What is LiteLLM?
LiteLLM is a proxy server that provides a unified OpenAI-compatible API for 100+ LLM providers. It handles routing, load balancing, fallbacks, rate limiting, and cost tracking across providers like OpenAI, Anthropic, Cohere, Azure, AWS Bedrock, and local models.
Installation
pip install litellm[proxy]
# Or Docker
docker run -d --name litellm \
-p 4000:4000 \
-v /opt/litellm/config.yaml:/app/config.yaml \
--restart unless-stopped \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
Configuration
# /opt/litellm/config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-your-openai-key
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-20250514
api_key: sk-ant-your-key
- model_name: local-llama
litellm_params:
model: ollama/llama3
api_base: http://localhost:11434
litellm_settings:
drop_params: true
set_verbose: false
router_settings:
routing_strategy: least-busy
num_retries: 3
fallbacks:
- gpt-4o: [claude-sonnet, local-llama]
Usage
# Use like OpenAI API
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-your-litellm-key" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# If OpenAI fails, automatically falls back to Claude, then Ollama
Features
- Unified API for 100+ LLM providers
- Automatic fallbacks and retries
- Load balancing across model deployments
- Rate limiting and spend tracking
- Virtual API keys for team management
- Streaming support
- Cost tracking and budgets per key