Why Cache Models?
Loading large language models from disk takes 10-60 seconds. Model caching keeps models in GPU memory or system RAM between requests, reducing response latency from seconds to milliseconds.
Ollama Model Caching
# Ollama keeps models loaded in memory by default
# Configure keep-alive duration
OLLAMA_KEEP_ALIVE=24h ollama serve
# Pre-load models at startup
ollama pull llama3:8b
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b","prompt":"warmup","stream":false}'
vLLM Continuous Batching
# vLLM keeps models loaded and uses continuous batching
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--port 8000
# vLLM automatically handles:
# - Model stays loaded in GPU memory
# - KV cache management
# - Continuous batching of requests
# - PagedAttention for efficient memory
KV Cache Optimization
# Prompt caching (prefix caching)
# Reuse KV cache for common prompt prefixes
# In vLLM:
python -m vllm.entrypoints.openai.api_server \
--enable-prefix-caching \
--model meta-llama/Meta-Llama-3-8B-Instruct
# This speeds up requests that share common system prompts
Multi-Model Caching
# Use LiteLLM to manage multiple cached models
# Each model stays loaded on its respective GPU
# LiteLLM routes requests to the appropriate model server
# GPU 0: llama3 (general)
# GPU 1: codellama (code)
# LiteLLM routes based on model parameter
Best Practices
- Pre-load frequently used models at server startup
- Use prefix caching for common system prompts
- Monitor GPU memory to avoid OOM errors
- Use quantized models to fit more in memory
- Implement request queuing to prevent concurrent model loading