Model Caching for Faster LLM Inference

By Admin · Mar 15, 2026 · Updated Jun 22, 2026 · 604 views · 2 min read

Why Cache Models?

Loading large language models from disk takes 10-60 seconds. Model caching keeps models in GPU memory or system RAM between requests, reducing response latency from seconds to milliseconds.

Ollama Model Caching

# Ollama keeps models loaded in memory by default
# Configure keep-alive duration
OLLAMA_KEEP_ALIVE=24h ollama serve

# Pre-load models at startup
ollama pull llama3:8b
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b","prompt":"warmup","stream":false}'

vLLM Continuous Batching

# vLLM keeps models loaded and uses continuous batching
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --port 8000

# vLLM automatically handles:
# - Model stays loaded in GPU memory
# - KV cache management
# - Continuous batching of requests
# - PagedAttention for efficient memory

KV Cache Optimization

# Prompt caching (prefix caching)
# Reuse KV cache for common prompt prefixes
# In vLLM:
python -m vllm.entrypoints.openai.api_server \
    --enable-prefix-caching \
    --model meta-llama/Meta-Llama-3-8B-Instruct

# This speeds up requests that share common system prompts

Multi-Model Caching

# Use LiteLLM to manage multiple cached models
# Each model stays loaded on its respective GPU
# LiteLLM routes requests to the appropriate model server

# GPU 0: llama3 (general)
# GPU 1: codellama (code)
# LiteLLM routes based on model parameter

Best Practices

Pre-load frequently used models at server startup
Use prefix caching for common system prompts
Monitor GPU memory to avoid OOM errors
Use quantized models to fit more in memory
Implement request queuing to prevent concurrent model loading