Docs / AI & Machine Learning / Model Caching for Faster LLM Inference

Model Caching for Faster LLM Inference

By Admin · Mar 15, 2026 · Updated Apr 24, 2026 · 577 views · 2 min read

Why Cache Models?

Loading large language models from disk takes 10-60 seconds. Model caching keeps models in GPU memory or system RAM between requests, reducing response latency from seconds to milliseconds.

Ollama Model Caching

# Ollama keeps models loaded in memory by default
# Configure keep-alive duration
OLLAMA_KEEP_ALIVE=24h ollama serve

# Pre-load models at startup
ollama pull llama3:8b
curl http://localhost:11434/api/generate -d '{"model":"llama3:8b","prompt":"warmup","stream":false}'

vLLM Continuous Batching

# vLLM keeps models loaded and uses continuous batching
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192 \
    --port 8000

# vLLM automatically handles:
# - Model stays loaded in GPU memory
# - KV cache management
# - Continuous batching of requests
# - PagedAttention for efficient memory

KV Cache Optimization

# Prompt caching (prefix caching)
# Reuse KV cache for common prompt prefixes
# In vLLM:
python -m vllm.entrypoints.openai.api_server \
    --enable-prefix-caching \
    --model meta-llama/Meta-Llama-3-8B-Instruct

# This speeds up requests that share common system prompts

Multi-Model Caching

# Use LiteLLM to manage multiple cached models
# Each model stays loaded on its respective GPU
# LiteLLM routes requests to the appropriate model server

# GPU 0: llama3 (general)
# GPU 1: codellama (code)
# LiteLLM routes based on model parameter

Best Practices

  • Pre-load frequently used models at server startup
  • Use prefix caching for common system prompts
  • Monitor GPU memory to avoid OOM errors
  • Use quantized models to fit more in memory
  • Implement request queuing to prevent concurrent model loading

Was this article helpful?