Optimizing Memory for Large Language Models
Memory is the primary bottleneck when running LLMs on a Breeze. These techniques help you run larger models within your available RAM.
Understand Memory Requirements
A rough formula: model parameters multiplied by bytes per parameter. A 7B model at FP16 needs around 14 GB. Quantization reduces this significantly.
Use Quantized Models
- Q4_K_M -- good balance of quality and size, roughly 4 GB for 7B models
- Q5_K_M -- slightly better quality, roughly 5 GB for 7B models
- Q8_0 -- near-original quality, roughly 7 GB for 7B models
Configure Swap Space
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Tune System Settings
sudo sysctl vm.swappiness=10
sudo sysctl vm.overcommit_memory=1
Limit Context Length
Reduce the context window to save memory. A 2048-token context uses far less RAM than 8192 tokens:
ollama run llama3 --num-ctx 2048
Monitor Memory Usage
watch -n 1 free -h
htop
If the OOM killer terminates your process, either reduce the model size, increase quantization, or upgrade your Breeze to a plan with more RAM.