GGUF (GPT-Generated Unified Format) is the standard format for running quantized large language models on CPUs. Combined with llama.cpp, it enables you to run powerful AI models on standard VPS hardware without requiring expensive GPUs. This guide covers everything from building llama.cpp to optimizing inference for production workloads.
Understanding GGUF and Quantization
GGUF files contain quantized model weights — compressed versions of full-precision models that trade minimal quality for dramatically reduced memory and compute requirements:
- Q2_K: Smallest, fastest, lowest quality — good for experimentation
- Q4_K_M: Best balance of speed and quality for most use cases
- Q5_K_M: Higher quality, ~30% slower than Q4
- Q6_K: Near full-precision quality, requires more RAM
- Q8_0: Highest quality quantization, nearly matches FP16
RAM Requirements by Model Size
| Model Parameters | Q4_K_M RAM | Q8_0 RAM | Recommended VPS |
|---|---|---|---|
| 7B | ~5GB | ~8GB | 8GB RAM |
| 13B | ~8GB | ~14GB | 16GB RAM |
| 34B | ~20GB | ~36GB | 32GB RAM |
| 70B | ~40GB | ~72GB | 64GB+ RAM |
Building llama.cpp from Source
Building from source ensures you get CPU-specific optimizations:
# Install build dependencies
sudo apt update
sudo apt install -y build-essential cmake git libcurl4-openssl-dev
# Clone llama.cpp
cd /opt
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Build with native CPU optimizations
cmake -B build \
-DLLAMA_NATIVE=ON \
-DLLAMA_CURL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Verify build
./build/bin/llama-cli --version
Check CPU Capabilities
# Check which SIMD instructions your CPU supports
grep -o 'avx[^ ]*\|sse[^ ]*\|f16c\|fma' /proc/cpuinfo | sort -u
# AVX2 gives the biggest speedup for llama.cpp
# AVX-512 provides additional benefit on supported CPUs
Downloading GGUF Models
# Create model storage directory
mkdir -p /opt/models
# Download from Hugging Face (example: Llama 3.1 8B)
# Using curl
curl -L -o /opt/models/llama-3.1-8b-q4_k_m.gguf \
"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"
# Using huggingface-cli (recommended for large files)
pip install huggingface-hub
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir /opt/models
Running the OpenAI-Compatible API Server
# Start the server with optimal settings
/opt/llama.cpp/build/bin/llama-server \
--model /opt/models/llama-3.1-8b-q4_k_m.gguf \
--host 127.0.0.1 \
--port 8080 \
--ctx-size 8192 \
--threads $(nproc) \
--batch-size 512 \
--n-predict 2048 \
--parallel 4 \
--cont-batching \
--flash-attn \
--mlock
# Key parameters explained:
# --ctx-size: Context window (tokens the model can see)
# --threads: CPU threads (set to physical core count)
# --batch-size: Prompt processing batch size
# --parallel: Number of concurrent requests
# --cont-batching: Enables continuous batching for better throughput
# --flash-attn: Flash attention for faster inference
# --mlock: Lock model in RAM (prevents swapping)
Create a systemd Service
sudo cat > /etc/systemd/system/llama-server.service