Run GGUF Models with llama.cpp for CPU Inference

By Admin · Mar 15, 2026 · Updated Jun 25, 2026 · 201 views · 3 min read

GGUF (GPT-Generated Unified Format) is the standard format for running quantized large language models on CPUs. Combined with llama.cpp, it enables you to run powerful AI models on standard VPS hardware without requiring expensive GPUs. This guide covers everything from building llama.cpp to optimizing inference for production workloads.

Understanding GGUF and Quantization

GGUF files contain quantized model weights — compressed versions of full-precision models that trade minimal quality for dramatically reduced memory and compute requirements:

Q2_K: Smallest, fastest, lowest quality — good for experimentation
Q4_K_M: Best balance of speed and quality for most use cases
Q5_K_M: Higher quality, ~30% slower than Q4
Q6_K: Near full-precision quality, requires more RAM
Q8_0: Highest quality quantization, nearly matches FP16

RAM Requirements by Model Size

Model Parameters	Q4_K_M RAM	Q8_0 RAM	Recommended VPS
7B	~5GB	~8GB	8GB RAM
13B	~8GB	~14GB	16GB RAM
34B	~20GB	~36GB	32GB RAM
70B	~40GB	~72GB	64GB+ RAM

Building llama.cpp from Source

Building from source ensures you get CPU-specific optimizations:

# Install build dependencies
sudo apt update
sudo apt install -y build-essential cmake git libcurl4-openssl-dev

# Clone llama.cpp
cd /opt
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with native CPU optimizations
cmake -B build \
  -DLLAMA_NATIVE=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j$(nproc)

# Verify build
./build/bin/llama-cli --version

Check CPU Capabilities

# Check which SIMD instructions your CPU supports
grep -o 'avx[^ ]*\|sse[^ ]*\|f16c\|fma' /proc/cpuinfo | sort -u

# AVX2 gives the biggest speedup for llama.cpp
# AVX-512 provides additional benefit on supported CPUs

Downloading GGUF Models

# Create model storage directory
mkdir -p /opt/models

# Download from Hugging Face (example: Llama 3.1 8B)
# Using curl
curl -L -o /opt/models/llama-3.1-8b-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

# Using huggingface-cli (recommended for large files)
pip install huggingface-hub
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir /opt/models

Running the OpenAI-Compatible API Server

# Start the server with optimal settings
/opt/llama.cpp/build/bin/llama-server \
  --model /opt/models/llama-3.1-8b-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 8192 \
  --threads $(nproc) \
  --batch-size 512 \
  --n-predict 2048 \
  --parallel 4 \
  --cont-batching \
  --flash-attn \
  --mlock

# Key parameters explained:
# --ctx-size: Context window (tokens the model can see)
# --threads: CPU threads (set to physical core count)
# --batch-size: Prompt processing batch size
# --parallel: Number of concurrent requests
# --cont-batching: Enables continuous batching for better throughput
# --flash-attn: Flash attention for faster inference
# --mlock: Lock model in RAM (prevents swapping)

Create a systemd Service

sudo cat > /etc/systemd/system/llama-server.service