Docs / AI & Machine Learning / Run GGUF Models with llama.cpp for CPU Inference

Run GGUF Models with llama.cpp for CPU Inference

By Admin · Mar 15, 2026 · Updated Apr 23, 2026 · 175 views · 3 min read

GGUF (GPT-Generated Unified Format) is the standard format for running quantized large language models on CPUs. Combined with llama.cpp, it enables you to run powerful AI models on standard VPS hardware without requiring expensive GPUs. This guide covers everything from building llama.cpp to optimizing inference for production workloads.

Understanding GGUF and Quantization

GGUF files contain quantized model weights — compressed versions of full-precision models that trade minimal quality for dramatically reduced memory and compute requirements:

  • Q2_K: Smallest, fastest, lowest quality — good for experimentation
  • Q4_K_M: Best balance of speed and quality for most use cases
  • Q5_K_M: Higher quality, ~30% slower than Q4
  • Q6_K: Near full-precision quality, requires more RAM
  • Q8_0: Highest quality quantization, nearly matches FP16

RAM Requirements by Model Size

Model ParametersQ4_K_M RAMQ8_0 RAMRecommended VPS
7B~5GB~8GB8GB RAM
13B~8GB~14GB16GB RAM
34B~20GB~36GB32GB RAM
70B~40GB~72GB64GB+ RAM

Building llama.cpp from Source

Building from source ensures you get CPU-specific optimizations:

# Install build dependencies
sudo apt update
sudo apt install -y build-essential cmake git libcurl4-openssl-dev

# Clone llama.cpp
cd /opt
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with native CPU optimizations
cmake -B build \
  -DLLAMA_NATIVE=ON \
  -DLLAMA_CURL=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release -j$(nproc)

# Verify build
./build/bin/llama-cli --version

Check CPU Capabilities

# Check which SIMD instructions your CPU supports
grep -o 'avx[^ ]*\|sse[^ ]*\|f16c\|fma' /proc/cpuinfo | sort -u

# AVX2 gives the biggest speedup for llama.cpp
# AVX-512 provides additional benefit on supported CPUs

Downloading GGUF Models

# Create model storage directory
mkdir -p /opt/models

# Download from Hugging Face (example: Llama 3.1 8B)
# Using curl
curl -L -o /opt/models/llama-3.1-8b-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"

# Using huggingface-cli (recommended for large files)
pip install huggingface-hub
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir /opt/models

Running the OpenAI-Compatible API Server

# Start the server with optimal settings
/opt/llama.cpp/build/bin/llama-server \
  --model /opt/models/llama-3.1-8b-q4_k_m.gguf \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 8192 \
  --threads $(nproc) \
  --batch-size 512 \
  --n-predict 2048 \
  --parallel 4 \
  --cont-batching \
  --flash-attn \
  --mlock

# Key parameters explained:
# --ctx-size: Context window (tokens the model can see)
# --threads: CPU threads (set to physical core count)
# --batch-size: Prompt processing batch size
# --parallel: Number of concurrent requests
# --cont-batching: Enables continuous batching for better throughput
# --flash-attn: Flash attention for faster inference
# --mlock: Lock model in RAM (prevents swapping)

Create a systemd Service

sudo cat > /etc/systemd/system/llama-server.service         

Was this article helpful?