Running Large Language Models on a VPS

By Admin · Mar 1, 2026 · Updated Jun 25, 2026 · 61 views · 1 min read

Running Large Language Models on a VPS

Running LLMs on your own Breeze gives you full control over data privacy, latency, and costs. This guide covers the key considerations for self-hosted inference.

Choosing the Right Breeze Size

7B parameter models: 8 GB RAM minimum, 16 GB recommended
13B parameter models: 16 GB RAM minimum, 32 GB recommended
70B parameter models: 64 GB+ RAM with quantization

Quantization

Quantized models (Q4, Q5, Q8) use significantly less memory with minimal quality loss. Tools like llama.cpp and Ollama support GGUF quantized formats out of the box.

Popular Inference Engines

Ollama -- simplest setup, great for quick deployments
llama.cpp -- lightweight C++ inference, CPU-optimized
vLLM -- high-throughput serving with paged attention
LocalAI -- OpenAI-compatible API server

Performance Tips

Enable swap space as a safety net:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Pin the number of threads to your CPU core count for optimal throughput. Monitor memory usage with htop and adjust model size or quantization level as needed.

Running Large Language Models on a VPS