Docs / AI & Machine Learning / Running Large Language Models on a VPS

Running Large Language Models on a VPS

By Admin · Mar 1, 2026 · Updated Apr 23, 2026 · 32 views · 1 min read

Running Large Language Models on a VPS

Running LLMs on your own Breeze gives you full control over data privacy, latency, and costs. This guide covers the key considerations for self-hosted inference.

Choosing the Right Breeze Size

  • 7B parameter models: 8 GB RAM minimum, 16 GB recommended
  • 13B parameter models: 16 GB RAM minimum, 32 GB recommended
  • 70B parameter models: 64 GB+ RAM with quantization

Quantization

Quantized models (Q4, Q5, Q8) use significantly less memory with minimal quality loss. Tools like llama.cpp and Ollama support GGUF quantized formats out of the box.

Popular Inference Engines

  • Ollama -- simplest setup, great for quick deployments
  • llama.cpp -- lightweight C++ inference, CPU-optimized
  • vLLM -- high-throughput serving with paged attention
  • LocalAI -- OpenAI-compatible API server

Performance Tips

Enable swap space as a safety net:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Pin the number of threads to your CPU core count for optimal throughput. Monitor memory usage with htop and adjust model size or quantization level as needed.

Was this article helpful?