What is Ollama?
Ollama lets you run large language models (LLMs) locally on your VPS. No API keys, no per-token costs, full data privacy.
Installation
curl -fsSL https://ollama.com/install.sh | sh
Downloading Models
# Small and fast (3.8B parameters, ~2.3 GB)
ollama pull phi3
# Medium — good balance (8B parameters, ~4.7 GB)
ollama pull llama3.1
# Large — best quality (70B parameters, ~40 GB)
ollama pull llama3.1:70b
# Code-specialized
ollama pull codellama:13b
Model Size vs Requirements
| Model | Parameters | VRAM/RAM | Quality |
|---|---|---|---|
| Phi-3 Mini | 3.8B | ~3 GB | Good for simple tasks |
| Llama 3.1 8B | 8B | ~5 GB | Great general purpose |
| Mistral 7B | 7B | ~5 GB | Strong reasoning |
| Llama 3.1 70B | 70B | ~42 GB | Near GPT-4 quality |
| CodeLlama 13B | 13B | ~8 GB | Code generation |
Tip For a 4 GB RAM VPS, stick with Phi-3 or quantized 7B models. 8 GB RAM handles 7-8B models comfortably. 70B models need a high-RAM server.
Using the API
Ollama runs an API server on port 11434:
# Generate a response
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Explain Docker containers in 3 sentences.",
"stream": false
}'
# Chat format
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a helpful DevOps assistant."},
{"role": "user", "content": "How do I optimize Nginx for high traffic?"}
],
"stream": false
}'
Integration with Python
import requests
def ask_ollama(prompt, model="llama3.1"):
response = requests.post("http://localhost:11434/api/generate", json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
answer = ask_ollama("Write a bash script to monitor disk usage")
print(answer)
Running as a Service
Ollama installs as a systemd service automatically:
sudo systemctl status ollama
sudo systemctl restart ollama
# View logs
journalctl -u ollama -f
Custom Modelfiles
Create specialized models:
# Modelfile
FROM llama3.1
SYSTEM You are a senior DevOps engineer. Give concise, practical answers with code examples.
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
ollama create devops-helper -f Modelfile
ollama run devops-helper
Warning LLM inference is CPU/memory intensive. On a shared VPS, a large model can starve other services. Monitor resource usage and consider a dedicated server for production LLM workloads.