Docs / AI & Machine Learning / Running Ollama for Local LLM Inference

Running Ollama for Local LLM Inference

By Admin · Mar 22, 2026 · Updated Apr 23, 2026 · 280 views · 2 min read

What is Ollama?

Ollama lets you run large language models (LLMs) locally on your VPS. No API keys, no per-token costs, full data privacy.

Installation

curl -fsSL https://ollama.com/install.sh | sh

Downloading Models

# Small and fast (3.8B parameters, ~2.3 GB)
ollama pull phi3

# Medium — good balance (8B parameters, ~4.7 GB)
ollama pull llama3.1

# Large — best quality (70B parameters, ~40 GB)
ollama pull llama3.1:70b

# Code-specialized
ollama pull codellama:13b

Model Size vs Requirements

Model Parameters VRAM/RAM Quality
Phi-3 Mini 3.8B ~3 GB Good for simple tasks
Llama 3.1 8B 8B ~5 GB Great general purpose
Mistral 7B 7B ~5 GB Strong reasoning
Llama 3.1 70B 70B ~42 GB Near GPT-4 quality
CodeLlama 13B 13B ~8 GB Code generation

Tip For a 4 GB RAM VPS, stick with Phi-3 or quantized 7B models. 8 GB RAM handles 7-8B models comfortably. 70B models need a high-RAM server.

Using the API

Ollama runs an API server on port 11434:

# Generate a response
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Explain Docker containers in 3 sentences.",
  "stream": false
}'

# Chat format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    {"role": "system", "content": "You are a helpful DevOps assistant."},
    {"role": "user", "content": "How do I optimize Nginx for high traffic?"}
  ],
  "stream": false
}'

Integration with Python

import requests

def ask_ollama(prompt, model="llama3.1"):
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": prompt,
        "stream": False
    })
    return response.json()["response"]

answer = ask_ollama("Write a bash script to monitor disk usage")
print(answer)

Running as a Service

Ollama installs as a systemd service automatically:

sudo systemctl status ollama
sudo systemctl restart ollama

# View logs
journalctl -u ollama -f

Custom Modelfiles

Create specialized models:

# Modelfile
FROM llama3.1
SYSTEM You are a senior DevOps engineer. Give concise, practical answers with code examples.
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
ollama create devops-helper -f Modelfile
ollama run devops-helper

Warning LLM inference is CPU/memory intensive. On a shared VPS, a large model can starve other services. Monitor resource usage and consider a dedicated server for production LLM workloads.

Was this article helpful?