How to Set Up a ChatGPT-Compatible API with LocalAI

By Admin · Mar 2, 2026 · Updated Jun 25, 2026 · 58 views · 3 min read

How to Set Up a ChatGPT-Compatible API with LocalAI

LocalAI is an open-source drop-in replacement for the OpenAI API that runs entirely on your own hardware. It supports text generation, embeddings, image generation, and speech-to-text, all with a familiar API interface. Running LocalAI on your Breeze gives you a private, cost-free AI backend compatible with any OpenAI SDK client.

Prerequisites

A Breeze instance with at least 8 GB of RAM (16 GB recommended for larger models)
Docker installed
At least 20 GB of free disk space for models

Installing LocalAI with Docker

Pull and run the LocalAI container:

# CPU-only version
docker run -d --name localai \
  -p 8080:8080 \
  -v /data/localai-models:/models \
  localai/localai:latest-cpu

# GPU version (NVIDIA)
docker run -d --name localai \
  -p 8080:8080 \
  --gpus all \
  -v /data/localai-models:/models \
  localai/localai:latest-gpu-nvidia-cuda-12

Downloading Models

LocalAI can automatically download models from its gallery. Use the API to install a model:

curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "lunademo"}'

Alternatively, download GGUF-format models manually and place them in the models directory:

cd /data/localai-models
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Creating a Model Configuration

Create a YAML configuration file for your model at /data/localai-models/llama-chat.yaml:

name: llama-chat
backend: llama-cpp
parameters:
  model: llama-2-7b-chat.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
template:
  chat_message: |
    [INST] {{.Input}} [/INST]

Using the OpenAI-Compatible API

LocalAI exposes endpoints that match the OpenAI API specification. You can use any OpenAI SDK client:

from openai import OpenAI

client = OpenAI(base_url="http://your-breeze-ip:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="llama-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)
print(response.choices[0].message.content)

Generating Embeddings

LocalAI also supports embedding generation for vector search applications:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "bert-embeddings", "input": "Your text here"}'

Setting Up Multiple Models

Run multiple models simultaneously by creating separate YAML config files for each. LocalAI loads all models found in the models directory and makes them available through the API. Use the /v1/models endpoint to list all available models.

Performance Tuning

Threads — set the threads parameter to match your Breeze vCPU count
Context size — lower context sizes reduce memory usage but limit conversation length
Quantization — use Q4_K_M quantized models for the best balance of quality and speed on CPU
GPU layers — set gpu_layers in the config to offload layers to GPU for faster inference

How to Set Up a ChatGPT-Compatible API with LocalAI