Docs / AI & Machine Learning / How to Set Up a ChatGPT-Compatible API with LocalAI

How to Set Up a ChatGPT-Compatible API with LocalAI

By Admin · Mar 2, 2026 · Updated Apr 23, 2026 · 31 views · 3 min read

How to Set Up a ChatGPT-Compatible API with LocalAI

LocalAI is an open-source drop-in replacement for the OpenAI API that runs entirely on your own hardware. It supports text generation, embeddings, image generation, and speech-to-text, all with a familiar API interface. Running LocalAI on your Breeze gives you a private, cost-free AI backend compatible with any OpenAI SDK client.

Prerequisites

  • A Breeze instance with at least 8 GB of RAM (16 GB recommended for larger models)
  • Docker installed
  • At least 20 GB of free disk space for models

Installing LocalAI with Docker

Pull and run the LocalAI container:

# CPU-only version
docker run -d --name localai \
  -p 8080:8080 \
  -v /data/localai-models:/models \
  localai/localai:latest-cpu

# GPU version (NVIDIA)
docker run -d --name localai \
  -p 8080:8080 \
  --gpus all \
  -v /data/localai-models:/models \
  localai/localai:latest-gpu-nvidia-cuda-12

Downloading Models

LocalAI can automatically download models from its gallery. Use the API to install a model:

curl -X POST http://localhost:8080/models/apply \
  -H "Content-Type: application/json" \
  -d '{"id": "lunademo"}'

Alternatively, download GGUF-format models manually and place them in the models directory:

cd /data/localai-models
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Creating a Model Configuration

Create a YAML configuration file for your model at /data/localai-models/llama-chat.yaml:

name: llama-chat
backend: llama-cpp
parameters:
  model: llama-2-7b-chat.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
template:
  chat_message: |
    [INST] {{.Input}} [/INST]

Using the OpenAI-Compatible API

LocalAI exposes endpoints that match the OpenAI API specification. You can use any OpenAI SDK client:

from openai import OpenAI

client = OpenAI(base_url="http://your-breeze-ip:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="llama-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ]
)
print(response.choices[0].message.content)

Generating Embeddings

LocalAI also supports embedding generation for vector search applications:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "bert-embeddings", "input": "Your text here"}'

Setting Up Multiple Models

Run multiple models simultaneously by creating separate YAML config files for each. LocalAI loads all models found in the models directory and makes them available through the API. Use the /v1/models endpoint to list all available models.

Performance Tuning

  • Threads — set the threads parameter to match your Breeze vCPU count
  • Context size — lower context sizes reduce memory usage but limit conversation length
  • Quantization — use Q4_K_M quantized models for the best balance of quality and speed on CPU
  • GPU layers — set gpu_layers in the config to offload layers to GPU for faster inference

Was this article helpful?