Building a RAG System with Python and ChromaDB

By Admin · Mar 16, 2026 · Updated Apr 23, 2026 · 750 views · 2 min read

What is RAG?

Retrieval-Augmented Generation (RAG) combines a knowledge base with an LLM. Instead of relying solely on the model's training data, you retrieve relevant documents and include them in the prompt.

User Question → Search Knowledge Base → Inject Context → LLM → Answer

Why RAG?

Approach	Pros	Cons
Fine-tuning	Deeply integrated knowledge	Expensive, slow, stale
RAG	Fresh data, cheap, auditable	Retrieval quality matters
Prompt stuffing	Simple	Context window limits

Setup

pip install chromadb sentence-transformers openai

Step 1: Create a Vector Store

import chromadb
from chromadb.utils import embedding_functions

# Use a local embedding model (no API needed)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=ef
)

Step 2: Add Documents

# Add your knowledge base
documents = [
    "To reset a server password, go to Portal > Server > Settings > Reset Password",
    "Kazepute VPS plans start at $5/month for 1 vCPU and 1GB RAM",
    "DNS changes can take up to 48 hours to propagate globally",
    "To enable backups, go to Portal > Server > Backups > Enable",
]

collection.add(
    documents=documents,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[{"source": "docs"} for _ in documents]
)

Step 3: Query and Generate

def answer_question(question: str) -> str:
    # Retrieve relevant documents
    results = collection.query(
        query_texts=[question],
        n_results=3
    )

    context = "\n".join(results["documents"][0])

    # Build prompt with context
    prompt = f"""Answer the question based on the following context.
If the context doesn't contain the answer, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""

    # Use Ollama (local) or any LLM API
    import requests
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3.1",
        "prompt": prompt,
        "stream": False
    })

    return response.json()["response"]

# Usage
print(answer_question("How much does the cheapest VPS plan cost?"))

Chunking Strategies

Large documents need to be split into chunks:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list:
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

Chunk Size	Retrieval	Context
Small (200 words)	More precise	May miss surrounding context
Medium (500 words)	Balanced	Good default
Large (1000 words)	More context	May include irrelevant info

Tip Start with 500-word chunks and 50-word overlap. Adjust based on your content type — code docs benefit from larger chunks, FAQ-style content works better with smaller ones.