What is RAG?
Retrieval-Augmented Generation (RAG) combines a knowledge base with an LLM. Instead of relying solely on the model's training data, you retrieve relevant documents and include them in the prompt.
User Question → Search Knowledge Base → Inject Context → LLM → Answer
Why RAG?
| Approach | Pros | Cons |
|---|---|---|
| Fine-tuning | Deeply integrated knowledge | Expensive, slow, stale |
| RAG | Fresh data, cheap, auditable | Retrieval quality matters |
| Prompt stuffing | Simple | Context window limits |
Setup
pip install chromadb sentence-transformers openai
Step 1: Create a Vector Store
import chromadb
from chromadb.utils import embedding_functions
# Use a local embedding model (no API needed)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(
name="knowledge_base",
embedding_function=ef
)
Step 2: Add Documents
# Add your knowledge base
documents = [
"To reset a server password, go to Portal > Server > Settings > Reset Password",
"Kazepute VPS plans start at $5/month for 1 vCPU and 1GB RAM",
"DNS changes can take up to 48 hours to propagate globally",
"To enable backups, go to Portal > Server > Backups > Enable",
]
collection.add(
documents=documents,
ids=[f"doc_{i}" for i in range(len(documents))],
metadatas=[{"source": "docs"} for _ in documents]
)
Step 3: Query and Generate
def answer_question(question: str) -> str:
# Retrieve relevant documents
results = collection.query(
query_texts=[question],
n_results=3
)
context = "\n".join(results["documents"][0])
# Build prompt with context
prompt = f"""Answer the question based on the following context.
If the context doesn't contain the answer, say "I don't have information about that."
Context:
{context}
Question: {question}
Answer:"""
# Use Ollama (local) or any LLM API
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1",
"prompt": prompt,
"stream": False
})
return response.json()["response"]
# Usage
print(answer_question("How much does the cheapest VPS plan cost?"))
Chunking Strategies
Large documents need to be split into chunks:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list:
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
| Chunk Size | Retrieval | Context |
|---|---|---|
| Small (200 words) | More precise | May miss surrounding context |
| Medium (500 words) | Balanced | Good default |
| Large (1000 words) | More context | May include irrelevant info |
Tip Start with 500-word chunks and 50-word overlap. Adjust based on your content type — code docs benefit from larger chunks, FAQ-style content works better with smaller ones.