Retrieval-Augmented Generation (RAG) combines document retrieval with language model generation to produce grounded, accurate answers. Building a RAG pipeline on your Breeze keeps your data private and under your control.
Architecture Overview
- Document Ingestion -- chunk and embed your documents
- Vector Store -- store embeddings for fast similarity search
- Retrieval + Generation -- query relevant chunks and pass them to an LLM
Install Dependencies
pip install langchain chromadb sentence-transformers
Ingest Documents
Split your documents into chunks and generate embeddings:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="/opt/rag/db")
Query the Pipeline
Retrieve relevant chunks and feed them as context to your language model:
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
results = retriever.get_relevant_documents("How do I reset my password?")
context = "\n".join([doc.page_content for doc in results])
Performance Tips
- Use a Breeze with at least 8 GB RAM for embedding models
- Persist your vector store to disk to avoid re-indexing
- Tune chunk size based on your document types for better retrieval accuracy