MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

RAG: Retrieval-Augmented Generation for LLMs

Implement RAG: embeddings, vector databases, chunking strategies, and retrieval patterns.

AIRAGLLMVector Database

By MinhVo

Introduction

Large Language Models are extraordinary at generating fluent text, but they have a fundamental limitation: they only know what was in their training data. Ask an LLM about your company's internal documentation, a product released last week, or proprietary research data, and it will either hallucinate an answer or admit ignorance. Retrieval-Augmented Generation (RAG) solves this by giving LLMs access to external knowledge at query time.

RAG works by combining two systems: a retrieval engine that finds relevant documents from a knowledge base, and a generator (the LLM) that produces answers grounded in those retrieved documents. This architecture lets you build AI applications that are knowledgeable about your specific data without fine-tuning the model — dramatically reducing cost, improving accuracy, and keeping knowledge up to date.

RAG Architecture

Every major AI product uses RAG in some form. ChatGPT's browsing feature, Perplexity's search engine, enterprise chatbots over internal documentation, and code assistants that reference your repository — all rely on RAG patterns. Understanding how to build, optimize, and evaluate RAG systems is one of the most valuable skills in applied AI engineering today.

Understanding RAG: Core Concepts

The Problem RAG Solves

LLMs have three fundamental limitations that RAG addresses:

Knowledge cutoff: Models are trained on data up to a specific date. They cannot answer questions about events after that date. RAG provides access to current information.

Hallucination: When LLMs lack knowledge, they generate plausible-sounding but incorrect answers. RAG grounds responses in retrieved facts, reducing hallucination.

Private data: Models do not have access to your internal documents, databases, or proprietary information. RAG lets you query private knowledge bases without exposing data during training.

The RAG Pipeline

A RAG system has two main phases:

Indexing Phase (offline):

  1. Load documents from various sources (PDFs, websites, databases, APIs)
  2. Split documents into chunks of appropriate size
  3. Generate vector embeddings for each chunk using an embedding model
  4. Store embeddings and chunks in a vector database

Query Phase (online):

  1. Convert the user's question into a vector embedding
  2. Search the vector database for the most similar chunks (top-k retrieval)
  3. Construct a prompt that includes the retrieved context and the question
  4. Send the prompt to the LLM to generate a grounded answer

Embeddings are numerical representations of text that capture semantic meaning. Similar concepts have similar embeddings, enabling semantic search that goes far beyond keyword matching.

from openai import OpenAI
 
client = OpenAI()
 
# Generate an embedding
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How does RAG improve LLM accuracy?"
)
embedding = response.data[0].embedding  # 1536-dimensional vector

Vector databases store these embeddings and perform approximate nearest neighbor (ANN) search to find the most similar vectors in milliseconds, even across millions of documents.

Architecture and Design Patterns

Basic RAG Architecture

The simplest RAG pipeline follows a straightforward linear flow:

User Query → Embed → Search Vector DB → Retrieve Chunks → Build Prompt → LLM → Response

Advanced RAG Patterns

Production RAG systems use several advanced patterns to improve quality:

Query Transformation: Rewrite the user's query before retrieval to improve results. Techniques include HyDE (Hypothetical Document Embeddings), query decomposition, and step-back prompting.

Multi-Index Retrieval: Use separate vector stores for different document types or sources, then merge results. A legal chatbot might have separate indices for contracts, regulations, and case law.

Reranking: After initial retrieval, use a cross-encoder model to rerank results by relevance. This is more expensive but significantly improves precision.

Contextual Compression: Extract only the relevant portions of retrieved documents before sending them to the LLM, reducing noise and token costs.

Self-RAG: The LLM evaluates its own retrieval needs, deciding whether to retrieve, how many documents to fetch, and whether the generated answer is supported by the evidence.

Step-by-Step Implementation

Building a RAG System from Scratch

Here is a complete RAG implementation using Python, LangChain, and ChromaDB:

# Step 1: Install dependencies
# pip install langchain langchain-openai chromadb pypdf tiktoken
 
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

Document Loading and Chunking

from langchain_community.document_loaders import DirectoryLoader
 
# Load all PDFs from a directory
loader = DirectoryLoader("./documents/", glob="**/*.pdf", loader_cls=PyPDFLoader)
documents = loader.load()
 
print(f"Loaded {len(documents)} pages")
 
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # ~250 tokens per chunk
    chunk_overlap=200,      # Overlap to maintain context across boundaries
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(documents)
 
print(f"Created {len(chunks)} chunks")

Chunking strategy matters enormously. Too small and you lose context. Too large and you dilute relevance. The sweet spot for most use cases is 500-1000 characters with 10-20% overlap.

Vector Store Setup

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
 
# Initialize embedding model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
 
# Create vector store with persistence
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="documents"
)
 
# Verify
print(f"Stored {vectorstore._collection.count()} vectors")

Retrieval and Generation

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
 
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
 
# Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Concatenate all retrieved docs into one prompt
    retriever=vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4}
    ),
    return_source_documents=True
)
 
# Query
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print(f"Sources: {[doc.metadata['source'] for doc in result['source_documents']]}")

Custom RAG with Full Control

For production systems, you often need more control than a chain provides:

from openai import OpenAI
import chromadb
 
client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_collection("documents")
 
def rag_query(question: str, k: int = 4) -> str:
    # Step 1: Generate query embedding
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding
 
    # Step 2: Retrieve relevant chunks
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k,
        include=["documents", "metadatas", "distances"]
    )
 
    # Step 3: Build context
    context_parts = []
    for i, (doc, meta, dist) in enumerate(zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    )):
        source = meta.get("source", "unknown")
        context_parts.append(f"[Source {i+1}: {source}]\n{doc}")
 
    context = "\n\n---\n\n".join(context_parts)
 
    # Step 4: Generate answer
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": """You are a helpful assistant. Answer 
questions based ONLY on the provided context. If the context doesn't contain 
enough information, say so. Always cite your sources."""},
            {"role": "user", "content": f"""Context:\n{context}\n\n
Question: {question}\n\nAnswer:"""}
        ],
        temperature=0
    )
 
    return response.choices[0].message.content
 
# Test
answer = rag_query("What are the key features of our product?")
print(answer)

Vector Database Search

Real-World Use Cases

Use Case 1: Enterprise Knowledge Base Chatbot

A company with thousands of internal documents (policies, procedures, technical specs) builds a RAG chatbot that employees can query in natural language. The system indexes all documents, processes them into chunks, and serves answers with source citations. This replaces hours of searching through SharePoint or Confluence.

Use Case 2: Customer Support Automation

A support team integrates RAG with their ticket system and knowledge base. When a customer asks a question, the system retrieves relevant help articles and generates a personalized response. Agents review and approve responses before sending, creating a human-in-the-loop workflow that handles 3x more tickets.

A law firm uses RAG to analyze contracts and legal precedents. Lawyers ask questions like "What are the termination clauses in the Acme contract?" and the system retrieves relevant passages from thousands of documents in seconds.

Use Case 4: Code Assistant with Repository Context

A development team builds a RAG system over their codebase, documentation, and pull request history. Developers ask questions like "How does the authentication middleware work?" and get accurate answers grounded in the actual code.

Best Practices for Production

  1. Invest in chunking quality: The chunking strategy has more impact on RAG quality than the choice of LLM. Experiment with chunk sizes (200-1500 tokens), overlap percentages, and splitting strategies (by paragraph, by sentence, by semantic similarity).

  2. Use metadata filtering: Store metadata (source, date, category, author) with each chunk and use it to filter retrieval results. This dramatically improves relevance when your knowledge base covers multiple topics.

  3. Implement hybrid search: Combine vector similarity search with keyword search (BM25). Vector search excels at semantic understanding; keyword search excels at exact term matching. Most production systems benefit from combining both.

  4. Evaluate systematically: Build a test set of questions with expected answers. Measure retrieval quality (precision, recall, MRR) and generation quality (faithfulness, relevance, completeness) separately. Tools like RAGAS and DeepEval automate this.

  5. Implement guardrails: Detect and refuse harmful queries. Validate that generated answers are actually supported by retrieved documents (citation verification). Prevent prompt injection through input sanitization.

  6. Cache aggressively: Embedding generation and vector search are deterministic for the same query. Cache results to reduce latency and costs. Implement semantic caching for similar (not identical) queries.

  7. Monitor and log everything: Track retrieval quality, LLM response quality, latency, and costs. Alert on quality degradation. Log retrieved chunks alongside generated answers for debugging.

  8. Optimize for cost: Use smaller, cheaper models for embedding generation. Compress context by extracting relevant sentences rather than passing full chunks. Batch embedding requests where possible.

Common Pitfalls and Solutions

PitfallImpactSolution
Chunks too largeDiluted relevance, high token costsUse 500-1000 character chunks with overlap
Chunks too smallLost context, fragmented answersEnsure chunks maintain paragraph-level coherence
No metadata filteringIrrelevant results from wrong document typesStore and filter by source, date, category
Single retrieval strategyMissing relevant documentsCombine vector search with keyword (BM25) search
No answer validationLLM hallucinates even with contextImplement faithfulness checking with separate LLM call
Ignoring retrieval qualityGarbage in, garbage outEvaluate retrieval separately from generation
Static knowledge baseStale informationImplement incremental indexing with change detection

Performance Optimization

# Caching with Redis for repeated queries
import hashlib
import json
import redis
 
cache = redis.Redis(host='localhost', port=6379, db=0)
 
def cached_rag_query(question: str, k: int = 4, ttl: int = 3600) -> str:
    cache_key = f"rag:{hashlib.md5(question.encode()).hexdigest()}"
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
 
    answer = rag_query(question, k)
    cache.setex(cache_key, ttl, json.dumps(answer))
    return answer
 
# Streaming responses for better UX
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
 
app = FastAPI()
 
@app.post("/query")
async def query_endpoint(question: str):
    async def generate():
        context = retrieve_context(question)
        stream = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[...],
            stream=True
        )
        for chunk in stream:
            yield chunk.choices[0].delta.content or ""
 
    return StreamingResponse(generate(), media_type="text/plain")

Comparison with Alternatives

FeatureRAGFine-tuningPrompt EngineeringKnowledge Graphs
CostLow (no training)High (GPU compute)Very lowModerate
FreshnessReal-time updatesRequires retrainingStaticReal-time updates
AccuracyHigh (grounded)High (domain-specific)VariableHigh (structured)
Setup complexityModerateHighLowHigh
Best forKnowledge Q&AStyle/format adaptationSimple tasksComplex relationships
ScalabilityVector DB scalesModel retrainingToken limitsGraph query scales

Advanced Patterns

Multi-Step RAG with Query Decomposition

def multi_step_rag(complex_question: str) -> str:
    # Step 1: Decompose the question
    decomposition = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": f"Break this question into 2-3 sub-questions:\n{complex_question}"
        }]
    )
    sub_questions = decomposition.choices[0].message.content.strip().split("\n")
 
    # Step 2: Retrieve for each sub-question
    all_context = []
    for sq in sub_questions:
        context = retrieve_context(sq.strip("- "))
        all_context.extend(context)
 
    # Step 3: Deduplicate and synthesize
    unique_context = deduplicate_chunks(all_context)
 
    # Step 4: Generate final answer
    return generate_answer(complex_question, unique_context)

Testing Strategies

import pytest
 
def test_retrieval_returns_relevant_chunks():
    results = retrieve_context("What is the refund policy?")
    sources = [r.metadata["source"] for r in results]
    assert any("refund" in s.lower() for s in sources)
 
def test_answer_is_grounded_in_context():
    answer = rag_query("What is the refund policy?")
    context = retrieve_context("What is the refund policy?")
    context_text = " ".join([c.page_content for c in context])
    # Check that key terms from the answer appear in the context
    answer_words = set(answer.lower().split())
    context_words = set(context_text.lower().split())
    overlap = answer_words & context_words
    assert len(overlap) > 5, "Answer should be grounded in retrieved context"
 
def test_handles_unanswerable_questions():
    answer = rag_query("What is the meaning of life?")
    assert any(phrase in answer.lower() for phrase in [
        "don't have", "not found", "cannot answer", "no information"
    ])

Future Outlook

RAG is evolving rapidly. Agentic RAG systems use multiple retrieval steps, self-correction, and tool use to answer complex questions. Multimodal RAG retrieves images, tables, and charts alongside text. Graph RAG combines vector search with knowledge graphs for better reasoning over complex domains. The LangChain team is building toward fully autonomous research agents that combine RAG with web search, code execution, and multi-step reasoning.

The core pattern of retrieve-then-generate will remain fundamental, but the sophistication of retrieval strategies, chunking methods, and evaluation frameworks will continue to increase.

Community Resources and Further Learning

The technology landscape evolves rapidly, making continuous learning essential for maintaining expertise. Building a systematic approach to staying current with developments in your technology stack ensures you can leverage new features and avoid deprecated patterns.

Curated Learning Pathways

Rather than consuming content randomly, create structured learning pathways aligned with your current projects and career goals. Start with official documentation and specification documents, which provide the most accurate and comprehensive information. Follow this with hands-on tutorials and workshops that reinforce concepts through practical application.

Technical blogs from framework maintainers and core team members often provide deeper insights into design decisions and upcoming features. Subscribe to the official blogs of your primary frameworks and libraries to stay ahead of breaking changes and deprecation timelines.

Contributing to Open Source

Contributing to open-source projects in your technology stack provides unparalleled learning opportunities. Start with documentation improvements and bug reports, then progress to fixing small issues tagged as "good first issue" in your favorite projects. This direct engagement with maintainers and the codebase accelerates your understanding far beyond what passive learning can achieve.

# Setting up for contribution
git clone https://github.com/project/repository.git
cd repository
git checkout -b fix/issue-description
 
# Run the project's contribution setup
npm run setup:dev
npm run test  # Ensure tests pass before making changes
 
# Make your changes, then run the full test suite
npm run test:full
npm run lint
npm run build
 
# Submit your contribution
git add -A
git commit -m "fix: description of the fix
 
Closes #1234"
git push origin fix/issue-description

Building a Technical Knowledge Base

Maintain a personal knowledge base that captures insights, solutions, and patterns you discover during your work. Tools like Obsidian, Notion, or even a simple Markdown repository can serve as an external memory that grows more valuable over time.

Organize your notes by topic rather than chronologically, and include code examples, links to relevant documentation, and explanations of why certain approaches work better than others. When you encounter a particularly insightful article or conference talk, write a summary that captures the key takeaways and how they apply to your current projects.

Follow key conferences and their published talks to stay informed about emerging patterns and best practices. Many conferences publish recorded talks on YouTube within weeks of the event, making world-class technical content freely accessible.

Join relevant Discord servers, Slack communities, and forums where practitioners discuss real-world challenges and solutions. These communities provide early warning about emerging issues and access to collective wisdom that isn't available through formal documentation.

Mentorship and Knowledge Sharing

Teaching others is one of the most effective ways to deepen your own understanding. Consider writing technical blog posts, giving talks at local meetups, or mentoring junior developers. The process of explaining concepts to others forces you to organize your knowledge and identify gaps in your understanding.

Pair programming sessions with colleagues of different experience levels create mutual learning opportunities. Senior developers gain fresh perspectives on problems they've solved the same way for years, while junior developers benefit from exposure to production-grade thinking and decision-making processes.

Conclusion

RAG is the most practical approach to giving LLMs access to your specific knowledge without fine-tuning. The key is building a high-quality pipeline: good chunking, smart retrieval, and faithful generation.

Key takeaways:

  1. Chunking quality matters most — experiment with size, overlap, and splitting strategy
  2. Use hybrid search — combine vector similarity with keyword matching for best results
  3. Implement metadata filtering — store and filter by source, date, and category
  4. Evaluate retrieval and generation separately — they have different failure modes
  5. Cache and optimize — reduce latency and costs through intelligent caching
  6. Build guardrails — validate answers, prevent injection, and handle edge cases
  7. Monitor quality continuously — RAG systems degrade as the knowledge base changes

RAG bridges the gap between general-purpose LLMs and domain-specific knowledge. Master it, and you can build AI applications that are genuinely useful for real-world tasks.