MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

AI Retrieval-Augmented Generation (RAG) Architecture

Design production RAG: chunking, embedding, retrieval, reranking, and generation.

RAGAILLMArchitecture

By MinhVo

Introduction

Retrieval-Augmented Generation (RAG) is the architecture that makes LLMs useful for enterprise applications. While LLMs have remarkable general knowledge, they hallucinate when asked about specific company data, recent events, or domain-specific information. RAG solves this by retrieving relevant documents from your knowledge base and providing them as context to the LLM, grounding its responses in factual, up-to-date information. The result is an AI system that combines the language understanding of LLMs with the accuracy of your own data.

RAG architecture overview

RAG has become the standard architecture for enterprise AI applications — from customer support chatbots that reference product documentation, to internal knowledge assistants that search company wikis, to legal research tools that query case databases. The reason is simple: fine-tuning an LLM on your data is expensive, slow, and must be repeated whenever your data changes. RAG achieves similar results at a fraction of the cost, with real-time updates and full control over the knowledge base.

Building a basic RAG pipeline is straightforward — embed your documents, store them in a vector database, retrieve relevant chunks for each query, and pass them to the LLM. But building a production-grade RAG system that delivers accurate, relevant, and fast responses requires solving several hard problems: optimal chunking strategies, hybrid retrieval, re-ranking, context window management, and evaluation. This guide covers the complete architecture, from document ingestion to response generation, with practical implementation patterns for production systems.

Understanding RAG: Core Concepts

The RAG Pipeline

A RAG pipeline has two phases: indexing (offline) and querying (online).

Indexing Phase:

  1. Document Loading: Parse documents from various sources (PDFs, websites, databases, APIs)
  2. Chunking: Split documents into smaller, semantically meaningful chunks
  3. Embedding: Convert each chunk into a vector representation using an embedding model
  4. Indexing: Store vectors and metadata in a vector database

Querying Phase:

  1. Query Embedding: Convert the user's question into a vector
  2. Retrieval: Find the most similar document chunks using vector search
  3. Re-ranking: Reorder retrieved chunks by relevance using a more accurate model
  4. Generation: Pass the top chunks as context to the LLM with the user's question
  5. Response: Return the LLM's answer, ideally with source citations

Chunking Strategies

Chunking is the most impactful design decision in a RAG system. The goal is to create chunks that are:

  • Self-contained: Each chunk should make sense on its own
  • Right-sized: Large enough to contain relevant information, small enough to fit in context
  • Semantically coherent: Chunks should respect document structure (paragraphs, sections, code blocks)

Common strategies include fixed-size chunking (split every N characters), recursive chunking (split at paragraph, then sentence boundaries), and semantic chunking (split when topic shifts). Semantic chunking produces the best results but is more complex to implement.

Embedding Models

The embedding model determines how well your system understands semantic similarity. Key factors: dimensionality (higher = more expressive but slower), domain specialization (general vs. domain-specific), and multilingual support. OpenAI's text-embedding-3-large, Cohere's embed-v3, and BGE-M3 are top choices for production RAG.

Retrieval and Re-ranking

Retrieval finds candidate chunks; re-ranking orders them by relevance. A two-stage approach — fast vector retrieval of 20-50 candidates followed by accurate re-ranking of the top results — produces significantly better results than vector search alone.

RAG pipeline stages

Architecture and Design Patterns

The Naive RAG Pattern

The simplest RAG: embed documents, retrieve top-k similar chunks, pass to LLM. This works for prototyping but has limitations: poor chunking can split relevant information across chunks, and simple similarity search misses nuanced relevance.

The Advanced RAG Pattern

Extends naive RAG with: pre-retrieval query rewriting (making queries more effective), hybrid search (semantic + keyword), re-ranking (using cross-encoders), and post-retrieval context compression (removing irrelevant sentences from retrieved chunks).

The Modular RAG Pattern

Decomposes the RAG pipeline into interchangeable modules: different chunking strategies, multiple retrieval methods, various re-rankers, and different generation strategies. This enables A/B testing and optimization of each component independently.

The Agentic RAG Pattern

Uses an AI agent to orchestrate the RAG pipeline — deciding when to retrieve, which sources to query, how to reformulate queries, and when to synthesize information from multiple retrievals. This handles complex, multi-hop questions that require information from multiple documents.

Step-by-Step Implementation

Building a Complete RAG Pipeline

import OpenAI from 'openai';
import { Pinecone } from '@pinecone-database/pinecone';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
 
const openai = new OpenAI();
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
 
// Document ingestion pipeline
async function ingestDocuments(documents: { id: string; content: string; metadata: Record<string, unknown> }[]) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 500,
    chunkOverlap: 50,
    separators: ['\n\n', '\n', '. ', ' ', ''],
  });
 
  const index = pinecone.Index('rag-index');
 
  for (const doc of documents) {
    const chunks = await splitter.createDocuments([doc.content]);
    
    const vectors = await Promise.all(
      chunks.map(async (chunk, i) => {
        const embedding = await openai.embeddings.create({
          model: 'text-embedding-3-small',
          input: chunk.pageContent,
        });
 
        return {
          id: `${doc.id}_chunk_${i}`,
          values: embedding.data[0].embedding,
          metadata: {
            ...doc.metadata,
            content: chunk.pageContent,
            docId: doc.id,
            chunkIndex: i,
          },
        };
      })
    );
 
    // Upsert in batches
    for (let i = 0; i < vectors.length; i += 100) {
      await index.upsert(vectors.slice(i, i + 100));
    }
  }
}
 
// Query pipeline with re-ranking
async function query(question: string, options: { topK?: number; rerankTopN?: number } = {}) {
  const { topK = 20, rerankTopN = 5 } = options;
 
  // Step 1: Embed the query
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  });
 
  // Step 2: Retrieve candidates
  const index = pinecone.Index('rag-index');
  const results = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK,
    includeMetadata: true,
  });
 
  // Step 3: Re-rank (simplified — use a cross-encoder in production)
  const reranked = results.matches
    .map(match => ({
      content: match.metadata?.content as string,
      score: match.score || 0,
      metadata: match.metadata,
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, rerankTopN);
 
  // Step 4: Generate answer
  const context = reranked.map((r, i) => `[${i + 1}] ${r.content}`).join('\n\n');
  
  const response = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [
      {
        role: 'system',
        content: `Answer the user's question based ONLY on the provided context. If the context doesn't contain enough information, say so. Always cite your sources using [number] notation.`
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ],
  });
 
  return {
    answer: response.choices[0].message.content,
    sources: reranked.map(r => ({
      content: r.content.slice(0, 200) + '...',
      metadata: r.metadata,
      score: r.score,
    })),
  };
}

Query Rewriting for Better Retrieval

async function rewriteQuery(originalQuery: string): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      {
        role: 'system',
        content: `Generate 3 different versions of the user's query to improve document retrieval. 
Each version should use different terminology and phrasing while preserving the original intent.
Return as JSON: { "queries": ["query1", "query2", "query3"] }`
      },
      { role: 'user', content: originalQuery }
    ],
    response_format: { type: 'json_object' },
  });
 
  const { queries } = JSON.parse(response.choices[0].message.content || '{}');
  return [originalQuery, ...queries];
}
 
// Multi-query retrieval
async function multiQueryRetrieval(question: string) {
  const queries = await rewriteQuery(question);
  
  const allResults = await Promise.all(
    queries.map(q => query(q, { topK: 10, rerankTopN: 5 }))
  );
 
  // Deduplicate and merge results
  const seen = new Set<string>();
  const merged = allResults.flatMap(r => r.sources)
    .filter(s => {
      const key = s.content.slice(0, 100);
      if (seen.has(key)) return false;
      seen.add(key);
      return true;
    })
    .sort((a, b) => b.score - a.score)
    .slice(0, 5);
 
  return merged;
}

RAG API Service

import express from 'express';
 
const app = express();
app.use(express.json());
 
app.post('/api/rag/query', async (req, res) => {
  try {
    const { question, options } = req.body;
    const result = await query(question, options);
    res.json(result);
  } catch (err) {
    res.status(500).json({ error: 'Query failed' });
  }
});
 
app.post('/api/rag/ingest', async (req, res) => {
  try {
    const { documents } = req.body;
    await ingestDocuments(documents);
    res.json({ status: 'success', documentsProcessed: documents.length });
  } catch (err) {
    res.status(500).json({ error: 'Ingestion failed' });
  }
});
 
app.listen(3000);

RAG implementation workflow

Real-World Use Cases

Customer Support Chatbots

Build chatbots that answer customer questions using product documentation, FAQs, and support articles. The RAG system retrieves relevant support content and generates accurate, helpful responses with source citations. This reduces support ticket volume by 40-60%.

Internal Knowledge Assistants

Create AI assistants that search across company wikis, documentation, Slack messages, and code repositories. Employees ask natural language questions and get answers grounded in company knowledge, reducing time spent searching for information.

Search legal databases, contracts, and regulatory documents using natural language queries. RAG enables lawyers to quickly find relevant precedents, clauses, and regulations without manually reviewing thousands of documents.

Technical Documentation Q&A

Build interactive documentation where developers ask questions and get answers from your technical docs. This is more natural than searching documentation and produces more contextual answers than keyword search.

Best Practices for Production

  1. Invest in chunking quality — Chunking is the highest-leverage optimization. Test different strategies (fixed-size, recursive, semantic) with your actual data and measure retrieval quality.

  2. Use hybrid search — Combine semantic search with keyword (BM25) search. Semantic catches meaning; keyword catches exact terms like product names and error codes.

  3. Add re-ranking — Use a cross-encoder to re-rank the top 20-50 results from vector search. This dramatically improves relevance at minimal latency cost.

  4. Implement query rewriting — Generate multiple query variations to improve recall. Different phrasings retrieve different relevant documents.

  5. Ground responses in sources — Always cite the source documents used to generate the answer. This builds trust and enables verification.

  6. Handle "I don't know" gracefully — When the knowledge base doesn't contain relevant information, the system should say so rather than hallucinate an answer. Set similarity thresholds and instruct the LLM to acknowledge knowledge gaps.

  7. Monitor and evaluate — Track retrieval quality (are the right documents being retrieved?), generation quality (are answers accurate?), and user satisfaction. Use this data to continuously improve.

  8. Optimize for latency — Users expect sub-3-second responses. Optimize by caching embeddings, using fast vector databases, and streaming LLM responses.

Common Pitfalls and Solutions

PitfallImpactSolution
Poor chunking strategyLost context, irrelevant retrievalTest multiple strategies, use semantic chunking
No re-rankingMediocre retrieval qualityAdd cross-encoder re-ranking
Ignoring keyword searchMissing exact matchesUse hybrid search (semantic + BM25)
No query rewritingLow recall for paraphrased queriesGenerate multiple query variations
Hallucinated answersIncorrect information presented as factGround in context, cite sources, set thresholds
Stale indexOutdated information returnedImplement incremental indexing on data changes
No evaluation frameworkCan't measure or improve qualityBuild automated evaluation with golden test set

Evaluating RAG Quality

Build an evaluation framework with three metrics: retrieval precision (are the retrieved documents relevant?), answer accuracy (is the generated answer correct?), and faithfulness (does the answer actually come from the retrieved context, not hallucinated?). Use a golden test set of 50-100 question-answer pairs to measure these metrics systematically.

Performance Optimization

Optimize RAG latency at each stage: use fast embedding models (text-embedding-3-small), implement HNSW indexes for sub-millisecond vector search, cache frequently retrieved chunks, and stream LLM responses for perceived speed.

For high-volume deployments, implement connection pooling for the vector database, batch embedding requests, and use async processing for document ingestion.

Comparison of RAG Frameworks

FrameworkComplexityFlexibilityPerformanceBest For
LangChainMedium★★★★★★★★★Prototyping, complex chains
LlamaIndexMedium★★★★★★★★★Document-focused RAG
HaystackHigh★★★★★★★★★★Production pipelines
CustomHigh★★★★★★★★★★Full control, optimized

Advanced Patterns

Self-RAG

The model decides whether to retrieve, evaluates the retrieved documents for relevance, and critiques its own answers for faithfulness. This self-reflective approach produces more accurate answers by catching retrieval failures and hallucinations.

Graph RAG

Combine vector retrieval with knowledge graph traversal. Store entities and relationships in a graph database, and use graph queries to find related concepts that vector search alone might miss.

Multi-Modal RAG

Extend RAG to handle images, tables, and diagrams alongside text. Use multi-modal embedding models to index visual content and multi-modal LLMs to generate answers that reference both text and images.

Future Outlook

RAG is evolving toward agentic retrieval — systems that dynamically decide what to retrieve, when to retrieve, and how to synthesize information from multiple sources. Instead of a fixed pipeline, the agent orchestrates retrieval based on the question's complexity, using different strategies for simple factual queries vs. complex multi-hop reasoning.

The most significant trend is real-time RAG — indexing and retrieving information in real-time as it's created, enabling AI systems that have access to the very latest information. This will transform applications in news, finance, and customer support where information freshness is critical.

RAG Evaluation Metrics

Evaluate RAG system quality using retrieval metrics (precision, recall, MRR) and generation metrics (faithfulness, relevance, completeness). The RAGAS framework provides automated evaluation of RAG pipelines using LLM-as-a-judge approaches. Track retrieval precision to ensure the vector store returns relevant documents, and measure answer faithfulness to verify that generated responses are grounded in retrieved context rather than hallucinated. A/B test different chunking strategies, embedding models, and prompt templates to optimize end-to-end RAG performance for your specific use case.

Architecture Decision Records

When evaluating architectural choices for your project, documenting your decision-making process through Architecture Decision Records (ADRs) provides invaluable context for future team members and stakeholders. Each ADR captures the context, decision, and consequences of a specific architectural choice.

Creating Effective ADRs

An ADR should include the date of the decision, the status (proposed, accepted, deprecated, or superseded), the context that motivated the decision, the decision itself, and the expected consequences both positive and negative. This structured approach ensures that decisions are traceable and reversible when circumstances change.

# ADR-001: Choose React for Frontend Framework
 
## Status: Accepted
 
## Context
We need a frontend framework that supports component-based architecture,
has a large ecosystem, and provides good TypeScript support.
 
## Decision
We will use React 18+ with TypeScript for all new frontend projects.
 
## Consequences
- Large talent pool available for hiring
- Mature ecosystem with extensive third-party libraries
- Strong TypeScript integration
- Requires additional libraries for routing and state management

Decision Matrix for Technology Selection

Create a weighted decision matrix when comparing multiple options. List your evaluation criteria (performance, learning curve, ecosystem maturity, community support, long-term viability) and assign weights based on your project priorities. Score each option on a scale of 1-5 for each criterion, then calculate weighted totals.

This systematic approach removes emotion from technology decisions and provides a defensible rationale when stakeholders question your choices. Document the matrix alongside your ADR so future teams understand not just what was chosen, but why alternatives were rejected.

Reversibility and Migration Paths

Every architectural decision should include a migration path in case the decision needs to be reversed. Consider the cost of changing course at six months, twelve months, and two years. Decisions with low reversal costs can be made more aggressively, while irreversible decisions warrant extended evaluation periods and proof-of-concept implementations.

For example, choosing a CSS-in-JS library has a relatively low reversal cost since styles can be migrated incrementally component by component. However, choosing a database technology has a high reversal cost due to data migration complexity and potential schema changes throughout the codebase.

Community Resources and Further Learning

The technology landscape evolves rapidly, making continuous learning essential for maintaining expertise. Building a systematic approach to staying current with developments in your technology stack ensures you can leverage new features and avoid deprecated patterns.

Curated Learning Pathways

Rather than consuming content randomly, create structured learning pathways aligned with your current projects and career goals. Start with official documentation and specification documents, which provide the most accurate and comprehensive information. Follow this with hands-on tutorials and workshops that reinforce concepts through practical application.

Technical blogs from framework maintainers and core team members often provide deeper insights into design decisions and upcoming features. Subscribe to the official blogs of your primary frameworks and libraries to stay ahead of breaking changes and deprecation timelines.

Contributing to Open Source

Contributing to open-source projects in your technology stack provides unparalleled learning opportunities. Start with documentation improvements and bug reports, then progress to fixing small issues tagged as "good first issue" in your favorite projects. This direct engagement with maintainers and the codebase accelerates your understanding far beyond what passive learning can achieve.

# Setting up for contribution
git clone https://github.com/project/repository.git
cd repository
git checkout -b fix/issue-description
 
# Run the project's contribution setup
npm run setup:dev
npm run test  # Ensure tests pass before making changes
 
# Make your changes, then run the full test suite
npm run test:full
npm run lint
npm run build
 
# Submit your contribution
git add -A
git commit -m "fix: description of the fix
 
Closes #1234"
git push origin fix/issue-description

Building a Technical Knowledge Base

Maintain a personal knowledge base that captures insights, solutions, and patterns you discover during your work. Tools like Obsidian, Notion, or even a simple Markdown repository can serve as an external memory that grows more valuable over time.

Organize your notes by topic rather than chronologically, and include code examples, links to relevant documentation, and explanations of why certain approaches work better than others. When you encounter a particularly insightful article or conference talk, write a summary that captures the key takeaways and how they apply to your current projects.

Follow key conferences and their published talks to stay informed about emerging patterns and best practices. Many conferences publish recorded talks on YouTube within weeks of the event, making world-class technical content freely accessible.

Join relevant Discord servers, Slack communities, and forums where practitioners discuss real-world challenges and solutions. These communities provide early warning about emerging issues and access to collective wisdom that isn't available through formal documentation.

Mentorship and Knowledge Sharing

Teaching others is one of the most effective ways to deepen your own understanding. Consider writing technical blog posts, giving talks at local meetups, or mentoring junior developers. The process of explaining concepts to others forces you to organize your knowledge and identify gaps in your understanding.

Pair programming sessions with colleagues of different experience levels create mutual learning opportunities. Senior developers gain fresh perspectives on problems they've solved the same way for years, while junior developers benefit from exposure to production-grade thinking and decision-making processes.

Conclusion

RAG is the bridge between general-purpose LLMs and enterprise-specific AI applications. By grounding LLM responses in your own data, RAG eliminates hallucinations, provides up-to-date information, and enables AI systems that truly understand your domain.

Key takeaways:

  1. RAG combines retrieval (finding relevant documents) with generation (producing answers) to ground LLM responses in factual data
  2. Chunking strategy is the highest-leverage optimization — test multiple approaches with your data
  3. Use hybrid search (semantic + keyword) for best retrieval quality
  4. Add re-ranking to dramatically improve relevance of retrieved documents
  5. Implement query rewriting to improve recall for diverse phrasings
  6. Always cite sources and handle "I don't know" cases gracefully
  7. Build an evaluation framework to measure and improve RAG quality continuously

Start by building a simple RAG pipeline over your documentation using OpenAI embeddings and Pinecone. Test it with real user questions, measure retrieval quality, and iterate on your chunking and retrieval strategies. Once the basic pipeline works, add hybrid search, re-ranking, and query rewriting to reach production quality.