Introduction
Retrieval-Augmented Generation (RAG) has become the dominant pattern for building AI applications that need to answer questions about specific knowledge bases. Instead of relying solely on the LLM's training data (which has a knowledge cutoff and cannot access private information), RAG retrieves relevant documents at query time and injects them into the prompt as context. This gives the LLM access to up-to-date, domain-specific, and private information without the cost and complexity of fine-tuning.
Building a production RAG application involves several interconnected systems: a document ingestion pipeline that loads, chunks, and embeds your knowledge base; a vector database that stores and retrieves embeddings efficiently; a retrieval system that finds the most relevant chunks for a given query; a generation system that constructs prompts with retrieved context and streams responses; and an evaluation framework that measures retrieval quality and answer accuracy.
In this comprehensive guide, we will build a complete RAG application using LangChain for orchestration, Next.js for the web framework, and OpenAI for embeddings and generation. We will cover every stage of the pipeline from document loading to streaming responses, with production patterns for chunking strategies, hybrid search, re-ranking, and observability.
Understanding RAG: Core Concepts
The RAG Pipeline
A RAG application has two distinct phases: indexing (offline) and querying (online). During indexing, documents are loaded from their source (files, websites, databases), split into chunks, converted to vector embeddings, and stored in a vector database. During querying, the user's question is converted to an embedding, similar chunks are retrieved from the vector database, and the LLM generates an answer using the retrieved context.
The quality of a RAG application depends almost entirely on the quality of retrieval. If the retrieved chunks are not relevant to the question, the LLM will generate a poor answer regardless of how capable it is. This makes chunking strategy, embedding model choice, and retrieval algorithm the most important engineering decisions in a RAG system.
Chunking Strategies
Chunking is the process of splitting documents into smaller pieces that can be independently embedded and retrieved. The three main strategies are fixed-size chunking (split every N characters with overlap), semantic chunking (split at natural boundaries like paragraphs or sections), and recursive character splitting (split hierarchically by different separators). Each has tradeoffs in terms of context preservation, retrieval precision, and processing speed.
The chunk size matters enormously. Too small (50 tokens) and chunks lack context — a sentence about "the API" without surrounding text is meaningless. Too large (2000 tokens) and chunks are imprecise — a long document section may contain mostly irrelevant information that dilutes the relevant parts. The sweet spot for most applications is 200-500 tokens per chunk with 50-100 tokens of overlap between adjacent chunks.
Embeddings and Vector Search
Embeddings are numerical representations (vectors) of text that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search that finds relevant content even when the exact words differ. When a user asks "How do I authenticate API requests?", a good embedding model will retrieve chunks about "API authentication", "bearer tokens", and "API keys" even if those exact phrases do not appear in the query.
Vector databases (Pinecone, Weaviate, Qdrant, Chroma, pgvector) store these embeddings and provide efficient similarity search using algorithms like HNSW (Hierarchical Navigable Small World). At query time, the user's question is embedded, and the database returns the K most similar chunks in milliseconds.
Hybrid Search
Pure vector search has limitations. It struggles with exact keyword matches (finding a specific error code like "ECONNREFUSED"), proper nouns (finding "React.memo" when the user searches for it), and numerical queries. Hybrid search combines vector similarity with keyword matching (BM25) to get the best of both worlds. The vector component captures semantic similarity while the keyword component ensures exact matches are found.
Architecture and Design Patterns
The RAG Chain Architecture
A production RAG application is organized as a chain of processing steps, each of which can be independently tested, optimized, and replaced:
User Query
↓
[Query Rewriting] → Reformulate for better retrieval
↓
[Embedding] → Convert query to vector
↓
[Retrieval] → Find similar chunks from vector DB
↓
[Re-ranking] → Reorder by relevance using cross-encoder
↓
[Prompt Construction] → Build prompt with context
↓
[Generation] → LLM generates answer
↓
[Streaming Response] → Stream tokens to client
The Document Processing Pipeline
Document ingestion is organized as a pipeline that handles diverse source formats:
Source Documents
↓
[Loaders] → Parse PDF, Markdown, HTML, Notion, Confluence
↓
[Transformers] → Clean text, extract metadata, resolve references
↓
[Chunkers] → Split into optimal-sized chunks with overlap
↓
[Embedders] → Generate vector embeddings for each chunk
↓
[Vector Store] → Index chunks for efficient retrieval
Step-by-Step Implementation
Project Setup
Initialize the Next.js project with LangChain and vector store dependencies:
npx create-next-app@latest rag-app --typescript --tailwind --app
cd rag-app
npm install langchain @langchain/openai @langchain/community
npm install @pinecone-database/pinecone
npm install langchain-text-splitters
npm install ai @ai-sdk/openaiDocument Loading and Chunking
Implement the document processing pipeline:
import { PDFLoader } from '@langchain/community/document_loaders/fs/pdf';
import { DirectoryLoader } from 'langchain/document_loaders/fs/directory';
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { OpenAIEmbeddings } from '@langchain/openai';
async function loadAndChunkDocuments(docsPath: string) {
// Load documents from directory
const loader = new DirectoryLoader(docsPath, {
'.pdf': (path) => new PDFLoader(path),
'.md': (path) => new MarkdownLoader(path),
});
const docs = await loader.load();
// Split into chunks
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 500,
chunkOverlap: 100,
separators: ['\n\n', '\n', '. ', ' ', ''],
lengthFunction: (text) => text.split(/\s+/).length, // Token-aware
});
const chunks = await splitter.splitDocuments(docs);
// Add metadata
const enrichedChunks = chunks.map((chunk, i) => ({
...chunk,
metadata: {
...chunk.metadata,
chunkIndex: i,
source: chunk.metadata.source,
pageNumber: chunk.metadata.pageNumber,
},
}));
return enrichedChunks;
}Vector Store Setup with Pinecone
Initialize the vector store and create the index:
import { Pinecone } from '@pinecone-database/pinecone';
import { PineconeStore } from '@langchain/pinecone';
const pinecone = new Pinecone();
const index = pinecone.index('rag-index');
const embeddings = new OpenAIEmbeddings({
modelName: 'text-embedding-3-small',
dimensions: 1536,
});
async function indexDocuments(chunks: Document[]) {
await PineconeStore.fromDocuments(chunks, embeddings, {
pineconeIndex: index,
namespace: 'knowledge-base',
textKey: 'text',
});
console.log(`Indexed ${chunks.length} chunks`);
}
async function getVectorStore(): Promise<PineconeStore> {
return PineconeStore.fromExistingIndex(embeddings, {
pineconeIndex: index,
namespace: 'knowledge-base',
textKey: 'text',
});
}RAG Chain with LangChain
Build the retrieval and generation chain:
import { ChatOpenAI } from '@langchain/openai';
import { RunnableSequence } from '@langchain/core/runnables';
import { StringOutputParser } from '@langchain/core/output_parsers';
import { formatDocumentsAsString } from 'langchain/util/document';
import {
ChatPromptTemplate,
MessagesPlaceholder,
} from '@langchain/core/prompts';
async function createRAGChain() {
const vectorStore = await getVectorStore();
const retriever = vectorStore.asRetriever({
k: 5,
searchType: 'similarity',
});
const model = new ChatOpenAI({
modelName: 'gpt-4-turbo',
temperature: 0,
streaming: true,
});
const prompt = ChatPromptTemplate.fromMessages([
['system', `You are a helpful assistant that answers questions based on the provided context.
If the context doesn't contain enough information to answer the question, say so honestly.
Always cite your sources by referencing the document name.
Context:
{context}`],
new MessagesPlaceholder('chat_history'),
['user', '{question}'],
]);
const chain = RunnableSequence.from([
{
context: retriever.pipe(formatDocumentsAsString),
question: (input) => input.question,
chat_history: (input) => input.chat_history || [],
},
prompt,
model,
new StringOutputParser(),
]);
return chain;
}Next.js API Route with Streaming
Create the API route that streams RAG responses:
// app/api/chat/route.ts
import { StreamingTextResponse } from 'ai';
import { createRAGChain } from '@/lib/rag-chain';
export async function POST(req: Request) {
const { messages } = await req.json();
const chain = await createRAGChain();
const lastMessage = messages[messages.length - 1].content;
const chatHistory = messages.slice(0, -1).map((m: any) => [m.role, m.content]);
const stream = await chain.stream({
question: lastMessage,
chat_history: chatHistory,
});
return new StreamingTextResponse(stream);
}Frontend Chat Interface
Build the React frontend with the Vercel AI SDK:
'use client';
import { useChat } from 'ai/react';
export function RAGChat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: '/api/chat' });
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto">
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.map((msg) => (
<div
key={msg.id}
className={`p-4 rounded-lg ${
msg.role === 'user'
? 'bg-blue-50 ml-12'
: 'bg-gray-50 mr-12'
}`}
>
<p className="whitespace-pre-wrap">{msg.content}</p>
</div>
))}
</div>
<form onSubmit={handleSubmit} className="p-4 border-t">
<div className="flex gap-2">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask a question about your documents..."
className="flex-1 p-3 border rounded-lg"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading}
className="px-6 py-3 bg-blue-600 text-white rounded-lg"
>
{isLoading ? 'Thinking...' : 'Ask'}
</button>
</div>
</form>
</div>
);
}Real-World Use Cases
Internal Knowledge Base Search
Companies build RAG applications over their internal documentation (Confluence, Notion, Google Docs) to provide employees with instant answers. Instead of searching through hundreds of documents, employees ask questions in natural language and get answers with source citations. The challenge is keeping the index up-to-date as documents change, which requires incremental indexing and change detection.
Customer Support Knowledge Bases
Support teams use RAG to augment their agents (human or AI) with product documentation, troubleshooting guides, and policy documents. The RAG system retrieves relevant support articles and generates concise, accurate answers. Quality is critical here — hallucinated answers can cause customer harm and support escalations.
Legal and Compliance Document Analysis
Law firms and compliance teams use RAG to search through large volumes of legal documents, contracts, and regulations. The system can answer questions like "What are the termination clauses in our vendor contracts?" by retrieving relevant sections from hundreds of contracts. Accuracy and citation are paramount — every claim must be traceable to a specific document and section.
Research Paper Assistants
Academic researchers use RAG to search through large collections of papers, extract findings, and synthesize answers across multiple sources. The system handles technical vocabulary, mathematical notation, and cross-references between papers. Advanced features include automatic citation generation and comparison of findings across papers.
Best Practices for Production
-
Optimize chunk size empirically: There is no universal best chunk size. Test different sizes (200, 500, 1000 tokens) with your actual queries and measure retrieval quality. Use an evaluation dataset of question-answer pairs to find the optimal size for your specific content.
-
Implement hybrid search: Combine vector similarity search with BM25 keyword search. Use Reciprocal Rank Fusion (RRF) to merge results from both methods. This catches both semantic matches ("How do I log in?" → "Authentication guide") and exact matches ("ECONNREFUSED error" → troubleshooting article about ECONNREFUSED).
-
Add a re-ranking step: After initial retrieval, use a cross-encoder model to re-rank the results by relevance. The initial retrieval uses bi-encoder embeddings (fast but less accurate), and the re-ranking uses cross-encoder (slow but more accurate) on the top K results. This significantly improves retrieval precision.
-
Include source citations: Always return the source documents alongside the generated answer. This allows users to verify the answer, builds trust, and enables debugging when the answer is incorrect. Display the document name, page number, and the specific chunk that was used.
-
Implement query transformation: Before retrieval, use an LLM to rewrite the user's query for better retrieval. Techniques include HyDE (Hypothetical Document Embedding — generate a hypothetical answer and use its embedding for retrieval), query decomposition (break complex questions into sub-questions), and query expansion (add synonyms and related terms).
-
Monitor retrieval quality: Track metrics like retrieval precision (are the retrieved chunks relevant?), answer accuracy (is the generated answer correct?), and citation accuracy (do the cited sources actually support the answer?). Use these metrics to identify and fix retrieval failures.
-
Handle the "no relevant context" case: When the retrieved chunks are not relevant to the query, the LLM should say "I don't have enough information to answer this question" rather than hallucinating an answer. Set a similarity threshold below which the system returns a "no results" response instead of forcing the LLM to generate from irrelevant context.
-
Implement incremental indexing: When source documents change, do not re-index the entire knowledge base. Detect changed documents, remove their old chunks from the vector store, and index only the new/modified chunks. This keeps the index fresh without the cost of full re-indexing.
Common Pitfalls and Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| Chunks too small | Loss of context — retrieved chunks lack the information needed to answer | Increase chunk size and add overlap; test with actual queries |
| Chunks too large | Dilution — relevant information is buried in irrelevant text | Decrease chunk size; use semantic chunking that respects document structure |
| No metadata filtering | Retrieving from irrelevant document categories | Add metadata (category, date, author) and use filtered retrieval |
| Embedding model mismatch | Poor retrieval because the embedding model doesn't understand your domain | Fine-tune embeddings on your domain data or use a domain-specific model |
| Ignoring token limits | Context window overflow when too many chunks are retrieved | Calculate token count of all retrieved chunks + prompt + expected output; reduce k if needed |
| No evaluation framework | Cannot measure improvement when changing chunking/retrieval strategy | Build an eval dataset of 50+ question-answer pairs; measure retrieval recall and answer accuracy |
Performance Optimization
Caching Embeddings
Embedding generation is expensive (API calls) and slow (network latency). Cache embeddings aggressively:
import { CacheBackedEmbeddings } from 'langchain/embeddings/cache_backed';
import { LocalFileStore } from 'langchain/storage/fs';
const underlyingEmbeddings = new OpenAIEmbeddings();
const store = new LocalFileStore('./embedding-cache');
const cachedEmbeddings = CacheBackedEmbeddings.fromBytesStore(
underlyingEmbeddings,
store,
{ namespace: 'text-embedding-3-small' }
);Streaming Retrieval
For the best user experience, start streaming the answer as soon as the first relevant chunk is retrieved, without waiting for all chunks:
async function* streamRAGResponse(query: string) {
const retriever = vectorStore.asRetriever({ k: 5 });
const docs = await retriever.invoke(query);
// Start generating immediately with first chunk
const stream = await model.stream(
prompt.format({ context: formatDocs(docs), question: query })
);
for await (const chunk of stream) {
yield chunk.content;
}
}Comparison with Alternatives
| Feature | RAG | Fine-Tuning | Prompt Engineering Only | Knowledge Graphs |
|---|---|---|---|---|
| Setup cost | Moderate | High | Low | Very high |
| Knowledge freshness | Real-time (re-index) | Static (retrain) | Static (training cutoff) | Real-time |
| Accuracy (domain) | High | Very high | Moderate | High |
| Cost per query | Moderate (embed + retrieve + generate) | Low (generate only) | Low (generate only) | Moderate |
| Transparency | High (citations) | Low | Low | High |
| Best for | Knowledge bases, docs, support | Style/tone adaptation | Simple tasks | Structured relationships |
RAG is the best choice when you need the LLM to access current, domain-specific, or private information. Fine-tuning is better for adapting the model's style, tone, or behavior. For simple tasks with well-known information, prompt engineering alone may suffice.
Advanced Patterns
Multi-Step RAG (Agentic RAG)
For complex questions that require information from multiple documents, use an agentic RAG pattern where the LLM decides what to retrieve next:
import { AgentExecutor, createOpenAIToolsAgent } from 'langchain/agents';
const retrieverTool = createRetrieverTool(retriever, {
name: 'search_knowledge_base',
description: 'Search the knowledge base for information relevant to the query',
});
const agent = await createOpenAIToolsAgent({
llm: model,
tools: [retrieverTool],
prompt: agentPrompt,
});
const executor = new AgentExecutor({ agent, tools: [retrieverTool] });
const result = await executor.invoke({ input: userQuestion });Parent-Child Chunking
Store small chunks for precise retrieval but return the parent chunk (larger context) to the LLM:
const parentSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 2000,
chunkOverlap: 200,
});
const childSplitter = new RecursiveCharacterTextSplitter({
chunkSize: 400,
chunkOverlap: 50,
});
// Index child chunks for retrieval
// Store parent chunks separately
// When a child chunk is retrieved, return its parent for generationTesting Strategies
Retrieval Quality Testing
Build an evaluation dataset and measure retrieval metrics:
interface EvalCase {
question: string;
expectedDocIds: string[];
expectedAnswer: string;
}
async function evaluateRetrieval(evalCases: EvalCase[]) {
let totalRecall = 0;
let totalPrecision = 0;
for (const testCase of evalCases) {
const results = await retriever.invoke(testCase.question);
const retrievedIds = results.map(r => r.metadata.docId);
const relevant = retrievedIds.filter(id =>
testCase.expectedDocIds.includes(id)
);
totalRecall += relevant.length / testCase.expectedDocIds.length;
totalPrecision += relevant.length / retrievedIds.length;
}
console.log(`Recall@5: ${(totalRecall / evalCases.length).toFixed(3)}`);
console.log(`Precision@5: ${(totalPrecision / evalCases.length).toFixed(3)}`);
}Future Outlook
RAG is evolving toward more sophisticated retrieval (graph RAG that traverses knowledge graphs, multimodal RAG that retrieves images and tables, and recursive RAG that iteratively refines queries). The embedding models are improving rapidly, making semantic search more accurate. Vector databases are adding native hybrid search, filtering, and re-ranking capabilities.
The rise of long-context models (100K+ tokens) raises the question of whether RAG is still necessary. For small knowledge bases (under 100 pages), stuffing the entire content into the context window may be simpler and more accurate. For large knowledge bases (thousands of documents), RAG remains essential because context windows cannot scale indefinitely and retrieval focuses the model's attention on the most relevant information.
Conclusion
RAG is the most practical pattern for building AI applications over custom knowledge bases. The key takeaways are:
- Chunk size matters enormously — test empirically with your content and queries to find the optimal size
- Implement hybrid search combining vector similarity and keyword matching for the best retrieval quality
- Add re-ranking after initial retrieval to improve precision
- Include source citations in every response for trust and verifiability
- Build an evaluation framework with question-answer pairs to measure and improve quality systematically
- Use streaming for the best user experience — start showing results as soon as the first chunk is retrieved
Start with a simple pipeline (load → chunk → embed → retrieve → generate), evaluate quality, and then incrementally add hybrid search, re-ranking, and query transformation as your quality metrics indicate room for improvement.