Embeddings and Semantic Search Architecture

Introduction

Text embeddings are dense vector representations of text that capture semantic meaning in a continuous numerical space. Unlike one-hot encodings or TF-IDF vectors that treat words as independent tokens, embeddings place semantically similar concepts near each other in vector space — "king" and "queen" have similar embeddings, as do "Paris" and "France."

Modern embedding models use transformer architectures trained on massive text corpora with contrastive learning objectives. The model learns to produce similar vectors for related text pairs (a question and its answer, an image and its caption) and dissimilar vectors for unrelated pairs. This training produces a "semantic space" where distance corresponds to meaning.

The dimensionality of embeddings varies by model: OpenAI's text-embedding-3-large produces 3072-dimensional vectors, Cohere's embed-v4 produces 1024 dimensions, and open-source models like BGE-M3 produce 1024 dimensions. Higher dimensions capture more nuance but increase storage costs and query latency. Most production systems use 768-1536 dimensions as the sweet spot.

Embedding quality directly determines search quality. A poor embedding model will produce vectors where unrelated texts cluster together and related texts drift apart. Evaluating embedding models requires domain-specific benchmarks — a model that excels at general semantic similarity may perform poorly on medical texts or legal documents.

Understanding Text Embeddings

The Embedding Model Landscape in 2025-2026

The embedding model landscape has evolved rapidly. OpenAI's text-embedding-3 series (small and large) set the standard for proprietary models, offering Matryoshka representation learning that lets you truncate dimensions for faster search without complete quality loss.

Open-source models have closed the gap. BAAI's BGE family (BGE-large-en-v1.5, BGE-M3) consistently ranks near the top of the MTEB leaderboard. Mistral's embed model offers competitive quality with efficient inference. Nomic's nomic-embed-text provides fully open-source embeddings with Apache 2.0 licensing.

Multi-modal embeddings represent the frontier. Models like CLIP, SigLIP, and Cohere's embed-v4 encode text and images into the same vector space, enabling cross-modal search — find images using text queries or find similar images without any text. This capability powers visual product search, content moderation, and multimodal RAG systems.

The Matryoshka embedding technique (named after Russian nesting dolls) has become standard. It allows using the first 256, 512, or 768 dimensions of a 1536-dimensional vector with graceful quality degradation. This means you can index at full dimensionality for maximum quality but query at reduced dimensions for speed, or vice versa.

Building Production Semantic Search

A production semantic search pipeline involves several stages: text preprocessing, embedding generation, vector indexing, querying, and result ranking. Each stage presents engineering decisions that affect quality, latency, and cost.

Text preprocessing determines what gets embedded. For long documents, chunking strategies matter enormously — fixed-size chunks with overlap (typically 512 tokens with 50-token overlap) are simple but may split sentences. Semantic chunking using sentence embeddings to find natural breakpoints produces better results. Late chunking (embedding the full document then chunking the output) preserves context but requires models that support long sequences.

Embedding generation can happen at ingestion time (eager) or query time (lazy). Eager embedding ensures consistent vectors but requires re-embedding when you change models. Most systems embed at both ingestion and query time, using the same model for both. Batch embedding APIs reduce per-text costs — OpenAI's batch API offers 50% discounts for non-urgent workloads.

Query-time processing often includes query expansion (generating multiple query variations), hybrid search (combining vector similarity with BM25 keyword matching), and re-ranking. The two-stage retrieve-then-rerank pattern uses fast vector search to find the top 50-100 candidates, then a cross-encoder model to re-rank them by relevance. This pattern adds 50-100ms of latency but significantly improves result quality.

RAG Architecture Patterns

Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLM responses in factual, up-to-date information. A RAG system retrieves relevant documents from a knowledge base and includes them in the LLM's context, reducing hallucination and enabling the model to answer questions about information it was never trained on.

The basic RAG pattern is straightforward: embed the user's query, find similar document chunks, include them in the prompt, and generate a response. Production RAG requires significantly more sophistication. Naive RAG — simply stuffing the top-k chunks into a prompt — fails when documents are long, questions require synthesizing information across multiple sources, or the retrieval misses relevant context.

Advanced RAG patterns address these limitations. Query routing sends different types of questions to different retrieval strategies — factual questions go to vector search, analytical questions go to SQL queries, and opinion questions go to web search. Query decomposition breaks complex questions into sub-questions, retrieves for each independently, and synthesizes the results.

Agentic RAG treats retrieval as a tool that an AI agent can call iteratively. Instead of a single retrieve-and-generate cycle, the agent retrieves documents, evaluates their relevance, generates follow-up queries if needed, and continues until it has sufficient information. This multi-hop retrieval pattern handles complex questions that require connecting information from multiple sources.

Scaling and Optimizing Vector Search

As vector collections grow from millions to billions of vectors, optimization becomes critical. Indexing strategy, quantization, and hardware selection all impact the cost-performance tradeoff.

HNSW (Hierarchical Navigable Small World) indexing provides the best query performance but requires significant memory — each vector needs roughly 2x its raw size in index overhead. For a billion 1536-dimensional vectors at float32 precision, you need approximately 12TB of RAM for vectors plus HNSW index. IVF-based indexes use less memory but require more candidates to achieve the same recall.

Quantization reduces memory requirements dramatically. Scalar quantization (float32 to int8) gives 4x compression with ~2% recall loss. Product quantization (PQ) achieves 8-32x compression by splitting vectors into subspaces and clustering each subspace. Binary quantization (32x compression) works surprisingly well for high-dimensional embeddings where only the sign of each dimension carries meaning.

Hardware selection matters for cost optimization. GPU-accelerated search (using NVIDIA RAPIDS or FAISS-GPU) provides 10-100x speedup over CPU for brute-force and IVF-based search. For HNSW, CPUs often outperform GPUs due to the graph traversal's sequential nature. Cloud providers offer memory-optimized instances (AWS's R-series, GCP's M-series) that provide the best $/GB ratio for vector workloads.

Conclusion

The topics covered in this article represent important developments in modern software engineering. By understanding these concepts deeply and applying them in your projects, you can build more robust, scalable, and maintainable systems. Continue exploring, experimenting, and building — the technology landscape rewards those who stay curious and keep learning.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline