Database Indexing with B-Tree, Hash, and GIN

Introduction

When a user complains that a search takes 30 seconds instead of 30 milliseconds, the root cause is almost always the same: a missing or incorrect index. Database indexes are the single most powerful tool for query performance optimization, yet many developers treat them as a black box—create an index, hope for the best, and move on. The reality is that understanding how different index types work internally is the key to choosing the right one for each query pattern.

PostgreSQL offers several index types, each optimized for different access patterns. B-tree indexes handle range queries, sorting, and equality with balanced performance. Hash indexes provide the fastest possible equality lookups at the cost of flexibility. GIN (Generalized Inverted Index) enables searching within composite values like JSONB documents, arrays, and full-text vectors. This guide teaches you how each type works under the hood, when to use each one, and how to verify your indexes are working correctly using EXPLAIN ANALYZE.

Understanding Index Types: Core Concepts

How a Database Scan Works Without Indexes

Without an index, PostgreSQL performs a sequential scan: it reads every page of the table from disk (or cache), examines every row, and returns those matching the WHERE clause. For a table with 10 million rows occupying 5 GB on disk, a sequential scan reads all 5 GB. Even with the data cached in memory, scanning 10 million rows takes 2–5 seconds.

An index transforms this into a targeted lookup. Instead of scanning the entire table, PostgreSQL traverses the index structure (a few dozen page reads at most), finds the pointers to matching rows, and fetches only those rows from the table. For a query returning 100 rows out of 10 million, the index might read 4–5 index pages plus 100 table pages—roughly 1000x less I/O.

B-Tree: Sorted Balanced Tree

B-tree is PostgreSQL's default index type and the most versatile. The structure is a balanced tree where each node contains sorted keys and child pointers. Leaf pages (the bottom level) contain index entries with the indexed value and a pointer to the corresponding heap tuple (the actual row in the table).

A B-tree on a column with 10 million values typically has 3–4 levels. The root page contains pointers to intermediate pages, which point to leaf pages. Finding any value requires traversing 3–4 pages—roughly 32 KB of I/O regardless of table size. This logarithmic scaling is what makes B-tree indexes effective even on very large tables.

B-tree supports: =, <, <=, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL, and LIKE 'prefix%' (left-anchored patterns). It also satisfies ORDER BY clauses, allowing PostgreSQL to return sorted results without a separate sort operation.

Hash: Bucket-Based Equality

Hash indexes use a hash function to distribute values into fixed-size buckets. A lookup computes the hash of the search value, jumps directly to the corresponding bucket, and scans the bucket for the exact match. This gives O(1) average-case lookup complexity.

Hash indexes are simpler and smaller than B-tree for the same data because they don't maintain sorted order. For a column storing 10 million UUIDs, a B-tree index might be 300 MB while a hash index is 200 MB. The smaller size means more of the index fits in cache, reducing disk I/O.

The critical limitation: hash indexes only support the = operator. They cannot do range queries, sorting, pattern matching, or multi-column indexing. Since PostgreSQL 10, hash indexes are WAL-logged and crash-safe, making them production-viable.

GIN: Inverted Index for Composite Values

GIN inverts the traditional index mapping. Instead of "row → value," it stores "value → set of rows." When you query "find all rows where the tags array contains 'database'," GIN looks up 'database' in its element tree, retrieves the posting list of row IDs, and returns them directly.

GIN is designed for multi-valued columns: arrays, JSONB documents, full-text search vectors, and hstore key-value pairs. The inverted structure makes containment queries (@>), overlap queries (&&), and existence queries (?, ?|, ?&) efficient.

The trade-off is write performance. Inserting a row with an array of 20 elements requires updating 20 posting lists in the GIN index. GIN's fast update mode buffers these updates in a pending list, merging them into the main structure periodically to amortize the write cost.

Architecture and Design Patterns

Index Scan vs Bitmap Scan vs Index-Only Scan

PostgreSQL uses three different strategies when an index is available:

Index Scan: PostgreSQL traverses the index, finds a matching entry, then fetches the corresponding heap tuple. For each match, it does an index lookup + a heap lookup. This is efficient when the number of matches is small (a few hundred rows) because heap accesses are random but few.

Bitmap Index Scan: PostgreSQL first scans the index and builds a bitmap of matching row locations. It then sorts the bitmap by physical location and fetches heap tuples in physical order. This reduces random I/O compared to a plain index scan when many rows match (thousands or more), because heap access becomes sequential.

Index-Only Scan: PostgreSQL reads all needed data directly from the index without accessing the heap. This is the fastest strategy but requires the index to contain all columns referenced in SELECT, WHERE, and ORDER BY. PostgreSQL also checks the visibility map to determine if heap access is needed for MVCC visibility checks.

Composite Index Column Ordering

The order of columns in a composite index determines which queries can use it. PostgreSQL reads composite index keys left to right, similar to how a phone book is sorted by last name, then first name.

-- This index
CREATE INDEX idx ON orders (customer_id, status, created_at);
 
-- Supports these query patterns:
WHERE customer_id = 100
WHERE customer_id = 100 AND status = 'active'
WHERE customer_id = 100 AND status = 'active' AND created_at > '2024-01-01'
 
-- Does NOT support:
WHERE status = 'active'
WHERE status = 'active' AND created_at > '2024-01-01'

The ESR (Equality, Sort, Range) guideline recommends placing equality columns first, sort columns second, and range columns last. This ensures the index can both filter and sort efficiently.

GIN Operator Classes

GIN indexes must be created with a specific operator class that determines which operators are supported:

-- Default JSONB operator class: supports all JSONB operators
CREATE INDEX idx_jsonb ON table USING gin (column);
-- Supports: @>, ?, ?|, ?&, @?, @@
 
-- Path-specific JSONB operator class: smaller, faster
CREATE INDEX idx_jsonb_path ON table USING gin (column jsonb_path_ops);
-- Supports only: @>, @?, @@
 
-- Trigram operator class for text similarity
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_trgm ON table USING gin (column gin_trgm_ops);
-- Supports: % (similarity), LIKE, ILIKE

Step-by-Step Implementation

Reading EXPLAIN ANALYZE Output

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345 AND status = 'active';

Index Scan using idx_orders_customer_status on orders  (cost=0.56..8.58 rows=15 width=64) (actual time=0.018..0.042 rows=12 loops=1)
  Index Cond: ((customer_id = 12345) AND (status = 'active'))
  Buffers: shared hit=4
Planning Time: 0.089 ms
Execution Time: 0.055 ms

Key observations:

Index Scan using idx_orders_customer_status: The optimizer chose our composite index
Index Cond: Shows the exact conditions applied at the index level
rows=12 matches actual rows=12: Cardinality estimates are accurate
Buffers: shared hit=4: Only 4 pages read from cache (32 KB total)
Execution Time: 0.055 ms: Sub-millisecond performance

Creating B-Tree Indexes

-- Standard B-tree index
CREATE INDEX CONCURRENTLY idx_orders_customer
  ON orders (customer_id);
 
-- Composite B-tree with ESR ordering
CREATE INDEX CONCURRENTLY idx_orders_search
  ON orders (customer_id, status, created_at DESC);
 
-- Partial B-tree (only index recent orders)
CREATE INDEX CONCURRENTLY idx_orders_recent
  ON orders (customer_id, created_at DESC)
  WHERE created_at > '2023-01-01';
 
-- Covering index (includes extra columns for index-only scans)
CREATE INDEX CONCURRENTLY idx_orders_covering
  ON orders (customer_id, status)
  INCLUDE (total_amount, created_at);

Creating Hash Indexes

-- Hash index for session token lookups
CREATE INDEX CONCURRENTLY idx_sessions_token
  ON sessions USING hash (session_token);
 
-- Verify it's used
EXPLAIN ANALYZE SELECT * FROM sessions WHERE session_token = 'abc123';

Creating GIN Indexes

-- GIN for JSONB containment queries
CREATE INDEX CONCURRENTLY idx_products_attrs
  ON products USING gin (attributes);
 
-- GIN for array overlap queries
CREATE INDEX CONCURRENTLY idx_documents_tags
  ON documents USING gin (tags);
 
-- GIN for full-text search
CREATE INDEX CONCURRENTLY idx_articles_fts
  ON articles USING gin (to_tsvector('english', content));
 
-- GIN for trigram text search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY idx_users_name_trgm
  ON users USING gin (name gin_trgm_ops);

Validating Index Usage

-- Check which indexes are being used
SELECT
  schemaname,
  relname AS table_name,
  indexrelname AS index_name,
  idx_scan AS scans,
  idx_tup_read AS tuples_read,
  idx_tup_fetch AS tuples_fetched,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;
 
-- Find unused indexes
SELECT
  indexrelname AS index_name,
  pg_size_pretty(pg_relation_size(indexrelid)) AS wasted_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelid NOT IN (SELECT conindid FROM pg_constraint WHERE contype IN ('p', 'u'))
ORDER BY pg_relation_size(indexrelid) DESC;

Real-World Use Cases

Use Case 1: User Authentication with Hash Index

An authentication service performing 100,000 lookups per second on email addresses needed maximum lookup speed. Emails are only ever queried with =, never with range comparisons. Switching from B-tree to hash index improved average lookup latency from 0.12ms to 0.08ms and reduced index size by 35%.

-- Before
CREATE INDEX idx_users_email ON users USING btree (email);
-- Index size: 450 MB, lookup: 0.12ms
 
-- After
CREATE INDEX idx_users_email_hash ON users USING hash (email);
-- Index size: 290 MB, lookup: 0.08ms

Use Case 2: Product Catalog Search with GIN Trigram

An e-commerce site needed to implement "search as you type" for product names. Users typed partial strings like "iph" and expected results including "iPhone 15 Pro." B-tree indexes can't handle infix patterns (LIKE '%iph%'), requiring full table scans.

Adding a GIN trigram index enabled efficient substring matching:

CREATE EXTENSION pg_trgm;
CREATE INDEX idx_products_name_gin ON products USING gin (name gin_trgm_ops);
 
-- Sub-10ms substring search across 5 million products
SELECT name, similarity(name, 'iph') AS score
FROM products
WHERE name ILIKE '%iph%'
ORDER BY score DESC
LIMIT 10;

Use Case 3: Multi-Tenant Event Log with Composite B-Tree

A SaaS platform storing events for 500 tenants queried events filtered by tenant, event type, and time range. Individual indexes on each column caused bitmap AND operations that degraded under high concurrency.

A composite index eliminated the bitmap overhead:

CREATE INDEX idx_events_tenant_type_time
  ON events (tenant_id, event_type, created_at DESC);
 
-- Query uses single index scan instead of bitmap AND
EXPLAIN ANALYZE
SELECT * FROM events
WHERE tenant_id = 42 AND event_type = 'click' AND created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC LIMIT 100;
-- Execution time: 0.3ms (was 45ms with bitmap AND)

Best Practices for Production

Default to B-tree: It handles the widest range of query patterns and is the most well-understood index type. Only deviate when you have a specific workload that benefits from an alternative.
Use Hash for pure equality on high-cardinality columns: Session tokens, API keys, UUIDs, and email addresses queried only with = are ideal candidates for hash indexes.
Use GIN for JSONB, arrays, and full-text search: These data types have multi-valued fields that require inverted index structures. GIN is purpose-built for containment, overlap, and existence queries.
Choose the right GIN operator class: Use jsonb_path_ops for JSONB unless you need all operators. Use gin_trgm_ops for text similarity. Using the wrong operator class means your index won't be used.
Follow ESR for composite indexes: Equality columns first, Sort column second, Range columns last. This maximizes both filtering and sorting efficiency.
Always create indexes CONCURRENTLY in production: CREATE INDEX CONCURRENTLY avoids locking the table during index creation. This is critical for tables with continuous write traffic.
Monitor index usage statistics: Use pg_stat_user_indexes to track scan counts and identify unused indexes. Every unused index slows writes and consumes storage.
Benchmark with realistic data: An index that works well on 10,000 test rows may behave differently on 10 million production rows. Test with production-sized datasets.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Using hash index when range queries are needed	Index cannot be used, falls back to seq scan	Use B-tree if any range or sort queries exist on the column
Wrong GIN operator class	Index exists but isn't used by the optimizer	Match operator class to query operators (jsonb_path_ops for @>)
Not using CONCURRENTLY	Table locked during index creation, blocking writes	Always use CREATE INDEX CONCURRENTLY on production tables
Composite index wrong column order	Most queries can't use the index	Apply ESR rule: Equality, Sort, Range
Ignoring index bloat	Indexes grow large, degrading cache performance	Schedule REINDEX CONCURRENTLY during maintenance windows
Over-indexing write-heavy tables	Write amplification, slower inserts/updates	Audit with pg_stat_user_indexes, drop unused indexes

Performance Optimization

Index Size Comparison

-- Compare index sizes for the same table
SELECT
  indexname,
  pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size,
  idx_scan AS times_used
FROM pg_indexes
JOIN pg_stat_user_indexes USING (indexrelname)
WHERE tablename = 'orders'
ORDER BY pg_relation_size(indexname::regclass) DESC;

Reducing Index Bloat

-- Check bloat with pgstattuple extension
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstatindex('idx_orders_customer_status');
 
-- Rebuild bloated index without locking
REINDEX INDEX CONCURRENTLY idx_orders_customer_status;
 
-- Monitor dead tuples
SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

Index-Only Scan Optimization

-- Check visibility map coverage
SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  ROUND(100.0 * relallvisible / NULLIF(relpages, 0), 1) AS visible_pct
FROM pg_class
JOIN pg_stat_user_tables ON pg_class.oid = pg_stat_user_tables.relid
WHERE relkind = 'r';
 
-- Run VACUUM to update visibility map
VACUUM (VERBOSE) orders;

Comparison with Alternatives

Feature	B-Tree	Hash	GIN	BRIN
Best for	General purpose	Equality only	Composite values	Ordered sequential
Range queries	✅	❌	❌	✅
Sort support	✅	❌	❌	❌
Multi-column	✅	❌	N/A	✅
Size (10M UUIDs)	300 MB	200 MB	N/A	50 KB
Write overhead	Low	Low	High	Very low
Crash-safe (PG 10+)	✅	✅	✅	✅

Advanced Patterns

Combining Index Types

-- Tenant-isolated JSONB queries: B-tree for tenant, GIN for metadata
CREATE INDEX idx_docs_tenant ON documents (tenant_id);
CREATE INDEX idx_docs_metadata ON documents USING gin (metadata);
 
-- PostgreSQL will bitmap AND both indexes
EXPLAIN ANALYZE
SELECT * FROM documents
WHERE tenant_id = 42 AND metadata @> '{"type": "invoice"}';

Partial GIN Index

-- GIN index only on active records
CREATE INDEX idx_active_products_attrs
  ON products USING gin (attributes)
  WHERE status = 'active';

Testing Strategies

-- Force specific index types for comparison
SET enable_indexscan = on;
SET enable_bitmapscan = off;
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 123;
 
-- Compare hash vs B-tree for equality
CREATE INDEX CONCURRENTLY idx_test_btree ON sessions (token);
EXPLAIN ANALYZE SELECT * FROM sessions WHERE token = 'test';
DROP INDEX CONCURRENTLY idx_test_btree;
 
CREATE INDEX CONCURRENTLY idx_test_hash ON sessions USING hash (token);
EXPLAIN ANALYZE SELECT * FROM sessions WHERE token = 'test';
DROP INDEX CONCURRENTLY idx_test_hash;

Future Outlook

PostgreSQL continues to enhance its index capabilities. Version 17 brought improvements to B-tree deduplication and parallel index builds. Future releases may add native bloom filter indexes, improved JSONB statistics for better query planning, and GPU-accelerated index scans.

The trend toward managed database services (Aurora, AlloyDB, Neon) includes automated index tuning. Amazon's Performance Insights and Google's Query Advisor can recommend indexes based on workload analysis. However, understanding index internals remains essential for designing schemas and queries that perform well from the start, rather than relying on reactive recommendations.

Conclusion

Choosing the right database index type is a fundamental skill that directly impacts application performance. B-tree is the versatile default for range queries, sorting, and equality. Hash provides the fastest equality lookups for high-cardinality columns queried exclusively with =. GIN enables searching within composite values—JSONB, arrays, and full-text—without external search infrastructure.

Key takeaways:

B-tree is the default choice: It handles the broadest range of query patterns. Start here and only deviate when profiling shows a specific need.
Hash for pure equality: Session tokens, API keys, and UUID lookups benefit from hash's O(1) average lookup and smaller index size.
GIN for composite values: JSONB containment, array overlap, and full-text search all require GIN's inverted index structure.
Use EXPLAIN ANALYZE to validate: Never assume an index is being used. Always verify with actual execution plans.
Follow ESR for composites: Column ordering determines index effectiveness. Equality, Sort, Range is the optimal order for most workloads.

Start by identifying your slowest queries, analyze their execution plans, and apply the appropriate index type. The performance improvements are often dramatic—10x to 1000x speedups are common when the right index is applied to the right query.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline