Introduction
Database replication is the backbone of modern high-availability systems. When your application serves millions of users across multiple continents, a single database server becomes both a performance bottleneck and a catastrophic single point of failure. Replication solves both problems by maintaining synchronized copies of your data across multiple nodes.
The choice between leader-follower and multi-leader replication isn't just a technical decision—it's an architectural commitment that shapes your consistency model, failover behavior, and operational complexity for years to come. According to a 2022 Datadog report, 78% of production database incidents involve replication-related issues, making this one of the most critical skills for backend engineers.
This guide covers the internals of both replication strategies, practical PostgreSQL implementation, conflict resolution patterns, and the operational playbooks you need to run them in production. We explore synchronous versus asynchronous semantics, semi-synchronous middle grounds, replication lag problems like read-your-own-writes and monotonic reads, leader election protocols, multi-leader conflict resolution using last-writer-wins and CRDTs, PostgreSQL streaming replication setup, MySQL GTID-based replication, replication topologies, and monitoring strategies that keep your cluster healthy.
Understanding Replication: Core Concepts
At its core, database replication maintains multiple copies of the same data across different servers. The fundamental question is: how do changes propagate from one node to another, and what happens when things go wrong?
Synchronous vs Asynchronous Replication
Synchronous replication guarantees that every write is confirmed on all replicas before returning success to the client. This provides the strongest consistency guarantee but introduces latency proportional to the slowest replica. If a replica goes down, writes block entirely.
Asynchronous replication returns success as soon as the write hits the leader's local disk (or WAL). Replicas catch up in the background. This provides lower latency and better availability but creates a window where committed data could be lost if the leader fails before replicating.
Semi-synchronous replication (used by MySQL) is a middle ground: at least one replica must acknowledge before the write is committed, but not all replicas. MySQL's implementation uses two plugins—rpl_semi_sync_source on the source and rpl_semi_sync_replica on the replica. When enabled, the source waits for at least one replica to acknowledge receipt of each transaction's binary log events before returning success to the client. If no replica acknowledges within the configured timeout (rpl_semi_sync_source_timeout, default 10 seconds), the source falls back to asynchronous replication. This fallback behavior prevents write unavailability when replicas are slow or disconnected, but it means you can silently lose the synchronous guarantee.
-- PostgreSQL streaming replication setup (leader)
-- postgresql.conf on leader
wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
hot_standby = on
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
-- Check replication status on leader
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
pg_wal_lsn_diff(sent_lsn, replay_lsn) as replication_lag_bytes
FROM pg_stat_replication;-- MySQL semi-synchronous replication setup
-- On source:
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
SET GLOBAL rpl_semi_sync_source_enabled = ON;
SET GLOBAL rpl_semi_sync_source_timeout = 5000; -- 5 seconds
-- On replica:
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
SET GLOBAL rpl_semi_sync_replica_enabled = ON;
STOP REPLICA; START REPLICA;
-- Monitor semi-sync status on source
SHOW STATUS LIKE 'Rpl_semi_sync%';
-- Key variables:
-- Rpl_semi_sync_source_status: ON/OFF
-- Rpl_semi_sync_source_clients: number of semi-sync replicas
-- Rpl_semi_sync_source_net_avg_wait_time: avg wait for replica ack
-- Rpl_semi_sync_source_no_tx: transactions not semi-sync confirmedThe Data Loss Window
The difference between synchronous and asynchronous replication has real consequences. Consider a scenario where the leader processes a write, acknowledges it to the client, and then crashes before the WAL record reaches any replica. With asynchronous replication, that write is gone. The data loss window equals the time between the write being committed locally and being streamed to at least one replica—typically measured in milliseconds under normal conditions but potentially unbounded under network partitions.
PostgreSQL's synchronous_standby_names parameter controls which replicas must acknowledge before a write is considered committed. Setting it to FIRST 1 (replica1) means the write blocks until replica1 confirms, but other replicas remain asynchronous. This gives you RPO=0 for the synchronous replica while keeping write latency predictable.
Architecture: Leader-Follower Pattern
Leader-follower (also called primary-standby) replication designates a single node as the leader that accepts all writes, while one or more followers maintain read-only copies.
How WAL-Based Streaming Works
PostgreSQL uses Write-Ahead Logging (WAL) for replication. Every change is first written to the WAL, then applied to the data files. Followers connect to the leader as WAL receivers, streaming WAL records in real-time:
- Client sends
INSERTto leader - Leader writes WAL record to local disk
- WAL sender process streams record to connected followers
- Follower's WAL receiver writes record to its local WAL
- Follower's startup process replays the WAL record against its data files
- (If synchronous) Leader waits for follower confirmation before responding
-- Setting up a follower (replica)
-- 1. Take a base backup from leader
pg_basebackup -h leader_host -D /var/lib/postgresql/data -U replicator -Fp -Xs -P -R
-- 2. Start the follower - it begins streaming automatically
-- Monitor lag on follower
SELECT
pg_is_in_recovery() as is_replica,
pg_last_wal_receive_lsn() as received,
pg_last_wal_replay_lsn() as replayed,
pg_wal_lsn_diff(
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()
) as lag_bytes,
now() - pg_last_xact_replay_timestamp() as replay_lag;Cascading Replication
For large fleets, cascading replication reduces load on the leader by having followers replicate from other followers:
Leader → Follower A → Follower C, D
→ Follower B → Follower E, F
Cascading replication is particularly useful in geographically distributed setups. Instead of the leader in us-east-1 streaming WAL to replicas in eu-west-1 and ap-southeast-1 directly, you can have a single replica in each region that fans out to local read replicas. This reduces the leader's max_wal_senders requirement and cross-region bandwidth costs. The tradeoff is increased replication lag for the cascaded replicas—each hop adds latency. In PostgreSQL 16+, cascading logical replication is also supported, where subscribers can re-publish data to downstream subscribers.
-- Cascading replication setup: Follower C replicates from Follower A
-- On Follower C's postgresql.conf:
primary_conninfo = 'host=follower_a_host port=5432 user=replicator password=secure_password'
-- The rest of the configuration is identical to setting up any standby
-- Monitor cascading replication lag on Follower C
SELECT
pg_is_in_recovery() as is_replica,
pg_last_wal_receive_lsn() as received,
pg_last_wal_replay_lsn() as replayed,
now() - pg_last_xact_replay_timestamp() as time_since_last_replay;Replication Slots: Preventing WAL Removal
Without replication slots, the leader may remove WAL segments before slow replicas consume them, causing the replica to disconnect and require a full re-sync. Replication slots solve this by telling the leader to retain WAL until all slots have consumed it.
-- Create a physical replication slot on the leader
SELECT pg_create_physical_replication_slot('standby_slot_1');
-- The standby connects using this slot
-- primary_conninfo includes: ...options='-c standby_slot_name=standby_slot_1'
-- Monitor slot lag - CRITICAL for disk space management
SELECT
slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as lag
FROM pg_replication_slots;
-- Set maximum WAL retention for a slot to prevent disk bloat
ALTER SYSTEM SET max_slot_wal_keep_size = '50GB';
SELECT pg_reload_conf();Warning: Inactive replication slots accumulate WAL indefinitely, which can fill your disk. Always monitor pg_replication_slots and set max_slot_wal_keep_size as a safety net. In production, we've seen cases where a forgotten inactive slot retained 200GB+ of WAL, filling the disk and crashing the leader.
Replication Lag Problems
Asynchronous replication creates a window where replicas are behind the leader. This lag causes several subtle consistency issues that application developers must handle.
Read-Your-Own-Writes
After a user writes data (e.g., updates their profile), subsequent reads may hit a lagging replica and return stale data. The user sees their own write disappear. Solutions include:
// Solution 1: Read-after-write consistency by routing to leader
async function updateProfile(userId: string, data: ProfileData) {
await leaderPool.query('UPDATE users SET profile = $1 WHERE id = $2', [data, userId]);
// Route the subsequent read to the leader for this user
const result = await leaderPool.query('SELECT * FROM users WHERE id = $1', [userId]);
return result.rows[0];
}
// Solution 2: Track write timestamp and wait for replica to catch up
async function readAfterWrite(userId: string, lastWriteTimestamp: Date) {
const replicaLag = await followerPool.query(
"SELECT now() - pg_last_xact_replay_timestamp() as lag"
);
const lagMs = replicaLag.rows[0].lag?.toMillis?.() || 0;
const staleness = Date.now() - lastWriteTimestamp.getTime();
if (lagMs > staleness) {
// Replica hasn't caught up yet, read from leader
return leaderPool.query('SELECT * FROM users WHERE id = $1', [userId]);
}
return followerPool.query('SELECT * FROM users WHERE id = $1', [userId]);
}Monotonic Reads
A user reads from replica A (which has replayed up to transaction 100), then reads from replica B (which has only replayed to transaction 95). The user sees data go "backwards in time." Solutions include routing each user to a consistent replica or tracking the user's last-seen LSN.
Consistent Prefix Reads
In a partitioned system, if a user writes to partition 1 and then reads from partition 2, they may see an inconsistent state where partition 2 hasn't received the write yet. This violates the guarantee that causally related events appear in order.
Leader Election and Automatic Failover
When the leader fails, a follower must be promoted. Manual failover is error-prone and slow, so production systems use automated failover with consensus protocols.
Patroni: Production-Grade Automatic Failover
Patroni is the de facto standard for PostgreSQL HA. It uses a Distributed Configuration Store (DCS)—etcd, ZooKeeper, or Consul—for leader election and cluster state.
# patroni.yml - Patroni configuration for automatic failover
scope: postgres-cluster
name: node1
restapi:
listen: 0.0.0.0:8008
connect_address: node1:8008
etcd3:
hosts: etcd1:2379,etcd2:2379,etcd3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1MB max lag for failover
synchronous_mode: true # Require at least one sync replica
synchronous_mode_strict: false # Don't block writes if no sync replica
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
hot_standby: "on"
max_wal_senders: 10
max_replication_slots: 10
wal_log_hints: "on"
# Synchronous replication with Patroni
# When synchronous_mode is true, Patroni manages synchronous_standby_names
# automatically, ensuring at least one replica is synchronous before allowing writes// Node.js: Read-write split with automatic failover handling
import { Pool } from 'pg';
const leaderPool = new Pool({
host: process.env.PG_LEADER_HOST,
database: 'myapp',
max: 20,
connectionTimeoutMillis: 3000,
});
const followerPool = new Pool({
host: process.env.PG_FOLLOWER_HOST,
database: 'myapp',
max: 30,
connectionTimeoutMillis: 3000,
});
async function query<T>(sql: string, params?: any[], opts?: { preferLeader?: boolean }): Promise<T[]> {
const pool = opts?.preferLeader ? leaderPool : followerPool;
try {
const result = await pool.query(sql, params);
return result.rows;
} catch (err) {
if (pool === followerPool) {
console.warn('Follower unavailable, falling back to leader');
const result = await leaderPool.query(sql, params);
return result.rows;
}
throw err;
}
}
async function createOrder(userId: string, items: any[]) {
const client = await leaderPool.connect();
try {
await client.query('BEGIN');
const order = await client.query(
'INSERT INTO orders (user_id, items) VALUES ($1, $2) RETURNING *',
[userId, JSON.stringify(items)]
);
await client.query('COMMIT');
return order.rows[0];
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release();
}
}Failover Decision Logic
Patroni's failover algorithm works as follows: each node runs a REST API and a watchdog loop that updates its DCS key with a TTL. If the leader's key expires (leader is dead or unresponsive), replicas attempt to acquire the leader lock. The replica with the lowest replication lag wins. If synchronous_mode is enabled, only synchronous replicas can be promoted, guaranteeing RPO=0. The maximum_lag_on_failover setting prevents a severely lagging replica from being promoted, which would cause significant data loss.
Multi-Leader Replication
Multi-leader (also called master-master or active-active) replication allows multiple nodes to accept writes simultaneously. This is essential for geographically distributed applications where write latency to a single leader would be unacceptable.
When Multi-Leader Is Necessary
Consider a global SaaS application with users in New York, London, and Tokyo. If the leader is in New York, a user in Tokyo experiences 150ms+ write latency just for the network round trip. With multi-leader replication, each region has a local leader that accepts writes, and changes propagate asynchronously to other leaders.
Conflict Resolution Strategies
The fundamental challenge of multi-leader replication is conflict resolution. When two leaders modify the same row concurrently, which write wins?
Last-Writer-Wins (LWW): The write with the latest timestamp wins. This is simple but can lose data. PostgreSQL logical replication does not resolve conflicts automatically—the subscriber's data takes precedence by default.
-- PostgreSQL logical replication conflict handling
-- On subscriber, conflicts cause an error. You can configure behavior:
ALTER SUBSCRIPTION my_sub SET (disable_on_error = false);
-- For PostgreSQL 16+, you can use origin filtering:
CREATE SUBSCRIPTION my_sub
CONNECTION 'host=publisher port=5432 dbname=mydb'
PUBLICATION my_pub
WITH (origin = 'none'); -- Don't replicate changes from other origins
-- Monitor conflicts on subscriber
SELECT * FROM pg_stat_subscription;Application-Level Merge: The application detects conflicts and merges changes. For example, in a collaborative document editor, both edits are preserved. In an inventory system, quantities are summed.
CRDTs (Conflict-Free Replicated Data Types): Data structures that mathematically guarantee convergence regardless of operation order. Counters (G-Counter, PN-Counter), sets (G-Set, OR-Set), and registers (LWW-Register, MV-Register) are common CRDT types.
// G-Counter CRDT example for distributed page view counting
class GCounter {
private counts: Map<string, number> = new Map();
increment(nodeId: string, amount: number = 1): void {
const current = this.counts.get(nodeId) || 0;
this.counts.set(nodeId, current + amount);
}
value(): number {
let total = 0;
for (const count of this.counts.values()) {
total += count;
}
return total;
}
merge(other: GCounter): void {
for (const [nodeId, count] of other.counts) {
const current = this.counts.get(nodeId) || 0;
this.counts.set(nodeId, Math.max(current, count));
}
}
}
// Usage: Each datacenter increments its own counter
const nycCounter = new GCounter();
const londonCounter = new GCounter();
nycCounter.increment('nyc', 42);
londonCounter.increment('london', 37);
// Periodically merge counters across datacenters
nycCounter.merge(londonCounter);
console.log(nycCounter.value()); // 79Multi-Leader Topologies
Circular: A → B → C → A
Star: A → B, A → C, A → D (A is the hub)
All-to-All: A ↔ B, A ↔ C, B ↔ C
All-to-all is the most common topology but requires careful conflict handling. Circular topologies are fragile—any node failure breaks the chain. Star topologies create a single bottleneck at the hub.
MySQL Replication Deep Dive
MySQL offers several replication modes, each with different tradeoffs.
GTID-Based Replication
Global Transaction Identifiers (GTIDs) uniquely identify each transaction across all servers. This eliminates the need to track binary log file names and positions during failover.
-- MySQL GTID replication setup
-- On source (my.cnf):
[mysqld]
server-id = 1
log-bin = mysql-bin
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_format = ROW
-- On replica (my.cnf):
[mysqld]
server-id = 2
relay-log = relay-bin
gtid_mode = ON
enforce_gtid_consistency = ON
read_only = ON
super_read_only = ON
-- Configure replication on replica
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'source_host',
SOURCE_USER = 'repl_user',
SOURCE_PASSWORD = 'repl_password',
SOURCE_AUTO_POSITION = 1;
START REPLICA;
SHOW REPLICA STATUS\GMySQL Group Replication
Group Replication provides synchronous multi-primary replication with automatic conflict detection. It uses the Paxos-based group communication protocol for atomic message delivery. All members of the group must agree on a transaction before it commits.
-- MySQL Group Replication setup
[mysqld]
server_id = 1
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_checksum = NONE
log_bin = mysql-bin
binlog_format = ROW
master_info_repository = TABLE
relay_log_info_repository = TABLE
transaction_write_set_extraction = XXHASH64
loose-group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
loose-group_replication_start_on_boot = OFF
loose-group_replication_local_address = "node1:33061"
loose-group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
loose-group_replication_single_primary_mode = ONStep-by-Step Implementation
Setting Up Leader-Follower with Automatic Failover
The complete setup requires a consensus layer (Patroni, repmgr, or Stolon) to handle automatic failover:
Zero-Downtime Migration with Logical Replication
A SaaS provider migrated from PostgreSQL 12 to 16 with minimal downtime using logical replication:
-- On the publisher (old server)
CREATE PUBLICATION migration_pub FOR ALL TABLES;
-- On the subscriber (new server)
CREATE SUBSCRIPTION migration_sub
CONNECTION 'host=old_server port=5432 dbname=myapp user=replicator'
PUBLICATION migration_pub;
-- Monitor subscription status
SELECT
subname,
worker_type,
received_lsn,
last_msg_send_time,
last_msg_receipt_time,
latest_end_lsn
FROM pg_stat_subscription;
-- After initial sync completes, switch application to new server
-- Then drop the subscription
ALTER SUBSCRIPTION migration_sub DISABLE;
ALTER SUBSCRIPTION migration_sub SET (slot_name = NONE);
DROP SUBSCRIPTION migration_sub;Logical Replication for Selective Table Replication
Logical replication provides fine-grained control—you can replicate specific tables, filter rows, and even replicate between different PostgreSQL major versions.
-- Replicate only specific tables
CREATE PUBLICATION order_pub FOR TABLE orders, order_items, payments;
-- Row-level filtering (PostgreSQL 15+)
CREATE PUBLICATION active_users_pub FOR TABLE users
WHERE (active = true);
-- Column lists (PostgreSQL 15+)
CREATE PUBLICATION user_emails_pub FOR TABLE users (id, email, created_at);Real-World Use Cases
Use Case 1: Global E-Commerce with Regional Read Replicas
A global e-commerce platform serves users in North America, Europe, and Asia. By deploying regional read replicas, they reduce read latency from 150ms (cross-region) to 5ms (in-region) while maintaining a single leader for writes.
const regionalPools = {
'us-east': new Pool({ host: 'db-us-east.internal', max: 30 }),
'eu-west': new Pool({ host: 'db-eu-west.internal', max: 30 }),
'ap-southeast': new Pool({ host: 'db-ap-southeast.internal', max: 30 }),
};
function getRegionalPool(region: string): Pool {
return regionalPools[region] || regionalPools['us-east'];
}
async function getProductCatalog(region: string) {
const pool = getRegionalPool(region);
return pool.query(
`SELECT p.*, i.quantity FROM products p
JOIN inventory i ON i.product_id = p.id
WHERE i.warehouse_region = $1 ORDER BY p.popularity DESC`,
[region]
);
}Use Case 2: Disaster Recovery with Cross-Region Streaming
A financial services company maintains a hot standby in a different geographic region with synchronous replication to meet RPO=0 requirements.
-- Synchronous replication to primary standby
synchronous_standby_names = 'FIRST 1 (standby_sync)'
-- Monitor replication compliance
CREATE OR REPLACE FUNCTION check_replication_compliance()
RETURNS TABLE (replica_name TEXT, lag_bytes BIGINT, compliant BOOLEAN) AS $$
BEGIN
RETURN QUERY
SELECT
application_name::TEXT,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn),
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) < 1048576
FROM pg_stat_replication;
END;
$$ LANGUAGE plpgsql;Replication Topologies Comparison
| Topology | Write Location | Conflict Risk | Complexity | Best For |
|---|---|---|---|---|
| Single Leader | One node | None | Low | Most applications |
| Multi-Leader | Multiple nodes | High | High | Geo-distributed writes |
| Leaderless (Dynamo-style) | Any node | High | Very high | High-availability writes |
| Cascading | One node | None | Medium | Large read fleets |
| Circular | Each node | Very high | Very high | Legacy; avoid in new designs |
Performance Optimization
-- Tune WAL for replication throughput
wal_buffers = '64MB'
wal_writer_delay = '200ms'
wal_compression = on -- Reduces network bandwidth by 50-70%
-- PostgreSQL 15+ allows lz4 compression
wal_compression = lz4
-- On followers: tune for read performance
shared_buffers = '8GB'
effective_cache_size = '24GB'
hot_standby_feedback = on
-- Prevent long-running queries from blocking WAL replay
max_standby_streaming_delay = '30s'
max_standby_archive_delay = '60s'
-- Monitor replication performance
SELECT
application_name, state,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) as write_lag,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn)) as flush_lag,
pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) as replay_lag
FROM pg_stat_replication;Monitoring Replication Health
Comprehensive monitoring is non-negotiable for production replication. Here's a complete monitoring strategy:
-- 1. Replication lag in bytes and time
SELECT
application_name,
client_addr,
state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag,
now() - pg_last_xact_replay_timestamp() as time_lag
FROM pg_stat_replication;
-- 2. Replication slot health
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as lag
FROM pg_replication_slots
ORDER BY retained_wal DESC;
-- 3. WAL generation rate (bytes per second over last 5 minutes)
SELECT
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') /
EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time()))
) as avg_wal_rate_per_sec;
-- 4. Alert on missing synchronous replica
SELECT
CASE WHEN count(*) = 0 THEN 'CRITICAL: No synchronous replica connected'
ELSE 'OK: ' || count(*) || ' synchronous replica(s)'
END as sync_status
FROM pg_stat_replication
WHERE state = 'streaming'
AND pg_wal_lsn_diff(sent_lsn, replay_lsn) = 0;// Prometheus metrics for replication monitoring
import { register, Gauge } from 'prom-client';
const replicationLagBytes = new Gauge({
name: 'pg_replication_lag_bytes',
help: 'Replication lag in bytes',
labelNames: ['replica_name', 'replica_host'],
});
const replicationLagSeconds = new Gauge({
name: 'pg_replication_lag_seconds',
help: 'Replication lag in seconds',
labelNames: ['replica_name', 'replica_host'],
});
const slotRetainedWal = new Gauge({
name: 'pg_slot_retained_wal_bytes',
help: 'WAL bytes retained by replication slot',
labelNames: ['slot_name'],
});
async function collectReplicationMetrics() {
const result = await leaderPool.query(`
SELECT
application_name,
client_addr,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
FROM pg_stat_replication
`);
for (const row of result.rows) {
replicationLagBytes.set(
{ replica_name: row.application_name, replica_host: row.client_addr },
row.lag_bytes
);
}
// Alert threshold: > 100MB lag
for (const row of result.rows) {
if (row.lag_bytes > 100 * 1024 * 1024) {
console.error(`ALERT: ${row.application_name} lag ${row.lag_bytes} bytes`);
}
}
}
setInterval(collectReplicationMetrics, 15000);Common Pitfalls and Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| Replication slot bloat | WAL files accumulate, filling disk | Monitor slot lag; set max_slot_wal_keep_size |
| Cascading failover failure | Multiple replicas go down during failover | Use consensus tools like Patroni with etcd |
| Split-brain with multi-leader | Conflicting writes on both leaders | Use conflict resolution (LWW, CRDTs, or app-level) |
| Long-running transactions on replica | Replication blocks, lag grows | Set max_standby_streaming_delay |
| Missing replication slot after failover | Replicas can't reconnect | Automate slot recreation in failover scripts |
| Network partition causes dual leaders | Data inconsistency | Use fencing mechanisms; Patroni handles this |
| WAL archiving fills disk | Leader crashes | Monitor pg_wal directory size; set retention |
| Replica falls behind during bulk load | Hours of lag, slot bloat | Temporarily disable replication or use parallel apply |
Testing Strategies
describe('Replication Tests', () => {
test('data replicates within acceptable lag', async () => {
const testId = `test-${Date.now()}`;
await leader.query('INSERT INTO test_table (id, data) VALUES ($1, $2)', [testId, 'test']);
const maxWait = 5000;
const start = Date.now();
let found = false;
while (Date.now() - start < maxWait) {
const result = await follower.query('SELECT * FROM test_table WHERE id = $1', [testId]);
if (result.rows.length > 0) { found = true; break; }
await new Promise(resolve => setTimeout(resolve, 100));
}
expect(found).toBe(true);
});
test('follower rejects writes', async () => {
await expect(
follower.query('INSERT INTO test_table (id) VALUES ($1)', ['should-fail'])
).rejects.toThrow(/read only/);
});
test('failover promotes replica and demotes old leader', async () => {
// Trigger failover via Patroni REST API
const res = await fetch('http://patroni-node2:8008/failover', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ leader: 'node1', candidate: 'node2' }),
});
expect(res.ok).toBe(true);
// Wait for failover to complete
await new Promise(resolve => setTimeout(resolve, 15000));
// Verify node2 is now the leader
const status = await fetch('http://patroni-node2:8008/cluster');
const cluster = await status.json();
expect(cluster.leader).toBe('node2');
});
});Choosing the Right Strategy
The choice between leader-follower and multi-leader replication depends on your application's requirements:
- Leader-follower is simpler to implement, provides strong consistency, and works well for read-heavy workloads. Use this as your default.
- Multi-leader is necessary when you need writes from multiple locations. The complexity of conflict resolution means you should only adopt it when geographic write latency is a genuine hard requirement.
- Logical replication is your best tool for migrations, selective table replication, and event-driven architectures. PostgreSQL 16+ added significant improvements including bidirectional logical replication.
- MySQL Group Replication provides multi-primary with automatic conflict detection but has strict requirements (primary keys on all tables, ROW format binlog).
Start with leader-follower. Add read replicas as your read traffic grows. Only move to multi-leader when you can clearly articulate the geographic write requirement and have designed your conflict resolution strategy.
Conclusion
Database replication is foundational to building reliable, scalable systems. Key takeaways:
- Start with leader-follower replication for most applications—it's simpler and well-understood
- Use synchronous replication only when you need RPO=0; the performance cost is real
- Invest in monitoring and automated failover from day one—Patroni + etcd is the gold standard for PostgreSQL
- Multi-leader replication is powerful but complex; adopt it only when geographic write latency is a hard requirement
- Logical replication is your best tool for migrations and event-driven architectures
- Replication lag is inevitable with asynchronous replication—design your application to handle stale reads
- Test failover quarterly; automated failover tools have edge cases that only surface under real failure conditions
Begin by setting up a simple leader-follower pair with Patroni on staging. Practice failover until it's routine. The PostgreSQL high availability documentation is essential reading.