Database Replication: Leader-Follower and Multi-Leader

Introduction

Database replication is the backbone of modern high-availability systems. When your application serves millions of users across multiple continents, a single database server becomes both a performance bottleneck and a catastrophic single point of failure. Replication solves both problems by maintaining synchronized copies of your data across multiple nodes.

The choice between leader-follower and multi-leader replication isn't just a technical decision—it's an architectural commitment that shapes your consistency model, failover behavior, and operational complexity for years to come. According to a 2022 Datadog report, 78% of production database incidents involve replication-related issues, making this one of the most critical skills for backend engineers.

This guide covers the internals of both replication strategies, practical PostgreSQL implementation, conflict resolution patterns, and the operational playbooks you need to run them in production. We explore synchronous versus asynchronous semantics, semi-synchronous middle grounds, replication lag problems like read-your-own-writes and monotonic reads, leader election protocols, multi-leader conflict resolution using last-writer-wins and CRDTs, PostgreSQL streaming replication setup, MySQL GTID-based replication, replication topologies, and monitoring strategies that keep your cluster healthy.

Understanding Replication: Core Concepts

At its core, database replication maintains multiple copies of the same data across different servers. The fundamental question is: how do changes propagate from one node to another, and what happens when things go wrong?

Synchronous vs Asynchronous Replication

Synchronous replication guarantees that every write is confirmed on all replicas before returning success to the client. This provides the strongest consistency guarantee but introduces latency proportional to the slowest replica. If a replica goes down, writes block entirely.

Asynchronous replication returns success as soon as the write hits the leader's local disk (or WAL). Replicas catch up in the background. This provides lower latency and better availability but creates a window where committed data could be lost if the leader fails before replicating.

Semi-synchronous replication (used by MySQL) is a middle ground: at least one replica must acknowledge before the write is committed, but not all replicas. MySQL's implementation uses two plugins—rpl_semi_sync_source on the source and rpl_semi_sync_replica on the replica. When enabled, the source waits for at least one replica to acknowledge receipt of each transaction's binary log events before returning success to the client. If no replica acknowledges within the configured timeout (rpl_semi_sync_source_timeout, default 10 seconds), the source falls back to asynchronous replication. This fallback behavior prevents write unavailability when replicas are slow or disconnected, but it means you can silently lose the synchronous guarantee.

-- PostgreSQL streaming replication setup (leader)
-- postgresql.conf on leader
wal_level = replica
max_wal_senders = 10
wal_keep_size = '1GB'
hot_standby = on
 
-- Create replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD 'secure_password';
 
-- Check replication status on leader
SELECT
  client_addr,
  state,
  sent_lsn,
  write_lsn,
  flush_lsn,
  replay_lsn,
  pg_wal_lsn_diff(sent_lsn, replay_lsn) as replication_lag_bytes
FROM pg_stat_replication;

-- MySQL semi-synchronous replication setup
-- On source:
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
SET GLOBAL rpl_semi_sync_source_enabled = ON;
SET GLOBAL rpl_semi_sync_source_timeout = 5000; -- 5 seconds
 
-- On replica:
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
SET GLOBAL rpl_semi_sync_replica_enabled = ON;
STOP REPLICA; START REPLICA;
 
-- Monitor semi-sync status on source
SHOW STATUS LIKE 'Rpl_semi_sync%';
-- Key variables:
-- Rpl_semi_sync_source_status: ON/OFF
-- Rpl_semi_sync_source_clients: number of semi-sync replicas
-- Rpl_semi_sync_source_net_avg_wait_time: avg wait for replica ack
-- Rpl_semi_sync_source_no_tx: transactions not semi-sync confirmed

The Data Loss Window

The difference between synchronous and asynchronous replication has real consequences. Consider a scenario where the leader processes a write, acknowledges it to the client, and then crashes before the WAL record reaches any replica. With asynchronous replication, that write is gone. The data loss window equals the time between the write being committed locally and being streamed to at least one replica—typically measured in milliseconds under normal conditions but potentially unbounded under network partitions.

PostgreSQL's synchronous_standby_names parameter controls which replicas must acknowledge before a write is considered committed. Setting it to FIRST 1 (replica1) means the write blocks until replica1 confirms, but other replicas remain asynchronous. This gives you RPO=0 for the synchronous replica while keeping write latency predictable.

Architecture: Leader-Follower Pattern

Leader-follower (also called primary-standby) replication designates a single node as the leader that accepts all writes, while one or more followers maintain read-only copies.

How WAL-Based Streaming Works

PostgreSQL uses Write-Ahead Logging (WAL) for replication. Every change is first written to the WAL, then applied to the data files. Followers connect to the leader as WAL receivers, streaming WAL records in real-time:

Client sends INSERT to leader
Leader writes WAL record to local disk
WAL sender process streams record to connected followers
Follower's WAL receiver writes record to its local WAL
Follower's startup process replays the WAL record against its data files
(If synchronous) Leader waits for follower confirmation before responding

-- Setting up a follower (replica)
-- 1. Take a base backup from leader
pg_basebackup -h leader_host -D /var/lib/postgresql/data -U replicator -Fp -Xs -P -R
 
-- 2. Start the follower - it begins streaming automatically
 
-- Monitor lag on follower
SELECT
  pg_is_in_recovery() as is_replica,
  pg_last_wal_receive_lsn() as received,
  pg_last_wal_replay_lsn() as replayed,
  pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) as lag_bytes,
  now() - pg_last_xact_replay_timestamp() as replay_lag;

Cascading Replication

For large fleets, cascading replication reduces load on the leader by having followers replicate from other followers:

Leader → Follower A → Follower C, D
       → Follower B → Follower E, F

Cascading replication is particularly useful in geographically distributed setups. Instead of the leader in us-east-1 streaming WAL to replicas in eu-west-1 and ap-southeast-1 directly, you can have a single replica in each region that fans out to local read replicas. This reduces the leader's max_wal_senders requirement and cross-region bandwidth costs. The tradeoff is increased replication lag for the cascaded replicas—each hop adds latency. In PostgreSQL 16+, cascading logical replication is also supported, where subscribers can re-publish data to downstream subscribers.

-- Cascading replication setup: Follower C replicates from Follower A
-- On Follower C's postgresql.conf:
primary_conninfo = 'host=follower_a_host port=5432 user=replicator password=secure_password'
-- The rest of the configuration is identical to setting up any standby
 
-- Monitor cascading replication lag on Follower C
SELECT
  pg_is_in_recovery() as is_replica,
  pg_last_wal_receive_lsn() as received,
  pg_last_wal_replay_lsn() as replayed,
  now() - pg_last_xact_replay_timestamp() as time_since_last_replay;

Replication Slots: Preventing WAL Removal

Without replication slots, the leader may remove WAL segments before slow replicas consume them, causing the replica to disconnect and require a full re-sync. Replication slots solve this by telling the leader to retain WAL until all slots have consumed it.

-- Create a physical replication slot on the leader
SELECT pg_create_physical_replication_slot('standby_slot_1');
 
-- The standby connects using this slot
-- primary_conninfo includes: ...options='-c standby_slot_name=standby_slot_1'
 
-- Monitor slot lag - CRITICAL for disk space management
SELECT
  slot_name,
  slot_type,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as lag
FROM pg_replication_slots;
 
-- Set maximum WAL retention for a slot to prevent disk bloat
ALTER SYSTEM SET max_slot_wal_keep_size = '50GB';
SELECT pg_reload_conf();

Warning: Inactive replication slots accumulate WAL indefinitely, which can fill your disk. Always monitor pg_replication_slots and set max_slot_wal_keep_size as a safety net. In production, we've seen cases where a forgotten inactive slot retained 200GB+ of WAL, filling the disk and crashing the leader.

Replication Lag Problems

Asynchronous replication creates a window where replicas are behind the leader. This lag causes several subtle consistency issues that application developers must handle.

Read-Your-Own-Writes

After a user writes data (e.g., updates their profile), subsequent reads may hit a lagging replica and return stale data. The user sees their own write disappear. Solutions include:

// Solution 1: Read-after-write consistency by routing to leader
async function updateProfile(userId: string, data: ProfileData) {
  await leaderPool.query('UPDATE users SET profile = $1 WHERE id = $2', [data, userId]);
  // Route the subsequent read to the leader for this user
  const result = await leaderPool.query('SELECT * FROM users WHERE id = $1', [userId]);
  return result.rows[0];
}
 
// Solution 2: Track write timestamp and wait for replica to catch up
async function readAfterWrite(userId: string, lastWriteTimestamp: Date) {
  const replicaLag = await followerPool.query(
    "SELECT now() - pg_last_xact_replay_timestamp() as lag"
  );
  const lagMs = replicaLag.rows[0].lag?.toMillis?.() || 0;
  const staleness = Date.now() - lastWriteTimestamp.getTime();
 
  if (lagMs > staleness) {
    // Replica hasn't caught up yet, read from leader
    return leaderPool.query('SELECT * FROM users WHERE id = $1', [userId]);
  }
  return followerPool.query('SELECT * FROM users WHERE id = $1', [userId]);
}

Monotonic Reads

A user reads from replica A (which has replayed up to transaction 100), then reads from replica B (which has only replayed to transaction 95). The user sees data go "backwards in time." Solutions include routing each user to a consistent replica or tracking the user's last-seen LSN.

Consistent Prefix Reads

In a partitioned system, if a user writes to partition 1 and then reads from partition 2, they may see an inconsistent state where partition 2 hasn't received the write yet. This violates the guarantee that causally related events appear in order.

Leader Election and Automatic Failover

When the leader fails, a follower must be promoted. Manual failover is error-prone and slow, so production systems use automated failover with consensus protocols.

Patroni: Production-Grade Automatic Failover

Patroni is the de facto standard for PostgreSQL HA. It uses a Distributed Configuration Store (DCS)—etcd, ZooKeeper, or Consul—for leader election and cluster state.

# patroni.yml - Patroni configuration for automatic failover
scope: postgres-cluster
name: node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB max lag for failover
    synchronous_mode: true            # Require at least one sync replica
    synchronous_mode_strict: false    # Don't block writes if no sync replica
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: "on"
        max_wal_senders: 10
        max_replication_slots: 10
        wal_log_hints: "on"
 
# Synchronous replication with Patroni
# When synchronous_mode is true, Patroni manages synchronous_standby_names
# automatically, ensuring at least one replica is synchronous before allowing writes

// Node.js: Read-write split with automatic failover handling
import { Pool } from 'pg';
 
const leaderPool = new Pool({
  host: process.env.PG_LEADER_HOST,
  database: 'myapp',
  max: 20,
  connectionTimeoutMillis: 3000,
});
 
const followerPool = new Pool({
  host: process.env.PG_FOLLOWER_HOST,
  database: 'myapp',
  max: 30,
  connectionTimeoutMillis: 3000,
});
 
async function query<T>(sql: string, params?: any[], opts?: { preferLeader?: boolean }): Promise<T[]> {
  const pool = opts?.preferLeader ? leaderPool : followerPool;
  try {
    const result = await pool.query(sql, params);
    return result.rows;
  } catch (err) {
    if (pool === followerPool) {
      console.warn('Follower unavailable, falling back to leader');
      const result = await leaderPool.query(sql, params);
      return result.rows;
    }
    throw err;
  }
}
 
async function createOrder(userId: string, items: any[]) {
  const client = await leaderPool.connect();
  try {
    await client.query('BEGIN');
    const order = await client.query(
      'INSERT INTO orders (user_id, items) VALUES ($1, $2) RETURNING *',
      [userId, JSON.stringify(items)]
    );
    await client.query('COMMIT');
    return order.rows[0];
  } catch (err) {
    await client.query('ROLLBACK');
    throw err;
  } finally {
    client.release();
  }
}

Failover Decision Logic

Patroni's failover algorithm works as follows: each node runs a REST API and a watchdog loop that updates its DCS key with a TTL. If the leader's key expires (leader is dead or unresponsive), replicas attempt to acquire the leader lock. The replica with the lowest replication lag wins. If synchronous_mode is enabled, only synchronous replicas can be promoted, guaranteeing RPO=0. The maximum_lag_on_failover setting prevents a severely lagging replica from being promoted, which would cause significant data loss.

Multi-Leader Replication

Multi-leader (also called master-master or active-active) replication allows multiple nodes to accept writes simultaneously. This is essential for geographically distributed applications where write latency to a single leader would be unacceptable.

When Multi-Leader Is Necessary

Consider a global SaaS application with users in New York, London, and Tokyo. If the leader is in New York, a user in Tokyo experiences 150ms+ write latency just for the network round trip. With multi-leader replication, each region has a local leader that accepts writes, and changes propagate asynchronously to other leaders.

Conflict Resolution Strategies

The fundamental challenge of multi-leader replication is conflict resolution. When two leaders modify the same row concurrently, which write wins?

Last-Writer-Wins (LWW): The write with the latest timestamp wins. This is simple but can lose data. PostgreSQL logical replication does not resolve conflicts automatically—the subscriber's data takes precedence by default.

-- PostgreSQL logical replication conflict handling
-- On subscriber, conflicts cause an error. You can configure behavior:
ALTER SUBSCRIPTION my_sub SET (disable_on_error = false);
 
-- For PostgreSQL 16+, you can use origin filtering:
CREATE SUBSCRIPTION my_sub
  CONNECTION 'host=publisher port=5432 dbname=mydb'
  PUBLICATION my_pub
  WITH (origin = 'none');  -- Don't replicate changes from other origins
 
-- Monitor conflicts on subscriber
SELECT * FROM pg_stat_subscription;

Application-Level Merge: The application detects conflicts and merges changes. For example, in a collaborative document editor, both edits are preserved. In an inventory system, quantities are summed.

CRDTs (Conflict-Free Replicated Data Types): Data structures that mathematically guarantee convergence regardless of operation order. Counters (G-Counter, PN-Counter), sets (G-Set, OR-Set), and registers (LWW-Register, MV-Register) are common CRDT types.

// G-Counter CRDT example for distributed page view counting
class GCounter {
  private counts: Map<string, number> = new Map();
 
  increment(nodeId: string, amount: number = 1): void {
    const current = this.counts.get(nodeId) || 0;
    this.counts.set(nodeId, current + amount);
  }
 
  value(): number {
    let total = 0;
    for (const count of this.counts.values()) {
      total += count;
    }
    return total;
  }
 
  merge(other: GCounter): void {
    for (const [nodeId, count] of other.counts) {
      const current = this.counts.get(nodeId) || 0;
      this.counts.set(nodeId, Math.max(current, count));
    }
  }
}
 
// Usage: Each datacenter increments its own counter
const nycCounter = new GCounter();
const londonCounter = new GCounter();
 
nycCounter.increment('nyc', 42);
londonCounter.increment('london', 37);
 
// Periodically merge counters across datacenters
nycCounter.merge(londonCounter);
console.log(nycCounter.value()); // 79

Multi-Leader Topologies

Circular: A → B → C → A
Star:     A → B, A → C, A → D  (A is the hub)
All-to-All: A ↔ B, A ↔ C, B ↔ C

All-to-all is the most common topology but requires careful conflict handling. Circular topologies are fragile—any node failure breaks the chain. Star topologies create a single bottleneck at the hub.

MySQL Replication Deep Dive

MySQL offers several replication modes, each with different tradeoffs.

GTID-Based Replication

Global Transaction Identifiers (GTIDs) uniquely identify each transaction across all servers. This eliminates the need to track binary log file names and positions during failover.

-- MySQL GTID replication setup
-- On source (my.cnf):
[mysqld]
server-id = 1
log-bin = mysql-bin
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_format = ROW
 
-- On replica (my.cnf):
[mysqld]
server-id = 2
relay-log = relay-bin
gtid_mode = ON
enforce_gtid_consistency = ON
read_only = ON
super_read_only = ON
 
-- Configure replication on replica
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'source_host',
  SOURCE_USER = 'repl_user',
  SOURCE_PASSWORD = 'repl_password',
  SOURCE_AUTO_POSITION = 1;
 
START REPLICA;
SHOW REPLICA STATUS\G

MySQL Group Replication

Group Replication provides synchronous multi-primary replication with automatic conflict detection. It uses the Paxos-based group communication protocol for atomic message delivery. All members of the group must agree on a transaction before it commits.

-- MySQL Group Replication setup
[mysqld]
server_id = 1
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_checksum = NONE
log_bin = mysql-bin
binlog_format = ROW
master_info_repository = TABLE
relay_log_info_repository = TABLE
transaction_write_set_extraction = XXHASH64
loose-group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"
loose-group_replication_start_on_boot = OFF
loose-group_replication_local_address = "node1:33061"
loose-group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
loose-group_replication_single_primary_mode = ON

Step-by-Step Implementation

Setting Up Leader-Follower with Automatic Failover

The complete setup requires a consensus layer (Patroni, repmgr, or Stolon) to handle automatic failover:

Zero-Downtime Migration with Logical Replication

A SaaS provider migrated from PostgreSQL 12 to 16 with minimal downtime using logical replication:

-- On the publisher (old server)
CREATE PUBLICATION migration_pub FOR ALL TABLES;
 
-- On the subscriber (new server)
CREATE SUBSCRIPTION migration_sub
CONNECTION 'host=old_server port=5432 dbname=myapp user=replicator'
PUBLICATION migration_pub;
 
-- Monitor subscription status
SELECT
  subname,
  worker_type,
  received_lsn,
  last_msg_send_time,
  last_msg_receipt_time,
  latest_end_lsn
FROM pg_stat_subscription;
 
-- After initial sync completes, switch application to new server
-- Then drop the subscription
ALTER SUBSCRIPTION migration_sub DISABLE;
ALTER SUBSCRIPTION migration_sub SET (slot_name = NONE);
DROP SUBSCRIPTION migration_sub;

Logical Replication for Selective Table Replication

Logical replication provides fine-grained control—you can replicate specific tables, filter rows, and even replicate between different PostgreSQL major versions.

-- Replicate only specific tables
CREATE PUBLICATION order_pub FOR TABLE orders, order_items, payments;
 
-- Row-level filtering (PostgreSQL 15+)
CREATE PUBLICATION active_users_pub FOR TABLE users
  WHERE (active = true);
 
-- Column lists (PostgreSQL 15+)
CREATE PUBLICATION user_emails_pub FOR TABLE users (id, email, created_at);

Real-World Use Cases

Use Case 1: Global E-Commerce with Regional Read Replicas

A global e-commerce platform serves users in North America, Europe, and Asia. By deploying regional read replicas, they reduce read latency from 150ms (cross-region) to 5ms (in-region) while maintaining a single leader for writes.

const regionalPools = {
  'us-east': new Pool({ host: 'db-us-east.internal', max: 30 }),
  'eu-west': new Pool({ host: 'db-eu-west.internal', max: 30 }),
  'ap-southeast': new Pool({ host: 'db-ap-southeast.internal', max: 30 }),
};
 
function getRegionalPool(region: string): Pool {
  return regionalPools[region] || regionalPools['us-east'];
}
 
async function getProductCatalog(region: string) {
  const pool = getRegionalPool(region);
  return pool.query(
    `SELECT p.*, i.quantity FROM products p
     JOIN inventory i ON i.product_id = p.id
     WHERE i.warehouse_region = $1 ORDER BY p.popularity DESC`,
    [region]
  );
}

Use Case 2: Disaster Recovery with Cross-Region Streaming

A financial services company maintains a hot standby in a different geographic region with synchronous replication to meet RPO=0 requirements.

-- Synchronous replication to primary standby
synchronous_standby_names = 'FIRST 1 (standby_sync)'
 
-- Monitor replication compliance
CREATE OR REPLACE FUNCTION check_replication_compliance()
RETURNS TABLE (replica_name TEXT, lag_bytes BIGINT, compliant BOOLEAN) AS $$
BEGIN
  RETURN QUERY
  SELECT
    application_name::TEXT,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn),
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) < 1048576
  FROM pg_stat_replication;
END;
$$ LANGUAGE plpgsql;

Replication Topologies Comparison

Topology	Write Location	Conflict Risk	Complexity	Best For
Single Leader	One node	None	Low	Most applications
Multi-Leader	Multiple nodes	High	High	Geo-distributed writes
Leaderless (Dynamo-style)	Any node	High	Very high	High-availability writes
Cascading	One node	None	Medium	Large read fleets
Circular	Each node	Very high	Very high	Legacy; avoid in new designs

Performance Optimization

-- Tune WAL for replication throughput
wal_buffers = '64MB'
wal_writer_delay = '200ms'
wal_compression = on  -- Reduces network bandwidth by 50-70%
 
-- PostgreSQL 15+ allows lz4 compression
wal_compression = lz4
 
-- On followers: tune for read performance
shared_buffers = '8GB'
effective_cache_size = '24GB'
hot_standby_feedback = on
 
-- Prevent long-running queries from blocking WAL replay
max_standby_streaming_delay = '30s'
max_standby_archive_delay = '60s'
 
-- Monitor replication performance
SELECT
  application_name, state,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) as write_lag,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn)) as flush_lag,
  pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) as replay_lag
FROM pg_stat_replication;

Monitoring Replication Health

Comprehensive monitoring is non-negotiable for production replication. Here's a complete monitoring strategy:

-- 1. Replication lag in bytes and time
SELECT
  application_name,
  client_addr,
  state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) as replay_lag,
  now() - pg_last_xact_replay_timestamp() as time_lag
FROM pg_stat_replication;
 
-- 2. Replication slot health
SELECT
  slot_name,
  active,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) as lag
FROM pg_replication_slots
ORDER BY retained_wal DESC;
 
-- 3. WAL generation rate (bytes per second over last 5 minutes)
SELECT
  pg_size_pretty(
    pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0') /
    EXTRACT(EPOCH FROM (now() - pg_postmaster_start_time()))
  ) as avg_wal_rate_per_sec;
 
-- 4. Alert on missing synchronous replica
SELECT
  CASE WHEN count(*) = 0 THEN 'CRITICAL: No synchronous replica connected'
       ELSE 'OK: ' || count(*) || ' synchronous replica(s)'
  END as sync_status
FROM pg_stat_replication
WHERE state = 'streaming'
  AND pg_wal_lsn_diff(sent_lsn, replay_lsn) = 0;

// Prometheus metrics for replication monitoring
import { register, Gauge } from 'prom-client';
 
const replicationLagBytes = new Gauge({
  name: 'pg_replication_lag_bytes',
  help: 'Replication lag in bytes',
  labelNames: ['replica_name', 'replica_host'],
});
 
const replicationLagSeconds = new Gauge({
  name: 'pg_replication_lag_seconds',
  help: 'Replication lag in seconds',
  labelNames: ['replica_name', 'replica_host'],
});
 
const slotRetainedWal = new Gauge({
  name: 'pg_slot_retained_wal_bytes',
  help: 'WAL bytes retained by replication slot',
  labelNames: ['slot_name'],
});
 
async function collectReplicationMetrics() {
  const result = await leaderPool.query(`
    SELECT
      application_name,
      client_addr,
      pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as lag_bytes
    FROM pg_stat_replication
  `);
 
  for (const row of result.rows) {
    replicationLagBytes.set(
      { replica_name: row.application_name, replica_host: row.client_addr },
      row.lag_bytes
    );
  }
 
  // Alert threshold: > 100MB lag
  for (const row of result.rows) {
    if (row.lag_bytes > 100 * 1024 * 1024) {
      console.error(`ALERT: ${row.application_name} lag ${row.lag_bytes} bytes`);
    }
  }
}
 
setInterval(collectReplicationMetrics, 15000);

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Replication slot bloat	WAL files accumulate, filling disk	Monitor slot lag; set `max_slot_wal_keep_size`
Cascading failover failure	Multiple replicas go down during failover	Use consensus tools like Patroni with etcd
Split-brain with multi-leader	Conflicting writes on both leaders	Use conflict resolution (LWW, CRDTs, or app-level)
Long-running transactions on replica	Replication blocks, lag grows	Set `max_standby_streaming_delay`
Missing replication slot after failover	Replicas can't reconnect	Automate slot recreation in failover scripts
Network partition causes dual leaders	Data inconsistency	Use fencing mechanisms; Patroni handles this
WAL archiving fills disk	Leader crashes	Monitor `pg_wal` directory size; set retention
Replica falls behind during bulk load	Hours of lag, slot bloat	Temporarily disable replication or use parallel apply

Testing Strategies

describe('Replication Tests', () => {
  test('data replicates within acceptable lag', async () => {
    const testId = `test-${Date.now()}`;
    await leader.query('INSERT INTO test_table (id, data) VALUES ($1, $2)', [testId, 'test']);
 
    const maxWait = 5000;
    const start = Date.now();
    let found = false;
 
    while (Date.now() - start < maxWait) {
      const result = await follower.query('SELECT * FROM test_table WHERE id = $1', [testId]);
      if (result.rows.length > 0) { found = true; break; }
      await new Promise(resolve => setTimeout(resolve, 100));
    }
    expect(found).toBe(true);
  });
 
  test('follower rejects writes', async () => {
    await expect(
      follower.query('INSERT INTO test_table (id) VALUES ($1)', ['should-fail'])
    ).rejects.toThrow(/read only/);
  });
 
  test('failover promotes replica and demotes old leader', async () => {
    // Trigger failover via Patroni REST API
    const res = await fetch('http://patroni-node2:8008/failover', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ leader: 'node1', candidate: 'node2' }),
    });
    expect(res.ok).toBe(true);
 
    // Wait for failover to complete
    await new Promise(resolve => setTimeout(resolve, 15000));
 
    // Verify node2 is now the leader
    const status = await fetch('http://patroni-node2:8008/cluster');
    const cluster = await status.json();
    expect(cluster.leader).toBe('node2');
  });
});

Choosing the Right Strategy

The choice between leader-follower and multi-leader replication depends on your application's requirements:

Leader-follower is simpler to implement, provides strong consistency, and works well for read-heavy workloads. Use this as your default.
Multi-leader is necessary when you need writes from multiple locations. The complexity of conflict resolution means you should only adopt it when geographic write latency is a genuine hard requirement.
Logical replication is your best tool for migrations, selective table replication, and event-driven architectures. PostgreSQL 16+ added significant improvements including bidirectional logical replication.
MySQL Group Replication provides multi-primary with automatic conflict detection but has strict requirements (primary keys on all tables, ROW format binlog).

Start with leader-follower. Add read replicas as your read traffic grows. Only move to multi-leader when you can clearly articulate the geographic write requirement and have designed your conflict resolution strategy.

Conclusion

Database replication is foundational to building reliable, scalable systems. Key takeaways:

Start with leader-follower replication for most applications—it's simpler and well-understood
Use synchronous replication only when you need RPO=0; the performance cost is real
Invest in monitoring and automated failover from day one—Patroni + etcd is the gold standard for PostgreSQL
Multi-leader replication is powerful but complex; adopt it only when geographic write latency is a hard requirement
Logical replication is your best tool for migrations and event-driven architectures
Replication lag is inevitable with asynchronous replication—design your application to handle stale reads
Test failover quarterly; automated failover tools have edge cases that only surface under real failure conditions

Begin by setting up a simple leader-follower pair with Patroni on staging. Practice failover until it's routine. The PostgreSQL high availability documentation is essential reading.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline