Database Backup Strategies: PITR, Snapshots, and WAL

Introduction

Point-in-time recovery (PITR) is the ability to restore a database to any specific moment in time—not just to the last backup, but to any second within your retention window. This capability transforms database recovery from a blunt instrument ("restore to last night's backup") into a precision tool ("restore to 2:37 PM, just before the bad migration ran"). PITR relies on Write-Ahead Log (WAL) archiving, which continuously captures every change made to the database.

Understanding PITR, WAL archiving, and snapshot-based backups is essential for any team operating production databases. These aren't optional luxuries—they're the difference between losing 24 hours of data and losing zero seconds. This guide dives deep into the mechanics of WAL archiving, the tradeoffs between different backup strategies, and the operational practices needed to make PITR reliable.

Understanding PITR and WAL: Core Concepts

Write-Ahead Logging (WAL)

Every modern database uses WAL as its fundamental durability mechanism. The principle is simple: before any change is applied to the actual data files, it's first written to a sequential log (the WAL). This ensures that even if the database crashes mid-operation, the WAL contains enough information to either complete or undo the operation during recovery.

In PostgreSQL, WAL is implemented as a sequence of 16MB segment files named with a 24-character hexadecimal number (e.g., 000000010000000000000001). Each segment contains a sequence of WAL records describing every change to the database at the byte level.

How PITR Works

PITR combines two mechanisms:

Base backup: A consistent snapshot of the database at a specific point in time
WAL archive: A continuous stream of WAL segments covering every change since the base backup

To recover to a specific point in time:

Restore the base backup
Replay WAL segments from the base backup's timestamp to the target time
The database replays every transaction in order, stopping at the exact target moment

This is analogous to a video recording: the base backup is a keyframe, and the WAL segments are the frames between keyframes. You can "play" the video to any point.

Recovery Target Types

PostgreSQL supports several recovery targets:

recovery_target_time: Recover to a specific timestamp
recovery_target_xid: Recover to a specific transaction ID
recovery_target_lsn: Recover to a specific Log Sequence Number
recovery_target: 'immediate' — recover as soon as consistency is reached

Snapshot-Based Backups

Storage-level snapshots (EBS snapshots, ZFS snapshots, LVM snapshots) provide near-instantaneous backups by capturing the filesystem state at a specific moment. Snapshots are copy-on-write—they only store the blocks that change after the snapshot is taken, making them storage-efficient.

However, snapshots alone are not PITR. A snapshot captures the state at one moment. To achieve PITR, you need snapshots plus WAL archiving.

Architecture and Design Patterns

The Backup Pipeline

PostgreSQL → WAL segments → Archive (S3) → Retention policy
    ↓
Base backup (weekly) → S3 → Lifecycle policy
    ↓
EBS snapshots (daily) → Cross-region copy

The Three-Layer Strategy

Layer	Frequency	RPO	Purpose
WAL archiving	Continuous	Seconds	PITR capability
Incremental backup	Daily	24 hours	Fast full restoration
Full backup	Weekly	7 days	Disaster recovery baseline

Verification Architecture

Automated cron → Restore backup to test DB → Run integrity checks →
Compare with production metrics → Alert on failure

Step-by-Step Implementation

Configuring WAL Archiving in PostgreSQL

-- postgresql.conf
wal_level = replica                    # Enable WAL archiving
archive_mode = on                      # Enable archive mode
archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f --sse AES256'
archive_timeout = 300                  # Force archive every 5 minutes
max_wal_senders = 3                    # Allow replication connections
wal_keep_size = '1GB'                  # Keep WAL for replication

Creating a Base Backup

#!/bin/bash
# base-backup.sh
 
BACKUP_DIR="/backup/base"
S3_BUCKET="s3://my-bucket/base"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 
# Create base backup
pg_basebackup \
  -h localhost \
  -U replicator \
  -D "$BACKUP_DIR/$TIMESTAMP" \
  -Ft \
  -z \
  -Xs \
  -P
 
# Upload to S3
aws s3 sync "$BACKUP_DIR/$TIMESTAMP" "$S3_BUCKET/$TIMESTAMP/" --sse AES256
 
# Record backup metadata
echo "{\"timestamp\": \"$TIMESTAMP\", \"lsn\": \"$(pg_current_wal_lsn())\"}" > /backup/metadata/latest.json
aws s3 cp /backup/metadata/latest.json "$S3_BUCKET/metadata.json"
 
# Cleanup local backup
rm -rf "$BACKUP_DIR/$TIMESTAMP"
 
echo "Base backup completed: $TIMESTAMP"

Point-in-Time Recovery Procedure

#!/bin/bash
# pitr-restore.sh
 
TARGET_TIME="2024-01-15 14:30:00"
S3_BUCKET="s3://my-bucket"
RESTORE_DIR="/var/lib/postgresql/data"
 
# Step 1: Download and restore base backup
LATEST_BACKUP=$(aws s3 cp "$S3_BUCKET/base/metadata.json" - | jq -r '.timestamp')
aws s3 sync "$S3_BUCKET/base/$LATEST_BACKUP/" /tmp/base/
tar -xzf /tmp/base/base.tar.gz -C "$RESTORE_DIR"
 
# Step 2: Configure recovery
cat > "$RESTORE_DIR/postgresql.auto.conf" << EOF
restore_command = 'aws s3 cp $S3_BUCKET/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
 
# Step 3: Create recovery signal
touch "$RESTORE_DIR/recovery.signal"
 
# Step 4: Set permissions and start
chown -R postgres:postgres "$RESTORE_DIR"
pg_ctl start -D "$RESTORE_DIR"
 
# Step 5: Wait for recovery to complete
echo "Waiting for recovery to reach $TARGET_TIME..."
while pg_is_in_recovery; do
  sleep 1
done
 
echo "Recovery complete. Database is now at $(psql -t -c 'SELECT now()')"

Automated Backup Testing

#!/bin/bash
# test-backup.sh - Run weekly
 
set -euo pipefail
 
TEST_CONTAINER="backup-test-$(date +%s)"
TEST_DB="test_restore"
 
# Start test container
docker run -d --name "$TEST_CONTAINER" \
  -e POSTGRES_PASSWORD=test \
  -e POSTGRES_DB="$TEST_DB" \
  postgres:16
 
sleep 10
 
# Download latest backup
aws s3 cp s3://my-backup-bucket/daily/latest.dump /tmp/latest.dump
 
# Restore
pg_restore -h localhost -p 5433 -U postgres -d "$TEST_DB" /tmp/latest.dump
 
# Verify
TABLE_COUNT=$(psql -h localhost -p 5433 -U postgres -d "$TEST_DB" -t -c \
  "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'")
 
ROW_COUNT=$(psql -h localhost -p 5433 -U postgres -d "$TEST_DB" -t -c \
  "SELECT count(*) FROM orders")
 
echo "Restored $TABLE_COUNT tables, $ROW_COUNT order rows"
 
# Cleanup
docker stop "$TEST_CONTAINER" && docker rm "$TEST_CONTAINER"
rm /tmp/latest.dump

Real-World Use Cases

Recovery from Bad Migration

A team ran a migration that dropped a column used by the application. They didn't realize the issue until 45 minutes later when users reported errors. Using PITR, they recovered the database to the moment before the migration ran, preserving all 45 minutes of user activity.

Recovery from Accidental Data Deletion

A support agent accidentally deleted 10,000 customer records while trying to delete one. The team used PITR to recover to the exact moment before the deletion, losing only the 30 seconds it took to identify the problem.

Recovery from Corrupted Index

A PostgreSQL bug caused index corruption after a power failure. Rather than rebuilding indexes (which would take hours on a multi-terabyte database), the team used a base backup from before the corruption and replayed WAL to the current point, skipping the corrupted transaction.

Best Practices for Production

Archive WAL to a different region — If your database is in us-east-1, archive WAL to us-west-2. This protects against regional outages.
Monitor archive lag — Alert if WAL archiving falls behind. A gap in the WAL archive means a gap in your recovery capability.
Test PITR monthly — Restore to a specific point in time and verify data integrity. This validates the entire pipeline.
Use archive_command with error handling — If the archive command fails, PostgreSQL will accumulate WAL segments until the disk fills up. Monitor disk usage.
Compress WAL archives — WAL segments are highly compressible (typically 10:1 ratio). Use gzip or zstd to reduce storage costs.
Set appropriate retention — Keep WAL archives for at least as long as your PITR window. Use S3 lifecycle policies to manage costs.
Document the recovery procedure — Write a runbook with exact commands, expected durations, and verification steps. Practice it under time pressure.
Use a separate archive user — Create a dedicated PostgreSQL user with minimal permissions for WAL archiving.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Missing WAL segments	PITR fails at the gap	Monitor archive_command success, alert on failures
Archive disk full	Database stops accepting writes	Monitor archive disk, use S3 instead of local disk
Base backup too old	Long replay time	Create base backups weekly or more frequently
Not testing restores	Undiscovered corruption	Automate monthly restore tests
Wrong recovery_target_time	Recovering to wrong moment	Use timestamps with timezone, verify with `SELECT now()`
Permissions issues	Archive command fails	Test archive_command manually, use dedicated user

Performance Optimization

WAL Compression

# Compress WAL before archiving
archive_command = 'gzip < %p | aws s3 cp - s3://bucket/wal/%f.gz --sse AES256'
 
# Decompress during recovery
restore_command = 'aws s3 cp s3://bucket/wal/%f.gz - | gunzip > %p'

Parallel WAL Archiving

# Use pg_receivewal for parallel archiving
pg_receivewal -h localhost -U replicator -D /archive/wal --synchronous &

Comparison with Alternatives

Method	RPO	Storage Cost	Recovery Speed	Complexity
Daily pg_dump	24 hours	Low	Slow (hours)	Low
WAL archiving	Seconds	Medium	Medium (minutes)	Medium
Streaming replication	Seconds	High	Instant (failover)	High
EBS snapshots	Minutes	Low	Medium	Low
Cloud-managed (RDS)	Minutes	Medium	Fast	Very low

Advanced Patterns

Cascading Replication

Primary → WAL Archive (S3)
       → Standby 1 → Standby 2 (cascading)

This reduces load on the primary while maintaining multiple recovery points.

Delayed Standby

-- Standby that lags 1 hour behind primary
-- recovery.conf
recovery_min_apply_delay = '1h'

A delayed standby provides a 1-hour window to catch mistakes before they propagate to the standby.

Testing Strategies

#!/bin/bash
# pitr-test.sh - Verify PITR capability
 
# Record current time
TARGET_TIME=$(psql -t -c "SELECT now()::text")
 
# Insert test data
psql -c "INSERT INTO test_table (data) VALUES ('test-pitr-$(date +%s)')"
 
# Wait a moment
sleep 5
 
# Restore to TARGET_TIME
./pitr-restore.sh "$TARGET_TIME"
 
# Verify test data is absent
COUNT=$(psql -t -c "SELECT count(*) FROM test_table WHERE data LIKE 'test-pitr-%'")
if [ "$COUNT" -eq 0 ]; then
  echo "PASS: PITR correctly restored to before test data insertion"
else
  echo "FAIL: Test data found after PITR"
  exit 1
fi

Future Outlook

Cloud-managed databases are abstracting away the complexity of PITR. AWS RDS offers automatic backups with configurable retention and one-click point-in-time recovery. Google Cloud SQL provides similar capabilities with automatic WAL archiving to Cloud Storage.

However, teams running self-managed databases (for cost, compliance, or performance reasons) still need to understand these fundamentals. The principles of WAL archiving, base backups, and recovery verification apply regardless of whether you're running PostgreSQL on bare metal or using a managed service.

Backup Monitoring and Alerting

A backup strategy is only as good as your ability to verify it works. Implement monitoring that tracks backup job completion, backup size trends, and backup duration. Alert on backup failures immediately, and set up automated restore tests that verify backup integrity on a regular schedule. Track the Recovery Point Objective compliance metric by measuring the age of the most recent successful backup. If this age exceeds your RPO target, trigger an alert. Use dashboards to visualize backup trends, storage consumption, and restore test results. Integrate backup monitoring with your existing observability stack using tools like Prometheus, Grafana, or Datadog. Regular backup drills where you practice restoring to a test environment build team confidence and reveal issues before they become emergencies.

Cloud-Native Backup Strategies

AWS RDS Automated Backups

AWS RDS simplifies PITR by handling WAL archiving and snapshot management automatically:

# Enable automated backups with 35-day retention
aws rds modify-db-instance \
  --db-instance-identifier my-database \
  --backup-retention-period 35 \
  --preferred-backup-window "03:00-04:00" \
  --apply-immediately
 
# Restore to a specific point in time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier my-database \
  --target-db-instance-identifier my-database-restored \
  --restore-time "2024-01-15T14:30:00Z" \
  --db-instance-class db.r6g.large

Kubernetes Database Backups with Velero

For databases running in Kubernetes, Velero provides cluster-level backup and restore:

# velero-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: postgres-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
      - database
    includedResources:
      - persistentvolumeclaims
      - persistentvolumes
      - deployments
      - services
    storageLocation: s3-backup
    ttl: 720h  # 30 days retention
    hooks:
      resources:
        - name: pg-dump
          labelSelector:
            matchLabels:
              app: postgres
          exec:
            container: postgres
            command:
              - /bin/sh
              - -c
              - |
                pg_dump -U postgres -Fc mydb > /tmp/backup.dump
            onError: Fail
            timeout: 300s

Backup Encryption

Encrypt backups at rest and in transit to meet compliance requirements:

# PostgreSQL backup with GPG encryption
pg_dump -Fc mydb | \
  gpg --encrypt --recipient backup@example.com | \
  aws s3 cp - s3://backups/mydb-$(date +%Y%m%d).dump.gpg
 
# Decrypt and restore
aws s3 cp s3://backups/mydb-20240115.dump.gpg - | \
  gpg --decrypt | \
  pg_restore -d mydb --clean

Cross-Region Backup Replication

Store backups in a different region for disaster recovery:

# AWS S3 cross-region replication for backup files
aws s3api put-bucket-replication \
  --bucket primary-backup-bucket \
  --replication-configuration '{
    "Role": "arn:aws:iam::ACCOUNT:role/replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Prefix": "wal/",
      "Destination": {
        "Bucket": "arn:aws:s3:::dr-backup-bucket",
        "StorageClass": "GLACIER"
      }
    }]
  }'

Backup Cost Optimization

Storage costs grow as backup retention periods increase and databases scale. Implement a tiered storage strategy that balances cost with recovery speed requirements. Compress WAL archives using zstd or gzip before uploading to object storage. Move older backups to cheaper storage tiers automatically using lifecycle policies. Monitor total backup storage consumption and set alerts for unexpected growth that could indicate issues like WAL archiving duplication or failed cleanup jobs.

Strategy	Savings	Trade-off
Compress WAL archives	60-80%	Slightly slower recovery
Tiered storage (S3 → Glacier)	70-90%	Slower retrieval for old backups
Incremental backups	50-70%	More complex recovery process
Deduplication	30-50%	CPU overhead during backup
Short hot retention + long cold	60-80%	Cold backups take longer to restore

# S3 lifecycle policy for backup cost optimization
aws s3api put-bucket-lifecycle-configuration \
  --bucket backup-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "TransitionToIA",
        "Filter": { "Prefix": "backups/" },
        "Status": "Enabled",
        "Transitions": [
          { "Days": 30, "StorageClass": "STANDARD_IA" },
          { "Days": 90, "StorageClass": "GLACIER" }
        ],
        "Expiration": { "Days": 365 }
      }
    ]
  }'

Conclusion

PITR transforms database recovery from a guessing game into a precise operation. The key takeaways:

WAL archiving is the foundation — Without continuous WAL archiving, PITR is impossible
Base backups + WAL = PITR — Combine periodic full backups with continuous log archiving for complete coverage
Test restores, not just backups — The only way to know your PITR works is to test it
Monitor archive health — Gaps in the WAL archive create gaps in your recovery capability
Automate verification — Manual testing is unreliable; automate restore tests on a schedule
Encrypt backups — Protect sensitive data at rest and in transit
Optimize costs — Use tiered storage and compression to manage growing backup costs

Configure WAL archiving on your most important database today. Create a base backup. Then practice a point-in-time recovery. If it works, you've just dramatically improved your data safety. If it doesn't, you've found a critical gap before it's too late.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline