MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr πŸ”₯ tagline

Hey there πŸ‘‹ I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 β€” present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 β€” Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms β€” earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Database Backup Strategies: PITR, Snapshots, and WAL

Implement database backups: point-in-time recovery, WAL archiving, and testing restores.

DatabaseBackupPITRDevOps

By MinhVo

Introduction

Point-in-time recovery (PITR) is the ability to restore a database to any specific moment in timeβ€”not just to the last backup, but to any second within your retention window. This capability transforms database recovery from a blunt instrument ("restore to last night's backup") into a precision tool ("restore to 2:37 PM, just before the bad migration ran"). PITR relies on Write-Ahead Log (WAL) archiving, which continuously captures every change made to the database.

Understanding PITR, WAL archiving, and snapshot-based backups is essential for any team operating production databases. These aren't optional luxuriesβ€”they're the difference between losing 24 hours of data and losing zero seconds. This guide dives deep into the mechanics of WAL archiving, the tradeoffs between different backup strategies, and the operational practices needed to make PITR reliable.

Database architecture

Understanding PITR and WAL: Core Concepts

Write-Ahead Logging (WAL)

Every modern database uses WAL as its fundamental durability mechanism. The principle is simple: before any change is applied to the actual data files, it's first written to a sequential log (the WAL). This ensures that even if the database crashes mid-operation, the WAL contains enough information to either complete or undo the operation during recovery.

In PostgreSQL, WAL is implemented as a sequence of 16MB segment files named with a 24-character hexadecimal number (e.g., 000000010000000000000001). Each segment contains a sequence of WAL records describing every change to the database at the byte level.

How PITR Works

PITR combines two mechanisms:

  1. Base backup: A consistent snapshot of the database at a specific point in time
  2. WAL archive: A continuous stream of WAL segments covering every change since the base backup

To recover to a specific point in time:

  1. Restore the base backup
  2. Replay WAL segments from the base backup's timestamp to the target time
  3. The database replays every transaction in order, stopping at the exact target moment

This is analogous to a video recording: the base backup is a keyframe, and the WAL segments are the frames between keyframes. You can "play" the video to any point.

Recovery Target Types

PostgreSQL supports several recovery targets:

  • recovery_target_time: Recover to a specific timestamp
  • recovery_target_xid: Recover to a specific transaction ID
  • recovery_target_lsn: Recover to a specific Log Sequence Number
  • recovery_target: 'immediate' β€” recover as soon as consistency is reached

Snapshot-Based Backups

Storage-level snapshots (EBS snapshots, ZFS snapshots, LVM snapshots) provide near-instantaneous backups by capturing the filesystem state at a specific moment. Snapshots are copy-on-writeβ€”they only store the blocks that change after the snapshot is taken, making them storage-efficient.

However, snapshots alone are not PITR. A snapshot captures the state at one moment. To achieve PITR, you need snapshots plus WAL archiving.

Architecture and Design Patterns

The Backup Pipeline

PostgreSQL β†’ WAL segments β†’ Archive (S3) β†’ Retention policy
    ↓
Base backup (weekly) β†’ S3 β†’ Lifecycle policy
    ↓
EBS snapshots (daily) β†’ Cross-region copy

The Three-Layer Strategy

LayerFrequencyRPOPurpose
WAL archivingContinuousSecondsPITR capability
Incremental backupDaily24 hoursFast full restoration
Full backupWeekly7 daysDisaster recovery baseline

Verification Architecture

Automated cron β†’ Restore backup to test DB β†’ Run integrity checks β†’
Compare with production metrics β†’ Alert on failure

Step-by-Step Implementation

Configuring WAL Archiving in PostgreSQL

-- postgresql.conf
wal_level = replica                    # Enable WAL archiving
archive_mode = on                      # Enable archive mode
archive_command = 'aws s3 cp %p s3://my-bucket/wal/%f --sse AES256'
archive_timeout = 300                  # Force archive every 5 minutes
max_wal_senders = 3                    # Allow replication connections
wal_keep_size = '1GB'                  # Keep WAL for replication

Creating a Base Backup

#!/bin/bash
# base-backup.sh
 
BACKUP_DIR="/backup/base"
S3_BUCKET="s3://my-bucket/base"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
 
# Create base backup
pg_basebackup \
  -h localhost \
  -U replicator \
  -D "$BACKUP_DIR/$TIMESTAMP" \
  -Ft \
  -z \
  -Xs \
  -P
 
# Upload to S3
aws s3 sync "$BACKUP_DIR/$TIMESTAMP" "$S3_BUCKET/$TIMESTAMP/" --sse AES256
 
# Record backup metadata
echo "{\"timestamp\": \"$TIMESTAMP\", \"lsn\": \"$(pg_current_wal_lsn())\"}" > /backup/metadata/latest.json
aws s3 cp /backup/metadata/latest.json "$S3_BUCKET/metadata.json"
 
# Cleanup local backup
rm -rf "$BACKUP_DIR/$TIMESTAMP"
 
echo "Base backup completed: $TIMESTAMP"

Point-in-Time Recovery Procedure

#!/bin/bash
# pitr-restore.sh
 
TARGET_TIME="2024-01-15 14:30:00"
S3_BUCKET="s3://my-bucket"
RESTORE_DIR="/var/lib/postgresql/data"
 
# Step 1: Download and restore base backup
LATEST_BACKUP=$(aws s3 cp "$S3_BUCKET/base/metadata.json" - | jq -r '.timestamp')
aws s3 sync "$S3_BUCKET/base/$LATEST_BACKUP/" /tmp/base/
tar -xzf /tmp/base/base.tar.gz -C "$RESTORE_DIR"
 
# Step 2: Configure recovery
cat > "$RESTORE_DIR/postgresql.auto.conf" << EOF
restore_command = 'aws s3 cp $S3_BUCKET/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
EOF
 
# Step 3: Create recovery signal
touch "$RESTORE_DIR/recovery.signal"
 
# Step 4: Set permissions and start
chown -R postgres:postgres "$RESTORE_DIR"
pg_ctl start -D "$RESTORE_DIR"
 
# Step 5: Wait for recovery to complete
echo "Waiting for recovery to reach $TARGET_TIME..."
while pg_is_in_recovery; do
  sleep 1
done
 
echo "Recovery complete. Database is now at $(psql -t -c 'SELECT now()')"

Automated Backup Testing

#!/bin/bash
# test-backup.sh - Run weekly
 
set -euo pipefail
 
TEST_CONTAINER="backup-test-$(date +%s)"
TEST_DB="test_restore"
 
# Start test container
docker run -d --name "$TEST_CONTAINER" \
  -e POSTGRES_PASSWORD=test \
  -e POSTGRES_DB="$TEST_DB" \
  postgres:16
 
sleep 10
 
# Download latest backup
aws s3 cp s3://my-backup-bucket/daily/latest.dump /tmp/latest.dump
 
# Restore
pg_restore -h localhost -p 5433 -U postgres -d "$TEST_DB" /tmp/latest.dump
 
# Verify
TABLE_COUNT=$(psql -h localhost -p 5433 -U postgres -d "$TEST_DB" -t -c \
  "SELECT count(*) FROM information_schema.tables WHERE table_schema='public'")
 
ROW_COUNT=$(psql -h localhost -p 5433 -U postgres -d "$TEST_DB" -t -c \
  "SELECT count(*) FROM orders")
 
echo "Restored $TABLE_COUNT tables, $ROW_COUNT order rows"
 
# Cleanup
docker stop "$TEST_CONTAINER" && docker rm "$TEST_CONTAINER"
rm /tmp/latest.dump

Backup testing

Real-World Use Cases

Recovery from Bad Migration

A team ran a migration that dropped a column used by the application. They didn't realize the issue until 45 minutes later when users reported errors. Using PITR, they recovered the database to the moment before the migration ran, preserving all 45 minutes of user activity.

Recovery from Accidental Data Deletion

A support agent accidentally deleted 10,000 customer records while trying to delete one. The team used PITR to recover to the exact moment before the deletion, losing only the 30 seconds it took to identify the problem.

Recovery from Corrupted Index

A PostgreSQL bug caused index corruption after a power failure. Rather than rebuilding indexes (which would take hours on a multi-terabyte database), the team used a base backup from before the corruption and replayed WAL to the current point, skipping the corrupted transaction.

Best Practices for Production

  1. Archive WAL to a different region β€” If your database is in us-east-1, archive WAL to us-west-2. This protects against regional outages.

  2. Monitor archive lag β€” Alert if WAL archiving falls behind. A gap in the WAL archive means a gap in your recovery capability.

  3. Test PITR monthly β€” Restore to a specific point in time and verify data integrity. This validates the entire pipeline.

  4. Use archive_command with error handling β€” If the archive command fails, PostgreSQL will accumulate WAL segments until the disk fills up. Monitor disk usage.

  5. Compress WAL archives β€” WAL segments are highly compressible (typically 10:1 ratio). Use gzip or zstd to reduce storage costs.

  6. Set appropriate retention β€” Keep WAL archives for at least as long as your PITR window. Use S3 lifecycle policies to manage costs.

  7. Document the recovery procedure β€” Write a runbook with exact commands, expected durations, and verification steps. Practice it under time pressure.

  8. Use a separate archive user β€” Create a dedicated PostgreSQL user with minimal permissions for WAL archiving.

Common Pitfalls and Solutions

PitfallImpactSolution
Missing WAL segmentsPITR fails at the gapMonitor archive_command success, alert on failures
Archive disk fullDatabase stops accepting writesMonitor archive disk, use S3 instead of local disk
Base backup too oldLong replay timeCreate base backups weekly or more frequently
Not testing restoresUndiscovered corruptionAutomate monthly restore tests
Wrong recovery_target_timeRecovering to wrong momentUse timestamps with timezone, verify with SELECT now()
Permissions issuesArchive command failsTest archive_command manually, use dedicated user

Performance Optimization

WAL Compression

# Compress WAL before archiving
archive_command = 'gzip < %p | aws s3 cp - s3://bucket/wal/%f.gz --sse AES256'
 
# Decompress during recovery
restore_command = 'aws s3 cp s3://bucket/wal/%f.gz - | gunzip > %p'

Parallel WAL Archiving

# Use pg_receivewal for parallel archiving
pg_receivewal -h localhost -U replicator -D /archive/wal --synchronous &

Comparison with Alternatives

MethodRPOStorage CostRecovery SpeedComplexity
Daily pg_dump24 hoursLowSlow (hours)Low
WAL archivingSecondsMediumMedium (minutes)Medium
Streaming replicationSecondsHighInstant (failover)High
EBS snapshotsMinutesLowMediumLow
Cloud-managed (RDS)MinutesMediumFastVery low

Advanced Patterns

Cascading Replication

Primary β†’ WAL Archive (S3)
       β†’ Standby 1 β†’ Standby 2 (cascading)

This reduces load on the primary while maintaining multiple recovery points.

Delayed Standby

-- Standby that lags 1 hour behind primary
-- recovery.conf
recovery_min_apply_delay = '1h'

A delayed standby provides a 1-hour window to catch mistakes before they propagate to the standby.

Testing Strategies

#!/bin/bash
# pitr-test.sh - Verify PITR capability
 
# Record current time
TARGET_TIME=$(psql -t -c "SELECT now()::text")
 
# Insert test data
psql -c "INSERT INTO test_table (data) VALUES ('test-pitr-$(date +%s)')"
 
# Wait a moment
sleep 5
 
# Restore to TARGET_TIME
./pitr-restore.sh "$TARGET_TIME"
 
# Verify test data is absent
COUNT=$(psql -t -c "SELECT count(*) FROM test_table WHERE data LIKE 'test-pitr-%'")
if [ "$COUNT" -eq 0 ]; then
  echo "PASS: PITR correctly restored to before test data insertion"
else
  echo "FAIL: Test data found after PITR"
  exit 1
fi

Future Outlook

Cloud-managed databases are abstracting away the complexity of PITR. AWS RDS offers automatic backups with configurable retention and one-click point-in-time recovery. Google Cloud SQL provides similar capabilities with automatic WAL archiving to Cloud Storage.

However, teams running self-managed databases (for cost, compliance, or performance reasons) still need to understand these fundamentals. The principles of WAL archiving, base backups, and recovery verification apply regardless of whether you're running PostgreSQL on bare metal or using a managed service.

Backup Monitoring and Alerting

A backup strategy is only as good as your ability to verify it works. Implement monitoring that tracks backup job completion, backup size trends, and backup duration. Alert on backup failures immediately, and set up automated restore tests that verify backup integrity on a regular schedule. Track the Recovery Point Objective compliance metric by measuring the age of the most recent successful backup. If this age exceeds your RPO target, trigger an alert. Use dashboards to visualize backup trends, storage consumption, and restore test results. Integrate backup monitoring with your existing observability stack using tools like Prometheus, Grafana, or Datadog. Regular backup drills where you practice restoring to a test environment build team confidence and reveal issues before they become emergencies.

Cloud-Native Backup Strategies

AWS RDS Automated Backups

AWS RDS simplifies PITR by handling WAL archiving and snapshot management automatically:

# Enable automated backups with 35-day retention
aws rds modify-db-instance \
  --db-instance-identifier my-database \
  --backup-retention-period 35 \
  --preferred-backup-window "03:00-04:00" \
  --apply-immediately
 
# Restore to a specific point in time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier my-database \
  --target-db-instance-identifier my-database-restored \
  --restore-time "2024-01-15T14:30:00Z" \
  --db-instance-class db.r6g.large

Kubernetes Database Backups with Velero

For databases running in Kubernetes, Velero provides cluster-level backup and restore:

# velero-backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: postgres-daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
      - database
    includedResources:
      - persistentvolumeclaims
      - persistentvolumes
      - deployments
      - services
    storageLocation: s3-backup
    ttl: 720h  # 30 days retention
    hooks:
      resources:
        - name: pg-dump
          labelSelector:
            matchLabels:
              app: postgres
          exec:
            container: postgres
            command:
              - /bin/sh
              - -c
              - |
                pg_dump -U postgres -Fc mydb > /tmp/backup.dump
            onError: Fail
            timeout: 300s

Backup Encryption

Encrypt backups at rest and in transit to meet compliance requirements:

# PostgreSQL backup with GPG encryption
pg_dump -Fc mydb | \
  gpg --encrypt --recipient backup@example.com | \
  aws s3 cp - s3://backups/mydb-$(date +%Y%m%d).dump.gpg
 
# Decrypt and restore
aws s3 cp s3://backups/mydb-20240115.dump.gpg - | \
  gpg --decrypt | \
  pg_restore -d mydb --clean

Cross-Region Backup Replication

Store backups in a different region for disaster recovery:

# AWS S3 cross-region replication for backup files
aws s3api put-bucket-replication \
  --bucket primary-backup-bucket \
  --replication-configuration '{
    "Role": "arn:aws:iam::ACCOUNT:role/replication-role",
    "Rules": [{
      "Status": "Enabled",
      "Prefix": "wal/",
      "Destination": {
        "Bucket": "arn:aws:s3:::dr-backup-bucket",
        "StorageClass": "GLACIER"
      }
    }]
  }'

Backup Cost Optimization

Storage costs grow as backup retention periods increase and databases scale. Implement a tiered storage strategy that balances cost with recovery speed requirements. Compress WAL archives using zstd or gzip before uploading to object storage. Move older backups to cheaper storage tiers automatically using lifecycle policies. Monitor total backup storage consumption and set alerts for unexpected growth that could indicate issues like WAL archiving duplication or failed cleanup jobs.

StrategySavingsTrade-off
Compress WAL archives60-80%Slightly slower recovery
Tiered storage (S3 β†’ Glacier)70-90%Slower retrieval for old backups
Incremental backups50-70%More complex recovery process
Deduplication30-50%CPU overhead during backup
Short hot retention + long cold60-80%Cold backups take longer to restore
# S3 lifecycle policy for backup cost optimization
aws s3api put-bucket-lifecycle-configuration \
  --bucket backup-bucket \
  --lifecycle-configuration '{
    "Rules": [
      {
        "ID": "TransitionToIA",
        "Filter": { "Prefix": "backups/" },
        "Status": "Enabled",
        "Transitions": [
          { "Days": 30, "StorageClass": "STANDARD_IA" },
          { "Days": 90, "StorageClass": "GLACIER" }
        ],
        "Expiration": { "Days": 365 }
      }
    ]
  }'

Conclusion

PITR transforms database recovery from a guessing game into a precise operation. The key takeaways:

  1. WAL archiving is the foundation β€” Without continuous WAL archiving, PITR is impossible
  2. Base backups + WAL = PITR β€” Combine periodic full backups with continuous log archiving for complete coverage
  3. Test restores, not just backups β€” The only way to know your PITR works is to test it
  4. Monitor archive health β€” Gaps in the WAL archive create gaps in your recovery capability
  5. Automate verification β€” Manual testing is unreliable; automate restore tests on a schedule
  6. Encrypt backups β€” Protect sensitive data at rest and in transit
  7. Optimize costs β€” Use tiered storage and compression to manage growing backup costs

Configure WAL archiving on your most important database today. Create a base backup. Then practice a point-in-time recovery. If it works, you've just dramatically improved your data safety. If it doesn't, you've found a critical gap before it's too late.