Database Migration Strategies: Zero-Downtime Deployments

Introduction

Deploying a database migration to a production system serving millions of users is one of the most nerve-wracking operations in software engineering. Unlike application code, which can be rolled back in seconds by redeploying a previous version, database schema changes are often irreversible. Dropping a column deletes data permanently. Renaming a table breaks every query that references the old name. And during the migration itself, DDL operations can lock tables for minutes or hours, rendering your application unavailable.

Zero-downtime migrations solve this problem by restructuring how we think about schema changes. Instead of applying destructive changes directly, the expand-contract pattern introduces changes additively first, migrates data gradually, and removes old structures only after all application code has been updated. This approach eliminates the maintenance window requirement, enables continuous deployment of database changes, and provides a safe rollback path at every stage.

This guide covers the expand-contract pattern in detail, online schema change techniques for MySQL and PostgreSQL, practical migration strategies using Flyway and Liquibase, and battle-tested patterns for migrating large tables without impacting production traffic.

Understanding Zero-Downtime Migrations: Core Concepts

Why Traditional Migrations Cause Downtime

Traditional migration tools apply DDL statements (ALTER TABLE, CREATE INDEX, DROP COLUMN) directly to the production database. In PostgreSQL, many DDL operations acquire an ACCESS EXCLUSIVE lock on the target table, which blocks all reads and writes for the duration of the operation. In MySQL, ALTER TABLE on large InnoDB tables can take hours while holding a metadata lock.

The core problem is that DDL operations are atomic: they either complete fully or not at all. There's no way to "partially" rename a column or "mostly" drop a table. This atomicity, which is a strength for data integrity, becomes a liability when the operation takes minutes and blocks all concurrent access.

The Expand-Contract Pattern

The expand-contract pattern (also called parallel change or evolutionary database design) breaks a migration into three distinct phases:

Phase 1 — Expand: Add new structures without removing old ones. Add a new column, create a new table, or add a new index. Old application code continues to work because the old structures still exist.

Phase 2 — Migrate: Gradually move data from old structures to new ones. Update application code to read from and write to both old and new structures. This phase can take hours, days, or weeks depending on data volume.

Phase 3 — Contract: Remove old structures once all application code has been updated and verified to work with the new structures. Drop the old column, remove the old table, or delete the old index.

Each phase is independently reversible. If Phase 2 reveals problems, you can stop and roll back to Phase 1 without data loss. Only Phase 3 is destructive, and by that point, the new structures have been thoroughly validated in production.

DDL Lock Behavior in PostgreSQL vs MySQL

PostgreSQL's lock model is more granular than MySQL's. Most DDL operations in PostgreSQL acquire an ACCESS EXCLUSIVE lock, but some operations support less restrictive locks. For example, CREATE INDEX acquires a SHARE lock that allows reads but blocks writes, while CREATE INDEX CONCURRENTLY acquires only a SHARE UPDATE EXCLUSIVE lock that allows both reads and writes.

MySQL's InnoDB storage engine uses metadata locks (MDL) that are held for the duration of a DDL operation. Online DDL (available since MySQL 5.6) reduces the impact of some ALTER TABLE operations by using the ALGORITHM=INPLACE and LOCK=NONE options, but not all operations support online DDL.

Architecture and Design Patterns

Forward-Only Migrations

Forward-only migration strategies avoid DOWN migration scripts entirely. Instead of rolling back a migration, you deploy a new forward migration that fixes the problem. This simplifies the migration tooling and ensures that every database state has a clear path forward.

The rationale is that rollback scripts are often untested and may not correctly reverse complex data transformations. A forward-only approach forces you to design each migration as a safe, additive change that can be fixed with another additive change if needed.

Dual-Write Pattern

During the migrate phase of expand-contract, the application must write to both old and new columns (or tables) simultaneously. This ensures data consistency regardless of which code version reads the data.

-- Phase 1: Expand - add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
 
-- Phase 2: Migrate - backfill new column from old
UPDATE users SET email_address = email WHERE email_address IS NULL;
 
-- Phase 2: Application code writes to both columns
INSERT INTO users (email, email_address, name)
VALUES ('user@example.com', 'user@example.com', 'John');
 
-- Phase 3: Contract - drop old column after all code is updated
ALTER TABLE users DROP COLUMN email;

Shadow Table Pattern

For large tables where ALTER TABLE is too slow, the shadow table pattern creates a copy of the table with the desired schema, migrates data in batches, then atomically swaps the tables.

-- 1. Create shadow table with new schema
CREATE TABLE users_new (LIKE users INCLUDING ALL);
ALTER TABLE users_new ADD COLUMN email_address VARCHAR(255);
 
-- 2. Migrate data in batches
INSERT INTO users_new (id, name, email, email_address, created_at)
SELECT id, name, email, email, created_at FROM users
WHERE id BETWEEN 1 AND 100000;
 
-- 3. Set up triggers to capture changes during migration
CREATE OR REPLACE FUNCTION sync_users_trigger() RETURNS TRIGGER AS $$
BEGIN
  IF TG_OP = 'INSERT' THEN
    INSERT INTO users_new VALUES (NEW.*);
  ELSIF TG_OP = 'UPDATE' THEN
    UPDATE users_new SET * = NEW.* WHERE id = NEW.id;
  ELSIF TG_OP = 'DELETE' THEN
    DELETE FROM users_new WHERE id = OLD.id;
  END IF;
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;
 
-- 4. Swap tables atomically
BEGIN;
ALTER TABLE users RENAME TO users_old;
ALTER TABLE users_new RENAME TO users;
COMMIT;

Foreign Key Considerations

Foreign keys complicate zero-downtime migrations because they create dependencies between tables. Renaming a table requires updating all foreign key references. The expand-contract pattern handles this by maintaining both the old and new foreign key relationships during the transition.

-- Phase 1: Add new foreign key column
ALTER TABLE orders ADD COLUMN customer_id_new BIGINT;
 
-- Phase 2: Backfill and add foreign key constraint
UPDATE orders SET customer_id_new = customer_id;
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer_new
  FOREIGN KEY (customer_id_new) REFERENCES users(id);
 
-- Phase 3: Drop old foreign key after code update
ALTER TABLE orders DROP CONSTRAINT fk_orders_customer;
ALTER TABLE orders DROP COLUMN customer_id;
ALTER TABLE orders RENAME COLUMN customer_id_new TO customer_id;

Step-by-Step Implementation

PostgreSQL Safe Column Rename

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
 
-- Step 2: Backfill from old column
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- For large tables, batch the update:
UPDATE users SET full_name = name
WHERE id IN (SELECT id FROM users WHERE full_name IS NULL LIMIT 10000);
 
-- Step 3: Add NOT NULL constraint (after backfill is complete)
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;
 
-- Step 4: Update application code to read/write full_name
-- Deploy new application version
 
-- Step 5: Drop old column (after all app instances updated)
ALTER TABLE users DROP COLUMN name;

PostgreSQL Safe Column Type Change

Changing a column type requires rewriting the table in PostgreSQL, which acquires an ACCESS EXCLUSIVE lock. The shadow table pattern avoids this:

-- Step 1: Create new column with desired type
ALTER TABLE orders ADD COLUMN total_new NUMERIC(12,2);
 
-- Step 2: Copy data with type conversion
UPDATE orders SET total_new = total::numeric WHERE total_new IS NULL;
 
-- Step 3: Add constraint
ALTER TABLE orders ALTER COLUMN total_new SET NOT NULL;
ALTER TABLE orders ADD CONSTRAINT check_total_positive CHECK (total_new >= 0);
 
-- Step 4: Swap columns in a transaction
BEGIN;
ALTER TABLE orders DROP COLUMN total;
ALTER TABLE orders RENAME COLUMN total_new TO total;
COMMIT;

MySQL Online Schema Change with gh-ost

GitHub's gh-ost (GitHub Online Schema-change Tool) performs MySQL schema migrations without locking the source table:

# Install gh-ost
gh-ost \
  --host=localhost \
  --database=mydb \
  --table=users \
  --alter="ADD COLUMN full_name VARCHAR(255)" \
  --allow-on-master \
  --chunk-size=1000 \
  --max-load="Threads_running=25" \
  --cut-over=default \
  --assume-rbr \
  --execute

gh-ost works by:

Creating a ghost table with the desired schema
Copying data from the source table in chunks
Tailing the MySQL binary log to capture changes
Applying captured changes to the ghost table
Performing an atomic table swap when caught up

Flyway Migration with Zero-Downtime Patterns

-- V42__add_user_email_normalized.sql
-- Expand phase: add new column
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);
 
-- Backfill existing data
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL;
 
-- Add index on new column
CREATE INDEX CONCURRENTLY idx_users_email_normalized
  ON users (email_normalized);

-- V43__backfill_user_email_normalized.sql
-- Continue backfilling (idempotent - safe to re-run)
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL
AND id > (SELECT COALESCE(MAX(id), 0) FROM users WHERE email_normalized IS NOT NULL);

-- V45__drop_user_email_old.sql
-- Contract phase: drop old column (deploy after all code updated)
-- Only run this after verifying no code references the old column
ALTER TABLE users DROP COLUMN IF EXISTS email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;

Liquibase Migration Strategy

<!-- changelog.xml -->
<changeSet id="expand-add-email-normalized" author="dev">
  <addColumn tableName="users">
    <column name="email_normalized" type="VARCHAR(255)"/>
  </addColumn>
  <sql>
    UPDATE users SET email_normalized = LOWER(TRIM(email))
    WHERE email_normalized IS NULL;
  </sql>
</changeSet>
 
<changeSet id="contract-drop-old-email" author="dev">
  <preconditions onFail="MARK_RAN">
    <sqlCheck expectedResult="0">
      SELECT COUNT(*) FROM information_schema.columns
      WHERE table_name = 'users' AND column_name = 'email_normalized'
      AND is_nullable = 'YES';
    </sqlCheck>
  </preconditions>
  <dropColumn tableName="users" columnName="email"/>
  <renameColumn tableName="users" oldColumnName="email_normalized" newColumnName="email"/>
</changeSet>

Real-World Use Cases

Use Case 1: Adding a Non-Nullable Column

A SaaS platform needed to add a tenant_id column to a 500-million-row events table. Direct ALTER TABLE ADD COLUMN ... NOT NULL with a default value would lock the table for 30+ minutes in PostgreSQL 11+ (which requires a table rewrite for NOT NULL with default on older versions).

The solution used the expand-contract pattern:

-- Phase 1: Add nullable column (instant in PG 11+)
ALTER TABLE events ADD COLUMN tenant_id BIGINT;
 
-- Phase 2: Backfill in batches of 10,000
UPDATE events SET tenant_id = (
  SELECT tenant_id FROM users WHERE users.id = events.user_id
) WHERE tenant_id IS NULL AND id BETWEEN 1 AND 10000;
 
-- Phase 2: Add NOT NULL after backfill (instant, no rewrite needed)
ALTER TABLE events ALTER COLUMN tenant_id SET NOT NULL;
 
-- Phase 3: Add foreign key
ALTER TABLE events ADD CONSTRAINT fk_events_tenant
  FOREIGN KEY (tenant_id) REFERENCES tenants(id);

Total downtime: zero. The backfill ran over 4 hours in the background while the application continued serving traffic normally.

Use Case 2: Changing a Primary Key Type

A startup needed to migrate user IDs from sequential integers to UUIDs. This required changing the primary key type on a table referenced by 15 other tables via foreign keys.

The migration used the shadow table pattern with gh-ost for MySQL:

Created a new users_v2 table with UUID primary key
Added uuid column to old users table and populated it
Used application-level dual-write to write both integer ID and UUID
Migrated foreign key references one table at a time using expand-contract
Swapped tables using gh-ost's atomic rename

Total migration time: 2 weeks (phased rollout), zero downtime throughout.

Use Case 3: Index Rebuild on Write-Heavy Table

A real-time analytics table with 200 million rows needed a new composite index. CREATE INDEX without CONCURRENTLY would lock the table for 15 minutes.

-- Non-blocking index creation
CREATE INDEX CONCURRENTLY idx_analytics_tenant_time
  ON analytics_events (tenant_id, created_at DESC);
 
-- Verify index was built successfully
SELECT indisvalid FROM pg_index
WHERE indexrelid = 'idx_analytics_tenant_time'::regclass;
-- If indisvalid = false, the index build failed and must be retried

The CONCURRENTLY option took 45 minutes instead of 15, but the table remained fully readable and writable throughout.

Best Practices for Production

Every migration must be backward-compatible: New columns should be nullable or have defaults. New tables should not be required by the current application version. This ensures the old application version works during the deployment window.
Batch large data updates: Never run UPDATE ... SET on millions of rows in a single transaction. Use batched updates with LIMIT and OFFSET (or cursor-based pagination) to avoid long-running transactions and WAL bloat.
Use CREATE INDEX CONCURRENTLY for all production indexes: This avoids the SHARE lock that blocks writes. If the concurrent build fails, the index is left in an invalid state and must be dropped and recreated.
Test migrations against production-sized data: A migration that takes 1 second on 1,000 test rows may take 3 hours on 100 million production rows. Use a production data clone for migration testing.
Keep migrations idempotent: Design migrations so they can be safely re-run if they fail partway through. Use IF NOT EXISTS, IF EXISTS, and upsert patterns to handle partial execution.
Separate schema changes from data backfills: Don't mix DDL and large DML in the same migration script. First add the column, then backfill data in a separate migration. This allows the schema change to complete instantly while the backfill runs asynchronously.
Monitor migration progress: For long-running migrations, log progress (rows processed, elapsed time, estimated remaining). Use PostgreSQL's pg_stat_progress_create_index to monitor index build progress.
Have a rollback plan for every migration: Even forward-only migrations need a plan. Know how to undo the effects (add back the dropped column, reverse the data transformation) even if you don't have an automated rollback script.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
DDL locks blocking reads/writes	Application downtime during migration	Use CONCURRENTLY, expand-contract, or online schema change tools
Non-batched updates on large tables	Long transactions, WAL bloat, replication lag	Batch updates in chunks of 1,000–10,000 rows
Dropping columns before code is updated	Application errors, 500 responses	Follow expand-contract: drop only after all code updated
Not testing on production-sized data	Migration takes 100x longer than expected	Test with production data clone
Foreign key constraints blocking table swaps	Cannot rename or drop tables with active references	Use deferrable constraints or drop/recreate in transaction
Migration scripts that aren't idempotent	Partial failure leaves database in inconsistent state	Use IF NOT EXISTS, upserts, and check-then-act patterns

Performance Optimization

Parallel Batch Updates

-- Batch update with progress tracking
DO $$
DECLARE
  batch_size INT := 10000;
  max_id BIGINT;
  current_id BIGINT := 0;
  total_updated BIGINT := 0;
BEGIN
  SELECT MAX(id) INTO max_id FROM users;
 
  WHILE current_id <= max_id LOOP
    UPDATE users
    SET email_normalized = LOWER(TRIM(email))
    WHERE id > current_id AND id <= current_id + batch_size
      AND email_normalized IS NULL;
 
    total_updated := total_updated + batch_size;
    RAISE NOTICE 'Progress: % / % rows', LEAST(current_id + batch_size, max_id), max_id;
 
    current_id := current_id + batch_size;
    COMMIT;
  END LOOP;
END $$;

Non-Blocking Column Addition in PostgreSQL

-- PG 11+: Adding a column with a non-volatile default is instant
-- (no table rewrite)
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
 
-- For volatile defaults, add nullable column then backfill
ALTER TABLE users ADD COLUMN created_at_new TIMESTAMPTZ;
UPDATE users SET created_at_new = created_at WHERE created_at_new IS NULL;
ALTER TABLE users ALTER COLUMN created_at_new SET DEFAULT NOW();

Comparison with Alternatives

Feature	Flyway	Liquibase	gh-ost	pt-online-schema-change
Database support	Many	Many	MySQL only	MySQL only
Migration format	SQL files	XML/YAML/SQL	CLI flags	CLI flags
Rollback support	Manual	Automatic	N/A (online)	N/A (online)
CI/CD integration	Excellent	Excellent	Manual	Manual
Concurrent DDL	Via SQL	Via SQL	Built-in	Built-in
Learning curve	Low	Medium	Low	Low
Community size	Large	Large	Large	Medium

Advanced Patterns

Trigger-Based Dual-Write

-- Temporary trigger for dual-write during migration
CREATE OR REPLACE FUNCTION dual_write_users() RETURNS TRIGGER AS $$
BEGIN
  NEW.email_normalized := LOWER(TRIM(NEW.email));
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER trg_dual_write_users
  BEFORE INSERT OR UPDATE ON users
  FOR EACH ROW
  EXECUTE FUNCTION dual_write_users();
 
-- After migration is complete, remove the trigger
DROP TRIGGER trg_dual_write_users ON users;

Migration Versioning with Feature Flags

-- Migration that checks feature flag before executing
DO $$
BEGIN
  -- Only run contract phase if feature flag is enabled
  IF current_setting('app.migration_phase', true) = 'contract' THEN
    ALTER TABLE users DROP COLUMN IF EXISTS email_old;
  END IF;
END $$;

Testing Strategies

-- Test migration on production-sized clone
pg_dump -Fc -t users production > users.dump
pg_restore -d test_db users.dump
 
-- Run migration and measure time
\timing on
\i V42__add_user_email_normalized.sql
 
-- Verify data integrity
SELECT COUNT(*) FROM users WHERE email_normalized IS NULL;
SELECT COUNT(*) FROM users WHERE email_normalized != LOWER(TRIM(email));
 
-- Test rollback (if applicable)
\i V42_rollback__remove_email_normalized.sql

Future Outlook

Database migration tooling is evolving toward GitOps-style workflows where schema changes are managed through pull requests with automated preview environments. PlanetScale's branching model allows developers to create database branches (similar to Git branches), make schema changes in isolation, and deploy via merge requests with automated diffing.

PostgreSQL's recent versions have reduced DDL lock requirements—PG 14 added ALTER TABLE ... ALTER COLUMN ... SET NOT NULL without a table scan, and future versions may support online column type changes natively. These improvements gradually reduce the need for manual expand-contract patterns.

Conclusion

Zero-downtime database migrations are achievable with disciplined application of the expand-contract pattern. By breaking schema changes into additive (expand), data-migration (migrate), and cleanup (contract) phases, you eliminate maintenance windows, reduce risk, and enable continuous deployment of database changes.

Key takeaways:

Expand-contract is the foundation: Every schema change can be decomposed into additive and destructive phases. Always expand first, contract last.
Batch large data operations: Never update millions of rows in a single transaction. Batch in chunks of 1,000–10,000 with commits between batches.
Use CREATE INDEX CONCURRENTLY: Non-blocking index creation is essential for production tables with continuous traffic.
Test with production-sized data: Migration timing on test data is meaningless. Always test with production data clones.
Keep migrations idempotent and backward-compatible: Old application code must work during the entire migration process.

Start by auditing your current migration practices, identify operations that require maintenance windows, and refactor them using the expand-contract pattern. Your deployment pipeline and your on-call engineers will thank you.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline