Introduction
Deploying a database migration to a production system serving millions of users is one of the most nerve-wracking operations in software engineering. Unlike application code, which can be rolled back in seconds by redeploying a previous version, database schema changes are often irreversible. Dropping a column deletes data permanently. Renaming a table breaks every query that references the old name. And during the migration itself, DDL operations can lock tables for minutes or hours, rendering your application unavailable.
Zero-downtime migrations solve this problem by restructuring how we think about schema changes. Instead of applying destructive changes directly, the expand-contract pattern introduces changes additively first, migrates data gradually, and removes old structures only after all application code has been updated. This approach eliminates the maintenance window requirement, enables continuous deployment of database changes, and provides a safe rollback path at every stage.
This guide covers the expand-contract pattern in detail, online schema change techniques for MySQL and PostgreSQL, practical migration strategies using Flyway and Liquibase, and battle-tested patterns for migrating large tables without impacting production traffic.
Understanding Zero-Downtime Migrations: Core Concepts
Why Traditional Migrations Cause Downtime
Traditional migration tools apply DDL statements (ALTER TABLE, CREATE INDEX, DROP COLUMN) directly to the production database. In PostgreSQL, many DDL operations acquire an ACCESS EXCLUSIVE lock on the target table, which blocks all reads and writes for the duration of the operation. In MySQL, ALTER TABLE on large InnoDB tables can take hours while holding a metadata lock.
The core problem is that DDL operations are atomic: they either complete fully or not at all. There's no way to "partially" rename a column or "mostly" drop a table. This atomicity, which is a strength for data integrity, becomes a liability when the operation takes minutes and blocks all concurrent access.
The Expand-Contract Pattern
The expand-contract pattern (also called parallel change or evolutionary database design) breaks a migration into three distinct phases:
Phase 1 — Expand: Add new structures without removing old ones. Add a new column, create a new table, or add a new index. Old application code continues to work because the old structures still exist.
Phase 2 — Migrate: Gradually move data from old structures to new ones. Update application code to read from and write to both old and new structures. This phase can take hours, days, or weeks depending on data volume.
Phase 3 — Contract: Remove old structures once all application code has been updated and verified to work with the new structures. Drop the old column, remove the old table, or delete the old index.
Each phase is independently reversible. If Phase 2 reveals problems, you can stop and roll back to Phase 1 without data loss. Only Phase 3 is destructive, and by that point, the new structures have been thoroughly validated in production.
DDL Lock Behavior in PostgreSQL vs MySQL
PostgreSQL's lock model is more granular than MySQL's. Most DDL operations in PostgreSQL acquire an ACCESS EXCLUSIVE lock, but some operations support less restrictive locks. For example, CREATE INDEX acquires a SHARE lock that allows reads but blocks writes, while CREATE INDEX CONCURRENTLY acquires only a SHARE UPDATE EXCLUSIVE lock that allows both reads and writes.
MySQL's InnoDB storage engine uses metadata locks (MDL) that are held for the duration of a DDL operation. Online DDL (available since MySQL 5.6) reduces the impact of some ALTER TABLE operations by using the ALGORITHM=INPLACE and LOCK=NONE options, but not all operations support online DDL.
Architecture and Design Patterns
Forward-Only Migrations
Forward-only migration strategies avoid DOWN migration scripts entirely. Instead of rolling back a migration, you deploy a new forward migration that fixes the problem. This simplifies the migration tooling and ensures that every database state has a clear path forward.
The rationale is that rollback scripts are often untested and may not correctly reverse complex data transformations. A forward-only approach forces you to design each migration as a safe, additive change that can be fixed with another additive change if needed.
Dual-Write Pattern
During the migrate phase of expand-contract, the application must write to both old and new columns (or tables) simultaneously. This ensures data consistency regardless of which code version reads the data.
-- Phase 1: Expand - add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Phase 2: Migrate - backfill new column from old
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- Phase 2: Application code writes to both columns
INSERT INTO users (email, email_address, name)
VALUES ('user@example.com', 'user@example.com', 'John');
-- Phase 3: Contract - drop old column after all code is updated
ALTER TABLE users DROP COLUMN email;Shadow Table Pattern
For large tables where ALTER TABLE is too slow, the shadow table pattern creates a copy of the table with the desired schema, migrates data in batches, then atomically swaps the tables.
-- 1. Create shadow table with new schema
CREATE TABLE users_new (LIKE users INCLUDING ALL);
ALTER TABLE users_new ADD COLUMN email_address VARCHAR(255);
-- 2. Migrate data in batches
INSERT INTO users_new (id, name, email, email_address, created_at)
SELECT id, name, email, email, created_at FROM users
WHERE id BETWEEN 1 AND 100000;
-- 3. Set up triggers to capture changes during migration
CREATE OR REPLACE FUNCTION sync_users_trigger() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO users_new VALUES (NEW.*);
ELSIF TG_OP = 'UPDATE' THEN
UPDATE users_new SET * = NEW.* WHERE id = NEW.id;
ELSIF TG_OP = 'DELETE' THEN
DELETE FROM users_new WHERE id = OLD.id;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
-- 4. Swap tables atomically
BEGIN;
ALTER TABLE users RENAME TO users_old;
ALTER TABLE users_new RENAME TO users;
COMMIT;Foreign Key Considerations
Foreign keys complicate zero-downtime migrations because they create dependencies between tables. Renaming a table requires updating all foreign key references. The expand-contract pattern handles this by maintaining both the old and new foreign key relationships during the transition.
-- Phase 1: Add new foreign key column
ALTER TABLE orders ADD COLUMN customer_id_new BIGINT;
-- Phase 2: Backfill and add foreign key constraint
UPDATE orders SET customer_id_new = customer_id;
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer_new
FOREIGN KEY (customer_id_new) REFERENCES users(id);
-- Phase 3: Drop old foreign key after code update
ALTER TABLE orders DROP CONSTRAINT fk_orders_customer;
ALTER TABLE orders DROP COLUMN customer_id;
ALTER TABLE orders RENAME COLUMN customer_id_new TO customer_id;Step-by-Step Implementation
PostgreSQL Safe Column Rename
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Step 2: Backfill from old column
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- For large tables, batch the update:
UPDATE users SET full_name = name
WHERE id IN (SELECT id FROM users WHERE full_name IS NULL LIMIT 10000);
-- Step 3: Add NOT NULL constraint (after backfill is complete)
ALTER TABLE users ALTER COLUMN full_name SET NOT NULL;
-- Step 4: Update application code to read/write full_name
-- Deploy new application version
-- Step 5: Drop old column (after all app instances updated)
ALTER TABLE users DROP COLUMN name;PostgreSQL Safe Column Type Change
Changing a column type requires rewriting the table in PostgreSQL, which acquires an ACCESS EXCLUSIVE lock. The shadow table pattern avoids this:
-- Step 1: Create new column with desired type
ALTER TABLE orders ADD COLUMN total_new NUMERIC(12,2);
-- Step 2: Copy data with type conversion
UPDATE orders SET total_new = total::numeric WHERE total_new IS NULL;
-- Step 3: Add constraint
ALTER TABLE orders ALTER COLUMN total_new SET NOT NULL;
ALTER TABLE orders ADD CONSTRAINT check_total_positive CHECK (total_new >= 0);
-- Step 4: Swap columns in a transaction
BEGIN;
ALTER TABLE orders DROP COLUMN total;
ALTER TABLE orders RENAME COLUMN total_new TO total;
COMMIT;MySQL Online Schema Change with gh-ost
GitHub's gh-ost (GitHub Online Schema-change Tool) performs MySQL schema migrations without locking the source table:
# Install gh-ost
gh-ost \
--host=localhost \
--database=mydb \
--table=users \
--alter="ADD COLUMN full_name VARCHAR(255)" \
--allow-on-master \
--chunk-size=1000 \
--max-load="Threads_running=25" \
--cut-over=default \
--assume-rbr \
--executegh-ost works by:
- Creating a ghost table with the desired schema
- Copying data from the source table in chunks
- Tailing the MySQL binary log to capture changes
- Applying captured changes to the ghost table
- Performing an atomic table swap when caught up
Flyway Migration with Zero-Downtime Patterns
-- V42__add_user_email_normalized.sql
-- Expand phase: add new column
ALTER TABLE users ADD COLUMN email_normalized VARCHAR(255);
-- Backfill existing data
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL;
-- Add index on new column
CREATE INDEX CONCURRENTLY idx_users_email_normalized
ON users (email_normalized);-- V43__backfill_user_email_normalized.sql
-- Continue backfilling (idempotent - safe to re-run)
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL
AND id > (SELECT COALESCE(MAX(id), 0) FROM users WHERE email_normalized IS NOT NULL);-- V45__drop_user_email_old.sql
-- Contract phase: drop old column (deploy after all code updated)
-- Only run this after verifying no code references the old column
ALTER TABLE users DROP COLUMN IF EXISTS email;
ALTER TABLE users RENAME COLUMN email_normalized TO email;Liquibase Migration Strategy
<!-- changelog.xml -->
<changeSet id="expand-add-email-normalized" author="dev">
<addColumn tableName="users">
<column name="email_normalized" type="VARCHAR(255)"/>
</addColumn>
<sql>
UPDATE users SET email_normalized = LOWER(TRIM(email))
WHERE email_normalized IS NULL;
</sql>
</changeSet>
<changeSet id="contract-drop-old-email" author="dev">
<preconditions onFail="MARK_RAN">
<sqlCheck expectedResult="0">
SELECT COUNT(*) FROM information_schema.columns
WHERE table_name = 'users' AND column_name = 'email_normalized'
AND is_nullable = 'YES';
</sqlCheck>
</preconditions>
<dropColumn tableName="users" columnName="email"/>
<renameColumn tableName="users" oldColumnName="email_normalized" newColumnName="email"/>
</changeSet>Real-World Use Cases
Use Case 1: Adding a Non-Nullable Column
A SaaS platform needed to add a tenant_id column to a 500-million-row events table. Direct ALTER TABLE ADD COLUMN ... NOT NULL with a default value would lock the table for 30+ minutes in PostgreSQL 11+ (which requires a table rewrite for NOT NULL with default on older versions).
The solution used the expand-contract pattern:
-- Phase 1: Add nullable column (instant in PG 11+)
ALTER TABLE events ADD COLUMN tenant_id BIGINT;
-- Phase 2: Backfill in batches of 10,000
UPDATE events SET tenant_id = (
SELECT tenant_id FROM users WHERE users.id = events.user_id
) WHERE tenant_id IS NULL AND id BETWEEN 1 AND 10000;
-- Phase 2: Add NOT NULL after backfill (instant, no rewrite needed)
ALTER TABLE events ALTER COLUMN tenant_id SET NOT NULL;
-- Phase 3: Add foreign key
ALTER TABLE events ADD CONSTRAINT fk_events_tenant
FOREIGN KEY (tenant_id) REFERENCES tenants(id);Total downtime: zero. The backfill ran over 4 hours in the background while the application continued serving traffic normally.
Use Case 2: Changing a Primary Key Type
A startup needed to migrate user IDs from sequential integers to UUIDs. This required changing the primary key type on a table referenced by 15 other tables via foreign keys.
The migration used the shadow table pattern with gh-ost for MySQL:
- Created a new
users_v2table with UUID primary key - Added
uuidcolumn to olduserstable and populated it - Used application-level dual-write to write both integer ID and UUID
- Migrated foreign key references one table at a time using expand-contract
- Swapped tables using gh-ost's atomic rename
Total migration time: 2 weeks (phased rollout), zero downtime throughout.
Use Case 3: Index Rebuild on Write-Heavy Table
A real-time analytics table with 200 million rows needed a new composite index. CREATE INDEX without CONCURRENTLY would lock the table for 15 minutes.
-- Non-blocking index creation
CREATE INDEX CONCURRENTLY idx_analytics_tenant_time
ON analytics_events (tenant_id, created_at DESC);
-- Verify index was built successfully
SELECT indisvalid FROM pg_index
WHERE indexrelid = 'idx_analytics_tenant_time'::regclass;
-- If indisvalid = false, the index build failed and must be retriedThe CONCURRENTLY option took 45 minutes instead of 15, but the table remained fully readable and writable throughout.
Best Practices for Production
-
Every migration must be backward-compatible: New columns should be nullable or have defaults. New tables should not be required by the current application version. This ensures the old application version works during the deployment window.
-
Batch large data updates: Never run
UPDATE ... SETon millions of rows in a single transaction. Use batched updates with LIMIT and OFFSET (or cursor-based pagination) to avoid long-running transactions and WAL bloat. -
Use
CREATE INDEX CONCURRENTLYfor all production indexes: This avoids the SHARE lock that blocks writes. If the concurrent build fails, the index is left in an invalid state and must be dropped and recreated. -
Test migrations against production-sized data: A migration that takes 1 second on 1,000 test rows may take 3 hours on 100 million production rows. Use a production data clone for migration testing.
-
Keep migrations idempotent: Design migrations so they can be safely re-run if they fail partway through. Use
IF NOT EXISTS,IF EXISTS, and upsert patterns to handle partial execution. -
Separate schema changes from data backfills: Don't mix DDL and large DML in the same migration script. First add the column, then backfill data in a separate migration. This allows the schema change to complete instantly while the backfill runs asynchronously.
-
Monitor migration progress: For long-running migrations, log progress (rows processed, elapsed time, estimated remaining). Use PostgreSQL's
pg_stat_progress_create_indexto monitor index build progress. -
Have a rollback plan for every migration: Even forward-only migrations need a plan. Know how to undo the effects (add back the dropped column, reverse the data transformation) even if you don't have an automated rollback script.
Common Pitfalls and Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| DDL locks blocking reads/writes | Application downtime during migration | Use CONCURRENTLY, expand-contract, or online schema change tools |
| Non-batched updates on large tables | Long transactions, WAL bloat, replication lag | Batch updates in chunks of 1,000–10,000 rows |
| Dropping columns before code is updated | Application errors, 500 responses | Follow expand-contract: drop only after all code updated |
| Not testing on production-sized data | Migration takes 100x longer than expected | Test with production data clone |
| Foreign key constraints blocking table swaps | Cannot rename or drop tables with active references | Use deferrable constraints or drop/recreate in transaction |
| Migration scripts that aren't idempotent | Partial failure leaves database in inconsistent state | Use IF NOT EXISTS, upserts, and check-then-act patterns |
Performance Optimization
Parallel Batch Updates
-- Batch update with progress tracking
DO $$
DECLARE
batch_size INT := 10000;
max_id BIGINT;
current_id BIGINT := 0;
total_updated BIGINT := 0;
BEGIN
SELECT MAX(id) INTO max_id FROM users;
WHILE current_id <= max_id LOOP
UPDATE users
SET email_normalized = LOWER(TRIM(email))
WHERE id > current_id AND id <= current_id + batch_size
AND email_normalized IS NULL;
total_updated := total_updated + batch_size;
RAISE NOTICE 'Progress: % / % rows', LEAST(current_id + batch_size, max_id), max_id;
current_id := current_id + batch_size;
COMMIT;
END LOOP;
END $$;Non-Blocking Column Addition in PostgreSQL
-- PG 11+: Adding a column with a non-volatile default is instant
-- (no table rewrite)
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
-- For volatile defaults, add nullable column then backfill
ALTER TABLE users ADD COLUMN created_at_new TIMESTAMPTZ;
UPDATE users SET created_at_new = created_at WHERE created_at_new IS NULL;
ALTER TABLE users ALTER COLUMN created_at_new SET DEFAULT NOW();Comparison with Alternatives
| Feature | Flyway | Liquibase | gh-ost | pt-online-schema-change |
|---|---|---|---|---|
| Database support | Many | Many | MySQL only | MySQL only |
| Migration format | SQL files | XML/YAML/SQL | CLI flags | CLI flags |
| Rollback support | Manual | Automatic | N/A (online) | N/A (online) |
| CI/CD integration | Excellent | Excellent | Manual | Manual |
| Concurrent DDL | Via SQL | Via SQL | Built-in | Built-in |
| Learning curve | Low | Medium | Low | Low |
| Community size | Large | Large | Large | Medium |
Advanced Patterns
Trigger-Based Dual-Write
-- Temporary trigger for dual-write during migration
CREATE OR REPLACE FUNCTION dual_write_users() RETURNS TRIGGER AS $$
BEGIN
NEW.email_normalized := LOWER(TRIM(NEW.email));
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_dual_write_users
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW
EXECUTE FUNCTION dual_write_users();
-- After migration is complete, remove the trigger
DROP TRIGGER trg_dual_write_users ON users;Migration Versioning with Feature Flags
-- Migration that checks feature flag before executing
DO $$
BEGIN
-- Only run contract phase if feature flag is enabled
IF current_setting('app.migration_phase', true) = 'contract' THEN
ALTER TABLE users DROP COLUMN IF EXISTS email_old;
END IF;
END $$;Testing Strategies
-- Test migration on production-sized clone
pg_dump -Fc -t users production > users.dump
pg_restore -d test_db users.dump
-- Run migration and measure time
\timing on
\i V42__add_user_email_normalized.sql
-- Verify data integrity
SELECT COUNT(*) FROM users WHERE email_normalized IS NULL;
SELECT COUNT(*) FROM users WHERE email_normalized != LOWER(TRIM(email));
-- Test rollback (if applicable)
\i V42_rollback__remove_email_normalized.sqlFuture Outlook
Database migration tooling is evolving toward GitOps-style workflows where schema changes are managed through pull requests with automated preview environments. PlanetScale's branching model allows developers to create database branches (similar to Git branches), make schema changes in isolation, and deploy via merge requests with automated diffing.
PostgreSQL's recent versions have reduced DDL lock requirements—PG 14 added ALTER TABLE ... ALTER COLUMN ... SET NOT NULL without a table scan, and future versions may support online column type changes natively. These improvements gradually reduce the need for manual expand-contract patterns.
Conclusion
Zero-downtime database migrations are achievable with disciplined application of the expand-contract pattern. By breaking schema changes into additive (expand), data-migration (migrate), and cleanup (contract) phases, you eliminate maintenance windows, reduce risk, and enable continuous deployment of database changes.
Key takeaways:
- Expand-contract is the foundation: Every schema change can be decomposed into additive and destructive phases. Always expand first, contract last.
- Batch large data operations: Never update millions of rows in a single transaction. Batch in chunks of 1,000–10,000 with commits between batches.
- Use
CREATE INDEX CONCURRENTLY: Non-blocking index creation is essential for production tables with continuous traffic. - Test with production-sized data: Migration timing on test data is meaningless. Always test with production data clones.
- Keep migrations idempotent and backward-compatible: Old application code must work during the entire migration process.
Start by auditing your current migration practices, identify operations that require maintenance windows, and refactor them using the expand-contract pattern. Your deployment pipeline and your on-call engineers will thank you.