ConceptsZero-downtime migrations

Zero-downtime migrations

What this is: how we change database schemas without dropping any traffic.

Who it’s for: anyone adding a column, anyone renaming a table, anyone wondering why we can’t just ALTER TABLE DROP COLUMN.

What to read next: System overview, Reliability and deployment, services/_template database section.

The rule

A migration must be safe to apply against running production code, AND the new code must be safe to deploy against the un-migrated database.

This is the “expand–contract” pattern. Every schema change has at minimum two phases:

  1. Expand — add the new shape; old code still works.
  2. Migrate / dual-write — populate the new shape; both old and new code work.
  3. Contract — remove the old shape; only new code remains.

You can’t combine these into one PR. Doing so means there’s a moment where the running code disagrees with the running schema.

Workflow with Drizzle Kit

Each service owns its schema in services/<svc>/src/db/schema.ts. Migrations live in services/<svc>/migrations/ as numbered SQL files (0001_*.sql, 0002_*.sql, …).

# 1. Edit schema.ts to add the new field
# 2. Generate the migration
pnpm --filter @services/<svc> migrate:generate
 
# 3. Review the generated SQL — DRIZZLE'S OUTPUT IS A STARTING POINT, NOT A FINAL ANSWER
# 4. Hand-edit to make it expand-only if needed
# 5. Test against cloud-agent-work branch
pnpm --filter @services/<svc> migrate

The generated SQL goes in a new numbered file. Once committed, it’s permanent — never edit a previously-merged migration. Forward-only.

Allowed without thinking too hard

ChangeSafe because
ADD COLUMN nullableOld code ignores the column
ADD COLUMN NOT NULL DEFAULT <const>New column has a value for every row
ADD INDEX CONCURRENTLYDoesn’t block reads or writes
ADD TABLENothing reads it yet
ALTER COLUMN ... DROP DEFAULTExisting rows unaffected; new rows must specify
CREATE TYPE <enum> (new enum)Nothing references it yet
ADD CONSTRAINT ... NOT VALID; VALIDATE CONSTRAINTSplits lock acquisition from validation

Requires the expand–contract dance

Renaming a column

  1. ADD COLUMN new_name
  2. Backfill new_name = old_name
  3. Deploy code that writes to BOTH columns (dual-write)
  4. Deploy code that reads from new_name
  5. Stop writing to old_name
  6. DROP COLUMN old_name (after a soak period — at least one week in prod)

Changing column type (e.g. textint)

  1. ADD COLUMN new_col <new_type>
  2. Backfill in chunks (don’t lock the whole table)
  3. Dual-write
  4. Cut reads over
  5. Drop the old column

Splitting a table

  1. Create the new table
  2. Backfill from the old
  3. Dual-write
  4. Cut reads
  5. Drop the old table

Adding a NOT NULL column with no default

  1. ADD COLUMN nullable
  2. Backfill
  3. ALTER COLUMN SET NOT NULL (uses a brief lock, but no scan if step 2 covered every row)

Banned outright (without explicit ops approval)

  • DROP TABLE (use rename + soak)
  • DROP COLUMN in same migration as adding it
  • ALTER COLUMN <not null> without a backfill
  • ALTER TABLE ... ADD CONSTRAINT FOREIGN KEY with a long table — locks the parent
  • Renaming a column in one migration
  • Renaming a table in one migration
  • Migrations that take long-running locks during peak hours

Locks to watch

Postgres lock levels matter. Things that take AccessExclusiveLock (blocking ALL reads + writes for the duration) — careful:

  • ALTER TABLE ... ADD CONSTRAINT FOREIGN KEY (use NOT VALID + VALIDATE to split)
  • ALTER TABLE ... ALTER COLUMN TYPE (use the add-new-drop-old dance)
  • DROP TABLE
  • REINDEX TABLE (use REINDEX CONCURRENTLY)

Things that take a brief lock then release:

  • ADD COLUMN (brief metadata lock, then done)
  • ADD INDEX CONCURRENTLY
  • ADD CONSTRAINT NOT VALID

pg_stat_activity and a query for pg_locks will show what’s blocking. Runbook: docs/runbooks/migration-lock-investigation.md.

Multi-service migrations

When a schema change affects multiple services (e.g., we add an external_id field to identity’s user table AND patient-graph needs to start using it), the order is:

  1. Migrate identity (add column, leave nullable). Deploy.
  2. Identity starts populating the column. Deploy.
  3. Patient-graph starts reading the column. Deploy.
  4. Identity sets the column NOT NULL after backfill is complete. Deploy.

Every step is a separate PR with explicit verification before the next. No single PR may span the chain.

What lives where

ArtifactPath
Schema definitionservices/<svc>/src/db/schema.ts
Migration filesservices/<svc>/migrations/
Drizzle configservices/<svc>/drizzle.config.ts
Per-service migrate scriptsservices/<svc>/package.json migrate / migrate:generate
The convention checkscripts/check-conventions.ts migration-brand-id-required rule

When something goes wrong

  1. Pre-merge bad migration — close the PR, write a corrective migration on top, never edit the merged file.
  2. Deployed but pre-rollout — revert the deploy. Migration changes that haven’t been applied to prod can be unwound.
  3. Migration applied to prod, code broken — roll forward with a fix-up migration. Don’t try to “undo” the production schema with a reverse migration unless you have to; usually rolling forward is safer.
  4. Migration applied, taking locks, traffic hurting — cancel the migration session (pg_cancel_backend); page on-call.

The runbook docs/runbooks/migration-rollback.md has the specific procedures.

Common mistakes

  • Combining expand + contract in one PR. Every dev wants to ship the “clean” version; the production reality says ship the expand, then contract later.
  • Forgetting the convention check. New tables MUST have brand_id. The check catches this.
  • Editing a committed migration. Forward-only. If you got it wrong, fix it in a new migration.
  • Not testing against a populated dataset. A migration that takes 10s on an empty dev DB might take 30 minutes on prod. Test against a representative sample.
  • Long-running locks during peak hours. Migrations that need a brief lock are fine; migrations that hold a lock for minutes are not. Schedule them off-peak.

Source ADRs

ADR-0028 (SST + AWS), ADR-0029 (stages), ADR-0037 (service shape — schema ownership), and the migration runbook in docs/runbooks/.