Zero-downtime migrations
What this is: how we change database schemas without dropping any traffic.
Who it’s for: anyone adding a column, anyone renaming a table, anyone wondering why we can’t just ALTER TABLE DROP COLUMN.
What to read next: System overview, Reliability and deployment, services/_template database section.
The rule
A migration must be safe to apply against running production code, AND the new code must be safe to deploy against the un-migrated database.
This is the “expand–contract” pattern. Every schema change has at minimum two phases:
- Expand — add the new shape; old code still works.
- Migrate / dual-write — populate the new shape; both old and new code work.
- Contract — remove the old shape; only new code remains.
You can’t combine these into one PR. Doing so means there’s a moment where the running code disagrees with the running schema.
Workflow with Drizzle Kit
Each service owns its schema in services/<svc>/src/db/schema.ts. Migrations live in services/<svc>/migrations/ as numbered SQL files (0001_*.sql, 0002_*.sql, …).
# 1. Edit schema.ts to add the new field
# 2. Generate the migration
pnpm --filter @services/<svc> migrate:generate
# 3. Review the generated SQL — DRIZZLE'S OUTPUT IS A STARTING POINT, NOT A FINAL ANSWER
# 4. Hand-edit to make it expand-only if needed
# 5. Test against cloud-agent-work branch
pnpm --filter @services/<svc> migrateThe generated SQL goes in a new numbered file. Once committed, it’s permanent — never edit a previously-merged migration. Forward-only.
Allowed without thinking too hard
| Change | Safe because |
|---|---|
ADD COLUMN nullable | Old code ignores the column |
ADD COLUMN NOT NULL DEFAULT <const> | New column has a value for every row |
ADD INDEX CONCURRENTLY | Doesn’t block reads or writes |
ADD TABLE | Nothing reads it yet |
ALTER COLUMN ... DROP DEFAULT | Existing rows unaffected; new rows must specify |
CREATE TYPE <enum> (new enum) | Nothing references it yet |
ADD CONSTRAINT ... NOT VALID; VALIDATE CONSTRAINT | Splits lock acquisition from validation |
Requires the expand–contract dance
Renaming a column
ADD COLUMN new_name- Backfill
new_name = old_name - Deploy code that writes to BOTH columns (dual-write)
- Deploy code that reads from
new_name - Stop writing to
old_name DROP COLUMN old_name(after a soak period — at least one week in prod)
Changing column type (e.g. text → int)
ADD COLUMN new_col <new_type>- Backfill in chunks (don’t lock the whole table)
- Dual-write
- Cut reads over
- Drop the old column
Splitting a table
- Create the new table
- Backfill from the old
- Dual-write
- Cut reads
- Drop the old table
Adding a NOT NULL column with no default
ADD COLUMN nullable- Backfill
ALTER COLUMN SET NOT NULL(uses a brief lock, but no scan if step 2 covered every row)
Banned outright (without explicit ops approval)
DROP TABLE(use rename + soak)DROP COLUMNin same migration as adding itALTER COLUMN <not null>without a backfillALTER TABLE ... ADD CONSTRAINT FOREIGN KEYwith a long table — locks the parent- Renaming a column in one migration
- Renaming a table in one migration
- Migrations that take long-running locks during peak hours
Locks to watch
Postgres lock levels matter. Things that take AccessExclusiveLock (blocking ALL reads + writes for the duration) — careful:
ALTER TABLE ... ADD CONSTRAINT FOREIGN KEY(useNOT VALID+VALIDATEto split)ALTER TABLE ... ALTER COLUMN TYPE(use the add-new-drop-old dance)DROP TABLEREINDEX TABLE(useREINDEX CONCURRENTLY)
Things that take a brief lock then release:
ADD COLUMN(brief metadata lock, then done)ADD INDEX CONCURRENTLYADD CONSTRAINT NOT VALID
pg_stat_activity and a query for pg_locks will show what’s blocking. Runbook: docs/runbooks/migration-lock-investigation.md.
Multi-service migrations
When a schema change affects multiple services (e.g., we add an external_id field to identity’s user table AND patient-graph needs to start using it), the order is:
- Migrate identity (add column, leave nullable). Deploy.
- Identity starts populating the column. Deploy.
- Patient-graph starts reading the column. Deploy.
- Identity sets the column NOT NULL after backfill is complete. Deploy.
Every step is a separate PR with explicit verification before the next. No single PR may span the chain.
What lives where
| Artifact | Path |
|---|---|
| Schema definition | services/<svc>/src/db/schema.ts |
| Migration files | services/<svc>/migrations/ |
| Drizzle config | services/<svc>/drizzle.config.ts |
| Per-service migrate scripts | services/<svc>/package.json migrate / migrate:generate |
| The convention check | scripts/check-conventions.ts migration-brand-id-required rule |
When something goes wrong
- Pre-merge bad migration — close the PR, write a corrective migration on top, never edit the merged file.
- Deployed but pre-rollout — revert the deploy. Migration changes that haven’t been applied to prod can be unwound.
- Migration applied to prod, code broken — roll forward with a fix-up migration. Don’t try to “undo” the production schema with a reverse migration unless you have to; usually rolling forward is safer.
- Migration applied, taking locks, traffic hurting — cancel the migration session (
pg_cancel_backend); page on-call.
The runbook docs/runbooks/migration-rollback.md has the specific procedures.
Common mistakes
- Combining expand + contract in one PR. Every dev wants to ship the “clean” version; the production reality says ship the expand, then contract later.
- Forgetting the convention check. New tables MUST have
brand_id. The check catches this. - Editing a committed migration. Forward-only. If you got it wrong, fix it in a new migration.
- Not testing against a populated dataset. A migration that takes 10s on an empty dev DB might take 30 minutes on prod. Test against a representative sample.
- Long-running locks during peak hours. Migrations that need a brief lock are fine; migrations that hold a lock for minutes are not. Schedule them off-peak.
Related
- System overview
- Reliability and deployment — deploy patterns
- Brands and multi-tenancy —
brand_idis required on every new table
Source ADRs
ADR-0028 (SST + AWS), ADR-0029 (stages), ADR-0037 (service shape — schema ownership), and the migration runbook in docs/runbooks/.