Rollback
When to rollback vs. roll-forward:
- Rollback when the new release is broken in a way that can't be hot-fixed in <30 min: a panic loop, an auth regression, data corruption.
- Roll-forward when you can ship a hotfix faster than the rollback ceremony. A small bug behind a known feature flag is a roll-forward case.
If unsure, roll back. Bad releases compound.
Code rollback (no migration involved)
git checkout v<previous>
make deploy ANSIBLE_INVENTORY=production
The systemd unit Restart=on-failure + the binary swap means the
process flips on the next ExecStart. Connections in flight finish;
new connections hit the rolled-back binary.
Code rollback when the new release added migrations
This is the dangerous case. There are three options, in order of preference:
1. Schema-compatible rollback (best)
If the new migration only added columns/tables that the old code ignores, the old code runs against the new schema fine. Just roll the code back; leave the schema alone.
Most of our migrations are deliberately additive for this reason.
2. Roll forward to a hotfix
If the migration changed semantics that the old code can't tolerate, ship a hotfix on top of the new release rather than reversing the migration.
3. Migration down + code rollback (last resort)
Only if (1) and (2) won't work and the data loss from down is
acceptable.
ssh web-01
sudo -u shithub /usr/local/bin/shithubd migrate down # ONE step
# verify
git checkout v<previous>
make deploy ANSIBLE_INVENTORY=production
migrate down rolls back exactly one step. Never chain downs
without checking each migration's down logic; some of them drop
columns and will lose data.
After any rollback
- Note the rollback in the incident channel with the from-tag and to-tag.
- File a follow-up issue with the failure mode.
- Disable any feature flags the bad release turned on.
- Confirm the rolled-back release passed CI (if not, you're now running un-tested code — that's a separate incident).
View source
| 1 | # Rollback |
| 2 | |
| 3 | When to rollback vs. roll-forward: |
| 4 | |
| 5 | - **Rollback** when the new release is broken in a way that can't |
| 6 | be hot-fixed in <30 min: a panic loop, an auth regression, data |
| 7 | corruption. |
| 8 | - **Roll-forward** when you can ship a hotfix faster than the |
| 9 | rollback ceremony. A small bug behind a known feature flag is a |
| 10 | roll-forward case. |
| 11 | |
| 12 | If unsure, roll back. Bad releases compound. |
| 13 | |
| 14 | ## Code rollback (no migration involved) |
| 15 | |
| 16 | ```sh |
| 17 | git checkout v<previous> |
| 18 | make deploy ANSIBLE_INVENTORY=production |
| 19 | ``` |
| 20 | |
| 21 | The systemd unit `Restart=on-failure` + the binary swap means the |
| 22 | process flips on the next ExecStart. Connections in flight finish; |
| 23 | new connections hit the rolled-back binary. |
| 24 | |
| 25 | ## Code rollback when the new release added migrations |
| 26 | |
| 27 | This is the dangerous case. There are three options, in order of |
| 28 | preference: |
| 29 | |
| 30 | ### 1. Schema-compatible rollback (best) |
| 31 | |
| 32 | If the new migration only *added* columns/tables that the old code |
| 33 | ignores, the old code runs against the new schema fine. Just roll |
| 34 | the code back; leave the schema alone. |
| 35 | |
| 36 | Most of our migrations are deliberately additive for this reason. |
| 37 | |
| 38 | ### 2. Roll forward to a hotfix |
| 39 | |
| 40 | If the migration changed semantics that the old code can't tolerate, |
| 41 | ship a hotfix on top of the new release rather than reversing the |
| 42 | migration. |
| 43 | |
| 44 | ### 3. Migration `down` + code rollback (last resort) |
| 45 | |
| 46 | Only if (1) and (2) won't work and the data loss from `down` is |
| 47 | acceptable. |
| 48 | |
| 49 | ```sh |
| 50 | ssh web-01 |
| 51 | sudo -u shithub /usr/local/bin/shithubd migrate down # ONE step |
| 52 | # verify |
| 53 | git checkout v<previous> |
| 54 | make deploy ANSIBLE_INVENTORY=production |
| 55 | ``` |
| 56 | |
| 57 | `migrate down` rolls back exactly one step. **Never** chain `down`s |
| 58 | without checking each migration's down logic; some of them drop |
| 59 | columns and *will* lose data. |
| 60 | |
| 61 | ## After any rollback |
| 62 | |
| 63 | - Note the rollback in the incident channel with the from-tag and |
| 64 | to-tag. |
| 65 | - File a follow-up issue with the failure mode. |
| 66 | - Disable any feature flags the bad release turned on. |
| 67 | - Confirm the rolled-back release passed CI (if not, you're now |
| 68 | running un-tested code — that's a separate incident). |