markdown · 2064 bytes Raw Blame History

Rollback

When to rollback vs. roll-forward:

  • Rollback when the new release is broken in a way that can't be hot-fixed in <30 min: a panic loop, an auth regression, data corruption.
  • Roll-forward when you can ship a hotfix faster than the rollback ceremony. A small bug behind a known feature flag is a roll-forward case.

If unsure, roll back. Bad releases compound.

Code rollback (no migration involved)

git checkout v<previous>
make deploy ANSIBLE_INVENTORY=production

The systemd unit Restart=on-failure + the binary swap means the process flips on the next ExecStart. Connections in flight finish; new connections hit the rolled-back binary.

Code rollback when the new release added migrations

This is the dangerous case. There are three options, in order of preference:

1. Schema-compatible rollback (best)

If the new migration only added columns/tables that the old code ignores, the old code runs against the new schema fine. Just roll the code back; leave the schema alone.

Most of our migrations are deliberately additive for this reason.

2. Roll forward to a hotfix

If the migration changed semantics that the old code can't tolerate, ship a hotfix on top of the new release rather than reversing the migration.

3. Migration down + code rollback (last resort)

Only if (1) and (2) won't work and the data loss from down is acceptable.

ssh web-01
sudo -u shithub /usr/local/bin/shithubd migrate down  # ONE step
# verify
git checkout v<previous>
make deploy ANSIBLE_INVENTORY=production

migrate down rolls back exactly one step. Never chain downs without checking each migration's down logic; some of them drop columns and will lose data.

After any rollback

  • Note the rollback in the incident channel with the from-tag and to-tag.
  • File a follow-up issue with the failure mode.
  • Disable any feature flags the bad release turned on.
  • Confirm the rolled-back release passed CI (if not, you're now running un-tested code — that's a separate incident).
View source
1 # Rollback
2
3 When to rollback vs. roll-forward:
4
5 - **Rollback** when the new release is broken in a way that can't
6 be hot-fixed in <30 min: a panic loop, an auth regression, data
7 corruption.
8 - **Roll-forward** when you can ship a hotfix faster than the
9 rollback ceremony. A small bug behind a known feature flag is a
10 roll-forward case.
11
12 If unsure, roll back. Bad releases compound.
13
14 ## Code rollback (no migration involved)
15
16 ```sh
17 git checkout v<previous>
18 make deploy ANSIBLE_INVENTORY=production
19 ```
20
21 The systemd unit `Restart=on-failure` + the binary swap means the
22 process flips on the next ExecStart. Connections in flight finish;
23 new connections hit the rolled-back binary.
24
25 ## Code rollback when the new release added migrations
26
27 This is the dangerous case. There are three options, in order of
28 preference:
29
30 ### 1. Schema-compatible rollback (best)
31
32 If the new migration only *added* columns/tables that the old code
33 ignores, the old code runs against the new schema fine. Just roll
34 the code back; leave the schema alone.
35
36 Most of our migrations are deliberately additive for this reason.
37
38 ### 2. Roll forward to a hotfix
39
40 If the migration changed semantics that the old code can't tolerate,
41 ship a hotfix on top of the new release rather than reversing the
42 migration.
43
44 ### 3. Migration `down` + code rollback (last resort)
45
46 Only if (1) and (2) won't work and the data loss from `down` is
47 acceptable.
48
49 ```sh
50 ssh web-01
51 sudo -u shithub /usr/local/bin/shithubd migrate down # ONE step
52 # verify
53 git checkout v<previous>
54 make deploy ANSIBLE_INVENTORY=production
55 ```
56
57 `migrate down` rolls back exactly one step. **Never** chain `down`s
58 without checking each migration's down logic; some of them drop
59 columns and *will* lose data.
60
61 ## After any rollback
62
63 - Note the rollback in the incident channel with the from-tag and
64 to-tag.
65 - File a follow-up issue with the failure mode.
66 - Disable any feature flags the bad release turned on.
67 - Confirm the rolled-back release passed CI (if not, you're now
68 running un-tested code — that's a separate incident).