tenseleyflow/shithub / 107744b

Browse files

deploy: wire actions retention schedule and docs

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
107744b1b88e942bf6eca5bd070b72dd299be73b
Parents
8cc4d79
Tree
6c820d8

10 changed files

StatusFile+-
A deploy/cutover/apply-actions-lifecycle.sh 31 0
A deploy/spaces/actions-lifecycle.json 17 0
M deploy/systemd/shithubd-cron.service 1 0
M deploy/systemd/shithubd-cron.timer 3 3
M docs/internal/actions-runner-api.md 6 0
M docs/internal/actions-schema.md 43 4
M docs/internal/deploy.md 4 1
M docs/internal/runbooks/actions-runner.md 29 2
M docs/internal/storage.md 2 1
M docs/internal/worker.md 2 1
deploy/cutover/apply-actions-lifecycle.shadded
@@ -0,0 +1,31 @@
1
+#!/usr/bin/env bash
2
+# SPDX-License-Identifier: AGPL-3.0-or-later
3
+#
4
+# Apply the Actions object retention policy to the primary object bucket.
5
+# Runs from the operator laptop after s3cmd is configured for the same
6
+# DigitalOcean Spaces account used by shithub's S3 object storage.
7
+#
8
+# Usage:
9
+#   SHITHUB_OBJECT_BUCKET=shithub-prod-objects \
10
+#   ./deploy/cutover/apply-actions-lifecycle.sh
11
+
12
+set -euo pipefail
13
+
14
+BUCKET="${SHITHUB_OBJECT_BUCKET:?set SHITHUB_OBJECT_BUCKET to the object bucket name}"
15
+LIFECYCLE_FILE="${LIFECYCLE_FILE:-deploy/spaces/actions-lifecycle.json}"
16
+S3CMD="${S3CMD:-s3cmd}"
17
+
18
+if ! command -v "$S3CMD" >/dev/null 2>&1; then
19
+        echo "fatal: $S3CMD not on PATH; install/configure s3cmd first" >&2
20
+        exit 2
21
+fi
22
+if [[ ! -f "$LIFECYCLE_FILE" ]]; then
23
+        echo "fatal: lifecycle file not found: $LIFECYCLE_FILE" >&2
24
+        exit 2
25
+fi
26
+
27
+echo "applying Actions lifecycle to s3://$BUCKET from $LIFECYCLE_FILE" >&2
28
+"$S3CMD" setlifecycle "$LIFECYCLE_FILE" "s3://$BUCKET"
29
+
30
+echo "current lifecycle for s3://$BUCKET:" >&2
31
+"$S3CMD" getlifecycle "s3://$BUCKET"
deploy/spaces/actions-lifecycle.jsonadded
@@ -0,0 +1,17 @@
1
+{
2
+  "_comment": "DigitalOcean Spaces lifecycle for the primary shithub object bucket. Apply with `s3cmd setlifecycle actions-lifecycle.json s3://<object-bucket>`. Actions logs and artifacts live under actions/runs/; DB metadata retention is handled by workflow:cleanup.",
3
+  "Rules": [
4
+    {
5
+      "ID": "actions-runs-90day-retention",
6
+      "Status": "Enabled",
7
+      "Filter": {"Prefix": "actions/runs/"},
8
+      "Expiration": {"Days": 90}
9
+    },
10
+    {
11
+      "ID": "actions-abort-stale-multipart",
12
+      "Status": "Enabled",
13
+      "Filter": {"Prefix": "actions/runs/"},
14
+      "AbortIncompleteMultipartUpload": {"DaysAfterInitiation": 2}
15
+    }
16
+  ]
17
+}
deploy/systemd/shithubd-cron.servicemodified
@@ -13,3 +13,4 @@ EnvironmentFile=/etc/shithub/worker.env
1313
 ExecStart=/usr/local/bin/shithubd admin run-job lifecycle:sweep
1414
 ExecStart=/usr/local/bin/shithubd admin run-job jobs:purge_completed
1515
 ExecStart=/usr/local/bin/shithubd admin run-job webhook:purge_old
16
+ExecStart=/usr/local/bin/shithubd admin run-job workflow:cleanup
deploy/systemd/shithubd-cron.timermodified
@@ -2,9 +2,9 @@
22
 Description=shithub periodic housekeeping timer
33
 
44
 [Timer]
5
-# Every hour at :07 — odd minute keeps it off shared-hour boundaries.
6
-OnCalendar=hourly
7
-# Run once at boot, ~5min in, so a fresh deploy doesn't wait an hour.
5
+# Daily at 03:30 UTC, after the 03:17 database backup window.
6
+OnCalendar=*-*-* 03:30:00 UTC
7
+# Run once at boot, ~5min in, so a fresh deploy doesn't wait a day.
88
 OnBootSec=5min
99
 Persistent=true
1010
 Unit=shithubd-cron.service
docs/internal/actions-runner-api.mdmodified
@@ -41,6 +41,11 @@ then inserts `jti` into `runner_jwt_used`. A replay returns 401. To
4141
 support multi-step runner flows, successful in-flight job endpoints
4242
 return `next_token` and `next_token_expires_at`.
4343
 
44
+Consumed JWT rows are retained for 30 days after token expiry, then
45
+pruned by the daily `workflow:cleanup` worker. This keeps the replay
46
+gate audit trail available for recent jobs without letting the table
47
+grow unbounded.
48
+
4449
 `shithubd-runner` consumes the same token chain: it claims with the
4550
 registration token, marks the job `running` with the first job JWT, then
4651
 uses each returned `next_token` serially for log chunks, step-status
@@ -203,3 +208,4 @@ runner posts terminal job status `cancelled`.
203208
 - `shithub_actions_runner_jwt_total{result="issued|rejected|replay"}`
204209
 - `shithub_actions_jobs_cancelled_total{reason="user|concurrency|timeout"}`
205210
 - `shithub_actions_log_scrub_replacements_total{location="server"}`
211
+- `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`
docs/internal/actions-schema.mdmodified
@@ -12,10 +12,11 @@ without churning under them.
1212
 
1313
 ## SQL schema
1414
 
15
-Actions migrations currently span 0042–0051, 0053, and 0057. Migration
16
-0052 belongs to the repo source-remotes feature, 0054 belongs to push
17
-event protocol tracking, 0055 belongs to the social feed, and 0056
18
-belongs to user profile contribution settings.
15
+Actions migrations currently span 0042–0051, 0053, 0057, and 0060.
16
+Migration 0052 belongs to the repo source-remotes feature, 0054
17
+belongs to push event protocol tracking, 0055 belongs to the social
18
+feed, 0056 belongs to user profile contribution settings, 0058 belongs
19
+to repo name reuse, and 0059 belongs to GitHub org imports.
1920
 
2021
 | #     | Table                       | Purpose                                                       |
2122
 | ----- | --------------------------- | ------------------------------------------------------------- |
@@ -31,6 +32,7 @@ belongs to user profile contribution settings.
3132
 | 0051  | `workflow_runs.trigger_event_id` | Trigger idempotency for retries/admin replays            |
3233
 | 0053  | `runner_jwt_used`           | Single-use replay gate for runner job JWTs                    |
3334
 | 0057  | `workflow_job_secret_masks` | Encrypted claim-time log mask snapshots per job               |
35
+| 0060  | Actions retention indexes   | Narrow cleanup indexes for terminal steps/runs                |
3436
 
3537
 A few load-bearing choices, called out so they're easy to spot in a
3638
 later schema diff:
@@ -376,6 +378,43 @@ Other admin surfaces are scoped to later sub-sprints:
376378
   UI re-run completed/cancelled runs. Re-runs read the workflow YAML
377379
   from the original run's `head_sha`, create a fresh queued
378380
   `workflow_runs` row, and set `parent_run_id` to the source run.
381
+- S41g: `workflow:cleanup` is a daily retention worker enqueued by
382
+  `shithubd-cron.service`. Operators can run it manually with
383
+  `shithubd admin run-job workflow:cleanup`.
384
+
385
+## Retention cleanup (S41g)
386
+
387
+`workflow:cleanup` applies the durable Actions retention contract in
388
+this order:
389
+
390
+1. Delete hot `workflow_step_log_chunks` for steps completed more than
391
+   7 days ago. Finalized logs already live in object storage.
392
+2. Delete expired `workflow_artifacts` rows after deleting their
393
+   `actions/runs/...` blob objects. The row's `expires_at` value is
394
+   authoritative so per-upload retention overrides keep working.
395
+3. Delete unpinned terminal `workflow_runs` older than 365 days. Child
396
+   jobs, steps, artifacts, and consumed JWT rows cascade through FK
397
+   ownership.
398
+4. Delete consumed `runner_jwt_used` rows whose JWT expiry is more than
399
+   30 days old. This preserves replay/audit evidence for recent jobs
400
+   without letting the replay table grow forever.
401
+
402
+The defaults can be overridden in the worker payload:
403
+
404
+```json
405
+{"step_log_chunk_days":7,"run_days":365,"jwt_used_days":30,"artifact_batch":1000}
406
+```
407
+
408
+`artifact_batch` caps each object-delete page and may not exceed 10000.
409
+Negative values are poison-job errors. The worker exports
410
+`shithub_actions_runs_pruned_total{kind}` where `kind` is one of
411
+`chunks`, `blobs`, `runs`, or `jwt_used`.
412
+
413
+Production object storage also needs provider-side lifecycle on the
414
+same prefix: `deploy/spaces/actions-lifecycle.json` expires
415
+`actions/runs/` objects after 90 days and aborts stale multipart
416
+uploads after 2 days. Apply it with
417
+`deploy/cutover/apply-actions-lifecycle.sh`.
379418
 
380419
 ## Trigger pipeline (S41b)
381420
 
docs/internal/deploy.mdmodified
@@ -128,7 +128,10 @@ Two layers, both mandatory:
128128
 Cross-region copy (`deploy/spaces/sync-cross-region.sh`) mirrors
129129
 both buckets to a second region for DR. Lifecycle in
130130
 `deploy/spaces/lifecycle.json` prunes WAL after 30 days and dumps
131
-after 90.
131
+after 90. Actions log/artifact objects use the primary object bucket's
132
+`actions/runs/` prefix; apply `deploy/spaces/actions-lifecycle.json`
133
+with `deploy/cutover/apply-actions-lifecycle.sh` so provider-side blob
134
+retention matches the `workflow:cleanup` database sweep.
132135
 
133136
 The recovery target is **PITR within 30 days, full restore within
134137
 1 hour**. We verify this every quarter with the restore drill —
docs/internal/runbooks/actions-runner.mdmodified
@@ -183,5 +183,32 @@ Expected results:
183183
 - The parent `workflow_runs` row rolls up to completed/success when all
184184
   jobs are terminal.
185185
 - The PR Checks tab shows the matching check run as success.
186
-- `/metrics` includes runner registration, heartbeat, JWT, and job
187
-  cancellation counters.
186
+- `/metrics` includes runner registration, heartbeat, JWT, job
187
+  cancellation, log-scrub, and retention counters.
188
+
189
+## Retention Sweep
190
+
191
+The daily housekeeping timer enqueues `workflow:cleanup` at 03:30 UTC,
192
+after the 03:17 backup window:
193
+
194
+```sh
195
+systemctl list-timers shithubd-cron.timer
196
+journalctl -u shithubd-cron.service -n 100
197
+```
198
+
199
+Manual smoke:
200
+
201
+```sh
202
+shithubd admin run-job workflow:cleanup
203
+```
204
+
205
+Expected behavior:
206
+
207
+- SQL log chunks older than 7 days for terminal steps are deleted.
208
+- Expired artifact rows are deleted only after their `actions/runs/...`
209
+  objects are deleted from object storage.
210
+- Unpinned terminal workflow runs older than 365 days are pruned;
211
+  pinned runs survive.
212
+- Consumed runner JWT rows older than 30 days are pruned.
213
+- `/metrics` exposes
214
+  `shithub_actions_runs_pruned_total{kind="chunks|blobs|runs|jwt_used"}`.
docs/internal/storage.mdmodified
@@ -38,6 +38,7 @@ avatars/<user_id>/<hash>.png # largest rendered avatar variant
3838
 avatars/<user_id>/<hash>-<size>.png   # smaller rendered avatar variants
3939
 avatars/orgs/<org_id>/<hash>.png      # largest rendered org avatar variant
4040
 avatars/orgs/<org_id>/<hash>-<size>.png
41
+actions/runs/<run_id>/...             # Actions logs + artifacts
4142
 backups/...                           # S37
4243
 ```
4344
 
@@ -56,7 +57,7 @@ Avatar uploads are decoded from PNG, JPEG, or GIF and re-encoded to PNG before s
5657
 The two backends share an interface but behavioral edges differ:
5758
 
5859
 - **Path-style addressing.** MinIO needs `force_path_style=true`. Spaces supports virtual-host-style (the default).
59
-- **Lifecycle rules.** Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK.
60
+- **Lifecycle rules.** Spaces and MinIO honor different subsets of the S3 lifecycle XML. Apply rules through their respective consoles, not via the SDK. The production Actions prefix uses `deploy/spaces/actions-lifecycle.json` (`actions/runs/`, 90-day expiry).
6061
 - **ACL semantics.** Spaces supports `public-read` on objects; MinIO uses bucket policies. We don't rely on either today (all reads go through the app).
6162
 - **Listing pagination.** Both honor `MaxKeys` + continuation tokens, but the page sizes they prefer differ. Don't assume an exact count per page.
6263
 
docs/internal/worker.mdmodified
@@ -36,7 +36,8 @@ backstop poll (every 5s by default) covers dropped notifications.
3636
 | `repo:size_recalc`           | enqueued by `push:process`           | overwrite-last-wins              |
3737
 | `org:github_import_discover` | org import request                   | `org_github_imports.status`      |
3838
 | `org:github_import_repo`     | import discovery per GitHub repo     | `org_github_import_repos.status` |
39
-| `jobs:purge_completed`       | future cron / manual ad-hoc          | always safe to re-run            |
39
+| `jobs:purge_completed`       | cron / manual ad-hoc                 | always safe to re-run            |
40
+| `workflow:cleanup`           | cron / manual ad-hoc                 | retention cutoff + idempotent deletes |
4041
 | `trending:compute`           | recurring self-scheduled S42 job     | append-only snapshots            |
4142
 
4243
 Adding a new kind: write the handler in `internal/worker/jobs/<kind>.go`,