tenseleyflow/sway / 27af978

Browse files

CHANGELOG: Sprint 18 — Cross-platform determinism golden

Authored by espadonne
SHA
27af978c298eac9258b183045510ddc3ec4c93ec
Parents
13f5f15
Tree
97ab568

1 changed file

StatusFile+-
M CHANGELOG.md 49 0
CHANGELOG.mdmodified
@@ -2,6 +2,55 @@
22
 
33
 ## Unreleased
44
 
5
+### Sprint 18 — Cross-platform determinism golden
6
+
7
+Closes Audit 01 stretch-list F-item "cross-platform determinism golden
8
+test." The README's "deterministic on CPU where possible" claim held
9
+in theory; now it's pinned by CI on two platforms.
10
+
11
+- **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON
12
+  comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6,
13
+  score_tol=1e-4)` and `mask_variable_fields` for stripping
14
+  timestamps, wall-seconds, `duration_s`, `backend_stats`,
15
+  `sway_version`, and cwd-resolved path identifiers (`adapter_id`,
16
+  `base_model_id`) before comparison. No torch dep; runs in the fast
17
+  lane.
18
+- **New integration test** `tests/integration/test_determinism_golden.py`
19
+  (slow + online). Builds a deterministically-seeded LoRA on
20
+  SmolLM2-135M, runs a minimal 2-probe suite (delta_kl +
21
+  calibration_drift), and diffs the JSON output against
22
+  `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1`
23
+  toggles regen mode; missing golden → SKIP with a regen recipe.
24
+- **New CI matrix** `determinism-golden` with
25
+  `strategy.matrix.os: [ubuntu-latest, macos-latest]` in
26
+  `.github/workflows/ci.yml`. Triggered by changes under
27
+  `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file
28
+  itself (plus the standard schedule/dispatch/push triggers).
29
+- **`workflow_dispatch` regen mode** — dispatching the CI workflow
30
+  with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both
31
+  matrix legs and uploads the regenerated JSONs as per-platform
32
+  artifacts. Meant for deliberate "yes I changed the algorithm"
33
+  flows.
34
+- **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for
35
+  logprob-like numeric fields sits above typical BLAS-implementation
36
+  drift (`1e-8`–`1e-7` band between OpenBLAS on linux and Accelerate
37
+  on darwin) but below real algorithm-change drift. `1e-4` for score
38
+  fields absorbs composition noise.
39
+- **25 new unit tests** for the comparator covering mask coverage,
40
+  tolerance thresholds, structural diffs (missing keys, length
41
+  mismatches, type mismatches), NaN/inf edge cases, and a realistic
42
+  two-masked-payloads round-trip.
43
+- **Pragmatic scope deviation**: the sprint envisioned a checked-in
44
+  5 MB adapter binary. Instead the test builds it from a fixed seed
45
+  at runtime (same pattern as `test_external_perplexity_e2e` and
46
+  `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the
47
+  sprint's risks section.
48
+- **Linux golden bootstrap**: first PR ships `expected_darwin.json`
49
+  only; the linux leg SKIPs with a recipe pointing at the
50
+  `workflow_dispatch` regen mode. Maintainer dispatches, downloads
51
+  the artifact, commits `expected_linux.json`, and the next CI run
52
+  asserts cleanly on both platforms. One-time onboarding cost.
53
+
554
 ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner
655
 
756
 Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation