@@ -2,6 +2,55 @@ |
| 2 | 2 | |
| 3 | 3 | ## Unreleased |
| 4 | 4 | |
| 5 | +### Sprint 18 — Cross-platform determinism golden |
| 6 | + |
| 7 | +Closes Audit 01 stretch-list F-item "cross-platform determinism golden |
| 8 | +test." The README's "deterministic on CPU where possible" claim held |
| 9 | +in theory; now it's pinned by CI on two platforms. |
| 10 | + |
| 11 | +- **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON |
| 12 | + comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6, |
| 13 | + score_tol=1e-4)` and `mask_variable_fields` for stripping |
| 14 | + timestamps, wall-seconds, `duration_s`, `backend_stats`, |
| 15 | + `sway_version`, and cwd-resolved path identifiers (`adapter_id`, |
| 16 | + `base_model_id`) before comparison. No torch dep; runs in the fast |
| 17 | + lane. |
| 18 | +- **New integration test** `tests/integration/test_determinism_golden.py` |
| 19 | + (slow + online). Builds a deterministically-seeded LoRA on |
| 20 | + SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + |
| 21 | + calibration_drift), and diffs the JSON output against |
| 22 | + `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1` |
| 23 | + toggles regen mode; missing golden → SKIP with a regen recipe. |
| 24 | +- **New CI matrix** `determinism-golden` with |
| 25 | + `strategy.matrix.os: [ubuntu-latest, macos-latest]` in |
| 26 | + `.github/workflows/ci.yml`. Triggered by changes under |
| 27 | + `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file |
| 28 | + itself (plus the standard schedule/dispatch/push triggers). |
| 29 | +- **`workflow_dispatch` regen mode** — dispatching the CI workflow |
| 30 | + with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both |
| 31 | + matrix legs and uploads the regenerated JSONs as per-platform |
| 32 | + artifacts. Meant for deliberate "yes I changed the algorithm" |
| 33 | + flows. |
| 34 | +- **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for |
| 35 | + logprob-like numeric fields sits above typical BLAS-implementation |
| 36 | + drift (`1e-8`–`1e-7` band between OpenBLAS on linux and Accelerate |
| 37 | + on darwin) but below real algorithm-change drift. `1e-4` for score |
| 38 | + fields absorbs composition noise. |
| 39 | +- **25 new unit tests** for the comparator covering mask coverage, |
| 40 | + tolerance thresholds, structural diffs (missing keys, length |
| 41 | + mismatches, type mismatches), NaN/inf edge cases, and a realistic |
| 42 | + two-masked-payloads round-trip. |
| 43 | +- **Pragmatic scope deviation**: the sprint envisioned a checked-in |
| 44 | + 5 MB adapter binary. Instead the test builds it from a fixed seed |
| 45 | + at runtime (same pattern as `test_external_perplexity_e2e` and |
| 46 | + `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the |
| 47 | + sprint's risks section. |
| 48 | +- **Linux golden bootstrap**: first PR ships `expected_darwin.json` |
| 49 | + only; the linux leg SKIPs with a recipe pointing at the |
| 50 | + `workflow_dispatch` regen mode. Maintainer dispatches, downloads |
| 51 | + the artifact, commits `expected_linux.json`, and the next CI run |
| 52 | + asserts cleanly on both platforms. One-time onboarding cost. |
| 53 | + |
| 5 | 54 | ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner |
| 6 | 55 | |
| 7 | 56 | Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation |