@@ -2,6 +2,94 @@ |
| 2 | 2 | |
| 3 | 3 | ## Unreleased |
| 4 | 4 | |
| 5 | +### Sprint 33 — `training_drift` probe (cross-repo, reads dlm loss curves) |
| 6 | + |
| 7 | +Closes the X2 "training_drift probe" backlog item. Sister to S25 |
| 8 | +`gradient_ghost`: where the ghost reads optimizer state at |
| 9 | +end-of-training, `training_drift` reads the loss *curve* during |
| 10 | +training. Both are pre-run, no model load, no backend required. |
| 11 | + |
| 12 | +**New probe (`kind: training_drift`, category: calibration).** |
| 13 | + |
| 14 | +For a dlm store, the probe parses every `train-*.jsonl` under |
| 15 | +`<store_path>/logs/`, dedupes resumed runs (latest occurrence wins), |
| 16 | +and computes four metrics: |
| 17 | + |
| 18 | +- `final_loss` — last recorded step's loss. |
| 19 | +- `convergence_ratio` — `final_loss / initial_loss`. |
| 20 | +- `smoothness` — `1 − var(Δloss) / var(loss)`, clipped to `[0, 1]`. |
| 21 | +- `instability_events` — count of loss-*increase* events whose |
| 22 | + magnitude exceeds the local typical movement scale (median |
| 23 | + absolute delta in a centered window). NaN losses count as one |
| 24 | + instability each, then forward-fill so downstream stats stay |
| 25 | + finite. |
| 26 | + |
| 27 | +Verdict PASS when all three thresholds clear (smoothness ≥ 0.7, |
| 28 | +convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN |
| 29 | +with each failed threshold listed in the message. |
| 30 | + |
| 31 | +**Spike heuristic note.** Sprint plan called for `|Δloss| > 3 · |
| 32 | +rolling_std`. Implementation rejects that: on a smooth exponential |
| 33 | +decay, within-window std-of-deltas stays tiny while absolute deltas |
| 34 | +are large — every step trips the threshold (verified during dev: |
| 35 | +60-step smooth curve flagged 59 false-positive spikes). Replaced |
| 36 | +with: count *positive* deltas (loss going up — the semantically |
| 37 | +meaningful instability) that exceed `sigma · median(|Δ|)` in a |
| 38 | +centered window. Robust to scale changes across training; loss |
| 39 | +going down faster than usual is no longer mistaken for instability. |
| 40 | + |
| 41 | +**No null calibration.** Mirrors `prompt_collapse` and |
| 42 | +`multi_turn_coherence_decay`: a null adapter has no loss curve, |
| 43 | +the null distribution of "smoothness on a noise adapter" is |
| 44 | +undefined. Fixed-threshold verdicts; users override per-spec. |
| 45 | + |
| 46 | +**Log-format note.** Sprint plan said dlm writes per-step JSONs at |
| 47 | +`logs/train_step_*.json`. Reality (verified against |
| 48 | +`~/.dlm/store/`): one mixed JSONL per run at |
| 49 | +`logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl` containing banner + |
| 50 | +delta + step + run_complete records. The probe filters for |
| 51 | +`{"type": "step"}` lines and reads `step` + `loss`. Sibling |
| 52 | +`*.summary.json` carries run aggregates we don't consume — the |
| 53 | +curve is richer. |
| 54 | + |
| 55 | +**Robustness:** |
| 56 | +- Resumed runs (overlapping step numbers across multiple jsonls): |
| 57 | + dedupe-by-keep-latest mirrors `dlm metrics` semantics. |
| 58 | +- Truncated JSONL tail (crashed-mid-line trainer): partial line |
| 59 | + is skipped; valid lines still consumed. |
| 60 | +- NaN losses are recorded as `+inf` so the spike detector flags |
| 61 | + them without numpy NaN poisoning the rest of the pipeline. |
| 62 | +- 1500-step run downsamples to ≤ 512 evidence points (uniform |
| 63 | + stride; first + last always preserved). |
| 64 | +- Pathological: every step recorded NaN → `smoothness = 0.0`, |
| 65 | + `instability_events = num_steps`, verdict WARN. |
| 66 | + |
| 67 | +**Implementation:** |
| 68 | +- `probes/training_drift.py` — spec, probe, JSONL parser, |
| 69 | + metric helpers, verdict mapping, downsampler. 365 LOC. |
| 70 | +- `probes/__init__.py` — registers the new probe. |
| 71 | + |
| 72 | +**Test surface:** |
| 73 | +- `tests/unit/test_probe_training_drift.py` — 30 unit tests |
| 74 | + covering: skip paths (no store_path / no logs dir / no jsonl / |
| 75 | + too few steps), end-to-end with smooth & spiky curves, |
| 76 | + resume-deduplication, downsampling, corrupt-first-line ERROR, |
| 77 | + truncated-tail tolerance, pure-math metric helpers |
| 78 | + (smooth/constant/NaN/all-NaN/zero-initial), spike-detector |
| 79 | + heuristic (loss-up vs loss-down semantics, short curves, |
| 80 | + empty), downsampling, verdict mapping, and JSONL parsing |
| 81 | + (filter non-step records, missing keys, NaN encoding, |
| 82 | + missing files). |
| 83 | +- `tests/fixtures/dlm_train_log_fixture.jsonl` — captured-from-disk |
| 84 | + shape: banner + delta record + 30 step records + run_complete. |
| 85 | + If this fixture's parse breaks, dlm's log format has shifted and |
| 86 | + the probe needs an update — the test catches that explicitly. |
| 87 | + |
| 88 | +**README** gains a `training_drift` paragraph in "Pre-run |
| 89 | +diagnostics" alongside `gradient_ghost`. The probe table at "Why |
| 90 | +it exists" picks up `training_drift` and the previously-missed |
| 91 | +`gradient_ghost` entry under Calibration. |
| 92 | + |
| 5 | 93 | ### Sprint 30 — `multi_turn_coherence_decay` probe |
| 6 | 94 | |
| 7 | 95 | Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway |