@@ -2,6 +2,94 @@ |
| 2 | | 2 | |
| 3 | ## Unreleased | 3 | ## Unreleased |
| 4 | | 4 | |
| | 5 | +### Sprint 33 — `training_drift` probe (cross-repo, reads dlm loss curves) |
| | 6 | + |
| | 7 | +Closes the X2 "training_drift probe" backlog item. Sister to S25 |
| | 8 | +`gradient_ghost`: where the ghost reads optimizer state at |
| | 9 | +end-of-training, `training_drift` reads the loss *curve* during |
| | 10 | +training. Both are pre-run, no model load, no backend required. |
| | 11 | + |
| | 12 | +**New probe (`kind: training_drift`, category: calibration).** |
| | 13 | + |
| | 14 | +For a dlm store, the probe parses every `train-*.jsonl` under |
| | 15 | +`<store_path>/logs/`, dedupes resumed runs (latest occurrence wins), |
| | 16 | +and computes four metrics: |
| | 17 | + |
| | 18 | +- `final_loss` — last recorded step's loss. |
| | 19 | +- `convergence_ratio` — `final_loss / initial_loss`. |
| | 20 | +- `smoothness` — `1 − var(Δloss) / var(loss)`, clipped to `[0, 1]`. |
| | 21 | +- `instability_events` — count of loss-*increase* events whose |
| | 22 | + magnitude exceeds the local typical movement scale (median |
| | 23 | + absolute delta in a centered window). NaN losses count as one |
| | 24 | + instability each, then forward-fill so downstream stats stay |
| | 25 | + finite. |
| | 26 | + |
| | 27 | +Verdict PASS when all three thresholds clear (smoothness ≥ 0.7, |
| | 28 | +convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN |
| | 29 | +with each failed threshold listed in the message. |
| | 30 | + |
| | 31 | +**Spike heuristic note.** Sprint plan called for `|Δloss| > 3 · |
| | 32 | +rolling_std`. Implementation rejects that: on a smooth exponential |
| | 33 | +decay, within-window std-of-deltas stays tiny while absolute deltas |
| | 34 | +are large — every step trips the threshold (verified during dev: |
| | 35 | +60-step smooth curve flagged 59 false-positive spikes). Replaced |
| | 36 | +with: count *positive* deltas (loss going up — the semantically |
| | 37 | +meaningful instability) that exceed `sigma · median(|Δ|)` in a |
| | 38 | +centered window. Robust to scale changes across training; loss |
| | 39 | +going down faster than usual is no longer mistaken for instability. |
| | 40 | + |
| | 41 | +**No null calibration.** Mirrors `prompt_collapse` and |
| | 42 | +`multi_turn_coherence_decay`: a null adapter has no loss curve, |
| | 43 | +the null distribution of "smoothness on a noise adapter" is |
| | 44 | +undefined. Fixed-threshold verdicts; users override per-spec. |
| | 45 | + |
| | 46 | +**Log-format note.** Sprint plan said dlm writes per-step JSONs at |
| | 47 | +`logs/train_step_*.json`. Reality (verified against |
| | 48 | +`~/.dlm/store/`): one mixed JSONL per run at |
| | 49 | +`logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl` containing banner + |
| | 50 | +delta + step + run_complete records. The probe filters for |
| | 51 | +`{"type": "step"}` lines and reads `step` + `loss`. Sibling |
| | 52 | +`*.summary.json` carries run aggregates we don't consume — the |
| | 53 | +curve is richer. |
| | 54 | + |
| | 55 | +**Robustness:** |
| | 56 | +- Resumed runs (overlapping step numbers across multiple jsonls): |
| | 57 | + dedupe-by-keep-latest mirrors `dlm metrics` semantics. |
| | 58 | +- Truncated JSONL tail (crashed-mid-line trainer): partial line |
| | 59 | + is skipped; valid lines still consumed. |
| | 60 | +- NaN losses are recorded as `+inf` so the spike detector flags |
| | 61 | + them without numpy NaN poisoning the rest of the pipeline. |
| | 62 | +- 1500-step run downsamples to ≤ 512 evidence points (uniform |
| | 63 | + stride; first + last always preserved). |
| | 64 | +- Pathological: every step recorded NaN → `smoothness = 0.0`, |
| | 65 | + `instability_events = num_steps`, verdict WARN. |
| | 66 | + |
| | 67 | +**Implementation:** |
| | 68 | +- `probes/training_drift.py` — spec, probe, JSONL parser, |
| | 69 | + metric helpers, verdict mapping, downsampler. 365 LOC. |
| | 70 | +- `probes/__init__.py` — registers the new probe. |
| | 71 | + |
| | 72 | +**Test surface:** |
| | 73 | +- `tests/unit/test_probe_training_drift.py` — 30 unit tests |
| | 74 | + covering: skip paths (no store_path / no logs dir / no jsonl / |
| | 75 | + too few steps), end-to-end with smooth & spiky curves, |
| | 76 | + resume-deduplication, downsampling, corrupt-first-line ERROR, |
| | 77 | + truncated-tail tolerance, pure-math metric helpers |
| | 78 | + (smooth/constant/NaN/all-NaN/zero-initial), spike-detector |
| | 79 | + heuristic (loss-up vs loss-down semantics, short curves, |
| | 80 | + empty), downsampling, verdict mapping, and JSONL parsing |
| | 81 | + (filter non-step records, missing keys, NaN encoding, |
| | 82 | + missing files). |
| | 83 | +- `tests/fixtures/dlm_train_log_fixture.jsonl` — captured-from-disk |
| | 84 | + shape: banner + delta record + 30 step records + run_complete. |
| | 85 | + If this fixture's parse breaks, dlm's log format has shifted and |
| | 86 | + the probe needs an update — the test catches that explicitly. |
| | 87 | + |
| | 88 | +**README** gains a `training_drift` paragraph in "Pre-run |
| | 89 | +diagnostics" alongside `gradient_ghost`. The probe table at "Why |
| | 90 | +it exists" picks up `training_drift` and the previously-missed |
| | 91 | +`gradient_ghost` entry under Calibration. |
| | 92 | + |
| 5 | ### Sprint 30 — `multi_turn_coherence_decay` probe | 93 | ### Sprint 30 — `multi_turn_coherence_decay` probe |
| 6 | | 94 | |
| 7 | Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway | 95 | Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway |