markdown · 98223 bytes Raw Blame History

Changelog

Unreleased

Sprint 36 — sway serve warm-backend daemon

Audit 03 H4. sway run cold-loads the HF backend each invocation (~15s on a 1.5B model: weights, KV cache, deterministic-mode setup). For interactive flows — notebooks, the sway watch retrain loop, the live HTML report — that startup dwarfs the actual scoring cost. sway serve is a long-running FastAPI daemon that loads the backend once and keeps it warm across requests.

New CLI (sway serve). Defaults: --host 127.0.0.1 --port 8787 --max-loaded-models 2. The daemon refuses to bind a non-loopback interface unless --api-key <token> is passed; every non-/health request must then carry Authorization: Bearer <token>.

Endpoints.

  • GET /health — uptime + the list of currently-warm models.
  • GET /stats — request count + mean latency + cache size.
  • POST /run — body {spec: SwaySpec}; returns the same JSON shape sway run --json-out would write, plus request_seconds for the daemon's measured execution time.
  • POST /score — same as /run with an optional probe_names filter; returns just the per-probe entries with no folded SwayScore.

Backend cache. LRU keyed on (kind, base, adapter, dtype, device). Capped at --max-loaded-models; loading a third distinct model LRU-evicts the oldest, calling backend.close() to release GPU memory. Single-flight: concurrent requests for the same key serialize at the loader instead of building twice.

Python SDK. dlm_sway.serve.client.ServeClient(url) exposes health(), stats(), run(spec), score(spec, probe_names=...). Stateless (no persistent connection pool); raises ServeClientError (subclass of SwayError) on transport failure or non-2xx responses.

Auth posture (v1). Defaults to no auth on loopback — the threat model on a single-user dev box is "did I bind 0.0.0.0 by accident". The CLI hard-refuses --host 0.0.0.0 without an API key. Full OAuth is deferred.

Sprint 33 — training_drift probe (cross-repo, reads dlm loss curves)

Closes the X2 "training_drift probe" backlog item. Sister to S25 gradient_ghost: where the ghost reads optimizer state at end-of-training, training_drift reads the loss curve during training. Both are pre-run, no model load, no backend required.

New probe (kind: training_drift, category: calibration).

For a dlm store, the probe parses every train-*.jsonl under <store_path>/logs/, dedupes resumed runs (latest occurrence wins), and computes four metrics:

  • final_loss — last recorded step's loss.
  • convergence_ratiofinal_loss / initial_loss.
  • smoothness1 − var(Δloss) / var(loss), clipped to [0, 1].
  • instability_events — count of loss-increase events whose magnitude exceeds the local typical movement scale (median absolute delta in a centered window). NaN losses count as one instability each, then forward-fill so downstream stats stay finite.

Verdict PASS when all three thresholds clear (smoothness ≥ 0.7, convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN with each failed threshold listed in the message.

Spike heuristic note. Sprint plan called for |Δloss| > 3 · rolling_std. Implementation rejects that: on a smooth exponential decay, within-window std-of-deltas stays tiny while absolute deltas are large — every step trips the threshold (verified during dev: 60-step smooth curve flagged 59 false-positive spikes). Replaced with: count positive deltas (loss going up — the semantically meaningful instability) that exceed sigma · median(|Δ|) in a centered window. Robust to scale changes across training; loss going down faster than usual is no longer mistaken for instability.

No null calibration. Mirrors prompt_collapse and multi_turn_coherence_decay: a null adapter has no loss curve, the null distribution of "smoothness on a noise adapter" is undefined. Fixed-threshold verdicts; users override per-spec.

Log-format note. Sprint plan said dlm writes per-step JSONs at logs/train_step_*.json. Reality (verified against ~/.dlm/store/): one mixed JSONL per run at logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl containing banner + delta + step + run_complete records. The probe filters for {"type": "step"} lines and reads step + loss. Sibling *.summary.json carries run aggregates we don't consume — the curve is richer.

Robustness:

  • Resumed runs (overlapping step numbers across multiple jsonls): dedupe-by-keep-latest mirrors dlm metrics semantics.
  • Truncated JSONL tail (crashed-mid-line trainer): partial line is skipped; valid lines still consumed.
  • NaN losses are recorded as +inf so the spike detector flags them without numpy NaN poisoning the rest of the pipeline.
  • 1500-step run downsamples to ≤ 512 evidence points (uniform stride; first + last always preserved).
  • Pathological: every step recorded NaN → smoothness = 0.0, instability_events = num_steps, verdict WARN.

Implementation:

  • probes/training_drift.py — spec, probe, JSONL parser, metric helpers, verdict mapping, downsampler. 365 LOC.
  • probes/__init__.py — registers the new probe.

Test surface:

  • tests/unit/test_probe_training_drift.py — 30 unit tests covering: skip paths (no store_path / no logs dir / no jsonl / too few steps), end-to-end with smooth & spiky curves, resume-deduplication, downsampling, corrupt-first-line ERROR, truncated-tail tolerance, pure-math metric helpers (smooth/constant/NaN/all-NaN/zero-initial), spike-detector heuristic (loss-up vs loss-down semantics, short curves, empty), downsampling, verdict mapping, and JSONL parsing (filter non-step records, missing keys, NaN encoding, missing files).
  • tests/fixtures/dlm_train_log_fixture.jsonl — captured-from-disk shape: banner + delta record + 30 step records + run_complete. If this fixture's parse breaks, dlm's log format has shifted and the probe needs an update — the test catches that explicitly.

README gains a training_drift paragraph in "Pre-run diagnostics" alongside gradient_ghost. The probe table at "Why it exists" picks up training_drift and the previously-missed gradient_ghost entry under Calibration.

Sprint 30 — multi_turn_coherence_decay probe

Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway had zero coverage of multi-turn behavior before this — every shipped adherence probe was single-turn. Adapters that pass delta_kl cleanly frequently degrade by turn 2 or 3 of real dialogue, where the model's own previous responses enter the context window and create compounding drift. The new probe is the first that catches that failure mode.

New probe (kind: multi_turn_coherence_decay, category: adherence).

For each prompt the probe greedy-generates ft's turn-1 response, then rolls a multi-turn synthetic dialogue with cycled generic follow-ups (Continue., Tell me more., …). At each turn 2..N both views see the same ft-grounded chat history; the probe computes KL(base || ft) at the next-token position. The per-turn KL series is fit to kl = a · exp(-b · turn) and the probe reports half_life_turns = ln(2) / b.

Verdict ladder:

  • ok — clean exponential decay; PASS when half_life_turns ≥ assert_half_life_turns (default 2.0).
  • stable — KL stayed within 0.1% relative spread across turns; half-life is formally infinite; clipped to max_turns × 10 and rendered with a "held coherence" message; always PASS.
  • non_monotonic — KL grew turn-over-turn (atypical but possible); WARN with the curve in evidence.
  • degenerate — KL ≈ 0 at every turn; FAIL with "probable no-op adapter" diagnosis.

No null calibration. Mirrors prompt_collapse's rationale: a null adapter has no signal to decay, so the null distribution of half-lives is meaningless. Fixed-threshold verdicts are the published path.

Chat-template requirement. Multi-turn dialogue requires the base's tokenizer to carry a chat_template. Bases without one SKIP gracefully with a clear message. The probe consults ctx.backend._tokenizer.chat_template — same backdoor prompt_collapse uses for the same reason (avoids broadening the public scoring contract for one probe's needs).

Reports surface a tiny unicode sparkline of per-turn KL in the verdict message so terminal-only readers see curve shape without opening the JSON.

Implementation:

  • probes/multi_turn_coherence.py — spec, probe, curve fit, verdict mapping, sparkline.
  • probes/__init__.py — registers the new probe.

Test surface:

  • tests/unit/test_probe_multi_turn_coherence.py — 22 unit tests covering skip paths, end-to-end with planted-distribution sequences (decreasing / flat / growing curves), fit math (clean exp / stable / growing / zero / partial-zero), verdict mapping, sparkline rendering, and chat-template detection.
  • tests/integration/test_probe_multi_turn_coherence.py — slow+online HF smoke on a tiny SmolLM2-135M LoRA across 3 dialogue turns. Exercises the full chat-template + turn-loop + curve-fit path on a real backend.

README gains a "Multi-turn coherence" section between Tool-use fidelity and Reproducing a sway run with a worked YAML example. The probe table at "Why it exists" picks up the new entry under Adherence (plus tool_use_fidelity under Attribution — the table missed S27's addition).

Sprint 28 — tenseleyflow/sway-action GitHub Action

Closes the P2 "GitHub Action" discoverability item from Audit 03 D-path3. The action lives in its own repo (tenseleyFlow/sway-action, tagged v0.1.0) so consumers can pin via uses: without coupling to sway's release cadence.

End-user value: three lines of YAML to wire sway into any GitHub PR check, with edit-in-place PR comments + cached venv + standardized verdict sticker. The audience is teams who don't use pre-commit but live in GitHub Actions.

- uses: tenseleyflow/sway-action@v0.1.0
  with:
    spec-path: sway.yaml

Inputs / outputs. Inputs cover fail-on policy (fail/warn/never), comment-on-pr, upload-artifact, sway-version (defaults to the action's pinned version), and python-version. Outputs sway-score, verdict, and report-path are surfaced for downstream workflow steps.

Action layout:

  • action.yml — composite action, 5 steps (setup-python → cache → entrypoint.sh → upload-artifact → github-script post-comment).
  • src/entrypoint.sh — bootstraps a venv, installs dlm-sway[hf] pinned, runs sway gate + sway report --format md, parses score
    • verdict from the JSON report into $GITHUB_OUTPUT.
  • src/post-comment.js — Node 20 script run via actions/github-script@v7. Finds prior sway comments via a hidden <!-- sway-action:report --> HTML marker and edits in place. Truncates at 60 KB with a footer pointing to the artifact for full report.

Self-tests (smoke.yml in the action repo): an actionlint + shellcheck + node -c lint job, an entrypoint-stub job that validates the bash plumbing against a stubbed dlm_sway package, and a self-test job that runs the action end-to-end against a tiny random LoRA on SmolLM2-135M (PR-only — needs PR context for the comment-posting step).

Sway README gains a "GitHub Action" subsection between Pre-commit and the .dlm integration with the copy-paste one-liner + a complete workflow example.

Sprint 27 — tool_use_fidelity probe

Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway ships for anyone fine-tuning an adapter against a tool-using base — the dominant fine-tune target outside dlm-style document training, and the failure mode no other shipped probe catches.

New probe (kind: tool_use_fidelity, category: attribution).

For each (prompt, tool_spec, gold_tool_name) case the probe greedy- decodes from base + ft and scores three independent signals:

  • JSON-schema validity delta (ft_valid_rate − base_valid_rate): the gate that catches "ft broke the base's tool-call format". Default pass criterion tolerates a 5pp drop.
  • Tool-name hallucination rate: fraction of schema-valid ft calls whose name falls outside the declared allowed_tools surface (or differs from gold_tool_name per-case when no surface is set). Default cap 10%.
  • Argument-field disagreement rate (informational): leaf-field drift between matched base/ft schema-valid calls. v1 surfaces it as evidence rather than gating on it — the per-token KL on free-form argument values is deferred past v1 because it requires alignment between two decoded strings.

The probe greedy-generates from both views, parses each output via a forgiving extractor (whole-text JSON → fenced json block → first balanced {...} substring), then validates against an OpenAI-flavored schema subset (string, integer, number, boolean, object, array plus required). No jsonschema dependency added — core sway dependencies stay lean.

json_valid_rate_ft is z-scored against the null-adapter baseline when null_adapter is in the suite. Calibration spec ships two sentinel tool-use cases so the null distribution carries useful signal.

Implementation modules:

  • probes/tool_use_fidelity.pyToolUseCase / ToolUseFidelitySpec / ToolUseFidelityProbe plus _parse_tool_call / _matches_schema / _field_disagreement helpers. 573 LOC.
  • probes/__init__.py — registers the new probe.

Test surface:

  • tests/unit/test_probe_tool_use_fidelity.py — 40 unit tests covering verdict logic (PASS / FAIL on validity / FAIL on hallucination / SKIP-no-cases), the parse helpers (whole-text / fenced / embedded / unbalanced / strings-with-braces), schema validation (required / type-tag / bool-not-int / extras-allowed), field disagreement (identical / drifted / nested / list-as-leaf / None-vs-absent), and the calibration handoff.
  • tests/integration/test_probe_tool_use_fidelity.py — slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies the full code path executes without error and emits every documented evidence key with rates in [0, 1]. Doesn't assert a specific verdict — the 135M base isn't tool-fluent enough to make the validity gate meaningful, only the plumbing.
  • tests/fixtures/tool_use_cases.yaml — eight hand-authored cases spanning the most common tool shapes (search, file I/O, Python exec, shell, HTTP fetch, DB query, calendar, calculator).

README gains a "Tool-use fidelity" section between the .dlm integration and "Reproducing a sway run" — pitches the probe as sway's signal for agentic fine-tunes, with a worked YAML example.

Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair)

Closes the X3 half of Audit 03's "make a sway run reproducible by a coworker without recreating their environment" goal. The X1 half (dlm export --to sway-json) lands in the dlm repo via a separate PR; this changelog block ships the sway-only deliverable.

New CLI commands.

  • sway pack <spec> [-o OUT] [--include-golden PATH] [--include-null-cache/--no-include-null-cache] [--max-size-mb MB] — bundles the spec + its dlm_source (when set) + cached null-stats entries + an optional last-known-good JSON report into a single *.swaypack.tar.gz. Default cap 50 MB; refuses to overwrite.
  • sway unpack <pack> [-o DIR] — extracts into <DIR>/swaypack/, validates the manifest, and prints the ready-to-run sway run invocation including the SWAY_NULL_CACHE_DIR=... env var that redirects null-stats lookups at the bundled cache.

Pack format (swaypack_version=1).

swaypack/
  manifest.json    # version + counts + packed_at + sway_version
  sway.yaml        # the spec, verbatim
  source.dlm       # spec.dlm_source content (when present)
  null-stats/      # one .json per cache key
  golden.json      # known-good run report (when --include-golden)

Implementation modules:

  • cli/_pack.py — builds the tarball in-memory first so the size cap can refuse cleanly before writing a half-built pack to disk. Path-traversal-safe by construction (we author every arcname).
  • cli/_unpack.py — uses tarfile.extractall(filter='data') to reject absolute paths / ../ escapes / device files (3.11 compat). Validates swaypack_version; rejects mismatches with a clear message.

Null-cache override env var.

  • probes/_null_cache._cache_root now honors $SWAY_NULL_CACHE_DIR if set (used verbatim — no dlm-sway/null-stats suffix). Order: env override → XDG cache → ~/.dlm-sway/null-stats. The sway unpack CLI prints the exact invocation that wires this env at the bundled cache.

Tests.

  • 17 unit tests in tests/unit/test_pack_unpack.py: round-trip identity, manifest fields, dlm_source bundling (present + missing
    • warning emit), golden bundling, every PackError + UnpackError branch (overwrite refusal, size cap, missing file, corrupt tarball, missing manifest, version mismatch, malformed root).
  • 2 slow+online integration tests in tests/integration/test_pack_run_roundtrip.py: pack → unpack → sway run produces an identical band + per-probe verdict + score; null-cache packing populates the unpack report's null_stats_dir pointer.
  • 708 unit tests pass; mypy + ruff + format clean.

Sprint 25 — P3 gradient_ghost probe (pre-run, cross-repo)

New zero-forward-pass diagnostic probe that loads dlm's training_state.pt and flags severely-undertrained adapters before any model load fires. Catches the 90%-case "user did --max-steps 5 for a smoke test and forgot to retrain" in ~50 ms vs. the 30+ s adapter_ablation probe that previously was the only training-health signal.

Probe + supporting modules.

  • probes/gradient_ghost.GradientGhostProbe — category calibration, needs_backend = False. Verdict ladder: global_step < min_steps_threshold → FAIL (primary signal); all per-param exp_avg_sq NaN → FAIL; per-layer ratio against the minimum layer's mean (not global mean — see module docstring on asymptotic-cap reasoning) flags layers > undertrained_layer_ratio;

    layer_failure_frac of layers crossing → FAIL/WARN. Configurable thresholds min_steps_threshold=50, undertrained_layer_ratio=2.0, layer_failure_frac=0.3.

  • probes/_training_state.py — torch.load wrapper for training_state.pt. Frozen TrainingStateSnapshot + ParamStat dataclasses. Lazy torch import (no cost on import dlm_sway for non-gradient-ghost users). Suppresses the FutureWarning around weights_only=False since dlm's pickled RNG state requires it.
  • probes/_param_id_mapping.py — maps optimizer-state integer param-ids to transformer-layer indices via the adapter's adapter_model.safetensors keys. Heterogeneous-per-layer counts raise ParamMappingError rather than silently mis-attributing. Uses safetensors.safe_open so we read keys without materializing tensors.
  • core/errors.MissingTrainingStateError — typed exception so the probe can SKIP cleanly when training_state.pt legitimately isn't there (non-dlm adapters), distinct from a parse error.

Probe-ABC + runner contract for pre-run probes.

  • probes/base.Probe.needs_backend: ClassVar[bool] = True — new opt-in flag. Default True preserves every shipped probe's contract.
  • probes/base.RunContext.backend: DifferentialBackend | None — was required, now optional. New RunContext.require_backend property narrows the type for mypy + raises a clear runtime error if a probe with needs_backend=True somehow gets a None backend.
  • probes/* sweep — every existing probe now reads ctx.require_backend.as_base() instead of ctx.backend.as_base(). No behavior change; mypy-correct under the new optional type.
  • suite/runner.run accepts backend=None. Pre-resolves all scheduled probes; if any has needs_backend=True while backend=None, raises BackendNotAvailableError with the offending probe kinds in the message. Skips trace-writer install, preflight finite-check, probe-label setting, and backend-stats snapshot when backend is absent.

sway check integration.

  • cli/commands.check_cmd runs gradient_ghost first in the quick battery (along with null_adapter + delta_kl + calibration_drift). On FAIL, emits a red ⚠️ banner before the regular verdict banner; on WARN, a yellow one. Informational — no exit-code change. Banner shows the probe's actionable message ("severely undertrained: global_step=N < threshold X").

Tests.

  • 17 unit tests (tests/unit/test_probe_gradient_ghost.py): registry / needs_backend flag / verdict-ladder branches (PASS / FAIL on global_step / FAIL on all-NaN / WARN / FAIL on many-layers / SKIP on missing) / _param_id_mapping correctness
    • error paths / _training_state loader edge cases / runner skip-backend contract.
  • 3 integration tests (tests/integration/test_probe_gradient_ghost.py, slow+online): real-store FAIL on ~/.dlm/store/01KPPFAB.../v0001 (skipped on machines without local dlm install — typical CI), synthetic converged → PASS, runner end-to-end with backend=None.
  • All 691 unit tests pass; mypy + ruff + format clean.

Sprint 24 — F01 PEFT→MLX adapter converter

Closes the audit's #1 major finding: the README pitched MLX as a co-equal backend, but the path required a pre-converted .npz adapter that nothing in the toolchain produced. With this sprint, dlm trainsway run on the MLX backend works end-to-end on any PEFT-trained LoRA adapter.

Converter (pure I/O, no torch dep).

  • backends/_mlx_convert.convert_peft_to_mlx — reads PEFT's adapter_model.safetensors + adapter_config.json, transposes LoRA matrices to MLX's layout, writes adapters.safetensors + mlx-lm-shaped adapter_config.json. Verified against PEFT >= 0.13
    • mlx-lm 0.31.
  • Key remap. PEFT's base_model.model.<dotted>.lora_<A|B>.weight becomes MLX's <dotted>.lora_<a|b>. Modern PEFT keys (no .default adapter-name segment) and legacy .default.weight keys both supported.
  • Shape transpose. PEFT lora_A=(r, in) → MLX lora_a=(in, r); PEFT lora_B=(out, r) → MLX lora_b=(r, out).
  • Config remap. Writes fine_tune_type=lora, num_layers inferred from max layer index in the keys, lora_parameters with rank/scale=alpha÷r/dropout/keys (per-layer-relative attribute paths like self_attn.q_proj).
  • Errors. Missing files / non-LORA peft_type / invalid rank / unexpected key prefixes / dst-not-empty all surface as typed MlxConvertError with actionable messages. modules_to_save tensors (e.g. embed_tokens, lm_head overrides) are skipped with a per-key warning rather than crashing.

CLI surface.

  • sway convert-adapter [--target mlx] SRC DST [--overwrite] — thin wrapper over the converter. Prints a before/after size + rank/scale report; surfaces MlxConvertError with a non-zero exit code; warns on skipped modules_to_save keys via stderr.

MLX backend integration.

  • backends/mlx._ensure_mlx_adapter — auto-detect: if the adapter dir contains adapter_model.safetensors, run the converter into a content-hashed cache at ${XDG_CACHE_HOME:-$HOME/.cache}/dlm-sway/mlx-converted/<blake2b>/, point mlx-lm at the cache. If the dir already contains adapters.safetensors, pass through unchanged. Uses 16-byte blake2b on the source safetensors bytes — repeat loads on the same adapter version short-circuit (~10 ms hash + dir lookup).
  • backends/mlx._MLXView._forward_logits — adjacent fix: out[0].astype(mx.float32) before np.asarray so unquantized bf16/fp16 model outputs round-trip correctly. Pre-existing bug surfaced by the new e2e test against mlx-community/SmolLM2-135M-Instruct.

Tests.

  • tests/unit/test_mlx_convert — 20 tests across:
    • Helper functions (_strip_layer_prefix, _extract_layer_index).
    • Happy path: synthetic PEFT adapter → MLX adapter, expected file layout, config shape, rank/scale math, key transpose, value preservation.
    • Error paths: missing safetensors / config, non-LORA peft_type, invalid rank, dst-not-empty without --overwrite, unexpected key prefix, modules_to_save skip-and-report.
    • Auto-convert detection: pass-through on already-MLX dir, fresh convert on PEFT dir, cache short-circuit, unrecognized dir pass-through.
  • tests/integration/test_mlx_converter_e2e — 4 darwin-arm64 slow+online tests on real mlx-community/SmolLM2-135M-Instruct: XDG cache populated by backend init, next_token_dist returns finite top-k via converted adapter, logprob_of works, repeat load skips reconvert (mtime check). Skipped on non-darwin / missing [mlx] extra.

README.

  • New "MLX backend (Apple Silicon)" section with the two install paths (auto-convert via sway run vs. explicit sway convert-adapter). Documents the cache location and out-of-scope items (QLoRA, modules_to_save).

Sprint 23 — H1 batched backend execution

Opens the door to 3-5× wall-time reduction on HF-backend suites by amortizing model.forward kernel-launch + memory-transfer overhead across prompt batches. Real speedup lands on the HF backend; other backends gain a unified Protocol surface via loop fallback.

Protocol.

  • core/scoring.ScoringBackend — new method next_token_dist_batch(prompts, *, top_k) -> list[TokenDist]. Default implementation on the Protocol loops over next_token_dist; backends override when they can amortize. Adding the method to a runtime_checkable Protocol is a contract change — all shipped backend views + the stub used by test_scoring.py::test_scoring_backend_runtime_checkable gained an explicit method implementation.

HF backend — real batching.

  • backends/hf._HFView.next_token_dist_batch — calls tokenizer(prompts, padding=True, padding_side="left") and a single model.forward over the padded batch. Left-padding is mandatory for decoder-only LMs: the last-token position (what next_token_dist reads) must line up with the real end of each sequence, not pad tokens.
  • backends/hf._topk_to_token_dist — extracted helper shared by the single-prompt and batched paths so tail-mass accounting (B6) stays identical across both.
  • backends/_instrumentation.BackendInstrumentation.cached_batch — routes batched calls through per-prompt cache lookup before the forward fires, so cached + uncached prompts in one call still short-circuit per-entry. Callback contract: compute_misses(miss_indices) -> list[result] in miss order. Wall-time attribution divides the batch's time evenly across miss entries for scoring_wall_s bookkeeping.

Instrumentation.

  • BackendStats — three new counters: batches_sent, batched_prompts, max_batch_size. avg_batch_size property computes batched_prompts / batches_sent (0 when no batches fired). All surfaced in to_dict().
  • suite/report._cache_line — the report footer's cache: N/M = P% line now suffixes | batches: K (avg=X.X) when any batched forward fired. Runs without batching render the cache line alone — pre-S23 footer shape preserved.

Probes — opt-in.

  • probes/base.Probe.batch_score: ClassVar[bool] = False — new flag. Probes set True when their scoring flows uniformly through next_token_dist on a prompt list.
  • probes/delta_kl.DeltaKLProbebatch_score = True; run() now issues one batched call per view (inside single as_base / as_finetuned contexts) instead of 2×N context entries with per-prompt calls. Divergences computed via zip(base_dists, ft_dists, strict=True).
  • probes/cluster_kl.ClusterKLProbe — same treatment; same math.
  • Other probes (section_internalization, leakage, paraphrase_invariance, adapter_ablation, prompt_collapse, multi-turn-style probes) keep batch_score=False — each needs bespoke batching logic deferred to a follow-up sprint.

Dummy / MLX / API backends.

  • backends/dummy._DummyView.next_token_dist_batch — loops over self.next_token_dist (not _compute_next_token_dist) so subclass overrides (_NullView's seeded-noise perturbation, _InterpolatedView's lam-blend) fire correctly on the batched path. The dummy has no real forward to amortize; batching counters stay at zero on this backend by design.
  • backends/mlx._MLXView.next_token_dist_batch — per-prompt loop stub with a TODO: mx.array padded forward docstring. MLX's per-prompt forward on Apple Silicon is already fast enough that the kernel-launch amortization a real batched forward would buy is small relative to the HF CUDA/MPS case.
  • backends/api.ApiScoringBackend.next_token_dist_batch — per-prompt loop; OpenAI-compat HTTP servers score one completion per request and httpx connection pooling already amortizes network cost.

Tests.

  • tests/unit/test_backend_instrumentation — new TestBackendInstrumentationCachedBatch class (6 tests) pinning all-miss / partial-cache-hit / all-cached / wrong-length-return / max_batch_size / empty-prompts branches. TestBackendStats gained 2 tests for the new counters + avg_batch_size property.
  • tests/unit/test_batched_backend_s23 — new file with 5 tests: batch_score flag set on DeltaKLProbe, probe-level spy that confirms the batched method is called exactly once per view, byte-identical results vs serial, report footer batches-segment conditional rendering, empty-prompts short-circuit.

0.1.0 — 2026-04-24

First PyPI release. Alpha — API not guaranteed stable until v1.0. Rolls up every sprint through Sprint 22; earlier sprints accumulated under the rolling Unreleased header pre-release.

Sprint 22 — v0.1.0 release + F06 dlm-compat test + Docker image

First PyPI publish. Version bumped from 0.1.0.dev0 to 0.1.0. Pre-commit consumer rev: pins switch from commit SHA to v0.1.0. Closes Audit 03 F05 (SHA-pin resolves at tag) and F06 (dlm API compatibility regression test).

D-path1 — PyPI publish.

  • pyproject.toml — version 0.1.0. Development Status :: 3 - Alpha classifier already present.
  • src/dlm_sway/__init__.py__version__ = "0.1.0".
  • README.md — real pip install "dlm-sway[hf]" recipe replaces the "Planned PyPI install (not live yet)" placeholder. Pre-alpha banner swapped for an explicit Alpha + semver-from-v1.0 disclaimer.

F06 — dlm API compatibility regression test.

  • core/errors.DlmCompatError — new typed exception. Raised when dlm's public surface drifts out of the contract sway's resolver depends on (currently: dlm.base_models.resolve(key).hf_id). Message includes installed-dlm version + a pip-install hint.
  • integrations/dlm/resolver — the except Exception: return base_model silent fallback is gone. resolve() raising or returning an object without hf_id now raises DlmCompatError (chained via __cause__ for tracebacks). An extra _installed_dlm_version() helper introspects importlib.metadata for the error message.
  • tests/unit/test_dlm_bridge — 2 new regression tests pin the two drift branches (missing hf_id; resolve itself raising).
  • tests/integration/test_dlm_api_compat — new slow+online+dlm integration test iterates dlm's registry keys and asserts every entry resolves to a spec with a plausible hf_id. Gracefully skipped (via pytest.importorskip) when the [dlm] extra isn't installed, so CI without dlm published to PyPI still passes.
  • pyproject.toml[dlm] extra pinned to dlm>=0.9,<1.0. Upper bound tightens to the range the compat test has validated; bump when dlm cuts v1.0 and the contract is re-verified.

D-path2 — Docker image (sway-gate).

  • Dockerfile.gatepython:3.11-slim + dlm-sway[hf,semsim] at the build-time SWAY_VERSION ARG. Pre-fetches the MiniLM weights (sentence-transformers/all-MiniLM-L6-v2, ~80 MB) so adapter_revert / cluster_kl probes don't cold-download on first gate run. ENTRYPOINT ["sway"] — pre-commit hook passes gate as the first container arg.
  • .github/workflows/docker.yml — on v* tag push + manual dispatch. Builds on GHA, pushes to ghcr.io/ tenseleyflow/sway-gate:{vX.Y.Z, latest}. Smoke-tests the pushed image with --version.
  • .pre-commit-hooks.yaml — new sway-gate-docker variant using language: docker_image pointing at ghcr.io/tenseleyflow/sway-gate:v0.1.0 gate. Third option alongside sway-gate (system PATH) and sway-gate-isolated (fresh venv).

F05 closure.

  • .pre-commit-hooks.yaml — isolated variant's additional_dependencies migrates from dlm-sway[hf] @ git+https://…@<SHA> to dlm-sway[hf]==0.1.0. Consumer rev: in .pre-commit-config.yaml pins to v0.1.0. SHA churn over.

README.

  • Real pip install recipes for every extra ([hf], [dlm], [all], …). "Install from source" kept for contributor workflow.
  • Pre-commit section: three hooks (system / isolated / docker) with first-run-cost comparison table.

Sprint 21 — Audit 03 closure

Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the audit recommended for v0.1.0 readiness. Full audit at .docs/audits/03-final-audit.md. Verdict was YELLOW-leaning-GREEN with zero critical findings; this sprint pushes it to GREEN on the in-scope items. Deferred: F01 (MLX converter — its own feature sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API compat test — lands with v0.1.0 release in S22). v0.1.0 PyPI publish is deferred to S22 (requires release coordination + credentials).

🟠 Major — degenerate z-score clipping (F02).

  • probes/null_adapter — null stats now carry an explicit degenerate: 1.0 field when the calibration ran but produced an unusable baseline (runs: 1, or every seed producing the exact same raw). The std floor at 1e-6 is preserved for valid-but-tight multi-seed nulls so they still calibrate. Audit observed a +290,766σ leakage-probe z under runs: 1; post-fix, z_score refuses and the probe falls back to fixed thresholds with a clear footer rollup.
  • probes/_zscore.z_score — refuses when stats["degenerate"] is truthy, independent of the std check. Belt-and-suspenders on top of the existing MIN_STD guard.
  • suite/report.collect_degenerate_null_kinds — new rollup surface. Terminal + markdown footers gain a "N probe kind(s) had a degenerate null baseline — bump runs:" block, distinct from the existing null-opt-outs rollup.
  • 12 new unit tests across test_null_calibration.py, test_zscore_helpers.py, test_report_extras_rollup.py, and test_probe_external_perplexity.py. Existing test_std_floor_prevents_runaway_zscore was rewritten to assert the fixed behavior (the pre-fix contract WAS the audit's bug).

🟡 Minor — CI resilience (F03).

  • pytest-timeout>=2.3 + tenacity>=9.0 in dev deps.
  • tests/fixtures/tiny_model.py wraps snapshot_download with exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)
    • etag_timeout=10 to bound per-file head probes. Benefits every slow+online test.
  • tests/integration/test_determinism_golden.py gains @pytest.mark.timeout(600). A silent network hang now surfaces as a test failure with actionable output (audit observed 20m workflow-timeout hang on Sprint 19 merge run 24747915467).

🟡 Minor — outlier-miner pool guard (F04).

  • mining/outlier_miner.mine_outliers — raises SwayError when the pool has fewer than 2·top_k distinct scored prompts, with an actionable --top-k N hint. Pre-fix: 1-distinct-prompt pool produced top=[p], bottom=[p] (identical lists — no outlier contrast). Guard applies AFTER scoring so unsupported probe kinds still return the empty-result path.

🟡 Minor — autogen skipped-probes comment (F07).

  • integrations/dlm/autogen.collect_skipped_probe_reasons — new public helper returning (probe_kind, reason) tuples for probes _build_suite intentionally omitted for the given handle.
  • _render_annotated_yaml — when skipped is non-empty, header gains a # skipped: <kind> (<reason>) block. Users no longer have to diff the autogen source to understand which probes are missing from their generated sway.yaml.

🟡 Minor — CI Node.js 20 deprecation silence (F08).

  • .github/workflows/ci.yml sets FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true" at the workflow level. Silences the Node 20 deprecation warnings each CI log currently carries (deadline June 2026). Temporary until every action we pull bumps to Node 24 natively.

🟡 Minor — portable dlm_source (F09).

  • integrations/dlm/autogen._portable_dlm_source emits a cwd-relative path when the .dlm lives inside the cwd, absolute otherwise. The cwd-relative form survives cross-machine checkout (CI agents, other devs); the old absolute-path emission broke the moment a committed autogen'd YAML was run on a different host.

Other.

  • 640 unit tests pass (up from 626 before S21 — 14 new regression tests across F02/F03/F04/F07/F09).
  • Pragmatic deviations from original scope: S21 was audit-scoped to ~2 days of cleanup. Actual depth: 6 findings closed with 14 new regression tests. One existing test (test_std_floor_prevents_runaway_zscore) explicitly rewritten to flip its assertion — the pre-fix contract the test encoded WAS the audit's reported bug.

Sprint 19 — Pre-commit hook sway gate

Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships a .pre-commit-hooks.yaml declaring two hook variants so pre-commit.com users can gate their adapter on every commit that touches a spec, .dlm file, or adapter directory.

  • Two hooks, pick your posture.
    • sway-gatelanguage: system. Uses the sway install on the user's PATH. Fast, zero install cost. Recommended default.
    • sway-gate-isolatedlanguage: python + git-based additional_dependencies (dlm-sway[hf] @ git+...@<SHA>). Self-contained — pre-commit builds a fresh venv and installs sway + torch + transformers on first run (~5 GB, ~2 min).
  • Rev pinning stance. Example config pins to a commit SHA, not HEAD. Sway is pre-v0.1.0 — no tagged release yet — and HEAD drifts silently on every pre-commit autoupdate. SHA pinning is the honest pre-release pattern; migration to rev: v0.1.0 is a README edit once the first release lands.
  • pass_filenames: false on both variants. sway gate takes exactly one spec path; the user declares it via args:. The files: regex decides when the hook fires. No ambiguity.
  • Scope discipline. Neither hook surfaces --json or --markdown flags. The hook gates (non-zero on FAIL) and nothing else. Users wanting report artifacts run sway run separately.
  • Readme "Pre-commit" section documents the consumer-side config for both variants, the SHA-pinning rationale, and the one-time install cost of the isolated variant.
  • examples/precommit-example/ — template sway.yaml, consumer-side .pre-commit-config.yaml, and a README walk-through.
  • tests/integration/test_pre_commit_hook.py (slow + online) — three cases: (a) parse-and-shape smoke test on .pre-commit-hooks.yaml; (b) pass-case runtime — spawns pre-commit run sway-gate as a subprocess against a tmp git repo with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — same fixture with an impossible assert_mean_gte, asserts non-zero exit + gate FAILED banner.
  • pre-commit>=3.8 in [dependency-groups].dev so the integration test can invoke it as a subprocess. Keeps the tool out of the user's runtime deps.

Sprint 18 — Cross-platform determinism golden

Closes Audit 01 stretch-list F-item "cross-platform determinism golden test." The README's "deterministic on CPU where possible" claim held in theory; now it's pinned by CI on two platforms.

  • New module src/dlm_sway/core/golden.py. Tolerance-aware JSON comparator with compare_goldens(actual, expected, *, logprob_tol=1e-6, score_tol=1e-4) and mask_variable_fields for stripping timestamps, wall-seconds, duration_s, backend_stats, sway_version, and cwd-resolved path identifiers (adapter_id, base_model_id) before comparison. No torch dep; runs in the fast lane.
  • New integration test tests/integration/test_determinism_golden.py (slow + online). Builds a deterministically-seeded LoRA on SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + calibration_drift), and diffs the JSON output against tests/golden/expected_<platform>.json. SWAY_UPDATE_GOLDENS=1 toggles regen mode; missing golden → SKIP with a regen recipe.
  • New CI matrix determinism-golden with strategy.matrix.os: [ubuntu-latest, macos-latest] in .github/workflows/ci.yml. Triggered by changes under tests/golden/**, src/dlm_sway/core/golden.py, or the test file itself (plus the standard schedule/dispatch/push triggers).
  • workflow_dispatch regen mode — dispatching the CI workflow with regenerate_goldens=true flips SWAY_UPDATE_GOLDENS=1 on both matrix legs and uploads the regenerated JSONs as per-platform artifacts. Meant for deliberate "yes I changed the algorithm" flows.
  • Tolerance rationale (documented in core/golden.py): 1e-6 for logprob-like numeric fields sits above typical BLAS-implementation drift (1e-81e-7 band between OpenBLAS on linux and Accelerate on darwin) but below real algorithm-change drift. 1e-4 for score fields absorbs composition noise.
  • 25 new unit tests for the comparator covering mask coverage, tolerance thresholds, structural diffs (missing keys, length mismatches, type mismatches), NaN/inf edge cases, and a realistic two-masked-payloads round-trip.
  • Pragmatic scope deviation: the sprint envisioned a checked-in 5 MB adapter binary. Instead the test builds it from a fixed seed at runtime (same pattern as test_external_perplexity_e2e and test_cluster_kl_e2e). Avoids the regeneration chore noted in the sprint's risks section.
  • Linux golden bootstrap: first PR ships expected_darwin.json only; the linux leg SKIPs with a recipe pointing at the workflow_dispatch regen mode. Maintainer dispatches, downloads the artifact, commits expected_linux.json, and the next CI run asserts cleanly on both platforms. One-time onboarding cost.

Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner

Closes Audit 01 innovation item F11. Adds sway mine — an evaluation companion to paraphrase_invariance that surfaces the paraphrases a memorizing adapter most reliably fails on, plus a secondary mode that ranks delta_kl prompts by per-prompt divergence.

sway mine --mode paraphrase — per paraphrase_invariance case:

  1. Generate candidate paraphrases via nlpaug SynonymAug (WordNet, deterministic under a fixed seed; back-translation is a user-supplied escape hatch to keep the default fast and offline).
  2. Embed candidates through the shared MiniLM cache (same 80 MB load adapter_revert pulls) and greedy farthest-first select the top-K most pairwise-distant ones — dodges nlpaug's tendency to emit near-duplicate synonym swaps.
  3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
    • (ft(candidate, gold) - base(candidate, gold))`. Large positive gap = the candidate breaks the adapter's lift = the adapter doesn't generalize there.
  4. Emit sway-mined-paraphrase.yaml with a mined_cases: block paste- compatible with paraphrase_invariance.cases.

sway mine --mode outliers — rank a prompt pool by per-prompt delta_kl.raw. Pool defaults to the spec's own delta_kl prompts; --from-corpus public_domain_en draws from S09's CC0 corpus instead. Emits top-K and bottom-K blocks: the highest-divergence prompts (best for tightening a gate) and lowest-divergence (candidates to drop from the suite). leakage and paraphrase_invariance have section/case- based specs; outlier mining on those is future work, documented inline in mining/outlier_miner.py.

  • New package src/dlm_sway/mining/ — two modules, paraphrase_miner.py and outlier_miner.py, plus a thin corpus_prompts helper for the --from-corpus path.
  • New CLI subcommand sway mine SPEC --mode paraphrase|outliers [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] [--seed N]. Paste-compatible YAML emission by default; --out overrides the filename.
  • 18 new unit tests — 7 for the paraphrase miner (ranker, diversity filter, dedup, input validation), 8 for the outlier miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 for the CLI (paraphrase + outliers modes + no-prompts error path).
  • Prove-the-value test at tests/unit/test_paraphrase_miner_prove_value.py. On a deliberately-memorizing dummy backend the hand-written paraphrase list passes with generalization_ratio > 0.5; substituting the mined list (same seed, same adapter) drops the ratio below 0.5 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap.
  • Dependency footprint: no new extras. The paraphrase miner uses nlpaug (already in [style]) + sentence-transformers (in [semsim]); the diversity filter reuses the embedder adapter_revert already pulls. Graceful BackendNotAvailableError with a pip hint when the extras aren't installed.

Sprint 20 — Audit 02 closure

Closes the full finding inventory from .docs/audits/02-followup-audit.md (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3 dead-code items). Sixteen sprints of innovation work had accumulated since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap CI dropped at the runner boundary) plus a grab-bag of claim-vs-code gaps, dead branches, and untested contracts.

🔴 Critical — ship-blocker.

  • F01suite/runner.py:_with_duration now forwards every ProbeResult field. The pre-fix version silently dropped ci_95 on every probe, making S14's headline deliverable runtime-inert in any real sway run. Regression test at the runner boundary + snapshot fixture carrying a populated ci_95 pin the fix.

🟠 Major — claim-vs-code gaps.

  • F02 — real sklearn.cluster.KMeans exercised by two new unit tests (separation + seed determinism) + a slow+online integration test at tests/integration/test_cluster_kl_e2e.py. Before: every cluster_kl test monkeypatched _kmeans_cluster with an argmax stub, leaving S16's reason-for-being unverified.
  • F03sway list-probes falls back to the defining module's __doc__ when a probe class has no docstring. Every one of the 13 shipped probes now renders a summary row.
  • F04sway doctor probes plotly (the load-bearing [viz] dep), sklearn (S16 cluster_kl), and the api extras (httpx, tenacity).
  • F05 — README primitives table reflects 13 probes (adherence + cluster_kl, calibration + external_perplexity, baseline row for null_adapter).
  • F06TwoModelDifferential composes safe_for_concurrent_views from its two inner backends. Before: the attribute was missing and the runner defaulted to False even when both inners set True, breaking S13's concurrency claim at the only supported differential- use pattern.
  • F07integrations/dlm/autogen emits cluster_kl when the prompt pool clears 20 entries; markdown report gains a per-probe "Cluster breakdown" section with per-cluster mean KL + exemplars.
  • F08backends/dummy._NullView RNG uses hashlib.md5 instead of Python's PYTHONHASHSEED-salted hash(). Cross-process determinism test at tests/unit/test_cross_process_determinism.py pins the fix.
  • F09 — trace writer ↔ analyzer round-trip test at tests/unit/test_runner_backend_stats.py: runs a suite with trace_path=, loads the file back, asserts probe labels + hit/miss counts match backend_stats.

🟡 Minor — dead code, doc drift, latent brittleness.

  • F10 — removed dead _ft_view branch in probes/prompt_collapse.py.
  • F11 — pytest plugin resolves spec paths against config.rootpath, not process cwd.
  • F12backends/api._post_completions delegates to tenacity's Retrying() callable; removes the unreachable post-loop fallback.
  • F13preference_flip WARN branch emits ci_95 and z_by_rank for consistency with every other numeric probe's WARN shape.
  • F14adapter_ablation module docstring documents why ci_95 renders as em-dash (it's a curve-fit, not a sample-mean aggregator).
  • F15 — report footer surfaces null_adapter opt-outs (probes with calibrate_spec=None) in both terminal and markdown surfaces.
  • F16 — report category ordering derives from DEFAULT_COMPONENT_WEIGHTS and accepts unknown categories from custom Probe subclasses.
  • F17cluster_kl degenerate zero-variance case now returns Verdict.WARN with evidence["degenerate_zero_variance"]=True and suppresses z-score to avoid a spurious small-sample-noise calibration.
  • F18_calibration_pack.py gains per-section provenance notes (F18 audit trail) without noisy per-item annotations.
  • F19 — pytest plugin defers dlm_sway.core.result / dlm_sway.core.errors imports to call sites so non-`@pytest.mark.sway` users don't pay the load tax.
  • F20sway check --help documents the σ-banner's null-calibration dependency (fall-through to composite score band when null SKIPs).

Stronger-test opportunities — regression tests for things we weren't pinning.

  • #9probes/_divergence rejects effectively-uniform TokenDists (spread < 1e-9) via _check_non_degenerate_token_dist. Catches a shape-broken lm_head that would otherwise compute a trivial constant divergence across prompts. Two compatibility touchups rode with the change: dummy backend's synthesized ft dist and cluster_kl test fixture _dist_broad gained a tiny monotonic perturbation to clear the guard (real models never produce bit-uniform logits thanks to fp32 accumulation noise).
  • #10 — cross-verdict consistency test (tests/unit/test_cross_verdict_consistency.py): runs the same spec through sway run, sway gate, and sway report --format junit and asserts identical per-verdict tallies.
  • #11sway doctor --json schema-shape snapshot test locks the top-level keys and the per-extra module-name set.
  • #12 — subprocess determinism test asserts _NullView.next_token_dist is byte-identical across PYTHONHASHSEED values; pins F08's fix.

Dead-code inventory — DC3–DC5.

  • DC3null_adapter.py's pre-S10 cache-promotion branch is annotated as legacy; removal queued for the next minor.
  • DC4 — dropped the stderr warning at suite/runner.py that fired whenever concurrent_probes > 1 even though execution never fanned out. The spec field is still accepted (future pool will light it up); the noisy warning is gone.
  • DC5tests/unit/test_model.py grows coverage of ModelSpec's dtype enum, endpoint, trust_remote_code, and custom/api kind branches.

Final state: 582 unit tests + 3 integration tests passing; mypy strict clean across 55 source files; ruff + format clean across 137 files. Sprint 20 is a single PR composed of per-finding, per-file commits so the history reads as a fix-by-fix cleanup.

Sprint 16 — Cluster-coherent KL probe

Closes Audit 01 innovation item F8. Adds a new cluster_kl probe that answers the question delta_kl can't: the mean divergence may be the same, but did the adapter shift the right topics, or is it a uniform blunt-instrument shift?

  • New probe cluster_kl (category: adherence). Embeds prompts via shared MiniLM (same cache key as adapter_revert), k-means clusters them at a fixed seed, measures per-prompt JS divergence between base and ft, and reports a specificity ratio: between_variance / (between_variance + within_variance). Range [0, 1]: ≈ 0.5 on a blunt adapter that shifts every topic by the same amount, → 1.0 on a topic-targeted adapter. Pair (mean_kl, specificity) tells a more honest story than either number alone.
  • Bootstrap CI on specificity. Resamples (divergence, cluster_label) pairs with replacement and takes the 2.5/97.5 percentiles of the bootstrap ratio distribution. Returns None below 4 prompts — matches the convention core.stats.bootstrap_ci uses.
  • Z-score calibration against null_adapter baseline via the existing _zscore helpers. calibrate_spec synthesizes 8 mixed-topic sentinel prompts + k=2 for the null pass; the specificity distribution concentrates around 0.5 there, so a real adapter's separation above that baseline reads as a clean z-score.
  • Degenerate-input policy. < min_prompts (default 20) → SKIP with a clear message; num_clusters * 2 > num_prompts → SKIP (per-cluster mean not well-resolved); empty prompt list → ERROR. Missing [semsim] extras → SKIP with a pip install 'dlm-sway[semsim]' hint.
  • Zero-variance fallback. If every prompt produced the same divergence (canned stub data, no adapter motion), the specificity ratio is mathematically undefined; we return 0.5 — the null-adapter expectation — so downstream z-score reports "no signal" instead of NaN.
  • New dependency. scikit-learn>=1.4 added to the [semsim] extra (riding the 80 MB MiniLM load that probe already pulls in). Mirrored to [all] and to mypy's stubless-override list.
  • 7 unit tests in tests/unit/test_probe_cluster_kl.py: two-topic adapter → high specificity, uniform adapter → 0.5 fallback, too-few prompts → SKIP, empty prompts → ERROR, num_clusters > prompts/2 → SKIP, CI bracketing, missing-extras SKIP path.
  • Prove-the-value test at tests/unit/test_cluster_kl_prove_value.py. Two backends with comparable delta_kl (ratio < 3×) have specificity scores that split by at least 0.3 — concrete evidence that cluster_kl surfaces a structural distinction delta_kl merges.

Sprint 15 — pytest plugin (@pytest.mark.sway)

Closes Audit 01 innovation item F10. Packages sway as a pytest library so teams already running pytest adopt sway with a single decorator instead of a subprocess wrapper.

  • New plugin (src/dlm_sway/pytest_plugin.py) auto-loaded via the pytest11 entry point after pip install 'dlm-sway[pytest]'. @pytest.mark.sway(spec="...", threshold=0.0, weights=None) expands a single pytest function into one test item per probe in the referenced spec + an optional __gate__ item that fires only when the composite score drops below threshold.
  • Verdict translation. FAIL / ERROR → pytest Failed, SKIP → pytest Skipped, WARN → pytest warning, PASS → pytest pass. Probe-level failures isolate: a failing adherence probe doesn't mask a failing calibration one, and pytest -k adherence runs just that probe.
  • Suite runs once per decorated function. A session-scoped _SuiteCache keyed on (spec_path, weights) ensures the N-way item expansion doesn't multiply backend wall time. Two @pytest.mark.sway tests against the same spec share one run.
  • Malformed marks fail cleanly. Missing spec, non-numeric threshold, non-dict weights, and unknown kwargs produce a synthetic _ConfigErrorItem with a green-field pytest failure line — no cryptic collection-time tracebacks.
  • New [pytest] extra — just pytest>=8.0. Matches the pattern other plugin-style extras use. Also registers the pytest11 entry point so the plugin is discovered automatically on install, consistent with pytest-cov / pytest-xdist.
  • Example directory at examples/pytest_integration/ with a minimal sway.yaml + test_sway_gate.py showing the decorator replacing a legacy subprocess.run(["sway", "gate", ...]) wrapper. README gains a "Pytest integration" section pointing at it.
  • 13 new unit tests via pytest's canonical pytester fixture: marker registration, N-item expansion, FAIL / SKIP / ERROR routing, gate below-threshold failure, gate above-threshold pass, gate absence when threshold=0, malformed-mark error paths, cache sharing across multiple decorated tests.
  • Drive-by fix for the S14 CI-narrowing test flake. The dummy backend's per-prompt noise was seeded via Python's hash(), which is salted per-process via PYTHONHASHSEED. Swapped to a stable hashlib.md5-derived seed so the narrowing invariant holds deterministically across test-order permutations.

Sprint 14 — Bootstrap CIs + forward-pass trace CLI

Closes Audit 01 innovation items F9 (bootstrap confidence intervals on raw metrics) + F12 (polished forward-pass trace CLI). Paired sharpenings of existing probe output and existing trace infrastructure.

  • New core/stats.bootstrap_ci — percentile-bootstrap 95% CI on any sequence of per-sample measurements. Numpy-only, 1000-resample default, seeded from ctx.seed so intervals are reproducible. Short-circuits on non-finite / empty / degenerate-constant inputs; vectorized resample keeps the overhead ~1 ms per probe.
  • ProbeResult.ci_95: tuple[float, float] | None — new field, threaded through safe_finalize (nulled if raw is nulled, so a CI never brackets a defensive-null point estimate). to_json / from_json persist the pair as a two-list.
  • Six aggregating probes emit ci_95: delta_kl (over per-prompt divergences), calibration_drift (over per-item regression indicators), external_perplexity (over per-chunk deltas), leakage (over clean-recall rates), paraphrase_invariance (over verbatim lifts), section_internalization (over effective SIS scores). Each also lands the interval under evidence["raw_ci_95"] for JSON consumers.
  • Report tables add a ci95 column between raw and z in both terminal and markdown output. format_ci renders as [lo, hi] with em-dash for missing/non-finite. Markdown snapshot refreshed; JSON snapshot refreshed for the new per-probe ci_95 field.
  • New sway trace <jsonl> CLI. Reads the forward-pass trace JSONL the S07 runner writes when --trace <path> is set, aggregates into per-probe and per-view wall-time + hit-rate tables, plus a top-N slowest-events table. --format terminal|md|json, --slowest K to tune the tail size. The parsing tolerates old trace shapes (missing probe / hit fields) and skips malformed lines.
  • suite/trace_analysis — typed dataclasses (TraceEvent, ProbeSummary, ViewSummary, TraceReport) + three renderers (terminal / markdown / json) matching the conventions of suite/report and suite/compare.
  • Committed trace fixture at tests/fixtures/trace_sample.jsonl (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new trace-analysis + CLI tests.
  • Prove-the-value (F9): on a dummy backend with per-prompt variation, delta_kl's CI narrows from [0.33, 0.41] (width 0.079) at N=4 prompts to [0.38, 0.42] (width 0.044) at N=32 — 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical scaling minus dummy-backend dispersion.
  • Prove-the-value (F12): the CLI's terminal output on the committed fixture correctly identifies sis/base (520.3 ms) as the single slowest event, sis (1,529 ms) as the slower probe over dk (529 ms), and ft (1,357 ms) as the slower view over base (701 ms).

Sprint 13 — OpenAI-compatible HTTP scoring backend

Closes Audit 01 innovation item F7. Unlocks sway against hosted fine-tunes — OpenAI platform, vllm serve, Ollama — without requiring a local torch + PEFT load.

  • New backend ApiScoringBackend (backends/api.py): scores against a /v1/completions endpoint via httpx. Implements ScoringBackend only (not DifferentialBackend) since one endpoint is one model; users compose two ApiScoringBackend instances behind the existing TwoModelDifferential wrapper for a full differential run. Method surface: logprob_of via echo=True+logprobs=0, rolling_logprob via the same, next_token_dist via max_tokens=1+logprobs=K, plus preflight_finite_check and an ApiScoringBackend.generate for probes that need text output (e.g. leakage).
  • Token-boundary handling. The API returns tokens as strings, so logprob_of walks the echoed tokens by character length until the running total covers the prompt, then sums the rest. When the boundary falls mid-token, the partial token lands on the prompt side — over-counts the prompt, under-attributes to the completion — and the behavior is asserted by a dedicated test (test_mid_token_prompt_leans_conservative).
  • Retry + preflight. Tenacity handles 5xx + network errors with exponential backoff (configurable max_retries). Preflight hits the endpoint once with hello, max_tokens=1, logprobs=1; rejects non-finite logprobs before the suite runs.
  • First backend with safe_for_concurrent_views=True. HTTP is stateless, so the S07 concurrent-probe scheduler can dispatch against an API backend in parallel as soon as the pool implementation lands (still scaffolding in the runner).
  • ModelSpec additions: BackendKind gains "api"; ModelSpec.endpoint field carries the server's base URL (the /v1/completions path is appended by the backend). API key from SWAY_API_KEYOPENAI_API_KEY env var fallback, so secrets stay out of the YAML.
  • backends.build dispatch routes kind="api" to ApiScoringBackend. Combined with build_two_separate + TwoModelDifferential, a YAML with defaults.differential: false and two kind: api models Just Works end-to-end.
  • New [api] extrahttpx>=0.27 + tenacity>=9.0. Core sway stays torch-free; the [api] extra adds ~1 MB of deps vs the [hf] extra's 3 GB.
  • Unit tests (21 new) use httpx's MockTransport to intercept every call and assert numeric outputs match canned OpenAI-shaped responses — covers all three scoring methods, cache dedup, 4xx error surfacing, 503→200 retry recovery, API-key env fallback, NaN-response preflight rejection, and the token-boundary math.
  • Prove-the-value (§F7): tests/integration/test_api_ollama.py is opt-in via SWAY_OLLAMA_URL + SWAY_OLLAMA_MODEL. Runs the full scoring surface against a live Ollama serving a small model (e.g. llama3.2:1b) and asserts finite output + preflight pass. Documents the wall-time budget the "≤3× HF backend" claim rests on once the concurrent-dispatch pool lands.

Sprint 12 — Interactive HTML report

Closes Audit 01 innovation item F6. Adds the exploration surface to sit alongside the terminal (CI logs) and markdown (PR artifacts) outputs — a single-file interactive HTML page for the research / write-up case.

  • sway report result.json --format html --out report.html emits a self-contained HTML page with five interactive Plotly panels: composite-score gauge with banded thresholds, per-category horizontal-bar breakdown, per-section SIS bar chart (when section_internalization evidence carries a per_section array), adapter-ablation response curve (when adapter_ablation evidence carries lambdas + mean_divergence_per_lambda), and an all-probe score × z-score scatter with hover tooltips. The terminal verdict palette (green / yellow / red) carries across every panel.
  • Plotly bundle inlined once in <head>; each panel is a Plotly-produced <div> with a stable sway-* id so snapshot tests don't churn on Plotly point releases. No external <script src> / <link href> references — the page loads offline from a single ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway wrapper + chart data is ~40 KB).
  • CLI gates --out PATH on file-producing formats. --format html requires --out (3 MB of JS has no business on stdout); --format md|json|junit now also accept --out for symmetry with the HTML path. --format terminal rejects --out — the terminal renderer is for the console only.
  • plotly>=5.20 is optional, shipped via the existing [viz] extra (alongside matplotlib). Without it, --format html exits 2 with the pip install 'dlm-sway[viz]' install hint. Graceful ImportError → RuntimeError translation tested end-to-end.
  • Snapshot at tests/snapshots/report.html locks the Sway-owned wrapper structure. The Plotly JS bundle is stripped from the snapshot (replaced with a placeholder) so the 43-line snapshot stays human-reviewable and Plotly version bumps don't drift it.
  • Prove-the-value: test_html_from_real_history_loads_offline renders HTML from the committed tests/fixtures/sway-history/02-* run (4 probes, real exported payload) and asserts: file parses via html.parser, zero external <script src> / <link href> references, every probe name appears in the body, file size lands between 1 MB and 10 MB. Measured output: 4.87 MB.

Sprint 11 — sway compare across saved JSON runs

Closes Audit 01 innovation item F5. Ships the regression-dashboard primitive every CI integration reached for but had to script.

  • New CLI subcommand sway compare. Accepts N saved result JSONs (typically from sway run --json), rehydrates each via report.from_json, folds them into a score matrix, and renders: a per-probe score table with columns per run, per-adjacent-pair delta columns (colored red on drop / green on lift), and a composite-score timeline row. Formats: terminal (default, Rich), --format md, --format json.
  • --fail-on-regression <threshold>. Exits 1 when any probe's score in the newest run dropped ≥ threshold vs the prior run. threshold=0 (default) disables the gate. Gate fires after the output is emitted so CI logs always capture the matrix even on red builds.
  • suite/compare.py module. Clean separation: build_matrix folds (SuiteResult, SwayScore) pairs into a CompareMatrix dataclass; render_{terminal,markdown,json} consume that. No filesystem IO in the module itself — the CLI owns the reads, the renderers own the writes. Probes that disappeared between runs show as None / em-dash in the matrix; new probes show as None on older runs. Union of probe names is sorted for stable row order across invocations.
  • Markdown snapshot locked at tests/snapshots/compare.md — silent schema drift breaks the test like every other snapshot.
  • Committed history fixture + prove-the-value test. tests/fixtures/sway-history/{01,02,03}-*.json ship a three-run narrative: baseline → retrained-improved → over-trained. Run 03 plants a section_internalization drop of 0.22 and a calibration_drift drop of 0.25 while delta_kl still rises (the memorization-without-generalization failure mode). test_compare_catches_planted_regression invokes sway compare sway-history/*.json --fail-on-regression 0.10 and asserts exit=1 with both regressed probes surfaced in the JSON payload and terminal text — exactly the F5 "CI gate the build on regression" experiment.

Sprint 10 — Multi-rank adversarial null adapters

Closes Audit 01 innovation item F4. Extends the S02 null-calibration matrix with a per-rank profile — users now read "how rank-saturated is my adapter?" straight off the report.

  • NullAdapterSpec.rank_multipliers: list[float] = [1.0] (new). Default preserves single-rank behavior byte-for-byte. Setting [0.5, 1.0, 2.0] calibrates three independent null distributions per probe kind and emits a z-profile alongside the verdict z-score.
  • NullCalibratedBackend.as_null_adapter gains a rank_scale kwarg. Both shipped backends (dummy + HF) implement it by scaling the null-weight noise std by sqrt(rank_scale) — mathematically equivalent to a rank change in terms of the LoRA output variance (A·B is a sum of r rank-1 outer products; variance is linear in r). No tensor-shape surgery, no model reload per multiplier.
  • RunContext.null_stats_by_rank threads the per-rank matrix through the runner. Keys are canonical rank_{mult:.2f} strings; inner structure matches null_stats.
  • Numeric probes emit evidence["z_by_rank"]. Every numeric probe (delta_kl, calibration_drift, external_perplexity, leakage, adapter_revert, adapter_ablation, paraphrase_invariance, preference_flip, prompt_collapse, section_internalization, style_fingerprint) computes per-rank z-scores using its existing sign convention. The verdict path still reads the 1.0x group — no existing thresholds shift.
  • Report surfaces the profile. Terminal + markdown probe notes append rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x whenever multi-rank calibration ran. format_z_profile helper centralizes the rendering.
  • Null-stats disk cache widens key to include rank_multipliers. Single-rank caches from pre-S10 runs are still readable — they're promoted to the new null_stats_by_rank shape on load.
  • Bug fix (external_perplexity): drop erroneous z sign flip. mean_delta is higher-is-better (ft logprob minus base logprob on external prose), so the raw z-score maps directly onto z >= assert_z_gte. S09's sign flip was reversing the pass/fail direction when the null-calibration path ran.
  • README: rank-profile interpretation guide. Explains when the shape indicates rank saturation vs. rank oversizing, plus the "low rank can be pathologically quiet" dual-reading caveat.
  • Prove-the-value (tests/unit/test_null_multi_rank.py): on a fixed adapter with the dummy backend, the delta_kl z-profile is strictly monotone in inverse rank (z@0.5x > z@1.0x > z@2.0x) — exactly the signature of a rank-scaled null distribution.

Sprint 09 — External-perplexity-gap probe

Closes Audit 01 innovation item F3. First innovation sprint landed on top of the audit-closure campaign (S01–S07).

  • New probe external_perplexity (probes/external_perplexity.py): rolling-logprob delta of ft vs base on held-out public-domain English prose. Raw metric is mean_delta_nats (per-token). Negative = adapter raised perplexity on external text (diffuse forgetting); positive = adapter improved English modeling incidentally. Category calibration; scored alongside calibration_drift.
  • Packaged public-domain corpus (probes/_corpora/public_domain_en.txt, ~14 KB): hand-assembled from twelve US-public-domain passages (Lincoln, Emerson, Twain, Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, Dickinson, US Constitution, Federalist #10). Each passage carries an inline # -- source: provenance comment identifying the work and the "US public domain (pre-1929)" basis. Loader strips the comments at read time so the probe sees only raw prose.
  • Null calibration: calibrate_spec() returns a 4-chunk cheap version so null_adapter produces per-kind stats without dominating suite runtime. Sign-flipped z-score — lower raw delta is worse, so the shared z >= assert_z_gte semantics read as "σ better than noise" on external fluency.
  • autogen wiring: emits an external_ppl suite entry (with corpus=public_domain_en, max_chunks=8) whenever the source .dlm has any PROSE section. Intent table explains it as the "diffuse-forgetting complement to calibration_drift."
  • Prove-the-value test (tests/unit/test_ext_ppl_vs_calibration_drift.py): a dummy backend that applies a uniform −0.3 nats/token drift to every pack item and every corpus chunk — below calibration_drift's 1.0-nat per-item regression threshold, above its −0.5 mean-delta gate, but well below external_perplexity's −0.1 fixed-threshold. calibration_drift reports PASS, external_perplexity reports FAIL. The two probes measure different failure modes and do not substitute for each other.

Sprint 07 — Performance & caching

Closes Audit 01 findings B19 (deferred per design note), E-cache-opportunity.

  • Forward-pass cache (backends/_instrumentation.py): every backend view now routes next_token_dist / rolling_logprob / logprob_of through a bounded LRU keyed on (op, view_id, prompt_hash, top_k). Baseline A/B on a tiny 4-probe suite against SmolLM2-135M on CPU: 2.33s → 1.76s (25% wall-time reduction, forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone suite has more overlap and would see larger gains; the 25% floor holds on any suite with repeated prompts across probes.
  • Cache key includes view_id: "base" / "ft" / "scaled_1.25" / "null_42" are distinct namespaces. A future toggle regression that fails to flip the adapter would surface as wrong-side cached values immediately, not silently corrupt divergence math.
  • Backend stats surface in the report: SuiteResult.backend_stats captures cache_hits / cache_misses / forward_passes / scoring_wall_s / hit_rate. Terminal + markdown footers render cache: 26/88 = 30% when stats are present.
  • Forward-pass tracing: sway run --trace <path.jsonl> writes one event per backend scoring call (probe / view_id / prompt_hash / top_k / op / wall_ms / hit). Zero overhead when unset.
  • concurrent_probes scaffolding: spec.defaults.concurrent_probes: int = 1 field plus safe_for_concurrent_views: bool = False class attribute on HF / MLX / Dummy backends. The runner warns on stderr when the user requested > 1 against an unsafe backend and stays sequential. Custom backends that are already concurrency-safe (e.g. a stateless hosted-API backend) can opt in without waiting for the HF fix.
  • B19 design note: .docs/design/backend-concurrency.md documents current state, why v0.1 doesn't fix it, and two future paths (per-thread lock vs per-worker pool) with recommendation.
  • Generation stays uncached: view.generate() intentionally bypasses the cache — probe-side callers vary (prompt, max_new_tokens, temperature, seed) in ways that would rarely collide, and caching sampled output would hide seed bugs behind stale strings.

Sprint 06 — CLI & report UX polish

Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15, B16, B17, B18.

  • D3 — extras rollup footer: terminal + markdown reports now end with a single pip install 'dlm-sway[...]' line collecting every extra mentioned in SKIP messages. Single rollup, no per-row scan.
  • D4 — sway check infers --base: reads base_model_name_or_path from the adapter's adapter_config.json when --base is omitted; echoes "(inferred base model: …)" so the user knows what got picked. Falls back to a clear required-flag error when the field is missing.
  • D5 — sway autogen annotates the YAML: generated specs carry a header block (source .dlm, dlm_id, base, adapter, generation timestamp, sway version) and a one-line # intent comment above every probe entry. No new dep — pyyaml + a position-based post-processor instead of ruamel.yaml.
  • D6 — sway run --dry-run + sway list-probes: dry-run validates the spec, prints a probe table (#, name, kind, category, enabled), and exits 0 without building a backend. list-probes prints every shipped probe kind with its category and the first line of its docstring.
  • D7 — sway doctor --json: machine-readable doctor payload (sway_version, python, platform, extras) keyed by extra and module. CI-grep-friendly.
  • D8 — dlm_source resolution warns instead of silently swallowing: when the spec sets dlm_source but the [dlm] extra isn't installed (or the bridge errors), the runner emits a yellow stderr warning with the install hint and continues with sections=None. No more silent "why are my section probes SKIPping?" debugging.
  • D9 — markdown column parity with terminal: the markdown probes table now carries score, raw, z, duration, and note columns. A Top findings section follows the table; a Skipped probes section closes when extras are missing.
  • D10 — unified number formatters: format_score, format_raw, format_z, format_duration_s live in suite/report.py and are the only numeric formatters used. Thousands separators above 1 000; glyph for None / non-finite. Single source kills cross-surface drift.
  • D11 — sway report --format validated via StrEnum: the Typer-level enum rejects unknown formats with the standard "Invalid value for '--format'" error. No more silent fallback to the terminal renderer on a typo.
  • D12 — sway check verdict banner: prints ✅ adapter is +4.2σ above noise (green ≥ 3σ), ⚠️ adapter is +1.5σ above noise — marginal (yellow ≥ 1σ), or ❌ adapter is +0.3σ — indistinguishable from noise (red) above the full report. Calibrated on the delta_kl z-score; check now runs null_adapter first so the z-score is available.
  • D13 — sway diff regression summary: after per-probe deltas, the diff prints A→B: N regressed >0.10, M regressed >0.20, composite Δ=±X.XX color-coded by direction.
  • D14 — adapter paths with spaces render safely: _adapter_label wraps the path in double quotes when any whitespace is present.
  • D15 — long messages wrap, don't truncate: the terminal probe table column uses Rich's overflow="fold" instead of an 80-char hard cut with an ellipsis. The full message is always visible.
  • B16 — single markdown renderer: report.from_json round-trips saved JSON back into the canonical dataclass pair, so sway report --format md and sway run --markdown both flow through report.to_markdown with identical output. The legacy _render_markdown_from_json and _render_junit_from_json helpers in cli/commands.py are deleted.
  • B17 — None scores render as : the unified formatters cover every render path; no more 0.00 masking missing data.
  • B18 — baseline row labeled (informational, weight=0): in both terminal and markdown component breakdowns, matching the Sprint 03 explicit-weight-zero decision.

Sprint 05 — Probe quality & edge cases

Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, B21, B22.

  • B6 — tail_logprob semantics: the field is now float | None with three discrete states: None (top-k covered the full vocab — no tail to redistribute), 0.0 (a real tail underflowed below fp32), and a measurable negative log-prob. HF, MLX, and dummy backends all emit None when k == vocab. Preflight check guards against non-finite tail_logprob only when it's a number.
  • B7 — probe validation before backend build: new validate_all_probes(suite) helper collects every spec error in one pass; _execute_spec calls it before materializing the backend so a typo in kind: surfaces immediately, and all typos surface in a single error message.
  • B8 — autogen style prompts: style_fingerprint no longer receives the leading sentence of a prose section (which elicited doc content, not stylistic voice). Replaced with a fixed _STYLE_ELICITATION_PROMPTS set of 6 open-ended, content-neutral prompts.
  • B9 — sentence-transformer caching: _load_embedder is now functools.lru_cache(maxsize=4). Suites that run adapter_revert back-to-back (multi-adapter diff, repeated probes) reuse the same ~80 MB embedder instead of re-loading every probe call.
  • B10 — _lcs_ratio docstring rot: the function name is a historical misnomer (it's gestalt similarity via difflib.SequenceMatcher, not LCS). Docstring rewritten to make the contract explicit; rename deferred to v0.2 to preserve the public API surface.
  • B11 — leakage adversarial perturbations: added 4 new perturbations (synonym_swap with a hand-curated 50-pair table, clause_reverse, prefix_inject, register_shift) alongside the original three. Default perturbation list is now all 7; the spec accepts any subset.
  • B12 — calibration pack expansion: BUILT_IN_PACK grew from 30 to 200 items (geography, natural sciences, arithmetic, language, history, biology, technology, miscellaneous). All public-domain / hand-composed grade-school facts — no third-party dataset license attaches. items_limit docstring documents the new resolution (~0.5 pp per regressed item vs the old ~3.3 pp).
  • B13 — tokenizer-aware _stuffing: prompt_collapse now derives its padding from the model's pad / unk / EOS token via the backend's tokenizer, making the metric language-agnostic. The pre-B13 hardcoded English string remains as the dummy-backend fallback and behind a legacy_stuffing: bool = False spec field for one-release backward-compat.
  • B14 — preference_flip per-triple error fence: wrapped each triple's logprob_of calls in try/except ProbeError. A single bad triple no longer kills the whole batch; evidence["dropped_triples"]
    • dropped_reasons surface the count + first 5 reasons. When every triple raises, the probe routes to ERROR with a clear explanation.
  • B20 — custom backend protocol checks: _load_custom now isinstance-checks NullCalibratedBackend and ScalableDifferentialBackend after the DifferentialBackend check. The set of satisfied protocols is stamped onto the instance as __sway_protocols__: tuple[str, ...] so the report can show which features are available without re-checking.
  • B21 — RunContext.null_stats truly frozen: the runner now wraps the stats dict in types.MappingProxyType before threading it through. The dataclass was already frozen=True but the dict was mutable by reference; B21 makes the docstring's "frozen" claim literally true. Field type widened from dict[str, dict[str, float]] to Mapping[str, Mapping[str, float]].
  • B22 — ModelSpec.adapter path normalization: added a pydantic field_validator that runs Path.expanduser().resolve() on the field at spec-load time. Backends no longer re-do the work; the cache key in _null_cache.compute_key is now stable regardless of how the user spelled the path in YAML or on the CLI.

Sprint 04 — Integration & regression testing

Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 (test side), B15.

  • Slow lane runs end-to-end (C1): the huggingface_hub dev dep added in S01 is verified; pytest -m "slow or online" now executes every checked-in integration test from a clean clone.
  • HF backend coverage 21% → 91% (C2) — measured combined fast + slow lane against dlm_sway.backends.hf. New tests:
    • tests/integration/test_hf_adapter_toggle.py extended with a bit-identical ft → base → ft roundtrip (B15 mitigation).
    • tests/integration/test_hf_scaled_adapter.py: λ sweep monotonicity + LoraLayer.scaling[key] restoration on clean exit AND on exception.
    • tests/integration/test_hf_null_adapter.py: same-seed determinism, different-seeds divergence, original adapter restoration on clean exit AND on exception.
    • tests/integration/test_hf_scoring.py: logprob_of (incl. zero-token-completion → ProbeError), rolling_logprob, next_token_dist, _HFView.generate (greedy + sampled-with-seed determinism).
    • tests/unit/test_backend_hf_helpers.py: direct unit coverage on _resolve_dtype and _detect_device so dtype regressions are caught in the fast lane.
  • MLX smoke test (C5): tests/integration/test_mlx_smoke.py exercises MLXDifferentialBackend on darwin-arm64 with a small LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm / missing-fixture. Reproducible adapter builder ships at tests/fixtures/build_mlx_adapter.py (run once on a Mac to populate the fixture directory).
  • sway gate exit code pinned (C6): tests/integration/test_sway_gate_exit_code.py covers PASS (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing verdicts (exit 1) via Typer's CliRunner.
  • dlm import ban regression-guarded (C7): tests/unit/test_dlm_not_imported.py patches every dlm.* entry in sys.modules to None and asserts the dummy suite runs end-to-end without ImportError, plus that sway autogen surfaces a clean install-hint error rather than a stack trace.
  • Pathological probe coverage (C8 + B3 test side): tests/unit/test_probe_adapter_ablation.py now drives monotonically-decreasing curves through the helper and pins probe-level evidence["saturation_reason"] for flat / found / overshoot-with-dip / non_monotonic shapes via a monkeypatched divergence.
  • WARN-branch numerical formula pinned (C10): test_warn_branch_score_formula_pinned in tests/unit/test_probe_preference_flip.py asserts the exact score = 0.5 + mean_delta / 4.0 formula with a hand-computed expected value.
  • Disjoint top-k divergence (C12): test_disjoint_top_k_supports_produce_finite_divergence in tests/unit/test_divergence.py covers the aligned_probs tail-redistribution path against fully-disjoint base / ft supports — JS comes out finite and approaches its theoretical ln(2) bound.
  • Report schema snapshots (C11): tests/unit/test_report_snapshot.py byte-compares to_json / to_markdown / to_junit against checked-in snapshots under tests/snapshots/. Intentional schema bumps: SWAY_UPDATE_SNAPSHOTS=1 uv run pytest … and commit the updated files. No new dep — hand-rolled diff helper.
  • CI workflow shipped: .github/workflows/ci.yml runs the fast lane (unit + lint + mypy) on every push / PR, and the slow lane (integration, HF backend) on schedule (nightly 07:00 UTC), manual dispatch, push to main, and PRs that touch src/dlm_sway/backends/, tests/integration/, or pyproject.toml.

Sprint 03 — Documentation truth & dead code

Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, P15, P17, P18.

  • Determinism is now wired (P09): the suite runner calls dlm_sway.core.determinism.seed_everything(spec.defaults.seed) before any backend work runs. The achieved determinism class (strict / best_effort / loose) is captured in SuiteResult.determinism and surfaces in the report footer + JSON payload + markdown header.
  • defaults.differential: false works end-to-end (P14): new backends/two_model.py with TwoModelDifferential wrapper + build_two_separate(spec_models) helper. _execute_spec routes to the wrapper when the flag is false. Doubles memory; primarily for custom backends that can't toggle adapters in place.
  • score_weights overridable from YAML and CLI (P15): new SuiteDefaults.score_weights field with subset-validating pydantic field validator (rejects unknown categories, negative weights, all-zero weights). New --weights k=v,k=v flag on sway run and sway gate. Partial overrides merge with DEFAULT_COMPONENT_WEIGHTS so users rarely have to respecify all five categories.
  • baseline row labeled (informational) with weight 0.0 explicit (P08, B18): added to DEFAULT_COMPONENT_WEIGHTS so the row appears in reports for transparency but contributes nothing to the composite. Terminal renderer adds the (informational) annotation column; markdown table adds a weight column with the same label.
  • Extended (9-dim) style fingerprint when [style] is installed (P10): style_fingerprint now actually uses spaCy POS-tagging and textstat syllable counting. Three new dims appended to the existing 6: passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per word. Spec field extended: "auto" | "on" | "off" (default "auto") controls the path. evidence["schema_version"] bumped to 2 so snapshot consumers can branch on dimensionality.
  • Stale "Known gaps" entries refreshed (P03, P04, P05): removed "null_adapter not yet wired" (delivered in S01/S02), "custom backend stubbed" (it's full), "MLX paths raise" (rewritten to describe the real limitation: two model copies because mlx_lm has no runtime adapter toggle).
  • README adds determinism + weights + differential documentation and a calibration paragraph that points at the per-kind null matrix.
  • delta_kl docstring rot fixed (P17): dropped the dead ``:mod:`dir``` reference.
  • Meta-test test_no_dead_options.py (P14/P15 regression guard): greps the source tree for documented spec/CLI option names and asserts each has a consumer outside its declaration site. Catches the exact "documented but unused" pattern Audit 01 flagged.

Sprint 02 — Universal z-score calibration

Closes Audit 01 findings P02 (delivery), B2, C9.

  • NullAdapterProbe is now a per-kind calibration matrix: iterates every downstream numeric kind in the suite (or an explicit calibrate_kinds list), runs a miniature version of each through the NullCalibrationBackendProxy for N seeds, and publishes {kind: {mean, std, n}} under evidence["null_stats"]. The old "publishes two keys only" implementation is gone (closes B2).
  • NullCalibrationBackendProxy (_null_proxy.py): swaps as_finetuned() to yield as_null_adapter(seed) so every numeric probe's own math produces the "what does my metric look like when the fine-tune is structural noise?" distribution without a bespoke code path per probe.
  • Probe.calibrate_spec(ctx) classmethod plus SENTINEL_PROMPTS / SENTINEL_DOC shared constants in probes/base.py. Each numeric probe overrides calibrate_spec to return a small, cheap spec for calibration; probes that can't be meaningfully calibrated (e.g. adapter_revert needs an embedder, adapter_ablation needs as_scaled_adapter, prompt_collapse can't fit an exponential decay to null noise) opt out by returning None and surface (no calibration for <kind>) in the report.
  • Shared _zscore helpers (probes/_zscore.py): z_score(raw, stats), verdict_from_z(z, threshold), score_from_z(z), no_calibration_note(kind). Every numeric probe now flows through these — no bespoke (raw - mean) / std math left in any probe file. Includes the MIN_STD = 1e-6 floor so a degenerate null distribution (e.g. a single-seed calibration) can't produce an infinite z-score (closes C9).
  • All 10 numeric probes thread get_null_stats(ctx, self.kind): delta_kl, adapter_revert, prompt_collapse, section_internalization, paraphrase_invariance, preference_flip, style_fingerprint, calibration_drift, leakage, adapter_ablation. Each has an assert_z_gte: float = 3.0 that is preferred when stats exist. Lower-is-better probes (adapter_revert, calibration_drift, leakage) sign-flip the z internally so the shared z >= threshold PASS rule still reads as "significantly better than null". Two probes (paraphrase_invariance, calibration_drift) keep their intent-aware / compound thresholds as the no-calibration fallback.
  • On-disk null-stats cache (probes/_null_cache.py, backends/hf.py): HF backend exposes cache_identity(); NullAdapterProbe hashes (backend_identity, runs, init_scale, seed_base, top_k, kinds) into a stable filename under ~/.dlm-sway/null-stats/<key>.json (XDG-respecting). Cache is best-effort — a missing / malformed file rebuilds. Disable per-suite with cache: false in the spec, or globally with SWAY_DISABLE_NULL_CACHE=1. The dummy backend doesn't expose a cache identity, so tests never touch disk unless they opt in.
  • Report shows z-scores in every numeric probe row: the terminal renderer already had a z column; the markdown renderer now does too. Rows that fell back to fixed thresholds carry (no calibration for <kind>) directly in the message.

Sprint 01 — Finite safety & verdict integrity

Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.

  • safe_finalize helper in core/result.py: every numeric probe now routes its final ProbeResult through this guard. A non-finite raw (or any explicitly-critical field) auto-converts to Verdict.ERROR; non-finite auxiliaries are nulled out and recorded in evidence["defensively_nulled"].
  • _divergence non-finite rejection: aligned_probs, kl, js now raise ProbeError on NaN/inf inputs instead of silently producing garbage. JS results are bounds-checked against the theoretical [0, ln 2]; out-of-bound results raise rather than ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)
  • PreflightCheckable Protocol: HF and dummy backends gain preflight_finite_check() that runs one forward pass per view with a sentinel prompt and rejects backends whose logprobs are non-finite.
  • Suite runner preflight gate: every sway run calls backend.preflight_finite_check() before the probe loop. Failure emits a single synthetic ERROR probe and skips all configured probes; sway gate exits non-zero. Disable with skip_preflight=True for sub-second test suites.
  • style_fingerprint zero-vector fix (B4): a fine-tuned model that produces empty / whitespace-only generations no longer reports a spurious "+0.82 shift toward doc" PASS. The _cosine_shift math is replaced by _projection_shift = (ft-base)·(doc-base) / ||doc-base||² which goes to zero when ft equals base.
  • _saturation_lambda full-range search (B3): the adapter-ablation saturation detector now searches the entire λ range (not just λ ≤ 1.0) with max(divs) as the reference, and returns a typed reason ("found" / "non_monotonic" / "flat_curve" / "below_floor") so a flat NaN-adapter curve is distinguishable from a healthy saturating one.
  • Property-based divergence tests (hypothesis): symmetry, JS(p,p) = 0, JS ≤ ln 2, KL(p,p) = 0, KL non-negativity. Plus explicit non-finite-raises tests pinning every entry point.
  • End-to-end NaN regression (tests/integration/test_nan_adapter_regression.py, slow+online): builds a real PEFT adapter with all-NaN weights via PEFT, persists it through safetensors, then asserts the HF backend's preflight rejects it and the suite produces ERROR — not the +11639σ headline the audit caught.
  • Added huggingface_hub to dev dep group so integration tests can resolve the tiny_model_dir fixture (closes C1 ahead of Sprint 04).

0.1.0.dev0 — 2026-04-20

Initial pre-alpha. Full 11-primitive battery shipped.

Primitives

  • Adherence
    • delta_kl — mean JS/KL divergence between base and fine-tuned next-token distributions
    • adapter_revert — reversion under adversarial paraphrase (needs sway-eval[semsim])
    • prompt_collapse — exponential-decay fit of divergence over context length
  • Attribution
    • section_internalization (flagship) — per-section effective_sis with leak check
    • paraphrase_invariance — memorization vs. generalization, intent-aware
    • preference_flip — DPO/ORPO chosen/rejected margin inversion
  • Calibration
    • style_fingerprint — 6-dim numpy-only stylistic shift vs. document
    • calibration_drift — general-knowledge regression on a packaged 30-item pack
    • leakage — greedy LCS recall + perturbation fragility
  • Ablation
    • adapter_ablation (signature primitive) — λ-scaled divergence curve with linearity, saturation, overshoot metrics
  • Baseline
    • null_adapter — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)

Infrastructure

  • DifferentialBackend + ScalableDifferentialBackend protocols
  • HuggingFace + PEFT backend with disable_adapter / set_adapter toggling and LoRA-scale mutation
  • Dummy backend for unit tests (canned responses + linear-blend scalable mode)
  • YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
  • Typer CLI: run, gate, check, diff, autogen, doctor, report
  • .dlm bridge (dlm-sway[dlm]): resolver + full-battery autogen
  • Matplotlib visualizations (dlm-sway[viz]): SIS bar chart, ablation curve, KL histogram

Known gaps

  • MLX backend loads two model copies in memory (base + adapter-fused) because mlx_lm has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs.
  • MLX requires a pre-converted .npz adapter — raw PEFT safetensors are rejected by mlx_lm.load. A PEFT-→-MLX converter is a future milestone.
  • PyPI publication of the dlm-sway wheel is pending a clean CI release workflow.
View source
1 # Changelog
2
3 ## Unreleased
4
5 ### Sprint 36 — `sway serve` warm-backend daemon
6
7 Audit 03 H4. `sway run` cold-loads the HF backend each invocation
8 (~15s on a 1.5B model: weights, KV cache, deterministic-mode setup).
9 For interactive flows — notebooks, the `sway watch` retrain loop,
10 the live HTML report — that startup dwarfs the actual scoring
11 cost. `sway serve` is a long-running FastAPI daemon that loads the
12 backend once and keeps it warm across requests.
13
14 **New CLI (`sway serve`).** Defaults: `--host 127.0.0.1 --port 8787
15 --max-loaded-models 2`. The daemon refuses to bind a non-loopback
16 interface unless `--api-key <token>` is passed; every non-`/health`
17 request must then carry `Authorization: Bearer <token>`.
18
19 **Endpoints.**
20
21 - `GET /health` — uptime + the list of currently-warm models.
22 - `GET /stats` — request count + mean latency + cache size.
23 - `POST /run` — body `{spec: SwaySpec}`; returns the same JSON
24 shape `sway run --json-out` would write, plus `request_seconds`
25 for the daemon's measured execution time.
26 - `POST /score` — same as `/run` with an optional `probe_names`
27 filter; returns just the per-probe entries with no folded
28 `SwayScore`.
29
30 **Backend cache.** LRU keyed on `(kind, base, adapter, dtype,
31 device)`. Capped at `--max-loaded-models`; loading a third distinct
32 model LRU-evicts the oldest, calling `backend.close()` to release
33 GPU memory. Single-flight: concurrent requests for the same key
34 serialize at the loader instead of building twice.
35
36 **Python SDK.** `dlm_sway.serve.client.ServeClient(url)` exposes
37 `health()`, `stats()`, `run(spec)`, `score(spec, probe_names=...)`.
38 Stateless (no persistent connection pool); raises
39 `ServeClientError` (subclass of `SwayError`) on transport failure
40 or non-2xx responses.
41
42 **Auth posture (v1).** Defaults to no auth on loopback — the
43 threat model on a single-user dev box is "did I bind 0.0.0.0 by
44 accident". The CLI hard-refuses `--host 0.0.0.0` without an API
45 key. Full OAuth is deferred.
46
47 ### Sprint 33 — `training_drift` probe (cross-repo, reads dlm loss curves)
48
49 Closes the X2 "training_drift probe" backlog item. Sister to S25
50 `gradient_ghost`: where the ghost reads optimizer state at
51 end-of-training, `training_drift` reads the loss *curve* during
52 training. Both are pre-run, no model load, no backend required.
53
54 **New probe (`kind: training_drift`, category: calibration).**
55
56 For a dlm store, the probe parses every `train-*.jsonl` under
57 `<store_path>/logs/`, dedupes resumed runs (latest occurrence wins),
58 and computes four metrics:
59
60 - `final_loss` — last recorded step's loss.
61 - `convergence_ratio``final_loss / initial_loss`.
62 - `smoothness``1 − var(Δloss) / var(loss)`, clipped to `[0, 1]`.
63 - `instability_events` — count of loss-*increase* events whose
64 magnitude exceeds the local typical movement scale (median
65 absolute delta in a centered window). NaN losses count as one
66 instability each, then forward-fill so downstream stats stay
67 finite.
68
69 Verdict PASS when all three thresholds clear (smoothness ≥ 0.7,
70 convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN
71 with each failed threshold listed in the message.
72
73 **Spike heuristic note.** Sprint plan called for `|Δloss| > 3 ·
74 rolling_std`. Implementation rejects that: on a smooth exponential
75 decay, within-window std-of-deltas stays tiny while absolute deltas
76 are large — every step trips the threshold (verified during dev:
77 60-step smooth curve flagged 59 false-positive spikes). Replaced
78 with: count *positive* deltas (loss going up — the semantically
79 meaningful instability) that exceed `sigma · median(|Δ|)` in a
80 centered window. Robust to scale changes across training; loss
81 going down faster than usual is no longer mistaken for instability.
82
83 **No null calibration.** Mirrors `prompt_collapse` and
84 `multi_turn_coherence_decay`: a null adapter has no loss curve,
85 the null distribution of "smoothness on a noise adapter" is
86 undefined. Fixed-threshold verdicts; users override per-spec.
87
88 **Log-format note.** Sprint plan said dlm writes per-step JSONs at
89 `logs/train_step_*.json`. Reality (verified against
90 `~/.dlm/store/`): one mixed JSONL per run at
91 `logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl` containing banner +
92 delta + step + run_complete records. The probe filters for
93 `{"type": "step"}` lines and reads `step` + `loss`. Sibling
94 `*.summary.json` carries run aggregates we don't consume — the
95 curve is richer.
96
97 **Robustness:**
98 - Resumed runs (overlapping step numbers across multiple jsonls):
99 dedupe-by-keep-latest mirrors `dlm metrics` semantics.
100 - Truncated JSONL tail (crashed-mid-line trainer): partial line
101 is skipped; valid lines still consumed.
102 - NaN losses are recorded as `+inf` so the spike detector flags
103 them without numpy NaN poisoning the rest of the pipeline.
104 - 1500-step run downsamples to ≤ 512 evidence points (uniform
105 stride; first + last always preserved).
106 - Pathological: every step recorded NaN → `smoothness = 0.0`,
107 `instability_events = num_steps`, verdict WARN.
108
109 **Implementation:**
110 - `probes/training_drift.py` — spec, probe, JSONL parser,
111 metric helpers, verdict mapping, downsampler. 365 LOC.
112 - `probes/__init__.py` — registers the new probe.
113
114 **Test surface:**
115 - `tests/unit/test_probe_training_drift.py` — 30 unit tests
116 covering: skip paths (no store_path / no logs dir / no jsonl /
117 too few steps), end-to-end with smooth & spiky curves,
118 resume-deduplication, downsampling, corrupt-first-line ERROR,
119 truncated-tail tolerance, pure-math metric helpers
120 (smooth/constant/NaN/all-NaN/zero-initial), spike-detector
121 heuristic (loss-up vs loss-down semantics, short curves,
122 empty), downsampling, verdict mapping, and JSONL parsing
123 (filter non-step records, missing keys, NaN encoding,
124 missing files).
125 - `tests/fixtures/dlm_train_log_fixture.jsonl` — captured-from-disk
126 shape: banner + delta record + 30 step records + run_complete.
127 If this fixture's parse breaks, dlm's log format has shifted and
128 the probe needs an update — the test catches that explicitly.
129
130 **README** gains a `training_drift` paragraph in "Pre-run
131 diagnostics" alongside `gradient_ghost`. The probe table at "Why
132 it exists" picks up `training_drift` and the previously-missed
133 `gradient_ghost` entry under Calibration.
134
135 ### Sprint 30 — `multi_turn_coherence_decay` probe
136
137 Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway
138 had zero coverage of multi-turn behavior before this — every
139 shipped adherence probe was single-turn. Adapters that pass
140 `delta_kl` cleanly frequently degrade by turn 2 or 3 of real
141 dialogue, where the model's own previous responses enter the
142 context window and create compounding drift. The new probe is the
143 first that catches that failure mode.
144
145 **New probe (`kind: multi_turn_coherence_decay`, category: adherence).**
146
147 For each prompt the probe greedy-generates ft's turn-1 response,
148 then rolls a multi-turn synthetic dialogue with cycled generic
149 follow-ups (`Continue.`, `Tell me more.`, …). At each turn 2..N
150 both views see the same ft-grounded chat history; the probe
151 computes `KL(base || ft)` at the next-token position. The
152 per-turn KL series is fit to `kl = a · exp(-b · turn)` and
153 the probe reports `half_life_turns = ln(2) / b`.
154
155 Verdict ladder:
156 - `ok` — clean exponential decay; PASS when `half_life_turns ≥
157 assert_half_life_turns` (default 2.0).
158 - `stable` — KL stayed within 0.1% relative spread across turns;
159 half-life is formally infinite; clipped to `max_turns × 10`
160 and rendered with a "held coherence" message; always PASS.
161 - `non_monotonic` — KL grew turn-over-turn (atypical but
162 possible); WARN with the curve in evidence.
163 - `degenerate` — KL ≈ 0 at every turn; FAIL with "probable no-op
164 adapter" diagnosis.
165
166 **No null calibration.** Mirrors `prompt_collapse`'s rationale: a
167 null adapter has no signal to decay, so the null distribution of
168 half-lives is meaningless. Fixed-threshold verdicts are the
169 published path.
170
171 **Chat-template requirement.** Multi-turn dialogue requires the
172 base's tokenizer to carry a `chat_template`. Bases without one
173 SKIP gracefully with a clear message. The probe consults
174 `ctx.backend._tokenizer.chat_template` — same backdoor
175 `prompt_collapse` uses for the same reason (avoids broadening the
176 public scoring contract for one probe's needs).
177
178 Reports surface a tiny unicode sparkline of per-turn KL in the
179 verdict message so terminal-only readers see curve shape without
180 opening the JSON.
181
182 **Implementation:**
183 - `probes/multi_turn_coherence.py` — spec, probe, curve fit,
184 verdict mapping, sparkline.
185 - `probes/__init__.py` — registers the new probe.
186
187 **Test surface:**
188 - `tests/unit/test_probe_multi_turn_coherence.py` — 22 unit tests
189 covering skip paths, end-to-end with planted-distribution
190 sequences (decreasing / flat / growing curves), fit math (clean
191 exp / stable / growing / zero / partial-zero), verdict mapping,
192 sparkline rendering, and chat-template detection.
193 - `tests/integration/test_probe_multi_turn_coherence.py` —
194 slow+online HF smoke on a tiny SmolLM2-135M LoRA across 3
195 dialogue turns. Exercises the full chat-template + turn-loop +
196 curve-fit path on a real backend.
197
198 **README** gains a "Multi-turn coherence" section between
199 `Tool-use fidelity` and `Reproducing a sway run` with a worked
200 YAML example. The probe table at "Why it exists" picks up the
201 new entry under Adherence (plus `tool_use_fidelity` under
202 Attribution — the table missed S27's addition).
203
204 ### Sprint 28 — `tenseleyflow/sway-action` GitHub Action
205
206 Closes the P2 "GitHub Action" discoverability item from Audit 03
207 D-path3. The action lives in its own repo
208 ([tenseleyFlow/sway-action](https://github.com/tenseleyFlow/sway-action),
209 tagged `v0.1.0`) so consumers can pin via `uses:` without coupling
210 to sway's release cadence.
211
212 **End-user value:** three lines of YAML to wire sway into any GitHub
213 PR check, with edit-in-place PR comments + cached venv + standardized
214 verdict sticker. The audience is teams who don't use pre-commit but
215 live in GitHub Actions.
216
217 ```yaml
218 - uses: tenseleyflow/sway-action@v0.1.0
219 with:
220 spec-path: sway.yaml
221 ```
222
223 **Inputs / outputs.** Inputs cover `fail-on` policy
224 (`fail`/`warn`/`never`), `comment-on-pr`, `upload-artifact`,
225 `sway-version` (defaults to the action's pinned version), and
226 `python-version`. Outputs `sway-score`, `verdict`, and `report-path`
227 are surfaced for downstream workflow steps.
228
229 **Action layout:**
230 - `action.yml` — composite action, 5 steps (setup-python → cache →
231 entrypoint.sh → upload-artifact → github-script post-comment).
232 - `src/entrypoint.sh` — bootstraps a venv, installs `dlm-sway[hf]`
233 pinned, runs `sway gate` + `sway report --format md`, parses score
234 + verdict from the JSON report into `$GITHUB_OUTPUT`.
235 - `src/post-comment.js` — Node 20 script run via
236 `actions/github-script@v7`. Finds prior sway comments via a hidden
237 `<!-- sway-action:report -->` HTML marker and edits in place.
238 Truncates at 60 KB with a footer pointing to the artifact for
239 full report.
240
241 **Self-tests** (`smoke.yml` in the action repo): an `actionlint` +
242 `shellcheck` + `node -c` lint job, an `entrypoint-stub` job that
243 validates the bash plumbing against a stubbed `dlm_sway` package, and
244 a `self-test` job that runs the action end-to-end against a tiny
245 random LoRA on SmolLM2-135M (PR-only — needs PR context for the
246 comment-posting step).
247
248 **Sway README** gains a "GitHub Action" subsection between Pre-commit
249 and the `.dlm` integration with the copy-paste one-liner + a complete
250 workflow example.
251
252 ### Sprint 27 — `tool_use_fidelity` probe
253
254 Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway
255 ships for anyone fine-tuning an adapter against a tool-using base —
256 the dominant fine-tune target outside dlm-style document training,
257 and the failure mode no other shipped probe catches.
258
259 **New probe (`kind: tool_use_fidelity`, category: attribution).**
260
261 For each `(prompt, tool_spec, gold_tool_name)` case the probe greedy-
262 decodes from base + ft and scores three independent signals:
263
264 - **JSON-schema validity delta** (`ft_valid_rate − base_valid_rate`):
265 the gate that catches "ft broke the base's tool-call format". Default
266 pass criterion tolerates a 5pp drop.
267 - **Tool-name hallucination rate**: fraction of schema-valid ft calls
268 whose `name` falls outside the declared `allowed_tools` surface (or
269 differs from `gold_tool_name` per-case when no surface is set).
270 Default cap 10%.
271 - **Argument-field disagreement rate** (informational): leaf-field
272 drift between matched base/ft schema-valid calls. v1 surfaces it
273 as evidence rather than gating on it — the per-token KL on
274 free-form argument values is deferred past v1 because it requires
275 alignment between two decoded strings.
276
277 The probe greedy-generates from both views, parses each output via a
278 forgiving extractor (whole-text JSON → fenced ```json``` block →
279 first balanced `{...}` substring), then validates against an
280 OpenAI-flavored schema subset (`string`, `integer`, `number`,
281 `boolean`, `object`, `array` plus `required`). No `jsonschema`
282 dependency added — core sway dependencies stay lean.
283
284 `json_valid_rate_ft` is z-scored against the null-adapter baseline
285 when `null_adapter` is in the suite. Calibration spec ships two
286 sentinel tool-use cases so the null distribution carries useful
287 signal.
288
289 **Implementation modules:**
290 - **`probes/tool_use_fidelity.py`** — `ToolUseCase` /
291 `ToolUseFidelitySpec` / `ToolUseFidelityProbe` plus
292 `_parse_tool_call` / `_matches_schema` / `_field_disagreement`
293 helpers. 573 LOC.
294 - **`probes/__init__.py`** — registers the new probe.
295
296 **Test surface:**
297 - **`tests/unit/test_probe_tool_use_fidelity.py`** — 40 unit tests
298 covering verdict logic (PASS / FAIL on validity / FAIL on
299 hallucination / SKIP-no-cases), the parse helpers (whole-text /
300 fenced / embedded / unbalanced / strings-with-braces), schema
301 validation (required / type-tag / bool-not-int / extras-allowed),
302 field disagreement (identical / drifted / nested / list-as-leaf /
303 None-vs-absent), and the calibration handoff.
304 - **`tests/integration/test_probe_tool_use_fidelity.py`** —
305 slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies
306 the full code path executes without error and emits every
307 documented evidence key with rates in `[0, 1]`. Doesn't assert a
308 specific verdict — the 135M base isn't tool-fluent enough to make
309 the validity gate meaningful, only the plumbing.
310 - **`tests/fixtures/tool_use_cases.yaml`** — eight hand-authored
311 cases spanning the most common tool shapes (search, file I/O,
312 Python exec, shell, HTTP fetch, DB query, calendar, calculator).
313
314 **README** gains a "Tool-use fidelity" section between the `.dlm`
315 integration and "Reproducing a sway run" — pitches the probe as
316 sway's signal for agentic fine-tunes, with a worked YAML example.
317
318 ### Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair)
319
320 Closes the X3 half of Audit 03's "make a sway run reproducible by a
321 coworker without recreating their environment" goal. The X1 half
322 (`dlm export --to sway-json`) lands in the dlm repo via a separate
323 PR; this changelog block ships the sway-only deliverable.
324
325 **New CLI commands.**
326
327 - **`sway pack <spec> [-o OUT] [--include-golden PATH]
328 [--include-null-cache/--no-include-null-cache] [--max-size-mb MB]`** —
329 bundles the spec + its `dlm_source` (when set) + cached null-stats
330 entries + an optional last-known-good JSON report into a single
331 `*.swaypack.tar.gz`. Default cap 50 MB; refuses to overwrite.
332 - **`sway unpack <pack> [-o DIR]`** — extracts into `<DIR>/swaypack/`,
333 validates the manifest, and prints the ready-to-run `sway run`
334 invocation including the `SWAY_NULL_CACHE_DIR=...` env var that
335 redirects null-stats lookups at the bundled cache.
336
337 **Pack format (`swaypack_version=1`).**
338
339 ```
340 swaypack/
341 manifest.json # version + counts + packed_at + sway_version
342 sway.yaml # the spec, verbatim
343 source.dlm # spec.dlm_source content (when present)
344 null-stats/ # one .json per cache key
345 golden.json # known-good run report (when --include-golden)
346 ```
347
348 Implementation modules:
349 - **`cli/_pack.py`** — builds the tarball in-memory first so the
350 size cap can refuse cleanly *before* writing a half-built pack
351 to disk. Path-traversal-safe by construction (we author every
352 arcname).
353 - **`cli/_unpack.py`** — uses `tarfile.extractall(filter='data')`
354 to reject absolute paths / `../` escapes / device files (3.11
355 compat). Validates `swaypack_version`; rejects mismatches with
356 a clear message.
357
358 **Null-cache override env var.**
359
360 - **`probes/_null_cache._cache_root`** now honors
361 `$SWAY_NULL_CACHE_DIR` if set (used verbatim — no
362 `dlm-sway/null-stats` suffix). Order: env override → XDG cache →
363 `~/.dlm-sway/null-stats`. The `sway unpack` CLI prints the exact
364 invocation that wires this env at the bundled cache.
365
366 **Tests.**
367
368 - **17 unit tests** in `tests/unit/test_pack_unpack.py`: round-trip
369 identity, manifest fields, dlm_source bundling (present + missing
370 + warning emit), golden bundling, every PackError + UnpackError
371 branch (overwrite refusal, size cap, missing file, corrupt
372 tarball, missing manifest, version mismatch, malformed root).
373 - **2 slow+online integration tests** in
374 `tests/integration/test_pack_run_roundtrip.py`: pack → unpack →
375 `sway run` produces an identical band + per-probe verdict +
376 score; null-cache packing populates the unpack report's
377 `null_stats_dir` pointer.
378 - 708 unit tests pass; mypy + ruff + format clean.
379
380 ### Sprint 25 — P3 gradient_ghost probe (pre-run, cross-repo)
381
382 New zero-forward-pass diagnostic probe that loads dlm's
383 `training_state.pt` and flags severely-undertrained adapters
384 **before** any model load fires. Catches the 90%-case "user did
385 `--max-steps 5` for a smoke test and forgot to retrain" in ~50 ms
386 vs. the 30+ s `adapter_ablation` probe that previously was the only
387 training-health signal.
388
389 **Probe + supporting modules.**
390
391 - **`probes/gradient_ghost.GradientGhostProbe`** — category
392 `calibration`, `needs_backend = False`. Verdict ladder:
393 `global_step < min_steps_threshold` → FAIL (primary signal); all
394 per-param `exp_avg_sq` NaN → FAIL; per-layer ratio against the
395 *minimum* layer's mean (not global mean — see module docstring on
396 asymptotic-cap reasoning) flags layers > `undertrained_layer_ratio`;
397 > `layer_failure_frac` of layers crossing → FAIL/WARN. Configurable
398 thresholds `min_steps_threshold=50`, `undertrained_layer_ratio=2.0`,
399 `layer_failure_frac=0.3`.
400 - **`probes/_training_state.py`** — torch.load wrapper for
401 `training_state.pt`. Frozen `TrainingStateSnapshot` + `ParamStat`
402 dataclasses. Lazy torch import (no cost on `import dlm_sway` for
403 non-gradient-ghost users). Suppresses the FutureWarning around
404 `weights_only=False` since dlm's pickled RNG state requires it.
405 - **`probes/_param_id_mapping.py`** — maps optimizer-state integer
406 param-ids to transformer-layer indices via the adapter's
407 `adapter_model.safetensors` keys. Heterogeneous-per-layer counts
408 raise `ParamMappingError` rather than silently mis-attributing.
409 Uses `safetensors.safe_open` so we read keys without
410 materializing tensors.
411 - **`core/errors.MissingTrainingStateError`** — typed exception so
412 the probe can SKIP cleanly when `training_state.pt` legitimately
413 isn't there (non-dlm adapters), distinct from a parse error.
414
415 **Probe-ABC + runner contract for pre-run probes.**
416
417 - **`probes/base.Probe.needs_backend: ClassVar[bool] = True`** —
418 new opt-in flag. Default `True` preserves every shipped probe's
419 contract.
420 - **`probes/base.RunContext.backend: DifferentialBackend | None`** —
421 was required, now optional. New `RunContext.require_backend`
422 property narrows the type for mypy + raises a clear runtime error
423 if a probe with `needs_backend=True` somehow gets a `None`
424 backend.
425 - **`probes/*` sweep** — every existing probe now reads
426 `ctx.require_backend.as_base()` instead of `ctx.backend.as_base()`.
427 No behavior change; mypy-correct under the new optional type.
428 - **`suite/runner.run`** accepts `backend=None`. Pre-resolves all
429 scheduled probes; if any has `needs_backend=True` while
430 `backend=None`, raises `BackendNotAvailableError` with the
431 offending probe kinds in the message. Skips trace-writer install,
432 preflight finite-check, probe-label setting, and backend-stats
433 snapshot when backend is absent.
434
435 **`sway check` integration.**
436
437 - **`cli/commands.check_cmd`** runs `gradient_ghost` first in the
438 quick battery (along with `null_adapter` + `delta_kl` +
439 `calibration_drift`). On FAIL, emits a red ⚠️ banner before the
440 regular verdict banner; on WARN, a yellow one. Informational —
441 no exit-code change. Banner shows the probe's actionable
442 message ("severely undertrained: global_step=N < threshold X").
443
444 **Tests.**
445
446 - **17 unit tests** (`tests/unit/test_probe_gradient_ghost.py`):
447 registry / `needs_backend` flag / verdict-ladder branches
448 (PASS / FAIL on global_step / FAIL on all-NaN / WARN / FAIL on
449 many-layers / SKIP on missing) / `_param_id_mapping` correctness
450 + error paths / `_training_state` loader edge cases / runner
451 skip-backend contract.
452 - **3 integration tests** (`tests/integration/test_probe_gradient_ghost.py`,
453 slow+online): real-store FAIL on `~/.dlm/store/01KPPFAB.../v0001`
454 (skipped on machines without local dlm install — typical CI),
455 synthetic converged → PASS, runner end-to-end with backend=None.
456 - All 691 unit tests pass; mypy + ruff + format clean.
457
458 ### Sprint 24 — F01 PEFT→MLX adapter converter
459
460 Closes the audit's #1 major finding: the README pitched MLX as a
461 co-equal backend, but the path required a pre-converted `.npz`
462 adapter that nothing in the toolchain produced. With this sprint,
463 `dlm train` → `sway run` on the MLX backend works end-to-end on any
464 PEFT-trained LoRA adapter.
465
466 **Converter (pure I/O, no torch dep).**
467
468 - **`backends/_mlx_convert.convert_peft_to_mlx`** — reads PEFT's
469 `adapter_model.safetensors` + `adapter_config.json`, transposes
470 LoRA matrices to MLX's layout, writes `adapters.safetensors` +
471 mlx-lm-shaped `adapter_config.json`. Verified against PEFT >= 0.13
472 + mlx-lm 0.31.
473 - **Key remap.** PEFT's
474 `base_model.model.<dotted>.lora_<A|B>.weight` becomes MLX's
475 `<dotted>.lora_<a|b>`. Modern PEFT keys (no `.default` adapter-name
476 segment) and legacy `.default.weight` keys both supported.
477 - **Shape transpose.** PEFT `lora_A=(r, in)` → MLX `lora_a=(in, r)`;
478 PEFT `lora_B=(out, r)` → MLX `lora_b=(r, out)`.
479 - **Config remap.** Writes `fine_tune_type=lora`, `num_layers`
480 inferred from max layer index in the keys, `lora_parameters` with
481 `rank/scale=alpha÷r/dropout/keys` (per-layer-relative attribute
482 paths like `self_attn.q_proj`).
483 - **Errors.** Missing files / non-LORA peft_type / invalid rank /
484 unexpected key prefixes / dst-not-empty all surface as typed
485 `MlxConvertError` with actionable messages. `modules_to_save`
486 tensors (e.g. `embed_tokens`, `lm_head` overrides) are skipped
487 with a per-key warning rather than crashing.
488
489 **CLI surface.**
490
491 - **`sway convert-adapter [--target mlx] SRC DST [--overwrite]`** —
492 thin wrapper over the converter. Prints a before/after size +
493 rank/scale report; surfaces `MlxConvertError` with a non-zero
494 exit code; warns on skipped `modules_to_save` keys via stderr.
495
496 **MLX backend integration.**
497
498 - **`backends/mlx._ensure_mlx_adapter`** — auto-detect: if the
499 adapter dir contains `adapter_model.safetensors`, run the
500 converter into a content-hashed cache at
501 `${XDG_CACHE_HOME:-$HOME/.cache}/dlm-sway/mlx-converted/<blake2b>/`,
502 point mlx-lm at the cache. If the dir already contains
503 `adapters.safetensors`, pass through unchanged. Uses 16-byte
504 blake2b on the source safetensors bytes — repeat loads on the
505 same adapter version short-circuit (~10 ms hash + dir lookup).
506 - **`backends/mlx._MLXView._forward_logits`** — adjacent fix:
507 `out[0].astype(mx.float32)` before `np.asarray` so unquantized
508 bf16/fp16 model outputs round-trip correctly. Pre-existing bug
509 surfaced by the new e2e test against `mlx-community/SmolLM2-135M-Instruct`.
510
511 **Tests.**
512
513 - **`tests/unit/test_mlx_convert`** — 20 tests across:
514 - Helper functions (`_strip_layer_prefix`, `_extract_layer_index`).
515 - Happy path: synthetic PEFT adapter → MLX adapter, expected file
516 layout, config shape, rank/scale math, key transpose, value
517 preservation.
518 - Error paths: missing safetensors / config, non-LORA peft_type,
519 invalid rank, dst-not-empty without `--overwrite`, unexpected
520 key prefix, `modules_to_save` skip-and-report.
521 - Auto-convert detection: pass-through on already-MLX dir, fresh
522 convert on PEFT dir, cache short-circuit, unrecognized dir
523 pass-through.
524 - **`tests/integration/test_mlx_converter_e2e`** — 4 darwin-arm64
525 slow+online tests on real `mlx-community/SmolLM2-135M-Instruct`:
526 XDG cache populated by backend init, `next_token_dist` returns
527 finite top-k via converted adapter, `logprob_of` works, repeat
528 load skips reconvert (mtime check). Skipped on non-darwin /
529 missing `[mlx]` extra.
530
531 **README.**
532
533 - New "MLX backend (Apple Silicon)" section with the two install
534 paths (auto-convert via `sway run` vs. explicit `sway convert-adapter`).
535 Documents the cache location and out-of-scope items (QLoRA,
536 `modules_to_save`).
537
538 ### Sprint 23 — H1 batched backend execution
539
540 Opens the door to 3-5× wall-time reduction on HF-backend suites by
541 amortizing `model.forward` kernel-launch + memory-transfer overhead
542 across prompt batches. Real speedup lands on the HF backend; other
543 backends gain a unified Protocol surface via loop fallback.
544
545 **Protocol.**
546
547 - **`core/scoring.ScoringBackend`** — new method
548 `next_token_dist_batch(prompts, *, top_k) -> list[TokenDist]`.
549 Default implementation on the Protocol loops over
550 `next_token_dist`; backends override when they can amortize.
551 Adding the method to a `runtime_checkable` Protocol is a
552 contract change — all shipped backend views + the stub used by
553 `test_scoring.py::test_scoring_backend_runtime_checkable`
554 gained an explicit method implementation.
555
556 **HF backend — real batching.**
557
558 - **`backends/hf._HFView.next_token_dist_batch`** — calls
559 `tokenizer(prompts, padding=True, padding_side="left")` and a
560 single `model.forward` over the padded batch. Left-padding is
561 mandatory for decoder-only LMs: the last-token position (what
562 `next_token_dist` reads) must line up with the real end of each
563 sequence, not pad tokens.
564 - **`backends/hf._topk_to_token_dist`** — extracted helper shared
565 by the single-prompt and batched paths so tail-mass accounting
566 (B6) stays identical across both.
567 - **`backends/_instrumentation.BackendInstrumentation.cached_batch`**
568 — routes batched calls through per-prompt cache lookup before the
569 forward fires, so cached + uncached prompts in one call still
570 short-circuit per-entry. Callback contract:
571 `compute_misses(miss_indices) -> list[result]` in miss order.
572 Wall-time attribution divides the batch's time evenly across
573 miss entries for `scoring_wall_s` bookkeeping.
574
575 **Instrumentation.**
576
577 - **`BackendStats`** — three new counters: `batches_sent`,
578 `batched_prompts`, `max_batch_size`. `avg_batch_size` property
579 computes `batched_prompts / batches_sent` (0 when no batches
580 fired). All surfaced in `to_dict()`.
581 - **`suite/report._cache_line`** — the report footer's
582 `cache: N/M = P%` line now suffixes
583 `| batches: K (avg=X.X)` when any batched forward fired. Runs
584 without batching render the cache line alone — pre-S23 footer
585 shape preserved.
586
587 **Probes — opt-in.**
588
589 - **`probes/base.Probe.batch_score: ClassVar[bool] = False`** —
590 new flag. Probes set True when their scoring flows uniformly
591 through `next_token_dist` on a prompt list.
592 - **`probes/delta_kl.DeltaKLProbe`** — `batch_score = True`;
593 `run()` now issues one batched call per view (inside single
594 `as_base` / `as_finetuned` contexts) instead of 2×N context
595 entries with per-prompt calls. Divergences computed via
596 `zip(base_dists, ft_dists, strict=True)`.
597 - **`probes/cluster_kl.ClusterKLProbe`** — same treatment; same
598 math.
599 - Other probes (section_internalization, leakage,
600 paraphrase_invariance, adapter_ablation, prompt_collapse,
601 multi-turn-style probes) keep `batch_score=False` — each needs
602 bespoke batching logic deferred to a follow-up sprint.
603
604 **Dummy / MLX / API backends.**
605
606 - **`backends/dummy._DummyView.next_token_dist_batch`** — loops
607 over `self.next_token_dist` (not `_compute_next_token_dist`) so
608 subclass overrides (`_NullView`'s seeded-noise perturbation,
609 `_InterpolatedView`'s lam-blend) fire correctly on the batched
610 path. The dummy has no real forward to amortize; batching
611 counters stay at zero on this backend by design.
612 - **`backends/mlx._MLXView.next_token_dist_batch`** — per-prompt
613 loop stub with a `TODO: mx.array padded forward` docstring.
614 MLX's per-prompt forward on Apple Silicon is already fast
615 enough that the kernel-launch amortization a real batched
616 forward would buy is small relative to the HF CUDA/MPS case.
617 - **`backends/api.ApiScoringBackend.next_token_dist_batch`** —
618 per-prompt loop; OpenAI-compat HTTP servers score one
619 completion per request and httpx connection pooling already
620 amortizes network cost.
621
622 **Tests.**
623
624 - **`tests/unit/test_backend_instrumentation`** — new
625 `TestBackendInstrumentationCachedBatch` class (6 tests) pinning
626 all-miss / partial-cache-hit / all-cached / wrong-length-return
627 / max_batch_size / empty-prompts branches. `TestBackendStats`
628 gained 2 tests for the new counters + `avg_batch_size` property.
629 - **`tests/unit/test_batched_backend_s23`** — new file with 5
630 tests: `batch_score` flag set on `DeltaKLProbe`, probe-level
631 spy that confirms the batched method is called exactly once
632 per view, byte-identical results vs serial, report footer
633 batches-segment conditional rendering, empty-prompts
634 short-circuit.
635
636 ## 0.1.0 — 2026-04-24
637
638 First PyPI release. Alpha — API not guaranteed stable until v1.0.
639 Rolls up every sprint through Sprint 22; earlier sprints accumulated
640 under the rolling Unreleased header pre-release.
641
642 ### Sprint 22 — v0.1.0 release + F06 dlm-compat test + Docker image
643
644 First PyPI publish. Version bumped from `0.1.0.dev0` to `0.1.0`.
645 Pre-commit consumer `rev:` pins switch from commit SHA to `v0.1.0`.
646 Closes Audit 03 F05 (SHA-pin resolves at tag) and F06 (dlm API
647 compatibility regression test).
648
649 **D-path1 — PyPI publish.**
650
651 - **`pyproject.toml`** — version `0.1.0`. `Development Status :: 3 -
652 Alpha` classifier already present.
653 - **`src/dlm_sway/__init__.py`** — `__version__ = "0.1.0"`.
654 - **`README.md`** — real `pip install "dlm-sway[hf]"` recipe replaces
655 the "Planned PyPI install (not live yet)" placeholder. Pre-alpha
656 banner swapped for an explicit Alpha + semver-from-v1.0 disclaimer.
657
658 **F06 — dlm API compatibility regression test.**
659
660 - **`core/errors.DlmCompatError`** — new typed exception. Raised
661 when dlm's public surface drifts out of the contract sway's
662 resolver depends on (currently: `dlm.base_models.resolve(key).hf_id`).
663 Message includes installed-dlm version + a pip-install hint.
664 - **`integrations/dlm/resolver`** — the `except Exception: return
665 base_model` silent fallback is gone. `resolve()` raising or
666 returning an object without `hf_id` now raises `DlmCompatError`
667 (chained via `__cause__` for tracebacks). An extra
668 `_installed_dlm_version()` helper introspects
669 `importlib.metadata` for the error message.
670 - **`tests/unit/test_dlm_bridge`** — 2 new regression tests pin the
671 two drift branches (missing `hf_id`; `resolve` itself raising).
672 - **`tests/integration/test_dlm_api_compat`** — new slow+online+dlm
673 integration test iterates dlm's registry keys and asserts every
674 entry resolves to a spec with a plausible `hf_id`. Gracefully
675 skipped (via `pytest.importorskip`) when the `[dlm]` extra isn't
676 installed, so CI without dlm published to PyPI still passes.
677 - **`pyproject.toml`** — `[dlm]` extra pinned to `dlm>=0.9,<1.0`.
678 Upper bound tightens to the range the compat test has validated;
679 bump when dlm cuts v1.0 and the contract is re-verified.
680
681 **D-path2 — Docker image (`sway-gate`).**
682
683 - **`Dockerfile.gate`** — `python:3.11-slim` + `dlm-sway[hf,semsim]`
684 at the build-time `SWAY_VERSION` ARG. Pre-fetches the MiniLM
685 weights (`sentence-transformers/all-MiniLM-L6-v2`, ~80 MB) so
686 `adapter_revert` / `cluster_kl` probes don't cold-download on
687 first gate run. `ENTRYPOINT ["sway"]` — pre-commit hook passes
688 `gate` as the first container arg.
689 - **`.github/workflows/docker.yml`** — on `v*` tag push + manual
690 dispatch. Builds on GHA, pushes to `ghcr.io/
691 tenseleyflow/sway-gate:{vX.Y.Z, latest}`. Smoke-tests the pushed
692 image with `--version`.
693 - **`.pre-commit-hooks.yaml`** — new `sway-gate-docker` variant
694 using `language: docker_image` pointing at
695 `ghcr.io/tenseleyflow/sway-gate:v0.1.0 gate`. Third option
696 alongside `sway-gate` (system PATH) and `sway-gate-isolated`
697 (fresh venv).
698
699 **F05 closure.**
700
701 - **`.pre-commit-hooks.yaml`** — isolated variant's
702 `additional_dependencies` migrates from
703 `dlm-sway[hf] @ git+https://…@<SHA>` to
704 `dlm-sway[hf]==0.1.0`. Consumer `rev:` in
705 `.pre-commit-config.yaml` pins to `v0.1.0`. SHA churn over.
706
707 **README.**
708
709 - Real `pip install` recipes for every extra (`[hf]`, `[dlm]`,
710 `[all]`, …). "Install from source" kept for contributor workflow.
711 - Pre-commit section: three hooks (system / isolated / docker)
712 with first-run-cost comparison table.
713
714 ### Sprint 21 — Audit 03 closure
715
716 Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the
717 audit recommended for v0.1.0 readiness. Full audit at
718 `.docs/audits/03-final-audit.md`. Verdict was YELLOW-leaning-GREEN
719 with zero critical findings; this sprint pushes it to GREEN on the
720 in-scope items. Deferred: F01 (MLX converter — its own feature
721 sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API
722 compat test — lands with v0.1.0 release in S22). `v0.1.0` PyPI
723 publish is deferred to S22 (requires release coordination +
724 credentials).
725
726 **🟠 Major — degenerate z-score clipping (F02).**
727
728 - **`probes/null_adapter`** — null stats now carry an explicit
729 `degenerate: 1.0` field when the calibration ran but produced an
730 unusable baseline (`runs: 1`, or every seed producing the *exact*
731 same raw). The std floor at `1e-6` is preserved for valid-but-tight
732 multi-seed nulls so they still calibrate. Audit observed a
733 `+290,766σ` leakage-probe z under `runs: 1`; post-fix, z_score
734 refuses and the probe falls back to fixed thresholds with a clear
735 footer rollup.
736 - **`probes/_zscore.z_score`** — refuses when `stats["degenerate"]`
737 is truthy, independent of the std check. Belt-and-suspenders on
738 top of the existing MIN_STD guard.
739 - **`suite/report.collect_degenerate_null_kinds`** — new rollup
740 surface. Terminal + markdown footers gain a
741 "N probe kind(s) had a degenerate null baseline — bump `runs:`"
742 block, distinct from the existing null-opt-outs rollup.
743 - 12 new unit tests across `test_null_calibration.py`,
744 `test_zscore_helpers.py`, `test_report_extras_rollup.py`, and
745 `test_probe_external_perplexity.py`. Existing
746 `test_std_floor_prevents_runaway_zscore` was rewritten to assert
747 the fixed behavior (the pre-fix contract WAS the audit's bug).
748
749 **🟡 Minor — CI resilience (F03).**
750
751 - **`pytest-timeout>=2.3`** + **`tenacity>=9.0`** in dev deps.
752 - **`tests/fixtures/tiny_model.py`** wraps `snapshot_download` with
753 exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)
754 + `etag_timeout=10` to bound per-file head probes. Benefits every
755 slow+online test.
756 - **`tests/integration/test_determinism_golden.py`** gains
757 `@pytest.mark.timeout(600)`. A silent network hang now surfaces
758 as a test failure with actionable output (audit observed 20m
759 workflow-timeout hang on Sprint 19 merge run 24747915467).
760
761 **🟡 Minor — outlier-miner pool guard (F04).**
762
763 - **`mining/outlier_miner.mine_outliers`** — raises `SwayError` when
764 the pool has fewer than `2·top_k` distinct scored prompts, with
765 an actionable `--top-k N` hint. Pre-fix: 1-distinct-prompt pool
766 produced `top=[p], bottom=[p]` (identical lists — no outlier
767 contrast). Guard applies AFTER scoring so unsupported probe kinds
768 still return the empty-result path.
769
770 **🟡 Minor — autogen skipped-probes comment (F07).**
771
772 - **`integrations/dlm/autogen.collect_skipped_probe_reasons`** — new
773 public helper returning `(probe_kind, reason)` tuples for probes
774 `_build_suite` intentionally omitted for the given handle.
775 - **`_render_annotated_yaml`** — when `skipped` is non-empty, header
776 gains a `# skipped: <kind> (<reason>)` block. Users no longer have
777 to diff the autogen source to understand which probes are missing
778 from their generated `sway.yaml`.
779
780 **🟡 Minor — CI Node.js 20 deprecation silence (F08).**
781
782 - **`.github/workflows/ci.yml`** sets
783 `FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"` at the workflow
784 level. Silences the Node 20 deprecation warnings each CI log
785 currently carries (deadline June 2026). Temporary until every
786 action we pull bumps to Node 24 natively.
787
788 **🟡 Minor — portable `dlm_source` (F09).**
789
790 - **`integrations/dlm/autogen._portable_dlm_source`** emits a
791 cwd-relative path when the `.dlm` lives inside the cwd, absolute
792 otherwise. The cwd-relative form survives cross-machine checkout
793 (CI agents, other devs); the old absolute-path emission broke the
794 moment a committed autogen'd YAML was run on a different host.
795
796 **Other.**
797
798 - **640 unit tests pass** (up from 626 before S21 — 14 new
799 regression tests across F02/F03/F04/F07/F09).
800 - **Pragmatic deviations from original scope:** S21 was audit-scoped
801 to ~2 days of cleanup. Actual depth: 6 findings closed with 14
802 new regression tests. One existing test
803 (`test_std_floor_prevents_runaway_zscore`) explicitly rewritten to
804 flip its assertion — the pre-fix contract the test encoded WAS
805 the audit's reported bug.
806
807 ### Sprint 19 — Pre-commit hook `sway gate`
808
809 Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships
810 a `.pre-commit-hooks.yaml` declaring two hook variants so
811 [pre-commit.com](https://pre-commit.com) users can gate their adapter
812 on every commit that touches a spec, `.dlm` file, or adapter directory.
813
814 - **Two hooks, pick your posture.**
815 - `sway-gate` — `language: system`. Uses the sway install on the
816 user's `PATH`. Fast, zero install cost. Recommended default.
817 - `sway-gate-isolated` — `language: python` + git-based
818 `additional_dependencies` (`dlm-sway[hf] @ git+...@<SHA>`).
819 Self-contained — pre-commit builds a fresh venv and installs
820 sway + torch + transformers on first run (~5 GB, ~2 min).
821 - **Rev pinning stance.** Example config pins to a commit SHA, not
822 `HEAD`. Sway is pre-v0.1.0 — no tagged release yet — and `HEAD`
823 drifts silently on every `pre-commit autoupdate`. SHA pinning is
824 the honest pre-release pattern; migration to `rev: v0.1.0` is a
825 README edit once the first release lands.
826 - **`pass_filenames: false` on both variants.** `sway gate` takes
827 exactly one spec path; the user declares it via `args:`. The
828 `files:` regex decides when the hook fires. No ambiguity.
829 - **Scope discipline.** Neither hook surfaces `--json` or
830 `--markdown` flags. The hook gates (non-zero on FAIL) and nothing
831 else. Users wanting report artifacts run `sway run` separately.
832 - **Readme "Pre-commit" section** documents the consumer-side config
833 for both variants, the SHA-pinning rationale, and the one-time
834 install cost of the isolated variant.
835 - **`examples/precommit-example/`** — template `sway.yaml`,
836 consumer-side `.pre-commit-config.yaml`, and a README walk-through.
837 - **`tests/integration/test_pre_commit_hook.py`** (slow + online) —
838 three cases: (a) parse-and-shape smoke test on
839 `.pre-commit-hooks.yaml`; (b) pass-case runtime — spawns
840 `pre-commit run sway-gate` as a subprocess against a tmp git repo
841 with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case —
842 same fixture with an impossible `assert_mean_gte`, asserts
843 non-zero exit + `gate FAILED` banner.
844 - **`pre-commit>=3.8`** in `[dependency-groups].dev` so the
845 integration test can invoke it as a subprocess. Keeps the tool
846 out of the user's runtime deps.
847
848 ### Sprint 18 — Cross-platform determinism golden
849
850 Closes Audit 01 stretch-list F-item "cross-platform determinism golden
851 test." The README's "deterministic on CPU where possible" claim held
852 in theory; now it's pinned by CI on two platforms.
853
854 - **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON
855 comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6,
856 score_tol=1e-4)` and `mask_variable_fields` for stripping
857 timestamps, wall-seconds, `duration_s`, `backend_stats`,
858 `sway_version`, and cwd-resolved path identifiers (`adapter_id`,
859 `base_model_id`) before comparison. No torch dep; runs in the fast
860 lane.
861 - **New integration test** `tests/integration/test_determinism_golden.py`
862 (slow + online). Builds a deterministically-seeded LoRA on
863 SmolLM2-135M, runs a minimal 2-probe suite (delta_kl +
864 calibration_drift), and diffs the JSON output against
865 `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1`
866 toggles regen mode; missing golden → SKIP with a regen recipe.
867 - **New CI matrix** `determinism-golden` with
868 `strategy.matrix.os: [ubuntu-latest, macos-latest]` in
869 `.github/workflows/ci.yml`. Triggered by changes under
870 `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file
871 itself (plus the standard schedule/dispatch/push triggers).
872 - **`workflow_dispatch` regen mode** — dispatching the CI workflow
873 with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both
874 matrix legs and uploads the regenerated JSONs as per-platform
875 artifacts. Meant for deliberate "yes I changed the algorithm"
876 flows.
877 - **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for
878 logprob-like numeric fields sits above typical BLAS-implementation
879 drift (`1e-8`–`1e-7` band between OpenBLAS on linux and Accelerate
880 on darwin) but below real algorithm-change drift. `1e-4` for score
881 fields absorbs composition noise.
882 - **25 new unit tests** for the comparator covering mask coverage,
883 tolerance thresholds, structural diffs (missing keys, length
884 mismatches, type mismatches), NaN/inf edge cases, and a realistic
885 two-masked-payloads round-trip.
886 - **Pragmatic scope deviation**: the sprint envisioned a checked-in
887 5 MB adapter binary. Instead the test builds it from a fixed seed
888 at runtime (same pattern as `test_external_perplexity_e2e` and
889 `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the
890 sprint's risks section.
891 - **Linux golden bootstrap**: first PR ships `expected_darwin.json`
892 only; the linux leg SKIPs with a recipe pointing at the
893 `workflow_dispatch` regen mode. Maintainer dispatches, downloads
894 the artifact, commits `expected_linux.json`, and the next CI run
895 asserts cleanly on both platforms. One-time onboarding cost.
896
897 ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner
898
899 Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation
900 companion to `paraphrase_invariance` that surfaces the paraphrases a
901 memorizing adapter most reliably fails on, plus a secondary mode that
902 ranks `delta_kl` prompts by per-prompt divergence.
903
904 **`sway mine --mode paraphrase`** — per `paraphrase_invariance` case:
905
906 1. Generate candidate paraphrases via nlpaug `SynonymAug` (WordNet,
907 deterministic under a fixed seed; back-translation is a user-supplied
908 escape hatch to keep the default fast and offline).
909 2. Embed candidates through the shared MiniLM cache (same 80 MB load
910 `adapter_revert` pulls) and greedy farthest-first select the top-K
911 most pairwise-distant ones — dodges nlpaug's tendency to emit
912 near-duplicate synonym swaps.
913 3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
914 - (ft(candidate, gold) - base(candidate, gold))`. Large positive gap
915 = the candidate breaks the adapter's lift = the adapter doesn't
916 generalize there.
917 4. Emit `sway-mined-paraphrase.yaml` with a `mined_cases:` block paste-
918 compatible with `paraphrase_invariance.cases`.
919
920 **`sway mine --mode outliers`** — rank a prompt pool by per-prompt
921 `delta_kl.raw`. Pool defaults to the spec's own `delta_kl` prompts;
922 `--from-corpus public_domain_en` draws from S09's CC0 corpus instead.
923 Emits top-K and bottom-K blocks: the highest-divergence prompts (best
924 for tightening a gate) and lowest-divergence (candidates to drop from
925 the suite). `leakage` and `paraphrase_invariance` have section/case-
926 based specs; outlier mining on those is future work, documented
927 inline in `mining/outlier_miner.py`.
928
929 - **New package** `src/dlm_sway/mining/` — two modules,
930 `paraphrase_miner.py` and `outlier_miner.py`, plus a thin
931 `corpus_prompts` helper for the `--from-corpus` path.
932 - **New CLI subcommand** `sway mine SPEC --mode paraphrase|outliers
933 [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME]
934 [--seed N]`. Paste-compatible YAML emission by default; `--out`
935 overrides the filename.
936 - **18 new unit tests** — 7 for the paraphrase miner (ranker,
937 diversity filter, dedup, input validation), 8 for the outlier
938 miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3
939 for the CLI (paraphrase + outliers modes + no-prompts error path).
940 - **Prove-the-value test** at
941 `tests/unit/test_paraphrase_miner_prove_value.py`. On a
942 deliberately-memorizing dummy backend the hand-written paraphrase
943 list passes with `generalization_ratio > 0.5`; substituting the
944 mined list (same seed, same adapter) drops the ratio below 0.5
945 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap.
946 - **Dependency footprint**: no new extras. The paraphrase miner
947 uses nlpaug (already in `[style]`) + sentence-transformers (in
948 `[semsim]`); the diversity filter reuses the embedder
949 `adapter_revert` already pulls. Graceful `BackendNotAvailableError`
950 with a pip hint when the extras aren't installed.
951
952 ### Sprint 20 — Audit 02 closure
953
954 Closes the full finding inventory from `.docs/audits/02-followup-audit.md`
955 (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3
956 dead-code items). Sixteen sprints of innovation work had accumulated
957 since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap
958 CI dropped at the runner boundary) plus a grab-bag of claim-vs-code
959 gaps, dead branches, and untested contracts.
960
961 **🔴 Critical — ship-blocker.**
962
963 - **F01** — `suite/runner.py:_with_duration` now forwards every
964 `ProbeResult` field. The pre-fix version silently dropped `ci_95`
965 on every probe, making S14's headline deliverable runtime-inert in
966 any real `sway run`. Regression test at the runner boundary + snapshot
967 fixture carrying a populated `ci_95` pin the fix.
968
969 **🟠 Major — claim-vs-code gaps.**
970
971 - **F02** — real `sklearn.cluster.KMeans` exercised by two new unit
972 tests (separation + seed determinism) + a slow+online integration
973 test at `tests/integration/test_cluster_kl_e2e.py`. Before: every
974 cluster_kl test monkeypatched `_kmeans_cluster` with an argmax stub,
975 leaving S16's reason-for-being unverified.
976 - **F03** — `sway list-probes` falls back to the defining module's
977 `__doc__` when a probe class has no docstring. Every one of the 13
978 shipped probes now renders a summary row.
979 - **F04** — `sway doctor` probes `plotly` (the load-bearing `[viz]` dep),
980 `sklearn` (S16 cluster_kl), and the `api` extras (`httpx`, `tenacity`).
981 - **F05** — README primitives table reflects 13 probes (adherence +
982 cluster_kl, calibration + external_perplexity, baseline row for
983 null_adapter).
984 - **F06** — `TwoModelDifferential` composes `safe_for_concurrent_views`
985 from its two inner backends. Before: the attribute was missing and
986 the runner defaulted to `False` even when both inners set `True`,
987 breaking S13's concurrency claim at the only supported differential-
988 use pattern.
989 - **F07** — `integrations/dlm/autogen` emits `cluster_kl` when the
990 prompt pool clears 20 entries; markdown report gains a per-probe
991 "Cluster breakdown" section with per-cluster mean KL + exemplars.
992 - **F08** — `backends/dummy._NullView` RNG uses `hashlib.md5` instead
993 of Python's PYTHONHASHSEED-salted `hash()`. Cross-process determinism
994 test at `tests/unit/test_cross_process_determinism.py` pins the fix.
995 - **F09** — trace writer ↔ analyzer round-trip test at
996 `tests/unit/test_runner_backend_stats.py`: runs a suite with
997 `trace_path=`, loads the file back, asserts probe labels + hit/miss
998 counts match `backend_stats`.
999
1000 **🟡 Minor — dead code, doc drift, latent brittleness.**
1001
1002 - **F10** — removed dead `_ft_view` branch in `probes/prompt_collapse.py`.
1003 - **F11** — pytest plugin resolves spec paths against `config.rootpath`,
1004 not process cwd.
1005 - **F12** — `backends/api._post_completions` delegates to tenacity's
1006 `Retrying()` callable; removes the unreachable post-loop fallback.
1007 - **F13** — `preference_flip` WARN branch emits `ci_95` and `z_by_rank`
1008 for consistency with every other numeric probe's WARN shape.
1009 - **F14** — `adapter_ablation` module docstring documents why `ci_95`
1010 renders as em-dash (it's a curve-fit, not a sample-mean aggregator).
1011 - **F15** — report footer surfaces `null_adapter` opt-outs (probes with
1012 `calibrate_spec=None`) in both terminal and markdown surfaces.
1013 - **F16** — report category ordering derives from
1014 `DEFAULT_COMPONENT_WEIGHTS` and accepts unknown categories from
1015 custom `Probe` subclasses.
1016 - **F17** — `cluster_kl` degenerate zero-variance case now returns
1017 `Verdict.WARN` with `evidence["degenerate_zero_variance"]=True` and
1018 suppresses z-score to avoid a spurious small-sample-noise calibration.
1019 - **F18** — `_calibration_pack.py` gains per-section provenance notes
1020 (F18 audit trail) without noisy per-item annotations.
1021 - **F19** — pytest plugin defers `dlm_sway.core.result` /
1022 `dlm_sway.core.errors` imports to call sites so non-`@pytest.mark.sway`
1023 users don't pay the load tax.
1024 - **F20** — `sway check` `--help` documents the σ-banner's null-calibration
1025 dependency (fall-through to composite score band when null SKIPs).
1026
1027 **Stronger-test opportunities — regression tests for things we weren't
1028 pinning.**
1029
1030 - **#9** — `probes/_divergence` rejects effectively-uniform TokenDists
1031 (spread < 1e-9) via `_check_non_degenerate_token_dist`. Catches a
1032 shape-broken lm_head that would otherwise compute a trivial constant
1033 divergence across prompts. Two compatibility touchups rode with the
1034 change: dummy backend's synthesized `ft` dist and cluster_kl test
1035 fixture `_dist_broad` gained a tiny monotonic perturbation to clear
1036 the guard (real models never produce bit-uniform logits thanks to
1037 fp32 accumulation noise).
1038 - **#10** — cross-verdict consistency test
1039 (`tests/unit/test_cross_verdict_consistency.py`): runs the same spec
1040 through `sway run`, `sway gate`, and `sway report --format junit` and
1041 asserts identical per-verdict tallies.
1042 - **#11** — `sway doctor --json` schema-shape snapshot test locks the
1043 top-level keys and the per-extra module-name set.
1044 - **#12** — subprocess determinism test asserts `_NullView.next_token_dist`
1045 is byte-identical across `PYTHONHASHSEED` values; pins F08's fix.
1046
1047 **Dead-code inventory — DC3–DC5.**
1048
1049 - **DC3** — `null_adapter.py`'s pre-S10 cache-promotion branch is
1050 annotated as legacy; removal queued for the next minor.
1051 - **DC4** — dropped the stderr warning at `suite/runner.py` that fired
1052 whenever `concurrent_probes > 1` even though execution never fanned
1053 out. The spec field is still accepted (future pool will light it up);
1054 the noisy warning is gone.
1055 - **DC5** — `tests/unit/test_model.py` grows coverage of `ModelSpec`'s
1056 dtype enum, `endpoint`, `trust_remote_code`, and `custom`/`api`
1057 `kind` branches.
1058
1059 Final state: 582 unit tests + 3 integration tests passing; mypy strict
1060 clean across 55 source files; ruff + format clean across 137 files.
1061 Sprint 20 is a single PR composed of per-finding, per-file commits so
1062 the history reads as a fix-by-fix cleanup.
1063
1064 ### Sprint 16 — Cluster-coherent KL probe
1065
1066 Closes Audit 01 innovation item F8. Adds a new `cluster_kl` probe that
1067 answers the question `delta_kl` can't: the mean divergence may be the
1068 same, but did the adapter shift the *right* topics, or is it a uniform
1069 blunt-instrument shift?
1070
1071 - **New probe `cluster_kl`** (category: adherence). Embeds prompts via
1072 shared MiniLM (same cache key as `adapter_revert`), k-means clusters
1073 them at a fixed seed, measures per-prompt JS divergence between base
1074 and ft, and reports a **specificity ratio**:
1075 `between_variance / (between_variance + within_variance)`. Range
1076 `[0, 1]`: `≈ 0.5` on a blunt adapter that shifts every topic by the
1077 same amount, `→ 1.0` on a topic-targeted adapter. Pair
1078 `(mean_kl, specificity)` tells a more honest story than either number
1079 alone.
1080 - **Bootstrap CI** on specificity. Resamples `(divergence, cluster_label)`
1081 pairs with replacement and takes the 2.5/97.5 percentiles of the
1082 bootstrap ratio distribution. Returns `None` below 4 prompts — matches
1083 the convention `core.stats.bootstrap_ci` uses.
1084 - **Z-score calibration** against `null_adapter` baseline via the
1085 existing `_zscore` helpers. `calibrate_spec` synthesizes 8 mixed-topic
1086 sentinel prompts + k=2 for the null pass; the specificity distribution
1087 concentrates around 0.5 there, so a real adapter's separation above
1088 that baseline reads as a clean z-score.
1089 - **Degenerate-input policy.** `< min_prompts` (default 20) → SKIP with
1090 a clear message; `num_clusters * 2 > num_prompts` → SKIP (per-cluster
1091 mean not well-resolved); empty prompt list → ERROR. Missing `[semsim]`
1092 extras → SKIP with a `pip install 'dlm-sway[semsim]'` hint.
1093 - **Zero-variance fallback.** If every prompt produced the same
1094 divergence (canned stub data, no adapter motion), the specificity
1095 ratio is mathematically undefined; we return `0.5` — the null-adapter
1096 expectation — so downstream z-score reports "no signal" instead of
1097 NaN.
1098 - **New dependency.** `scikit-learn>=1.4` added to the `[semsim]` extra
1099 (riding the 80 MB MiniLM load that probe already pulls in). Mirrored
1100 to `[all]` and to mypy's stubless-override list.
1101 - **7 unit tests** in `tests/unit/test_probe_cluster_kl.py`: two-topic
1102 adapter → high specificity, uniform adapter → 0.5 fallback, too-few
1103 prompts → SKIP, empty prompts → ERROR, `num_clusters > prompts/2` →
1104 SKIP, CI bracketing, missing-extras SKIP path.
1105 - **Prove-the-value test** at `tests/unit/test_cluster_kl_prove_value.py`.
1106 Two backends with comparable `delta_kl` (ratio < 3×) have specificity
1107 scores that split by at least 0.3 — concrete evidence that `cluster_kl`
1108 surfaces a structural distinction `delta_kl` merges.
1109
1110 ### Sprint 15 — pytest plugin (`@pytest.mark.sway`)
1111
1112 Closes Audit 01 innovation item F10. Packages sway as a pytest
1113 library so teams already running pytest adopt sway with a single
1114 decorator instead of a subprocess wrapper.
1115
1116 - **New plugin** (`src/dlm_sway/pytest_plugin.py`) auto-loaded via
1117 the `pytest11` entry point after `pip install 'dlm-sway[pytest]'`.
1118 `@pytest.mark.sway(spec="...", threshold=0.0, weights=None)`
1119 expands a single pytest function into **one test item per probe**
1120 in the referenced spec + an optional `__gate__` item that fires
1121 only when the composite score drops below `threshold`.
1122 - **Verdict translation.** `FAIL` / `ERROR` → pytest Failed,
1123 `SKIP` → pytest Skipped, `WARN` → pytest warning, `PASS` →
1124 pytest pass. Probe-level failures isolate: a failing adherence
1125 probe doesn't mask a failing calibration one, and `pytest -k
1126 adherence` runs just that probe.
1127 - **Suite runs once per decorated function.** A session-scoped
1128 `_SuiteCache` keyed on `(spec_path, weights)` ensures the N-way
1129 item expansion doesn't multiply backend wall time. Two
1130 `@pytest.mark.sway` tests against the same spec share one run.
1131 - **Malformed marks fail cleanly.** Missing `spec`, non-numeric
1132 `threshold`, non-dict `weights`, and unknown kwargs produce a
1133 synthetic `_ConfigErrorItem` with a green-field pytest failure
1134 line — no cryptic collection-time tracebacks.
1135 - **New `[pytest]` extra** — just `pytest>=8.0`. Matches the pattern
1136 other plugin-style extras use. Also registers the `pytest11`
1137 entry point so the plugin is discovered automatically on install,
1138 consistent with pytest-cov / pytest-xdist.
1139 - **Example directory** at `examples/pytest_integration/` with a
1140 minimal `sway.yaml` + `test_sway_gate.py` showing the decorator
1141 replacing a legacy `subprocess.run(["sway", "gate", ...])`
1142 wrapper. README gains a "Pytest integration" section pointing at
1143 it.
1144 - **13 new unit tests** via pytest's canonical `pytester` fixture:
1145 marker registration, N-item expansion, FAIL / SKIP / ERROR
1146 routing, gate below-threshold failure, gate above-threshold pass,
1147 gate absence when `threshold=0`, malformed-mark error paths,
1148 cache sharing across multiple decorated tests.
1149 - **Drive-by fix for the S14 CI-narrowing test flake.** The dummy
1150 backend's per-prompt noise was seeded via Python's `hash()`,
1151 which is salted per-process via `PYTHONHASHSEED`. Swapped to a
1152 stable `hashlib.md5`-derived seed so the narrowing invariant
1153 holds deterministically across test-order permutations.
1154
1155 ### Sprint 14 — Bootstrap CIs + forward-pass trace CLI
1156
1157 Closes Audit 01 innovation items F9 (bootstrap confidence intervals
1158 on raw metrics) + F12 (polished forward-pass trace CLI). Paired
1159 sharpenings of existing probe output and existing trace
1160 infrastructure.
1161
1162 - **New `core/stats.bootstrap_ci`** — percentile-bootstrap 95% CI on
1163 any sequence of per-sample measurements. Numpy-only, 1000-resample
1164 default, seeded from ``ctx.seed`` so intervals are reproducible.
1165 Short-circuits on non-finite / empty / degenerate-constant inputs;
1166 vectorized resample keeps the overhead ~1 ms per probe.
1167 - **`ProbeResult.ci_95: tuple[float, float] | None`** — new field,
1168 threaded through `safe_finalize` (nulled if `raw` is nulled, so a
1169 CI never brackets a defensive-null point estimate). `to_json` /
1170 `from_json` persist the pair as a two-list.
1171 - **Six aggregating probes emit `ci_95`:** `delta_kl` (over per-prompt
1172 divergences), `calibration_drift` (over per-item regression
1173 indicators), `external_perplexity` (over per-chunk deltas),
1174 `leakage` (over clean-recall rates), `paraphrase_invariance` (over
1175 verbatim lifts), `section_internalization` (over effective SIS
1176 scores). Each also lands the interval under
1177 ``evidence["raw_ci_95"]`` for JSON consumers.
1178 - **Report tables add a `ci95` column** between `raw` and `z` in
1179 both terminal and markdown output. `format_ci` renders as
1180 ``[lo, hi]`` with em-dash for missing/non-finite. Markdown
1181 snapshot refreshed; JSON snapshot refreshed for the new
1182 per-probe `ci_95` field.
1183 - **New `sway trace <jsonl>` CLI.** Reads the forward-pass trace
1184 JSONL the S07 runner writes when `--trace <path>` is set,
1185 aggregates into per-probe and per-view wall-time + hit-rate
1186 tables, plus a top-N slowest-events table. `--format
1187 terminal|md|json`, `--slowest K` to tune the tail size. The
1188 parsing tolerates old trace shapes (missing `probe` / `hit`
1189 fields) and skips malformed lines.
1190 - **`suite/trace_analysis`** — typed dataclasses (`TraceEvent`,
1191 `ProbeSummary`, `ViewSummary`, `TraceReport`) + three renderers
1192 (terminal / markdown / json) matching the conventions of
1193 `suite/report` and `suite/compare`.
1194 - **Committed trace fixture** at `tests/fixtures/trace_sample.jsonl`
1195 (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new
1196 trace-analysis + CLI tests.
1197 - **Prove-the-value (F9):** on a dummy backend with per-prompt
1198 variation, `delta_kl`'s CI narrows from `[0.33, 0.41]` (width
1199 0.079) at N=4 prompts to `[0.38, 0.42]` (width 0.044) at N=32 —
1200 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical
1201 scaling minus dummy-backend dispersion.
1202 - **Prove-the-value (F12):** the CLI's terminal output on the
1203 committed fixture correctly identifies `sis/base` (520.3 ms) as
1204 the single slowest event, `sis` (1,529 ms) as the slower probe
1205 over `dk` (529 ms), and `ft` (1,357 ms) as the slower view over
1206 `base` (701 ms).
1207
1208 ### Sprint 13 — OpenAI-compatible HTTP scoring backend
1209
1210 Closes Audit 01 innovation item F7. Unlocks sway against hosted
1211 fine-tunes — OpenAI platform, `vllm serve`, Ollama — without
1212 requiring a local torch + PEFT load.
1213
1214 - **New backend `ApiScoringBackend`** (`backends/api.py`): scores
1215 against a ``/v1/completions`` endpoint via httpx. Implements
1216 `ScoringBackend` only (not `DifferentialBackend`) since one
1217 endpoint is one model; users compose two `ApiScoringBackend`
1218 instances behind the existing `TwoModelDifferential` wrapper for
1219 a full differential run. Method surface: `logprob_of` via
1220 `echo=True`+`logprobs=0`, `rolling_logprob` via the same,
1221 `next_token_dist` via `max_tokens=1`+`logprobs=K`, plus
1222 `preflight_finite_check` and an `ApiScoringBackend.generate` for
1223 probes that need text output (e.g. `leakage`).
1224 - **Token-boundary handling.** The API returns tokens as strings, so
1225 `logprob_of` walks the echoed tokens by character length until the
1226 running total covers the prompt, then sums the rest. When the
1227 boundary falls mid-token, the partial token lands on the prompt
1228 side — over-counts the prompt, under-attributes to the completion
1229 — and the behavior is asserted by a dedicated test
1230 (`test_mid_token_prompt_leans_conservative`).
1231 - **Retry + preflight.** Tenacity handles 5xx + network errors with
1232 exponential backoff (configurable `max_retries`). Preflight hits
1233 the endpoint once with `hello`, `max_tokens=1`, `logprobs=1`;
1234 rejects non-finite logprobs before the suite runs.
1235 - **First backend with `safe_for_concurrent_views=True`.** HTTP is
1236 stateless, so the S07 concurrent-probe scheduler can dispatch
1237 against an API backend in parallel as soon as the pool
1238 implementation lands (still scaffolding in the runner).
1239 - **`ModelSpec` additions:** `BackendKind` gains `"api"`;
1240 `ModelSpec.endpoint` field carries the server's base URL (the
1241 `/v1/completions` path is appended by the backend). API key from
1242 `SWAY_API_KEY` → `OPENAI_API_KEY` env var fallback, so secrets
1243 stay out of the YAML.
1244 - **`backends.build` dispatch** routes `kind="api"` to
1245 `ApiScoringBackend`. Combined with `build_two_separate` +
1246 `TwoModelDifferential`, a YAML with
1247 `defaults.differential: false` and two `kind: api` models Just
1248 Works end-to-end.
1249 - **New `[api]` extra** — `httpx>=0.27` + `tenacity>=9.0`. Core sway
1250 stays torch-free; the `[api]` extra adds ~1 MB of deps vs the
1251 `[hf]` extra's 3 GB.
1252 - **Unit tests (21 new) use httpx's `MockTransport`** to intercept
1253 every call and assert numeric outputs match canned OpenAI-shaped
1254 responses — covers all three scoring methods, cache dedup, 4xx
1255 error surfacing, 503→200 retry recovery, API-key env fallback,
1256 NaN-response preflight rejection, and the token-boundary math.
1257 - **Prove-the-value (§F7):** `tests/integration/test_api_ollama.py`
1258 is opt-in via `SWAY_OLLAMA_URL` + `SWAY_OLLAMA_MODEL`. Runs the
1259 full scoring surface against a live Ollama serving a small model
1260 (e.g. `llama3.2:1b`) and asserts finite output + preflight pass.
1261 Documents the wall-time budget the "≤3× HF backend" claim rests
1262 on once the concurrent-dispatch pool lands.
1263
1264 ### Sprint 12 — Interactive HTML report
1265
1266 Closes Audit 01 innovation item F6. Adds the exploration surface to
1267 sit alongside the terminal (CI logs) and markdown (PR artifacts)
1268 outputs — a single-file interactive HTML page for the research /
1269 write-up case.
1270
1271 - **`sway report result.json --format html --out report.html`** emits
1272 a self-contained HTML page with five interactive Plotly panels:
1273 composite-score gauge with banded thresholds, per-category
1274 horizontal-bar breakdown, per-section SIS bar chart (when
1275 `section_internalization` evidence carries a `per_section` array),
1276 adapter-ablation response curve (when `adapter_ablation` evidence
1277 carries `lambdas` + `mean_divergence_per_lambda`), and an
1278 all-probe score × z-score scatter with hover tooltips. The terminal
1279 verdict palette (green / yellow / red) carries across every panel.
1280 - **Plotly bundle inlined once in `<head>`**; each panel is a
1281 Plotly-produced `<div>` with a stable `sway-*` id so snapshot tests
1282 don't churn on Plotly point releases. No external `<script src>` /
1283 `<link href>` references — the page loads offline from a single
1284 ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway
1285 wrapper + chart data is ~40 KB).
1286 - **CLI gates `--out PATH` on file-producing formats.** `--format html`
1287 requires `--out` (3 MB of JS has no business on stdout);
1288 `--format md|json|junit` now also accept `--out` for symmetry with
1289 the HTML path. `--format terminal` rejects `--out` — the terminal
1290 renderer is for the console only.
1291 - **`plotly>=5.20` is optional**, shipped via the existing `[viz]`
1292 extra (alongside matplotlib). Without it, `--format html` exits 2
1293 with the `pip install 'dlm-sway[viz]'` install hint. Graceful
1294 ImportError → RuntimeError translation tested end-to-end.
1295 - **Snapshot** at `tests/snapshots/report.html` locks the Sway-owned
1296 wrapper structure. The Plotly JS bundle is stripped from the
1297 snapshot (replaced with a placeholder) so the 43-line snapshot
1298 stays human-reviewable and Plotly version bumps don't drift it.
1299 - **Prove-the-value:** `test_html_from_real_history_loads_offline`
1300 renders HTML from the committed `tests/fixtures/sway-history/02-*`
1301 run (4 probes, real exported payload) and asserts: file parses via
1302 `html.parser`, zero external `<script src>` / `<link href>`
1303 references, every probe name appears in the body, file size lands
1304 between 1 MB and 10 MB. Measured output: 4.87 MB.
1305
1306 ### Sprint 11 — `sway compare` across saved JSON runs
1307
1308 Closes Audit 01 innovation item F5. Ships the regression-dashboard
1309 primitive every CI integration reached for but had to script.
1310
1311 - **New CLI subcommand `sway compare`.** Accepts N saved result JSONs
1312 (typically from `sway run --json`), rehydrates each via
1313 `report.from_json`, folds them into a score matrix, and renders:
1314 a per-probe score table with columns per run, per-adjacent-pair
1315 delta columns (colored red on drop / green on lift), and a
1316 composite-score timeline row. Formats: `terminal` (default, Rich),
1317 `--format md`, `--format json`.
1318 - **`--fail-on-regression <threshold>`.** Exits 1 when any probe's
1319 score in the newest run dropped ≥ `threshold` vs the prior run.
1320 `threshold=0` (default) disables the gate. Gate fires after the
1321 output is emitted so CI logs always capture the matrix even on
1322 red builds.
1323 - **`suite/compare.py` module.** Clean separation: `build_matrix`
1324 folds `(SuiteResult, SwayScore)` pairs into a `CompareMatrix`
1325 dataclass; `render_{terminal,markdown,json}` consume that. No
1326 filesystem IO in the module itself — the CLI owns the reads, the
1327 renderers own the writes. Probes that disappeared between runs
1328 show as `None` / em-dash in the matrix; new probes show as
1329 `None` on older runs. Union of probe names is sorted for stable
1330 row order across invocations.
1331 - **Markdown snapshot locked** at `tests/snapshots/compare.md` —
1332 silent schema drift breaks the test like every other snapshot.
1333 - **Committed history fixture + prove-the-value test.**
1334 `tests/fixtures/sway-history/{01,02,03}-*.json` ship a three-run
1335 narrative: baseline → retrained-improved → over-trained. Run 03
1336 plants a `section_internalization` drop of 0.22 and a
1337 `calibration_drift` drop of 0.25 while `delta_kl` still rises
1338 (the memorization-without-generalization failure mode).
1339 `test_compare_catches_planted_regression` invokes
1340 `sway compare sway-history/*.json --fail-on-regression 0.10` and
1341 asserts exit=1 with both regressed probes surfaced in the JSON
1342 payload and terminal text — exactly the F5 "CI gate the build on
1343 regression" experiment.
1344
1345 ### Sprint 10 — Multi-rank adversarial null adapters
1346
1347 Closes Audit 01 innovation item F4. Extends the S02 null-calibration
1348 matrix with a per-rank profile — users now read "how rank-saturated
1349 is my adapter?" straight off the report.
1350
1351 - **`NullAdapterSpec.rank_multipliers: list[float] = [1.0]`** (new).
1352 Default preserves single-rank behavior byte-for-byte. Setting
1353 `[0.5, 1.0, 2.0]` calibrates three independent null distributions
1354 per probe kind and emits a z-profile alongside the verdict z-score.
1355 - **`NullCalibratedBackend.as_null_adapter` gains a `rank_scale` kwarg.**
1356 Both shipped backends (dummy + HF) implement it by scaling the
1357 null-weight noise std by `sqrt(rank_scale)` — mathematically
1358 equivalent to a rank change in terms of the LoRA output variance
1359 (`A·B` is a sum of `r` rank-1 outer products; variance is linear
1360 in `r`). No tensor-shape surgery, no model reload per multiplier.
1361 - **`RunContext.null_stats_by_rank`** threads the per-rank matrix
1362 through the runner. Keys are canonical `rank_{mult:.2f}` strings;
1363 inner structure matches `null_stats`.
1364 - **Numeric probes emit `evidence["z_by_rank"]`.** Every numeric
1365 probe (delta_kl, calibration_drift, external_perplexity, leakage,
1366 adapter_revert, adapter_ablation, paraphrase_invariance,
1367 preference_flip, prompt_collapse, section_internalization,
1368 style_fingerprint) computes per-rank z-scores using its existing
1369 sign convention. The verdict path still reads the 1.0x group —
1370 no existing thresholds shift.
1371 - **Report surfaces the profile.** Terminal + markdown probe notes
1372 append `rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`
1373 whenever multi-rank calibration ran. `format_z_profile` helper
1374 centralizes the rendering.
1375 - **Null-stats disk cache widens key to include `rank_multipliers`.**
1376 Single-rank caches from pre-S10 runs are still readable — they're
1377 promoted to the new `null_stats_by_rank` shape on load.
1378 - **Bug fix (`external_perplexity`): drop erroneous z sign flip.**
1379 `mean_delta` is higher-is-better (ft logprob minus base logprob
1380 on external prose), so the raw z-score maps directly onto
1381 `z >= assert_z_gte`. S09's sign flip was reversing the pass/fail
1382 direction when the null-calibration path ran.
1383 - **README: rank-profile interpretation guide.** Explains when the
1384 shape indicates rank saturation vs. rank oversizing, plus the
1385 "low rank can be pathologically quiet" dual-reading caveat.
1386 - **Prove-the-value (`tests/unit/test_null_multi_rank.py`):** on a
1387 fixed adapter with the dummy backend, the delta_kl z-profile is
1388 strictly monotone in inverse rank (`z@0.5x > z@1.0x > z@2.0x`) —
1389 exactly the signature of a rank-scaled null distribution.
1390
1391 ### Sprint 09 — External-perplexity-gap probe
1392
1393 Closes Audit 01 innovation item F3. First innovation sprint landed on
1394 top of the audit-closure campaign (S01–S07).
1395
1396 - **New probe `external_perplexity`** (`probes/external_perplexity.py`):
1397 rolling-logprob delta of ft vs base on held-out public-domain English
1398 prose. Raw metric is `mean_delta_nats` (per-token). Negative =
1399 adapter raised perplexity on external text (diffuse forgetting);
1400 positive = adapter improved English modeling incidentally. Category
1401 `calibration`; scored alongside `calibration_drift`.
1402 - **Packaged public-domain corpus**
1403 (`probes/_corpora/public_domain_en.txt`, ~14 KB): hand-assembled
1404 from twelve US-public-domain passages (Lincoln, Emerson, Twain,
1405 Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville,
1406 Dickinson, US Constitution, Federalist #10). Each passage carries
1407 an inline `# -- source:` provenance comment identifying the work
1408 and the "US public domain (pre-1929)" basis. Loader strips the
1409 comments at read time so the probe sees only raw prose.
1410 - **Null calibration**: `calibrate_spec()` returns a 4-chunk cheap
1411 version so `null_adapter` produces per-kind stats without dominating
1412 suite runtime. Sign-flipped z-score — lower raw delta is worse, so
1413 the shared `z >= assert_z_gte` semantics read as "σ better than
1414 noise" on external fluency.
1415 - **`autogen` wiring**: emits an `external_ppl` suite entry (with
1416 `corpus=public_domain_en`, `max_chunks=8`) whenever the source `.dlm`
1417 has any PROSE section. Intent table explains it as the
1418 "diffuse-forgetting complement to `calibration_drift`."
1419 - **Prove-the-value test**
1420 (`tests/unit/test_ext_ppl_vs_calibration_drift.py`): a dummy backend
1421 that applies a uniform −0.3 nats/token drift to every pack item and
1422 every corpus chunk — below `calibration_drift`'s 1.0-nat per-item
1423 regression threshold, above its −0.5 mean-delta gate, but well below
1424 `external_perplexity`'s −0.1 fixed-threshold. `calibration_drift`
1425 reports PASS, `external_perplexity` reports FAIL. The two probes
1426 measure different failure modes and do not substitute for each
1427 other.
1428
1429 ### Sprint 07 — Performance & caching
1430
1431 Closes Audit 01 findings B19 (deferred per design note),
1432 E-cache-opportunity.
1433
1434 - **Forward-pass cache** (`backends/_instrumentation.py`): every
1435 backend view now routes `next_token_dist` / `rolling_logprob` /
1436 `logprob_of` through a bounded LRU keyed on `(op, view_id,
1437 prompt_hash, top_k)`. Baseline A/B on a tiny 4-probe suite against
1438 SmolLM2-135M on CPU: 2.33s → 1.76s (**25% wall-time reduction**,
1439 forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone
1440 suite has more overlap and would see larger gains; the 25% floor
1441 holds on any suite with repeated prompts across probes.
1442 - **Cache key includes `view_id`**: `"base"` / `"ft"` /
1443 `"scaled_1.25"` / `"null_42"` are distinct namespaces. A future
1444 toggle regression that fails to flip the adapter would surface as
1445 wrong-side cached values immediately, not silently corrupt
1446 divergence math.
1447 - **Backend stats surface in the report**: `SuiteResult.backend_stats`
1448 captures `cache_hits` / `cache_misses` / `forward_passes` /
1449 `scoring_wall_s` / `hit_rate`. Terminal + markdown footers render
1450 `cache: 26/88 = 30%` when stats are present.
1451 - **Forward-pass tracing**: `sway run --trace <path.jsonl>` writes
1452 one event per backend scoring call (probe / view_id / prompt_hash /
1453 top_k / op / wall_ms / hit). Zero overhead when unset.
1454 - **`concurrent_probes` scaffolding**: `spec.defaults.concurrent_probes:
1455 int = 1` field plus `safe_for_concurrent_views: bool = False`
1456 class attribute on HF / MLX / Dummy backends. The runner warns on
1457 stderr when the user requested > 1 against an unsafe backend and
1458 stays sequential. Custom backends that are already concurrency-safe
1459 (e.g. a stateless hosted-API backend) can opt in without waiting
1460 for the HF fix.
1461 - **B19 design note**: `.docs/design/backend-concurrency.md` documents
1462 current state, why v0.1 doesn't fix it, and two future paths
1463 (per-thread lock vs per-worker pool) with recommendation.
1464 - **Generation stays uncached**: `view.generate()` intentionally
1465 bypasses the cache — probe-side callers vary
1466 `(prompt, max_new_tokens, temperature, seed)` in ways that would
1467 rarely collide, and caching sampled output would hide seed bugs
1468 behind stale strings.
1469
1470 ### Sprint 06 — CLI & report UX polish
1471
1472 Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12,
1473 D13, D14, D15, B16, B17, B18.
1474
1475 - **D3 — extras rollup footer:** terminal + markdown reports now end
1476 with a single `pip install 'dlm-sway[...]'` line collecting every
1477 extra mentioned in SKIP messages. Single rollup, no per-row scan.
1478 - **D4 — `sway check` infers `--base`:** reads
1479 `base_model_name_or_path` from the adapter's `adapter_config.json`
1480 when `--base` is omitted; echoes "(inferred base model: …)" so the
1481 user knows what got picked. Falls back to a clear required-flag
1482 error when the field is missing.
1483 - **D5 — `sway autogen` annotates the YAML:** generated specs carry a
1484 header block (source `.dlm`, dlm_id, base, adapter, generation
1485 timestamp, sway version) and a one-line `# intent` comment above
1486 every probe entry. No new dep — pyyaml + a position-based
1487 post-processor instead of `ruamel.yaml`.
1488 - **D6 — `sway run --dry-run` + `sway list-probes`:** dry-run validates
1489 the spec, prints a probe table (#, name, kind, category, enabled),
1490 and exits 0 without building a backend. `list-probes` prints every
1491 shipped probe kind with its category and the first line of its
1492 docstring.
1493 - **D7 — `sway doctor --json`:** machine-readable doctor payload
1494 (`sway_version`, `python`, `platform`, `extras`) keyed by extra and
1495 module. CI-grep-friendly.
1496 - **D8 — `dlm_source` resolution warns instead of silently swallowing:**
1497 when the spec sets `dlm_source` but the `[dlm]` extra isn't installed
1498 (or the bridge errors), the runner emits a yellow stderr warning with
1499 the install hint and continues with `sections=None`. No more silent
1500 "why are my section probes SKIPping?" debugging.
1501 - **D9 — markdown column parity with terminal:** the markdown probes
1502 table now carries `score`, `raw`, `z`, `duration`, and `note`
1503 columns. A `Top findings` section follows the table; a `Skipped
1504 probes` section closes when extras are missing.
1505 - **D10 — unified number formatters:** `format_score`, `format_raw`,
1506 `format_z`, `format_duration_s` live in `suite/report.py` and are
1507 the only numeric formatters used. Thousands separators above 1 000;
1508 `` glyph for `None` / non-finite. Single source kills cross-surface
1509 drift.
1510 - **D11 — `sway report --format` validated via `StrEnum`:** the
1511 Typer-level enum rejects unknown formats with the standard
1512 "Invalid value for '--format'" error. No more silent fallback to
1513 the terminal renderer on a typo.
1514 - **D12 — `sway check` verdict banner:** prints
1515 `✅ adapter is +4.2σ above noise` (green ≥ 3σ),
1516 `⚠️ adapter is +1.5σ above noise — marginal` (yellow ≥ 1σ), or
1517 `❌ adapter is +0.3σ — indistinguishable from noise` (red) above
1518 the full report. Calibrated on the delta_kl z-score; `check`
1519 now runs `null_adapter` first so the z-score is available.
1520 - **D13 — `sway diff` regression summary:** after per-probe deltas,
1521 the diff prints `A→B: N regressed >0.10, M regressed >0.20,
1522 composite Δ=±X.XX` color-coded by direction.
1523 - **D14 — adapter paths with spaces render safely:** `_adapter_label`
1524 wraps the path in double quotes when any whitespace is present.
1525 - **D15 — long messages wrap, don't truncate:** the terminal probe
1526 table column uses Rich's `overflow="fold"` instead of an 80-char
1527 hard cut with an ellipsis. The full message is always visible.
1528 - **B16 — single markdown renderer:** `report.from_json` round-trips
1529 saved JSON back into the canonical dataclass pair, so
1530 `sway report --format md` and `sway run --markdown` both flow
1531 through `report.to_markdown` with identical output. The legacy
1532 `_render_markdown_from_json` and `_render_junit_from_json` helpers
1533 in `cli/commands.py` are deleted.
1534 - **B17 — `None` scores render as ``:** the unified formatters cover
1535 every render path; no more `0.00` masking missing data.
1536 - **B18 — `baseline` row labeled `(informational, weight=0)`:** in
1537 both terminal and markdown component breakdowns, matching the
1538 Sprint 03 explicit-weight-zero decision.
1539
1540 ### Sprint 05 — Probe quality & edge cases
1541
1542 Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20,
1543 B21, B22.
1544
1545 - **B6 — `tail_logprob` semantics:** the field is now `float | None`
1546 with three discrete states: `None` (top-k covered the full vocab — no
1547 tail to redistribute), `0.0` (a real tail underflowed below fp32),
1548 and a measurable negative log-prob. HF, MLX, and dummy backends all
1549 emit `None` when `k == vocab`. Preflight check guards against
1550 non-finite `tail_logprob` only when it's a number.
1551 - **B7 — probe validation before backend build:** new
1552 `validate_all_probes(suite)` helper collects every spec error in one
1553 pass; `_execute_spec` calls it *before* materializing the backend so
1554 a typo in `kind:` surfaces immediately, and *all* typos surface in a
1555 single error message.
1556 - **B8 — `autogen` style prompts:** `style_fingerprint` no longer
1557 receives the leading sentence of a prose section (which elicited doc
1558 *content*, not stylistic voice). Replaced with a fixed
1559 `_STYLE_ELICITATION_PROMPTS` set of 6 open-ended, content-neutral
1560 prompts.
1561 - **B9 — sentence-transformer caching:** `_load_embedder` is now
1562 `functools.lru_cache(maxsize=4)`. Suites that run `adapter_revert`
1563 back-to-back (multi-adapter diff, repeated probes) reuse the same
1564 ~80 MB embedder instead of re-loading every probe call.
1565 - **B10 — `_lcs_ratio` docstring rot:** the function name is a
1566 historical misnomer (it's gestalt similarity via
1567 `difflib.SequenceMatcher`, not LCS). Docstring rewritten to make the
1568 contract explicit; rename deferred to v0.2 to preserve the public
1569 API surface.
1570 - **B11 — leakage adversarial perturbations:** added 4 new perturbations
1571 (`synonym_swap` with a hand-curated 50-pair table, `clause_reverse`,
1572 `prefix_inject`, `register_shift`) alongside the original three.
1573 Default perturbation list is now all 7; the spec accepts any subset.
1574 - **B12 — calibration pack expansion:** `BUILT_IN_PACK` grew from 30
1575 to **200** items (geography, natural sciences, arithmetic, language,
1576 history, biology, technology, miscellaneous). All public-domain /
1577 hand-composed grade-school facts — no third-party dataset license
1578 attaches. `items_limit` docstring documents the new resolution
1579 (~0.5 pp per regressed item vs the old ~3.3 pp).
1580 - **B13 — tokenizer-aware `_stuffing`:** `prompt_collapse` now derives
1581 its padding from the model's pad / unk / EOS token via the backend's
1582 tokenizer, making the metric language-agnostic. The pre-B13 hardcoded
1583 English string remains as the dummy-backend fallback and behind a
1584 `legacy_stuffing: bool = False` spec field for one-release
1585 backward-compat.
1586 - **B14 — `preference_flip` per-triple error fence:** wrapped each
1587 triple's `logprob_of` calls in `try/except ProbeError`. A single bad
1588 triple no longer kills the whole batch; `evidence["dropped_triples"]`
1589 + `dropped_reasons` surface the count + first 5 reasons. When *every*
1590 triple raises, the probe routes to ERROR with a clear explanation.
1591 - **B20 — custom backend protocol checks:** `_load_custom` now
1592 isinstance-checks `NullCalibratedBackend` and
1593 `ScalableDifferentialBackend` after the `DifferentialBackend` check.
1594 The set of satisfied protocols is stamped onto the instance as
1595 `__sway_protocols__: tuple[str, ...]` so the report can show which
1596 features are available without re-checking.
1597 - **B21 — `RunContext.null_stats` truly frozen:** the runner now wraps
1598 the stats dict in `types.MappingProxyType` before threading it
1599 through. The dataclass was already `frozen=True` but the dict was
1600 mutable by reference; B21 makes the docstring's "frozen" claim
1601 literally true. Field type widened from `dict[str, dict[str, float]]`
1602 to `Mapping[str, Mapping[str, float]]`.
1603 - **B22 — `ModelSpec.adapter` path normalization:** added a pydantic
1604 `field_validator` that runs `Path.expanduser().resolve()` on the
1605 field at spec-load time. Backends no longer re-do the work; the cache
1606 key in `_null_cache.compute_key` is now stable regardless of how the
1607 user spelled the path in YAML or on the CLI.
1608
1609 ### Sprint 04 — Integration & regression testing
1610
1611 Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3
1612 (test side), B15.
1613
1614 - **Slow lane runs end-to-end** (C1): the `huggingface_hub` dev dep
1615 added in S01 is verified; `pytest -m "slow or online"` now executes
1616 every checked-in integration test from a clean clone.
1617 - **HF backend coverage 21% → 91%** (C2) — measured combined fast +
1618 slow lane against `dlm_sway.backends.hf`. New tests:
1619 - `tests/integration/test_hf_adapter_toggle.py` extended with a
1620 bit-identical ft → base → ft roundtrip (B15 mitigation).
1621 - `tests/integration/test_hf_scaled_adapter.py`: λ sweep
1622 monotonicity + `LoraLayer.scaling[key]` restoration on clean exit
1623 AND on exception.
1624 - `tests/integration/test_hf_null_adapter.py`: same-seed
1625 determinism, different-seeds divergence, original adapter
1626 restoration on clean exit AND on exception.
1627 - `tests/integration/test_hf_scoring.py`: `logprob_of` (incl.
1628 zero-token-completion → ProbeError), `rolling_logprob`,
1629 `next_token_dist`, `_HFView.generate` (greedy + sampled-with-seed
1630 determinism).
1631 - `tests/unit/test_backend_hf_helpers.py`: direct unit coverage on
1632 `_resolve_dtype` and `_detect_device` so dtype regressions are
1633 caught in the fast lane.
1634 - **MLX smoke test** (C5): `tests/integration/test_mlx_smoke.py`
1635 exercises `MLXDifferentialBackend` on darwin-arm64 with a small
1636 LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm
1637 / missing-fixture. Reproducible adapter builder ships at
1638 `tests/fixtures/build_mlx_adapter.py` (run once on a Mac to
1639 populate the fixture directory).
1640 - **`sway gate` exit code pinned** (C6):
1641 `tests/integration/test_sway_gate_exit_code.py` covers PASS
1642 (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing
1643 verdicts (exit 1) via Typer's `CliRunner`.
1644 - **`dlm` import ban regression-guarded** (C7):
1645 `tests/unit/test_dlm_not_imported.py` patches every `dlm.*` entry in
1646 `sys.modules` to `None` and asserts the dummy suite runs end-to-end
1647 without ImportError, plus that `sway autogen` surfaces a clean
1648 install-hint error rather than a stack trace.
1649 - **Pathological probe coverage** (C8 + B3 test side):
1650 `tests/unit/test_probe_adapter_ablation.py` now drives
1651 monotonically-decreasing curves through the helper and pins
1652 probe-level `evidence["saturation_reason"]` for flat / found /
1653 overshoot-with-dip / non_monotonic shapes via a monkeypatched
1654 `divergence`.
1655 - **WARN-branch numerical formula pinned** (C10):
1656 `test_warn_branch_score_formula_pinned` in
1657 `tests/unit/test_probe_preference_flip.py` asserts the exact
1658 `score = 0.5 + mean_delta / 4.0` formula with a hand-computed
1659 expected value.
1660 - **Disjoint top-k divergence** (C12):
1661 `test_disjoint_top_k_supports_produce_finite_divergence` in
1662 `tests/unit/test_divergence.py` covers the
1663 `aligned_probs` tail-redistribution path against fully-disjoint
1664 base / ft supports — JS comes out finite and approaches its
1665 theoretical ln(2) bound.
1666 - **Report schema snapshots** (C11): `tests/unit/test_report_snapshot.py`
1667 byte-compares `to_json` / `to_markdown` / `to_junit` against
1668 checked-in snapshots under `tests/snapshots/`. Intentional schema
1669 bumps: `SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …` and commit the
1670 updated files. No new dep — hand-rolled diff helper.
1671 - **CI workflow shipped**: `.github/workflows/ci.yml` runs the fast
1672 lane (unit + lint + mypy) on every push / PR, and the slow lane
1673 (integration, HF backend) on schedule (nightly 07:00 UTC), manual
1674 dispatch, push to main, and PRs that touch `src/dlm_sway/backends/`,
1675 `tests/integration/`, or `pyproject.toml`.
1676
1677 ### Sprint 03 — Documentation truth & dead code
1678
1679 Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14,
1680 P15, P17, P18.
1681
1682 - **Determinism is now wired** (P09): the suite runner calls
1683 `dlm_sway.core.determinism.seed_everything(spec.defaults.seed)` before
1684 any backend work runs. The achieved determinism class (`strict` /
1685 `best_effort` / `loose`) is captured in `SuiteResult.determinism` and
1686 surfaces in the report footer + JSON payload + markdown header.
1687 - **`defaults.differential: false` works end-to-end** (P14): new
1688 `backends/two_model.py` with `TwoModelDifferential` wrapper +
1689 `build_two_separate(spec_models)` helper. `_execute_spec` routes to
1690 the wrapper when the flag is false. Doubles memory; primarily for
1691 custom backends that can't toggle adapters in place.
1692 - **`score_weights` overridable from YAML and CLI** (P15): new
1693 `SuiteDefaults.score_weights` field with subset-validating pydantic
1694 field validator (rejects unknown categories, negative weights, all-zero
1695 weights). New `--weights k=v,k=v` flag on `sway run` and `sway gate`.
1696 Partial overrides merge with `DEFAULT_COMPONENT_WEIGHTS` so users
1697 rarely have to respecify all five categories.
1698 - **`baseline` row labeled `(informational)` with weight 0.0 explicit**
1699 (P08, B18): added to `DEFAULT_COMPONENT_WEIGHTS` so the row appears
1700 in reports for transparency but contributes nothing to the composite.
1701 Terminal renderer adds the `(informational)` annotation column;
1702 markdown table adds a `weight` column with the same label.
1703 - **Extended (9-dim) style fingerprint when `[style]` is installed**
1704 (P10): `style_fingerprint` now actually uses spaCy POS-tagging and
1705 textstat syllable counting. Three new dims appended to the existing 6:
1706 passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per
1707 word. Spec field `extended: "auto" | "on" | "off"` (default `"auto"`)
1708 controls the path. `evidence["schema_version"]` bumped to `2` so
1709 snapshot consumers can branch on dimensionality.
1710 - **Stale "Known gaps" entries refreshed** (P03, P04, P05): removed
1711 "null_adapter not yet wired" (delivered in S01/S02), "custom backend
1712 stubbed" (it's full), "MLX paths raise" (rewritten to describe the
1713 real limitation: two model copies because `mlx_lm` has no runtime
1714 adapter toggle).
1715 - **README adds determinism + weights + differential documentation**
1716 and a calibration paragraph that points at the per-kind null matrix.
1717 - **`delta_kl` docstring rot fixed** (P17): dropped the dead
1718 ``:mod:`dir``` reference.
1719 - **Meta-test `test_no_dead_options.py`** (P14/P15 regression guard):
1720 greps the source tree for documented spec/CLI option names and
1721 asserts each has a consumer outside its declaration site. Catches the
1722 exact "documented but unused" pattern Audit 01 flagged.
1723
1724 ### Sprint 02 — Universal z-score calibration
1725
1726 Closes Audit 01 findings P02 (delivery), B2, C9.
1727
1728 - **`NullAdapterProbe` is now a per-kind calibration matrix**: iterates
1729 every downstream numeric kind in the suite (or an explicit
1730 `calibrate_kinds` list), runs a miniature version of each through the
1731 `NullCalibrationBackendProxy` for N seeds, and publishes
1732 `{kind: {mean, std, n}}` under `evidence["null_stats"]`. The old
1733 "publishes two keys only" implementation is gone (closes B2).
1734 - **`NullCalibrationBackendProxy`** (`_null_proxy.py`): swaps
1735 `as_finetuned()` to yield `as_null_adapter(seed)` so every numeric
1736 probe's own math produces the "what does my metric look like when the
1737 fine-tune is structural noise?" distribution without a bespoke code
1738 path per probe.
1739 - **`Probe.calibrate_spec(ctx)` classmethod** plus
1740 `SENTINEL_PROMPTS` / `SENTINEL_DOC` shared constants in
1741 `probes/base.py`. Each numeric probe overrides `calibrate_spec` to
1742 return a small, cheap spec for calibration; probes that can't be
1743 meaningfully calibrated (e.g. `adapter_revert` needs an embedder,
1744 `adapter_ablation` needs `as_scaled_adapter`, `prompt_collapse`
1745 can't fit an exponential decay to null noise) opt out by returning
1746 `None` and surface `(no calibration for <kind>)` in the report.
1747 - **Shared `_zscore` helpers** (`probes/_zscore.py`):
1748 `z_score(raw, stats)`, `verdict_from_z(z, threshold)`,
1749 `score_from_z(z)`, `no_calibration_note(kind)`. Every numeric probe
1750 now flows through these — no bespoke `(raw - mean) / std` math left
1751 in any probe file. Includes the `MIN_STD = 1e-6` floor so a
1752 degenerate null distribution (e.g. a single-seed calibration) can't
1753 produce an infinite z-score (closes C9).
1754 - **All 10 numeric probes thread `get_null_stats(ctx, self.kind)`**:
1755 `delta_kl`, `adapter_revert`, `prompt_collapse`,
1756 `section_internalization`, `paraphrase_invariance`,
1757 `preference_flip`, `style_fingerprint`, `calibration_drift`,
1758 `leakage`, `adapter_ablation`. Each has an `assert_z_gte: float = 3.0`
1759 that is *preferred* when stats exist. Lower-is-better probes
1760 (`adapter_revert`, `calibration_drift`, `leakage`) sign-flip the
1761 z internally so the shared `z >= threshold` PASS rule still reads as
1762 "significantly better than null". Two probes (`paraphrase_invariance`,
1763 `calibration_drift`) keep their intent-aware / compound thresholds
1764 as the no-calibration fallback.
1765 - **On-disk null-stats cache** (`probes/_null_cache.py`,
1766 `backends/hf.py`): HF backend exposes `cache_identity()`;
1767 `NullAdapterProbe` hashes `(backend_identity, runs, init_scale,
1768 seed_base, top_k, kinds)` into a stable filename under
1769 `~/.dlm-sway/null-stats/<key>.json` (XDG-respecting). Cache is
1770 best-effort — a missing / malformed file rebuilds. Disable per-suite
1771 with `cache: false` in the spec, or globally with
1772 `SWAY_DISABLE_NULL_CACHE=1`. The dummy backend doesn't expose a
1773 cache identity, so tests never touch disk unless they opt in.
1774 - **Report shows z-scores in every numeric probe row**: the terminal
1775 renderer already had a `z` column; the markdown renderer now does
1776 too. Rows that fell back to fixed thresholds carry
1777 `(no calibration for <kind>)` directly in the message.
1778
1779 ### Sprint 01 — Finite safety & verdict integrity
1780
1781 Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.
1782
1783 - **`safe_finalize` helper** in `core/result.py`: every numeric probe now
1784 routes its final `ProbeResult` through this guard. A non-finite
1785 `raw` (or any explicitly-critical field) auto-converts to
1786 `Verdict.ERROR`; non-finite auxiliaries are nulled out and recorded
1787 in `evidence["defensively_nulled"]`.
1788 - **`_divergence` non-finite rejection**: `aligned_probs`, `kl`, `js`
1789 now raise `ProbeError` on NaN/inf inputs instead of silently
1790 producing garbage. JS results are bounds-checked against the
1791 theoretical `[0, ln 2]`; out-of-bound results raise rather than
1792 ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)
1793 - **`PreflightCheckable` Protocol**: HF and dummy backends gain
1794 `preflight_finite_check()` that runs one forward pass per view with
1795 a sentinel prompt and rejects backends whose logprobs are non-finite.
1796 - **Suite runner preflight gate**: every `sway run` calls
1797 `backend.preflight_finite_check()` before the probe loop. Failure
1798 emits a single synthetic ERROR probe and skips all configured
1799 probes; `sway gate` exits non-zero. Disable with `skip_preflight=True`
1800 for sub-second test suites.
1801 - **`style_fingerprint` zero-vector fix (B4)**: a fine-tuned model that
1802 produces empty / whitespace-only generations no longer reports a
1803 spurious "+0.82 shift toward doc" PASS. The `_cosine_shift` math is
1804 replaced by `_projection_shift = (ft-base)·(doc-base) / ||doc-base||²`
1805 which goes to zero when ft equals base.
1806 - **`_saturation_lambda` full-range search (B3)**: the adapter-ablation
1807 saturation detector now searches the entire λ range (not just λ ≤ 1.0)
1808 with `max(divs)` as the reference, and returns a typed reason
1809 (`"found"` / `"non_monotonic"` / `"flat_curve"` / `"below_floor"`)
1810 so a flat NaN-adapter curve is distinguishable from a healthy
1811 saturating one.
1812 - **Property-based divergence tests** (`hypothesis`): symmetry,
1813 `JS(p,p) = 0`, `JS ≤ ln 2`, `KL(p,p) = 0`, KL non-negativity. Plus
1814 explicit non-finite-raises tests pinning every entry point.
1815 - **End-to-end NaN regression** (`tests/integration/test_nan_adapter_regression.py`,
1816 slow+online): builds a real PEFT adapter with all-NaN weights via
1817 PEFT, persists it through safetensors, then asserts the HF backend's
1818 preflight rejects it and the suite produces ERROR — not the
1819 +11639σ headline the audit caught.
1820 - Added `huggingface_hub` to dev dep group so integration tests can
1821 resolve the `tiny_model_dir` fixture (closes C1 ahead of Sprint 04).
1822
1823 ## 0.1.0.dev0 — 2026-04-20
1824
1825 Initial pre-alpha. Full 11-primitive battery shipped.
1826
1827 ### Primitives
1828
1829 - **Adherence**
1830 - `delta_kl` — mean JS/KL divergence between base and fine-tuned next-token distributions
1831 - `adapter_revert` — reversion under adversarial paraphrase (needs `sway-eval[semsim]`)
1832 - `prompt_collapse` — exponential-decay fit of divergence over context length
1833 - **Attribution**
1834 - `section_internalization` *(flagship)* — per-section `effective_sis` with leak check
1835 - `paraphrase_invariance` — memorization vs. generalization, intent-aware
1836 - `preference_flip` — DPO/ORPO chosen/rejected margin inversion
1837 - **Calibration**
1838 - `style_fingerprint` — 6-dim numpy-only stylistic shift vs. document
1839 - `calibration_drift` — general-knowledge regression on a packaged 30-item pack
1840 - `leakage` — greedy LCS recall + perturbation fragility
1841 - **Ablation**
1842 - `adapter_ablation` *(signature primitive)* — λ-scaled divergence curve with linearity, saturation, overshoot metrics
1843 - **Baseline**
1844 - `null_adapter` — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)
1845
1846 ### Infrastructure
1847
1848 - `DifferentialBackend` + `ScalableDifferentialBackend` protocols
1849 - HuggingFace + PEFT backend with `disable_adapter` / `set_adapter` toggling and LoRA-scale mutation
1850 - Dummy backend for unit tests (canned responses + linear-blend scalable mode)
1851 - YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
1852 - Typer CLI: `run`, `gate`, `check`, `diff`, `autogen`, `doctor`, `report`
1853 - `.dlm` bridge (`dlm-sway[dlm]`): resolver + full-battery autogen
1854 - Matplotlib visualizations (`dlm-sway[viz]`): SIS bar chart, ablation curve, KL histogram
1855
1856 ### Known gaps
1857
1858 - MLX backend loads **two** model copies in memory (base + adapter-fused) because `mlx_lm` has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs.
1859 - MLX requires a pre-converted `.npz` adapter — raw PEFT safetensors are rejected by `mlx_lm.load`. A PEFT-→-MLX converter is a future milestone.
1860 - PyPI publication of the `dlm-sway` wheel is pending a clean CI release workflow.