markdown · 64537 bytes Raw Blame History

Changelog

Unreleased

Sprint 21 — Audit 03 closure

Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the audit recommended for v0.1.0 readiness. Full audit at .docs/audits/03-final-audit.md. Verdict was YELLOW-leaning-GREEN with zero critical findings; this sprint pushes it to GREEN on the in-scope items. Deferred: F01 (MLX converter — its own feature sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API compat test — lands with v0.1.0 release in S22). v0.1.0 PyPI publish is deferred to S22 (requires release coordination + credentials).

🟠 Major — degenerate z-score clipping (F02).

  • probes/null_adapter — null stats now carry an explicit degenerate: 1.0 field when the calibration ran but produced an unusable baseline (runs: 1, or every seed producing the exact same raw). The std floor at 1e-6 is preserved for valid-but-tight multi-seed nulls so they still calibrate. Audit observed a +290,766σ leakage-probe z under runs: 1; post-fix, z_score refuses and the probe falls back to fixed thresholds with a clear footer rollup.
  • probes/_zscore.z_score — refuses when stats["degenerate"] is truthy, independent of the std check. Belt-and-suspenders on top of the existing MIN_STD guard.
  • suite/report.collect_degenerate_null_kinds — new rollup surface. Terminal + markdown footers gain a "N probe kind(s) had a degenerate null baseline — bump runs:" block, distinct from the existing null-opt-outs rollup.
  • 12 new unit tests across test_null_calibration.py, test_zscore_helpers.py, test_report_extras_rollup.py, and test_probe_external_perplexity.py. Existing test_std_floor_prevents_runaway_zscore was rewritten to assert the fixed behavior (the pre-fix contract WAS the audit's bug).

🟡 Minor — CI resilience (F03).

  • pytest-timeout>=2.3 + tenacity>=9.0 in dev deps.
  • tests/fixtures/tiny_model.py wraps snapshot_download with exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)
    • etag_timeout=10 to bound per-file head probes. Benefits every slow+online test.
  • tests/integration/test_determinism_golden.py gains @pytest.mark.timeout(600). A silent network hang now surfaces as a test failure with actionable output (audit observed 20m workflow-timeout hang on Sprint 19 merge run 24747915467).

🟡 Minor — outlier-miner pool guard (F04).

  • mining/outlier_miner.mine_outliers — raises SwayError when the pool has fewer than 2·top_k distinct scored prompts, with an actionable --top-k N hint. Pre-fix: 1-distinct-prompt pool produced top=[p], bottom=[p] (identical lists — no outlier contrast). Guard applies AFTER scoring so unsupported probe kinds still return the empty-result path.

🟡 Minor — autogen skipped-probes comment (F07).

  • integrations/dlm/autogen.collect_skipped_probe_reasons — new public helper returning (probe_kind, reason) tuples for probes _build_suite intentionally omitted for the given handle.
  • _render_annotated_yaml — when skipped is non-empty, header gains a # skipped: <kind> (<reason>) block. Users no longer have to diff the autogen source to understand which probes are missing from their generated sway.yaml.

🟡 Minor — CI Node.js 20 deprecation silence (F08).

  • .github/workflows/ci.yml sets FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true" at the workflow level. Silences the Node 20 deprecation warnings each CI log currently carries (deadline June 2026). Temporary until every action we pull bumps to Node 24 natively.

🟡 Minor — portable dlm_source (F09).

  • integrations/dlm/autogen._portable_dlm_source emits a cwd-relative path when the .dlm lives inside the cwd, absolute otherwise. The cwd-relative form survives cross-machine checkout (CI agents, other devs); the old absolute-path emission broke the moment a committed autogen'd YAML was run on a different host.

Other.

  • 640 unit tests pass (up from 626 before S21 — 14 new regression tests across F02/F03/F04/F07/F09).
  • Pragmatic deviations from original scope: S21 was audit-scoped to ~2 days of cleanup. Actual depth: 6 findings closed with 14 new regression tests. One existing test (test_std_floor_prevents_runaway_zscore) explicitly rewritten to flip its assertion — the pre-fix contract the test encoded WAS the audit's reported bug.

Sprint 19 — Pre-commit hook sway gate

Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships a .pre-commit-hooks.yaml declaring two hook variants so pre-commit.com users can gate their adapter on every commit that touches a spec, .dlm file, or adapter directory.

  • Two hooks, pick your posture.
    • sway-gatelanguage: system. Uses the sway install on the user's PATH. Fast, zero install cost. Recommended default.
    • sway-gate-isolatedlanguage: python + git-based additional_dependencies (dlm-sway[hf] @ git+...@<SHA>). Self-contained — pre-commit builds a fresh venv and installs sway + torch + transformers on first run (~5 GB, ~2 min).
  • Rev pinning stance. Example config pins to a commit SHA, not HEAD. Sway is pre-v0.1.0 — no tagged release yet — and HEAD drifts silently on every pre-commit autoupdate. SHA pinning is the honest pre-release pattern; migration to rev: v0.1.0 is a README edit once the first release lands.
  • pass_filenames: false on both variants. sway gate takes exactly one spec path; the user declares it via args:. The files: regex decides when the hook fires. No ambiguity.
  • Scope discipline. Neither hook surfaces --json or --markdown flags. The hook gates (non-zero on FAIL) and nothing else. Users wanting report artifacts run sway run separately.
  • Readme "Pre-commit" section documents the consumer-side config for both variants, the SHA-pinning rationale, and the one-time install cost of the isolated variant.
  • examples/precommit-example/ — template sway.yaml, consumer-side .pre-commit-config.yaml, and a README walk-through.
  • tests/integration/test_pre_commit_hook.py (slow + online) — three cases: (a) parse-and-shape smoke test on .pre-commit-hooks.yaml; (b) pass-case runtime — spawns pre-commit run sway-gate as a subprocess against a tmp git repo with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — same fixture with an impossible assert_mean_gte, asserts non-zero exit + gate FAILED banner.
  • pre-commit>=3.8 in [dependency-groups].dev so the integration test can invoke it as a subprocess. Keeps the tool out of the user's runtime deps.

Sprint 18 — Cross-platform determinism golden

Closes Audit 01 stretch-list F-item "cross-platform determinism golden test." The README's "deterministic on CPU where possible" claim held in theory; now it's pinned by CI on two platforms.

  • New module src/dlm_sway/core/golden.py. Tolerance-aware JSON comparator with compare_goldens(actual, expected, *, logprob_tol=1e-6, score_tol=1e-4) and mask_variable_fields for stripping timestamps, wall-seconds, duration_s, backend_stats, sway_version, and cwd-resolved path identifiers (adapter_id, base_model_id) before comparison. No torch dep; runs in the fast lane.
  • New integration test tests/integration/test_determinism_golden.py (slow + online). Builds a deterministically-seeded LoRA on SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + calibration_drift), and diffs the JSON output against tests/golden/expected_<platform>.json. SWAY_UPDATE_GOLDENS=1 toggles regen mode; missing golden → SKIP with a regen recipe.
  • New CI matrix determinism-golden with strategy.matrix.os: [ubuntu-latest, macos-latest] in .github/workflows/ci.yml. Triggered by changes under tests/golden/**, src/dlm_sway/core/golden.py, or the test file itself (plus the standard schedule/dispatch/push triggers).
  • workflow_dispatch regen mode — dispatching the CI workflow with regenerate_goldens=true flips SWAY_UPDATE_GOLDENS=1 on both matrix legs and uploads the regenerated JSONs as per-platform artifacts. Meant for deliberate "yes I changed the algorithm" flows.
  • Tolerance rationale (documented in core/golden.py): 1e-6 for logprob-like numeric fields sits above typical BLAS-implementation drift (1e-81e-7 band between OpenBLAS on linux and Accelerate on darwin) but below real algorithm-change drift. 1e-4 for score fields absorbs composition noise.
  • 25 new unit tests for the comparator covering mask coverage, tolerance thresholds, structural diffs (missing keys, length mismatches, type mismatches), NaN/inf edge cases, and a realistic two-masked-payloads round-trip.
  • Pragmatic scope deviation: the sprint envisioned a checked-in 5 MB adapter binary. Instead the test builds it from a fixed seed at runtime (same pattern as test_external_perplexity_e2e and test_cluster_kl_e2e). Avoids the regeneration chore noted in the sprint's risks section.
  • Linux golden bootstrap: first PR ships expected_darwin.json only; the linux leg SKIPs with a recipe pointing at the workflow_dispatch regen mode. Maintainer dispatches, downloads the artifact, commits expected_linux.json, and the next CI run asserts cleanly on both platforms. One-time onboarding cost.

Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner

Closes Audit 01 innovation item F11. Adds sway mine — an evaluation companion to paraphrase_invariance that surfaces the paraphrases a memorizing adapter most reliably fails on, plus a secondary mode that ranks delta_kl prompts by per-prompt divergence.

sway mine --mode paraphrase — per paraphrase_invariance case:

  1. Generate candidate paraphrases via nlpaug SynonymAug (WordNet, deterministic under a fixed seed; back-translation is a user-supplied escape hatch to keep the default fast and offline).
  2. Embed candidates through the shared MiniLM cache (same 80 MB load adapter_revert pulls) and greedy farthest-first select the top-K most pairwise-distant ones — dodges nlpaug's tendency to emit near-duplicate synonym swaps.
  3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
    • (ft(candidate, gold) - base(candidate, gold))`. Large positive gap = the candidate breaks the adapter's lift = the adapter doesn't generalize there.
  4. Emit sway-mined-paraphrase.yaml with a mined_cases: block paste- compatible with paraphrase_invariance.cases.

sway mine --mode outliers — rank a prompt pool by per-prompt delta_kl.raw. Pool defaults to the spec's own delta_kl prompts; --from-corpus public_domain_en draws from S09's CC0 corpus instead. Emits top-K and bottom-K blocks: the highest-divergence prompts (best for tightening a gate) and lowest-divergence (candidates to drop from the suite). leakage and paraphrase_invariance have section/case- based specs; outlier mining on those is future work, documented inline in mining/outlier_miner.py.

  • New package src/dlm_sway/mining/ — two modules, paraphrase_miner.py and outlier_miner.py, plus a thin corpus_prompts helper for the --from-corpus path.
  • New CLI subcommand sway mine SPEC --mode paraphrase|outliers [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] [--seed N]. Paste-compatible YAML emission by default; --out overrides the filename.
  • 18 new unit tests — 7 for the paraphrase miner (ranker, diversity filter, dedup, input validation), 8 for the outlier miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 for the CLI (paraphrase + outliers modes + no-prompts error path).
  • Prove-the-value test at tests/unit/test_paraphrase_miner_prove_value.py. On a deliberately-memorizing dummy backend the hand-written paraphrase list passes with generalization_ratio > 0.5; substituting the mined list (same seed, same adapter) drops the ratio below 0.5 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap.
  • Dependency footprint: no new extras. The paraphrase miner uses nlpaug (already in [style]) + sentence-transformers (in [semsim]); the diversity filter reuses the embedder adapter_revert already pulls. Graceful BackendNotAvailableError with a pip hint when the extras aren't installed.

Sprint 20 — Audit 02 closure

Closes the full finding inventory from .docs/audits/02-followup-audit.md (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3 dead-code items). Sixteen sprints of innovation work had accumulated since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap CI dropped at the runner boundary) plus a grab-bag of claim-vs-code gaps, dead branches, and untested contracts.

🔴 Critical — ship-blocker.

  • F01suite/runner.py:_with_duration now forwards every ProbeResult field. The pre-fix version silently dropped ci_95 on every probe, making S14's headline deliverable runtime-inert in any real sway run. Regression test at the runner boundary + snapshot fixture carrying a populated ci_95 pin the fix.

🟠 Major — claim-vs-code gaps.

  • F02 — real sklearn.cluster.KMeans exercised by two new unit tests (separation + seed determinism) + a slow+online integration test at tests/integration/test_cluster_kl_e2e.py. Before: every cluster_kl test monkeypatched _kmeans_cluster with an argmax stub, leaving S16's reason-for-being unverified.
  • F03sway list-probes falls back to the defining module's __doc__ when a probe class has no docstring. Every one of the 13 shipped probes now renders a summary row.
  • F04sway doctor probes plotly (the load-bearing [viz] dep), sklearn (S16 cluster_kl), and the api extras (httpx, tenacity).
  • F05 — README primitives table reflects 13 probes (adherence + cluster_kl, calibration + external_perplexity, baseline row for null_adapter).
  • F06TwoModelDifferential composes safe_for_concurrent_views from its two inner backends. Before: the attribute was missing and the runner defaulted to False even when both inners set True, breaking S13's concurrency claim at the only supported differential- use pattern.
  • F07integrations/dlm/autogen emits cluster_kl when the prompt pool clears 20 entries; markdown report gains a per-probe "Cluster breakdown" section with per-cluster mean KL + exemplars.
  • F08backends/dummy._NullView RNG uses hashlib.md5 instead of Python's PYTHONHASHSEED-salted hash(). Cross-process determinism test at tests/unit/test_cross_process_determinism.py pins the fix.
  • F09 — trace writer ↔ analyzer round-trip test at tests/unit/test_runner_backend_stats.py: runs a suite with trace_path=, loads the file back, asserts probe labels + hit/miss counts match backend_stats.

🟡 Minor — dead code, doc drift, latent brittleness.

  • F10 — removed dead _ft_view branch in probes/prompt_collapse.py.
  • F11 — pytest plugin resolves spec paths against config.rootpath, not process cwd.
  • F12backends/api._post_completions delegates to tenacity's Retrying() callable; removes the unreachable post-loop fallback.
  • F13preference_flip WARN branch emits ci_95 and z_by_rank for consistency with every other numeric probe's WARN shape.
  • F14adapter_ablation module docstring documents why ci_95 renders as em-dash (it's a curve-fit, not a sample-mean aggregator).
  • F15 — report footer surfaces null_adapter opt-outs (probes with calibrate_spec=None) in both terminal and markdown surfaces.
  • F16 — report category ordering derives from DEFAULT_COMPONENT_WEIGHTS and accepts unknown categories from custom Probe subclasses.
  • F17cluster_kl degenerate zero-variance case now returns Verdict.WARN with evidence["degenerate_zero_variance"]=True and suppresses z-score to avoid a spurious small-sample-noise calibration.
  • F18_calibration_pack.py gains per-section provenance notes (F18 audit trail) without noisy per-item annotations.
  • F19 — pytest plugin defers dlm_sway.core.result / dlm_sway.core.errors imports to call sites so non-`@pytest.mark.sway` users don't pay the load tax.
  • F20sway check --help documents the σ-banner's null-calibration dependency (fall-through to composite score band when null SKIPs).

Stronger-test opportunities — regression tests for things we weren't pinning.

  • #9probes/_divergence rejects effectively-uniform TokenDists (spread < 1e-9) via _check_non_degenerate_token_dist. Catches a shape-broken lm_head that would otherwise compute a trivial constant divergence across prompts. Two compatibility touchups rode with the change: dummy backend's synthesized ft dist and cluster_kl test fixture _dist_broad gained a tiny monotonic perturbation to clear the guard (real models never produce bit-uniform logits thanks to fp32 accumulation noise).
  • #10 — cross-verdict consistency test (tests/unit/test_cross_verdict_consistency.py): runs the same spec through sway run, sway gate, and sway report --format junit and asserts identical per-verdict tallies.
  • #11sway doctor --json schema-shape snapshot test locks the top-level keys and the per-extra module-name set.
  • #12 — subprocess determinism test asserts _NullView.next_token_dist is byte-identical across PYTHONHASHSEED values; pins F08's fix.

Dead-code inventory — DC3–DC5.

  • DC3null_adapter.py's pre-S10 cache-promotion branch is annotated as legacy; removal queued for the next minor.
  • DC4 — dropped the stderr warning at suite/runner.py that fired whenever concurrent_probes > 1 even though execution never fanned out. The spec field is still accepted (future pool will light it up); the noisy warning is gone.
  • DC5tests/unit/test_model.py grows coverage of ModelSpec's dtype enum, endpoint, trust_remote_code, and custom/api kind branches.

Final state: 582 unit tests + 3 integration tests passing; mypy strict clean across 55 source files; ruff + format clean across 137 files. Sprint 20 is a single PR composed of per-finding, per-file commits so the history reads as a fix-by-fix cleanup.

Sprint 16 — Cluster-coherent KL probe

Closes Audit 01 innovation item F8. Adds a new cluster_kl probe that answers the question delta_kl can't: the mean divergence may be the same, but did the adapter shift the right topics, or is it a uniform blunt-instrument shift?

  • New probe cluster_kl (category: adherence). Embeds prompts via shared MiniLM (same cache key as adapter_revert), k-means clusters them at a fixed seed, measures per-prompt JS divergence between base and ft, and reports a specificity ratio: between_variance / (between_variance + within_variance). Range [0, 1]: ≈ 0.5 on a blunt adapter that shifts every topic by the same amount, → 1.0 on a topic-targeted adapter. Pair (mean_kl, specificity) tells a more honest story than either number alone.
  • Bootstrap CI on specificity. Resamples (divergence, cluster_label) pairs with replacement and takes the 2.5/97.5 percentiles of the bootstrap ratio distribution. Returns None below 4 prompts — matches the convention core.stats.bootstrap_ci uses.
  • Z-score calibration against null_adapter baseline via the existing _zscore helpers. calibrate_spec synthesizes 8 mixed-topic sentinel prompts + k=2 for the null pass; the specificity distribution concentrates around 0.5 there, so a real adapter's separation above that baseline reads as a clean z-score.
  • Degenerate-input policy. < min_prompts (default 20) → SKIP with a clear message; num_clusters * 2 > num_prompts → SKIP (per-cluster mean not well-resolved); empty prompt list → ERROR. Missing [semsim] extras → SKIP with a pip install 'dlm-sway[semsim]' hint.
  • Zero-variance fallback. If every prompt produced the same divergence (canned stub data, no adapter motion), the specificity ratio is mathematically undefined; we return 0.5 — the null-adapter expectation — so downstream z-score reports "no signal" instead of NaN.
  • New dependency. scikit-learn>=1.4 added to the [semsim] extra (riding the 80 MB MiniLM load that probe already pulls in). Mirrored to [all] and to mypy's stubless-override list.
  • 7 unit tests in tests/unit/test_probe_cluster_kl.py: two-topic adapter → high specificity, uniform adapter → 0.5 fallback, too-few prompts → SKIP, empty prompts → ERROR, num_clusters > prompts/2 → SKIP, CI bracketing, missing-extras SKIP path.
  • Prove-the-value test at tests/unit/test_cluster_kl_prove_value.py. Two backends with comparable delta_kl (ratio < 3×) have specificity scores that split by at least 0.3 — concrete evidence that cluster_kl surfaces a structural distinction delta_kl merges.

Sprint 15 — pytest plugin (@pytest.mark.sway)

Closes Audit 01 innovation item F10. Packages sway as a pytest library so teams already running pytest adopt sway with a single decorator instead of a subprocess wrapper.

  • New plugin (src/dlm_sway/pytest_plugin.py) auto-loaded via the pytest11 entry point after pip install 'dlm-sway[pytest]'. @pytest.mark.sway(spec="...", threshold=0.0, weights=None) expands a single pytest function into one test item per probe in the referenced spec + an optional __gate__ item that fires only when the composite score drops below threshold.
  • Verdict translation. FAIL / ERROR → pytest Failed, SKIP → pytest Skipped, WARN → pytest warning, PASS → pytest pass. Probe-level failures isolate: a failing adherence probe doesn't mask a failing calibration one, and pytest -k adherence runs just that probe.
  • Suite runs once per decorated function. A session-scoped _SuiteCache keyed on (spec_path, weights) ensures the N-way item expansion doesn't multiply backend wall time. Two @pytest.mark.sway tests against the same spec share one run.
  • Malformed marks fail cleanly. Missing spec, non-numeric threshold, non-dict weights, and unknown kwargs produce a synthetic _ConfigErrorItem with a green-field pytest failure line — no cryptic collection-time tracebacks.
  • New [pytest] extra — just pytest>=8.0. Matches the pattern other plugin-style extras use. Also registers the pytest11 entry point so the plugin is discovered automatically on install, consistent with pytest-cov / pytest-xdist.
  • Example directory at examples/pytest_integration/ with a minimal sway.yaml + test_sway_gate.py showing the decorator replacing a legacy subprocess.run(["sway", "gate", ...]) wrapper. README gains a "Pytest integration" section pointing at it.
  • 13 new unit tests via pytest's canonical pytester fixture: marker registration, N-item expansion, FAIL / SKIP / ERROR routing, gate below-threshold failure, gate above-threshold pass, gate absence when threshold=0, malformed-mark error paths, cache sharing across multiple decorated tests.
  • Drive-by fix for the S14 CI-narrowing test flake. The dummy backend's per-prompt noise was seeded via Python's hash(), which is salted per-process via PYTHONHASHSEED. Swapped to a stable hashlib.md5-derived seed so the narrowing invariant holds deterministically across test-order permutations.

Sprint 14 — Bootstrap CIs + forward-pass trace CLI

Closes Audit 01 innovation items F9 (bootstrap confidence intervals on raw metrics) + F12 (polished forward-pass trace CLI). Paired sharpenings of existing probe output and existing trace infrastructure.

  • New core/stats.bootstrap_ci — percentile-bootstrap 95% CI on any sequence of per-sample measurements. Numpy-only, 1000-resample default, seeded from ctx.seed so intervals are reproducible. Short-circuits on non-finite / empty / degenerate-constant inputs; vectorized resample keeps the overhead ~1 ms per probe.
  • ProbeResult.ci_95: tuple[float, float] | None — new field, threaded through safe_finalize (nulled if raw is nulled, so a CI never brackets a defensive-null point estimate). to_json / from_json persist the pair as a two-list.
  • Six aggregating probes emit ci_95: delta_kl (over per-prompt divergences), calibration_drift (over per-item regression indicators), external_perplexity (over per-chunk deltas), leakage (over clean-recall rates), paraphrase_invariance (over verbatim lifts), section_internalization (over effective SIS scores). Each also lands the interval under evidence["raw_ci_95"] for JSON consumers.
  • Report tables add a ci95 column between raw and z in both terminal and markdown output. format_ci renders as [lo, hi] with em-dash for missing/non-finite. Markdown snapshot refreshed; JSON snapshot refreshed for the new per-probe ci_95 field.
  • New sway trace <jsonl> CLI. Reads the forward-pass trace JSONL the S07 runner writes when --trace <path> is set, aggregates into per-probe and per-view wall-time + hit-rate tables, plus a top-N slowest-events table. --format terminal|md|json, --slowest K to tune the tail size. The parsing tolerates old trace shapes (missing probe / hit fields) and skips malformed lines.
  • suite/trace_analysis — typed dataclasses (TraceEvent, ProbeSummary, ViewSummary, TraceReport) + three renderers (terminal / markdown / json) matching the conventions of suite/report and suite/compare.
  • Committed trace fixture at tests/fixtures/trace_sample.jsonl (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new trace-analysis + CLI tests.
  • Prove-the-value (F9): on a dummy backend with per-prompt variation, delta_kl's CI narrows from [0.33, 0.41] (width 0.079) at N=4 prompts to [0.38, 0.42] (width 0.044) at N=32 — 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical scaling minus dummy-backend dispersion.
  • Prove-the-value (F12): the CLI's terminal output on the committed fixture correctly identifies sis/base (520.3 ms) as the single slowest event, sis (1,529 ms) as the slower probe over dk (529 ms), and ft (1,357 ms) as the slower view over base (701 ms).

Sprint 13 — OpenAI-compatible HTTP scoring backend

Closes Audit 01 innovation item F7. Unlocks sway against hosted fine-tunes — OpenAI platform, vllm serve, Ollama — without requiring a local torch + PEFT load.

  • New backend ApiScoringBackend (backends/api.py): scores against a /v1/completions endpoint via httpx. Implements ScoringBackend only (not DifferentialBackend) since one endpoint is one model; users compose two ApiScoringBackend instances behind the existing TwoModelDifferential wrapper for a full differential run. Method surface: logprob_of via echo=True+logprobs=0, rolling_logprob via the same, next_token_dist via max_tokens=1+logprobs=K, plus preflight_finite_check and an ApiScoringBackend.generate for probes that need text output (e.g. leakage).
  • Token-boundary handling. The API returns tokens as strings, so logprob_of walks the echoed tokens by character length until the running total covers the prompt, then sums the rest. When the boundary falls mid-token, the partial token lands on the prompt side — over-counts the prompt, under-attributes to the completion — and the behavior is asserted by a dedicated test (test_mid_token_prompt_leans_conservative).
  • Retry + preflight. Tenacity handles 5xx + network errors with exponential backoff (configurable max_retries). Preflight hits the endpoint once with hello, max_tokens=1, logprobs=1; rejects non-finite logprobs before the suite runs.
  • First backend with safe_for_concurrent_views=True. HTTP is stateless, so the S07 concurrent-probe scheduler can dispatch against an API backend in parallel as soon as the pool implementation lands (still scaffolding in the runner).
  • ModelSpec additions: BackendKind gains "api"; ModelSpec.endpoint field carries the server's base URL (the /v1/completions path is appended by the backend). API key from SWAY_API_KEYOPENAI_API_KEY env var fallback, so secrets stay out of the YAML.
  • backends.build dispatch routes kind="api" to ApiScoringBackend. Combined with build_two_separate + TwoModelDifferential, a YAML with defaults.differential: false and two kind: api models Just Works end-to-end.
  • New [api] extrahttpx>=0.27 + tenacity>=9.0. Core sway stays torch-free; the [api] extra adds ~1 MB of deps vs the [hf] extra's 3 GB.
  • Unit tests (21 new) use httpx's MockTransport to intercept every call and assert numeric outputs match canned OpenAI-shaped responses — covers all three scoring methods, cache dedup, 4xx error surfacing, 503→200 retry recovery, API-key env fallback, NaN-response preflight rejection, and the token-boundary math.
  • Prove-the-value (§F7): tests/integration/test_api_ollama.py is opt-in via SWAY_OLLAMA_URL + SWAY_OLLAMA_MODEL. Runs the full scoring surface against a live Ollama serving a small model (e.g. llama3.2:1b) and asserts finite output + preflight pass. Documents the wall-time budget the "≤3× HF backend" claim rests on once the concurrent-dispatch pool lands.

Sprint 12 — Interactive HTML report

Closes Audit 01 innovation item F6. Adds the exploration surface to sit alongside the terminal (CI logs) and markdown (PR artifacts) outputs — a single-file interactive HTML page for the research / write-up case.

  • sway report result.json --format html --out report.html emits a self-contained HTML page with five interactive Plotly panels: composite-score gauge with banded thresholds, per-category horizontal-bar breakdown, per-section SIS bar chart (when section_internalization evidence carries a per_section array), adapter-ablation response curve (when adapter_ablation evidence carries lambdas + mean_divergence_per_lambda), and an all-probe score × z-score scatter with hover tooltips. The terminal verdict palette (green / yellow / red) carries across every panel.
  • Plotly bundle inlined once in <head>; each panel is a Plotly-produced <div> with a stable sway-* id so snapshot tests don't churn on Plotly point releases. No external <script src> / <link href> references — the page loads offline from a single ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway wrapper + chart data is ~40 KB).
  • CLI gates --out PATH on file-producing formats. --format html requires --out (3 MB of JS has no business on stdout); --format md|json|junit now also accept --out for symmetry with the HTML path. --format terminal rejects --out — the terminal renderer is for the console only.
  • plotly>=5.20 is optional, shipped via the existing [viz] extra (alongside matplotlib). Without it, --format html exits 2 with the pip install 'dlm-sway[viz]' install hint. Graceful ImportError → RuntimeError translation tested end-to-end.
  • Snapshot at tests/snapshots/report.html locks the Sway-owned wrapper structure. The Plotly JS bundle is stripped from the snapshot (replaced with a placeholder) so the 43-line snapshot stays human-reviewable and Plotly version bumps don't drift it.
  • Prove-the-value: test_html_from_real_history_loads_offline renders HTML from the committed tests/fixtures/sway-history/02-* run (4 probes, real exported payload) and asserts: file parses via html.parser, zero external <script src> / <link href> references, every probe name appears in the body, file size lands between 1 MB and 10 MB. Measured output: 4.87 MB.

Sprint 11 — sway compare across saved JSON runs

Closes Audit 01 innovation item F5. Ships the regression-dashboard primitive every CI integration reached for but had to script.

  • New CLI subcommand sway compare. Accepts N saved result JSONs (typically from sway run --json), rehydrates each via report.from_json, folds them into a score matrix, and renders: a per-probe score table with columns per run, per-adjacent-pair delta columns (colored red on drop / green on lift), and a composite-score timeline row. Formats: terminal (default, Rich), --format md, --format json.
  • --fail-on-regression <threshold>. Exits 1 when any probe's score in the newest run dropped ≥ threshold vs the prior run. threshold=0 (default) disables the gate. Gate fires after the output is emitted so CI logs always capture the matrix even on red builds.
  • suite/compare.py module. Clean separation: build_matrix folds (SuiteResult, SwayScore) pairs into a CompareMatrix dataclass; render_{terminal,markdown,json} consume that. No filesystem IO in the module itself — the CLI owns the reads, the renderers own the writes. Probes that disappeared between runs show as None / em-dash in the matrix; new probes show as None on older runs. Union of probe names is sorted for stable row order across invocations.
  • Markdown snapshot locked at tests/snapshots/compare.md — silent schema drift breaks the test like every other snapshot.
  • Committed history fixture + prove-the-value test. tests/fixtures/sway-history/{01,02,03}-*.json ship a three-run narrative: baseline → retrained-improved → over-trained. Run 03 plants a section_internalization drop of 0.22 and a calibration_drift drop of 0.25 while delta_kl still rises (the memorization-without-generalization failure mode). test_compare_catches_planted_regression invokes sway compare sway-history/*.json --fail-on-regression 0.10 and asserts exit=1 with both regressed probes surfaced in the JSON payload and terminal text — exactly the F5 "CI gate the build on regression" experiment.

Sprint 10 — Multi-rank adversarial null adapters

Closes Audit 01 innovation item F4. Extends the S02 null-calibration matrix with a per-rank profile — users now read "how rank-saturated is my adapter?" straight off the report.

  • NullAdapterSpec.rank_multipliers: list[float] = [1.0] (new). Default preserves single-rank behavior byte-for-byte. Setting [0.5, 1.0, 2.0] calibrates three independent null distributions per probe kind and emits a z-profile alongside the verdict z-score.
  • NullCalibratedBackend.as_null_adapter gains a rank_scale kwarg. Both shipped backends (dummy + HF) implement it by scaling the null-weight noise std by sqrt(rank_scale) — mathematically equivalent to a rank change in terms of the LoRA output variance (A·B is a sum of r rank-1 outer products; variance is linear in r). No tensor-shape surgery, no model reload per multiplier.
  • RunContext.null_stats_by_rank threads the per-rank matrix through the runner. Keys are canonical rank_{mult:.2f} strings; inner structure matches null_stats.
  • Numeric probes emit evidence["z_by_rank"]. Every numeric probe (delta_kl, calibration_drift, external_perplexity, leakage, adapter_revert, adapter_ablation, paraphrase_invariance, preference_flip, prompt_collapse, section_internalization, style_fingerprint) computes per-rank z-scores using its existing sign convention. The verdict path still reads the 1.0x group — no existing thresholds shift.
  • Report surfaces the profile. Terminal + markdown probe notes append rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x whenever multi-rank calibration ran. format_z_profile helper centralizes the rendering.
  • Null-stats disk cache widens key to include rank_multipliers. Single-rank caches from pre-S10 runs are still readable — they're promoted to the new null_stats_by_rank shape on load.
  • Bug fix (external_perplexity): drop erroneous z sign flip. mean_delta is higher-is-better (ft logprob minus base logprob on external prose), so the raw z-score maps directly onto z >= assert_z_gte. S09's sign flip was reversing the pass/fail direction when the null-calibration path ran.
  • README: rank-profile interpretation guide. Explains when the shape indicates rank saturation vs. rank oversizing, plus the "low rank can be pathologically quiet" dual-reading caveat.
  • Prove-the-value (tests/unit/test_null_multi_rank.py): on a fixed adapter with the dummy backend, the delta_kl z-profile is strictly monotone in inverse rank (z@0.5x > z@1.0x > z@2.0x) — exactly the signature of a rank-scaled null distribution.

Sprint 09 — External-perplexity-gap probe

Closes Audit 01 innovation item F3. First innovation sprint landed on top of the audit-closure campaign (S01–S07).

  • New probe external_perplexity (probes/external_perplexity.py): rolling-logprob delta of ft vs base on held-out public-domain English prose. Raw metric is mean_delta_nats (per-token). Negative = adapter raised perplexity on external text (diffuse forgetting); positive = adapter improved English modeling incidentally. Category calibration; scored alongside calibration_drift.
  • Packaged public-domain corpus (probes/_corpora/public_domain_en.txt, ~14 KB): hand-assembled from twelve US-public-domain passages (Lincoln, Emerson, Twain, Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, Dickinson, US Constitution, Federalist #10). Each passage carries an inline # -- source: provenance comment identifying the work and the "US public domain (pre-1929)" basis. Loader strips the comments at read time so the probe sees only raw prose.
  • Null calibration: calibrate_spec() returns a 4-chunk cheap version so null_adapter produces per-kind stats without dominating suite runtime. Sign-flipped z-score — lower raw delta is worse, so the shared z >= assert_z_gte semantics read as "σ better than noise" on external fluency.
  • autogen wiring: emits an external_ppl suite entry (with corpus=public_domain_en, max_chunks=8) whenever the source .dlm has any PROSE section. Intent table explains it as the "diffuse-forgetting complement to calibration_drift."
  • Prove-the-value test (tests/unit/test_ext_ppl_vs_calibration_drift.py): a dummy backend that applies a uniform −0.3 nats/token drift to every pack item and every corpus chunk — below calibration_drift's 1.0-nat per-item regression threshold, above its −0.5 mean-delta gate, but well below external_perplexity's −0.1 fixed-threshold. calibration_drift reports PASS, external_perplexity reports FAIL. The two probes measure different failure modes and do not substitute for each other.

Sprint 07 — Performance & caching

Closes Audit 01 findings B19 (deferred per design note), E-cache-opportunity.

  • Forward-pass cache (backends/_instrumentation.py): every backend view now routes next_token_dist / rolling_logprob / logprob_of through a bounded LRU keyed on (op, view_id, prompt_hash, top_k). Baseline A/B on a tiny 4-probe suite against SmolLM2-135M on CPU: 2.33s → 1.76s (25% wall-time reduction, forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone suite has more overlap and would see larger gains; the 25% floor holds on any suite with repeated prompts across probes.
  • Cache key includes view_id: "base" / "ft" / "scaled_1.25" / "null_42" are distinct namespaces. A future toggle regression that fails to flip the adapter would surface as wrong-side cached values immediately, not silently corrupt divergence math.
  • Backend stats surface in the report: SuiteResult.backend_stats captures cache_hits / cache_misses / forward_passes / scoring_wall_s / hit_rate. Terminal + markdown footers render cache: 26/88 = 30% when stats are present.
  • Forward-pass tracing: sway run --trace <path.jsonl> writes one event per backend scoring call (probe / view_id / prompt_hash / top_k / op / wall_ms / hit). Zero overhead when unset.
  • concurrent_probes scaffolding: spec.defaults.concurrent_probes: int = 1 field plus safe_for_concurrent_views: bool = False class attribute on HF / MLX / Dummy backends. The runner warns on stderr when the user requested > 1 against an unsafe backend and stays sequential. Custom backends that are already concurrency-safe (e.g. a stateless hosted-API backend) can opt in without waiting for the HF fix.
  • B19 design note: .docs/design/backend-concurrency.md documents current state, why v0.1 doesn't fix it, and two future paths (per-thread lock vs per-worker pool) with recommendation.
  • Generation stays uncached: view.generate() intentionally bypasses the cache — probe-side callers vary (prompt, max_new_tokens, temperature, seed) in ways that would rarely collide, and caching sampled output would hide seed bugs behind stale strings.

Sprint 06 — CLI & report UX polish

Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15, B16, B17, B18.

  • D3 — extras rollup footer: terminal + markdown reports now end with a single pip install 'dlm-sway[...]' line collecting every extra mentioned in SKIP messages. Single rollup, no per-row scan.
  • D4 — sway check infers --base: reads base_model_name_or_path from the adapter's adapter_config.json when --base is omitted; echoes "(inferred base model: …)" so the user knows what got picked. Falls back to a clear required-flag error when the field is missing.
  • D5 — sway autogen annotates the YAML: generated specs carry a header block (source .dlm, dlm_id, base, adapter, generation timestamp, sway version) and a one-line # intent comment above every probe entry. No new dep — pyyaml + a position-based post-processor instead of ruamel.yaml.
  • D6 — sway run --dry-run + sway list-probes: dry-run validates the spec, prints a probe table (#, name, kind, category, enabled), and exits 0 without building a backend. list-probes prints every shipped probe kind with its category and the first line of its docstring.
  • D7 — sway doctor --json: machine-readable doctor payload (sway_version, python, platform, extras) keyed by extra and module. CI-grep-friendly.
  • D8 — dlm_source resolution warns instead of silently swallowing: when the spec sets dlm_source but the [dlm] extra isn't installed (or the bridge errors), the runner emits a yellow stderr warning with the install hint and continues with sections=None. No more silent "why are my section probes SKIPping?" debugging.
  • D9 — markdown column parity with terminal: the markdown probes table now carries score, raw, z, duration, and note columns. A Top findings section follows the table; a Skipped probes section closes when extras are missing.
  • D10 — unified number formatters: format_score, format_raw, format_z, format_duration_s live in suite/report.py and are the only numeric formatters used. Thousands separators above 1 000; glyph for None / non-finite. Single source kills cross-surface drift.
  • D11 — sway report --format validated via StrEnum: the Typer-level enum rejects unknown formats with the standard "Invalid value for '--format'" error. No more silent fallback to the terminal renderer on a typo.
  • D12 — sway check verdict banner: prints ✅ adapter is +4.2σ above noise (green ≥ 3σ), ⚠️ adapter is +1.5σ above noise — marginal (yellow ≥ 1σ), or ❌ adapter is +0.3σ — indistinguishable from noise (red) above the full report. Calibrated on the delta_kl z-score; check now runs null_adapter first so the z-score is available.
  • D13 — sway diff regression summary: after per-probe deltas, the diff prints A→B: N regressed >0.10, M regressed >0.20, composite Δ=±X.XX color-coded by direction.
  • D14 — adapter paths with spaces render safely: _adapter_label wraps the path in double quotes when any whitespace is present.
  • D15 — long messages wrap, don't truncate: the terminal probe table column uses Rich's overflow="fold" instead of an 80-char hard cut with an ellipsis. The full message is always visible.
  • B16 — single markdown renderer: report.from_json round-trips saved JSON back into the canonical dataclass pair, so sway report --format md and sway run --markdown both flow through report.to_markdown with identical output. The legacy _render_markdown_from_json and _render_junit_from_json helpers in cli/commands.py are deleted.
  • B17 — None scores render as : the unified formatters cover every render path; no more 0.00 masking missing data.
  • B18 — baseline row labeled (informational, weight=0): in both terminal and markdown component breakdowns, matching the Sprint 03 explicit-weight-zero decision.

Sprint 05 — Probe quality & edge cases

Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, B21, B22.

  • B6 — tail_logprob semantics: the field is now float | None with three discrete states: None (top-k covered the full vocab — no tail to redistribute), 0.0 (a real tail underflowed below fp32), and a measurable negative log-prob. HF, MLX, and dummy backends all emit None when k == vocab. Preflight check guards against non-finite tail_logprob only when it's a number.
  • B7 — probe validation before backend build: new validate_all_probes(suite) helper collects every spec error in one pass; _execute_spec calls it before materializing the backend so a typo in kind: surfaces immediately, and all typos surface in a single error message.
  • B8 — autogen style prompts: style_fingerprint no longer receives the leading sentence of a prose section (which elicited doc content, not stylistic voice). Replaced with a fixed _STYLE_ELICITATION_PROMPTS set of 6 open-ended, content-neutral prompts.
  • B9 — sentence-transformer caching: _load_embedder is now functools.lru_cache(maxsize=4). Suites that run adapter_revert back-to-back (multi-adapter diff, repeated probes) reuse the same ~80 MB embedder instead of re-loading every probe call.
  • B10 — _lcs_ratio docstring rot: the function name is a historical misnomer (it's gestalt similarity via difflib.SequenceMatcher, not LCS). Docstring rewritten to make the contract explicit; rename deferred to v0.2 to preserve the public API surface.
  • B11 — leakage adversarial perturbations: added 4 new perturbations (synonym_swap with a hand-curated 50-pair table, clause_reverse, prefix_inject, register_shift) alongside the original three. Default perturbation list is now all 7; the spec accepts any subset.
  • B12 — calibration pack expansion: BUILT_IN_PACK grew from 30 to 200 items (geography, natural sciences, arithmetic, language, history, biology, technology, miscellaneous). All public-domain / hand-composed grade-school facts — no third-party dataset license attaches. items_limit docstring documents the new resolution (~0.5 pp per regressed item vs the old ~3.3 pp).
  • B13 — tokenizer-aware _stuffing: prompt_collapse now derives its padding from the model's pad / unk / EOS token via the backend's tokenizer, making the metric language-agnostic. The pre-B13 hardcoded English string remains as the dummy-backend fallback and behind a legacy_stuffing: bool = False spec field for one-release backward-compat.
  • B14 — preference_flip per-triple error fence: wrapped each triple's logprob_of calls in try/except ProbeError. A single bad triple no longer kills the whole batch; evidence["dropped_triples"]
    • dropped_reasons surface the count + first 5 reasons. When every triple raises, the probe routes to ERROR with a clear explanation.
  • B20 — custom backend protocol checks: _load_custom now isinstance-checks NullCalibratedBackend and ScalableDifferentialBackend after the DifferentialBackend check. The set of satisfied protocols is stamped onto the instance as __sway_protocols__: tuple[str, ...] so the report can show which features are available without re-checking.
  • B21 — RunContext.null_stats truly frozen: the runner now wraps the stats dict in types.MappingProxyType before threading it through. The dataclass was already frozen=True but the dict was mutable by reference; B21 makes the docstring's "frozen" claim literally true. Field type widened from dict[str, dict[str, float]] to Mapping[str, Mapping[str, float]].
  • B22 — ModelSpec.adapter path normalization: added a pydantic field_validator that runs Path.expanduser().resolve() on the field at spec-load time. Backends no longer re-do the work; the cache key in _null_cache.compute_key is now stable regardless of how the user spelled the path in YAML or on the CLI.

Sprint 04 — Integration & regression testing

Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 (test side), B15.

  • Slow lane runs end-to-end (C1): the huggingface_hub dev dep added in S01 is verified; pytest -m "slow or online" now executes every checked-in integration test from a clean clone.
  • HF backend coverage 21% → 91% (C2) — measured combined fast + slow lane against dlm_sway.backends.hf. New tests:
    • tests/integration/test_hf_adapter_toggle.py extended with a bit-identical ft → base → ft roundtrip (B15 mitigation).
    • tests/integration/test_hf_scaled_adapter.py: λ sweep monotonicity + LoraLayer.scaling[key] restoration on clean exit AND on exception.
    • tests/integration/test_hf_null_adapter.py: same-seed determinism, different-seeds divergence, original adapter restoration on clean exit AND on exception.
    • tests/integration/test_hf_scoring.py: logprob_of (incl. zero-token-completion → ProbeError), rolling_logprob, next_token_dist, _HFView.generate (greedy + sampled-with-seed determinism).
    • tests/unit/test_backend_hf_helpers.py: direct unit coverage on _resolve_dtype and _detect_device so dtype regressions are caught in the fast lane.
  • MLX smoke test (C5): tests/integration/test_mlx_smoke.py exercises MLXDifferentialBackend on darwin-arm64 with a small LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm / missing-fixture. Reproducible adapter builder ships at tests/fixtures/build_mlx_adapter.py (run once on a Mac to populate the fixture directory).
  • sway gate exit code pinned (C6): tests/integration/test_sway_gate_exit_code.py covers PASS (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing verdicts (exit 1) via Typer's CliRunner.
  • dlm import ban regression-guarded (C7): tests/unit/test_dlm_not_imported.py patches every dlm.* entry in sys.modules to None and asserts the dummy suite runs end-to-end without ImportError, plus that sway autogen surfaces a clean install-hint error rather than a stack trace.
  • Pathological probe coverage (C8 + B3 test side): tests/unit/test_probe_adapter_ablation.py now drives monotonically-decreasing curves through the helper and pins probe-level evidence["saturation_reason"] for flat / found / overshoot-with-dip / non_monotonic shapes via a monkeypatched divergence.
  • WARN-branch numerical formula pinned (C10): test_warn_branch_score_formula_pinned in tests/unit/test_probe_preference_flip.py asserts the exact score = 0.5 + mean_delta / 4.0 formula with a hand-computed expected value.
  • Disjoint top-k divergence (C12): test_disjoint_top_k_supports_produce_finite_divergence in tests/unit/test_divergence.py covers the aligned_probs tail-redistribution path against fully-disjoint base / ft supports — JS comes out finite and approaches its theoretical ln(2) bound.
  • Report schema snapshots (C11): tests/unit/test_report_snapshot.py byte-compares to_json / to_markdown / to_junit against checked-in snapshots under tests/snapshots/. Intentional schema bumps: SWAY_UPDATE_SNAPSHOTS=1 uv run pytest … and commit the updated files. No new dep — hand-rolled diff helper.
  • CI workflow shipped: .github/workflows/ci.yml runs the fast lane (unit + lint + mypy) on every push / PR, and the slow lane (integration, HF backend) on schedule (nightly 07:00 UTC), manual dispatch, push to main, and PRs that touch src/dlm_sway/backends/, tests/integration/, or pyproject.toml.

Sprint 03 — Documentation truth & dead code

Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, P15, P17, P18.

  • Determinism is now wired (P09): the suite runner calls dlm_sway.core.determinism.seed_everything(spec.defaults.seed) before any backend work runs. The achieved determinism class (strict / best_effort / loose) is captured in SuiteResult.determinism and surfaces in the report footer + JSON payload + markdown header.
  • defaults.differential: false works end-to-end (P14): new backends/two_model.py with TwoModelDifferential wrapper + build_two_separate(spec_models) helper. _execute_spec routes to the wrapper when the flag is false. Doubles memory; primarily for custom backends that can't toggle adapters in place.
  • score_weights overridable from YAML and CLI (P15): new SuiteDefaults.score_weights field with subset-validating pydantic field validator (rejects unknown categories, negative weights, all-zero weights). New --weights k=v,k=v flag on sway run and sway gate. Partial overrides merge with DEFAULT_COMPONENT_WEIGHTS so users rarely have to respecify all five categories.
  • baseline row labeled (informational) with weight 0.0 explicit (P08, B18): added to DEFAULT_COMPONENT_WEIGHTS so the row appears in reports for transparency but contributes nothing to the composite. Terminal renderer adds the (informational) annotation column; markdown table adds a weight column with the same label.
  • Extended (9-dim) style fingerprint when [style] is installed (P10): style_fingerprint now actually uses spaCy POS-tagging and textstat syllable counting. Three new dims appended to the existing 6: passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per word. Spec field extended: "auto" | "on" | "off" (default "auto") controls the path. evidence["schema_version"] bumped to 2 so snapshot consumers can branch on dimensionality.
  • Stale "Known gaps" entries refreshed (P03, P04, P05): removed "null_adapter not yet wired" (delivered in S01/S02), "custom backend stubbed" (it's full), "MLX paths raise" (rewritten to describe the real limitation: two model copies because mlx_lm has no runtime adapter toggle).
  • README adds determinism + weights + differential documentation and a calibration paragraph that points at the per-kind null matrix.
  • delta_kl docstring rot fixed (P17): dropped the dead ``:mod:`dir``` reference.
  • Meta-test test_no_dead_options.py (P14/P15 regression guard): greps the source tree for documented spec/CLI option names and asserts each has a consumer outside its declaration site. Catches the exact "documented but unused" pattern Audit 01 flagged.

Sprint 02 — Universal z-score calibration

Closes Audit 01 findings P02 (delivery), B2, C9.

  • NullAdapterProbe is now a per-kind calibration matrix: iterates every downstream numeric kind in the suite (or an explicit calibrate_kinds list), runs a miniature version of each through the NullCalibrationBackendProxy for N seeds, and publishes {kind: {mean, std, n}} under evidence["null_stats"]. The old "publishes two keys only" implementation is gone (closes B2).
  • NullCalibrationBackendProxy (_null_proxy.py): swaps as_finetuned() to yield as_null_adapter(seed) so every numeric probe's own math produces the "what does my metric look like when the fine-tune is structural noise?" distribution without a bespoke code path per probe.
  • Probe.calibrate_spec(ctx) classmethod plus SENTINEL_PROMPTS / SENTINEL_DOC shared constants in probes/base.py. Each numeric probe overrides calibrate_spec to return a small, cheap spec for calibration; probes that can't be meaningfully calibrated (e.g. adapter_revert needs an embedder, adapter_ablation needs as_scaled_adapter, prompt_collapse can't fit an exponential decay to null noise) opt out by returning None and surface (no calibration for <kind>) in the report.
  • Shared _zscore helpers (probes/_zscore.py): z_score(raw, stats), verdict_from_z(z, threshold), score_from_z(z), no_calibration_note(kind). Every numeric probe now flows through these — no bespoke (raw - mean) / std math left in any probe file. Includes the MIN_STD = 1e-6 floor so a degenerate null distribution (e.g. a single-seed calibration) can't produce an infinite z-score (closes C9).
  • All 10 numeric probes thread get_null_stats(ctx, self.kind): delta_kl, adapter_revert, prompt_collapse, section_internalization, paraphrase_invariance, preference_flip, style_fingerprint, calibration_drift, leakage, adapter_ablation. Each has an assert_z_gte: float = 3.0 that is preferred when stats exist. Lower-is-better probes (adapter_revert, calibration_drift, leakage) sign-flip the z internally so the shared z >= threshold PASS rule still reads as "significantly better than null". Two probes (paraphrase_invariance, calibration_drift) keep their intent-aware / compound thresholds as the no-calibration fallback.
  • On-disk null-stats cache (probes/_null_cache.py, backends/hf.py): HF backend exposes cache_identity(); NullAdapterProbe hashes (backend_identity, runs, init_scale, seed_base, top_k, kinds) into a stable filename under ~/.dlm-sway/null-stats/<key>.json (XDG-respecting). Cache is best-effort — a missing / malformed file rebuilds. Disable per-suite with cache: false in the spec, or globally with SWAY_DISABLE_NULL_CACHE=1. The dummy backend doesn't expose a cache identity, so tests never touch disk unless they opt in.
  • Report shows z-scores in every numeric probe row: the terminal renderer already had a z column; the markdown renderer now does too. Rows that fell back to fixed thresholds carry (no calibration for <kind>) directly in the message.

Sprint 01 — Finite safety & verdict integrity

Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.

  • safe_finalize helper in core/result.py: every numeric probe now routes its final ProbeResult through this guard. A non-finite raw (or any explicitly-critical field) auto-converts to Verdict.ERROR; non-finite auxiliaries are nulled out and recorded in evidence["defensively_nulled"].
  • _divergence non-finite rejection: aligned_probs, kl, js now raise ProbeError on NaN/inf inputs instead of silently producing garbage. JS results are bounds-checked against the theoretical [0, ln 2]; out-of-bound results raise rather than ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)
  • PreflightCheckable Protocol: HF and dummy backends gain preflight_finite_check() that runs one forward pass per view with a sentinel prompt and rejects backends whose logprobs are non-finite.
  • Suite runner preflight gate: every sway run calls backend.preflight_finite_check() before the probe loop. Failure emits a single synthetic ERROR probe and skips all configured probes; sway gate exits non-zero. Disable with skip_preflight=True for sub-second test suites.
  • style_fingerprint zero-vector fix (B4): a fine-tuned model that produces empty / whitespace-only generations no longer reports a spurious "+0.82 shift toward doc" PASS. The _cosine_shift math is replaced by _projection_shift = (ft-base)·(doc-base) / ||doc-base||² which goes to zero when ft equals base.
  • _saturation_lambda full-range search (B3): the adapter-ablation saturation detector now searches the entire λ range (not just λ ≤ 1.0) with max(divs) as the reference, and returns a typed reason ("found" / "non_monotonic" / "flat_curve" / "below_floor") so a flat NaN-adapter curve is distinguishable from a healthy saturating one.
  • Property-based divergence tests (hypothesis): symmetry, JS(p,p) = 0, JS ≤ ln 2, KL(p,p) = 0, KL non-negativity. Plus explicit non-finite-raises tests pinning every entry point.
  • End-to-end NaN regression (tests/integration/test_nan_adapter_regression.py, slow+online): builds a real PEFT adapter with all-NaN weights via PEFT, persists it through safetensors, then asserts the HF backend's preflight rejects it and the suite produces ERROR — not the +11639σ headline the audit caught.
  • Added huggingface_hub to dev dep group so integration tests can resolve the tiny_model_dir fixture (closes C1 ahead of Sprint 04).

0.1.0.dev0 — 2026-04-20

Initial pre-alpha. Full 11-primitive battery shipped.

Primitives

  • Adherence
    • delta_kl — mean JS/KL divergence between base and fine-tuned next-token distributions
    • adapter_revert — reversion under adversarial paraphrase (needs sway-eval[semsim])
    • prompt_collapse — exponential-decay fit of divergence over context length
  • Attribution
    • section_internalization (flagship) — per-section effective_sis with leak check
    • paraphrase_invariance — memorization vs. generalization, intent-aware
    • preference_flip — DPO/ORPO chosen/rejected margin inversion
  • Calibration
    • style_fingerprint — 6-dim numpy-only stylistic shift vs. document
    • calibration_drift — general-knowledge regression on a packaged 30-item pack
    • leakage — greedy LCS recall + perturbation fragility
  • Ablation
    • adapter_ablation (signature primitive) — λ-scaled divergence curve with linearity, saturation, overshoot metrics
  • Baseline
    • null_adapter — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)

Infrastructure

  • DifferentialBackend + ScalableDifferentialBackend protocols
  • HuggingFace + PEFT backend with disable_adapter / set_adapter toggling and LoRA-scale mutation
  • Dummy backend for unit tests (canned responses + linear-blend scalable mode)
  • YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
  • Typer CLI: run, gate, check, diff, autogen, doctor, report
  • .dlm bridge (dlm-sway[dlm]): resolver + full-battery autogen
  • Matplotlib visualizations (dlm-sway[viz]): SIS bar chart, ablation curve, KL histogram

Known gaps

  • MLX backend loads two model copies in memory (base + adapter-fused) because mlx_lm has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs.
  • MLX requires a pre-converted .npz adapter — raw PEFT safetensors are rejected by mlx_lm.load. A PEFT-→-MLX converter is a future milestone.
  • PyPI publication of the dlm-sway wheel is pending a clean CI release workflow.
View source
1 # Changelog
2
3 ## Unreleased
4
5 ### Sprint 21 — Audit 03 closure
6
7 Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the
8 audit recommended for v0.1.0 readiness. Full audit at
9 `.docs/audits/03-final-audit.md`. Verdict was YELLOW-leaning-GREEN
10 with zero critical findings; this sprint pushes it to GREEN on the
11 in-scope items. Deferred: F01 (MLX converter — its own feature
12 sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API
13 compat test — lands with v0.1.0 release in S22). `v0.1.0` PyPI
14 publish is deferred to S22 (requires release coordination +
15 credentials).
16
17 **🟠 Major — degenerate z-score clipping (F02).**
18
19 - **`probes/null_adapter`** — null stats now carry an explicit
20 `degenerate: 1.0` field when the calibration ran but produced an
21 unusable baseline (`runs: 1`, or every seed producing the *exact*
22 same raw). The std floor at `1e-6` is preserved for valid-but-tight
23 multi-seed nulls so they still calibrate. Audit observed a
24 `+290,766σ` leakage-probe z under `runs: 1`; post-fix, z_score
25 refuses and the probe falls back to fixed thresholds with a clear
26 footer rollup.
27 - **`probes/_zscore.z_score`** — refuses when `stats["degenerate"]`
28 is truthy, independent of the std check. Belt-and-suspenders on
29 top of the existing MIN_STD guard.
30 - **`suite/report.collect_degenerate_null_kinds`** — new rollup
31 surface. Terminal + markdown footers gain a
32 "N probe kind(s) had a degenerate null baseline — bump `runs:`"
33 block, distinct from the existing null-opt-outs rollup.
34 - 12 new unit tests across `test_null_calibration.py`,
35 `test_zscore_helpers.py`, `test_report_extras_rollup.py`, and
36 `test_probe_external_perplexity.py`. Existing
37 `test_std_floor_prevents_runaway_zscore` was rewritten to assert
38 the fixed behavior (the pre-fix contract WAS the audit's bug).
39
40 **🟡 Minor — CI resilience (F03).**
41
42 - **`pytest-timeout>=2.3`** + **`tenacity>=9.0`** in dev deps.
43 - **`tests/fixtures/tiny_model.py`** wraps `snapshot_download` with
44 exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)
45 + `etag_timeout=10` to bound per-file head probes. Benefits every
46 slow+online test.
47 - **`tests/integration/test_determinism_golden.py`** gains
48 `@pytest.mark.timeout(600)`. A silent network hang now surfaces
49 as a test failure with actionable output (audit observed 20m
50 workflow-timeout hang on Sprint 19 merge run 24747915467).
51
52 **🟡 Minor — outlier-miner pool guard (F04).**
53
54 - **`mining/outlier_miner.mine_outliers`** — raises `SwayError` when
55 the pool has fewer than `2·top_k` distinct scored prompts, with
56 an actionable `--top-k N` hint. Pre-fix: 1-distinct-prompt pool
57 produced `top=[p], bottom=[p]` (identical lists — no outlier
58 contrast). Guard applies AFTER scoring so unsupported probe kinds
59 still return the empty-result path.
60
61 **🟡 Minor — autogen skipped-probes comment (F07).**
62
63 - **`integrations/dlm/autogen.collect_skipped_probe_reasons`** — new
64 public helper returning `(probe_kind, reason)` tuples for probes
65 `_build_suite` intentionally omitted for the given handle.
66 - **`_render_annotated_yaml`** — when `skipped` is non-empty, header
67 gains a `# skipped: <kind> (<reason>)` block. Users no longer have
68 to diff the autogen source to understand which probes are missing
69 from their generated `sway.yaml`.
70
71 **🟡 Minor — CI Node.js 20 deprecation silence (F08).**
72
73 - **`.github/workflows/ci.yml`** sets
74 `FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"` at the workflow
75 level. Silences the Node 20 deprecation warnings each CI log
76 currently carries (deadline June 2026). Temporary until every
77 action we pull bumps to Node 24 natively.
78
79 **🟡 Minor — portable `dlm_source` (F09).**
80
81 - **`integrations/dlm/autogen._portable_dlm_source`** emits a
82 cwd-relative path when the `.dlm` lives inside the cwd, absolute
83 otherwise. The cwd-relative form survives cross-machine checkout
84 (CI agents, other devs); the old absolute-path emission broke the
85 moment a committed autogen'd YAML was run on a different host.
86
87 **Other.**
88
89 - **640 unit tests pass** (up from 626 before S21 — 14 new
90 regression tests across F02/F03/F04/F07/F09).
91 - **Pragmatic deviations from original scope:** S21 was audit-scoped
92 to ~2 days of cleanup. Actual depth: 6 findings closed with 14
93 new regression tests. One existing test
94 (`test_std_floor_prevents_runaway_zscore`) explicitly rewritten to
95 flip its assertion — the pre-fix contract the test encoded WAS
96 the audit's reported bug.
97
98 ### Sprint 19 — Pre-commit hook `sway gate`
99
100 Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships
101 a `.pre-commit-hooks.yaml` declaring two hook variants so
102 [pre-commit.com](https://pre-commit.com) users can gate their adapter
103 on every commit that touches a spec, `.dlm` file, or adapter directory.
104
105 - **Two hooks, pick your posture.**
106 - `sway-gate``language: system`. Uses the sway install on the
107 user's `PATH`. Fast, zero install cost. Recommended default.
108 - `sway-gate-isolated``language: python` + git-based
109 `additional_dependencies` (`dlm-sway[hf] @ git+...@<SHA>`).
110 Self-contained — pre-commit builds a fresh venv and installs
111 sway + torch + transformers on first run (~5 GB, ~2 min).
112 - **Rev pinning stance.** Example config pins to a commit SHA, not
113 `HEAD`. Sway is pre-v0.1.0 — no tagged release yet — and `HEAD`
114 drifts silently on every `pre-commit autoupdate`. SHA pinning is
115 the honest pre-release pattern; migration to `rev: v0.1.0` is a
116 README edit once the first release lands.
117 - **`pass_filenames: false` on both variants.** `sway gate` takes
118 exactly one spec path; the user declares it via `args:`. The
119 `files:` regex decides when the hook fires. No ambiguity.
120 - **Scope discipline.** Neither hook surfaces `--json` or
121 `--markdown` flags. The hook gates (non-zero on FAIL) and nothing
122 else. Users wanting report artifacts run `sway run` separately.
123 - **Readme "Pre-commit" section** documents the consumer-side config
124 for both variants, the SHA-pinning rationale, and the one-time
125 install cost of the isolated variant.
126 - **`examples/precommit-example/`** — template `sway.yaml`,
127 consumer-side `.pre-commit-config.yaml`, and a README walk-through.
128 - **`tests/integration/test_pre_commit_hook.py`** (slow + online) —
129 three cases: (a) parse-and-shape smoke test on
130 `.pre-commit-hooks.yaml`; (b) pass-case runtime — spawns
131 `pre-commit run sway-gate` as a subprocess against a tmp git repo
132 with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case —
133 same fixture with an impossible `assert_mean_gte`, asserts
134 non-zero exit + `gate FAILED` banner.
135 - **`pre-commit>=3.8`** in `[dependency-groups].dev` so the
136 integration test can invoke it as a subprocess. Keeps the tool
137 out of the user's runtime deps.
138
139 ### Sprint 18 — Cross-platform determinism golden
140
141 Closes Audit 01 stretch-list F-item "cross-platform determinism golden
142 test." The README's "deterministic on CPU where possible" claim held
143 in theory; now it's pinned by CI on two platforms.
144
145 - **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON
146 comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6,
147 score_tol=1e-4)` and `mask_variable_fields` for stripping
148 timestamps, wall-seconds, `duration_s`, `backend_stats`,
149 `sway_version`, and cwd-resolved path identifiers (`adapter_id`,
150 `base_model_id`) before comparison. No torch dep; runs in the fast
151 lane.
152 - **New integration test** `tests/integration/test_determinism_golden.py`
153 (slow + online). Builds a deterministically-seeded LoRA on
154 SmolLM2-135M, runs a minimal 2-probe suite (delta_kl +
155 calibration_drift), and diffs the JSON output against
156 `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1`
157 toggles regen mode; missing golden → SKIP with a regen recipe.
158 - **New CI matrix** `determinism-golden` with
159 `strategy.matrix.os: [ubuntu-latest, macos-latest]` in
160 `.github/workflows/ci.yml`. Triggered by changes under
161 `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file
162 itself (plus the standard schedule/dispatch/push triggers).
163 - **`workflow_dispatch` regen mode** — dispatching the CI workflow
164 with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both
165 matrix legs and uploads the regenerated JSONs as per-platform
166 artifacts. Meant for deliberate "yes I changed the algorithm"
167 flows.
168 - **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for
169 logprob-like numeric fields sits above typical BLAS-implementation
170 drift (`1e-8``1e-7` band between OpenBLAS on linux and Accelerate
171 on darwin) but below real algorithm-change drift. `1e-4` for score
172 fields absorbs composition noise.
173 - **25 new unit tests** for the comparator covering mask coverage,
174 tolerance thresholds, structural diffs (missing keys, length
175 mismatches, type mismatches), NaN/inf edge cases, and a realistic
176 two-masked-payloads round-trip.
177 - **Pragmatic scope deviation**: the sprint envisioned a checked-in
178 5 MB adapter binary. Instead the test builds it from a fixed seed
179 at runtime (same pattern as `test_external_perplexity_e2e` and
180 `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the
181 sprint's risks section.
182 - **Linux golden bootstrap**: first PR ships `expected_darwin.json`
183 only; the linux leg SKIPs with a recipe pointing at the
184 `workflow_dispatch` regen mode. Maintainer dispatches, downloads
185 the artifact, commits `expected_linux.json`, and the next CI run
186 asserts cleanly on both platforms. One-time onboarding cost.
187
188 ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner
189
190 Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation
191 companion to `paraphrase_invariance` that surfaces the paraphrases a
192 memorizing adapter most reliably fails on, plus a secondary mode that
193 ranks `delta_kl` prompts by per-prompt divergence.
194
195 **`sway mine --mode paraphrase`** — per `paraphrase_invariance` case:
196
197 1. Generate candidate paraphrases via nlpaug `SynonymAug` (WordNet,
198 deterministic under a fixed seed; back-translation is a user-supplied
199 escape hatch to keep the default fast and offline).
200 2. Embed candidates through the shared MiniLM cache (same 80 MB load
201 `adapter_revert` pulls) and greedy farthest-first select the top-K
202 most pairwise-distant ones — dodges nlpaug's tendency to emit
203 near-duplicate synonym swaps.
204 3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
205 - (ft(candidate, gold) - base(candidate, gold))`. Large positive gap
206 = the candidate breaks the adapter's lift = the adapter doesn't
207 generalize there.
208 4. Emit `sway-mined-paraphrase.yaml` with a `mined_cases:` block paste-
209 compatible with `paraphrase_invariance.cases`.
210
211 **`sway mine --mode outliers`** — rank a prompt pool by per-prompt
212 `delta_kl.raw`. Pool defaults to the spec's own `delta_kl` prompts;
213 `--from-corpus public_domain_en` draws from S09's CC0 corpus instead.
214 Emits top-K and bottom-K blocks: the highest-divergence prompts (best
215 for tightening a gate) and lowest-divergence (candidates to drop from
216 the suite). `leakage` and `paraphrase_invariance` have section/case-
217 based specs; outlier mining on those is future work, documented
218 inline in `mining/outlier_miner.py`.
219
220 - **New package** `src/dlm_sway/mining/` — two modules,
221 `paraphrase_miner.py` and `outlier_miner.py`, plus a thin
222 `corpus_prompts` helper for the `--from-corpus` path.
223 - **New CLI subcommand** `sway mine SPEC --mode paraphrase|outliers
224 [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME]
225 [--seed N]`. Paste-compatible YAML emission by default; `--out`
226 overrides the filename.
227 - **18 new unit tests** — 7 for the paraphrase miner (ranker,
228 diversity filter, dedup, input validation), 8 for the outlier
229 miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3
230 for the CLI (paraphrase + outliers modes + no-prompts error path).
231 - **Prove-the-value test** at
232 `tests/unit/test_paraphrase_miner_prove_value.py`. On a
233 deliberately-memorizing dummy backend the hand-written paraphrase
234 list passes with `generalization_ratio > 0.5`; substituting the
235 mined list (same seed, same adapter) drops the ratio below 0.5
236 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap.
237 - **Dependency footprint**: no new extras. The paraphrase miner
238 uses nlpaug (already in `[style]`) + sentence-transformers (in
239 `[semsim]`); the diversity filter reuses the embedder
240 `adapter_revert` already pulls. Graceful `BackendNotAvailableError`
241 with a pip hint when the extras aren't installed.
242
243 ### Sprint 20 — Audit 02 closure
244
245 Closes the full finding inventory from `.docs/audits/02-followup-audit.md`
246 (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3
247 dead-code items). Sixteen sprints of innovation work had accumulated
248 since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap
249 CI dropped at the runner boundary) plus a grab-bag of claim-vs-code
250 gaps, dead branches, and untested contracts.
251
252 **🔴 Critical — ship-blocker.**
253
254 - **F01** — `suite/runner.py:_with_duration` now forwards every
255 `ProbeResult` field. The pre-fix version silently dropped `ci_95`
256 on every probe, making S14's headline deliverable runtime-inert in
257 any real `sway run`. Regression test at the runner boundary + snapshot
258 fixture carrying a populated `ci_95` pin the fix.
259
260 **🟠 Major — claim-vs-code gaps.**
261
262 - **F02** — real `sklearn.cluster.KMeans` exercised by two new unit
263 tests (separation + seed determinism) + a slow+online integration
264 test at `tests/integration/test_cluster_kl_e2e.py`. Before: every
265 cluster_kl test monkeypatched `_kmeans_cluster` with an argmax stub,
266 leaving S16's reason-for-being unverified.
267 - **F03** — `sway list-probes` falls back to the defining module's
268 `__doc__` when a probe class has no docstring. Every one of the 13
269 shipped probes now renders a summary row.
270 - **F04** — `sway doctor` probes `plotly` (the load-bearing `[viz]` dep),
271 `sklearn` (S16 cluster_kl), and the `api` extras (`httpx`, `tenacity`).
272 - **F05** — README primitives table reflects 13 probes (adherence +
273 cluster_kl, calibration + external_perplexity, baseline row for
274 null_adapter).
275 - **F06** — `TwoModelDifferential` composes `safe_for_concurrent_views`
276 from its two inner backends. Before: the attribute was missing and
277 the runner defaulted to `False` even when both inners set `True`,
278 breaking S13's concurrency claim at the only supported differential-
279 use pattern.
280 - **F07** — `integrations/dlm/autogen` emits `cluster_kl` when the
281 prompt pool clears 20 entries; markdown report gains a per-probe
282 "Cluster breakdown" section with per-cluster mean KL + exemplars.
283 - **F08** — `backends/dummy._NullView` RNG uses `hashlib.md5` instead
284 of Python's PYTHONHASHSEED-salted `hash()`. Cross-process determinism
285 test at `tests/unit/test_cross_process_determinism.py` pins the fix.
286 - **F09** — trace writer ↔ analyzer round-trip test at
287 `tests/unit/test_runner_backend_stats.py`: runs a suite with
288 `trace_path=`, loads the file back, asserts probe labels + hit/miss
289 counts match `backend_stats`.
290
291 **🟡 Minor — dead code, doc drift, latent brittleness.**
292
293 - **F10** — removed dead `_ft_view` branch in `probes/prompt_collapse.py`.
294 - **F11** — pytest plugin resolves spec paths against `config.rootpath`,
295 not process cwd.
296 - **F12** — `backends/api._post_completions` delegates to tenacity's
297 `Retrying()` callable; removes the unreachable post-loop fallback.
298 - **F13** — `preference_flip` WARN branch emits `ci_95` and `z_by_rank`
299 for consistency with every other numeric probe's WARN shape.
300 - **F14** — `adapter_ablation` module docstring documents why `ci_95`
301 renders as em-dash (it's a curve-fit, not a sample-mean aggregator).
302 - **F15** — report footer surfaces `null_adapter` opt-outs (probes with
303 `calibrate_spec=None`) in both terminal and markdown surfaces.
304 - **F16** — report category ordering derives from
305 `DEFAULT_COMPONENT_WEIGHTS` and accepts unknown categories from
306 custom `Probe` subclasses.
307 - **F17** — `cluster_kl` degenerate zero-variance case now returns
308 `Verdict.WARN` with `evidence["degenerate_zero_variance"]=True` and
309 suppresses z-score to avoid a spurious small-sample-noise calibration.
310 - **F18** — `_calibration_pack.py` gains per-section provenance notes
311 (F18 audit trail) without noisy per-item annotations.
312 - **F19** — pytest plugin defers `dlm_sway.core.result` /
313 `dlm_sway.core.errors` imports to call sites so non-`@pytest.mark.sway`
314 users don't pay the load tax.
315 - **F20** — `sway check` `--help` documents the σ-banner's null-calibration
316 dependency (fall-through to composite score band when null SKIPs).
317
318 **Stronger-test opportunities — regression tests for things we weren't
319 pinning.**
320
321 - **#9** — `probes/_divergence` rejects effectively-uniform TokenDists
322 (spread < 1e-9) via `_check_non_degenerate_token_dist`. Catches a
323 shape-broken lm_head that would otherwise compute a trivial constant
324 divergence across prompts. Two compatibility touchups rode with the
325 change: dummy backend's synthesized `ft` dist and cluster_kl test
326 fixture `_dist_broad` gained a tiny monotonic perturbation to clear
327 the guard (real models never produce bit-uniform logits thanks to
328 fp32 accumulation noise).
329 - **#10** — cross-verdict consistency test
330 (`tests/unit/test_cross_verdict_consistency.py`): runs the same spec
331 through `sway run`, `sway gate`, and `sway report --format junit` and
332 asserts identical per-verdict tallies.
333 - **#11** — `sway doctor --json` schema-shape snapshot test locks the
334 top-level keys and the per-extra module-name set.
335 - **#12** — subprocess determinism test asserts `_NullView.next_token_dist`
336 is byte-identical across `PYTHONHASHSEED` values; pins F08's fix.
337
338 **Dead-code inventory — DC3–DC5.**
339
340 - **DC3** — `null_adapter.py`'s pre-S10 cache-promotion branch is
341 annotated as legacy; removal queued for the next minor.
342 - **DC4** — dropped the stderr warning at `suite/runner.py` that fired
343 whenever `concurrent_probes > 1` even though execution never fanned
344 out. The spec field is still accepted (future pool will light it up);
345 the noisy warning is gone.
346 - **DC5** — `tests/unit/test_model.py` grows coverage of `ModelSpec`'s
347 dtype enum, `endpoint`, `trust_remote_code`, and `custom`/`api`
348 `kind` branches.
349
350 Final state: 582 unit tests + 3 integration tests passing; mypy strict
351 clean across 55 source files; ruff + format clean across 137 files.
352 Sprint 20 is a single PR composed of per-finding, per-file commits so
353 the history reads as a fix-by-fix cleanup.
354
355 ### Sprint 16 — Cluster-coherent KL probe
356
357 Closes Audit 01 innovation item F8. Adds a new `cluster_kl` probe that
358 answers the question `delta_kl` can't: the mean divergence may be the
359 same, but did the adapter shift the *right* topics, or is it a uniform
360 blunt-instrument shift?
361
362 - **New probe `cluster_kl`** (category: adherence). Embeds prompts via
363 shared MiniLM (same cache key as `adapter_revert`), k-means clusters
364 them at a fixed seed, measures per-prompt JS divergence between base
365 and ft, and reports a **specificity ratio**:
366 `between_variance / (between_variance + within_variance)`. Range
367 `[0, 1]`: `≈ 0.5` on a blunt adapter that shifts every topic by the
368 same amount, `→ 1.0` on a topic-targeted adapter. Pair
369 `(mean_kl, specificity)` tells a more honest story than either number
370 alone.
371 - **Bootstrap CI** on specificity. Resamples `(divergence, cluster_label)`
372 pairs with replacement and takes the 2.5/97.5 percentiles of the
373 bootstrap ratio distribution. Returns `None` below 4 prompts — matches
374 the convention `core.stats.bootstrap_ci` uses.
375 - **Z-score calibration** against `null_adapter` baseline via the
376 existing `_zscore` helpers. `calibrate_spec` synthesizes 8 mixed-topic
377 sentinel prompts + k=2 for the null pass; the specificity distribution
378 concentrates around 0.5 there, so a real adapter's separation above
379 that baseline reads as a clean z-score.
380 - **Degenerate-input policy.** `< min_prompts` (default 20) → SKIP with
381 a clear message; `num_clusters * 2 > num_prompts` → SKIP (per-cluster
382 mean not well-resolved); empty prompt list → ERROR. Missing `[semsim]`
383 extras → SKIP with a `pip install 'dlm-sway[semsim]'` hint.
384 - **Zero-variance fallback.** If every prompt produced the same
385 divergence (canned stub data, no adapter motion), the specificity
386 ratio is mathematically undefined; we return `0.5` — the null-adapter
387 expectation — so downstream z-score reports "no signal" instead of
388 NaN.
389 - **New dependency.** `scikit-learn>=1.4` added to the `[semsim]` extra
390 (riding the 80 MB MiniLM load that probe already pulls in). Mirrored
391 to `[all]` and to mypy's stubless-override list.
392 - **7 unit tests** in `tests/unit/test_probe_cluster_kl.py`: two-topic
393 adapter → high specificity, uniform adapter → 0.5 fallback, too-few
394 prompts → SKIP, empty prompts → ERROR, `num_clusters > prompts/2` →
395 SKIP, CI bracketing, missing-extras SKIP path.
396 - **Prove-the-value test** at `tests/unit/test_cluster_kl_prove_value.py`.
397 Two backends with comparable `delta_kl` (ratio < 3×) have specificity
398 scores that split by at least 0.3 — concrete evidence that `cluster_kl`
399 surfaces a structural distinction `delta_kl` merges.
400
401 ### Sprint 15 — pytest plugin (`@pytest.mark.sway`)
402
403 Closes Audit 01 innovation item F10. Packages sway as a pytest
404 library so teams already running pytest adopt sway with a single
405 decorator instead of a subprocess wrapper.
406
407 - **New plugin** (`src/dlm_sway/pytest_plugin.py`) auto-loaded via
408 the `pytest11` entry point after `pip install 'dlm-sway[pytest]'`.
409 `@pytest.mark.sway(spec="...", threshold=0.0, weights=None)`
410 expands a single pytest function into **one test item per probe**
411 in the referenced spec + an optional `__gate__` item that fires
412 only when the composite score drops below `threshold`.
413 - **Verdict translation.** `FAIL` / `ERROR` → pytest Failed,
414 `SKIP` → pytest Skipped, `WARN` → pytest warning, `PASS` →
415 pytest pass. Probe-level failures isolate: a failing adherence
416 probe doesn't mask a failing calibration one, and `pytest -k
417 adherence` runs just that probe.
418 - **Suite runs once per decorated function.** A session-scoped
419 `_SuiteCache` keyed on `(spec_path, weights)` ensures the N-way
420 item expansion doesn't multiply backend wall time. Two
421 `@pytest.mark.sway` tests against the same spec share one run.
422 - **Malformed marks fail cleanly.** Missing `spec`, non-numeric
423 `threshold`, non-dict `weights`, and unknown kwargs produce a
424 synthetic `_ConfigErrorItem` with a green-field pytest failure
425 line — no cryptic collection-time tracebacks.
426 - **New `[pytest]` extra** — just `pytest>=8.0`. Matches the pattern
427 other plugin-style extras use. Also registers the `pytest11`
428 entry point so the plugin is discovered automatically on install,
429 consistent with pytest-cov / pytest-xdist.
430 - **Example directory** at `examples/pytest_integration/` with a
431 minimal `sway.yaml` + `test_sway_gate.py` showing the decorator
432 replacing a legacy `subprocess.run(["sway", "gate", ...])`
433 wrapper. README gains a "Pytest integration" section pointing at
434 it.
435 - **13 new unit tests** via pytest's canonical `pytester` fixture:
436 marker registration, N-item expansion, FAIL / SKIP / ERROR
437 routing, gate below-threshold failure, gate above-threshold pass,
438 gate absence when `threshold=0`, malformed-mark error paths,
439 cache sharing across multiple decorated tests.
440 - **Drive-by fix for the S14 CI-narrowing test flake.** The dummy
441 backend's per-prompt noise was seeded via Python's `hash()`,
442 which is salted per-process via `PYTHONHASHSEED`. Swapped to a
443 stable `hashlib.md5`-derived seed so the narrowing invariant
444 holds deterministically across test-order permutations.
445
446 ### Sprint 14 — Bootstrap CIs + forward-pass trace CLI
447
448 Closes Audit 01 innovation items F9 (bootstrap confidence intervals
449 on raw metrics) + F12 (polished forward-pass trace CLI). Paired
450 sharpenings of existing probe output and existing trace
451 infrastructure.
452
453 - **New `core/stats.bootstrap_ci`** — percentile-bootstrap 95% CI on
454 any sequence of per-sample measurements. Numpy-only, 1000-resample
455 default, seeded from ``ctx.seed`` so intervals are reproducible.
456 Short-circuits on non-finite / empty / degenerate-constant inputs;
457 vectorized resample keeps the overhead ~1 ms per probe.
458 - **`ProbeResult.ci_95: tuple[float, float] | None`** — new field,
459 threaded through `safe_finalize` (nulled if `raw` is nulled, so a
460 CI never brackets a defensive-null point estimate). `to_json` /
461 `from_json` persist the pair as a two-list.
462 - **Six aggregating probes emit `ci_95`:** `delta_kl` (over per-prompt
463 divergences), `calibration_drift` (over per-item regression
464 indicators), `external_perplexity` (over per-chunk deltas),
465 `leakage` (over clean-recall rates), `paraphrase_invariance` (over
466 verbatim lifts), `section_internalization` (over effective SIS
467 scores). Each also lands the interval under
468 ``evidence["raw_ci_95"]`` for JSON consumers.
469 - **Report tables add a `ci95` column** between `raw` and `z` in
470 both terminal and markdown output. `format_ci` renders as
471 ``[lo, hi]`` with em-dash for missing/non-finite. Markdown
472 snapshot refreshed; JSON snapshot refreshed for the new
473 per-probe `ci_95` field.
474 - **New `sway trace <jsonl>` CLI.** Reads the forward-pass trace
475 JSONL the S07 runner writes when `--trace <path>` is set,
476 aggregates into per-probe and per-view wall-time + hit-rate
477 tables, plus a top-N slowest-events table. `--format
478 terminal|md|json`, `--slowest K` to tune the tail size. The
479 parsing tolerates old trace shapes (missing `probe` / `hit`
480 fields) and skips malformed lines.
481 - **`suite/trace_analysis`** — typed dataclasses (`TraceEvent`,
482 `ProbeSummary`, `ViewSummary`, `TraceReport`) + three renderers
483 (terminal / markdown / json) matching the conventions of
484 `suite/report` and `suite/compare`.
485 - **Committed trace fixture** at `tests/fixtures/trace_sample.jsonl`
486 (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new
487 trace-analysis + CLI tests.
488 - **Prove-the-value (F9):** on a dummy backend with per-prompt
489 variation, `delta_kl`'s CI narrows from `[0.33, 0.41]` (width
490 0.079) at N=4 prompts to `[0.38, 0.42]` (width 0.044) at N=32 —
491 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical
492 scaling minus dummy-backend dispersion.
493 - **Prove-the-value (F12):** the CLI's terminal output on the
494 committed fixture correctly identifies `sis/base` (520.3 ms) as
495 the single slowest event, `sis` (1,529 ms) as the slower probe
496 over `dk` (529 ms), and `ft` (1,357 ms) as the slower view over
497 `base` (701 ms).
498
499 ### Sprint 13 — OpenAI-compatible HTTP scoring backend
500
501 Closes Audit 01 innovation item F7. Unlocks sway against hosted
502 fine-tunes — OpenAI platform, `vllm serve`, Ollama — without
503 requiring a local torch + PEFT load.
504
505 - **New backend `ApiScoringBackend`** (`backends/api.py`): scores
506 against a ``/v1/completions`` endpoint via httpx. Implements
507 `ScoringBackend` only (not `DifferentialBackend`) since one
508 endpoint is one model; users compose two `ApiScoringBackend`
509 instances behind the existing `TwoModelDifferential` wrapper for
510 a full differential run. Method surface: `logprob_of` via
511 `echo=True`+`logprobs=0`, `rolling_logprob` via the same,
512 `next_token_dist` via `max_tokens=1`+`logprobs=K`, plus
513 `preflight_finite_check` and an `ApiScoringBackend.generate` for
514 probes that need text output (e.g. `leakage`).
515 - **Token-boundary handling.** The API returns tokens as strings, so
516 `logprob_of` walks the echoed tokens by character length until the
517 running total covers the prompt, then sums the rest. When the
518 boundary falls mid-token, the partial token lands on the prompt
519 side — over-counts the prompt, under-attributes to the completion
520 — and the behavior is asserted by a dedicated test
521 (`test_mid_token_prompt_leans_conservative`).
522 - **Retry + preflight.** Tenacity handles 5xx + network errors with
523 exponential backoff (configurable `max_retries`). Preflight hits
524 the endpoint once with `hello`, `max_tokens=1`, `logprobs=1`;
525 rejects non-finite logprobs before the suite runs.
526 - **First backend with `safe_for_concurrent_views=True`.** HTTP is
527 stateless, so the S07 concurrent-probe scheduler can dispatch
528 against an API backend in parallel as soon as the pool
529 implementation lands (still scaffolding in the runner).
530 - **`ModelSpec` additions:** `BackendKind` gains `"api"`;
531 `ModelSpec.endpoint` field carries the server's base URL (the
532 `/v1/completions` path is appended by the backend). API key from
533 `SWAY_API_KEY` → `OPENAI_API_KEY` env var fallback, so secrets
534 stay out of the YAML.
535 - **`backends.build` dispatch** routes `kind="api"` to
536 `ApiScoringBackend`. Combined with `build_two_separate` +
537 `TwoModelDifferential`, a YAML with
538 `defaults.differential: false` and two `kind: api` models Just
539 Works end-to-end.
540 - **New `[api]` extra** — `httpx>=0.27` + `tenacity>=9.0`. Core sway
541 stays torch-free; the `[api]` extra adds ~1 MB of deps vs the
542 `[hf]` extra's 3 GB.
543 - **Unit tests (21 new) use httpx's `MockTransport`** to intercept
544 every call and assert numeric outputs match canned OpenAI-shaped
545 responses — covers all three scoring methods, cache dedup, 4xx
546 error surfacing, 503→200 retry recovery, API-key env fallback,
547 NaN-response preflight rejection, and the token-boundary math.
548 - **Prove-the-value (§F7):** `tests/integration/test_api_ollama.py`
549 is opt-in via `SWAY_OLLAMA_URL` + `SWAY_OLLAMA_MODEL`. Runs the
550 full scoring surface against a live Ollama serving a small model
551 (e.g. `llama3.2:1b`) and asserts finite output + preflight pass.
552 Documents the wall-time budget the "≤3× HF backend" claim rests
553 on once the concurrent-dispatch pool lands.
554
555 ### Sprint 12 — Interactive HTML report
556
557 Closes Audit 01 innovation item F6. Adds the exploration surface to
558 sit alongside the terminal (CI logs) and markdown (PR artifacts)
559 outputs — a single-file interactive HTML page for the research /
560 write-up case.
561
562 - **`sway report result.json --format html --out report.html`** emits
563 a self-contained HTML page with five interactive Plotly panels:
564 composite-score gauge with banded thresholds, per-category
565 horizontal-bar breakdown, per-section SIS bar chart (when
566 `section_internalization` evidence carries a `per_section` array),
567 adapter-ablation response curve (when `adapter_ablation` evidence
568 carries `lambdas` + `mean_divergence_per_lambda`), and an
569 all-probe score × z-score scatter with hover tooltips. The terminal
570 verdict palette (green / yellow / red) carries across every panel.
571 - **Plotly bundle inlined once in `<head>`**; each panel is a
572 Plotly-produced `<div>` with a stable `sway-*` id so snapshot tests
573 don't churn on Plotly point releases. No external `<script src>` /
574 `<link href>` references — the page loads offline from a single
575 ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway
576 wrapper + chart data is ~40 KB).
577 - **CLI gates `--out PATH` on file-producing formats.** `--format html`
578 requires `--out` (3 MB of JS has no business on stdout);
579 `--format md|json|junit` now also accept `--out` for symmetry with
580 the HTML path. `--format terminal` rejects `--out` — the terminal
581 renderer is for the console only.
582 - **`plotly>=5.20` is optional**, shipped via the existing `[viz]`
583 extra (alongside matplotlib). Without it, `--format html` exits 2
584 with the `pip install 'dlm-sway[viz]'` install hint. Graceful
585 ImportError → RuntimeError translation tested end-to-end.
586 - **Snapshot** at `tests/snapshots/report.html` locks the Sway-owned
587 wrapper structure. The Plotly JS bundle is stripped from the
588 snapshot (replaced with a placeholder) so the 43-line snapshot
589 stays human-reviewable and Plotly version bumps don't drift it.
590 - **Prove-the-value:** `test_html_from_real_history_loads_offline`
591 renders HTML from the committed `tests/fixtures/sway-history/02-*`
592 run (4 probes, real exported payload) and asserts: file parses via
593 `html.parser`, zero external `<script src>` / `<link href>`
594 references, every probe name appears in the body, file size lands
595 between 1 MB and 10 MB. Measured output: 4.87 MB.
596
597 ### Sprint 11 — `sway compare` across saved JSON runs
598
599 Closes Audit 01 innovation item F5. Ships the regression-dashboard
600 primitive every CI integration reached for but had to script.
601
602 - **New CLI subcommand `sway compare`.** Accepts N saved result JSONs
603 (typically from `sway run --json`), rehydrates each via
604 `report.from_json`, folds them into a score matrix, and renders:
605 a per-probe score table with columns per run, per-adjacent-pair
606 delta columns (colored red on drop / green on lift), and a
607 composite-score timeline row. Formats: `terminal` (default, Rich),
608 `--format md`, `--format json`.
609 - **`--fail-on-regression <threshold>`.** Exits 1 when any probe's
610 score in the newest run dropped ≥ `threshold` vs the prior run.
611 `threshold=0` (default) disables the gate. Gate fires after the
612 output is emitted so CI logs always capture the matrix even on
613 red builds.
614 - **`suite/compare.py` module.** Clean separation: `build_matrix`
615 folds `(SuiteResult, SwayScore)` pairs into a `CompareMatrix`
616 dataclass; `render_{terminal,markdown,json}` consume that. No
617 filesystem IO in the module itself — the CLI owns the reads, the
618 renderers own the writes. Probes that disappeared between runs
619 show as `None` / em-dash in the matrix; new probes show as
620 `None` on older runs. Union of probe names is sorted for stable
621 row order across invocations.
622 - **Markdown snapshot locked** at `tests/snapshots/compare.md` —
623 silent schema drift breaks the test like every other snapshot.
624 - **Committed history fixture + prove-the-value test.**
625 `tests/fixtures/sway-history/{01,02,03}-*.json` ship a three-run
626 narrative: baseline → retrained-improved → over-trained. Run 03
627 plants a `section_internalization` drop of 0.22 and a
628 `calibration_drift` drop of 0.25 while `delta_kl` still rises
629 (the memorization-without-generalization failure mode).
630 `test_compare_catches_planted_regression` invokes
631 `sway compare sway-history/*.json --fail-on-regression 0.10` and
632 asserts exit=1 with both regressed probes surfaced in the JSON
633 payload and terminal text — exactly the F5 "CI gate the build on
634 regression" experiment.
635
636 ### Sprint 10 — Multi-rank adversarial null adapters
637
638 Closes Audit 01 innovation item F4. Extends the S02 null-calibration
639 matrix with a per-rank profile — users now read "how rank-saturated
640 is my adapter?" straight off the report.
641
642 - **`NullAdapterSpec.rank_multipliers: list[float] = [1.0]`** (new).
643 Default preserves single-rank behavior byte-for-byte. Setting
644 `[0.5, 1.0, 2.0]` calibrates three independent null distributions
645 per probe kind and emits a z-profile alongside the verdict z-score.
646 - **`NullCalibratedBackend.as_null_adapter` gains a `rank_scale` kwarg.**
647 Both shipped backends (dummy + HF) implement it by scaling the
648 null-weight noise std by `sqrt(rank_scale)` — mathematically
649 equivalent to a rank change in terms of the LoRA output variance
650 (`A·B` is a sum of `r` rank-1 outer products; variance is linear
651 in `r`). No tensor-shape surgery, no model reload per multiplier.
652 - **`RunContext.null_stats_by_rank`** threads the per-rank matrix
653 through the runner. Keys are canonical `rank_{mult:.2f}` strings;
654 inner structure matches `null_stats`.
655 - **Numeric probes emit `evidence["z_by_rank"]`.** Every numeric
656 probe (delta_kl, calibration_drift, external_perplexity, leakage,
657 adapter_revert, adapter_ablation, paraphrase_invariance,
658 preference_flip, prompt_collapse, section_internalization,
659 style_fingerprint) computes per-rank z-scores using its existing
660 sign convention. The verdict path still reads the 1.0x group —
661 no existing thresholds shift.
662 - **Report surfaces the profile.** Terminal + markdown probe notes
663 append `rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`
664 whenever multi-rank calibration ran. `format_z_profile` helper
665 centralizes the rendering.
666 - **Null-stats disk cache widens key to include `rank_multipliers`.**
667 Single-rank caches from pre-S10 runs are still readable — they're
668 promoted to the new `null_stats_by_rank` shape on load.
669 - **Bug fix (`external_perplexity`): drop erroneous z sign flip.**
670 `mean_delta` is higher-is-better (ft logprob minus base logprob
671 on external prose), so the raw z-score maps directly onto
672 `z >= assert_z_gte`. S09's sign flip was reversing the pass/fail
673 direction when the null-calibration path ran.
674 - **README: rank-profile interpretation guide.** Explains when the
675 shape indicates rank saturation vs. rank oversizing, plus the
676 "low rank can be pathologically quiet" dual-reading caveat.
677 - **Prove-the-value (`tests/unit/test_null_multi_rank.py`):** on a
678 fixed adapter with the dummy backend, the delta_kl z-profile is
679 strictly monotone in inverse rank (`z@0.5x > z@1.0x > z@2.0x`) —
680 exactly the signature of a rank-scaled null distribution.
681
682 ### Sprint 09 — External-perplexity-gap probe
683
684 Closes Audit 01 innovation item F3. First innovation sprint landed on
685 top of the audit-closure campaign (S01–S07).
686
687 - **New probe `external_perplexity`** (`probes/external_perplexity.py`):
688 rolling-logprob delta of ft vs base on held-out public-domain English
689 prose. Raw metric is `mean_delta_nats` (per-token). Negative =
690 adapter raised perplexity on external text (diffuse forgetting);
691 positive = adapter improved English modeling incidentally. Category
692 `calibration`; scored alongside `calibration_drift`.
693 - **Packaged public-domain corpus**
694 (`probes/_corpora/public_domain_en.txt`, ~14 KB): hand-assembled
695 from twelve US-public-domain passages (Lincoln, Emerson, Twain,
696 Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville,
697 Dickinson, US Constitution, Federalist #10). Each passage carries
698 an inline `# -- source:` provenance comment identifying the work
699 and the "US public domain (pre-1929)" basis. Loader strips the
700 comments at read time so the probe sees only raw prose.
701 - **Null calibration**: `calibrate_spec()` returns a 4-chunk cheap
702 version so `null_adapter` produces per-kind stats without dominating
703 suite runtime. Sign-flipped z-score — lower raw delta is worse, so
704 the shared `z >= assert_z_gte` semantics read as "σ better than
705 noise" on external fluency.
706 - **`autogen` wiring**: emits an `external_ppl` suite entry (with
707 `corpus=public_domain_en`, `max_chunks=8`) whenever the source `.dlm`
708 has any PROSE section. Intent table explains it as the
709 "diffuse-forgetting complement to `calibration_drift`."
710 - **Prove-the-value test**
711 (`tests/unit/test_ext_ppl_vs_calibration_drift.py`): a dummy backend
712 that applies a uniform −0.3 nats/token drift to every pack item and
713 every corpus chunk — below `calibration_drift`'s 1.0-nat per-item
714 regression threshold, above its −0.5 mean-delta gate, but well below
715 `external_perplexity`'s −0.1 fixed-threshold. `calibration_drift`
716 reports PASS, `external_perplexity` reports FAIL. The two probes
717 measure different failure modes and do not substitute for each
718 other.
719
720 ### Sprint 07 — Performance & caching
721
722 Closes Audit 01 findings B19 (deferred per design note),
723 E-cache-opportunity.
724
725 - **Forward-pass cache** (`backends/_instrumentation.py`): every
726 backend view now routes `next_token_dist` / `rolling_logprob` /
727 `logprob_of` through a bounded LRU keyed on `(op, view_id,
728 prompt_hash, top_k)`. Baseline A/B on a tiny 4-probe suite against
729 SmolLM2-135M on CPU: 2.33s → 1.76s (**25% wall-time reduction**,
730 forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone
731 suite has more overlap and would see larger gains; the 25% floor
732 holds on any suite with repeated prompts across probes.
733 - **Cache key includes `view_id`**: `"base"` / `"ft"` /
734 `"scaled_1.25"` / `"null_42"` are distinct namespaces. A future
735 toggle regression that fails to flip the adapter would surface as
736 wrong-side cached values immediately, not silently corrupt
737 divergence math.
738 - **Backend stats surface in the report**: `SuiteResult.backend_stats`
739 captures `cache_hits` / `cache_misses` / `forward_passes` /
740 `scoring_wall_s` / `hit_rate`. Terminal + markdown footers render
741 `cache: 26/88 = 30%` when stats are present.
742 - **Forward-pass tracing**: `sway run --trace <path.jsonl>` writes
743 one event per backend scoring call (probe / view_id / prompt_hash /
744 top_k / op / wall_ms / hit). Zero overhead when unset.
745 - **`concurrent_probes` scaffolding**: `spec.defaults.concurrent_probes:
746 int = 1` field plus `safe_for_concurrent_views: bool = False`
747 class attribute on HF / MLX / Dummy backends. The runner warns on
748 stderr when the user requested > 1 against an unsafe backend and
749 stays sequential. Custom backends that are already concurrency-safe
750 (e.g. a stateless hosted-API backend) can opt in without waiting
751 for the HF fix.
752 - **B19 design note**: `.docs/design/backend-concurrency.md` documents
753 current state, why v0.1 doesn't fix it, and two future paths
754 (per-thread lock vs per-worker pool) with recommendation.
755 - **Generation stays uncached**: `view.generate()` intentionally
756 bypasses the cache — probe-side callers vary
757 `(prompt, max_new_tokens, temperature, seed)` in ways that would
758 rarely collide, and caching sampled output would hide seed bugs
759 behind stale strings.
760
761 ### Sprint 06 — CLI & report UX polish
762
763 Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12,
764 D13, D14, D15, B16, B17, B18.
765
766 - **D3 — extras rollup footer:** terminal + markdown reports now end
767 with a single `pip install 'dlm-sway[...]'` line collecting every
768 extra mentioned in SKIP messages. Single rollup, no per-row scan.
769 - **D4 — `sway check` infers `--base`:** reads
770 `base_model_name_or_path` from the adapter's `adapter_config.json`
771 when `--base` is omitted; echoes "(inferred base model: …)" so the
772 user knows what got picked. Falls back to a clear required-flag
773 error when the field is missing.
774 - **D5 — `sway autogen` annotates the YAML:** generated specs carry a
775 header block (source `.dlm`, dlm_id, base, adapter, generation
776 timestamp, sway version) and a one-line `# intent` comment above
777 every probe entry. No new dep — pyyaml + a position-based
778 post-processor instead of `ruamel.yaml`.
779 - **D6 — `sway run --dry-run` + `sway list-probes`:** dry-run validates
780 the spec, prints a probe table (#, name, kind, category, enabled),
781 and exits 0 without building a backend. `list-probes` prints every
782 shipped probe kind with its category and the first line of its
783 docstring.
784 - **D7 — `sway doctor --json`:** machine-readable doctor payload
785 (`sway_version`, `python`, `platform`, `extras`) keyed by extra and
786 module. CI-grep-friendly.
787 - **D8 — `dlm_source` resolution warns instead of silently swallowing:**
788 when the spec sets `dlm_source` but the `[dlm]` extra isn't installed
789 (or the bridge errors), the runner emits a yellow stderr warning with
790 the install hint and continues with `sections=None`. No more silent
791 "why are my section probes SKIPping?" debugging.
792 - **D9 — markdown column parity with terminal:** the markdown probes
793 table now carries `score`, `raw`, `z`, `duration`, and `note`
794 columns. A `Top findings` section follows the table; a `Skipped
795 probes` section closes when extras are missing.
796 - **D10 — unified number formatters:** `format_score`, `format_raw`,
797 `format_z`, `format_duration_s` live in `suite/report.py` and are
798 the only numeric formatters used. Thousands separators above 1 000;
799 `` glyph for `None` / non-finite. Single source kills cross-surface
800 drift.
801 - **D11 — `sway report --format` validated via `StrEnum`:** the
802 Typer-level enum rejects unknown formats with the standard
803 "Invalid value for '--format'" error. No more silent fallback to
804 the terminal renderer on a typo.
805 - **D12 — `sway check` verdict banner:** prints
806 `✅ adapter is +4.2σ above noise` (green ≥ 3σ),
807 `⚠️ adapter is +1.5σ above noise — marginal` (yellow ≥ 1σ), or
808 `❌ adapter is +0.3σ — indistinguishable from noise` (red) above
809 the full report. Calibrated on the delta_kl z-score; `check`
810 now runs `null_adapter` first so the z-score is available.
811 - **D13 — `sway diff` regression summary:** after per-probe deltas,
812 the diff prints `A→B: N regressed >0.10, M regressed >0.20,
813 composite Δ=±X.XX` color-coded by direction.
814 - **D14 — adapter paths with spaces render safely:** `_adapter_label`
815 wraps the path in double quotes when any whitespace is present.
816 - **D15 — long messages wrap, don't truncate:** the terminal probe
817 table column uses Rich's `overflow="fold"` instead of an 80-char
818 hard cut with an ellipsis. The full message is always visible.
819 - **B16 — single markdown renderer:** `report.from_json` round-trips
820 saved JSON back into the canonical dataclass pair, so
821 `sway report --format md` and `sway run --markdown` both flow
822 through `report.to_markdown` with identical output. The legacy
823 `_render_markdown_from_json` and `_render_junit_from_json` helpers
824 in `cli/commands.py` are deleted.
825 - **B17 — `None` scores render as ``:** the unified formatters cover
826 every render path; no more `0.00` masking missing data.
827 - **B18 — `baseline` row labeled `(informational, weight=0)`:** in
828 both terminal and markdown component breakdowns, matching the
829 Sprint 03 explicit-weight-zero decision.
830
831 ### Sprint 05 — Probe quality & edge cases
832
833 Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20,
834 B21, B22.
835
836 - **B6 — `tail_logprob` semantics:** the field is now `float | None`
837 with three discrete states: `None` (top-k covered the full vocab — no
838 tail to redistribute), `0.0` (a real tail underflowed below fp32),
839 and a measurable negative log-prob. HF, MLX, and dummy backends all
840 emit `None` when `k == vocab`. Preflight check guards against
841 non-finite `tail_logprob` only when it's a number.
842 - **B7 — probe validation before backend build:** new
843 `validate_all_probes(suite)` helper collects every spec error in one
844 pass; `_execute_spec` calls it *before* materializing the backend so
845 a typo in `kind:` surfaces immediately, and *all* typos surface in a
846 single error message.
847 - **B8 — `autogen` style prompts:** `style_fingerprint` no longer
848 receives the leading sentence of a prose section (which elicited doc
849 *content*, not stylistic voice). Replaced with a fixed
850 `_STYLE_ELICITATION_PROMPTS` set of 6 open-ended, content-neutral
851 prompts.
852 - **B9 — sentence-transformer caching:** `_load_embedder` is now
853 `functools.lru_cache(maxsize=4)`. Suites that run `adapter_revert`
854 back-to-back (multi-adapter diff, repeated probes) reuse the same
855 ~80 MB embedder instead of re-loading every probe call.
856 - **B10 — `_lcs_ratio` docstring rot:** the function name is a
857 historical misnomer (it's gestalt similarity via
858 `difflib.SequenceMatcher`, not LCS). Docstring rewritten to make the
859 contract explicit; rename deferred to v0.2 to preserve the public
860 API surface.
861 - **B11 — leakage adversarial perturbations:** added 4 new perturbations
862 (`synonym_swap` with a hand-curated 50-pair table, `clause_reverse`,
863 `prefix_inject`, `register_shift`) alongside the original three.
864 Default perturbation list is now all 7; the spec accepts any subset.
865 - **B12 — calibration pack expansion:** `BUILT_IN_PACK` grew from 30
866 to **200** items (geography, natural sciences, arithmetic, language,
867 history, biology, technology, miscellaneous). All public-domain /
868 hand-composed grade-school facts — no third-party dataset license
869 attaches. `items_limit` docstring documents the new resolution
870 (~0.5 pp per regressed item vs the old ~3.3 pp).
871 - **B13 — tokenizer-aware `_stuffing`:** `prompt_collapse` now derives
872 its padding from the model's pad / unk / EOS token via the backend's
873 tokenizer, making the metric language-agnostic. The pre-B13 hardcoded
874 English string remains as the dummy-backend fallback and behind a
875 `legacy_stuffing: bool = False` spec field for one-release
876 backward-compat.
877 - **B14 — `preference_flip` per-triple error fence:** wrapped each
878 triple's `logprob_of` calls in `try/except ProbeError`. A single bad
879 triple no longer kills the whole batch; `evidence["dropped_triples"]`
880 + `dropped_reasons` surface the count + first 5 reasons. When *every*
881 triple raises, the probe routes to ERROR with a clear explanation.
882 - **B20 — custom backend protocol checks:** `_load_custom` now
883 isinstance-checks `NullCalibratedBackend` and
884 `ScalableDifferentialBackend` after the `DifferentialBackend` check.
885 The set of satisfied protocols is stamped onto the instance as
886 `__sway_protocols__: tuple[str, ...]` so the report can show which
887 features are available without re-checking.
888 - **B21 — `RunContext.null_stats` truly frozen:** the runner now wraps
889 the stats dict in `types.MappingProxyType` before threading it
890 through. The dataclass was already `frozen=True` but the dict was
891 mutable by reference; B21 makes the docstring's "frozen" claim
892 literally true. Field type widened from `dict[str, dict[str, float]]`
893 to `Mapping[str, Mapping[str, float]]`.
894 - **B22 — `ModelSpec.adapter` path normalization:** added a pydantic
895 `field_validator` that runs `Path.expanduser().resolve()` on the
896 field at spec-load time. Backends no longer re-do the work; the cache
897 key in `_null_cache.compute_key` is now stable regardless of how the
898 user spelled the path in YAML or on the CLI.
899
900 ### Sprint 04 — Integration & regression testing
901
902 Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3
903 (test side), B15.
904
905 - **Slow lane runs end-to-end** (C1): the `huggingface_hub` dev dep
906 added in S01 is verified; `pytest -m "slow or online"` now executes
907 every checked-in integration test from a clean clone.
908 - **HF backend coverage 21% → 91%** (C2) — measured combined fast +
909 slow lane against `dlm_sway.backends.hf`. New tests:
910 - `tests/integration/test_hf_adapter_toggle.py` extended with a
911 bit-identical ft → base → ft roundtrip (B15 mitigation).
912 - `tests/integration/test_hf_scaled_adapter.py`: λ sweep
913 monotonicity + `LoraLayer.scaling[key]` restoration on clean exit
914 AND on exception.
915 - `tests/integration/test_hf_null_adapter.py`: same-seed
916 determinism, different-seeds divergence, original adapter
917 restoration on clean exit AND on exception.
918 - `tests/integration/test_hf_scoring.py`: `logprob_of` (incl.
919 zero-token-completion → ProbeError), `rolling_logprob`,
920 `next_token_dist`, `_HFView.generate` (greedy + sampled-with-seed
921 determinism).
922 - `tests/unit/test_backend_hf_helpers.py`: direct unit coverage on
923 `_resolve_dtype` and `_detect_device` so dtype regressions are
924 caught in the fast lane.
925 - **MLX smoke test** (C5): `tests/integration/test_mlx_smoke.py`
926 exercises `MLXDifferentialBackend` on darwin-arm64 with a small
927 LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm
928 / missing-fixture. Reproducible adapter builder ships at
929 `tests/fixtures/build_mlx_adapter.py` (run once on a Mac to
930 populate the fixture directory).
931 - **`sway gate` exit code pinned** (C6):
932 `tests/integration/test_sway_gate_exit_code.py` covers PASS
933 (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing
934 verdicts (exit 1) via Typer's `CliRunner`.
935 - **`dlm` import ban regression-guarded** (C7):
936 `tests/unit/test_dlm_not_imported.py` patches every `dlm.*` entry in
937 `sys.modules` to `None` and asserts the dummy suite runs end-to-end
938 without ImportError, plus that `sway autogen` surfaces a clean
939 install-hint error rather than a stack trace.
940 - **Pathological probe coverage** (C8 + B3 test side):
941 `tests/unit/test_probe_adapter_ablation.py` now drives
942 monotonically-decreasing curves through the helper and pins
943 probe-level `evidence["saturation_reason"]` for flat / found /
944 overshoot-with-dip / non_monotonic shapes via a monkeypatched
945 `divergence`.
946 - **WARN-branch numerical formula pinned** (C10):
947 `test_warn_branch_score_formula_pinned` in
948 `tests/unit/test_probe_preference_flip.py` asserts the exact
949 `score = 0.5 + mean_delta / 4.0` formula with a hand-computed
950 expected value.
951 - **Disjoint top-k divergence** (C12):
952 `test_disjoint_top_k_supports_produce_finite_divergence` in
953 `tests/unit/test_divergence.py` covers the
954 `aligned_probs` tail-redistribution path against fully-disjoint
955 base / ft supports — JS comes out finite and approaches its
956 theoretical ln(2) bound.
957 - **Report schema snapshots** (C11): `tests/unit/test_report_snapshot.py`
958 byte-compares `to_json` / `to_markdown` / `to_junit` against
959 checked-in snapshots under `tests/snapshots/`. Intentional schema
960 bumps: `SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …` and commit the
961 updated files. No new dep — hand-rolled diff helper.
962 - **CI workflow shipped**: `.github/workflows/ci.yml` runs the fast
963 lane (unit + lint + mypy) on every push / PR, and the slow lane
964 (integration, HF backend) on schedule (nightly 07:00 UTC), manual
965 dispatch, push to main, and PRs that touch `src/dlm_sway/backends/`,
966 `tests/integration/`, or `pyproject.toml`.
967
968 ### Sprint 03 — Documentation truth & dead code
969
970 Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14,
971 P15, P17, P18.
972
973 - **Determinism is now wired** (P09): the suite runner calls
974 `dlm_sway.core.determinism.seed_everything(spec.defaults.seed)` before
975 any backend work runs. The achieved determinism class (`strict` /
976 `best_effort` / `loose`) is captured in `SuiteResult.determinism` and
977 surfaces in the report footer + JSON payload + markdown header.
978 - **`defaults.differential: false` works end-to-end** (P14): new
979 `backends/two_model.py` with `TwoModelDifferential` wrapper +
980 `build_two_separate(spec_models)` helper. `_execute_spec` routes to
981 the wrapper when the flag is false. Doubles memory; primarily for
982 custom backends that can't toggle adapters in place.
983 - **`score_weights` overridable from YAML and CLI** (P15): new
984 `SuiteDefaults.score_weights` field with subset-validating pydantic
985 field validator (rejects unknown categories, negative weights, all-zero
986 weights). New `--weights k=v,k=v` flag on `sway run` and `sway gate`.
987 Partial overrides merge with `DEFAULT_COMPONENT_WEIGHTS` so users
988 rarely have to respecify all five categories.
989 - **`baseline` row labeled `(informational)` with weight 0.0 explicit**
990 (P08, B18): added to `DEFAULT_COMPONENT_WEIGHTS` so the row appears
991 in reports for transparency but contributes nothing to the composite.
992 Terminal renderer adds the `(informational)` annotation column;
993 markdown table adds a `weight` column with the same label.
994 - **Extended (9-dim) style fingerprint when `[style]` is installed**
995 (P10): `style_fingerprint` now actually uses spaCy POS-tagging and
996 textstat syllable counting. Three new dims appended to the existing 6:
997 passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per
998 word. Spec field `extended: "auto" | "on" | "off"` (default `"auto"`)
999 controls the path. `evidence["schema_version"]` bumped to `2` so
1000 snapshot consumers can branch on dimensionality.
1001 - **Stale "Known gaps" entries refreshed** (P03, P04, P05): removed
1002 "null_adapter not yet wired" (delivered in S01/S02), "custom backend
1003 stubbed" (it's full), "MLX paths raise" (rewritten to describe the
1004 real limitation: two model copies because `mlx_lm` has no runtime
1005 adapter toggle).
1006 - **README adds determinism + weights + differential documentation**
1007 and a calibration paragraph that points at the per-kind null matrix.
1008 - **`delta_kl` docstring rot fixed** (P17): dropped the dead
1009 ``:mod:`dir``` reference.
1010 - **Meta-test `test_no_dead_options.py`** (P14/P15 regression guard):
1011 greps the source tree for documented spec/CLI option names and
1012 asserts each has a consumer outside its declaration site. Catches the
1013 exact "documented but unused" pattern Audit 01 flagged.
1014
1015 ### Sprint 02 — Universal z-score calibration
1016
1017 Closes Audit 01 findings P02 (delivery), B2, C9.
1018
1019 - **`NullAdapterProbe` is now a per-kind calibration matrix**: iterates
1020 every downstream numeric kind in the suite (or an explicit
1021 `calibrate_kinds` list), runs a miniature version of each through the
1022 `NullCalibrationBackendProxy` for N seeds, and publishes
1023 `{kind: {mean, std, n}}` under `evidence["null_stats"]`. The old
1024 "publishes two keys only" implementation is gone (closes B2).
1025 - **`NullCalibrationBackendProxy`** (`_null_proxy.py`): swaps
1026 `as_finetuned()` to yield `as_null_adapter(seed)` so every numeric
1027 probe's own math produces the "what does my metric look like when the
1028 fine-tune is structural noise?" distribution without a bespoke code
1029 path per probe.
1030 - **`Probe.calibrate_spec(ctx)` classmethod** plus
1031 `SENTINEL_PROMPTS` / `SENTINEL_DOC` shared constants in
1032 `probes/base.py`. Each numeric probe overrides `calibrate_spec` to
1033 return a small, cheap spec for calibration; probes that can't be
1034 meaningfully calibrated (e.g. `adapter_revert` needs an embedder,
1035 `adapter_ablation` needs `as_scaled_adapter`, `prompt_collapse`
1036 can't fit an exponential decay to null noise) opt out by returning
1037 `None` and surface `(no calibration for <kind>)` in the report.
1038 - **Shared `_zscore` helpers** (`probes/_zscore.py`):
1039 `z_score(raw, stats)`, `verdict_from_z(z, threshold)`,
1040 `score_from_z(z)`, `no_calibration_note(kind)`. Every numeric probe
1041 now flows through these — no bespoke `(raw - mean) / std` math left
1042 in any probe file. Includes the `MIN_STD = 1e-6` floor so a
1043 degenerate null distribution (e.g. a single-seed calibration) can't
1044 produce an infinite z-score (closes C9).
1045 - **All 10 numeric probes thread `get_null_stats(ctx, self.kind)`**:
1046 `delta_kl`, `adapter_revert`, `prompt_collapse`,
1047 `section_internalization`, `paraphrase_invariance`,
1048 `preference_flip`, `style_fingerprint`, `calibration_drift`,
1049 `leakage`, `adapter_ablation`. Each has an `assert_z_gte: float = 3.0`
1050 that is *preferred* when stats exist. Lower-is-better probes
1051 (`adapter_revert`, `calibration_drift`, `leakage`) sign-flip the
1052 z internally so the shared `z >= threshold` PASS rule still reads as
1053 "significantly better than null". Two probes (`paraphrase_invariance`,
1054 `calibration_drift`) keep their intent-aware / compound thresholds
1055 as the no-calibration fallback.
1056 - **On-disk null-stats cache** (`probes/_null_cache.py`,
1057 `backends/hf.py`): HF backend exposes `cache_identity()`;
1058 `NullAdapterProbe` hashes `(backend_identity, runs, init_scale,
1059 seed_base, top_k, kinds)` into a stable filename under
1060 `~/.dlm-sway/null-stats/<key>.json` (XDG-respecting). Cache is
1061 best-effort — a missing / malformed file rebuilds. Disable per-suite
1062 with `cache: false` in the spec, or globally with
1063 `SWAY_DISABLE_NULL_CACHE=1`. The dummy backend doesn't expose a
1064 cache identity, so tests never touch disk unless they opt in.
1065 - **Report shows z-scores in every numeric probe row**: the terminal
1066 renderer already had a `z` column; the markdown renderer now does
1067 too. Rows that fell back to fixed thresholds carry
1068 `(no calibration for <kind>)` directly in the message.
1069
1070 ### Sprint 01 — Finite safety & verdict integrity
1071
1072 Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.
1073
1074 - **`safe_finalize` helper** in `core/result.py`: every numeric probe now
1075 routes its final `ProbeResult` through this guard. A non-finite
1076 `raw` (or any explicitly-critical field) auto-converts to
1077 `Verdict.ERROR`; non-finite auxiliaries are nulled out and recorded
1078 in `evidence["defensively_nulled"]`.
1079 - **`_divergence` non-finite rejection**: `aligned_probs`, `kl`, `js`
1080 now raise `ProbeError` on NaN/inf inputs instead of silently
1081 producing garbage. JS results are bounds-checked against the
1082 theoretical `[0, ln 2]`; out-of-bound results raise rather than
1083 ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)
1084 - **`PreflightCheckable` Protocol**: HF and dummy backends gain
1085 `preflight_finite_check()` that runs one forward pass per view with
1086 a sentinel prompt and rejects backends whose logprobs are non-finite.
1087 - **Suite runner preflight gate**: every `sway run` calls
1088 `backend.preflight_finite_check()` before the probe loop. Failure
1089 emits a single synthetic ERROR probe and skips all configured
1090 probes; `sway gate` exits non-zero. Disable with `skip_preflight=True`
1091 for sub-second test suites.
1092 - **`style_fingerprint` zero-vector fix (B4)**: a fine-tuned model that
1093 produces empty / whitespace-only generations no longer reports a
1094 spurious "+0.82 shift toward doc" PASS. The `_cosine_shift` math is
1095 replaced by `_projection_shift = (ft-base)·(doc-base) / ||doc-base||²`
1096 which goes to zero when ft equals base.
1097 - **`_saturation_lambda` full-range search (B3)**: the adapter-ablation
1098 saturation detector now searches the entire λ range (not just λ ≤ 1.0)
1099 with `max(divs)` as the reference, and returns a typed reason
1100 (`"found"` / `"non_monotonic"` / `"flat_curve"` / `"below_floor"`)
1101 so a flat NaN-adapter curve is distinguishable from a healthy
1102 saturating one.
1103 - **Property-based divergence tests** (`hypothesis`): symmetry,
1104 `JS(p,p) = 0`, `JS ≤ ln 2`, `KL(p,p) = 0`, KL non-negativity. Plus
1105 explicit non-finite-raises tests pinning every entry point.
1106 - **End-to-end NaN regression** (`tests/integration/test_nan_adapter_regression.py`,
1107 slow+online): builds a real PEFT adapter with all-NaN weights via
1108 PEFT, persists it through safetensors, then asserts the HF backend's
1109 preflight rejects it and the suite produces ERROR — not the
1110 +11639σ headline the audit caught.
1111 - Added `huggingface_hub` to dev dep group so integration tests can
1112 resolve the `tiny_model_dir` fixture (closes C1 ahead of Sprint 04).
1113
1114 ## 0.1.0.dev0 — 2026-04-20
1115
1116 Initial pre-alpha. Full 11-primitive battery shipped.
1117
1118 ### Primitives
1119
1120 - **Adherence**
1121 - `delta_kl` — mean JS/KL divergence between base and fine-tuned next-token distributions
1122 - `adapter_revert` — reversion under adversarial paraphrase (needs `sway-eval[semsim]`)
1123 - `prompt_collapse` — exponential-decay fit of divergence over context length
1124 - **Attribution**
1125 - `section_internalization` *(flagship)* — per-section `effective_sis` with leak check
1126 - `paraphrase_invariance` — memorization vs. generalization, intent-aware
1127 - `preference_flip` — DPO/ORPO chosen/rejected margin inversion
1128 - **Calibration**
1129 - `style_fingerprint` — 6-dim numpy-only stylistic shift vs. document
1130 - `calibration_drift` — general-knowledge regression on a packaged 30-item pack
1131 - `leakage` — greedy LCS recall + perturbation fragility
1132 - **Ablation**
1133 - `adapter_ablation` *(signature primitive)* — λ-scaled divergence curve with linearity, saturation, overshoot metrics
1134 - **Baseline**
1135 - `null_adapter` — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)
1136
1137 ### Infrastructure
1138
1139 - `DifferentialBackend` + `ScalableDifferentialBackend` protocols
1140 - HuggingFace + PEFT backend with `disable_adapter` / `set_adapter` toggling and LoRA-scale mutation
1141 - Dummy backend for unit tests (canned responses + linear-blend scalable mode)
1142 - YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
1143 - Typer CLI: `run`, `gate`, `check`, `diff`, `autogen`, `doctor`, `report`
1144 - `.dlm` bridge (`dlm-sway[dlm]`): resolver + full-battery autogen
1145 - Matplotlib visualizations (`dlm-sway[viz]`): SIS bar chart, ablation curve, KL histogram
1146
1147 ### Known gaps
1148
1149 - MLX backend loads **two** model copies in memory (base + adapter-fused) because `mlx_lm` has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs.
1150 - MLX requires a pre-converted `.npz` adapter — raw PEFT safetensors are rejected by `mlx_lm.load`. A PEFT-→-MLX converter is a future milestone.
1151 - PyPI publication of the `dlm-sway` wheel is pending a clean CI release workflow.