Changelog
Unreleased
Sprint 21 — Audit 03 closure
Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the
audit recommended for v0.1.0 readiness. Full audit at
.docs/audits/03-final-audit.md. Verdict was YELLOW-leaning-GREEN
with zero critical findings; this sprint pushes it to GREEN on the
in-scope items. Deferred: F01 (MLX converter — its own feature
sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API
compat test — lands with v0.1.0 release in S22). v0.1.0 PyPI
publish is deferred to S22 (requires release coordination +
credentials).
🟠 Major — degenerate z-score clipping (F02).
probes/null_adapter— null stats now carry an explicitdegenerate: 1.0field when the calibration ran but produced an unusable baseline (runs: 1, or every seed producing the exact same raw). The std floor at1e-6is preserved for valid-but-tight multi-seed nulls so they still calibrate. Audit observed a+290,766σleakage-probe z underruns: 1; post-fix, z_score refuses and the probe falls back to fixed thresholds with a clear footer rollup.probes/_zscore.z_score— refuses whenstats["degenerate"]is truthy, independent of the std check. Belt-and-suspenders on top of the existing MIN_STD guard.suite/report.collect_degenerate_null_kinds— new rollup surface. Terminal + markdown footers gain a "N probe kind(s) had a degenerate null baseline — bumpruns:" block, distinct from the existing null-opt-outs rollup.- 12 new unit tests across
test_null_calibration.py,test_zscore_helpers.py,test_report_extras_rollup.py, andtest_probe_external_perplexity.py. Existingtest_std_floor_prevents_runaway_zscorewas rewritten to assert the fixed behavior (the pre-fix contract WAS the audit's bug).
🟡 Minor — CI resilience (F03).
pytest-timeout>=2.3+tenacity>=9.0in dev deps.tests/fixtures/tiny_model.pywrapssnapshot_downloadwith exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)etag_timeout=10to bound per-file head probes. Benefits every slow+online test.
tests/integration/test_determinism_golden.pygains@pytest.mark.timeout(600). A silent network hang now surfaces as a test failure with actionable output (audit observed 20m workflow-timeout hang on Sprint 19 merge run 24747915467).
🟡 Minor — outlier-miner pool guard (F04).
mining/outlier_miner.mine_outliers— raisesSwayErrorwhen the pool has fewer than2·top_kdistinct scored prompts, with an actionable--top-k Nhint. Pre-fix: 1-distinct-prompt pool producedtop=[p], bottom=[p](identical lists — no outlier contrast). Guard applies AFTER scoring so unsupported probe kinds still return the empty-result path.
🟡 Minor — autogen skipped-probes comment (F07).
integrations/dlm/autogen.collect_skipped_probe_reasons— new public helper returning(probe_kind, reason)tuples for probes_build_suiteintentionally omitted for the given handle._render_annotated_yaml— whenskippedis non-empty, header gains a# skipped: <kind> (<reason>)block. Users no longer have to diff the autogen source to understand which probes are missing from their generatedsway.yaml.
🟡 Minor — CI Node.js 20 deprecation silence (F08).
.github/workflows/ci.ymlsetsFORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"at the workflow level. Silences the Node 20 deprecation warnings each CI log currently carries (deadline June 2026). Temporary until every action we pull bumps to Node 24 natively.
🟡 Minor — portable dlm_source (F09).
integrations/dlm/autogen._portable_dlm_sourceemits a cwd-relative path when the.dlmlives inside the cwd, absolute otherwise. The cwd-relative form survives cross-machine checkout (CI agents, other devs); the old absolute-path emission broke the moment a committed autogen'd YAML was run on a different host.
Other.
- 640 unit tests pass (up from 626 before S21 — 14 new regression tests across F02/F03/F04/F07/F09).
- Pragmatic deviations from original scope: S21 was audit-scoped
to ~2 days of cleanup. Actual depth: 6 findings closed with 14
new regression tests. One existing test
(
test_std_floor_prevents_runaway_zscore) explicitly rewritten to flip its assertion — the pre-fix contract the test encoded WAS the audit's reported bug.
Sprint 19 — Pre-commit hook sway gate
Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships
a .pre-commit-hooks.yaml declaring two hook variants so
pre-commit.com users can gate their adapter
on every commit that touches a spec, .dlm file, or adapter directory.
- Two hooks, pick your posture.
sway-gate—language: system. Uses the sway install on the user'sPATH. Fast, zero install cost. Recommended default.sway-gate-isolated—language: python+ git-basedadditional_dependencies(dlm-sway[hf] @ git+...@<SHA>). Self-contained — pre-commit builds a fresh venv and installs sway + torch + transformers on first run (~5 GB, ~2 min).
- Rev pinning stance. Example config pins to a commit SHA, not
HEAD. Sway is pre-v0.1.0 — no tagged release yet — andHEADdrifts silently on everypre-commit autoupdate. SHA pinning is the honest pre-release pattern; migration torev: v0.1.0is a README edit once the first release lands. pass_filenames: falseon both variants.sway gatetakes exactly one spec path; the user declares it viaargs:. Thefiles:regex decides when the hook fires. No ambiguity.- Scope discipline. Neither hook surfaces
--jsonor--markdownflags. The hook gates (non-zero on FAIL) and nothing else. Users wanting report artifacts runsway runseparately. - Readme "Pre-commit" section documents the consumer-side config for both variants, the SHA-pinning rationale, and the one-time install cost of the isolated variant.
examples/precommit-example/— templatesway.yaml, consumer-side.pre-commit-config.yaml, and a README walk-through.tests/integration/test_pre_commit_hook.py(slow + online) — three cases: (a) parse-and-shape smoke test on.pre-commit-hooks.yaml; (b) pass-case runtime — spawnspre-commit run sway-gateas a subprocess against a tmp git repo with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — same fixture with an impossibleassert_mean_gte, asserts non-zero exit +gate FAILEDbanner.pre-commit>=3.8in[dependency-groups].devso the integration test can invoke it as a subprocess. Keeps the tool out of the user's runtime deps.
Sprint 18 — Cross-platform determinism golden
Closes Audit 01 stretch-list F-item "cross-platform determinism golden test." The README's "deterministic on CPU where possible" claim held in theory; now it's pinned by CI on two platforms.
- New module
src/dlm_sway/core/golden.py. Tolerance-aware JSON comparator withcompare_goldens(actual, expected, *, logprob_tol=1e-6, score_tol=1e-4)andmask_variable_fieldsfor stripping timestamps, wall-seconds,duration_s,backend_stats,sway_version, and cwd-resolved path identifiers (adapter_id,base_model_id) before comparison. No torch dep; runs in the fast lane. - New integration test
tests/integration/test_determinism_golden.py(slow + online). Builds a deterministically-seeded LoRA on SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + calibration_drift), and diffs the JSON output againsttests/golden/expected_<platform>.json.SWAY_UPDATE_GOLDENS=1toggles regen mode; missing golden → SKIP with a regen recipe. - New CI matrix
determinism-goldenwithstrategy.matrix.os: [ubuntu-latest, macos-latest]in.github/workflows/ci.yml. Triggered by changes undertests/golden/**,src/dlm_sway/core/golden.py, or the test file itself (plus the standard schedule/dispatch/push triggers). workflow_dispatchregen mode — dispatching the CI workflow withregenerate_goldens=trueflipsSWAY_UPDATE_GOLDENS=1on both matrix legs and uploads the regenerated JSONs as per-platform artifacts. Meant for deliberate "yes I changed the algorithm" flows.- Tolerance rationale (documented in
core/golden.py):1e-6for logprob-like numeric fields sits above typical BLAS-implementation drift (1e-8–1e-7band between OpenBLAS on linux and Accelerate on darwin) but below real algorithm-change drift.1e-4for score fields absorbs composition noise. - 25 new unit tests for the comparator covering mask coverage, tolerance thresholds, structural diffs (missing keys, length mismatches, type mismatches), NaN/inf edge cases, and a realistic two-masked-payloads round-trip.
- Pragmatic scope deviation: the sprint envisioned a checked-in
5 MB adapter binary. Instead the test builds it from a fixed seed
at runtime (same pattern as
test_external_perplexity_e2eandtest_cluster_kl_e2e). Avoids the regeneration chore noted in the sprint's risks section. - Linux golden bootstrap: first PR ships
expected_darwin.jsononly; the linux leg SKIPs with a recipe pointing at theworkflow_dispatchregen mode. Maintainer dispatches, downloads the artifact, commitsexpected_linux.json, and the next CI run asserts cleanly on both platforms. One-time onboarding cost.
Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner
Closes Audit 01 innovation item F11. Adds sway mine — an evaluation
companion to paraphrase_invariance that surfaces the paraphrases a
memorizing adapter most reliably fails on, plus a secondary mode that
ranks delta_kl prompts by per-prompt divergence.
sway mine --mode paraphrase — per paraphrase_invariance case:
- Generate candidate paraphrases via nlpaug
SynonymAug(WordNet, deterministic under a fixed seed; back-translation is a user-supplied escape hatch to keep the default fast and offline). - Embed candidates through the shared MiniLM cache (same 80 MB load
adapter_revertpulls) and greedy farthest-first select the top-K most pairwise-distant ones — dodges nlpaug's tendency to emit near-duplicate synonym swaps. - Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
- (ft(candidate, gold) - base(candidate, gold))`. Large positive gap = the candidate breaks the adapter's lift = the adapter doesn't generalize there.
- Emit
sway-mined-paraphrase.yamlwith amined_cases:block paste- compatible withparaphrase_invariance.cases.
sway mine --mode outliers — rank a prompt pool by per-prompt
delta_kl.raw. Pool defaults to the spec's own delta_kl prompts;
--from-corpus public_domain_en draws from S09's CC0 corpus instead.
Emits top-K and bottom-K blocks: the highest-divergence prompts (best
for tightening a gate) and lowest-divergence (candidates to drop from
the suite). leakage and paraphrase_invariance have section/case-
based specs; outlier mining on those is future work, documented
inline in mining/outlier_miner.py.
- New package
src/dlm_sway/mining/— two modules,paraphrase_miner.pyandoutlier_miner.py, plus a thincorpus_promptshelper for the--from-corpuspath. - New CLI subcommand
sway mine SPEC --mode paraphrase|outliers [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] [--seed N]. Paste-compatible YAML emission by default;--outoverrides the filename. - 18 new unit tests — 7 for the paraphrase miner (ranker, diversity filter, dedup, input validation), 8 for the outlier miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 for the CLI (paraphrase + outliers modes + no-prompts error path).
- Prove-the-value test at
tests/unit/test_paraphrase_miner_prove_value.py. On a deliberately-memorizing dummy backend the hand-written paraphrase list passes withgeneralization_ratio > 0.5; substituting the mined list (same seed, same adapter) drops the ratio below 0.5 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap. - Dependency footprint: no new extras. The paraphrase miner
uses nlpaug (already in
[style]) + sentence-transformers (in[semsim]); the diversity filter reuses the embedderadapter_revertalready pulls. GracefulBackendNotAvailableErrorwith a pip hint when the extras aren't installed.
Sprint 20 — Audit 02 closure
Closes the full finding inventory from .docs/audits/02-followup-audit.md
(1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3
dead-code items). Sixteen sprints of innovation work had accumulated
since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap
CI dropped at the runner boundary) plus a grab-bag of claim-vs-code
gaps, dead branches, and untested contracts.
🔴 Critical — ship-blocker.
- F01 —
suite/runner.py:_with_durationnow forwards everyProbeResultfield. The pre-fix version silently droppedci_95on every probe, making S14's headline deliverable runtime-inert in any realsway run. Regression test at the runner boundary + snapshot fixture carrying a populatedci_95pin the fix.
🟠 Major — claim-vs-code gaps.
- F02 — real
sklearn.cluster.KMeansexercised by two new unit tests (separation + seed determinism) + a slow+online integration test attests/integration/test_cluster_kl_e2e.py. Before: every cluster_kl test monkeypatched_kmeans_clusterwith an argmax stub, leaving S16's reason-for-being unverified. - F03 —
sway list-probesfalls back to the defining module's__doc__when a probe class has no docstring. Every one of the 13 shipped probes now renders a summary row. - F04 —
sway doctorprobesplotly(the load-bearing[viz]dep),sklearn(S16 cluster_kl), and theapiextras (httpx,tenacity). - F05 — README primitives table reflects 13 probes (adherence + cluster_kl, calibration + external_perplexity, baseline row for null_adapter).
- F06 —
TwoModelDifferentialcomposessafe_for_concurrent_viewsfrom its two inner backends. Before: the attribute was missing and the runner defaulted toFalseeven when both inners setTrue, breaking S13's concurrency claim at the only supported differential- use pattern. - F07 —
integrations/dlm/autogenemitscluster_klwhen the prompt pool clears 20 entries; markdown report gains a per-probe "Cluster breakdown" section with per-cluster mean KL + exemplars. - F08 —
backends/dummy._NullViewRNG useshashlib.md5instead of Python's PYTHONHASHSEED-saltedhash(). Cross-process determinism test attests/unit/test_cross_process_determinism.pypins the fix. - F09 — trace writer ↔ analyzer round-trip test at
tests/unit/test_runner_backend_stats.py: runs a suite withtrace_path=, loads the file back, asserts probe labels + hit/miss counts matchbackend_stats.
🟡 Minor — dead code, doc drift, latent brittleness.
- F10 — removed dead
_ft_viewbranch inprobes/prompt_collapse.py. - F11 — pytest plugin resolves spec paths against
config.rootpath, not process cwd. - F12 —
backends/api._post_completionsdelegates to tenacity'sRetrying()callable; removes the unreachable post-loop fallback. - F13 —
preference_flipWARN branch emitsci_95andz_by_rankfor consistency with every other numeric probe's WARN shape. - F14 —
adapter_ablationmodule docstring documents whyci_95renders as em-dash (it's a curve-fit, not a sample-mean aggregator). - F15 — report footer surfaces
null_adapteropt-outs (probes withcalibrate_spec=None) in both terminal and markdown surfaces. - F16 — report category ordering derives from
DEFAULT_COMPONENT_WEIGHTSand accepts unknown categories from customProbesubclasses. - F17 —
cluster_kldegenerate zero-variance case now returnsVerdict.WARNwithevidence["degenerate_zero_variance"]=Trueand suppresses z-score to avoid a spurious small-sample-noise calibration. - F18 —
_calibration_pack.pygains per-section provenance notes (F18 audit trail) without noisy per-item annotations. - F19 — pytest plugin defers
dlm_sway.core.result/dlm_sway.core.errorsimports to call sites so non-`@pytest.mark.sway` users don't pay the load tax. - F20 —
sway check--helpdocuments the σ-banner's null-calibration dependency (fall-through to composite score band when null SKIPs).
Stronger-test opportunities — regression tests for things we weren't pinning.
- #9 —
probes/_divergencerejects effectively-uniform TokenDists (spread < 1e-9) via_check_non_degenerate_token_dist. Catches a shape-broken lm_head that would otherwise compute a trivial constant divergence across prompts. Two compatibility touchups rode with the change: dummy backend's synthesizedftdist and cluster_kl test fixture_dist_broadgained a tiny monotonic perturbation to clear the guard (real models never produce bit-uniform logits thanks to fp32 accumulation noise). - #10 — cross-verdict consistency test
(
tests/unit/test_cross_verdict_consistency.py): runs the same spec throughsway run,sway gate, andsway report --format junitand asserts identical per-verdict tallies. - #11 —
sway doctor --jsonschema-shape snapshot test locks the top-level keys and the per-extra module-name set. - #12 — subprocess determinism test asserts
_NullView.next_token_distis byte-identical acrossPYTHONHASHSEEDvalues; pins F08's fix.
Dead-code inventory — DC3–DC5.
- DC3 —
null_adapter.py's pre-S10 cache-promotion branch is annotated as legacy; removal queued for the next minor. - DC4 — dropped the stderr warning at
suite/runner.pythat fired wheneverconcurrent_probes > 1even though execution never fanned out. The spec field is still accepted (future pool will light it up); the noisy warning is gone. - DC5 —
tests/unit/test_model.pygrows coverage ofModelSpec's dtype enum,endpoint,trust_remote_code, andcustom/apikindbranches.
Final state: 582 unit tests + 3 integration tests passing; mypy strict clean across 55 source files; ruff + format clean across 137 files. Sprint 20 is a single PR composed of per-finding, per-file commits so the history reads as a fix-by-fix cleanup.
Sprint 16 — Cluster-coherent KL probe
Closes Audit 01 innovation item F8. Adds a new cluster_kl probe that
answers the question delta_kl can't: the mean divergence may be the
same, but did the adapter shift the right topics, or is it a uniform
blunt-instrument shift?
- New probe
cluster_kl(category: adherence). Embeds prompts via shared MiniLM (same cache key asadapter_revert), k-means clusters them at a fixed seed, measures per-prompt JS divergence between base and ft, and reports a specificity ratio:between_variance / (between_variance + within_variance). Range[0, 1]:≈ 0.5on a blunt adapter that shifts every topic by the same amount,→ 1.0on a topic-targeted adapter. Pair(mean_kl, specificity)tells a more honest story than either number alone. - Bootstrap CI on specificity. Resamples
(divergence, cluster_label)pairs with replacement and takes the 2.5/97.5 percentiles of the bootstrap ratio distribution. ReturnsNonebelow 4 prompts — matches the conventioncore.stats.bootstrap_ciuses. - Z-score calibration against
null_adapterbaseline via the existing_zscorehelpers.calibrate_specsynthesizes 8 mixed-topic sentinel prompts + k=2 for the null pass; the specificity distribution concentrates around 0.5 there, so a real adapter's separation above that baseline reads as a clean z-score. - Degenerate-input policy.
< min_prompts(default 20) → SKIP with a clear message;num_clusters * 2 > num_prompts→ SKIP (per-cluster mean not well-resolved); empty prompt list → ERROR. Missing[semsim]extras → SKIP with apip install 'dlm-sway[semsim]'hint. - Zero-variance fallback. If every prompt produced the same
divergence (canned stub data, no adapter motion), the specificity
ratio is mathematically undefined; we return
0.5— the null-adapter expectation — so downstream z-score reports "no signal" instead of NaN. - New dependency.
scikit-learn>=1.4added to the[semsim]extra (riding the 80 MB MiniLM load that probe already pulls in). Mirrored to[all]and to mypy's stubless-override list. - 7 unit tests in
tests/unit/test_probe_cluster_kl.py: two-topic adapter → high specificity, uniform adapter → 0.5 fallback, too-few prompts → SKIP, empty prompts → ERROR,num_clusters > prompts/2→ SKIP, CI bracketing, missing-extras SKIP path. - Prove-the-value test at
tests/unit/test_cluster_kl_prove_value.py. Two backends with comparabledelta_kl(ratio < 3×) have specificity scores that split by at least 0.3 — concrete evidence thatcluster_klsurfaces a structural distinctiondelta_klmerges.
Sprint 15 — pytest plugin (@pytest.mark.sway)
Closes Audit 01 innovation item F10. Packages sway as a pytest library so teams already running pytest adopt sway with a single decorator instead of a subprocess wrapper.
- New plugin (
src/dlm_sway/pytest_plugin.py) auto-loaded via thepytest11entry point afterpip install 'dlm-sway[pytest]'.@pytest.mark.sway(spec="...", threshold=0.0, weights=None)expands a single pytest function into one test item per probe in the referenced spec + an optional__gate__item that fires only when the composite score drops belowthreshold. - Verdict translation.
FAIL/ERROR→ pytest Failed,SKIP→ pytest Skipped,WARN→ pytest warning,PASS→ pytest pass. Probe-level failures isolate: a failing adherence probe doesn't mask a failing calibration one, andpytest -k adherenceruns just that probe. - Suite runs once per decorated function. A session-scoped
_SuiteCachekeyed on(spec_path, weights)ensures the N-way item expansion doesn't multiply backend wall time. Two@pytest.mark.swaytests against the same spec share one run. - Malformed marks fail cleanly. Missing
spec, non-numericthreshold, non-dictweights, and unknown kwargs produce a synthetic_ConfigErrorItemwith a green-field pytest failure line — no cryptic collection-time tracebacks. - New
[pytest]extra — justpytest>=8.0. Matches the pattern other plugin-style extras use. Also registers thepytest11entry point so the plugin is discovered automatically on install, consistent with pytest-cov / pytest-xdist. - Example directory at
examples/pytest_integration/with a minimalsway.yaml+test_sway_gate.pyshowing the decorator replacing a legacysubprocess.run(["sway", "gate", ...])wrapper. README gains a "Pytest integration" section pointing at it. - 13 new unit tests via pytest's canonical
pytesterfixture: marker registration, N-item expansion, FAIL / SKIP / ERROR routing, gate below-threshold failure, gate above-threshold pass, gate absence whenthreshold=0, malformed-mark error paths, cache sharing across multiple decorated tests. - Drive-by fix for the S14 CI-narrowing test flake. The dummy
backend's per-prompt noise was seeded via Python's
hash(), which is salted per-process viaPYTHONHASHSEED. Swapped to a stablehashlib.md5-derived seed so the narrowing invariant holds deterministically across test-order permutations.
Sprint 14 — Bootstrap CIs + forward-pass trace CLI
Closes Audit 01 innovation items F9 (bootstrap confidence intervals on raw metrics) + F12 (polished forward-pass trace CLI). Paired sharpenings of existing probe output and existing trace infrastructure.
- New
core/stats.bootstrap_ci— percentile-bootstrap 95% CI on any sequence of per-sample measurements. Numpy-only, 1000-resample default, seeded fromctx.seedso intervals are reproducible. Short-circuits on non-finite / empty / degenerate-constant inputs; vectorized resample keeps the overhead ~1 ms per probe. ProbeResult.ci_95: tuple[float, float] | None— new field, threaded throughsafe_finalize(nulled ifrawis nulled, so a CI never brackets a defensive-null point estimate).to_json/from_jsonpersist the pair as a two-list.- Six aggregating probes emit
ci_95:delta_kl(over per-prompt divergences),calibration_drift(over per-item regression indicators),external_perplexity(over per-chunk deltas),leakage(over clean-recall rates),paraphrase_invariance(over verbatim lifts),section_internalization(over effective SIS scores). Each also lands the interval underevidence["raw_ci_95"]for JSON consumers. - Report tables add a
ci95column betweenrawandzin both terminal and markdown output.format_cirenders as[lo, hi]with em-dash for missing/non-finite. Markdown snapshot refreshed; JSON snapshot refreshed for the new per-probeci_95field. - New
sway trace <jsonl>CLI. Reads the forward-pass trace JSONL the S07 runner writes when--trace <path>is set, aggregates into per-probe and per-view wall-time + hit-rate tables, plus a top-N slowest-events table.--format terminal|md|json,--slowest Kto tune the tail size. The parsing tolerates old trace shapes (missingprobe/hitfields) and skips malformed lines. suite/trace_analysis— typed dataclasses (TraceEvent,ProbeSummary,ViewSummary,TraceReport) + three renderers (terminal / markdown / json) matching the conventions ofsuite/reportandsuite/compare.- Committed trace fixture at
tests/fixtures/trace_sample.jsonl(8 events, 2 probes × 4 prompts with cache hits). Drives 22 new trace-analysis + CLI tests. - Prove-the-value (F9): on a dummy backend with per-prompt
variation,
delta_kl's CI narrows from[0.33, 0.41](width 0.079) at N=4 prompts to[0.38, 0.42](width 0.044) at N=32 — 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical scaling minus dummy-backend dispersion. - Prove-the-value (F12): the CLI's terminal output on the
committed fixture correctly identifies
sis/base(520.3 ms) as the single slowest event,sis(1,529 ms) as the slower probe overdk(529 ms), andft(1,357 ms) as the slower view overbase(701 ms).
Sprint 13 — OpenAI-compatible HTTP scoring backend
Closes Audit 01 innovation item F7. Unlocks sway against hosted
fine-tunes — OpenAI platform, vllm serve, Ollama — without
requiring a local torch + PEFT load.
- New backend
ApiScoringBackend(backends/api.py): scores against a/v1/completionsendpoint via httpx. ImplementsScoringBackendonly (notDifferentialBackend) since one endpoint is one model; users compose twoApiScoringBackendinstances behind the existingTwoModelDifferentialwrapper for a full differential run. Method surface:logprob_ofviaecho=True+logprobs=0,rolling_logprobvia the same,next_token_distviamax_tokens=1+logprobs=K, pluspreflight_finite_checkand anApiScoringBackend.generatefor probes that need text output (e.g.leakage). - Token-boundary handling. The API returns tokens as strings, so
logprob_ofwalks the echoed tokens by character length until the running total covers the prompt, then sums the rest. When the boundary falls mid-token, the partial token lands on the prompt side — over-counts the prompt, under-attributes to the completion — and the behavior is asserted by a dedicated test (test_mid_token_prompt_leans_conservative). - Retry + preflight. Tenacity handles 5xx + network errors with
exponential backoff (configurable
max_retries). Preflight hits the endpoint once withhello,max_tokens=1,logprobs=1; rejects non-finite logprobs before the suite runs. - First backend with
safe_for_concurrent_views=True. HTTP is stateless, so the S07 concurrent-probe scheduler can dispatch against an API backend in parallel as soon as the pool implementation lands (still scaffolding in the runner). ModelSpecadditions:BackendKindgains"api";ModelSpec.endpointfield carries the server's base URL (the/v1/completionspath is appended by the backend). API key fromSWAY_API_KEY→OPENAI_API_KEYenv var fallback, so secrets stay out of the YAML.backends.builddispatch routeskind="api"toApiScoringBackend. Combined withbuild_two_separate+TwoModelDifferential, a YAML withdefaults.differential: falseand twokind: apimodels Just Works end-to-end.- New
[api]extra —httpx>=0.27+tenacity>=9.0. Core sway stays torch-free; the[api]extra adds ~1 MB of deps vs the[hf]extra's 3 GB. - Unit tests (21 new) use httpx's
MockTransportto intercept every call and assert numeric outputs match canned OpenAI-shaped responses — covers all three scoring methods, cache dedup, 4xx error surfacing, 503→200 retry recovery, API-key env fallback, NaN-response preflight rejection, and the token-boundary math. - Prove-the-value (§F7):
tests/integration/test_api_ollama.pyis opt-in viaSWAY_OLLAMA_URL+SWAY_OLLAMA_MODEL. Runs the full scoring surface against a live Ollama serving a small model (e.g.llama3.2:1b) and asserts finite output + preflight pass. Documents the wall-time budget the "≤3× HF backend" claim rests on once the concurrent-dispatch pool lands.
Sprint 12 — Interactive HTML report
Closes Audit 01 innovation item F6. Adds the exploration surface to sit alongside the terminal (CI logs) and markdown (PR artifacts) outputs — a single-file interactive HTML page for the research / write-up case.
sway report result.json --format html --out report.htmlemits a self-contained HTML page with five interactive Plotly panels: composite-score gauge with banded thresholds, per-category horizontal-bar breakdown, per-section SIS bar chart (whensection_internalizationevidence carries aper_sectionarray), adapter-ablation response curve (whenadapter_ablationevidence carrieslambdas+mean_divergence_per_lambda), and an all-probe score × z-score scatter with hover tooltips. The terminal verdict palette (green / yellow / red) carries across every panel.- Plotly bundle inlined once in
<head>; each panel is a Plotly-produced<div>with a stablesway-*id so snapshot tests don't churn on Plotly point releases. No external<script src>/<link href>references — the page loads offline from a single ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway wrapper + chart data is ~40 KB). - CLI gates
--out PATHon file-producing formats.--format htmlrequires--out(3 MB of JS has no business on stdout);--format md|json|junitnow also accept--outfor symmetry with the HTML path.--format terminalrejects--out— the terminal renderer is for the console only. plotly>=5.20is optional, shipped via the existing[viz]extra (alongside matplotlib). Without it,--format htmlexits 2 with thepip install 'dlm-sway[viz]'install hint. Graceful ImportError → RuntimeError translation tested end-to-end.- Snapshot at
tests/snapshots/report.htmllocks the Sway-owned wrapper structure. The Plotly JS bundle is stripped from the snapshot (replaced with a placeholder) so the 43-line snapshot stays human-reviewable and Plotly version bumps don't drift it. - Prove-the-value:
test_html_from_real_history_loads_offlinerenders HTML from the committedtests/fixtures/sway-history/02-*run (4 probes, real exported payload) and asserts: file parses viahtml.parser, zero external<script src>/<link href>references, every probe name appears in the body, file size lands between 1 MB and 10 MB. Measured output: 4.87 MB.
Sprint 11 — sway compare across saved JSON runs
Closes Audit 01 innovation item F5. Ships the regression-dashboard primitive every CI integration reached for but had to script.
- New CLI subcommand
sway compare. Accepts N saved result JSONs (typically fromsway run --json), rehydrates each viareport.from_json, folds them into a score matrix, and renders: a per-probe score table with columns per run, per-adjacent-pair delta columns (colored red on drop / green on lift), and a composite-score timeline row. Formats:terminal(default, Rich),--format md,--format json. --fail-on-regression <threshold>. Exits 1 when any probe's score in the newest run dropped ≥thresholdvs the prior run.threshold=0(default) disables the gate. Gate fires after the output is emitted so CI logs always capture the matrix even on red builds.suite/compare.pymodule. Clean separation:build_matrixfolds(SuiteResult, SwayScore)pairs into aCompareMatrixdataclass;render_{terminal,markdown,json}consume that. No filesystem IO in the module itself — the CLI owns the reads, the renderers own the writes. Probes that disappeared between runs show asNone/ em-dash in the matrix; new probes show asNoneon older runs. Union of probe names is sorted for stable row order across invocations.- Markdown snapshot locked at
tests/snapshots/compare.md— silent schema drift breaks the test like every other snapshot. - Committed history fixture + prove-the-value test.
tests/fixtures/sway-history/{01,02,03}-*.jsonship a three-run narrative: baseline → retrained-improved → over-trained. Run 03 plants asection_internalizationdrop of 0.22 and acalibration_driftdrop of 0.25 whiledelta_klstill rises (the memorization-without-generalization failure mode).test_compare_catches_planted_regressioninvokessway compare sway-history/*.json --fail-on-regression 0.10and asserts exit=1 with both regressed probes surfaced in the JSON payload and terminal text — exactly the F5 "CI gate the build on regression" experiment.
Sprint 10 — Multi-rank adversarial null adapters
Closes Audit 01 innovation item F4. Extends the S02 null-calibration matrix with a per-rank profile — users now read "how rank-saturated is my adapter?" straight off the report.
NullAdapterSpec.rank_multipliers: list[float] = [1.0](new). Default preserves single-rank behavior byte-for-byte. Setting[0.5, 1.0, 2.0]calibrates three independent null distributions per probe kind and emits a z-profile alongside the verdict z-score.NullCalibratedBackend.as_null_adaptergains arank_scalekwarg. Both shipped backends (dummy + HF) implement it by scaling the null-weight noise std bysqrt(rank_scale)— mathematically equivalent to a rank change in terms of the LoRA output variance (A·Bis a sum ofrrank-1 outer products; variance is linear inr). No tensor-shape surgery, no model reload per multiplier.RunContext.null_stats_by_rankthreads the per-rank matrix through the runner. Keys are canonicalrank_{mult:.2f}strings; inner structure matchesnull_stats.- Numeric probes emit
evidence["z_by_rank"]. Every numeric probe (delta_kl, calibration_drift, external_perplexity, leakage, adapter_revert, adapter_ablation, paraphrase_invariance, preference_flip, prompt_collapse, section_internalization, style_fingerprint) computes per-rank z-scores using its existing sign convention. The verdict path still reads the 1.0x group — no existing thresholds shift. - Report surfaces the profile. Terminal + markdown probe notes
append
rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2xwhenever multi-rank calibration ran.format_z_profilehelper centralizes the rendering. - Null-stats disk cache widens key to include
rank_multipliers. Single-rank caches from pre-S10 runs are still readable — they're promoted to the newnull_stats_by_rankshape on load. - Bug fix (
external_perplexity): drop erroneous z sign flip.mean_deltais higher-is-better (ft logprob minus base logprob on external prose), so the raw z-score maps directly ontoz >= assert_z_gte. S09's sign flip was reversing the pass/fail direction when the null-calibration path ran. - README: rank-profile interpretation guide. Explains when the shape indicates rank saturation vs. rank oversizing, plus the "low rank can be pathologically quiet" dual-reading caveat.
- Prove-the-value (
tests/unit/test_null_multi_rank.py): on a fixed adapter with the dummy backend, the delta_kl z-profile is strictly monotone in inverse rank (z@0.5x > z@1.0x > z@2.0x) — exactly the signature of a rank-scaled null distribution.
Sprint 09 — External-perplexity-gap probe
Closes Audit 01 innovation item F3. First innovation sprint landed on top of the audit-closure campaign (S01–S07).
- New probe
external_perplexity(probes/external_perplexity.py): rolling-logprob delta of ft vs base on held-out public-domain English prose. Raw metric ismean_delta_nats(per-token). Negative = adapter raised perplexity on external text (diffuse forgetting); positive = adapter improved English modeling incidentally. Categorycalibration; scored alongsidecalibration_drift. - Packaged public-domain corpus
(
probes/_corpora/public_domain_en.txt, ~14 KB): hand-assembled from twelve US-public-domain passages (Lincoln, Emerson, Twain, Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, Dickinson, US Constitution, Federalist #10). Each passage carries an inline# -- source:provenance comment identifying the work and the "US public domain (pre-1929)" basis. Loader strips the comments at read time so the probe sees only raw prose. - Null calibration:
calibrate_spec()returns a 4-chunk cheap version sonull_adapterproduces per-kind stats without dominating suite runtime. Sign-flipped z-score — lower raw delta is worse, so the sharedz >= assert_z_gtesemantics read as "σ better than noise" on external fluency. autogenwiring: emits anexternal_pplsuite entry (withcorpus=public_domain_en,max_chunks=8) whenever the source.dlmhas any PROSE section. Intent table explains it as the "diffuse-forgetting complement tocalibration_drift."- Prove-the-value test
(
tests/unit/test_ext_ppl_vs_calibration_drift.py): a dummy backend that applies a uniform −0.3 nats/token drift to every pack item and every corpus chunk — belowcalibration_drift's 1.0-nat per-item regression threshold, above its −0.5 mean-delta gate, but well belowexternal_perplexity's −0.1 fixed-threshold.calibration_driftreports PASS,external_perplexityreports FAIL. The two probes measure different failure modes and do not substitute for each other.
Sprint 07 — Performance & caching
Closes Audit 01 findings B19 (deferred per design note), E-cache-opportunity.
- Forward-pass cache (
backends/_instrumentation.py): every backend view now routesnext_token_dist/rolling_logprob/logprob_ofthrough a bounded LRU keyed on(op, view_id, prompt_hash, top_k). Baseline A/B on a tiny 4-probe suite against SmolLM2-135M on CPU: 2.33s → 1.76s (25% wall-time reduction, forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone suite has more overlap and would see larger gains; the 25% floor holds on any suite with repeated prompts across probes. - Cache key includes
view_id:"base"/"ft"/"scaled_1.25"/"null_42"are distinct namespaces. A future toggle regression that fails to flip the adapter would surface as wrong-side cached values immediately, not silently corrupt divergence math. - Backend stats surface in the report:
SuiteResult.backend_statscapturescache_hits/cache_misses/forward_passes/scoring_wall_s/hit_rate. Terminal + markdown footers rendercache: 26/88 = 30%when stats are present. - Forward-pass tracing:
sway run --trace <path.jsonl>writes one event per backend scoring call (probe / view_id / prompt_hash / top_k / op / wall_ms / hit). Zero overhead when unset. concurrent_probesscaffolding:spec.defaults.concurrent_probes: int = 1field plussafe_for_concurrent_views: bool = Falseclass attribute on HF / MLX / Dummy backends. The runner warns on stderr when the user requested > 1 against an unsafe backend and stays sequential. Custom backends that are already concurrency-safe (e.g. a stateless hosted-API backend) can opt in without waiting for the HF fix.- B19 design note:
.docs/design/backend-concurrency.mddocuments current state, why v0.1 doesn't fix it, and two future paths (per-thread lock vs per-worker pool) with recommendation. - Generation stays uncached:
view.generate()intentionally bypasses the cache — probe-side callers vary(prompt, max_new_tokens, temperature, seed)in ways that would rarely collide, and caching sampled output would hide seed bugs behind stale strings.
Sprint 06 — CLI & report UX polish
Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15, B16, B17, B18.
- D3 — extras rollup footer: terminal + markdown reports now end
with a single
pip install 'dlm-sway[...]'line collecting every extra mentioned in SKIP messages. Single rollup, no per-row scan. - D4 —
sway checkinfers--base: readsbase_model_name_or_pathfrom the adapter'sadapter_config.jsonwhen--baseis omitted; echoes "(inferred base model: …)" so the user knows what got picked. Falls back to a clear required-flag error when the field is missing. - D5 —
sway autogenannotates the YAML: generated specs carry a header block (source.dlm, dlm_id, base, adapter, generation timestamp, sway version) and a one-line# intentcomment above every probe entry. No new dep — pyyaml + a position-based post-processor instead ofruamel.yaml. - D6 —
sway run --dry-run+sway list-probes: dry-run validates the spec, prints a probe table (#, name, kind, category, enabled), and exits 0 without building a backend.list-probesprints every shipped probe kind with its category and the first line of its docstring. - D7 —
sway doctor --json: machine-readable doctor payload (sway_version,python,platform,extras) keyed by extra and module. CI-grep-friendly. - D8 —
dlm_sourceresolution warns instead of silently swallowing: when the spec setsdlm_sourcebut the[dlm]extra isn't installed (or the bridge errors), the runner emits a yellow stderr warning with the install hint and continues withsections=None. No more silent "why are my section probes SKIPping?" debugging. - D9 — markdown column parity with terminal: the markdown probes
table now carries
score,raw,z,duration, andnotecolumns. ATop findingssection follows the table; aSkipped probessection closes when extras are missing. - D10 — unified number formatters:
format_score,format_raw,format_z,format_duration_slive insuite/report.pyand are the only numeric formatters used. Thousands separators above 1 000;—glyph forNone/ non-finite. Single source kills cross-surface drift. - D11 —
sway report --formatvalidated viaStrEnum: the Typer-level enum rejects unknown formats with the standard "Invalid value for '--format'" error. No more silent fallback to the terminal renderer on a typo. - D12 —
sway checkverdict banner: prints✅ adapter is +4.2σ above noise(green ≥ 3σ),⚠️ adapter is +1.5σ above noise — marginal(yellow ≥ 1σ), or❌ adapter is +0.3σ — indistinguishable from noise(red) above the full report. Calibrated on the delta_kl z-score;checknow runsnull_adapterfirst so the z-score is available. - D13 —
sway diffregression summary: after per-probe deltas, the diff printsA→B: N regressed >0.10, M regressed >0.20, composite Δ=±X.XXcolor-coded by direction. - D14 — adapter paths with spaces render safely:
_adapter_labelwraps the path in double quotes when any whitespace is present. - D15 — long messages wrap, don't truncate: the terminal probe
table column uses Rich's
overflow="fold"instead of an 80-char hard cut with an ellipsis. The full message is always visible. - B16 — single markdown renderer:
report.from_jsonround-trips saved JSON back into the canonical dataclass pair, sosway report --format mdandsway run --markdownboth flow throughreport.to_markdownwith identical output. The legacy_render_markdown_from_jsonand_render_junit_from_jsonhelpers incli/commands.pyare deleted. - B17 —
Nonescores render as—: the unified formatters cover every render path; no more0.00masking missing data. - B18 —
baselinerow labeled(informational, weight=0): in both terminal and markdown component breakdowns, matching the Sprint 03 explicit-weight-zero decision.
Sprint 05 — Probe quality & edge cases
Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, B21, B22.
- B6 —
tail_logprobsemantics: the field is nowfloat | Nonewith three discrete states:None(top-k covered the full vocab — no tail to redistribute),0.0(a real tail underflowed below fp32), and a measurable negative log-prob. HF, MLX, and dummy backends all emitNonewhenk == vocab. Preflight check guards against non-finitetail_logprobonly when it's a number. - B7 — probe validation before backend build: new
validate_all_probes(suite)helper collects every spec error in one pass;_execute_speccalls it before materializing the backend so a typo inkind:surfaces immediately, and all typos surface in a single error message. - B8 —
autogenstyle prompts:style_fingerprintno longer receives the leading sentence of a prose section (which elicited doc content, not stylistic voice). Replaced with a fixed_STYLE_ELICITATION_PROMPTSset of 6 open-ended, content-neutral prompts. - B9 — sentence-transformer caching:
_load_embedderis nowfunctools.lru_cache(maxsize=4). Suites that runadapter_revertback-to-back (multi-adapter diff, repeated probes) reuse the same ~80 MB embedder instead of re-loading every probe call. - B10 —
_lcs_ratiodocstring rot: the function name is a historical misnomer (it's gestalt similarity viadifflib.SequenceMatcher, not LCS). Docstring rewritten to make the contract explicit; rename deferred to v0.2 to preserve the public API surface. - B11 — leakage adversarial perturbations: added 4 new perturbations
(
synonym_swapwith a hand-curated 50-pair table,clause_reverse,prefix_inject,register_shift) alongside the original three. Default perturbation list is now all 7; the spec accepts any subset. - B12 — calibration pack expansion:
BUILT_IN_PACKgrew from 30 to 200 items (geography, natural sciences, arithmetic, language, history, biology, technology, miscellaneous). All public-domain / hand-composed grade-school facts — no third-party dataset license attaches.items_limitdocstring documents the new resolution (~0.5 pp per regressed item vs the old ~3.3 pp). - B13 — tokenizer-aware
_stuffing:prompt_collapsenow derives its padding from the model's pad / unk / EOS token via the backend's tokenizer, making the metric language-agnostic. The pre-B13 hardcoded English string remains as the dummy-backend fallback and behind alegacy_stuffing: bool = Falsespec field for one-release backward-compat. - B14 —
preference_flipper-triple error fence: wrapped each triple'slogprob_ofcalls intry/except ProbeError. A single bad triple no longer kills the whole batch;evidence["dropped_triples"]dropped_reasonssurface the count + first 5 reasons. When every triple raises, the probe routes to ERROR with a clear explanation.
- B20 — custom backend protocol checks:
_load_customnow isinstance-checksNullCalibratedBackendandScalableDifferentialBackendafter theDifferentialBackendcheck. The set of satisfied protocols is stamped onto the instance as__sway_protocols__: tuple[str, ...]so the report can show which features are available without re-checking. - B21 —
RunContext.null_statstruly frozen: the runner now wraps the stats dict intypes.MappingProxyTypebefore threading it through. The dataclass was alreadyfrozen=Truebut the dict was mutable by reference; B21 makes the docstring's "frozen" claim literally true. Field type widened fromdict[str, dict[str, float]]toMapping[str, Mapping[str, float]]. - B22 —
ModelSpec.adapterpath normalization: added a pydanticfield_validatorthat runsPath.expanduser().resolve()on the field at spec-load time. Backends no longer re-do the work; the cache key in_null_cache.compute_keyis now stable regardless of how the user spelled the path in YAML or on the CLI.
Sprint 04 — Integration & regression testing
Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 (test side), B15.
- Slow lane runs end-to-end (C1): the
huggingface_hubdev dep added in S01 is verified;pytest -m "slow or online"now executes every checked-in integration test from a clean clone. - HF backend coverage 21% → 91% (C2) — measured combined fast +
slow lane against
dlm_sway.backends.hf. New tests:tests/integration/test_hf_adapter_toggle.pyextended with a bit-identical ft → base → ft roundtrip (B15 mitigation).tests/integration/test_hf_scaled_adapter.py: λ sweep monotonicity +LoraLayer.scaling[key]restoration on clean exit AND on exception.tests/integration/test_hf_null_adapter.py: same-seed determinism, different-seeds divergence, original adapter restoration on clean exit AND on exception.tests/integration/test_hf_scoring.py:logprob_of(incl. zero-token-completion → ProbeError),rolling_logprob,next_token_dist,_HFView.generate(greedy + sampled-with-seed determinism).tests/unit/test_backend_hf_helpers.py: direct unit coverage on_resolve_dtypeand_detect_deviceso dtype regressions are caught in the fast lane.
- MLX smoke test (C5):
tests/integration/test_mlx_smoke.pyexercisesMLXDifferentialBackendon darwin-arm64 with a small LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm / missing-fixture. Reproducible adapter builder ships attests/fixtures/build_mlx_adapter.py(run once on a Mac to populate the fixture directory). sway gateexit code pinned (C6):tests/integration/test_sway_gate_exit_code.pycovers PASS (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing verdicts (exit 1) via Typer'sCliRunner.dlmimport ban regression-guarded (C7):tests/unit/test_dlm_not_imported.pypatches everydlm.*entry insys.modulestoNoneand asserts the dummy suite runs end-to-end without ImportError, plus thatsway autogensurfaces a clean install-hint error rather than a stack trace.- Pathological probe coverage (C8 + B3 test side):
tests/unit/test_probe_adapter_ablation.pynow drives monotonically-decreasing curves through the helper and pins probe-levelevidence["saturation_reason"]for flat / found / overshoot-with-dip / non_monotonic shapes via a monkeypatcheddivergence. - WARN-branch numerical formula pinned (C10):
test_warn_branch_score_formula_pinnedintests/unit/test_probe_preference_flip.pyasserts the exactscore = 0.5 + mean_delta / 4.0formula with a hand-computed expected value. - Disjoint top-k divergence (C12):
test_disjoint_top_k_supports_produce_finite_divergenceintests/unit/test_divergence.pycovers thealigned_probstail-redistribution path against fully-disjoint base / ft supports — JS comes out finite and approaches its theoretical ln(2) bound. - Report schema snapshots (C11):
tests/unit/test_report_snapshot.pybyte-comparesto_json/to_markdown/to_junitagainst checked-in snapshots undertests/snapshots/. Intentional schema bumps:SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …and commit the updated files. No new dep — hand-rolled diff helper. - CI workflow shipped:
.github/workflows/ci.ymlruns the fast lane (unit + lint + mypy) on every push / PR, and the slow lane (integration, HF backend) on schedule (nightly 07:00 UTC), manual dispatch, push to main, and PRs that touchsrc/dlm_sway/backends/,tests/integration/, orpyproject.toml.
Sprint 03 — Documentation truth & dead code
Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, P15, P17, P18.
- Determinism is now wired (P09): the suite runner calls
dlm_sway.core.determinism.seed_everything(spec.defaults.seed)before any backend work runs. The achieved determinism class (strict/best_effort/loose) is captured inSuiteResult.determinismand surfaces in the report footer + JSON payload + markdown header. defaults.differential: falseworks end-to-end (P14): newbackends/two_model.pywithTwoModelDifferentialwrapper +build_two_separate(spec_models)helper._execute_specroutes to the wrapper when the flag is false. Doubles memory; primarily for custom backends that can't toggle adapters in place.score_weightsoverridable from YAML and CLI (P15): newSuiteDefaults.score_weightsfield with subset-validating pydantic field validator (rejects unknown categories, negative weights, all-zero weights). New--weights k=v,k=vflag onsway runandsway gate. Partial overrides merge withDEFAULT_COMPONENT_WEIGHTSso users rarely have to respecify all five categories.baselinerow labeled(informational)with weight 0.0 explicit (P08, B18): added toDEFAULT_COMPONENT_WEIGHTSso the row appears in reports for transparency but contributes nothing to the composite. Terminal renderer adds the(informational)annotation column; markdown table adds aweightcolumn with the same label.- Extended (9-dim) style fingerprint when
[style]is installed (P10):style_fingerprintnow actually uses spaCy POS-tagging and textstat syllable counting. Three new dims appended to the existing 6: passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per word. Spec fieldextended: "auto" | "on" | "off"(default"auto") controls the path.evidence["schema_version"]bumped to2so snapshot consumers can branch on dimensionality. - Stale "Known gaps" entries refreshed (P03, P04, P05): removed
"null_adapter not yet wired" (delivered in S01/S02), "custom backend
stubbed" (it's full), "MLX paths raise" (rewritten to describe the
real limitation: two model copies because
mlx_lmhas no runtime adapter toggle). - README adds determinism + weights + differential documentation and a calibration paragraph that points at the per-kind null matrix.
delta_kldocstring rot fixed (P17): dropped the dead ``:mod:`dir``` reference.- Meta-test
test_no_dead_options.py(P14/P15 regression guard): greps the source tree for documented spec/CLI option names and asserts each has a consumer outside its declaration site. Catches the exact "documented but unused" pattern Audit 01 flagged.
Sprint 02 — Universal z-score calibration
Closes Audit 01 findings P02 (delivery), B2, C9.
NullAdapterProbeis now a per-kind calibration matrix: iterates every downstream numeric kind in the suite (or an explicitcalibrate_kindslist), runs a miniature version of each through theNullCalibrationBackendProxyfor N seeds, and publishes{kind: {mean, std, n}}underevidence["null_stats"]. The old "publishes two keys only" implementation is gone (closes B2).NullCalibrationBackendProxy(_null_proxy.py): swapsas_finetuned()to yieldas_null_adapter(seed)so every numeric probe's own math produces the "what does my metric look like when the fine-tune is structural noise?" distribution without a bespoke code path per probe.Probe.calibrate_spec(ctx)classmethod plusSENTINEL_PROMPTS/SENTINEL_DOCshared constants inprobes/base.py. Each numeric probe overridescalibrate_specto return a small, cheap spec for calibration; probes that can't be meaningfully calibrated (e.g.adapter_revertneeds an embedder,adapter_ablationneedsas_scaled_adapter,prompt_collapsecan't fit an exponential decay to null noise) opt out by returningNoneand surface(no calibration for <kind>)in the report.- Shared
_zscorehelpers (probes/_zscore.py):z_score(raw, stats),verdict_from_z(z, threshold),score_from_z(z),no_calibration_note(kind). Every numeric probe now flows through these — no bespoke(raw - mean) / stdmath left in any probe file. Includes theMIN_STD = 1e-6floor so a degenerate null distribution (e.g. a single-seed calibration) can't produce an infinite z-score (closes C9). - All 10 numeric probes thread
get_null_stats(ctx, self.kind):delta_kl,adapter_revert,prompt_collapse,section_internalization,paraphrase_invariance,preference_flip,style_fingerprint,calibration_drift,leakage,adapter_ablation. Each has anassert_z_gte: float = 3.0that is preferred when stats exist. Lower-is-better probes (adapter_revert,calibration_drift,leakage) sign-flip the z internally so the sharedz >= thresholdPASS rule still reads as "significantly better than null". Two probes (paraphrase_invariance,calibration_drift) keep their intent-aware / compound thresholds as the no-calibration fallback. - On-disk null-stats cache (
probes/_null_cache.py,backends/hf.py): HF backend exposescache_identity();NullAdapterProbehashes(backend_identity, runs, init_scale, seed_base, top_k, kinds)into a stable filename under~/.dlm-sway/null-stats/<key>.json(XDG-respecting). Cache is best-effort — a missing / malformed file rebuilds. Disable per-suite withcache: falsein the spec, or globally withSWAY_DISABLE_NULL_CACHE=1. The dummy backend doesn't expose a cache identity, so tests never touch disk unless they opt in. - Report shows z-scores in every numeric probe row: the terminal
renderer already had a
zcolumn; the markdown renderer now does too. Rows that fell back to fixed thresholds carry(no calibration for <kind>)directly in the message.
Sprint 01 — Finite safety & verdict integrity
Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.
safe_finalizehelper incore/result.py: every numeric probe now routes its finalProbeResultthrough this guard. A non-finiteraw(or any explicitly-critical field) auto-converts toVerdict.ERROR; non-finite auxiliaries are nulled out and recorded inevidence["defensively_nulled"]._divergencenon-finite rejection:aligned_probs,kl,jsnow raiseProbeErroron NaN/inf inputs instead of silently producing garbage. JS results are bounds-checked against the theoretical[0, ln 2]; out-of-bound results raise rather than ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)PreflightCheckableProtocol: HF and dummy backends gainpreflight_finite_check()that runs one forward pass per view with a sentinel prompt and rejects backends whose logprobs are non-finite.- Suite runner preflight gate: every
sway runcallsbackend.preflight_finite_check()before the probe loop. Failure emits a single synthetic ERROR probe and skips all configured probes;sway gateexits non-zero. Disable withskip_preflight=Truefor sub-second test suites. style_fingerprintzero-vector fix (B4): a fine-tuned model that produces empty / whitespace-only generations no longer reports a spurious "+0.82 shift toward doc" PASS. The_cosine_shiftmath is replaced by_projection_shift = (ft-base)·(doc-base) / ||doc-base||²which goes to zero when ft equals base._saturation_lambdafull-range search (B3): the adapter-ablation saturation detector now searches the entire λ range (not just λ ≤ 1.0) withmax(divs)as the reference, and returns a typed reason ("found"/"non_monotonic"/"flat_curve"/"below_floor") so a flat NaN-adapter curve is distinguishable from a healthy saturating one.- Property-based divergence tests (
hypothesis): symmetry,JS(p,p) = 0,JS ≤ ln 2,KL(p,p) = 0, KL non-negativity. Plus explicit non-finite-raises tests pinning every entry point. - End-to-end NaN regression (
tests/integration/test_nan_adapter_regression.py, slow+online): builds a real PEFT adapter with all-NaN weights via PEFT, persists it through safetensors, then asserts the HF backend's preflight rejects it and the suite produces ERROR — not the +11639σ headline the audit caught. - Added
huggingface_hubto dev dep group so integration tests can resolve thetiny_model_dirfixture (closes C1 ahead of Sprint 04).
0.1.0.dev0 — 2026-04-20
Initial pre-alpha. Full 11-primitive battery shipped.
Primitives
- Adherence
delta_kl— mean JS/KL divergence between base and fine-tuned next-token distributionsadapter_revert— reversion under adversarial paraphrase (needssway-eval[semsim])prompt_collapse— exponential-decay fit of divergence over context length
- Attribution
section_internalization(flagship) — per-sectioneffective_siswith leak checkparaphrase_invariance— memorization vs. generalization, intent-awarepreference_flip— DPO/ORPO chosen/rejected margin inversion
- Calibration
style_fingerprint— 6-dim numpy-only stylistic shift vs. documentcalibration_drift— general-knowledge regression on a packaged 30-item packleakage— greedy LCS recall + perturbation fragility
- Ablation
adapter_ablation(signature primitive) — λ-scaled divergence curve with linearity, saturation, overshoot metrics
- Baseline
null_adapter— per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)
Infrastructure
DifferentialBackend+ScalableDifferentialBackendprotocols- HuggingFace + PEFT backend with
disable_adapter/set_adaptertoggling and LoRA-scale mutation - Dummy backend for unit tests (canned responses + linear-blend scalable mode)
- YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
- Typer CLI:
run,gate,check,diff,autogen,doctor,report .dlmbridge (dlm-sway[dlm]): resolver + full-battery autogen- Matplotlib visualizations (
dlm-sway[viz]): SIS bar chart, ablation curve, KL histogram
Known gaps
- MLX backend loads two model copies in memory (base + adapter-fused) because
mlx_lmhas no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs. - MLX requires a pre-converted
.npzadapter — raw PEFT safetensors are rejected bymlx_lm.load. A PEFT-→-MLX converter is a future milestone. - PyPI publication of the
dlm-swaywheel is pending a clean CI release workflow.
View source
| 1 | # Changelog |
| 2 | |
| 3 | ## Unreleased |
| 4 | |
| 5 | ### Sprint 21 — Audit 03 closure |
| 6 | |
| 7 | Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the |
| 8 | audit recommended for v0.1.0 readiness. Full audit at |
| 9 | `.docs/audits/03-final-audit.md`. Verdict was YELLOW-leaning-GREEN |
| 10 | with zero critical findings; this sprint pushes it to GREEN on the |
| 11 | in-scope items. Deferred: F01 (MLX converter — its own feature |
| 12 | sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API |
| 13 | compat test — lands with v0.1.0 release in S22). `v0.1.0` PyPI |
| 14 | publish is deferred to S22 (requires release coordination + |
| 15 | credentials). |
| 16 | |
| 17 | **🟠 Major — degenerate z-score clipping (F02).** |
| 18 | |
| 19 | - **`probes/null_adapter`** — null stats now carry an explicit |
| 20 | `degenerate: 1.0` field when the calibration ran but produced an |
| 21 | unusable baseline (`runs: 1`, or every seed producing the *exact* |
| 22 | same raw). The std floor at `1e-6` is preserved for valid-but-tight |
| 23 | multi-seed nulls so they still calibrate. Audit observed a |
| 24 | `+290,766σ` leakage-probe z under `runs: 1`; post-fix, z_score |
| 25 | refuses and the probe falls back to fixed thresholds with a clear |
| 26 | footer rollup. |
| 27 | - **`probes/_zscore.z_score`** — refuses when `stats["degenerate"]` |
| 28 | is truthy, independent of the std check. Belt-and-suspenders on |
| 29 | top of the existing MIN_STD guard. |
| 30 | - **`suite/report.collect_degenerate_null_kinds`** — new rollup |
| 31 | surface. Terminal + markdown footers gain a |
| 32 | "N probe kind(s) had a degenerate null baseline — bump `runs:`" |
| 33 | block, distinct from the existing null-opt-outs rollup. |
| 34 | - 12 new unit tests across `test_null_calibration.py`, |
| 35 | `test_zscore_helpers.py`, `test_report_extras_rollup.py`, and |
| 36 | `test_probe_external_perplexity.py`. Existing |
| 37 | `test_std_floor_prevents_runaway_zscore` was rewritten to assert |
| 38 | the fixed behavior (the pre-fix contract WAS the audit's bug). |
| 39 | |
| 40 | **🟡 Minor — CI resilience (F03).** |
| 41 | |
| 42 | - **`pytest-timeout>=2.3`** + **`tenacity>=9.0`** in dev deps. |
| 43 | - **`tests/fixtures/tiny_model.py`** wraps `snapshot_download` with |
| 44 | exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff) |
| 45 | + `etag_timeout=10` to bound per-file head probes. Benefits every |
| 46 | slow+online test. |
| 47 | - **`tests/integration/test_determinism_golden.py`** gains |
| 48 | `@pytest.mark.timeout(600)`. A silent network hang now surfaces |
| 49 | as a test failure with actionable output (audit observed 20m |
| 50 | workflow-timeout hang on Sprint 19 merge run 24747915467). |
| 51 | |
| 52 | **🟡 Minor — outlier-miner pool guard (F04).** |
| 53 | |
| 54 | - **`mining/outlier_miner.mine_outliers`** — raises `SwayError` when |
| 55 | the pool has fewer than `2·top_k` distinct scored prompts, with |
| 56 | an actionable `--top-k N` hint. Pre-fix: 1-distinct-prompt pool |
| 57 | produced `top=[p], bottom=[p]` (identical lists — no outlier |
| 58 | contrast). Guard applies AFTER scoring so unsupported probe kinds |
| 59 | still return the empty-result path. |
| 60 | |
| 61 | **🟡 Minor — autogen skipped-probes comment (F07).** |
| 62 | |
| 63 | - **`integrations/dlm/autogen.collect_skipped_probe_reasons`** — new |
| 64 | public helper returning `(probe_kind, reason)` tuples for probes |
| 65 | `_build_suite` intentionally omitted for the given handle. |
| 66 | - **`_render_annotated_yaml`** — when `skipped` is non-empty, header |
| 67 | gains a `# skipped: <kind> (<reason>)` block. Users no longer have |
| 68 | to diff the autogen source to understand which probes are missing |
| 69 | from their generated `sway.yaml`. |
| 70 | |
| 71 | **🟡 Minor — CI Node.js 20 deprecation silence (F08).** |
| 72 | |
| 73 | - **`.github/workflows/ci.yml`** sets |
| 74 | `FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"` at the workflow |
| 75 | level. Silences the Node 20 deprecation warnings each CI log |
| 76 | currently carries (deadline June 2026). Temporary until every |
| 77 | action we pull bumps to Node 24 natively. |
| 78 | |
| 79 | **🟡 Minor — portable `dlm_source` (F09).** |
| 80 | |
| 81 | - **`integrations/dlm/autogen._portable_dlm_source`** emits a |
| 82 | cwd-relative path when the `.dlm` lives inside the cwd, absolute |
| 83 | otherwise. The cwd-relative form survives cross-machine checkout |
| 84 | (CI agents, other devs); the old absolute-path emission broke the |
| 85 | moment a committed autogen'd YAML was run on a different host. |
| 86 | |
| 87 | **Other.** |
| 88 | |
| 89 | - **640 unit tests pass** (up from 626 before S21 — 14 new |
| 90 | regression tests across F02/F03/F04/F07/F09). |
| 91 | - **Pragmatic deviations from original scope:** S21 was audit-scoped |
| 92 | to ~2 days of cleanup. Actual depth: 6 findings closed with 14 |
| 93 | new regression tests. One existing test |
| 94 | (`test_std_floor_prevents_runaway_zscore`) explicitly rewritten to |
| 95 | flip its assertion — the pre-fix contract the test encoded WAS |
| 96 | the audit's reported bug. |
| 97 | |
| 98 | ### Sprint 19 — Pre-commit hook `sway gate` |
| 99 | |
| 100 | Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships |
| 101 | a `.pre-commit-hooks.yaml` declaring two hook variants so |
| 102 | [pre-commit.com](https://pre-commit.com) users can gate their adapter |
| 103 | on every commit that touches a spec, `.dlm` file, or adapter directory. |
| 104 | |
| 105 | - **Two hooks, pick your posture.** |
| 106 | - `sway-gate` — `language: system`. Uses the sway install on the |
| 107 | user's `PATH`. Fast, zero install cost. Recommended default. |
| 108 | - `sway-gate-isolated` — `language: python` + git-based |
| 109 | `additional_dependencies` (`dlm-sway[hf] @ git+...@<SHA>`). |
| 110 | Self-contained — pre-commit builds a fresh venv and installs |
| 111 | sway + torch + transformers on first run (~5 GB, ~2 min). |
| 112 | - **Rev pinning stance.** Example config pins to a commit SHA, not |
| 113 | `HEAD`. Sway is pre-v0.1.0 — no tagged release yet — and `HEAD` |
| 114 | drifts silently on every `pre-commit autoupdate`. SHA pinning is |
| 115 | the honest pre-release pattern; migration to `rev: v0.1.0` is a |
| 116 | README edit once the first release lands. |
| 117 | - **`pass_filenames: false` on both variants.** `sway gate` takes |
| 118 | exactly one spec path; the user declares it via `args:`. The |
| 119 | `files:` regex decides when the hook fires. No ambiguity. |
| 120 | - **Scope discipline.** Neither hook surfaces `--json` or |
| 121 | `--markdown` flags. The hook gates (non-zero on FAIL) and nothing |
| 122 | else. Users wanting report artifacts run `sway run` separately. |
| 123 | - **Readme "Pre-commit" section** documents the consumer-side config |
| 124 | for both variants, the SHA-pinning rationale, and the one-time |
| 125 | install cost of the isolated variant. |
| 126 | - **`examples/precommit-example/`** — template `sway.yaml`, |
| 127 | consumer-side `.pre-commit-config.yaml`, and a README walk-through. |
| 128 | - **`tests/integration/test_pre_commit_hook.py`** (slow + online) — |
| 129 | three cases: (a) parse-and-shape smoke test on |
| 130 | `.pre-commit-hooks.yaml`; (b) pass-case runtime — spawns |
| 131 | `pre-commit run sway-gate` as a subprocess against a tmp git repo |
| 132 | with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — |
| 133 | same fixture with an impossible `assert_mean_gte`, asserts |
| 134 | non-zero exit + `gate FAILED` banner. |
| 135 | - **`pre-commit>=3.8`** in `[dependency-groups].dev` so the |
| 136 | integration test can invoke it as a subprocess. Keeps the tool |
| 137 | out of the user's runtime deps. |
| 138 | |
| 139 | ### Sprint 18 — Cross-platform determinism golden |
| 140 | |
| 141 | Closes Audit 01 stretch-list F-item "cross-platform determinism golden |
| 142 | test." The README's "deterministic on CPU where possible" claim held |
| 143 | in theory; now it's pinned by CI on two platforms. |
| 144 | |
| 145 | - **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON |
| 146 | comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6, |
| 147 | score_tol=1e-4)` and `mask_variable_fields` for stripping |
| 148 | timestamps, wall-seconds, `duration_s`, `backend_stats`, |
| 149 | `sway_version`, and cwd-resolved path identifiers (`adapter_id`, |
| 150 | `base_model_id`) before comparison. No torch dep; runs in the fast |
| 151 | lane. |
| 152 | - **New integration test** `tests/integration/test_determinism_golden.py` |
| 153 | (slow + online). Builds a deterministically-seeded LoRA on |
| 154 | SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + |
| 155 | calibration_drift), and diffs the JSON output against |
| 156 | `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1` |
| 157 | toggles regen mode; missing golden → SKIP with a regen recipe. |
| 158 | - **New CI matrix** `determinism-golden` with |
| 159 | `strategy.matrix.os: [ubuntu-latest, macos-latest]` in |
| 160 | `.github/workflows/ci.yml`. Triggered by changes under |
| 161 | `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file |
| 162 | itself (plus the standard schedule/dispatch/push triggers). |
| 163 | - **`workflow_dispatch` regen mode** — dispatching the CI workflow |
| 164 | with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both |
| 165 | matrix legs and uploads the regenerated JSONs as per-platform |
| 166 | artifacts. Meant for deliberate "yes I changed the algorithm" |
| 167 | flows. |
| 168 | - **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for |
| 169 | logprob-like numeric fields sits above typical BLAS-implementation |
| 170 | drift (`1e-8`–`1e-7` band between OpenBLAS on linux and Accelerate |
| 171 | on darwin) but below real algorithm-change drift. `1e-4` for score |
| 172 | fields absorbs composition noise. |
| 173 | - **25 new unit tests** for the comparator covering mask coverage, |
| 174 | tolerance thresholds, structural diffs (missing keys, length |
| 175 | mismatches, type mismatches), NaN/inf edge cases, and a realistic |
| 176 | two-masked-payloads round-trip. |
| 177 | - **Pragmatic scope deviation**: the sprint envisioned a checked-in |
| 178 | 5 MB adapter binary. Instead the test builds it from a fixed seed |
| 179 | at runtime (same pattern as `test_external_perplexity_e2e` and |
| 180 | `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the |
| 181 | sprint's risks section. |
| 182 | - **Linux golden bootstrap**: first PR ships `expected_darwin.json` |
| 183 | only; the linux leg SKIPs with a recipe pointing at the |
| 184 | `workflow_dispatch` regen mode. Maintainer dispatches, downloads |
| 185 | the artifact, commits `expected_linux.json`, and the next CI run |
| 186 | asserts cleanly on both platforms. One-time onboarding cost. |
| 187 | |
| 188 | ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner |
| 189 | |
| 190 | Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation |
| 191 | companion to `paraphrase_invariance` that surfaces the paraphrases a |
| 192 | memorizing adapter most reliably fails on, plus a secondary mode that |
| 193 | ranks `delta_kl` prompts by per-prompt divergence. |
| 194 | |
| 195 | **`sway mine --mode paraphrase`** — per `paraphrase_invariance` case: |
| 196 | |
| 197 | 1. Generate candidate paraphrases via nlpaug `SynonymAug` (WordNet, |
| 198 | deterministic under a fixed seed; back-translation is a user-supplied |
| 199 | escape hatch to keep the default fast and offline). |
| 200 | 2. Embed candidates through the shared MiniLM cache (same 80 MB load |
| 201 | `adapter_revert` pulls) and greedy farthest-first select the top-K |
| 202 | most pairwise-distant ones — dodges nlpaug's tendency to emit |
| 203 | near-duplicate synonym swaps. |
| 204 | 3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold)) |
| 205 | - (ft(candidate, gold) - base(candidate, gold))`. Large positive gap |
| 206 | = the candidate breaks the adapter's lift = the adapter doesn't |
| 207 | generalize there. |
| 208 | 4. Emit `sway-mined-paraphrase.yaml` with a `mined_cases:` block paste- |
| 209 | compatible with `paraphrase_invariance.cases`. |
| 210 | |
| 211 | **`sway mine --mode outliers`** — rank a prompt pool by per-prompt |
| 212 | `delta_kl.raw`. Pool defaults to the spec's own `delta_kl` prompts; |
| 213 | `--from-corpus public_domain_en` draws from S09's CC0 corpus instead. |
| 214 | Emits top-K and bottom-K blocks: the highest-divergence prompts (best |
| 215 | for tightening a gate) and lowest-divergence (candidates to drop from |
| 216 | the suite). `leakage` and `paraphrase_invariance` have section/case- |
| 217 | based specs; outlier mining on those is future work, documented |
| 218 | inline in `mining/outlier_miner.py`. |
| 219 | |
| 220 | - **New package** `src/dlm_sway/mining/` — two modules, |
| 221 | `paraphrase_miner.py` and `outlier_miner.py`, plus a thin |
| 222 | `corpus_prompts` helper for the `--from-corpus` path. |
| 223 | - **New CLI subcommand** `sway mine SPEC --mode paraphrase|outliers |
| 224 | [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] |
| 225 | [--seed N]`. Paste-compatible YAML emission by default; `--out` |
| 226 | overrides the filename. |
| 227 | - **18 new unit tests** — 7 for the paraphrase miner (ranker, |
| 228 | diversity filter, dedup, input validation), 8 for the outlier |
| 229 | miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 |
| 230 | for the CLI (paraphrase + outliers modes + no-prompts error path). |
| 231 | - **Prove-the-value test** at |
| 232 | `tests/unit/test_paraphrase_miner_prove_value.py`. On a |
| 233 | deliberately-memorizing dummy backend the hand-written paraphrase |
| 234 | list passes with `generalization_ratio > 0.5`; substituting the |
| 235 | mined list (same seed, same adapter) drops the ratio below 0.5 |
| 236 | and flips the verdict to FAIL, with a ≥ 0.3 ratio gap. |
| 237 | - **Dependency footprint**: no new extras. The paraphrase miner |
| 238 | uses nlpaug (already in `[style]`) + sentence-transformers (in |
| 239 | `[semsim]`); the diversity filter reuses the embedder |
| 240 | `adapter_revert` already pulls. Graceful `BackendNotAvailableError` |
| 241 | with a pip hint when the extras aren't installed. |
| 242 | |
| 243 | ### Sprint 20 — Audit 02 closure |
| 244 | |
| 245 | Closes the full finding inventory from `.docs/audits/02-followup-audit.md` |
| 246 | (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3 |
| 247 | dead-code items). Sixteen sprints of innovation work had accumulated |
| 248 | since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap |
| 249 | CI dropped at the runner boundary) plus a grab-bag of claim-vs-code |
| 250 | gaps, dead branches, and untested contracts. |
| 251 | |
| 252 | **🔴 Critical — ship-blocker.** |
| 253 | |
| 254 | - **F01** — `suite/runner.py:_with_duration` now forwards every |
| 255 | `ProbeResult` field. The pre-fix version silently dropped `ci_95` |
| 256 | on every probe, making S14's headline deliverable runtime-inert in |
| 257 | any real `sway run`. Regression test at the runner boundary + snapshot |
| 258 | fixture carrying a populated `ci_95` pin the fix. |
| 259 | |
| 260 | **🟠 Major — claim-vs-code gaps.** |
| 261 | |
| 262 | - **F02** — real `sklearn.cluster.KMeans` exercised by two new unit |
| 263 | tests (separation + seed determinism) + a slow+online integration |
| 264 | test at `tests/integration/test_cluster_kl_e2e.py`. Before: every |
| 265 | cluster_kl test monkeypatched `_kmeans_cluster` with an argmax stub, |
| 266 | leaving S16's reason-for-being unverified. |
| 267 | - **F03** — `sway list-probes` falls back to the defining module's |
| 268 | `__doc__` when a probe class has no docstring. Every one of the 13 |
| 269 | shipped probes now renders a summary row. |
| 270 | - **F04** — `sway doctor` probes `plotly` (the load-bearing `[viz]` dep), |
| 271 | `sklearn` (S16 cluster_kl), and the `api` extras (`httpx`, `tenacity`). |
| 272 | - **F05** — README primitives table reflects 13 probes (adherence + |
| 273 | cluster_kl, calibration + external_perplexity, baseline row for |
| 274 | null_adapter). |
| 275 | - **F06** — `TwoModelDifferential` composes `safe_for_concurrent_views` |
| 276 | from its two inner backends. Before: the attribute was missing and |
| 277 | the runner defaulted to `False` even when both inners set `True`, |
| 278 | breaking S13's concurrency claim at the only supported differential- |
| 279 | use pattern. |
| 280 | - **F07** — `integrations/dlm/autogen` emits `cluster_kl` when the |
| 281 | prompt pool clears 20 entries; markdown report gains a per-probe |
| 282 | "Cluster breakdown" section with per-cluster mean KL + exemplars. |
| 283 | - **F08** — `backends/dummy._NullView` RNG uses `hashlib.md5` instead |
| 284 | of Python's PYTHONHASHSEED-salted `hash()`. Cross-process determinism |
| 285 | test at `tests/unit/test_cross_process_determinism.py` pins the fix. |
| 286 | - **F09** — trace writer ↔ analyzer round-trip test at |
| 287 | `tests/unit/test_runner_backend_stats.py`: runs a suite with |
| 288 | `trace_path=`, loads the file back, asserts probe labels + hit/miss |
| 289 | counts match `backend_stats`. |
| 290 | |
| 291 | **🟡 Minor — dead code, doc drift, latent brittleness.** |
| 292 | |
| 293 | - **F10** — removed dead `_ft_view` branch in `probes/prompt_collapse.py`. |
| 294 | - **F11** — pytest plugin resolves spec paths against `config.rootpath`, |
| 295 | not process cwd. |
| 296 | - **F12** — `backends/api._post_completions` delegates to tenacity's |
| 297 | `Retrying()` callable; removes the unreachable post-loop fallback. |
| 298 | - **F13** — `preference_flip` WARN branch emits `ci_95` and `z_by_rank` |
| 299 | for consistency with every other numeric probe's WARN shape. |
| 300 | - **F14** — `adapter_ablation` module docstring documents why `ci_95` |
| 301 | renders as em-dash (it's a curve-fit, not a sample-mean aggregator). |
| 302 | - **F15** — report footer surfaces `null_adapter` opt-outs (probes with |
| 303 | `calibrate_spec=None`) in both terminal and markdown surfaces. |
| 304 | - **F16** — report category ordering derives from |
| 305 | `DEFAULT_COMPONENT_WEIGHTS` and accepts unknown categories from |
| 306 | custom `Probe` subclasses. |
| 307 | - **F17** — `cluster_kl` degenerate zero-variance case now returns |
| 308 | `Verdict.WARN` with `evidence["degenerate_zero_variance"]=True` and |
| 309 | suppresses z-score to avoid a spurious small-sample-noise calibration. |
| 310 | - **F18** — `_calibration_pack.py` gains per-section provenance notes |
| 311 | (F18 audit trail) without noisy per-item annotations. |
| 312 | - **F19** — pytest plugin defers `dlm_sway.core.result` / |
| 313 | `dlm_sway.core.errors` imports to call sites so non-`@pytest.mark.sway` |
| 314 | users don't pay the load tax. |
| 315 | - **F20** — `sway check` `--help` documents the σ-banner's null-calibration |
| 316 | dependency (fall-through to composite score band when null SKIPs). |
| 317 | |
| 318 | **Stronger-test opportunities — regression tests for things we weren't |
| 319 | pinning.** |
| 320 | |
| 321 | - **#9** — `probes/_divergence` rejects effectively-uniform TokenDists |
| 322 | (spread < 1e-9) via `_check_non_degenerate_token_dist`. Catches a |
| 323 | shape-broken lm_head that would otherwise compute a trivial constant |
| 324 | divergence across prompts. Two compatibility touchups rode with the |
| 325 | change: dummy backend's synthesized `ft` dist and cluster_kl test |
| 326 | fixture `_dist_broad` gained a tiny monotonic perturbation to clear |
| 327 | the guard (real models never produce bit-uniform logits thanks to |
| 328 | fp32 accumulation noise). |
| 329 | - **#10** — cross-verdict consistency test |
| 330 | (`tests/unit/test_cross_verdict_consistency.py`): runs the same spec |
| 331 | through `sway run`, `sway gate`, and `sway report --format junit` and |
| 332 | asserts identical per-verdict tallies. |
| 333 | - **#11** — `sway doctor --json` schema-shape snapshot test locks the |
| 334 | top-level keys and the per-extra module-name set. |
| 335 | - **#12** — subprocess determinism test asserts `_NullView.next_token_dist` |
| 336 | is byte-identical across `PYTHONHASHSEED` values; pins F08's fix. |
| 337 | |
| 338 | **Dead-code inventory — DC3–DC5.** |
| 339 | |
| 340 | - **DC3** — `null_adapter.py`'s pre-S10 cache-promotion branch is |
| 341 | annotated as legacy; removal queued for the next minor. |
| 342 | - **DC4** — dropped the stderr warning at `suite/runner.py` that fired |
| 343 | whenever `concurrent_probes > 1` even though execution never fanned |
| 344 | out. The spec field is still accepted (future pool will light it up); |
| 345 | the noisy warning is gone. |
| 346 | - **DC5** — `tests/unit/test_model.py` grows coverage of `ModelSpec`'s |
| 347 | dtype enum, `endpoint`, `trust_remote_code`, and `custom`/`api` |
| 348 | `kind` branches. |
| 349 | |
| 350 | Final state: 582 unit tests + 3 integration tests passing; mypy strict |
| 351 | clean across 55 source files; ruff + format clean across 137 files. |
| 352 | Sprint 20 is a single PR composed of per-finding, per-file commits so |
| 353 | the history reads as a fix-by-fix cleanup. |
| 354 | |
| 355 | ### Sprint 16 — Cluster-coherent KL probe |
| 356 | |
| 357 | Closes Audit 01 innovation item F8. Adds a new `cluster_kl` probe that |
| 358 | answers the question `delta_kl` can't: the mean divergence may be the |
| 359 | same, but did the adapter shift the *right* topics, or is it a uniform |
| 360 | blunt-instrument shift? |
| 361 | |
| 362 | - **New probe `cluster_kl`** (category: adherence). Embeds prompts via |
| 363 | shared MiniLM (same cache key as `adapter_revert`), k-means clusters |
| 364 | them at a fixed seed, measures per-prompt JS divergence between base |
| 365 | and ft, and reports a **specificity ratio**: |
| 366 | `between_variance / (between_variance + within_variance)`. Range |
| 367 | `[0, 1]`: `≈ 0.5` on a blunt adapter that shifts every topic by the |
| 368 | same amount, `→ 1.0` on a topic-targeted adapter. Pair |
| 369 | `(mean_kl, specificity)` tells a more honest story than either number |
| 370 | alone. |
| 371 | - **Bootstrap CI** on specificity. Resamples `(divergence, cluster_label)` |
| 372 | pairs with replacement and takes the 2.5/97.5 percentiles of the |
| 373 | bootstrap ratio distribution. Returns `None` below 4 prompts — matches |
| 374 | the convention `core.stats.bootstrap_ci` uses. |
| 375 | - **Z-score calibration** against `null_adapter` baseline via the |
| 376 | existing `_zscore` helpers. `calibrate_spec` synthesizes 8 mixed-topic |
| 377 | sentinel prompts + k=2 for the null pass; the specificity distribution |
| 378 | concentrates around 0.5 there, so a real adapter's separation above |
| 379 | that baseline reads as a clean z-score. |
| 380 | - **Degenerate-input policy.** `< min_prompts` (default 20) → SKIP with |
| 381 | a clear message; `num_clusters * 2 > num_prompts` → SKIP (per-cluster |
| 382 | mean not well-resolved); empty prompt list → ERROR. Missing `[semsim]` |
| 383 | extras → SKIP with a `pip install 'dlm-sway[semsim]'` hint. |
| 384 | - **Zero-variance fallback.** If every prompt produced the same |
| 385 | divergence (canned stub data, no adapter motion), the specificity |
| 386 | ratio is mathematically undefined; we return `0.5` — the null-adapter |
| 387 | expectation — so downstream z-score reports "no signal" instead of |
| 388 | NaN. |
| 389 | - **New dependency.** `scikit-learn>=1.4` added to the `[semsim]` extra |
| 390 | (riding the 80 MB MiniLM load that probe already pulls in). Mirrored |
| 391 | to `[all]` and to mypy's stubless-override list. |
| 392 | - **7 unit tests** in `tests/unit/test_probe_cluster_kl.py`: two-topic |
| 393 | adapter → high specificity, uniform adapter → 0.5 fallback, too-few |
| 394 | prompts → SKIP, empty prompts → ERROR, `num_clusters > prompts/2` → |
| 395 | SKIP, CI bracketing, missing-extras SKIP path. |
| 396 | - **Prove-the-value test** at `tests/unit/test_cluster_kl_prove_value.py`. |
| 397 | Two backends with comparable `delta_kl` (ratio < 3×) have specificity |
| 398 | scores that split by at least 0.3 — concrete evidence that `cluster_kl` |
| 399 | surfaces a structural distinction `delta_kl` merges. |
| 400 | |
| 401 | ### Sprint 15 — pytest plugin (`@pytest.mark.sway`) |
| 402 | |
| 403 | Closes Audit 01 innovation item F10. Packages sway as a pytest |
| 404 | library so teams already running pytest adopt sway with a single |
| 405 | decorator instead of a subprocess wrapper. |
| 406 | |
| 407 | - **New plugin** (`src/dlm_sway/pytest_plugin.py`) auto-loaded via |
| 408 | the `pytest11` entry point after `pip install 'dlm-sway[pytest]'`. |
| 409 | `@pytest.mark.sway(spec="...", threshold=0.0, weights=None)` |
| 410 | expands a single pytest function into **one test item per probe** |
| 411 | in the referenced spec + an optional `__gate__` item that fires |
| 412 | only when the composite score drops below `threshold`. |
| 413 | - **Verdict translation.** `FAIL` / `ERROR` → pytest Failed, |
| 414 | `SKIP` → pytest Skipped, `WARN` → pytest warning, `PASS` → |
| 415 | pytest pass. Probe-level failures isolate: a failing adherence |
| 416 | probe doesn't mask a failing calibration one, and `pytest -k |
| 417 | adherence` runs just that probe. |
| 418 | - **Suite runs once per decorated function.** A session-scoped |
| 419 | `_SuiteCache` keyed on `(spec_path, weights)` ensures the N-way |
| 420 | item expansion doesn't multiply backend wall time. Two |
| 421 | `@pytest.mark.sway` tests against the same spec share one run. |
| 422 | - **Malformed marks fail cleanly.** Missing `spec`, non-numeric |
| 423 | `threshold`, non-dict `weights`, and unknown kwargs produce a |
| 424 | synthetic `_ConfigErrorItem` with a green-field pytest failure |
| 425 | line — no cryptic collection-time tracebacks. |
| 426 | - **New `[pytest]` extra** — just `pytest>=8.0`. Matches the pattern |
| 427 | other plugin-style extras use. Also registers the `pytest11` |
| 428 | entry point so the plugin is discovered automatically on install, |
| 429 | consistent with pytest-cov / pytest-xdist. |
| 430 | - **Example directory** at `examples/pytest_integration/` with a |
| 431 | minimal `sway.yaml` + `test_sway_gate.py` showing the decorator |
| 432 | replacing a legacy `subprocess.run(["sway", "gate", ...])` |
| 433 | wrapper. README gains a "Pytest integration" section pointing at |
| 434 | it. |
| 435 | - **13 new unit tests** via pytest's canonical `pytester` fixture: |
| 436 | marker registration, N-item expansion, FAIL / SKIP / ERROR |
| 437 | routing, gate below-threshold failure, gate above-threshold pass, |
| 438 | gate absence when `threshold=0`, malformed-mark error paths, |
| 439 | cache sharing across multiple decorated tests. |
| 440 | - **Drive-by fix for the S14 CI-narrowing test flake.** The dummy |
| 441 | backend's per-prompt noise was seeded via Python's `hash()`, |
| 442 | which is salted per-process via `PYTHONHASHSEED`. Swapped to a |
| 443 | stable `hashlib.md5`-derived seed so the narrowing invariant |
| 444 | holds deterministically across test-order permutations. |
| 445 | |
| 446 | ### Sprint 14 — Bootstrap CIs + forward-pass trace CLI |
| 447 | |
| 448 | Closes Audit 01 innovation items F9 (bootstrap confidence intervals |
| 449 | on raw metrics) + F12 (polished forward-pass trace CLI). Paired |
| 450 | sharpenings of existing probe output and existing trace |
| 451 | infrastructure. |
| 452 | |
| 453 | - **New `core/stats.bootstrap_ci`** — percentile-bootstrap 95% CI on |
| 454 | any sequence of per-sample measurements. Numpy-only, 1000-resample |
| 455 | default, seeded from ``ctx.seed`` so intervals are reproducible. |
| 456 | Short-circuits on non-finite / empty / degenerate-constant inputs; |
| 457 | vectorized resample keeps the overhead ~1 ms per probe. |
| 458 | - **`ProbeResult.ci_95: tuple[float, float] | None`** — new field, |
| 459 | threaded through `safe_finalize` (nulled if `raw` is nulled, so a |
| 460 | CI never brackets a defensive-null point estimate). `to_json` / |
| 461 | `from_json` persist the pair as a two-list. |
| 462 | - **Six aggregating probes emit `ci_95`:** `delta_kl` (over per-prompt |
| 463 | divergences), `calibration_drift` (over per-item regression |
| 464 | indicators), `external_perplexity` (over per-chunk deltas), |
| 465 | `leakage` (over clean-recall rates), `paraphrase_invariance` (over |
| 466 | verbatim lifts), `section_internalization` (over effective SIS |
| 467 | scores). Each also lands the interval under |
| 468 | ``evidence["raw_ci_95"]`` for JSON consumers. |
| 469 | - **Report tables add a `ci95` column** between `raw` and `z` in |
| 470 | both terminal and markdown output. `format_ci` renders as |
| 471 | ``[lo, hi]`` with em-dash for missing/non-finite. Markdown |
| 472 | snapshot refreshed; JSON snapshot refreshed for the new |
| 473 | per-probe `ci_95` field. |
| 474 | - **New `sway trace <jsonl>` CLI.** Reads the forward-pass trace |
| 475 | JSONL the S07 runner writes when `--trace <path>` is set, |
| 476 | aggregates into per-probe and per-view wall-time + hit-rate |
| 477 | tables, plus a top-N slowest-events table. `--format |
| 478 | terminal|md|json`, `--slowest K` to tune the tail size. The |
| 479 | parsing tolerates old trace shapes (missing `probe` / `hit` |
| 480 | fields) and skips malformed lines. |
| 481 | - **`suite/trace_analysis`** — typed dataclasses (`TraceEvent`, |
| 482 | `ProbeSummary`, `ViewSummary`, `TraceReport`) + three renderers |
| 483 | (terminal / markdown / json) matching the conventions of |
| 484 | `suite/report` and `suite/compare`. |
| 485 | - **Committed trace fixture** at `tests/fixtures/trace_sample.jsonl` |
| 486 | (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new |
| 487 | trace-analysis + CLI tests. |
| 488 | - **Prove-the-value (F9):** on a dummy backend with per-prompt |
| 489 | variation, `delta_kl`'s CI narrows from `[0.33, 0.41]` (width |
| 490 | 0.079) at N=4 prompts to `[0.38, 0.42]` (width 0.044) at N=32 — |
| 491 | 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical |
| 492 | scaling minus dummy-backend dispersion. |
| 493 | - **Prove-the-value (F12):** the CLI's terminal output on the |
| 494 | committed fixture correctly identifies `sis/base` (520.3 ms) as |
| 495 | the single slowest event, `sis` (1,529 ms) as the slower probe |
| 496 | over `dk` (529 ms), and `ft` (1,357 ms) as the slower view over |
| 497 | `base` (701 ms). |
| 498 | |
| 499 | ### Sprint 13 — OpenAI-compatible HTTP scoring backend |
| 500 | |
| 501 | Closes Audit 01 innovation item F7. Unlocks sway against hosted |
| 502 | fine-tunes — OpenAI platform, `vllm serve`, Ollama — without |
| 503 | requiring a local torch + PEFT load. |
| 504 | |
| 505 | - **New backend `ApiScoringBackend`** (`backends/api.py`): scores |
| 506 | against a ``/v1/completions`` endpoint via httpx. Implements |
| 507 | `ScoringBackend` only (not `DifferentialBackend`) since one |
| 508 | endpoint is one model; users compose two `ApiScoringBackend` |
| 509 | instances behind the existing `TwoModelDifferential` wrapper for |
| 510 | a full differential run. Method surface: `logprob_of` via |
| 511 | `echo=True`+`logprobs=0`, `rolling_logprob` via the same, |
| 512 | `next_token_dist` via `max_tokens=1`+`logprobs=K`, plus |
| 513 | `preflight_finite_check` and an `ApiScoringBackend.generate` for |
| 514 | probes that need text output (e.g. `leakage`). |
| 515 | - **Token-boundary handling.** The API returns tokens as strings, so |
| 516 | `logprob_of` walks the echoed tokens by character length until the |
| 517 | running total covers the prompt, then sums the rest. When the |
| 518 | boundary falls mid-token, the partial token lands on the prompt |
| 519 | side — over-counts the prompt, under-attributes to the completion |
| 520 | — and the behavior is asserted by a dedicated test |
| 521 | (`test_mid_token_prompt_leans_conservative`). |
| 522 | - **Retry + preflight.** Tenacity handles 5xx + network errors with |
| 523 | exponential backoff (configurable `max_retries`). Preflight hits |
| 524 | the endpoint once with `hello`, `max_tokens=1`, `logprobs=1`; |
| 525 | rejects non-finite logprobs before the suite runs. |
| 526 | - **First backend with `safe_for_concurrent_views=True`.** HTTP is |
| 527 | stateless, so the S07 concurrent-probe scheduler can dispatch |
| 528 | against an API backend in parallel as soon as the pool |
| 529 | implementation lands (still scaffolding in the runner). |
| 530 | - **`ModelSpec` additions:** `BackendKind` gains `"api"`; |
| 531 | `ModelSpec.endpoint` field carries the server's base URL (the |
| 532 | `/v1/completions` path is appended by the backend). API key from |
| 533 | `SWAY_API_KEY` → `OPENAI_API_KEY` env var fallback, so secrets |
| 534 | stay out of the YAML. |
| 535 | - **`backends.build` dispatch** routes `kind="api"` to |
| 536 | `ApiScoringBackend`. Combined with `build_two_separate` + |
| 537 | `TwoModelDifferential`, a YAML with |
| 538 | `defaults.differential: false` and two `kind: api` models Just |
| 539 | Works end-to-end. |
| 540 | - **New `[api]` extra** — `httpx>=0.27` + `tenacity>=9.0`. Core sway |
| 541 | stays torch-free; the `[api]` extra adds ~1 MB of deps vs the |
| 542 | `[hf]` extra's 3 GB. |
| 543 | - **Unit tests (21 new) use httpx's `MockTransport`** to intercept |
| 544 | every call and assert numeric outputs match canned OpenAI-shaped |
| 545 | responses — covers all three scoring methods, cache dedup, 4xx |
| 546 | error surfacing, 503→200 retry recovery, API-key env fallback, |
| 547 | NaN-response preflight rejection, and the token-boundary math. |
| 548 | - **Prove-the-value (§F7):** `tests/integration/test_api_ollama.py` |
| 549 | is opt-in via `SWAY_OLLAMA_URL` + `SWAY_OLLAMA_MODEL`. Runs the |
| 550 | full scoring surface against a live Ollama serving a small model |
| 551 | (e.g. `llama3.2:1b`) and asserts finite output + preflight pass. |
| 552 | Documents the wall-time budget the "≤3× HF backend" claim rests |
| 553 | on once the concurrent-dispatch pool lands. |
| 554 | |
| 555 | ### Sprint 12 — Interactive HTML report |
| 556 | |
| 557 | Closes Audit 01 innovation item F6. Adds the exploration surface to |
| 558 | sit alongside the terminal (CI logs) and markdown (PR artifacts) |
| 559 | outputs — a single-file interactive HTML page for the research / |
| 560 | write-up case. |
| 561 | |
| 562 | - **`sway report result.json --format html --out report.html`** emits |
| 563 | a self-contained HTML page with five interactive Plotly panels: |
| 564 | composite-score gauge with banded thresholds, per-category |
| 565 | horizontal-bar breakdown, per-section SIS bar chart (when |
| 566 | `section_internalization` evidence carries a `per_section` array), |
| 567 | adapter-ablation response curve (when `adapter_ablation` evidence |
| 568 | carries `lambdas` + `mean_divergence_per_lambda`), and an |
| 569 | all-probe score × z-score scatter with hover tooltips. The terminal |
| 570 | verdict palette (green / yellow / red) carries across every panel. |
| 571 | - **Plotly bundle inlined once in `<head>`**; each panel is a |
| 572 | Plotly-produced `<div>` with a stable `sway-*` id so snapshot tests |
| 573 | don't churn on Plotly point releases. No external `<script src>` / |
| 574 | `<link href>` references — the page loads offline from a single |
| 575 | ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway |
| 576 | wrapper + chart data is ~40 KB). |
| 577 | - **CLI gates `--out PATH` on file-producing formats.** `--format html` |
| 578 | requires `--out` (3 MB of JS has no business on stdout); |
| 579 | `--format md|json|junit` now also accept `--out` for symmetry with |
| 580 | the HTML path. `--format terminal` rejects `--out` — the terminal |
| 581 | renderer is for the console only. |
| 582 | - **`plotly>=5.20` is optional**, shipped via the existing `[viz]` |
| 583 | extra (alongside matplotlib). Without it, `--format html` exits 2 |
| 584 | with the `pip install 'dlm-sway[viz]'` install hint. Graceful |
| 585 | ImportError → RuntimeError translation tested end-to-end. |
| 586 | - **Snapshot** at `tests/snapshots/report.html` locks the Sway-owned |
| 587 | wrapper structure. The Plotly JS bundle is stripped from the |
| 588 | snapshot (replaced with a placeholder) so the 43-line snapshot |
| 589 | stays human-reviewable and Plotly version bumps don't drift it. |
| 590 | - **Prove-the-value:** `test_html_from_real_history_loads_offline` |
| 591 | renders HTML from the committed `tests/fixtures/sway-history/02-*` |
| 592 | run (4 probes, real exported payload) and asserts: file parses via |
| 593 | `html.parser`, zero external `<script src>` / `<link href>` |
| 594 | references, every probe name appears in the body, file size lands |
| 595 | between 1 MB and 10 MB. Measured output: 4.87 MB. |
| 596 | |
| 597 | ### Sprint 11 — `sway compare` across saved JSON runs |
| 598 | |
| 599 | Closes Audit 01 innovation item F5. Ships the regression-dashboard |
| 600 | primitive every CI integration reached for but had to script. |
| 601 | |
| 602 | - **New CLI subcommand `sway compare`.** Accepts N saved result JSONs |
| 603 | (typically from `sway run --json`), rehydrates each via |
| 604 | `report.from_json`, folds them into a score matrix, and renders: |
| 605 | a per-probe score table with columns per run, per-adjacent-pair |
| 606 | delta columns (colored red on drop / green on lift), and a |
| 607 | composite-score timeline row. Formats: `terminal` (default, Rich), |
| 608 | `--format md`, `--format json`. |
| 609 | - **`--fail-on-regression <threshold>`.** Exits 1 when any probe's |
| 610 | score in the newest run dropped ≥ `threshold` vs the prior run. |
| 611 | `threshold=0` (default) disables the gate. Gate fires after the |
| 612 | output is emitted so CI logs always capture the matrix even on |
| 613 | red builds. |
| 614 | - **`suite/compare.py` module.** Clean separation: `build_matrix` |
| 615 | folds `(SuiteResult, SwayScore)` pairs into a `CompareMatrix` |
| 616 | dataclass; `render_{terminal,markdown,json}` consume that. No |
| 617 | filesystem IO in the module itself — the CLI owns the reads, the |
| 618 | renderers own the writes. Probes that disappeared between runs |
| 619 | show as `None` / em-dash in the matrix; new probes show as |
| 620 | `None` on older runs. Union of probe names is sorted for stable |
| 621 | row order across invocations. |
| 622 | - **Markdown snapshot locked** at `tests/snapshots/compare.md` — |
| 623 | silent schema drift breaks the test like every other snapshot. |
| 624 | - **Committed history fixture + prove-the-value test.** |
| 625 | `tests/fixtures/sway-history/{01,02,03}-*.json` ship a three-run |
| 626 | narrative: baseline → retrained-improved → over-trained. Run 03 |
| 627 | plants a `section_internalization` drop of 0.22 and a |
| 628 | `calibration_drift` drop of 0.25 while `delta_kl` still rises |
| 629 | (the memorization-without-generalization failure mode). |
| 630 | `test_compare_catches_planted_regression` invokes |
| 631 | `sway compare sway-history/*.json --fail-on-regression 0.10` and |
| 632 | asserts exit=1 with both regressed probes surfaced in the JSON |
| 633 | payload and terminal text — exactly the F5 "CI gate the build on |
| 634 | regression" experiment. |
| 635 | |
| 636 | ### Sprint 10 — Multi-rank adversarial null adapters |
| 637 | |
| 638 | Closes Audit 01 innovation item F4. Extends the S02 null-calibration |
| 639 | matrix with a per-rank profile — users now read "how rank-saturated |
| 640 | is my adapter?" straight off the report. |
| 641 | |
| 642 | - **`NullAdapterSpec.rank_multipliers: list[float] = [1.0]`** (new). |
| 643 | Default preserves single-rank behavior byte-for-byte. Setting |
| 644 | `[0.5, 1.0, 2.0]` calibrates three independent null distributions |
| 645 | per probe kind and emits a z-profile alongside the verdict z-score. |
| 646 | - **`NullCalibratedBackend.as_null_adapter` gains a `rank_scale` kwarg.** |
| 647 | Both shipped backends (dummy + HF) implement it by scaling the |
| 648 | null-weight noise std by `sqrt(rank_scale)` — mathematically |
| 649 | equivalent to a rank change in terms of the LoRA output variance |
| 650 | (`A·B` is a sum of `r` rank-1 outer products; variance is linear |
| 651 | in `r`). No tensor-shape surgery, no model reload per multiplier. |
| 652 | - **`RunContext.null_stats_by_rank`** threads the per-rank matrix |
| 653 | through the runner. Keys are canonical `rank_{mult:.2f}` strings; |
| 654 | inner structure matches `null_stats`. |
| 655 | - **Numeric probes emit `evidence["z_by_rank"]`.** Every numeric |
| 656 | probe (delta_kl, calibration_drift, external_perplexity, leakage, |
| 657 | adapter_revert, adapter_ablation, paraphrase_invariance, |
| 658 | preference_flip, prompt_collapse, section_internalization, |
| 659 | style_fingerprint) computes per-rank z-scores using its existing |
| 660 | sign convention. The verdict path still reads the 1.0x group — |
| 661 | no existing thresholds shift. |
| 662 | - **Report surfaces the profile.** Terminal + markdown probe notes |
| 663 | append `rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x` |
| 664 | whenever multi-rank calibration ran. `format_z_profile` helper |
| 665 | centralizes the rendering. |
| 666 | - **Null-stats disk cache widens key to include `rank_multipliers`.** |
| 667 | Single-rank caches from pre-S10 runs are still readable — they're |
| 668 | promoted to the new `null_stats_by_rank` shape on load. |
| 669 | - **Bug fix (`external_perplexity`): drop erroneous z sign flip.** |
| 670 | `mean_delta` is higher-is-better (ft logprob minus base logprob |
| 671 | on external prose), so the raw z-score maps directly onto |
| 672 | `z >= assert_z_gte`. S09's sign flip was reversing the pass/fail |
| 673 | direction when the null-calibration path ran. |
| 674 | - **README: rank-profile interpretation guide.** Explains when the |
| 675 | shape indicates rank saturation vs. rank oversizing, plus the |
| 676 | "low rank can be pathologically quiet" dual-reading caveat. |
| 677 | - **Prove-the-value (`tests/unit/test_null_multi_rank.py`):** on a |
| 678 | fixed adapter with the dummy backend, the delta_kl z-profile is |
| 679 | strictly monotone in inverse rank (`z@0.5x > z@1.0x > z@2.0x`) — |
| 680 | exactly the signature of a rank-scaled null distribution. |
| 681 | |
| 682 | ### Sprint 09 — External-perplexity-gap probe |
| 683 | |
| 684 | Closes Audit 01 innovation item F3. First innovation sprint landed on |
| 685 | top of the audit-closure campaign (S01–S07). |
| 686 | |
| 687 | - **New probe `external_perplexity`** (`probes/external_perplexity.py`): |
| 688 | rolling-logprob delta of ft vs base on held-out public-domain English |
| 689 | prose. Raw metric is `mean_delta_nats` (per-token). Negative = |
| 690 | adapter raised perplexity on external text (diffuse forgetting); |
| 691 | positive = adapter improved English modeling incidentally. Category |
| 692 | `calibration`; scored alongside `calibration_drift`. |
| 693 | - **Packaged public-domain corpus** |
| 694 | (`probes/_corpora/public_domain_en.txt`, ~14 KB): hand-assembled |
| 695 | from twelve US-public-domain passages (Lincoln, Emerson, Twain, |
| 696 | Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, |
| 697 | Dickinson, US Constitution, Federalist #10). Each passage carries |
| 698 | an inline `# -- source:` provenance comment identifying the work |
| 699 | and the "US public domain (pre-1929)" basis. Loader strips the |
| 700 | comments at read time so the probe sees only raw prose. |
| 701 | - **Null calibration**: `calibrate_spec()` returns a 4-chunk cheap |
| 702 | version so `null_adapter` produces per-kind stats without dominating |
| 703 | suite runtime. Sign-flipped z-score — lower raw delta is worse, so |
| 704 | the shared `z >= assert_z_gte` semantics read as "σ better than |
| 705 | noise" on external fluency. |
| 706 | - **`autogen` wiring**: emits an `external_ppl` suite entry (with |
| 707 | `corpus=public_domain_en`, `max_chunks=8`) whenever the source `.dlm` |
| 708 | has any PROSE section. Intent table explains it as the |
| 709 | "diffuse-forgetting complement to `calibration_drift`." |
| 710 | - **Prove-the-value test** |
| 711 | (`tests/unit/test_ext_ppl_vs_calibration_drift.py`): a dummy backend |
| 712 | that applies a uniform −0.3 nats/token drift to every pack item and |
| 713 | every corpus chunk — below `calibration_drift`'s 1.0-nat per-item |
| 714 | regression threshold, above its −0.5 mean-delta gate, but well below |
| 715 | `external_perplexity`'s −0.1 fixed-threshold. `calibration_drift` |
| 716 | reports PASS, `external_perplexity` reports FAIL. The two probes |
| 717 | measure different failure modes and do not substitute for each |
| 718 | other. |
| 719 | |
| 720 | ### Sprint 07 — Performance & caching |
| 721 | |
| 722 | Closes Audit 01 findings B19 (deferred per design note), |
| 723 | E-cache-opportunity. |
| 724 | |
| 725 | - **Forward-pass cache** (`backends/_instrumentation.py`): every |
| 726 | backend view now routes `next_token_dist` / `rolling_logprob` / |
| 727 | `logprob_of` through a bounded LRU keyed on `(op, view_id, |
| 728 | prompt_hash, top_k)`. Baseline A/B on a tiny 4-probe suite against |
| 729 | SmolLM2-135M on CPU: 2.33s → 1.76s (**25% wall-time reduction**, |
| 730 | forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone |
| 731 | suite has more overlap and would see larger gains; the 25% floor |
| 732 | holds on any suite with repeated prompts across probes. |
| 733 | - **Cache key includes `view_id`**: `"base"` / `"ft"` / |
| 734 | `"scaled_1.25"` / `"null_42"` are distinct namespaces. A future |
| 735 | toggle regression that fails to flip the adapter would surface as |
| 736 | wrong-side cached values immediately, not silently corrupt |
| 737 | divergence math. |
| 738 | - **Backend stats surface in the report**: `SuiteResult.backend_stats` |
| 739 | captures `cache_hits` / `cache_misses` / `forward_passes` / |
| 740 | `scoring_wall_s` / `hit_rate`. Terminal + markdown footers render |
| 741 | `cache: 26/88 = 30%` when stats are present. |
| 742 | - **Forward-pass tracing**: `sway run --trace <path.jsonl>` writes |
| 743 | one event per backend scoring call (probe / view_id / prompt_hash / |
| 744 | top_k / op / wall_ms / hit). Zero overhead when unset. |
| 745 | - **`concurrent_probes` scaffolding**: `spec.defaults.concurrent_probes: |
| 746 | int = 1` field plus `safe_for_concurrent_views: bool = False` |
| 747 | class attribute on HF / MLX / Dummy backends. The runner warns on |
| 748 | stderr when the user requested > 1 against an unsafe backend and |
| 749 | stays sequential. Custom backends that are already concurrency-safe |
| 750 | (e.g. a stateless hosted-API backend) can opt in without waiting |
| 751 | for the HF fix. |
| 752 | - **B19 design note**: `.docs/design/backend-concurrency.md` documents |
| 753 | current state, why v0.1 doesn't fix it, and two future paths |
| 754 | (per-thread lock vs per-worker pool) with recommendation. |
| 755 | - **Generation stays uncached**: `view.generate()` intentionally |
| 756 | bypasses the cache — probe-side callers vary |
| 757 | `(prompt, max_new_tokens, temperature, seed)` in ways that would |
| 758 | rarely collide, and caching sampled output would hide seed bugs |
| 759 | behind stale strings. |
| 760 | |
| 761 | ### Sprint 06 — CLI & report UX polish |
| 762 | |
| 763 | Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, |
| 764 | D13, D14, D15, B16, B17, B18. |
| 765 | |
| 766 | - **D3 — extras rollup footer:** terminal + markdown reports now end |
| 767 | with a single `pip install 'dlm-sway[...]'` line collecting every |
| 768 | extra mentioned in SKIP messages. Single rollup, no per-row scan. |
| 769 | - **D4 — `sway check` infers `--base`:** reads |
| 770 | `base_model_name_or_path` from the adapter's `adapter_config.json` |
| 771 | when `--base` is omitted; echoes "(inferred base model: …)" so the |
| 772 | user knows what got picked. Falls back to a clear required-flag |
| 773 | error when the field is missing. |
| 774 | - **D5 — `sway autogen` annotates the YAML:** generated specs carry a |
| 775 | header block (source `.dlm`, dlm_id, base, adapter, generation |
| 776 | timestamp, sway version) and a one-line `# intent` comment above |
| 777 | every probe entry. No new dep — pyyaml + a position-based |
| 778 | post-processor instead of `ruamel.yaml`. |
| 779 | - **D6 — `sway run --dry-run` + `sway list-probes`:** dry-run validates |
| 780 | the spec, prints a probe table (#, name, kind, category, enabled), |
| 781 | and exits 0 without building a backend. `list-probes` prints every |
| 782 | shipped probe kind with its category and the first line of its |
| 783 | docstring. |
| 784 | - **D7 — `sway doctor --json`:** machine-readable doctor payload |
| 785 | (`sway_version`, `python`, `platform`, `extras`) keyed by extra and |
| 786 | module. CI-grep-friendly. |
| 787 | - **D8 — `dlm_source` resolution warns instead of silently swallowing:** |
| 788 | when the spec sets `dlm_source` but the `[dlm]` extra isn't installed |
| 789 | (or the bridge errors), the runner emits a yellow stderr warning with |
| 790 | the install hint and continues with `sections=None`. No more silent |
| 791 | "why are my section probes SKIPping?" debugging. |
| 792 | - **D9 — markdown column parity with terminal:** the markdown probes |
| 793 | table now carries `score`, `raw`, `z`, `duration`, and `note` |
| 794 | columns. A `Top findings` section follows the table; a `Skipped |
| 795 | probes` section closes when extras are missing. |
| 796 | - **D10 — unified number formatters:** `format_score`, `format_raw`, |
| 797 | `format_z`, `format_duration_s` live in `suite/report.py` and are |
| 798 | the only numeric formatters used. Thousands separators above 1 000; |
| 799 | `—` glyph for `None` / non-finite. Single source kills cross-surface |
| 800 | drift. |
| 801 | - **D11 — `sway report --format` validated via `StrEnum`:** the |
| 802 | Typer-level enum rejects unknown formats with the standard |
| 803 | "Invalid value for '--format'" error. No more silent fallback to |
| 804 | the terminal renderer on a typo. |
| 805 | - **D12 — `sway check` verdict banner:** prints |
| 806 | `✅ adapter is +4.2σ above noise` (green ≥ 3σ), |
| 807 | `⚠️ adapter is +1.5σ above noise — marginal` (yellow ≥ 1σ), or |
| 808 | `❌ adapter is +0.3σ — indistinguishable from noise` (red) above |
| 809 | the full report. Calibrated on the delta_kl z-score; `check` |
| 810 | now runs `null_adapter` first so the z-score is available. |
| 811 | - **D13 — `sway diff` regression summary:** after per-probe deltas, |
| 812 | the diff prints `A→B: N regressed >0.10, M regressed >0.20, |
| 813 | composite Δ=±X.XX` color-coded by direction. |
| 814 | - **D14 — adapter paths with spaces render safely:** `_adapter_label` |
| 815 | wraps the path in double quotes when any whitespace is present. |
| 816 | - **D15 — long messages wrap, don't truncate:** the terminal probe |
| 817 | table column uses Rich's `overflow="fold"` instead of an 80-char |
| 818 | hard cut with an ellipsis. The full message is always visible. |
| 819 | - **B16 — single markdown renderer:** `report.from_json` round-trips |
| 820 | saved JSON back into the canonical dataclass pair, so |
| 821 | `sway report --format md` and `sway run --markdown` both flow |
| 822 | through `report.to_markdown` with identical output. The legacy |
| 823 | `_render_markdown_from_json` and `_render_junit_from_json` helpers |
| 824 | in `cli/commands.py` are deleted. |
| 825 | - **B17 — `None` scores render as `—`:** the unified formatters cover |
| 826 | every render path; no more `0.00` masking missing data. |
| 827 | - **B18 — `baseline` row labeled `(informational, weight=0)`:** in |
| 828 | both terminal and markdown component breakdowns, matching the |
| 829 | Sprint 03 explicit-weight-zero decision. |
| 830 | |
| 831 | ### Sprint 05 — Probe quality & edge cases |
| 832 | |
| 833 | Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, |
| 834 | B21, B22. |
| 835 | |
| 836 | - **B6 — `tail_logprob` semantics:** the field is now `float | None` |
| 837 | with three discrete states: `None` (top-k covered the full vocab — no |
| 838 | tail to redistribute), `0.0` (a real tail underflowed below fp32), |
| 839 | and a measurable negative log-prob. HF, MLX, and dummy backends all |
| 840 | emit `None` when `k == vocab`. Preflight check guards against |
| 841 | non-finite `tail_logprob` only when it's a number. |
| 842 | - **B7 — probe validation before backend build:** new |
| 843 | `validate_all_probes(suite)` helper collects every spec error in one |
| 844 | pass; `_execute_spec` calls it *before* materializing the backend so |
| 845 | a typo in `kind:` surfaces immediately, and *all* typos surface in a |
| 846 | single error message. |
| 847 | - **B8 — `autogen` style prompts:** `style_fingerprint` no longer |
| 848 | receives the leading sentence of a prose section (which elicited doc |
| 849 | *content*, not stylistic voice). Replaced with a fixed |
| 850 | `_STYLE_ELICITATION_PROMPTS` set of 6 open-ended, content-neutral |
| 851 | prompts. |
| 852 | - **B9 — sentence-transformer caching:** `_load_embedder` is now |
| 853 | `functools.lru_cache(maxsize=4)`. Suites that run `adapter_revert` |
| 854 | back-to-back (multi-adapter diff, repeated probes) reuse the same |
| 855 | ~80 MB embedder instead of re-loading every probe call. |
| 856 | - **B10 — `_lcs_ratio` docstring rot:** the function name is a |
| 857 | historical misnomer (it's gestalt similarity via |
| 858 | `difflib.SequenceMatcher`, not LCS). Docstring rewritten to make the |
| 859 | contract explicit; rename deferred to v0.2 to preserve the public |
| 860 | API surface. |
| 861 | - **B11 — leakage adversarial perturbations:** added 4 new perturbations |
| 862 | (`synonym_swap` with a hand-curated 50-pair table, `clause_reverse`, |
| 863 | `prefix_inject`, `register_shift`) alongside the original three. |
| 864 | Default perturbation list is now all 7; the spec accepts any subset. |
| 865 | - **B12 — calibration pack expansion:** `BUILT_IN_PACK` grew from 30 |
| 866 | to **200** items (geography, natural sciences, arithmetic, language, |
| 867 | history, biology, technology, miscellaneous). All public-domain / |
| 868 | hand-composed grade-school facts — no third-party dataset license |
| 869 | attaches. `items_limit` docstring documents the new resolution |
| 870 | (~0.5 pp per regressed item vs the old ~3.3 pp). |
| 871 | - **B13 — tokenizer-aware `_stuffing`:** `prompt_collapse` now derives |
| 872 | its padding from the model's pad / unk / EOS token via the backend's |
| 873 | tokenizer, making the metric language-agnostic. The pre-B13 hardcoded |
| 874 | English string remains as the dummy-backend fallback and behind a |
| 875 | `legacy_stuffing: bool = False` spec field for one-release |
| 876 | backward-compat. |
| 877 | - **B14 — `preference_flip` per-triple error fence:** wrapped each |
| 878 | triple's `logprob_of` calls in `try/except ProbeError`. A single bad |
| 879 | triple no longer kills the whole batch; `evidence["dropped_triples"]` |
| 880 | + `dropped_reasons` surface the count + first 5 reasons. When *every* |
| 881 | triple raises, the probe routes to ERROR with a clear explanation. |
| 882 | - **B20 — custom backend protocol checks:** `_load_custom` now |
| 883 | isinstance-checks `NullCalibratedBackend` and |
| 884 | `ScalableDifferentialBackend` after the `DifferentialBackend` check. |
| 885 | The set of satisfied protocols is stamped onto the instance as |
| 886 | `__sway_protocols__: tuple[str, ...]` so the report can show which |
| 887 | features are available without re-checking. |
| 888 | - **B21 — `RunContext.null_stats` truly frozen:** the runner now wraps |
| 889 | the stats dict in `types.MappingProxyType` before threading it |
| 890 | through. The dataclass was already `frozen=True` but the dict was |
| 891 | mutable by reference; B21 makes the docstring's "frozen" claim |
| 892 | literally true. Field type widened from `dict[str, dict[str, float]]` |
| 893 | to `Mapping[str, Mapping[str, float]]`. |
| 894 | - **B22 — `ModelSpec.adapter` path normalization:** added a pydantic |
| 895 | `field_validator` that runs `Path.expanduser().resolve()` on the |
| 896 | field at spec-load time. Backends no longer re-do the work; the cache |
| 897 | key in `_null_cache.compute_key` is now stable regardless of how the |
| 898 | user spelled the path in YAML or on the CLI. |
| 899 | |
| 900 | ### Sprint 04 — Integration & regression testing |
| 901 | |
| 902 | Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 |
| 903 | (test side), B15. |
| 904 | |
| 905 | - **Slow lane runs end-to-end** (C1): the `huggingface_hub` dev dep |
| 906 | added in S01 is verified; `pytest -m "slow or online"` now executes |
| 907 | every checked-in integration test from a clean clone. |
| 908 | - **HF backend coverage 21% → 91%** (C2) — measured combined fast + |
| 909 | slow lane against `dlm_sway.backends.hf`. New tests: |
| 910 | - `tests/integration/test_hf_adapter_toggle.py` extended with a |
| 911 | bit-identical ft → base → ft roundtrip (B15 mitigation). |
| 912 | - `tests/integration/test_hf_scaled_adapter.py`: λ sweep |
| 913 | monotonicity + `LoraLayer.scaling[key]` restoration on clean exit |
| 914 | AND on exception. |
| 915 | - `tests/integration/test_hf_null_adapter.py`: same-seed |
| 916 | determinism, different-seeds divergence, original adapter |
| 917 | restoration on clean exit AND on exception. |
| 918 | - `tests/integration/test_hf_scoring.py`: `logprob_of` (incl. |
| 919 | zero-token-completion → ProbeError), `rolling_logprob`, |
| 920 | `next_token_dist`, `_HFView.generate` (greedy + sampled-with-seed |
| 921 | determinism). |
| 922 | - `tests/unit/test_backend_hf_helpers.py`: direct unit coverage on |
| 923 | `_resolve_dtype` and `_detect_device` so dtype regressions are |
| 924 | caught in the fast lane. |
| 925 | - **MLX smoke test** (C5): `tests/integration/test_mlx_smoke.py` |
| 926 | exercises `MLXDifferentialBackend` on darwin-arm64 with a small |
| 927 | LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm |
| 928 | / missing-fixture. Reproducible adapter builder ships at |
| 929 | `tests/fixtures/build_mlx_adapter.py` (run once on a Mac to |
| 930 | populate the fixture directory). |
| 931 | - **`sway gate` exit code pinned** (C6): |
| 932 | `tests/integration/test_sway_gate_exit_code.py` covers PASS |
| 933 | (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing |
| 934 | verdicts (exit 1) via Typer's `CliRunner`. |
| 935 | - **`dlm` import ban regression-guarded** (C7): |
| 936 | `tests/unit/test_dlm_not_imported.py` patches every `dlm.*` entry in |
| 937 | `sys.modules` to `None` and asserts the dummy suite runs end-to-end |
| 938 | without ImportError, plus that `sway autogen` surfaces a clean |
| 939 | install-hint error rather than a stack trace. |
| 940 | - **Pathological probe coverage** (C8 + B3 test side): |
| 941 | `tests/unit/test_probe_adapter_ablation.py` now drives |
| 942 | monotonically-decreasing curves through the helper and pins |
| 943 | probe-level `evidence["saturation_reason"]` for flat / found / |
| 944 | overshoot-with-dip / non_monotonic shapes via a monkeypatched |
| 945 | `divergence`. |
| 946 | - **WARN-branch numerical formula pinned** (C10): |
| 947 | `test_warn_branch_score_formula_pinned` in |
| 948 | `tests/unit/test_probe_preference_flip.py` asserts the exact |
| 949 | `score = 0.5 + mean_delta / 4.0` formula with a hand-computed |
| 950 | expected value. |
| 951 | - **Disjoint top-k divergence** (C12): |
| 952 | `test_disjoint_top_k_supports_produce_finite_divergence` in |
| 953 | `tests/unit/test_divergence.py` covers the |
| 954 | `aligned_probs` tail-redistribution path against fully-disjoint |
| 955 | base / ft supports — JS comes out finite and approaches its |
| 956 | theoretical ln(2) bound. |
| 957 | - **Report schema snapshots** (C11): `tests/unit/test_report_snapshot.py` |
| 958 | byte-compares `to_json` / `to_markdown` / `to_junit` against |
| 959 | checked-in snapshots under `tests/snapshots/`. Intentional schema |
| 960 | bumps: `SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …` and commit the |
| 961 | updated files. No new dep — hand-rolled diff helper. |
| 962 | - **CI workflow shipped**: `.github/workflows/ci.yml` runs the fast |
| 963 | lane (unit + lint + mypy) on every push / PR, and the slow lane |
| 964 | (integration, HF backend) on schedule (nightly 07:00 UTC), manual |
| 965 | dispatch, push to main, and PRs that touch `src/dlm_sway/backends/`, |
| 966 | `tests/integration/`, or `pyproject.toml`. |
| 967 | |
| 968 | ### Sprint 03 — Documentation truth & dead code |
| 969 | |
| 970 | Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, |
| 971 | P15, P17, P18. |
| 972 | |
| 973 | - **Determinism is now wired** (P09): the suite runner calls |
| 974 | `dlm_sway.core.determinism.seed_everything(spec.defaults.seed)` before |
| 975 | any backend work runs. The achieved determinism class (`strict` / |
| 976 | `best_effort` / `loose`) is captured in `SuiteResult.determinism` and |
| 977 | surfaces in the report footer + JSON payload + markdown header. |
| 978 | - **`defaults.differential: false` works end-to-end** (P14): new |
| 979 | `backends/two_model.py` with `TwoModelDifferential` wrapper + |
| 980 | `build_two_separate(spec_models)` helper. `_execute_spec` routes to |
| 981 | the wrapper when the flag is false. Doubles memory; primarily for |
| 982 | custom backends that can't toggle adapters in place. |
| 983 | - **`score_weights` overridable from YAML and CLI** (P15): new |
| 984 | `SuiteDefaults.score_weights` field with subset-validating pydantic |
| 985 | field validator (rejects unknown categories, negative weights, all-zero |
| 986 | weights). New `--weights k=v,k=v` flag on `sway run` and `sway gate`. |
| 987 | Partial overrides merge with `DEFAULT_COMPONENT_WEIGHTS` so users |
| 988 | rarely have to respecify all five categories. |
| 989 | - **`baseline` row labeled `(informational)` with weight 0.0 explicit** |
| 990 | (P08, B18): added to `DEFAULT_COMPONENT_WEIGHTS` so the row appears |
| 991 | in reports for transparency but contributes nothing to the composite. |
| 992 | Terminal renderer adds the `(informational)` annotation column; |
| 993 | markdown table adds a `weight` column with the same label. |
| 994 | - **Extended (9-dim) style fingerprint when `[style]` is installed** |
| 995 | (P10): `style_fingerprint` now actually uses spaCy POS-tagging and |
| 996 | textstat syllable counting. Three new dims appended to the existing 6: |
| 997 | passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per |
| 998 | word. Spec field `extended: "auto" | "on" | "off"` (default `"auto"`) |
| 999 | controls the path. `evidence["schema_version"]` bumped to `2` so |
| 1000 | snapshot consumers can branch on dimensionality. |
| 1001 | - **Stale "Known gaps" entries refreshed** (P03, P04, P05): removed |
| 1002 | "null_adapter not yet wired" (delivered in S01/S02), "custom backend |
| 1003 | stubbed" (it's full), "MLX paths raise" (rewritten to describe the |
| 1004 | real limitation: two model copies because `mlx_lm` has no runtime |
| 1005 | adapter toggle). |
| 1006 | - **README adds determinism + weights + differential documentation** |
| 1007 | and a calibration paragraph that points at the per-kind null matrix. |
| 1008 | - **`delta_kl` docstring rot fixed** (P17): dropped the dead |
| 1009 | ``:mod:`dir``` reference. |
| 1010 | - **Meta-test `test_no_dead_options.py`** (P14/P15 regression guard): |
| 1011 | greps the source tree for documented spec/CLI option names and |
| 1012 | asserts each has a consumer outside its declaration site. Catches the |
| 1013 | exact "documented but unused" pattern Audit 01 flagged. |
| 1014 | |
| 1015 | ### Sprint 02 — Universal z-score calibration |
| 1016 | |
| 1017 | Closes Audit 01 findings P02 (delivery), B2, C9. |
| 1018 | |
| 1019 | - **`NullAdapterProbe` is now a per-kind calibration matrix**: iterates |
| 1020 | every downstream numeric kind in the suite (or an explicit |
| 1021 | `calibrate_kinds` list), runs a miniature version of each through the |
| 1022 | `NullCalibrationBackendProxy` for N seeds, and publishes |
| 1023 | `{kind: {mean, std, n}}` under `evidence["null_stats"]`. The old |
| 1024 | "publishes two keys only" implementation is gone (closes B2). |
| 1025 | - **`NullCalibrationBackendProxy`** (`_null_proxy.py`): swaps |
| 1026 | `as_finetuned()` to yield `as_null_adapter(seed)` so every numeric |
| 1027 | probe's own math produces the "what does my metric look like when the |
| 1028 | fine-tune is structural noise?" distribution without a bespoke code |
| 1029 | path per probe. |
| 1030 | - **`Probe.calibrate_spec(ctx)` classmethod** plus |
| 1031 | `SENTINEL_PROMPTS` / `SENTINEL_DOC` shared constants in |
| 1032 | `probes/base.py`. Each numeric probe overrides `calibrate_spec` to |
| 1033 | return a small, cheap spec for calibration; probes that can't be |
| 1034 | meaningfully calibrated (e.g. `adapter_revert` needs an embedder, |
| 1035 | `adapter_ablation` needs `as_scaled_adapter`, `prompt_collapse` |
| 1036 | can't fit an exponential decay to null noise) opt out by returning |
| 1037 | `None` and surface `(no calibration for <kind>)` in the report. |
| 1038 | - **Shared `_zscore` helpers** (`probes/_zscore.py`): |
| 1039 | `z_score(raw, stats)`, `verdict_from_z(z, threshold)`, |
| 1040 | `score_from_z(z)`, `no_calibration_note(kind)`. Every numeric probe |
| 1041 | now flows through these — no bespoke `(raw - mean) / std` math left |
| 1042 | in any probe file. Includes the `MIN_STD = 1e-6` floor so a |
| 1043 | degenerate null distribution (e.g. a single-seed calibration) can't |
| 1044 | produce an infinite z-score (closes C9). |
| 1045 | - **All 10 numeric probes thread `get_null_stats(ctx, self.kind)`**: |
| 1046 | `delta_kl`, `adapter_revert`, `prompt_collapse`, |
| 1047 | `section_internalization`, `paraphrase_invariance`, |
| 1048 | `preference_flip`, `style_fingerprint`, `calibration_drift`, |
| 1049 | `leakage`, `adapter_ablation`. Each has an `assert_z_gte: float = 3.0` |
| 1050 | that is *preferred* when stats exist. Lower-is-better probes |
| 1051 | (`adapter_revert`, `calibration_drift`, `leakage`) sign-flip the |
| 1052 | z internally so the shared `z >= threshold` PASS rule still reads as |
| 1053 | "significantly better than null". Two probes (`paraphrase_invariance`, |
| 1054 | `calibration_drift`) keep their intent-aware / compound thresholds |
| 1055 | as the no-calibration fallback. |
| 1056 | - **On-disk null-stats cache** (`probes/_null_cache.py`, |
| 1057 | `backends/hf.py`): HF backend exposes `cache_identity()`; |
| 1058 | `NullAdapterProbe` hashes `(backend_identity, runs, init_scale, |
| 1059 | seed_base, top_k, kinds)` into a stable filename under |
| 1060 | `~/.dlm-sway/null-stats/<key>.json` (XDG-respecting). Cache is |
| 1061 | best-effort — a missing / malformed file rebuilds. Disable per-suite |
| 1062 | with `cache: false` in the spec, or globally with |
| 1063 | `SWAY_DISABLE_NULL_CACHE=1`. The dummy backend doesn't expose a |
| 1064 | cache identity, so tests never touch disk unless they opt in. |
| 1065 | - **Report shows z-scores in every numeric probe row**: the terminal |
| 1066 | renderer already had a `z` column; the markdown renderer now does |
| 1067 | too. Rows that fell back to fixed thresholds carry |
| 1068 | `(no calibration for <kind>)` directly in the message. |
| 1069 | |
| 1070 | ### Sprint 01 — Finite safety & verdict integrity |
| 1071 | |
| 1072 | Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2. |
| 1073 | |
| 1074 | - **`safe_finalize` helper** in `core/result.py`: every numeric probe now |
| 1075 | routes its final `ProbeResult` through this guard. A non-finite |
| 1076 | `raw` (or any explicitly-critical field) auto-converts to |
| 1077 | `Verdict.ERROR`; non-finite auxiliaries are nulled out and recorded |
| 1078 | in `evidence["defensively_nulled"]`. |
| 1079 | - **`_divergence` non-finite rejection**: `aligned_probs`, `kl`, `js` |
| 1080 | now raise `ProbeError` on NaN/inf inputs instead of silently |
| 1081 | producing garbage. JS results are bounds-checked against the |
| 1082 | theoretical `[0, ln 2]`; out-of-bound results raise rather than |
| 1083 | ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.) |
| 1084 | - **`PreflightCheckable` Protocol**: HF and dummy backends gain |
| 1085 | `preflight_finite_check()` that runs one forward pass per view with |
| 1086 | a sentinel prompt and rejects backends whose logprobs are non-finite. |
| 1087 | - **Suite runner preflight gate**: every `sway run` calls |
| 1088 | `backend.preflight_finite_check()` before the probe loop. Failure |
| 1089 | emits a single synthetic ERROR probe and skips all configured |
| 1090 | probes; `sway gate` exits non-zero. Disable with `skip_preflight=True` |
| 1091 | for sub-second test suites. |
| 1092 | - **`style_fingerprint` zero-vector fix (B4)**: a fine-tuned model that |
| 1093 | produces empty / whitespace-only generations no longer reports a |
| 1094 | spurious "+0.82 shift toward doc" PASS. The `_cosine_shift` math is |
| 1095 | replaced by `_projection_shift = (ft-base)·(doc-base) / ||doc-base||²` |
| 1096 | which goes to zero when ft equals base. |
| 1097 | - **`_saturation_lambda` full-range search (B3)**: the adapter-ablation |
| 1098 | saturation detector now searches the entire λ range (not just λ ≤ 1.0) |
| 1099 | with `max(divs)` as the reference, and returns a typed reason |
| 1100 | (`"found"` / `"non_monotonic"` / `"flat_curve"` / `"below_floor"`) |
| 1101 | so a flat NaN-adapter curve is distinguishable from a healthy |
| 1102 | saturating one. |
| 1103 | - **Property-based divergence tests** (`hypothesis`): symmetry, |
| 1104 | `JS(p,p) = 0`, `JS ≤ ln 2`, `KL(p,p) = 0`, KL non-negativity. Plus |
| 1105 | explicit non-finite-raises tests pinning every entry point. |
| 1106 | - **End-to-end NaN regression** (`tests/integration/test_nan_adapter_regression.py`, |
| 1107 | slow+online): builds a real PEFT adapter with all-NaN weights via |
| 1108 | PEFT, persists it through safetensors, then asserts the HF backend's |
| 1109 | preflight rejects it and the suite produces ERROR — not the |
| 1110 | +11639σ headline the audit caught. |
| 1111 | - Added `huggingface_hub` to dev dep group so integration tests can |
| 1112 | resolve the `tiny_model_dir` fixture (closes C1 ahead of Sprint 04). |
| 1113 | |
| 1114 | ## 0.1.0.dev0 — 2026-04-20 |
| 1115 | |
| 1116 | Initial pre-alpha. Full 11-primitive battery shipped. |
| 1117 | |
| 1118 | ### Primitives |
| 1119 | |
| 1120 | - **Adherence** |
| 1121 | - `delta_kl` — mean JS/KL divergence between base and fine-tuned next-token distributions |
| 1122 | - `adapter_revert` — reversion under adversarial paraphrase (needs `sway-eval[semsim]`) |
| 1123 | - `prompt_collapse` — exponential-decay fit of divergence over context length |
| 1124 | - **Attribution** |
| 1125 | - `section_internalization` *(flagship)* — per-section `effective_sis` with leak check |
| 1126 | - `paraphrase_invariance` — memorization vs. generalization, intent-aware |
| 1127 | - `preference_flip` — DPO/ORPO chosen/rejected margin inversion |
| 1128 | - **Calibration** |
| 1129 | - `style_fingerprint` — 6-dim numpy-only stylistic shift vs. document |
| 1130 | - `calibration_drift` — general-knowledge regression on a packaged 30-item pack |
| 1131 | - `leakage` — greedy LCS recall + perturbation fragility |
| 1132 | - **Ablation** |
| 1133 | - `adapter_ablation` *(signature primitive)* — λ-scaled divergence curve with linearity, saturation, overshoot metrics |
| 1134 | - **Baseline** |
| 1135 | - `null_adapter` — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased) |
| 1136 | |
| 1137 | ### Infrastructure |
| 1138 | |
| 1139 | - `DifferentialBackend` + `ScalableDifferentialBackend` protocols |
| 1140 | - HuggingFace + PEFT backend with `disable_adapter` / `set_adapter` toggling and LoRA-scale mutation |
| 1141 | - Dummy backend for unit tests (canned responses + linear-blend scalable mode) |
| 1142 | - YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports |
| 1143 | - Typer CLI: `run`, `gate`, `check`, `diff`, `autogen`, `doctor`, `report` |
| 1144 | - `.dlm` bridge (`dlm-sway[dlm]`): resolver + full-battery autogen |
| 1145 | - Matplotlib visualizations (`dlm-sway[viz]`): SIS bar chart, ablation curve, KL histogram |
| 1146 | |
| 1147 | ### Known gaps |
| 1148 | |
| 1149 | - MLX backend loads **two** model copies in memory (base + adapter-fused) because `mlx_lm` has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs. |
| 1150 | - MLX requires a pre-converted `.npz` adapter — raw PEFT safetensors are rejected by `mlx_lm.load`. A PEFT-→-MLX converter is a future milestone. |
| 1151 | - PyPI publication of the `dlm-sway` wheel is pending a clean CI release workflow. |