Changelog
Unreleased
Sprint 36 — sway serve warm-backend daemon
Audit 03 H4. sway run cold-loads the HF backend each invocation
(~15s on a 1.5B model: weights, KV cache, deterministic-mode setup).
For interactive flows — notebooks, the sway watch retrain loop,
the live HTML report — that startup dwarfs the actual scoring
cost. sway serve is a long-running FastAPI daemon that loads the
backend once and keeps it warm across requests.
New CLI (sway serve). Defaults: --host 127.0.0.1 --port 8787 --max-loaded-models 2. The daemon refuses to bind a non-loopback
interface unless --api-key <token> is passed; every non-/health
request must then carry Authorization: Bearer <token>.
Endpoints.
GET /health— uptime + the list of currently-warm models.GET /stats— request count + mean latency + cache size.POST /run— body{spec: SwaySpec}; returns the same JSON shapesway run --json-outwould write, plusrequest_secondsfor the daemon's measured execution time.POST /score— same as/runwith an optionalprobe_namesfilter; returns just the per-probe entries with no foldedSwayScore.
Backend cache. LRU keyed on (kind, base, adapter, dtype, device). Capped at --max-loaded-models; loading a third distinct
model LRU-evicts the oldest, calling backend.close() to release
GPU memory. Single-flight: concurrent requests for the same key
serialize at the loader instead of building twice.
Python SDK. dlm_sway.serve.client.ServeClient(url) exposes
health(), stats(), run(spec), score(spec, probe_names=...).
Stateless (no persistent connection pool); raises
ServeClientError (subclass of SwayError) on transport failure
or non-2xx responses.
Auth posture (v1). Defaults to no auth on loopback — the
threat model on a single-user dev box is "did I bind 0.0.0.0 by
accident". The CLI hard-refuses --host 0.0.0.0 without an API
key. Full OAuth is deferred.
Sprint 33 — training_drift probe (cross-repo, reads dlm loss curves)
Closes the X2 "training_drift probe" backlog item. Sister to S25
gradient_ghost: where the ghost reads optimizer state at
end-of-training, training_drift reads the loss curve during
training. Both are pre-run, no model load, no backend required.
New probe (kind: training_drift, category: calibration).
For a dlm store, the probe parses every train-*.jsonl under
<store_path>/logs/, dedupes resumed runs (latest occurrence wins),
and computes four metrics:
final_loss— last recorded step's loss.convergence_ratio—final_loss / initial_loss.smoothness—1 − var(Δloss) / var(loss), clipped to[0, 1].instability_events— count of loss-increase events whose magnitude exceeds the local typical movement scale (median absolute delta in a centered window). NaN losses count as one instability each, then forward-fill so downstream stats stay finite.
Verdict PASS when all three thresholds clear (smoothness ≥ 0.7, convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN with each failed threshold listed in the message.
Spike heuristic note. Sprint plan called for |Δloss| > 3 · rolling_std. Implementation rejects that: on a smooth exponential
decay, within-window std-of-deltas stays tiny while absolute deltas
are large — every step trips the threshold (verified during dev:
60-step smooth curve flagged 59 false-positive spikes). Replaced
with: count positive deltas (loss going up — the semantically
meaningful instability) that exceed sigma · median(|Δ|) in a
centered window. Robust to scale changes across training; loss
going down faster than usual is no longer mistaken for instability.
No null calibration. Mirrors prompt_collapse and
multi_turn_coherence_decay: a null adapter has no loss curve,
the null distribution of "smoothness on a noise adapter" is
undefined. Fixed-threshold verdicts; users override per-spec.
Log-format note. Sprint plan said dlm writes per-step JSONs at
logs/train_step_*.json. Reality (verified against
~/.dlm/store/): one mixed JSONL per run at
logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl containing banner +
delta + step + run_complete records. The probe filters for
{"type": "step"} lines and reads step + loss. Sibling
*.summary.json carries run aggregates we don't consume — the
curve is richer.
Robustness:
- Resumed runs (overlapping step numbers across multiple jsonls):
dedupe-by-keep-latest mirrors
dlm metricssemantics. - Truncated JSONL tail (crashed-mid-line trainer): partial line is skipped; valid lines still consumed.
- NaN losses are recorded as
+infso the spike detector flags them without numpy NaN poisoning the rest of the pipeline. - 1500-step run downsamples to ≤ 512 evidence points (uniform stride; first + last always preserved).
- Pathological: every step recorded NaN →
smoothness = 0.0,instability_events = num_steps, verdict WARN.
Implementation:
probes/training_drift.py— spec, probe, JSONL parser, metric helpers, verdict mapping, downsampler. 365 LOC.probes/__init__.py— registers the new probe.
Test surface:
tests/unit/test_probe_training_drift.py— 30 unit tests covering: skip paths (no store_path / no logs dir / no jsonl / too few steps), end-to-end with smooth & spiky curves, resume-deduplication, downsampling, corrupt-first-line ERROR, truncated-tail tolerance, pure-math metric helpers (smooth/constant/NaN/all-NaN/zero-initial), spike-detector heuristic (loss-up vs loss-down semantics, short curves, empty), downsampling, verdict mapping, and JSONL parsing (filter non-step records, missing keys, NaN encoding, missing files).tests/fixtures/dlm_train_log_fixture.jsonl— captured-from-disk shape: banner + delta record + 30 step records + run_complete. If this fixture's parse breaks, dlm's log format has shifted and the probe needs an update — the test catches that explicitly.
README gains a training_drift paragraph in "Pre-run
diagnostics" alongside gradient_ghost. The probe table at "Why
it exists" picks up training_drift and the previously-missed
gradient_ghost entry under Calibration.
Sprint 30 — multi_turn_coherence_decay probe
Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway
had zero coverage of multi-turn behavior before this — every
shipped adherence probe was single-turn. Adapters that pass
delta_kl cleanly frequently degrade by turn 2 or 3 of real
dialogue, where the model's own previous responses enter the
context window and create compounding drift. The new probe is the
first that catches that failure mode.
New probe (kind: multi_turn_coherence_decay, category: adherence).
For each prompt the probe greedy-generates ft's turn-1 response,
then rolls a multi-turn synthetic dialogue with cycled generic
follow-ups (Continue., Tell me more., …). At each turn 2..N
both views see the same ft-grounded chat history; the probe
computes KL(base || ft) at the next-token position. The
per-turn KL series is fit to kl = a · exp(-b · turn) and
the probe reports half_life_turns = ln(2) / b.
Verdict ladder:
ok— clean exponential decay; PASS whenhalf_life_turns ≥ assert_half_life_turns(default 2.0).stable— KL stayed within 0.1% relative spread across turns; half-life is formally infinite; clipped tomax_turns × 10and rendered with a "held coherence" message; always PASS.non_monotonic— KL grew turn-over-turn (atypical but possible); WARN with the curve in evidence.degenerate— KL ≈ 0 at every turn; FAIL with "probable no-op adapter" diagnosis.
No null calibration. Mirrors prompt_collapse's rationale: a
null adapter has no signal to decay, so the null distribution of
half-lives is meaningless. Fixed-threshold verdicts are the
published path.
Chat-template requirement. Multi-turn dialogue requires the
base's tokenizer to carry a chat_template. Bases without one
SKIP gracefully with a clear message. The probe consults
ctx.backend._tokenizer.chat_template — same backdoor
prompt_collapse uses for the same reason (avoids broadening the
public scoring contract for one probe's needs).
Reports surface a tiny unicode sparkline of per-turn KL in the verdict message so terminal-only readers see curve shape without opening the JSON.
Implementation:
probes/multi_turn_coherence.py— spec, probe, curve fit, verdict mapping, sparkline.probes/__init__.py— registers the new probe.
Test surface:
tests/unit/test_probe_multi_turn_coherence.py— 22 unit tests covering skip paths, end-to-end with planted-distribution sequences (decreasing / flat / growing curves), fit math (clean exp / stable / growing / zero / partial-zero), verdict mapping, sparkline rendering, and chat-template detection.tests/integration/test_probe_multi_turn_coherence.py— slow+online HF smoke on a tiny SmolLM2-135M LoRA across 3 dialogue turns. Exercises the full chat-template + turn-loop + curve-fit path on a real backend.
README gains a "Multi-turn coherence" section between
Tool-use fidelity and Reproducing a sway run with a worked
YAML example. The probe table at "Why it exists" picks up the
new entry under Adherence (plus tool_use_fidelity under
Attribution — the table missed S27's addition).
Sprint 28 — tenseleyflow/sway-action GitHub Action
Closes the P2 "GitHub Action" discoverability item from Audit 03
D-path3. The action lives in its own repo
(tenseleyFlow/sway-action,
tagged v0.1.0) so consumers can pin via uses: without coupling
to sway's release cadence.
End-user value: three lines of YAML to wire sway into any GitHub PR check, with edit-in-place PR comments + cached venv + standardized verdict sticker. The audience is teams who don't use pre-commit but live in GitHub Actions.
- uses: tenseleyflow/sway-action@v0.1.0
with:
spec-path: sway.yaml
Inputs / outputs. Inputs cover fail-on policy
(fail/warn/never), comment-on-pr, upload-artifact,
sway-version (defaults to the action's pinned version), and
python-version. Outputs sway-score, verdict, and report-path
are surfaced for downstream workflow steps.
Action layout:
action.yml— composite action, 5 steps (setup-python → cache → entrypoint.sh → upload-artifact → github-script post-comment).src/entrypoint.sh— bootstraps a venv, installsdlm-sway[hf]pinned, runssway gate+sway report --format md, parses score- verdict from the JSON report into
$GITHUB_OUTPUT.
- verdict from the JSON report into
src/post-comment.js— Node 20 script run viaactions/github-script@v7. Finds prior sway comments via a hidden<!-- sway-action:report -->HTML marker and edits in place. Truncates at 60 KB with a footer pointing to the artifact for full report.
Self-tests (smoke.yml in the action repo): an actionlint +
shellcheck + node -c lint job, an entrypoint-stub job that
validates the bash plumbing against a stubbed dlm_sway package, and
a self-test job that runs the action end-to-end against a tiny
random LoRA on SmolLM2-135M (PR-only — needs PR context for the
comment-posting step).
Sway README gains a "GitHub Action" subsection between Pre-commit
and the .dlm integration with the copy-paste one-liner + a complete
workflow example.
Sprint 27 — tool_use_fidelity probe
Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway ships for anyone fine-tuning an adapter against a tool-using base — the dominant fine-tune target outside dlm-style document training, and the failure mode no other shipped probe catches.
New probe (kind: tool_use_fidelity, category: attribution).
For each (prompt, tool_spec, gold_tool_name) case the probe greedy-
decodes from base + ft and scores three independent signals:
- JSON-schema validity delta (
ft_valid_rate − base_valid_rate): the gate that catches "ft broke the base's tool-call format". Default pass criterion tolerates a 5pp drop. - Tool-name hallucination rate: fraction of schema-valid ft calls
whose
namefalls outside the declaredallowed_toolssurface (or differs fromgold_tool_nameper-case when no surface is set). Default cap 10%. - Argument-field disagreement rate (informational): leaf-field drift between matched base/ft schema-valid calls. v1 surfaces it as evidence rather than gating on it — the per-token KL on free-form argument values is deferred past v1 because it requires alignment between two decoded strings.
The probe greedy-generates from both views, parses each output via a
forgiving extractor (whole-text JSON → fenced json block →
first balanced {...} substring), then validates against an
OpenAI-flavored schema subset (string, integer, number,
boolean, object, array plus required). No jsonschema
dependency added — core sway dependencies stay lean.
json_valid_rate_ft is z-scored against the null-adapter baseline
when null_adapter is in the suite. Calibration spec ships two
sentinel tool-use cases so the null distribution carries useful
signal.
Implementation modules:
probes/tool_use_fidelity.py—ToolUseCase/ToolUseFidelitySpec/ToolUseFidelityProbeplus_parse_tool_call/_matches_schema/_field_disagreementhelpers. 573 LOC.probes/__init__.py— registers the new probe.
Test surface:
tests/unit/test_probe_tool_use_fidelity.py— 40 unit tests covering verdict logic (PASS / FAIL on validity / FAIL on hallucination / SKIP-no-cases), the parse helpers (whole-text / fenced / embedded / unbalanced / strings-with-braces), schema validation (required / type-tag / bool-not-int / extras-allowed), field disagreement (identical / drifted / nested / list-as-leaf / None-vs-absent), and the calibration handoff.tests/integration/test_probe_tool_use_fidelity.py— slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies the full code path executes without error and emits every documented evidence key with rates in[0, 1]. Doesn't assert a specific verdict — the 135M base isn't tool-fluent enough to make the validity gate meaningful, only the plumbing.tests/fixtures/tool_use_cases.yaml— eight hand-authored cases spanning the most common tool shapes (search, file I/O, Python exec, shell, HTTP fetch, DB query, calendar, calculator).
README gains a "Tool-use fidelity" section between the .dlm
integration and "Reproducing a sway run" — pitches the probe as
sway's signal for agentic fine-tunes, with a worked YAML example.
Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair)
Closes the X3 half of Audit 03's "make a sway run reproducible by a
coworker without recreating their environment" goal. The X1 half
(dlm export --to sway-json) lands in the dlm repo via a separate
PR; this changelog block ships the sway-only deliverable.
New CLI commands.
sway pack <spec> [-o OUT] [--include-golden PATH] [--include-null-cache/--no-include-null-cache] [--max-size-mb MB]— bundles the spec + itsdlm_source(when set) + cached null-stats entries + an optional last-known-good JSON report into a single*.swaypack.tar.gz. Default cap 50 MB; refuses to overwrite.sway unpack <pack> [-o DIR]— extracts into<DIR>/swaypack/, validates the manifest, and prints the ready-to-runsway runinvocation including theSWAY_NULL_CACHE_DIR=...env var that redirects null-stats lookups at the bundled cache.
Pack format (swaypack_version=1).
swaypack/
manifest.json # version + counts + packed_at + sway_version
sway.yaml # the spec, verbatim
source.dlm # spec.dlm_source content (when present)
null-stats/ # one .json per cache key
golden.json # known-good run report (when --include-golden)
Implementation modules:
cli/_pack.py— builds the tarball in-memory first so the size cap can refuse cleanly before writing a half-built pack to disk. Path-traversal-safe by construction (we author every arcname).cli/_unpack.py— usestarfile.extractall(filter='data')to reject absolute paths /../escapes / device files (3.11 compat). Validatesswaypack_version; rejects mismatches with a clear message.
Null-cache override env var.
probes/_null_cache._cache_rootnow honors$SWAY_NULL_CACHE_DIRif set (used verbatim — nodlm-sway/null-statssuffix). Order: env override → XDG cache →~/.dlm-sway/null-stats. Thesway unpackCLI prints the exact invocation that wires this env at the bundled cache.
Tests.
- 17 unit tests in
tests/unit/test_pack_unpack.py: round-trip identity, manifest fields, dlm_source bundling (present + missing- warning emit), golden bundling, every PackError + UnpackError branch (overwrite refusal, size cap, missing file, corrupt tarball, missing manifest, version mismatch, malformed root).
- 2 slow+online integration tests in
tests/integration/test_pack_run_roundtrip.py: pack → unpack →sway runproduces an identical band + per-probe verdict + score; null-cache packing populates the unpack report'snull_stats_dirpointer. - 708 unit tests pass; mypy + ruff + format clean.
Sprint 25 — P3 gradient_ghost probe (pre-run, cross-repo)
New zero-forward-pass diagnostic probe that loads dlm's
training_state.pt and flags severely-undertrained adapters
before any model load fires. Catches the 90%-case "user did
--max-steps 5 for a smoke test and forgot to retrain" in ~50 ms
vs. the 30+ s adapter_ablation probe that previously was the only
training-health signal.
Probe + supporting modules.
probes/gradient_ghost.GradientGhostProbe— categorycalibration,needs_backend = False. Verdict ladder:global_step < min_steps_threshold→ FAIL (primary signal); all per-paramexp_avg_sqNaN → FAIL; per-layer ratio against the minimum layer's mean (not global mean — see module docstring on asymptotic-cap reasoning) flags layers >undertrained_layer_ratio;layer_failure_fracof layers crossing → FAIL/WARN. Configurable thresholdsmin_steps_threshold=50,undertrained_layer_ratio=2.0,layer_failure_frac=0.3.probes/_training_state.py— torch.load wrapper fortraining_state.pt. FrozenTrainingStateSnapshot+ParamStatdataclasses. Lazy torch import (no cost onimport dlm_swayfor non-gradient-ghost users). Suppresses the FutureWarning aroundweights_only=Falsesince dlm's pickled RNG state requires it.probes/_param_id_mapping.py— maps optimizer-state integer param-ids to transformer-layer indices via the adapter'sadapter_model.safetensorskeys. Heterogeneous-per-layer counts raiseParamMappingErrorrather than silently mis-attributing. Usessafetensors.safe_openso we read keys without materializing tensors.core/errors.MissingTrainingStateError— typed exception so the probe can SKIP cleanly whentraining_state.ptlegitimately isn't there (non-dlm adapters), distinct from a parse error.
Probe-ABC + runner contract for pre-run probes.
probes/base.Probe.needs_backend: ClassVar[bool] = True— new opt-in flag. DefaultTruepreserves every shipped probe's contract.probes/base.RunContext.backend: DifferentialBackend | None— was required, now optional. NewRunContext.require_backendproperty narrows the type for mypy + raises a clear runtime error if a probe withneeds_backend=Truesomehow gets aNonebackend.probes/*sweep — every existing probe now readsctx.require_backend.as_base()instead ofctx.backend.as_base(). No behavior change; mypy-correct under the new optional type.suite/runner.runacceptsbackend=None. Pre-resolves all scheduled probes; if any hasneeds_backend=Truewhilebackend=None, raisesBackendNotAvailableErrorwith the offending probe kinds in the message. Skips trace-writer install, preflight finite-check, probe-label setting, and backend-stats snapshot when backend is absent.
sway check integration.
cli/commands.check_cmdrunsgradient_ghostfirst in the quick battery (along withnull_adapter+delta_kl+calibration_drift). On FAIL, emits a red ⚠️ banner before the regular verdict banner; on WARN, a yellow one. Informational — no exit-code change. Banner shows the probe's actionable message ("severely undertrained: global_step=N < threshold X").
Tests.
- 17 unit tests (
tests/unit/test_probe_gradient_ghost.py): registry /needs_backendflag / verdict-ladder branches (PASS / FAIL on global_step / FAIL on all-NaN / WARN / FAIL on many-layers / SKIP on missing) /_param_id_mappingcorrectness- error paths /
_training_stateloader edge cases / runner skip-backend contract.
- error paths /
- 3 integration tests (
tests/integration/test_probe_gradient_ghost.py, slow+online): real-store FAIL on~/.dlm/store/01KPPFAB.../v0001(skipped on machines without local dlm install — typical CI), synthetic converged → PASS, runner end-to-end with backend=None. - All 691 unit tests pass; mypy + ruff + format clean.
Sprint 24 — F01 PEFT→MLX adapter converter
Closes the audit's #1 major finding: the README pitched MLX as a
co-equal backend, but the path required a pre-converted .npz
adapter that nothing in the toolchain produced. With this sprint,
dlm train → sway run on the MLX backend works end-to-end on any
PEFT-trained LoRA adapter.
Converter (pure I/O, no torch dep).
backends/_mlx_convert.convert_peft_to_mlx— reads PEFT'sadapter_model.safetensors+adapter_config.json, transposes LoRA matrices to MLX's layout, writesadapters.safetensors+ mlx-lm-shapedadapter_config.json. Verified against PEFT >= 0.13- mlx-lm 0.31.
- Key remap. PEFT's
base_model.model.<dotted>.lora_<A|B>.weightbecomes MLX's<dotted>.lora_<a|b>. Modern PEFT keys (no.defaultadapter-name segment) and legacy.default.weightkeys both supported. - Shape transpose. PEFT
lora_A=(r, in)→ MLXlora_a=(in, r); PEFTlora_B=(out, r)→ MLXlora_b=(r, out). - Config remap. Writes
fine_tune_type=lora,num_layersinferred from max layer index in the keys,lora_parameterswithrank/scale=alpha÷r/dropout/keys(per-layer-relative attribute paths likeself_attn.q_proj). - Errors. Missing files / non-LORA peft_type / invalid rank /
unexpected key prefixes / dst-not-empty all surface as typed
MlxConvertErrorwith actionable messages.modules_to_savetensors (e.g.embed_tokens,lm_headoverrides) are skipped with a per-key warning rather than crashing.
CLI surface.
sway convert-adapter [--target mlx] SRC DST [--overwrite]— thin wrapper over the converter. Prints a before/after size + rank/scale report; surfacesMlxConvertErrorwith a non-zero exit code; warns on skippedmodules_to_savekeys via stderr.
MLX backend integration.
backends/mlx._ensure_mlx_adapter— auto-detect: if the adapter dir containsadapter_model.safetensors, run the converter into a content-hashed cache at${XDG_CACHE_HOME:-$HOME/.cache}/dlm-sway/mlx-converted/<blake2b>/, point mlx-lm at the cache. If the dir already containsadapters.safetensors, pass through unchanged. Uses 16-byte blake2b on the source safetensors bytes — repeat loads on the same adapter version short-circuit (~10 ms hash + dir lookup).backends/mlx._MLXView._forward_logits— adjacent fix:out[0].astype(mx.float32)beforenp.asarrayso unquantized bf16/fp16 model outputs round-trip correctly. Pre-existing bug surfaced by the new e2e test againstmlx-community/SmolLM2-135M-Instruct.
Tests.
tests/unit/test_mlx_convert— 20 tests across:- Helper functions (
_strip_layer_prefix,_extract_layer_index). - Happy path: synthetic PEFT adapter → MLX adapter, expected file layout, config shape, rank/scale math, key transpose, value preservation.
- Error paths: missing safetensors / config, non-LORA peft_type,
invalid rank, dst-not-empty without
--overwrite, unexpected key prefix,modules_to_saveskip-and-report. - Auto-convert detection: pass-through on already-MLX dir, fresh convert on PEFT dir, cache short-circuit, unrecognized dir pass-through.
- Helper functions (
tests/integration/test_mlx_converter_e2e— 4 darwin-arm64 slow+online tests on realmlx-community/SmolLM2-135M-Instruct: XDG cache populated by backend init,next_token_distreturns finite top-k via converted adapter,logprob_ofworks, repeat load skips reconvert (mtime check). Skipped on non-darwin / missing[mlx]extra.
README.
- New "MLX backend (Apple Silicon)" section with the two install
paths (auto-convert via
sway runvs. explicitsway convert-adapter). Documents the cache location and out-of-scope items (QLoRA,modules_to_save).
Sprint 23 — H1 batched backend execution
Opens the door to 3-5× wall-time reduction on HF-backend suites by
amortizing model.forward kernel-launch + memory-transfer overhead
across prompt batches. Real speedup lands on the HF backend; other
backends gain a unified Protocol surface via loop fallback.
Protocol.
core/scoring.ScoringBackend— new methodnext_token_dist_batch(prompts, *, top_k) -> list[TokenDist]. Default implementation on the Protocol loops overnext_token_dist; backends override when they can amortize. Adding the method to aruntime_checkableProtocol is a contract change — all shipped backend views + the stub used bytest_scoring.py::test_scoring_backend_runtime_checkablegained an explicit method implementation.
HF backend — real batching.
backends/hf._HFView.next_token_dist_batch— callstokenizer(prompts, padding=True, padding_side="left")and a singlemodel.forwardover the padded batch. Left-padding is mandatory for decoder-only LMs: the last-token position (whatnext_token_distreads) must line up with the real end of each sequence, not pad tokens.backends/hf._topk_to_token_dist— extracted helper shared by the single-prompt and batched paths so tail-mass accounting (B6) stays identical across both.backends/_instrumentation.BackendInstrumentation.cached_batch— routes batched calls through per-prompt cache lookup before the forward fires, so cached + uncached prompts in one call still short-circuit per-entry. Callback contract:compute_misses(miss_indices) -> list[result]in miss order. Wall-time attribution divides the batch's time evenly across miss entries forscoring_wall_sbookkeeping.
Instrumentation.
BackendStats— three new counters:batches_sent,batched_prompts,max_batch_size.avg_batch_sizeproperty computesbatched_prompts / batches_sent(0 when no batches fired). All surfaced into_dict().suite/report._cache_line— the report footer'scache: N/M = P%line now suffixes| batches: K (avg=X.X)when any batched forward fired. Runs without batching render the cache line alone — pre-S23 footer shape preserved.
Probes — opt-in.
probes/base.Probe.batch_score: ClassVar[bool] = False— new flag. Probes set True when their scoring flows uniformly throughnext_token_diston a prompt list.probes/delta_kl.DeltaKLProbe—batch_score = True;run()now issues one batched call per view (inside singleas_base/as_finetunedcontexts) instead of 2×N context entries with per-prompt calls. Divergences computed viazip(base_dists, ft_dists, strict=True).probes/cluster_kl.ClusterKLProbe— same treatment; same math.- Other probes (section_internalization, leakage,
paraphrase_invariance, adapter_ablation, prompt_collapse,
multi-turn-style probes) keep
batch_score=False— each needs bespoke batching logic deferred to a follow-up sprint.
Dummy / MLX / API backends.
backends/dummy._DummyView.next_token_dist_batch— loops overself.next_token_dist(not_compute_next_token_dist) so subclass overrides (_NullView's seeded-noise perturbation,_InterpolatedView's lam-blend) fire correctly on the batched path. The dummy has no real forward to amortize; batching counters stay at zero on this backend by design.backends/mlx._MLXView.next_token_dist_batch— per-prompt loop stub with aTODO: mx.array padded forwarddocstring. MLX's per-prompt forward on Apple Silicon is already fast enough that the kernel-launch amortization a real batched forward would buy is small relative to the HF CUDA/MPS case.backends/api.ApiScoringBackend.next_token_dist_batch— per-prompt loop; OpenAI-compat HTTP servers score one completion per request and httpx connection pooling already amortizes network cost.
Tests.
tests/unit/test_backend_instrumentation— newTestBackendInstrumentationCachedBatchclass (6 tests) pinning all-miss / partial-cache-hit / all-cached / wrong-length-return / max_batch_size / empty-prompts branches.TestBackendStatsgained 2 tests for the new counters +avg_batch_sizeproperty.tests/unit/test_batched_backend_s23— new file with 5 tests:batch_scoreflag set onDeltaKLProbe, probe-level spy that confirms the batched method is called exactly once per view, byte-identical results vs serial, report footer batches-segment conditional rendering, empty-prompts short-circuit.
0.1.0 — 2026-04-24
First PyPI release. Alpha — API not guaranteed stable until v1.0. Rolls up every sprint through Sprint 22; earlier sprints accumulated under the rolling Unreleased header pre-release.
Sprint 22 — v0.1.0 release + F06 dlm-compat test + Docker image
First PyPI publish. Version bumped from 0.1.0.dev0 to 0.1.0.
Pre-commit consumer rev: pins switch from commit SHA to v0.1.0.
Closes Audit 03 F05 (SHA-pin resolves at tag) and F06 (dlm API
compatibility regression test).
D-path1 — PyPI publish.
pyproject.toml— version0.1.0.Development Status :: 3 - Alphaclassifier already present.src/dlm_sway/__init__.py—__version__ = "0.1.0".README.md— realpip install "dlm-sway[hf]"recipe replaces the "Planned PyPI install (not live yet)" placeholder. Pre-alpha banner swapped for an explicit Alpha + semver-from-v1.0 disclaimer.
F06 — dlm API compatibility regression test.
core/errors.DlmCompatError— new typed exception. Raised when dlm's public surface drifts out of the contract sway's resolver depends on (currently:dlm.base_models.resolve(key).hf_id). Message includes installed-dlm version + a pip-install hint.integrations/dlm/resolver— theexcept Exception: return base_modelsilent fallback is gone.resolve()raising or returning an object withouthf_idnow raisesDlmCompatError(chained via__cause__for tracebacks). An extra_installed_dlm_version()helper introspectsimportlib.metadatafor the error message.tests/unit/test_dlm_bridge— 2 new regression tests pin the two drift branches (missinghf_id;resolveitself raising).tests/integration/test_dlm_api_compat— new slow+online+dlm integration test iterates dlm's registry keys and asserts every entry resolves to a spec with a plausiblehf_id. Gracefully skipped (viapytest.importorskip) when the[dlm]extra isn't installed, so CI without dlm published to PyPI still passes.pyproject.toml—[dlm]extra pinned todlm>=0.9,<1.0. Upper bound tightens to the range the compat test has validated; bump when dlm cuts v1.0 and the contract is re-verified.
D-path2 — Docker image (sway-gate).
Dockerfile.gate—python:3.11-slim+dlm-sway[hf,semsim]at the build-timeSWAY_VERSIONARG. Pre-fetches the MiniLM weights (sentence-transformers/all-MiniLM-L6-v2, ~80 MB) soadapter_revert/cluster_klprobes don't cold-download on first gate run.ENTRYPOINT ["sway"]— pre-commit hook passesgateas the first container arg..github/workflows/docker.yml— onv*tag push + manual dispatch. Builds on GHA, pushes toghcr.io/ tenseleyflow/sway-gate:{vX.Y.Z, latest}. Smoke-tests the pushed image with--version..pre-commit-hooks.yaml— newsway-gate-dockervariant usinglanguage: docker_imagepointing atghcr.io/tenseleyflow/sway-gate:v0.1.0 gate. Third option alongsidesway-gate(system PATH) andsway-gate-isolated(fresh venv).
F05 closure.
.pre-commit-hooks.yaml— isolated variant'sadditional_dependenciesmigrates fromdlm-sway[hf] @ git+https://…@<SHA>todlm-sway[hf]==0.1.0. Consumerrev:in.pre-commit-config.yamlpins tov0.1.0. SHA churn over.
README.
- Real
pip installrecipes for every extra ([hf],[dlm],[all], …). "Install from source" kept for contributor workflow. - Pre-commit section: three hooks (system / isolated / docker) with first-run-cost comparison table.
Sprint 21 — Audit 03 closure
Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the
audit recommended for v0.1.0 readiness. Full audit at
.docs/audits/03-final-audit.md. Verdict was YELLOW-leaning-GREEN
with zero critical findings; this sprint pushes it to GREEN on the
in-scope items. Deferred: F01 (MLX converter — its own feature
sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API
compat test — lands with v0.1.0 release in S22). v0.1.0 PyPI
publish is deferred to S22 (requires release coordination +
credentials).
🟠 Major — degenerate z-score clipping (F02).
probes/null_adapter— null stats now carry an explicitdegenerate: 1.0field when the calibration ran but produced an unusable baseline (runs: 1, or every seed producing the exact same raw). The std floor at1e-6is preserved for valid-but-tight multi-seed nulls so they still calibrate. Audit observed a+290,766σleakage-probe z underruns: 1; post-fix, z_score refuses and the probe falls back to fixed thresholds with a clear footer rollup.probes/_zscore.z_score— refuses whenstats["degenerate"]is truthy, independent of the std check. Belt-and-suspenders on top of the existing MIN_STD guard.suite/report.collect_degenerate_null_kinds— new rollup surface. Terminal + markdown footers gain a "N probe kind(s) had a degenerate null baseline — bumpruns:" block, distinct from the existing null-opt-outs rollup.- 12 new unit tests across
test_null_calibration.py,test_zscore_helpers.py,test_report_extras_rollup.py, andtest_probe_external_perplexity.py. Existingtest_std_floor_prevents_runaway_zscorewas rewritten to assert the fixed behavior (the pre-fix contract WAS the audit's bug).
🟡 Minor — CI resilience (F03).
pytest-timeout>=2.3+tenacity>=9.0in dev deps.tests/fixtures/tiny_model.pywrapssnapshot_downloadwith exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff)etag_timeout=10to bound per-file head probes. Benefits every slow+online test.
tests/integration/test_determinism_golden.pygains@pytest.mark.timeout(600). A silent network hang now surfaces as a test failure with actionable output (audit observed 20m workflow-timeout hang on Sprint 19 merge run 24747915467).
🟡 Minor — outlier-miner pool guard (F04).
mining/outlier_miner.mine_outliers— raisesSwayErrorwhen the pool has fewer than2·top_kdistinct scored prompts, with an actionable--top-k Nhint. Pre-fix: 1-distinct-prompt pool producedtop=[p], bottom=[p](identical lists — no outlier contrast). Guard applies AFTER scoring so unsupported probe kinds still return the empty-result path.
🟡 Minor — autogen skipped-probes comment (F07).
integrations/dlm/autogen.collect_skipped_probe_reasons— new public helper returning(probe_kind, reason)tuples for probes_build_suiteintentionally omitted for the given handle._render_annotated_yaml— whenskippedis non-empty, header gains a# skipped: <kind> (<reason>)block. Users no longer have to diff the autogen source to understand which probes are missing from their generatedsway.yaml.
🟡 Minor — CI Node.js 20 deprecation silence (F08).
.github/workflows/ci.ymlsetsFORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"at the workflow level. Silences the Node 20 deprecation warnings each CI log currently carries (deadline June 2026). Temporary until every action we pull bumps to Node 24 natively.
🟡 Minor — portable dlm_source (F09).
integrations/dlm/autogen._portable_dlm_sourceemits a cwd-relative path when the.dlmlives inside the cwd, absolute otherwise. The cwd-relative form survives cross-machine checkout (CI agents, other devs); the old absolute-path emission broke the moment a committed autogen'd YAML was run on a different host.
Other.
- 640 unit tests pass (up from 626 before S21 — 14 new regression tests across F02/F03/F04/F07/F09).
- Pragmatic deviations from original scope: S21 was audit-scoped
to ~2 days of cleanup. Actual depth: 6 findings closed with 14
new regression tests. One existing test
(
test_std_floor_prevents_runaway_zscore) explicitly rewritten to flip its assertion — the pre-fix contract the test encoded WAS the audit's reported bug.
Sprint 19 — Pre-commit hook sway gate
Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships
a .pre-commit-hooks.yaml declaring two hook variants so
pre-commit.com users can gate their adapter
on every commit that touches a spec, .dlm file, or adapter directory.
- Two hooks, pick your posture.
sway-gate—language: system. Uses the sway install on the user'sPATH. Fast, zero install cost. Recommended default.sway-gate-isolated—language: python+ git-basedadditional_dependencies(dlm-sway[hf] @ git+...@<SHA>). Self-contained — pre-commit builds a fresh venv and installs sway + torch + transformers on first run (~5 GB, ~2 min).
- Rev pinning stance. Example config pins to a commit SHA, not
HEAD. Sway is pre-v0.1.0 — no tagged release yet — andHEADdrifts silently on everypre-commit autoupdate. SHA pinning is the honest pre-release pattern; migration torev: v0.1.0is a README edit once the first release lands. pass_filenames: falseon both variants.sway gatetakes exactly one spec path; the user declares it viaargs:. Thefiles:regex decides when the hook fires. No ambiguity.- Scope discipline. Neither hook surfaces
--jsonor--markdownflags. The hook gates (non-zero on FAIL) and nothing else. Users wanting report artifacts runsway runseparately. - Readme "Pre-commit" section documents the consumer-side config for both variants, the SHA-pinning rationale, and the one-time install cost of the isolated variant.
examples/precommit-example/— templatesway.yaml, consumer-side.pre-commit-config.yaml, and a README walk-through.tests/integration/test_pre_commit_hook.py(slow + online) — three cases: (a) parse-and-shape smoke test on.pre-commit-hooks.yaml; (b) pass-case runtime — spawnspre-commit run sway-gateas a subprocess against a tmp git repo with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — same fixture with an impossibleassert_mean_gte, asserts non-zero exit +gate FAILEDbanner.pre-commit>=3.8in[dependency-groups].devso the integration test can invoke it as a subprocess. Keeps the tool out of the user's runtime deps.
Sprint 18 — Cross-platform determinism golden
Closes Audit 01 stretch-list F-item "cross-platform determinism golden test." The README's "deterministic on CPU where possible" claim held in theory; now it's pinned by CI on two platforms.
- New module
src/dlm_sway/core/golden.py. Tolerance-aware JSON comparator withcompare_goldens(actual, expected, *, logprob_tol=1e-6, score_tol=1e-4)andmask_variable_fieldsfor stripping timestamps, wall-seconds,duration_s,backend_stats,sway_version, and cwd-resolved path identifiers (adapter_id,base_model_id) before comparison. No torch dep; runs in the fast lane. - New integration test
tests/integration/test_determinism_golden.py(slow + online). Builds a deterministically-seeded LoRA on SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + calibration_drift), and diffs the JSON output againsttests/golden/expected_<platform>.json.SWAY_UPDATE_GOLDENS=1toggles regen mode; missing golden → SKIP with a regen recipe. - New CI matrix
determinism-goldenwithstrategy.matrix.os: [ubuntu-latest, macos-latest]in.github/workflows/ci.yml. Triggered by changes undertests/golden/**,src/dlm_sway/core/golden.py, or the test file itself (plus the standard schedule/dispatch/push triggers). workflow_dispatchregen mode — dispatching the CI workflow withregenerate_goldens=trueflipsSWAY_UPDATE_GOLDENS=1on both matrix legs and uploads the regenerated JSONs as per-platform artifacts. Meant for deliberate "yes I changed the algorithm" flows.- Tolerance rationale (documented in
core/golden.py):1e-6for logprob-like numeric fields sits above typical BLAS-implementation drift (1e-8–1e-7band between OpenBLAS on linux and Accelerate on darwin) but below real algorithm-change drift.1e-4for score fields absorbs composition noise. - 25 new unit tests for the comparator covering mask coverage, tolerance thresholds, structural diffs (missing keys, length mismatches, type mismatches), NaN/inf edge cases, and a realistic two-masked-payloads round-trip.
- Pragmatic scope deviation: the sprint envisioned a checked-in
5 MB adapter binary. Instead the test builds it from a fixed seed
at runtime (same pattern as
test_external_perplexity_e2eandtest_cluster_kl_e2e). Avoids the regeneration chore noted in the sprint's risks section. - Linux golden bootstrap: first PR ships
expected_darwin.jsononly; the linux leg SKIPs with a recipe pointing at theworkflow_dispatchregen mode. Maintainer dispatches, downloads the artifact, commitsexpected_linux.json, and the next CI run asserts cleanly on both platforms. One-time onboarding cost.
Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner
Closes Audit 01 innovation item F11. Adds sway mine — an evaluation
companion to paraphrase_invariance that surfaces the paraphrases a
memorizing adapter most reliably fails on, plus a secondary mode that
ranks delta_kl prompts by per-prompt divergence.
sway mine --mode paraphrase — per paraphrase_invariance case:
- Generate candidate paraphrases via nlpaug
SynonymAug(WordNet, deterministic under a fixed seed; back-translation is a user-supplied escape hatch to keep the default fast and offline). - Embed candidates through the shared MiniLM cache (same 80 MB load
adapter_revertpulls) and greedy farthest-first select the top-K most pairwise-distant ones — dodges nlpaug's tendency to emit near-duplicate synonym swaps. - Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold))
- (ft(candidate, gold) - base(candidate, gold))`. Large positive gap = the candidate breaks the adapter's lift = the adapter doesn't generalize there.
- Emit
sway-mined-paraphrase.yamlwith amined_cases:block paste- compatible withparaphrase_invariance.cases.
sway mine --mode outliers — rank a prompt pool by per-prompt
delta_kl.raw. Pool defaults to the spec's own delta_kl prompts;
--from-corpus public_domain_en draws from S09's CC0 corpus instead.
Emits top-K and bottom-K blocks: the highest-divergence prompts (best
for tightening a gate) and lowest-divergence (candidates to drop from
the suite). leakage and paraphrase_invariance have section/case-
based specs; outlier mining on those is future work, documented
inline in mining/outlier_miner.py.
- New package
src/dlm_sway/mining/— two modules,paraphrase_miner.pyandoutlier_miner.py, plus a thincorpus_promptshelper for the--from-corpuspath. - New CLI subcommand
sway mine SPEC --mode paraphrase|outliers [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] [--seed N]. Paste-compatible YAML emission by default;--outoverrides the filename. - 18 new unit tests — 7 for the paraphrase miner (ranker, diversity filter, dedup, input validation), 8 for the outlier miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 for the CLI (paraphrase + outliers modes + no-prompts error path).
- Prove-the-value test at
tests/unit/test_paraphrase_miner_prove_value.py. On a deliberately-memorizing dummy backend the hand-written paraphrase list passes withgeneralization_ratio > 0.5; substituting the mined list (same seed, same adapter) drops the ratio below 0.5 and flips the verdict to FAIL, with a ≥ 0.3 ratio gap. - Dependency footprint: no new extras. The paraphrase miner
uses nlpaug (already in
[style]) + sentence-transformers (in[semsim]); the diversity filter reuses the embedderadapter_revertalready pulls. GracefulBackendNotAvailableErrorwith a pip hint when the extras aren't installed.
Sprint 20 — Audit 02 closure
Closes the full finding inventory from .docs/audits/02-followup-audit.md
(1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3
dead-code items). Sixteen sprints of innovation work had accumulated
since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap
CI dropped at the runner boundary) plus a grab-bag of claim-vs-code
gaps, dead branches, and untested contracts.
🔴 Critical — ship-blocker.
- F01 —
suite/runner.py:_with_durationnow forwards everyProbeResultfield. The pre-fix version silently droppedci_95on every probe, making S14's headline deliverable runtime-inert in any realsway run. Regression test at the runner boundary + snapshot fixture carrying a populatedci_95pin the fix.
🟠 Major — claim-vs-code gaps.
- F02 — real
sklearn.cluster.KMeansexercised by two new unit tests (separation + seed determinism) + a slow+online integration test attests/integration/test_cluster_kl_e2e.py. Before: every cluster_kl test monkeypatched_kmeans_clusterwith an argmax stub, leaving S16's reason-for-being unverified. - F03 —
sway list-probesfalls back to the defining module's__doc__when a probe class has no docstring. Every one of the 13 shipped probes now renders a summary row. - F04 —
sway doctorprobesplotly(the load-bearing[viz]dep),sklearn(S16 cluster_kl), and theapiextras (httpx,tenacity). - F05 — README primitives table reflects 13 probes (adherence + cluster_kl, calibration + external_perplexity, baseline row for null_adapter).
- F06 —
TwoModelDifferentialcomposessafe_for_concurrent_viewsfrom its two inner backends. Before: the attribute was missing and the runner defaulted toFalseeven when both inners setTrue, breaking S13's concurrency claim at the only supported differential- use pattern. - F07 —
integrations/dlm/autogenemitscluster_klwhen the prompt pool clears 20 entries; markdown report gains a per-probe "Cluster breakdown" section with per-cluster mean KL + exemplars. - F08 —
backends/dummy._NullViewRNG useshashlib.md5instead of Python's PYTHONHASHSEED-saltedhash(). Cross-process determinism test attests/unit/test_cross_process_determinism.pypins the fix. - F09 — trace writer ↔ analyzer round-trip test at
tests/unit/test_runner_backend_stats.py: runs a suite withtrace_path=, loads the file back, asserts probe labels + hit/miss counts matchbackend_stats.
🟡 Minor — dead code, doc drift, latent brittleness.
- F10 — removed dead
_ft_viewbranch inprobes/prompt_collapse.py. - F11 — pytest plugin resolves spec paths against
config.rootpath, not process cwd. - F12 —
backends/api._post_completionsdelegates to tenacity'sRetrying()callable; removes the unreachable post-loop fallback. - F13 —
preference_flipWARN branch emitsci_95andz_by_rankfor consistency with every other numeric probe's WARN shape. - F14 —
adapter_ablationmodule docstring documents whyci_95renders as em-dash (it's a curve-fit, not a sample-mean aggregator). - F15 — report footer surfaces
null_adapteropt-outs (probes withcalibrate_spec=None) in both terminal and markdown surfaces. - F16 — report category ordering derives from
DEFAULT_COMPONENT_WEIGHTSand accepts unknown categories from customProbesubclasses. - F17 —
cluster_kldegenerate zero-variance case now returnsVerdict.WARNwithevidence["degenerate_zero_variance"]=Trueand suppresses z-score to avoid a spurious small-sample-noise calibration. - F18 —
_calibration_pack.pygains per-section provenance notes (F18 audit trail) without noisy per-item annotations. - F19 — pytest plugin defers
dlm_sway.core.result/dlm_sway.core.errorsimports to call sites so non-`@pytest.mark.sway` users don't pay the load tax. - F20 —
sway check--helpdocuments the σ-banner's null-calibration dependency (fall-through to composite score band when null SKIPs).
Stronger-test opportunities — regression tests for things we weren't pinning.
- #9 —
probes/_divergencerejects effectively-uniform TokenDists (spread < 1e-9) via_check_non_degenerate_token_dist. Catches a shape-broken lm_head that would otherwise compute a trivial constant divergence across prompts. Two compatibility touchups rode with the change: dummy backend's synthesizedftdist and cluster_kl test fixture_dist_broadgained a tiny monotonic perturbation to clear the guard (real models never produce bit-uniform logits thanks to fp32 accumulation noise). - #10 — cross-verdict consistency test
(
tests/unit/test_cross_verdict_consistency.py): runs the same spec throughsway run,sway gate, andsway report --format junitand asserts identical per-verdict tallies. - #11 —
sway doctor --jsonschema-shape snapshot test locks the top-level keys and the per-extra module-name set. - #12 — subprocess determinism test asserts
_NullView.next_token_distis byte-identical acrossPYTHONHASHSEEDvalues; pins F08's fix.
Dead-code inventory — DC3–DC5.
- DC3 —
null_adapter.py's pre-S10 cache-promotion branch is annotated as legacy; removal queued for the next minor. - DC4 — dropped the stderr warning at
suite/runner.pythat fired wheneverconcurrent_probes > 1even though execution never fanned out. The spec field is still accepted (future pool will light it up); the noisy warning is gone. - DC5 —
tests/unit/test_model.pygrows coverage ofModelSpec's dtype enum,endpoint,trust_remote_code, andcustom/apikindbranches.
Final state: 582 unit tests + 3 integration tests passing; mypy strict clean across 55 source files; ruff + format clean across 137 files. Sprint 20 is a single PR composed of per-finding, per-file commits so the history reads as a fix-by-fix cleanup.
Sprint 16 — Cluster-coherent KL probe
Closes Audit 01 innovation item F8. Adds a new cluster_kl probe that
answers the question delta_kl can't: the mean divergence may be the
same, but did the adapter shift the right topics, or is it a uniform
blunt-instrument shift?
- New probe
cluster_kl(category: adherence). Embeds prompts via shared MiniLM (same cache key asadapter_revert), k-means clusters them at a fixed seed, measures per-prompt JS divergence between base and ft, and reports a specificity ratio:between_variance / (between_variance + within_variance). Range[0, 1]:≈ 0.5on a blunt adapter that shifts every topic by the same amount,→ 1.0on a topic-targeted adapter. Pair(mean_kl, specificity)tells a more honest story than either number alone. - Bootstrap CI on specificity. Resamples
(divergence, cluster_label)pairs with replacement and takes the 2.5/97.5 percentiles of the bootstrap ratio distribution. ReturnsNonebelow 4 prompts — matches the conventioncore.stats.bootstrap_ciuses. - Z-score calibration against
null_adapterbaseline via the existing_zscorehelpers.calibrate_specsynthesizes 8 mixed-topic sentinel prompts + k=2 for the null pass; the specificity distribution concentrates around 0.5 there, so a real adapter's separation above that baseline reads as a clean z-score. - Degenerate-input policy.
< min_prompts(default 20) → SKIP with a clear message;num_clusters * 2 > num_prompts→ SKIP (per-cluster mean not well-resolved); empty prompt list → ERROR. Missing[semsim]extras → SKIP with apip install 'dlm-sway[semsim]'hint. - Zero-variance fallback. If every prompt produced the same
divergence (canned stub data, no adapter motion), the specificity
ratio is mathematically undefined; we return
0.5— the null-adapter expectation — so downstream z-score reports "no signal" instead of NaN. - New dependency.
scikit-learn>=1.4added to the[semsim]extra (riding the 80 MB MiniLM load that probe already pulls in). Mirrored to[all]and to mypy's stubless-override list. - 7 unit tests in
tests/unit/test_probe_cluster_kl.py: two-topic adapter → high specificity, uniform adapter → 0.5 fallback, too-few prompts → SKIP, empty prompts → ERROR,num_clusters > prompts/2→ SKIP, CI bracketing, missing-extras SKIP path. - Prove-the-value test at
tests/unit/test_cluster_kl_prove_value.py. Two backends with comparabledelta_kl(ratio < 3×) have specificity scores that split by at least 0.3 — concrete evidence thatcluster_klsurfaces a structural distinctiondelta_klmerges.
Sprint 15 — pytest plugin (@pytest.mark.sway)
Closes Audit 01 innovation item F10. Packages sway as a pytest library so teams already running pytest adopt sway with a single decorator instead of a subprocess wrapper.
- New plugin (
src/dlm_sway/pytest_plugin.py) auto-loaded via thepytest11entry point afterpip install 'dlm-sway[pytest]'.@pytest.mark.sway(spec="...", threshold=0.0, weights=None)expands a single pytest function into one test item per probe in the referenced spec + an optional__gate__item that fires only when the composite score drops belowthreshold. - Verdict translation.
FAIL/ERROR→ pytest Failed,SKIP→ pytest Skipped,WARN→ pytest warning,PASS→ pytest pass. Probe-level failures isolate: a failing adherence probe doesn't mask a failing calibration one, andpytest -k adherenceruns just that probe. - Suite runs once per decorated function. A session-scoped
_SuiteCachekeyed on(spec_path, weights)ensures the N-way item expansion doesn't multiply backend wall time. Two@pytest.mark.swaytests against the same spec share one run. - Malformed marks fail cleanly. Missing
spec, non-numericthreshold, non-dictweights, and unknown kwargs produce a synthetic_ConfigErrorItemwith a green-field pytest failure line — no cryptic collection-time tracebacks. - New
[pytest]extra — justpytest>=8.0. Matches the pattern other plugin-style extras use. Also registers thepytest11entry point so the plugin is discovered automatically on install, consistent with pytest-cov / pytest-xdist. - Example directory at
examples/pytest_integration/with a minimalsway.yaml+test_sway_gate.pyshowing the decorator replacing a legacysubprocess.run(["sway", "gate", ...])wrapper. README gains a "Pytest integration" section pointing at it. - 13 new unit tests via pytest's canonical
pytesterfixture: marker registration, N-item expansion, FAIL / SKIP / ERROR routing, gate below-threshold failure, gate above-threshold pass, gate absence whenthreshold=0, malformed-mark error paths, cache sharing across multiple decorated tests. - Drive-by fix for the S14 CI-narrowing test flake. The dummy
backend's per-prompt noise was seeded via Python's
hash(), which is salted per-process viaPYTHONHASHSEED. Swapped to a stablehashlib.md5-derived seed so the narrowing invariant holds deterministically across test-order permutations.
Sprint 14 — Bootstrap CIs + forward-pass trace CLI
Closes Audit 01 innovation items F9 (bootstrap confidence intervals on raw metrics) + F12 (polished forward-pass trace CLI). Paired sharpenings of existing probe output and existing trace infrastructure.
- New
core/stats.bootstrap_ci— percentile-bootstrap 95% CI on any sequence of per-sample measurements. Numpy-only, 1000-resample default, seeded fromctx.seedso intervals are reproducible. Short-circuits on non-finite / empty / degenerate-constant inputs; vectorized resample keeps the overhead ~1 ms per probe. ProbeResult.ci_95: tuple[float, float] | None— new field, threaded throughsafe_finalize(nulled ifrawis nulled, so a CI never brackets a defensive-null point estimate).to_json/from_jsonpersist the pair as a two-list.- Six aggregating probes emit
ci_95:delta_kl(over per-prompt divergences),calibration_drift(over per-item regression indicators),external_perplexity(over per-chunk deltas),leakage(over clean-recall rates),paraphrase_invariance(over verbatim lifts),section_internalization(over effective SIS scores). Each also lands the interval underevidence["raw_ci_95"]for JSON consumers. - Report tables add a
ci95column betweenrawandzin both terminal and markdown output.format_cirenders as[lo, hi]with em-dash for missing/non-finite. Markdown snapshot refreshed; JSON snapshot refreshed for the new per-probeci_95field. - New
sway trace <jsonl>CLI. Reads the forward-pass trace JSONL the S07 runner writes when--trace <path>is set, aggregates into per-probe and per-view wall-time + hit-rate tables, plus a top-N slowest-events table.--format terminal|md|json,--slowest Kto tune the tail size. The parsing tolerates old trace shapes (missingprobe/hitfields) and skips malformed lines. suite/trace_analysis— typed dataclasses (TraceEvent,ProbeSummary,ViewSummary,TraceReport) + three renderers (terminal / markdown / json) matching the conventions ofsuite/reportandsuite/compare.- Committed trace fixture at
tests/fixtures/trace_sample.jsonl(8 events, 2 probes × 4 prompts with cache hits). Drives 22 new trace-analysis + CLI tests. - Prove-the-value (F9): on a dummy backend with per-prompt
variation,
delta_kl's CI narrows from[0.33, 0.41](width 0.079) at N=4 prompts to[0.38, 0.42](width 0.044) at N=32 — 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical scaling minus dummy-backend dispersion. - Prove-the-value (F12): the CLI's terminal output on the
committed fixture correctly identifies
sis/base(520.3 ms) as the single slowest event,sis(1,529 ms) as the slower probe overdk(529 ms), andft(1,357 ms) as the slower view overbase(701 ms).
Sprint 13 — OpenAI-compatible HTTP scoring backend
Closes Audit 01 innovation item F7. Unlocks sway against hosted
fine-tunes — OpenAI platform, vllm serve, Ollama — without
requiring a local torch + PEFT load.
- New backend
ApiScoringBackend(backends/api.py): scores against a/v1/completionsendpoint via httpx. ImplementsScoringBackendonly (notDifferentialBackend) since one endpoint is one model; users compose twoApiScoringBackendinstances behind the existingTwoModelDifferentialwrapper for a full differential run. Method surface:logprob_ofviaecho=True+logprobs=0,rolling_logprobvia the same,next_token_distviamax_tokens=1+logprobs=K, pluspreflight_finite_checkand anApiScoringBackend.generatefor probes that need text output (e.g.leakage). - Token-boundary handling. The API returns tokens as strings, so
logprob_ofwalks the echoed tokens by character length until the running total covers the prompt, then sums the rest. When the boundary falls mid-token, the partial token lands on the prompt side — over-counts the prompt, under-attributes to the completion — and the behavior is asserted by a dedicated test (test_mid_token_prompt_leans_conservative). - Retry + preflight. Tenacity handles 5xx + network errors with
exponential backoff (configurable
max_retries). Preflight hits the endpoint once withhello,max_tokens=1,logprobs=1; rejects non-finite logprobs before the suite runs. - First backend with
safe_for_concurrent_views=True. HTTP is stateless, so the S07 concurrent-probe scheduler can dispatch against an API backend in parallel as soon as the pool implementation lands (still scaffolding in the runner). ModelSpecadditions:BackendKindgains"api";ModelSpec.endpointfield carries the server's base URL (the/v1/completionspath is appended by the backend). API key fromSWAY_API_KEY→OPENAI_API_KEYenv var fallback, so secrets stay out of the YAML.backends.builddispatch routeskind="api"toApiScoringBackend. Combined withbuild_two_separate+TwoModelDifferential, a YAML withdefaults.differential: falseand twokind: apimodels Just Works end-to-end.- New
[api]extra —httpx>=0.27+tenacity>=9.0. Core sway stays torch-free; the[api]extra adds ~1 MB of deps vs the[hf]extra's 3 GB. - Unit tests (21 new) use httpx's
MockTransportto intercept every call and assert numeric outputs match canned OpenAI-shaped responses — covers all three scoring methods, cache dedup, 4xx error surfacing, 503→200 retry recovery, API-key env fallback, NaN-response preflight rejection, and the token-boundary math. - Prove-the-value (§F7):
tests/integration/test_api_ollama.pyis opt-in viaSWAY_OLLAMA_URL+SWAY_OLLAMA_MODEL. Runs the full scoring surface against a live Ollama serving a small model (e.g.llama3.2:1b) and asserts finite output + preflight pass. Documents the wall-time budget the "≤3× HF backend" claim rests on once the concurrent-dispatch pool lands.
Sprint 12 — Interactive HTML report
Closes Audit 01 innovation item F6. Adds the exploration surface to sit alongside the terminal (CI logs) and markdown (PR artifacts) outputs — a single-file interactive HTML page for the research / write-up case.
sway report result.json --format html --out report.htmlemits a self-contained HTML page with five interactive Plotly panels: composite-score gauge with banded thresholds, per-category horizontal-bar breakdown, per-section SIS bar chart (whensection_internalizationevidence carries aper_sectionarray), adapter-ablation response curve (whenadapter_ablationevidence carrieslambdas+mean_divergence_per_lambda), and an all-probe score × z-score scatter with hover tooltips. The terminal verdict palette (green / yellow / red) carries across every panel.- Plotly bundle inlined once in
<head>; each panel is a Plotly-produced<div>with a stablesway-*id so snapshot tests don't churn on Plotly point releases. No external<script src>/<link href>references — the page loads offline from a single ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway wrapper + chart data is ~40 KB). - CLI gates
--out PATHon file-producing formats.--format htmlrequires--out(3 MB of JS has no business on stdout);--format md|json|junitnow also accept--outfor symmetry with the HTML path.--format terminalrejects--out— the terminal renderer is for the console only. plotly>=5.20is optional, shipped via the existing[viz]extra (alongside matplotlib). Without it,--format htmlexits 2 with thepip install 'dlm-sway[viz]'install hint. Graceful ImportError → RuntimeError translation tested end-to-end.- Snapshot at
tests/snapshots/report.htmllocks the Sway-owned wrapper structure. The Plotly JS bundle is stripped from the snapshot (replaced with a placeholder) so the 43-line snapshot stays human-reviewable and Plotly version bumps don't drift it. - Prove-the-value:
test_html_from_real_history_loads_offlinerenders HTML from the committedtests/fixtures/sway-history/02-*run (4 probes, real exported payload) and asserts: file parses viahtml.parser, zero external<script src>/<link href>references, every probe name appears in the body, file size lands between 1 MB and 10 MB. Measured output: 4.87 MB.
Sprint 11 — sway compare across saved JSON runs
Closes Audit 01 innovation item F5. Ships the regression-dashboard primitive every CI integration reached for but had to script.
- New CLI subcommand
sway compare. Accepts N saved result JSONs (typically fromsway run --json), rehydrates each viareport.from_json, folds them into a score matrix, and renders: a per-probe score table with columns per run, per-adjacent-pair delta columns (colored red on drop / green on lift), and a composite-score timeline row. Formats:terminal(default, Rich),--format md,--format json. --fail-on-regression <threshold>. Exits 1 when any probe's score in the newest run dropped ≥thresholdvs the prior run.threshold=0(default) disables the gate. Gate fires after the output is emitted so CI logs always capture the matrix even on red builds.suite/compare.pymodule. Clean separation:build_matrixfolds(SuiteResult, SwayScore)pairs into aCompareMatrixdataclass;render_{terminal,markdown,json}consume that. No filesystem IO in the module itself — the CLI owns the reads, the renderers own the writes. Probes that disappeared between runs show asNone/ em-dash in the matrix; new probes show asNoneon older runs. Union of probe names is sorted for stable row order across invocations.- Markdown snapshot locked at
tests/snapshots/compare.md— silent schema drift breaks the test like every other snapshot. - Committed history fixture + prove-the-value test.
tests/fixtures/sway-history/{01,02,03}-*.jsonship a three-run narrative: baseline → retrained-improved → over-trained. Run 03 plants asection_internalizationdrop of 0.22 and acalibration_driftdrop of 0.25 whiledelta_klstill rises (the memorization-without-generalization failure mode).test_compare_catches_planted_regressioninvokessway compare sway-history/*.json --fail-on-regression 0.10and asserts exit=1 with both regressed probes surfaced in the JSON payload and terminal text — exactly the F5 "CI gate the build on regression" experiment.
Sprint 10 — Multi-rank adversarial null adapters
Closes Audit 01 innovation item F4. Extends the S02 null-calibration matrix with a per-rank profile — users now read "how rank-saturated is my adapter?" straight off the report.
NullAdapterSpec.rank_multipliers: list[float] = [1.0](new). Default preserves single-rank behavior byte-for-byte. Setting[0.5, 1.0, 2.0]calibrates three independent null distributions per probe kind and emits a z-profile alongside the verdict z-score.NullCalibratedBackend.as_null_adaptergains arank_scalekwarg. Both shipped backends (dummy + HF) implement it by scaling the null-weight noise std bysqrt(rank_scale)— mathematically equivalent to a rank change in terms of the LoRA output variance (A·Bis a sum ofrrank-1 outer products; variance is linear inr). No tensor-shape surgery, no model reload per multiplier.RunContext.null_stats_by_rankthreads the per-rank matrix through the runner. Keys are canonicalrank_{mult:.2f}strings; inner structure matchesnull_stats.- Numeric probes emit
evidence["z_by_rank"]. Every numeric probe (delta_kl, calibration_drift, external_perplexity, leakage, adapter_revert, adapter_ablation, paraphrase_invariance, preference_flip, prompt_collapse, section_internalization, style_fingerprint) computes per-rank z-scores using its existing sign convention. The verdict path still reads the 1.0x group — no existing thresholds shift. - Report surfaces the profile. Terminal + markdown probe notes
append
rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2xwhenever multi-rank calibration ran.format_z_profilehelper centralizes the rendering. - Null-stats disk cache widens key to include
rank_multipliers. Single-rank caches from pre-S10 runs are still readable — they're promoted to the newnull_stats_by_rankshape on load. - Bug fix (
external_perplexity): drop erroneous z sign flip.mean_deltais higher-is-better (ft logprob minus base logprob on external prose), so the raw z-score maps directly ontoz >= assert_z_gte. S09's sign flip was reversing the pass/fail direction when the null-calibration path ran. - README: rank-profile interpretation guide. Explains when the shape indicates rank saturation vs. rank oversizing, plus the "low rank can be pathologically quiet" dual-reading caveat.
- Prove-the-value (
tests/unit/test_null_multi_rank.py): on a fixed adapter with the dummy backend, the delta_kl z-profile is strictly monotone in inverse rank (z@0.5x > z@1.0x > z@2.0x) — exactly the signature of a rank-scaled null distribution.
Sprint 09 — External-perplexity-gap probe
Closes Audit 01 innovation item F3. First innovation sprint landed on top of the audit-closure campaign (S01–S07).
- New probe
external_perplexity(probes/external_perplexity.py): rolling-logprob delta of ft vs base on held-out public-domain English prose. Raw metric ismean_delta_nats(per-token). Negative = adapter raised perplexity on external text (diffuse forgetting); positive = adapter improved English modeling incidentally. Categorycalibration; scored alongsidecalibration_drift. - Packaged public-domain corpus
(
probes/_corpora/public_domain_en.txt, ~14 KB): hand-assembled from twelve US-public-domain passages (Lincoln, Emerson, Twain, Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, Dickinson, US Constitution, Federalist #10). Each passage carries an inline# -- source:provenance comment identifying the work and the "US public domain (pre-1929)" basis. Loader strips the comments at read time so the probe sees only raw prose. - Null calibration:
calibrate_spec()returns a 4-chunk cheap version sonull_adapterproduces per-kind stats without dominating suite runtime. Sign-flipped z-score — lower raw delta is worse, so the sharedz >= assert_z_gtesemantics read as "σ better than noise" on external fluency. autogenwiring: emits anexternal_pplsuite entry (withcorpus=public_domain_en,max_chunks=8) whenever the source.dlmhas any PROSE section. Intent table explains it as the "diffuse-forgetting complement tocalibration_drift."- Prove-the-value test
(
tests/unit/test_ext_ppl_vs_calibration_drift.py): a dummy backend that applies a uniform −0.3 nats/token drift to every pack item and every corpus chunk — belowcalibration_drift's 1.0-nat per-item regression threshold, above its −0.5 mean-delta gate, but well belowexternal_perplexity's −0.1 fixed-threshold.calibration_driftreports PASS,external_perplexityreports FAIL. The two probes measure different failure modes and do not substitute for each other.
Sprint 07 — Performance & caching
Closes Audit 01 findings B19 (deferred per design note), E-cache-opportunity.
- Forward-pass cache (
backends/_instrumentation.py): every backend view now routesnext_token_dist/rolling_logprob/logprob_ofthrough a bounded LRU keyed on(op, view_id, prompt_hash, top_k). Baseline A/B on a tiny 4-probe suite against SmolLM2-135M on CPU: 2.33s → 1.76s (25% wall-time reduction, forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone suite has more overlap and would see larger gains; the 25% floor holds on any suite with repeated prompts across probes. - Cache key includes
view_id:"base"/"ft"/"scaled_1.25"/"null_42"are distinct namespaces. A future toggle regression that fails to flip the adapter would surface as wrong-side cached values immediately, not silently corrupt divergence math. - Backend stats surface in the report:
SuiteResult.backend_statscapturescache_hits/cache_misses/forward_passes/scoring_wall_s/hit_rate. Terminal + markdown footers rendercache: 26/88 = 30%when stats are present. - Forward-pass tracing:
sway run --trace <path.jsonl>writes one event per backend scoring call (probe / view_id / prompt_hash / top_k / op / wall_ms / hit). Zero overhead when unset. concurrent_probesscaffolding:spec.defaults.concurrent_probes: int = 1field plussafe_for_concurrent_views: bool = Falseclass attribute on HF / MLX / Dummy backends. The runner warns on stderr when the user requested > 1 against an unsafe backend and stays sequential. Custom backends that are already concurrency-safe (e.g. a stateless hosted-API backend) can opt in without waiting for the HF fix.- B19 design note:
.docs/design/backend-concurrency.mddocuments current state, why v0.1 doesn't fix it, and two future paths (per-thread lock vs per-worker pool) with recommendation. - Generation stays uncached:
view.generate()intentionally bypasses the cache — probe-side callers vary(prompt, max_new_tokens, temperature, seed)in ways that would rarely collide, and caching sampled output would hide seed bugs behind stale strings.
Sprint 06 — CLI & report UX polish
Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15, B16, B17, B18.
- D3 — extras rollup footer: terminal + markdown reports now end
with a single
pip install 'dlm-sway[...]'line collecting every extra mentioned in SKIP messages. Single rollup, no per-row scan. - D4 —
sway checkinfers--base: readsbase_model_name_or_pathfrom the adapter'sadapter_config.jsonwhen--baseis omitted; echoes "(inferred base model: …)" so the user knows what got picked. Falls back to a clear required-flag error when the field is missing. - D5 —
sway autogenannotates the YAML: generated specs carry a header block (source.dlm, dlm_id, base, adapter, generation timestamp, sway version) and a one-line# intentcomment above every probe entry. No new dep — pyyaml + a position-based post-processor instead ofruamel.yaml. - D6 —
sway run --dry-run+sway list-probes: dry-run validates the spec, prints a probe table (#, name, kind, category, enabled), and exits 0 without building a backend.list-probesprints every shipped probe kind with its category and the first line of its docstring. - D7 —
sway doctor --json: machine-readable doctor payload (sway_version,python,platform,extras) keyed by extra and module. CI-grep-friendly. - D8 —
dlm_sourceresolution warns instead of silently swallowing: when the spec setsdlm_sourcebut the[dlm]extra isn't installed (or the bridge errors), the runner emits a yellow stderr warning with the install hint and continues withsections=None. No more silent "why are my section probes SKIPping?" debugging. - D9 — markdown column parity with terminal: the markdown probes
table now carries
score,raw,z,duration, andnotecolumns. ATop findingssection follows the table; aSkipped probessection closes when extras are missing. - D10 — unified number formatters:
format_score,format_raw,format_z,format_duration_slive insuite/report.pyand are the only numeric formatters used. Thousands separators above 1 000;—glyph forNone/ non-finite. Single source kills cross-surface drift. - D11 —
sway report --formatvalidated viaStrEnum: the Typer-level enum rejects unknown formats with the standard "Invalid value for '--format'" error. No more silent fallback to the terminal renderer on a typo. - D12 —
sway checkverdict banner: prints✅ adapter is +4.2σ above noise(green ≥ 3σ),⚠️ adapter is +1.5σ above noise — marginal(yellow ≥ 1σ), or❌ adapter is +0.3σ — indistinguishable from noise(red) above the full report. Calibrated on the delta_kl z-score;checknow runsnull_adapterfirst so the z-score is available. - D13 —
sway diffregression summary: after per-probe deltas, the diff printsA→B: N regressed >0.10, M regressed >0.20, composite Δ=±X.XXcolor-coded by direction. - D14 — adapter paths with spaces render safely:
_adapter_labelwraps the path in double quotes when any whitespace is present. - D15 — long messages wrap, don't truncate: the terminal probe
table column uses Rich's
overflow="fold"instead of an 80-char hard cut with an ellipsis. The full message is always visible. - B16 — single markdown renderer:
report.from_jsonround-trips saved JSON back into the canonical dataclass pair, sosway report --format mdandsway run --markdownboth flow throughreport.to_markdownwith identical output. The legacy_render_markdown_from_jsonand_render_junit_from_jsonhelpers incli/commands.pyare deleted. - B17 —
Nonescores render as—: the unified formatters cover every render path; no more0.00masking missing data. - B18 —
baselinerow labeled(informational, weight=0): in both terminal and markdown component breakdowns, matching the Sprint 03 explicit-weight-zero decision.
Sprint 05 — Probe quality & edge cases
Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, B21, B22.
- B6 —
tail_logprobsemantics: the field is nowfloat | Nonewith three discrete states:None(top-k covered the full vocab — no tail to redistribute),0.0(a real tail underflowed below fp32), and a measurable negative log-prob. HF, MLX, and dummy backends all emitNonewhenk == vocab. Preflight check guards against non-finitetail_logprobonly when it's a number. - B7 — probe validation before backend build: new
validate_all_probes(suite)helper collects every spec error in one pass;_execute_speccalls it before materializing the backend so a typo inkind:surfaces immediately, and all typos surface in a single error message. - B8 —
autogenstyle prompts:style_fingerprintno longer receives the leading sentence of a prose section (which elicited doc content, not stylistic voice). Replaced with a fixed_STYLE_ELICITATION_PROMPTSset of 6 open-ended, content-neutral prompts. - B9 — sentence-transformer caching:
_load_embedderis nowfunctools.lru_cache(maxsize=4). Suites that runadapter_revertback-to-back (multi-adapter diff, repeated probes) reuse the same ~80 MB embedder instead of re-loading every probe call. - B10 —
_lcs_ratiodocstring rot: the function name is a historical misnomer (it's gestalt similarity viadifflib.SequenceMatcher, not LCS). Docstring rewritten to make the contract explicit; rename deferred to v0.2 to preserve the public API surface. - B11 — leakage adversarial perturbations: added 4 new perturbations
(
synonym_swapwith a hand-curated 50-pair table,clause_reverse,prefix_inject,register_shift) alongside the original three. Default perturbation list is now all 7; the spec accepts any subset. - B12 — calibration pack expansion:
BUILT_IN_PACKgrew from 30 to 200 items (geography, natural sciences, arithmetic, language, history, biology, technology, miscellaneous). All public-domain / hand-composed grade-school facts — no third-party dataset license attaches.items_limitdocstring documents the new resolution (~0.5 pp per regressed item vs the old ~3.3 pp). - B13 — tokenizer-aware
_stuffing:prompt_collapsenow derives its padding from the model's pad / unk / EOS token via the backend's tokenizer, making the metric language-agnostic. The pre-B13 hardcoded English string remains as the dummy-backend fallback and behind alegacy_stuffing: bool = Falsespec field for one-release backward-compat. - B14 —
preference_flipper-triple error fence: wrapped each triple'slogprob_ofcalls intry/except ProbeError. A single bad triple no longer kills the whole batch;evidence["dropped_triples"]dropped_reasonssurface the count + first 5 reasons. When every triple raises, the probe routes to ERROR with a clear explanation.
- B20 — custom backend protocol checks:
_load_customnow isinstance-checksNullCalibratedBackendandScalableDifferentialBackendafter theDifferentialBackendcheck. The set of satisfied protocols is stamped onto the instance as__sway_protocols__: tuple[str, ...]so the report can show which features are available without re-checking. - B21 —
RunContext.null_statstruly frozen: the runner now wraps the stats dict intypes.MappingProxyTypebefore threading it through. The dataclass was alreadyfrozen=Truebut the dict was mutable by reference; B21 makes the docstring's "frozen" claim literally true. Field type widened fromdict[str, dict[str, float]]toMapping[str, Mapping[str, float]]. - B22 —
ModelSpec.adapterpath normalization: added a pydanticfield_validatorthat runsPath.expanduser().resolve()on the field at spec-load time. Backends no longer re-do the work; the cache key in_null_cache.compute_keyis now stable regardless of how the user spelled the path in YAML or on the CLI.
Sprint 04 — Integration & regression testing
Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 (test side), B15.
- Slow lane runs end-to-end (C1): the
huggingface_hubdev dep added in S01 is verified;pytest -m "slow or online"now executes every checked-in integration test from a clean clone. - HF backend coverage 21% → 91% (C2) — measured combined fast +
slow lane against
dlm_sway.backends.hf. New tests:tests/integration/test_hf_adapter_toggle.pyextended with a bit-identical ft → base → ft roundtrip (B15 mitigation).tests/integration/test_hf_scaled_adapter.py: λ sweep monotonicity +LoraLayer.scaling[key]restoration on clean exit AND on exception.tests/integration/test_hf_null_adapter.py: same-seed determinism, different-seeds divergence, original adapter restoration on clean exit AND on exception.tests/integration/test_hf_scoring.py:logprob_of(incl. zero-token-completion → ProbeError),rolling_logprob,next_token_dist,_HFView.generate(greedy + sampled-with-seed determinism).tests/unit/test_backend_hf_helpers.py: direct unit coverage on_resolve_dtypeand_detect_deviceso dtype regressions are caught in the fast lane.
- MLX smoke test (C5):
tests/integration/test_mlx_smoke.pyexercisesMLXDifferentialBackendon darwin-arm64 with a small LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm / missing-fixture. Reproducible adapter builder ships attests/fixtures/build_mlx_adapter.py(run once on a Mac to populate the fixture directory). sway gateexit code pinned (C6):tests/integration/test_sway_gate_exit_code.pycovers PASS (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing verdicts (exit 1) via Typer'sCliRunner.dlmimport ban regression-guarded (C7):tests/unit/test_dlm_not_imported.pypatches everydlm.*entry insys.modulestoNoneand asserts the dummy suite runs end-to-end without ImportError, plus thatsway autogensurfaces a clean install-hint error rather than a stack trace.- Pathological probe coverage (C8 + B3 test side):
tests/unit/test_probe_adapter_ablation.pynow drives monotonically-decreasing curves through the helper and pins probe-levelevidence["saturation_reason"]for flat / found / overshoot-with-dip / non_monotonic shapes via a monkeypatcheddivergence. - WARN-branch numerical formula pinned (C10):
test_warn_branch_score_formula_pinnedintests/unit/test_probe_preference_flip.pyasserts the exactscore = 0.5 + mean_delta / 4.0formula with a hand-computed expected value. - Disjoint top-k divergence (C12):
test_disjoint_top_k_supports_produce_finite_divergenceintests/unit/test_divergence.pycovers thealigned_probstail-redistribution path against fully-disjoint base / ft supports — JS comes out finite and approaches its theoretical ln(2) bound. - Report schema snapshots (C11):
tests/unit/test_report_snapshot.pybyte-comparesto_json/to_markdown/to_junitagainst checked-in snapshots undertests/snapshots/. Intentional schema bumps:SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …and commit the updated files. No new dep — hand-rolled diff helper. - CI workflow shipped:
.github/workflows/ci.ymlruns the fast lane (unit + lint + mypy) on every push / PR, and the slow lane (integration, HF backend) on schedule (nightly 07:00 UTC), manual dispatch, push to main, and PRs that touchsrc/dlm_sway/backends/,tests/integration/, orpyproject.toml.
Sprint 03 — Documentation truth & dead code
Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, P15, P17, P18.
- Determinism is now wired (P09): the suite runner calls
dlm_sway.core.determinism.seed_everything(spec.defaults.seed)before any backend work runs. The achieved determinism class (strict/best_effort/loose) is captured inSuiteResult.determinismand surfaces in the report footer + JSON payload + markdown header. defaults.differential: falseworks end-to-end (P14): newbackends/two_model.pywithTwoModelDifferentialwrapper +build_two_separate(spec_models)helper._execute_specroutes to the wrapper when the flag is false. Doubles memory; primarily for custom backends that can't toggle adapters in place.score_weightsoverridable from YAML and CLI (P15): newSuiteDefaults.score_weightsfield with subset-validating pydantic field validator (rejects unknown categories, negative weights, all-zero weights). New--weights k=v,k=vflag onsway runandsway gate. Partial overrides merge withDEFAULT_COMPONENT_WEIGHTSso users rarely have to respecify all five categories.baselinerow labeled(informational)with weight 0.0 explicit (P08, B18): added toDEFAULT_COMPONENT_WEIGHTSso the row appears in reports for transparency but contributes nothing to the composite. Terminal renderer adds the(informational)annotation column; markdown table adds aweightcolumn with the same label.- Extended (9-dim) style fingerprint when
[style]is installed (P10):style_fingerprintnow actually uses spaCy POS-tagging and textstat syllable counting. Three new dims appended to the existing 6: passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per word. Spec fieldextended: "auto" | "on" | "off"(default"auto") controls the path.evidence["schema_version"]bumped to2so snapshot consumers can branch on dimensionality. - Stale "Known gaps" entries refreshed (P03, P04, P05): removed
"null_adapter not yet wired" (delivered in S01/S02), "custom backend
stubbed" (it's full), "MLX paths raise" (rewritten to describe the
real limitation: two model copies because
mlx_lmhas no runtime adapter toggle). - README adds determinism + weights + differential documentation and a calibration paragraph that points at the per-kind null matrix.
delta_kldocstring rot fixed (P17): dropped the dead ``:mod:`dir``` reference.- Meta-test
test_no_dead_options.py(P14/P15 regression guard): greps the source tree for documented spec/CLI option names and asserts each has a consumer outside its declaration site. Catches the exact "documented but unused" pattern Audit 01 flagged.
Sprint 02 — Universal z-score calibration
Closes Audit 01 findings P02 (delivery), B2, C9.
NullAdapterProbeis now a per-kind calibration matrix: iterates every downstream numeric kind in the suite (or an explicitcalibrate_kindslist), runs a miniature version of each through theNullCalibrationBackendProxyfor N seeds, and publishes{kind: {mean, std, n}}underevidence["null_stats"]. The old "publishes two keys only" implementation is gone (closes B2).NullCalibrationBackendProxy(_null_proxy.py): swapsas_finetuned()to yieldas_null_adapter(seed)so every numeric probe's own math produces the "what does my metric look like when the fine-tune is structural noise?" distribution without a bespoke code path per probe.Probe.calibrate_spec(ctx)classmethod plusSENTINEL_PROMPTS/SENTINEL_DOCshared constants inprobes/base.py. Each numeric probe overridescalibrate_specto return a small, cheap spec for calibration; probes that can't be meaningfully calibrated (e.g.adapter_revertneeds an embedder,adapter_ablationneedsas_scaled_adapter,prompt_collapsecan't fit an exponential decay to null noise) opt out by returningNoneand surface(no calibration for <kind>)in the report.- Shared
_zscorehelpers (probes/_zscore.py):z_score(raw, stats),verdict_from_z(z, threshold),score_from_z(z),no_calibration_note(kind). Every numeric probe now flows through these — no bespoke(raw - mean) / stdmath left in any probe file. Includes theMIN_STD = 1e-6floor so a degenerate null distribution (e.g. a single-seed calibration) can't produce an infinite z-score (closes C9). - All 10 numeric probes thread
get_null_stats(ctx, self.kind):delta_kl,adapter_revert,prompt_collapse,section_internalization,paraphrase_invariance,preference_flip,style_fingerprint,calibration_drift,leakage,adapter_ablation. Each has anassert_z_gte: float = 3.0that is preferred when stats exist. Lower-is-better probes (adapter_revert,calibration_drift,leakage) sign-flip the z internally so the sharedz >= thresholdPASS rule still reads as "significantly better than null". Two probes (paraphrase_invariance,calibration_drift) keep their intent-aware / compound thresholds as the no-calibration fallback. - On-disk null-stats cache (
probes/_null_cache.py,backends/hf.py): HF backend exposescache_identity();NullAdapterProbehashes(backend_identity, runs, init_scale, seed_base, top_k, kinds)into a stable filename under~/.dlm-sway/null-stats/<key>.json(XDG-respecting). Cache is best-effort — a missing / malformed file rebuilds. Disable per-suite withcache: falsein the spec, or globally withSWAY_DISABLE_NULL_CACHE=1. The dummy backend doesn't expose a cache identity, so tests never touch disk unless they opt in. - Report shows z-scores in every numeric probe row: the terminal
renderer already had a
zcolumn; the markdown renderer now does too. Rows that fell back to fixed thresholds carry(no calibration for <kind>)directly in the message.
Sprint 01 — Finite safety & verdict integrity
Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2.
safe_finalizehelper incore/result.py: every numeric probe now routes its finalProbeResultthrough this guard. A non-finiteraw(or any explicitly-critical field) auto-converts toVerdict.ERROR; non-finite auxiliaries are nulled out and recorded inevidence["defensively_nulled"]._divergencenon-finite rejection:aligned_probs,kl,jsnow raiseProbeErroron NaN/inf inputs instead of silently producing garbage. JS results are bounds-checked against the theoretical[0, ln 2]; out-of-bound results raise rather than ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.)PreflightCheckableProtocol: HF and dummy backends gainpreflight_finite_check()that runs one forward pass per view with a sentinel prompt and rejects backends whose logprobs are non-finite.- Suite runner preflight gate: every
sway runcallsbackend.preflight_finite_check()before the probe loop. Failure emits a single synthetic ERROR probe and skips all configured probes;sway gateexits non-zero. Disable withskip_preflight=Truefor sub-second test suites. style_fingerprintzero-vector fix (B4): a fine-tuned model that produces empty / whitespace-only generations no longer reports a spurious "+0.82 shift toward doc" PASS. The_cosine_shiftmath is replaced by_projection_shift = (ft-base)·(doc-base) / ||doc-base||²which goes to zero when ft equals base._saturation_lambdafull-range search (B3): the adapter-ablation saturation detector now searches the entire λ range (not just λ ≤ 1.0) withmax(divs)as the reference, and returns a typed reason ("found"/"non_monotonic"/"flat_curve"/"below_floor") so a flat NaN-adapter curve is distinguishable from a healthy saturating one.- Property-based divergence tests (
hypothesis): symmetry,JS(p,p) = 0,JS ≤ ln 2,KL(p,p) = 0, KL non-negativity. Plus explicit non-finite-raises tests pinning every entry point. - End-to-end NaN regression (
tests/integration/test_nan_adapter_regression.py, slow+online): builds a real PEFT adapter with all-NaN weights via PEFT, persists it through safetensors, then asserts the HF backend's preflight rejects it and the suite produces ERROR — not the +11639σ headline the audit caught. - Added
huggingface_hubto dev dep group so integration tests can resolve thetiny_model_dirfixture (closes C1 ahead of Sprint 04).
0.1.0.dev0 — 2026-04-20
Initial pre-alpha. Full 11-primitive battery shipped.
Primitives
- Adherence
delta_kl— mean JS/KL divergence between base and fine-tuned next-token distributionsadapter_revert— reversion under adversarial paraphrase (needssway-eval[semsim])prompt_collapse— exponential-decay fit of divergence over context length
- Attribution
section_internalization(flagship) — per-sectioneffective_siswith leak checkparaphrase_invariance— memorization vs. generalization, intent-awarepreference_flip— DPO/ORPO chosen/rejected margin inversion
- Calibration
style_fingerprint— 6-dim numpy-only stylistic shift vs. documentcalibration_drift— general-knowledge regression on a packaged 30-item packleakage— greedy LCS recall + perturbation fragility
- Ablation
adapter_ablation(signature primitive) — λ-scaled divergence curve with linearity, saturation, overshoot metrics
- Baseline
null_adapter— per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased)
Infrastructure
DifferentialBackend+ScalableDifferentialBackendprotocols- HuggingFace + PEFT backend with
disable_adapter/set_adaptertoggling and LoRA-scale mutation - Dummy backend for unit tests (canned responses + linear-blend scalable mode)
- YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports
- Typer CLI:
run,gate,check,diff,autogen,doctor,report .dlmbridge (dlm-sway[dlm]): resolver + full-battery autogen- Matplotlib visualizations (
dlm-sway[viz]): SIS bar chart, ablation curve, KL histogram
Known gaps
- MLX backend loads two model copies in memory (base + adapter-fused) because
mlx_lmhas no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs. - MLX requires a pre-converted
.npzadapter — raw PEFT safetensors are rejected bymlx_lm.load. A PEFT-→-MLX converter is a future milestone. - PyPI publication of the
dlm-swaywheel is pending a clean CI release workflow.
View source
| 1 | # Changelog |
| 2 | |
| 3 | ## Unreleased |
| 4 | |
| 5 | ### Sprint 36 — `sway serve` warm-backend daemon |
| 6 | |
| 7 | Audit 03 H4. `sway run` cold-loads the HF backend each invocation |
| 8 | (~15s on a 1.5B model: weights, KV cache, deterministic-mode setup). |
| 9 | For interactive flows — notebooks, the `sway watch` retrain loop, |
| 10 | the live HTML report — that startup dwarfs the actual scoring |
| 11 | cost. `sway serve` is a long-running FastAPI daemon that loads the |
| 12 | backend once and keeps it warm across requests. |
| 13 | |
| 14 | **New CLI (`sway serve`).** Defaults: `--host 127.0.0.1 --port 8787 |
| 15 | --max-loaded-models 2`. The daemon refuses to bind a non-loopback |
| 16 | interface unless `--api-key <token>` is passed; every non-`/health` |
| 17 | request must then carry `Authorization: Bearer <token>`. |
| 18 | |
| 19 | **Endpoints.** |
| 20 | |
| 21 | - `GET /health` — uptime + the list of currently-warm models. |
| 22 | - `GET /stats` — request count + mean latency + cache size. |
| 23 | - `POST /run` — body `{spec: SwaySpec}`; returns the same JSON |
| 24 | shape `sway run --json-out` would write, plus `request_seconds` |
| 25 | for the daemon's measured execution time. |
| 26 | - `POST /score` — same as `/run` with an optional `probe_names` |
| 27 | filter; returns just the per-probe entries with no folded |
| 28 | `SwayScore`. |
| 29 | |
| 30 | **Backend cache.** LRU keyed on `(kind, base, adapter, dtype, |
| 31 | device)`. Capped at `--max-loaded-models`; loading a third distinct |
| 32 | model LRU-evicts the oldest, calling `backend.close()` to release |
| 33 | GPU memory. Single-flight: concurrent requests for the same key |
| 34 | serialize at the loader instead of building twice. |
| 35 | |
| 36 | **Python SDK.** `dlm_sway.serve.client.ServeClient(url)` exposes |
| 37 | `health()`, `stats()`, `run(spec)`, `score(spec, probe_names=...)`. |
| 38 | Stateless (no persistent connection pool); raises |
| 39 | `ServeClientError` (subclass of `SwayError`) on transport failure |
| 40 | or non-2xx responses. |
| 41 | |
| 42 | **Auth posture (v1).** Defaults to no auth on loopback — the |
| 43 | threat model on a single-user dev box is "did I bind 0.0.0.0 by |
| 44 | accident". The CLI hard-refuses `--host 0.0.0.0` without an API |
| 45 | key. Full OAuth is deferred. |
| 46 | |
| 47 | ### Sprint 33 — `training_drift` probe (cross-repo, reads dlm loss curves) |
| 48 | |
| 49 | Closes the X2 "training_drift probe" backlog item. Sister to S25 |
| 50 | `gradient_ghost`: where the ghost reads optimizer state at |
| 51 | end-of-training, `training_drift` reads the loss *curve* during |
| 52 | training. Both are pre-run, no model load, no backend required. |
| 53 | |
| 54 | **New probe (`kind: training_drift`, category: calibration).** |
| 55 | |
| 56 | For a dlm store, the probe parses every `train-*.jsonl` under |
| 57 | `<store_path>/logs/`, dedupes resumed runs (latest occurrence wins), |
| 58 | and computes four metrics: |
| 59 | |
| 60 | - `final_loss` — last recorded step's loss. |
| 61 | - `convergence_ratio` — `final_loss / initial_loss`. |
| 62 | - `smoothness` — `1 − var(Δloss) / var(loss)`, clipped to `[0, 1]`. |
| 63 | - `instability_events` — count of loss-*increase* events whose |
| 64 | magnitude exceeds the local typical movement scale (median |
| 65 | absolute delta in a centered window). NaN losses count as one |
| 66 | instability each, then forward-fill so downstream stats stay |
| 67 | finite. |
| 68 | |
| 69 | Verdict PASS when all three thresholds clear (smoothness ≥ 0.7, |
| 70 | convergence_ratio ≤ 0.7, instability_events ≤ 0). Otherwise WARN |
| 71 | with each failed threshold listed in the message. |
| 72 | |
| 73 | **Spike heuristic note.** Sprint plan called for `|Δloss| > 3 · |
| 74 | rolling_std`. Implementation rejects that: on a smooth exponential |
| 75 | decay, within-window std-of-deltas stays tiny while absolute deltas |
| 76 | are large — every step trips the threshold (verified during dev: |
| 77 | 60-step smooth curve flagged 59 false-positive spikes). Replaced |
| 78 | with: count *positive* deltas (loss going up — the semantically |
| 79 | meaningful instability) that exceed `sigma · median(|Δ|)` in a |
| 80 | centered window. Robust to scale changes across training; loss |
| 81 | going down faster than usual is no longer mistaken for instability. |
| 82 | |
| 83 | **No null calibration.** Mirrors `prompt_collapse` and |
| 84 | `multi_turn_coherence_decay`: a null adapter has no loss curve, |
| 85 | the null distribution of "smoothness on a noise adapter" is |
| 86 | undefined. Fixed-threshold verdicts; users override per-spec. |
| 87 | |
| 88 | **Log-format note.** Sprint plan said dlm writes per-step JSONs at |
| 89 | `logs/train_step_*.json`. Reality (verified against |
| 90 | `~/.dlm/store/`): one mixed JSONL per run at |
| 91 | `logs/train-NNNNNN-YYYYMMDDTHHMMSS.jsonl` containing banner + |
| 92 | delta + step + run_complete records. The probe filters for |
| 93 | `{"type": "step"}` lines and reads `step` + `loss`. Sibling |
| 94 | `*.summary.json` carries run aggregates we don't consume — the |
| 95 | curve is richer. |
| 96 | |
| 97 | **Robustness:** |
| 98 | - Resumed runs (overlapping step numbers across multiple jsonls): |
| 99 | dedupe-by-keep-latest mirrors `dlm metrics` semantics. |
| 100 | - Truncated JSONL tail (crashed-mid-line trainer): partial line |
| 101 | is skipped; valid lines still consumed. |
| 102 | - NaN losses are recorded as `+inf` so the spike detector flags |
| 103 | them without numpy NaN poisoning the rest of the pipeline. |
| 104 | - 1500-step run downsamples to ≤ 512 evidence points (uniform |
| 105 | stride; first + last always preserved). |
| 106 | - Pathological: every step recorded NaN → `smoothness = 0.0`, |
| 107 | `instability_events = num_steps`, verdict WARN. |
| 108 | |
| 109 | **Implementation:** |
| 110 | - `probes/training_drift.py` — spec, probe, JSONL parser, |
| 111 | metric helpers, verdict mapping, downsampler. 365 LOC. |
| 112 | - `probes/__init__.py` — registers the new probe. |
| 113 | |
| 114 | **Test surface:** |
| 115 | - `tests/unit/test_probe_training_drift.py` — 30 unit tests |
| 116 | covering: skip paths (no store_path / no logs dir / no jsonl / |
| 117 | too few steps), end-to-end with smooth & spiky curves, |
| 118 | resume-deduplication, downsampling, corrupt-first-line ERROR, |
| 119 | truncated-tail tolerance, pure-math metric helpers |
| 120 | (smooth/constant/NaN/all-NaN/zero-initial), spike-detector |
| 121 | heuristic (loss-up vs loss-down semantics, short curves, |
| 122 | empty), downsampling, verdict mapping, and JSONL parsing |
| 123 | (filter non-step records, missing keys, NaN encoding, |
| 124 | missing files). |
| 125 | - `tests/fixtures/dlm_train_log_fixture.jsonl` — captured-from-disk |
| 126 | shape: banner + delta record + 30 step records + run_complete. |
| 127 | If this fixture's parse breaks, dlm's log format has shifted and |
| 128 | the probe needs an update — the test catches that explicitly. |
| 129 | |
| 130 | **README** gains a `training_drift` paragraph in "Pre-run |
| 131 | diagnostics" alongside `gradient_ghost`. The probe table at "Why |
| 132 | it exists" picks up `training_drift` and the previously-missed |
| 133 | `gradient_ghost` entry under Calibration. |
| 134 | |
| 135 | ### Sprint 30 — `multi_turn_coherence_decay` probe |
| 136 | |
| 137 | Closes the P2 "multi_turn_coherence_decay probe" backlog item. Sway |
| 138 | had zero coverage of multi-turn behavior before this — every |
| 139 | shipped adherence probe was single-turn. Adapters that pass |
| 140 | `delta_kl` cleanly frequently degrade by turn 2 or 3 of real |
| 141 | dialogue, where the model's own previous responses enter the |
| 142 | context window and create compounding drift. The new probe is the |
| 143 | first that catches that failure mode. |
| 144 | |
| 145 | **New probe (`kind: multi_turn_coherence_decay`, category: adherence).** |
| 146 | |
| 147 | For each prompt the probe greedy-generates ft's turn-1 response, |
| 148 | then rolls a multi-turn synthetic dialogue with cycled generic |
| 149 | follow-ups (`Continue.`, `Tell me more.`, …). At each turn 2..N |
| 150 | both views see the same ft-grounded chat history; the probe |
| 151 | computes `KL(base || ft)` at the next-token position. The |
| 152 | per-turn KL series is fit to `kl = a · exp(-b · turn)` and |
| 153 | the probe reports `half_life_turns = ln(2) / b`. |
| 154 | |
| 155 | Verdict ladder: |
| 156 | - `ok` — clean exponential decay; PASS when `half_life_turns ≥ |
| 157 | assert_half_life_turns` (default 2.0). |
| 158 | - `stable` — KL stayed within 0.1% relative spread across turns; |
| 159 | half-life is formally infinite; clipped to `max_turns × 10` |
| 160 | and rendered with a "held coherence" message; always PASS. |
| 161 | - `non_monotonic` — KL grew turn-over-turn (atypical but |
| 162 | possible); WARN with the curve in evidence. |
| 163 | - `degenerate` — KL ≈ 0 at every turn; FAIL with "probable no-op |
| 164 | adapter" diagnosis. |
| 165 | |
| 166 | **No null calibration.** Mirrors `prompt_collapse`'s rationale: a |
| 167 | null adapter has no signal to decay, so the null distribution of |
| 168 | half-lives is meaningless. Fixed-threshold verdicts are the |
| 169 | published path. |
| 170 | |
| 171 | **Chat-template requirement.** Multi-turn dialogue requires the |
| 172 | base's tokenizer to carry a `chat_template`. Bases without one |
| 173 | SKIP gracefully with a clear message. The probe consults |
| 174 | `ctx.backend._tokenizer.chat_template` — same backdoor |
| 175 | `prompt_collapse` uses for the same reason (avoids broadening the |
| 176 | public scoring contract for one probe's needs). |
| 177 | |
| 178 | Reports surface a tiny unicode sparkline of per-turn KL in the |
| 179 | verdict message so terminal-only readers see curve shape without |
| 180 | opening the JSON. |
| 181 | |
| 182 | **Implementation:** |
| 183 | - `probes/multi_turn_coherence.py` — spec, probe, curve fit, |
| 184 | verdict mapping, sparkline. |
| 185 | - `probes/__init__.py` — registers the new probe. |
| 186 | |
| 187 | **Test surface:** |
| 188 | - `tests/unit/test_probe_multi_turn_coherence.py` — 22 unit tests |
| 189 | covering skip paths, end-to-end with planted-distribution |
| 190 | sequences (decreasing / flat / growing curves), fit math (clean |
| 191 | exp / stable / growing / zero / partial-zero), verdict mapping, |
| 192 | sparkline rendering, and chat-template detection. |
| 193 | - `tests/integration/test_probe_multi_turn_coherence.py` — |
| 194 | slow+online HF smoke on a tiny SmolLM2-135M LoRA across 3 |
| 195 | dialogue turns. Exercises the full chat-template + turn-loop + |
| 196 | curve-fit path on a real backend. |
| 197 | |
| 198 | **README** gains a "Multi-turn coherence" section between |
| 199 | `Tool-use fidelity` and `Reproducing a sway run` with a worked |
| 200 | YAML example. The probe table at "Why it exists" picks up the |
| 201 | new entry under Adherence (plus `tool_use_fidelity` under |
| 202 | Attribution — the table missed S27's addition). |
| 203 | |
| 204 | ### Sprint 28 — `tenseleyflow/sway-action` GitHub Action |
| 205 | |
| 206 | Closes the P2 "GitHub Action" discoverability item from Audit 03 |
| 207 | D-path3. The action lives in its own repo |
| 208 | ([tenseleyFlow/sway-action](https://github.com/tenseleyFlow/sway-action), |
| 209 | tagged `v0.1.0`) so consumers can pin via `uses:` without coupling |
| 210 | to sway's release cadence. |
| 211 | |
| 212 | **End-user value:** three lines of YAML to wire sway into any GitHub |
| 213 | PR check, with edit-in-place PR comments + cached venv + standardized |
| 214 | verdict sticker. The audience is teams who don't use pre-commit but |
| 215 | live in GitHub Actions. |
| 216 | |
| 217 | ```yaml |
| 218 | - uses: tenseleyflow/sway-action@v0.1.0 |
| 219 | with: |
| 220 | spec-path: sway.yaml |
| 221 | ``` |
| 222 | |
| 223 | **Inputs / outputs.** Inputs cover `fail-on` policy |
| 224 | (`fail`/`warn`/`never`), `comment-on-pr`, `upload-artifact`, |
| 225 | `sway-version` (defaults to the action's pinned version), and |
| 226 | `python-version`. Outputs `sway-score`, `verdict`, and `report-path` |
| 227 | are surfaced for downstream workflow steps. |
| 228 | |
| 229 | **Action layout:** |
| 230 | - `action.yml` — composite action, 5 steps (setup-python → cache → |
| 231 | entrypoint.sh → upload-artifact → github-script post-comment). |
| 232 | - `src/entrypoint.sh` — bootstraps a venv, installs `dlm-sway[hf]` |
| 233 | pinned, runs `sway gate` + `sway report --format md`, parses score |
| 234 | + verdict from the JSON report into `$GITHUB_OUTPUT`. |
| 235 | - `src/post-comment.js` — Node 20 script run via |
| 236 | `actions/github-script@v7`. Finds prior sway comments via a hidden |
| 237 | `<!-- sway-action:report -->` HTML marker and edits in place. |
| 238 | Truncates at 60 KB with a footer pointing to the artifact for |
| 239 | full report. |
| 240 | |
| 241 | **Self-tests** (`smoke.yml` in the action repo): an `actionlint` + |
| 242 | `shellcheck` + `node -c` lint job, an `entrypoint-stub` job that |
| 243 | validates the bash plumbing against a stubbed `dlm_sway` package, and |
| 244 | a `self-test` job that runs the action end-to-end against a tiny |
| 245 | random LoRA on SmolLM2-135M (PR-only — needs PR context for the |
| 246 | comment-posting step). |
| 247 | |
| 248 | **Sway README** gains a "GitHub Action" subsection between Pre-commit |
| 249 | and the `.dlm` integration with the copy-paste one-liner + a complete |
| 250 | workflow example. |
| 251 | |
| 252 | ### Sprint 27 — `tool_use_fidelity` probe |
| 253 | |
| 254 | Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway |
| 255 | ships for anyone fine-tuning an adapter against a tool-using base — |
| 256 | the dominant fine-tune target outside dlm-style document training, |
| 257 | and the failure mode no other shipped probe catches. |
| 258 | |
| 259 | **New probe (`kind: tool_use_fidelity`, category: attribution).** |
| 260 | |
| 261 | For each `(prompt, tool_spec, gold_tool_name)` case the probe greedy- |
| 262 | decodes from base + ft and scores three independent signals: |
| 263 | |
| 264 | - **JSON-schema validity delta** (`ft_valid_rate − base_valid_rate`): |
| 265 | the gate that catches "ft broke the base's tool-call format". Default |
| 266 | pass criterion tolerates a 5pp drop. |
| 267 | - **Tool-name hallucination rate**: fraction of schema-valid ft calls |
| 268 | whose `name` falls outside the declared `allowed_tools` surface (or |
| 269 | differs from `gold_tool_name` per-case when no surface is set). |
| 270 | Default cap 10%. |
| 271 | - **Argument-field disagreement rate** (informational): leaf-field |
| 272 | drift between matched base/ft schema-valid calls. v1 surfaces it |
| 273 | as evidence rather than gating on it — the per-token KL on |
| 274 | free-form argument values is deferred past v1 because it requires |
| 275 | alignment between two decoded strings. |
| 276 | |
| 277 | The probe greedy-generates from both views, parses each output via a |
| 278 | forgiving extractor (whole-text JSON → fenced ```json``` block → |
| 279 | first balanced `{...}` substring), then validates against an |
| 280 | OpenAI-flavored schema subset (`string`, `integer`, `number`, |
| 281 | `boolean`, `object`, `array` plus `required`). No `jsonschema` |
| 282 | dependency added — core sway dependencies stay lean. |
| 283 | |
| 284 | `json_valid_rate_ft` is z-scored against the null-adapter baseline |
| 285 | when `null_adapter` is in the suite. Calibration spec ships two |
| 286 | sentinel tool-use cases so the null distribution carries useful |
| 287 | signal. |
| 288 | |
| 289 | **Implementation modules:** |
| 290 | - **`probes/tool_use_fidelity.py`** — `ToolUseCase` / |
| 291 | `ToolUseFidelitySpec` / `ToolUseFidelityProbe` plus |
| 292 | `_parse_tool_call` / `_matches_schema` / `_field_disagreement` |
| 293 | helpers. 573 LOC. |
| 294 | - **`probes/__init__.py`** — registers the new probe. |
| 295 | |
| 296 | **Test surface:** |
| 297 | - **`tests/unit/test_probe_tool_use_fidelity.py`** — 40 unit tests |
| 298 | covering verdict logic (PASS / FAIL on validity / FAIL on |
| 299 | hallucination / SKIP-no-cases), the parse helpers (whole-text / |
| 300 | fenced / embedded / unbalanced / strings-with-braces), schema |
| 301 | validation (required / type-tag / bool-not-int / extras-allowed), |
| 302 | field disagreement (identical / drifted / nested / list-as-leaf / |
| 303 | None-vs-absent), and the calibration handoff. |
| 304 | - **`tests/integration/test_probe_tool_use_fidelity.py`** — |
| 305 | slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies |
| 306 | the full code path executes without error and emits every |
| 307 | documented evidence key with rates in `[0, 1]`. Doesn't assert a |
| 308 | specific verdict — the 135M base isn't tool-fluent enough to make |
| 309 | the validity gate meaningful, only the plumbing. |
| 310 | - **`tests/fixtures/tool_use_cases.yaml`** — eight hand-authored |
| 311 | cases spanning the most common tool shapes (search, file I/O, |
| 312 | Python exec, shell, HTTP fetch, DB query, calendar, calculator). |
| 313 | |
| 314 | **README** gains a "Tool-use fidelity" section between the `.dlm` |
| 315 | integration and "Reproducing a sway run" — pitches the probe as |
| 316 | sway's signal for agentic fine-tunes, with a worked YAML example. |
| 317 | |
| 318 | ### Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair) |
| 319 | |
| 320 | Closes the X3 half of Audit 03's "make a sway run reproducible by a |
| 321 | coworker without recreating their environment" goal. The X1 half |
| 322 | (`dlm export --to sway-json`) lands in the dlm repo via a separate |
| 323 | PR; this changelog block ships the sway-only deliverable. |
| 324 | |
| 325 | **New CLI commands.** |
| 326 | |
| 327 | - **`sway pack <spec> [-o OUT] [--include-golden PATH] |
| 328 | [--include-null-cache/--no-include-null-cache] [--max-size-mb MB]`** — |
| 329 | bundles the spec + its `dlm_source` (when set) + cached null-stats |
| 330 | entries + an optional last-known-good JSON report into a single |
| 331 | `*.swaypack.tar.gz`. Default cap 50 MB; refuses to overwrite. |
| 332 | - **`sway unpack <pack> [-o DIR]`** — extracts into `<DIR>/swaypack/`, |
| 333 | validates the manifest, and prints the ready-to-run `sway run` |
| 334 | invocation including the `SWAY_NULL_CACHE_DIR=...` env var that |
| 335 | redirects null-stats lookups at the bundled cache. |
| 336 | |
| 337 | **Pack format (`swaypack_version=1`).** |
| 338 | |
| 339 | ``` |
| 340 | swaypack/ |
| 341 | manifest.json # version + counts + packed_at + sway_version |
| 342 | sway.yaml # the spec, verbatim |
| 343 | source.dlm # spec.dlm_source content (when present) |
| 344 | null-stats/ # one .json per cache key |
| 345 | golden.json # known-good run report (when --include-golden) |
| 346 | ``` |
| 347 | |
| 348 | Implementation modules: |
| 349 | - **`cli/_pack.py`** — builds the tarball in-memory first so the |
| 350 | size cap can refuse cleanly *before* writing a half-built pack |
| 351 | to disk. Path-traversal-safe by construction (we author every |
| 352 | arcname). |
| 353 | - **`cli/_unpack.py`** — uses `tarfile.extractall(filter='data')` |
| 354 | to reject absolute paths / `../` escapes / device files (3.11 |
| 355 | compat). Validates `swaypack_version`; rejects mismatches with |
| 356 | a clear message. |
| 357 | |
| 358 | **Null-cache override env var.** |
| 359 | |
| 360 | - **`probes/_null_cache._cache_root`** now honors |
| 361 | `$SWAY_NULL_CACHE_DIR` if set (used verbatim — no |
| 362 | `dlm-sway/null-stats` suffix). Order: env override → XDG cache → |
| 363 | `~/.dlm-sway/null-stats`. The `sway unpack` CLI prints the exact |
| 364 | invocation that wires this env at the bundled cache. |
| 365 | |
| 366 | **Tests.** |
| 367 | |
| 368 | - **17 unit tests** in `tests/unit/test_pack_unpack.py`: round-trip |
| 369 | identity, manifest fields, dlm_source bundling (present + missing |
| 370 | + warning emit), golden bundling, every PackError + UnpackError |
| 371 | branch (overwrite refusal, size cap, missing file, corrupt |
| 372 | tarball, missing manifest, version mismatch, malformed root). |
| 373 | - **2 slow+online integration tests** in |
| 374 | `tests/integration/test_pack_run_roundtrip.py`: pack → unpack → |
| 375 | `sway run` produces an identical band + per-probe verdict + |
| 376 | score; null-cache packing populates the unpack report's |
| 377 | `null_stats_dir` pointer. |
| 378 | - 708 unit tests pass; mypy + ruff + format clean. |
| 379 | |
| 380 | ### Sprint 25 — P3 gradient_ghost probe (pre-run, cross-repo) |
| 381 | |
| 382 | New zero-forward-pass diagnostic probe that loads dlm's |
| 383 | `training_state.pt` and flags severely-undertrained adapters |
| 384 | **before** any model load fires. Catches the 90%-case "user did |
| 385 | `--max-steps 5` for a smoke test and forgot to retrain" in ~50 ms |
| 386 | vs. the 30+ s `adapter_ablation` probe that previously was the only |
| 387 | training-health signal. |
| 388 | |
| 389 | **Probe + supporting modules.** |
| 390 | |
| 391 | - **`probes/gradient_ghost.GradientGhostProbe`** — category |
| 392 | `calibration`, `needs_backend = False`. Verdict ladder: |
| 393 | `global_step < min_steps_threshold` → FAIL (primary signal); all |
| 394 | per-param `exp_avg_sq` NaN → FAIL; per-layer ratio against the |
| 395 | *minimum* layer's mean (not global mean — see module docstring on |
| 396 | asymptotic-cap reasoning) flags layers > `undertrained_layer_ratio`; |
| 397 | > `layer_failure_frac` of layers crossing → FAIL/WARN. Configurable |
| 398 | thresholds `min_steps_threshold=50`, `undertrained_layer_ratio=2.0`, |
| 399 | `layer_failure_frac=0.3`. |
| 400 | - **`probes/_training_state.py`** — torch.load wrapper for |
| 401 | `training_state.pt`. Frozen `TrainingStateSnapshot` + `ParamStat` |
| 402 | dataclasses. Lazy torch import (no cost on `import dlm_sway` for |
| 403 | non-gradient-ghost users). Suppresses the FutureWarning around |
| 404 | `weights_only=False` since dlm's pickled RNG state requires it. |
| 405 | - **`probes/_param_id_mapping.py`** — maps optimizer-state integer |
| 406 | param-ids to transformer-layer indices via the adapter's |
| 407 | `adapter_model.safetensors` keys. Heterogeneous-per-layer counts |
| 408 | raise `ParamMappingError` rather than silently mis-attributing. |
| 409 | Uses `safetensors.safe_open` so we read keys without |
| 410 | materializing tensors. |
| 411 | - **`core/errors.MissingTrainingStateError`** — typed exception so |
| 412 | the probe can SKIP cleanly when `training_state.pt` legitimately |
| 413 | isn't there (non-dlm adapters), distinct from a parse error. |
| 414 | |
| 415 | **Probe-ABC + runner contract for pre-run probes.** |
| 416 | |
| 417 | - **`probes/base.Probe.needs_backend: ClassVar[bool] = True`** — |
| 418 | new opt-in flag. Default `True` preserves every shipped probe's |
| 419 | contract. |
| 420 | - **`probes/base.RunContext.backend: DifferentialBackend | None`** — |
| 421 | was required, now optional. New `RunContext.require_backend` |
| 422 | property narrows the type for mypy + raises a clear runtime error |
| 423 | if a probe with `needs_backend=True` somehow gets a `None` |
| 424 | backend. |
| 425 | - **`probes/*` sweep** — every existing probe now reads |
| 426 | `ctx.require_backend.as_base()` instead of `ctx.backend.as_base()`. |
| 427 | No behavior change; mypy-correct under the new optional type. |
| 428 | - **`suite/runner.run`** accepts `backend=None`. Pre-resolves all |
| 429 | scheduled probes; if any has `needs_backend=True` while |
| 430 | `backend=None`, raises `BackendNotAvailableError` with the |
| 431 | offending probe kinds in the message. Skips trace-writer install, |
| 432 | preflight finite-check, probe-label setting, and backend-stats |
| 433 | snapshot when backend is absent. |
| 434 | |
| 435 | **`sway check` integration.** |
| 436 | |
| 437 | - **`cli/commands.check_cmd`** runs `gradient_ghost` first in the |
| 438 | quick battery (along with `null_adapter` + `delta_kl` + |
| 439 | `calibration_drift`). On FAIL, emits a red ⚠️ banner before the |
| 440 | regular verdict banner; on WARN, a yellow one. Informational — |
| 441 | no exit-code change. Banner shows the probe's actionable |
| 442 | message ("severely undertrained: global_step=N < threshold X"). |
| 443 | |
| 444 | **Tests.** |
| 445 | |
| 446 | - **17 unit tests** (`tests/unit/test_probe_gradient_ghost.py`): |
| 447 | registry / `needs_backend` flag / verdict-ladder branches |
| 448 | (PASS / FAIL on global_step / FAIL on all-NaN / WARN / FAIL on |
| 449 | many-layers / SKIP on missing) / `_param_id_mapping` correctness |
| 450 | + error paths / `_training_state` loader edge cases / runner |
| 451 | skip-backend contract. |
| 452 | - **3 integration tests** (`tests/integration/test_probe_gradient_ghost.py`, |
| 453 | slow+online): real-store FAIL on `~/.dlm/store/01KPPFAB.../v0001` |
| 454 | (skipped on machines without local dlm install — typical CI), |
| 455 | synthetic converged → PASS, runner end-to-end with backend=None. |
| 456 | - All 691 unit tests pass; mypy + ruff + format clean. |
| 457 | |
| 458 | ### Sprint 24 — F01 PEFT→MLX adapter converter |
| 459 | |
| 460 | Closes the audit's #1 major finding: the README pitched MLX as a |
| 461 | co-equal backend, but the path required a pre-converted `.npz` |
| 462 | adapter that nothing in the toolchain produced. With this sprint, |
| 463 | `dlm train` → `sway run` on the MLX backend works end-to-end on any |
| 464 | PEFT-trained LoRA adapter. |
| 465 | |
| 466 | **Converter (pure I/O, no torch dep).** |
| 467 | |
| 468 | - **`backends/_mlx_convert.convert_peft_to_mlx`** — reads PEFT's |
| 469 | `adapter_model.safetensors` + `adapter_config.json`, transposes |
| 470 | LoRA matrices to MLX's layout, writes `adapters.safetensors` + |
| 471 | mlx-lm-shaped `adapter_config.json`. Verified against PEFT >= 0.13 |
| 472 | + mlx-lm 0.31. |
| 473 | - **Key remap.** PEFT's |
| 474 | `base_model.model.<dotted>.lora_<A|B>.weight` becomes MLX's |
| 475 | `<dotted>.lora_<a|b>`. Modern PEFT keys (no `.default` adapter-name |
| 476 | segment) and legacy `.default.weight` keys both supported. |
| 477 | - **Shape transpose.** PEFT `lora_A=(r, in)` → MLX `lora_a=(in, r)`; |
| 478 | PEFT `lora_B=(out, r)` → MLX `lora_b=(r, out)`. |
| 479 | - **Config remap.** Writes `fine_tune_type=lora`, `num_layers` |
| 480 | inferred from max layer index in the keys, `lora_parameters` with |
| 481 | `rank/scale=alpha÷r/dropout/keys` (per-layer-relative attribute |
| 482 | paths like `self_attn.q_proj`). |
| 483 | - **Errors.** Missing files / non-LORA peft_type / invalid rank / |
| 484 | unexpected key prefixes / dst-not-empty all surface as typed |
| 485 | `MlxConvertError` with actionable messages. `modules_to_save` |
| 486 | tensors (e.g. `embed_tokens`, `lm_head` overrides) are skipped |
| 487 | with a per-key warning rather than crashing. |
| 488 | |
| 489 | **CLI surface.** |
| 490 | |
| 491 | - **`sway convert-adapter [--target mlx] SRC DST [--overwrite]`** — |
| 492 | thin wrapper over the converter. Prints a before/after size + |
| 493 | rank/scale report; surfaces `MlxConvertError` with a non-zero |
| 494 | exit code; warns on skipped `modules_to_save` keys via stderr. |
| 495 | |
| 496 | **MLX backend integration.** |
| 497 | |
| 498 | - **`backends/mlx._ensure_mlx_adapter`** — auto-detect: if the |
| 499 | adapter dir contains `adapter_model.safetensors`, run the |
| 500 | converter into a content-hashed cache at |
| 501 | `${XDG_CACHE_HOME:-$HOME/.cache}/dlm-sway/mlx-converted/<blake2b>/`, |
| 502 | point mlx-lm at the cache. If the dir already contains |
| 503 | `adapters.safetensors`, pass through unchanged. Uses 16-byte |
| 504 | blake2b on the source safetensors bytes — repeat loads on the |
| 505 | same adapter version short-circuit (~10 ms hash + dir lookup). |
| 506 | - **`backends/mlx._MLXView._forward_logits`** — adjacent fix: |
| 507 | `out[0].astype(mx.float32)` before `np.asarray` so unquantized |
| 508 | bf16/fp16 model outputs round-trip correctly. Pre-existing bug |
| 509 | surfaced by the new e2e test against `mlx-community/SmolLM2-135M-Instruct`. |
| 510 | |
| 511 | **Tests.** |
| 512 | |
| 513 | - **`tests/unit/test_mlx_convert`** — 20 tests across: |
| 514 | - Helper functions (`_strip_layer_prefix`, `_extract_layer_index`). |
| 515 | - Happy path: synthetic PEFT adapter → MLX adapter, expected file |
| 516 | layout, config shape, rank/scale math, key transpose, value |
| 517 | preservation. |
| 518 | - Error paths: missing safetensors / config, non-LORA peft_type, |
| 519 | invalid rank, dst-not-empty without `--overwrite`, unexpected |
| 520 | key prefix, `modules_to_save` skip-and-report. |
| 521 | - Auto-convert detection: pass-through on already-MLX dir, fresh |
| 522 | convert on PEFT dir, cache short-circuit, unrecognized dir |
| 523 | pass-through. |
| 524 | - **`tests/integration/test_mlx_converter_e2e`** — 4 darwin-arm64 |
| 525 | slow+online tests on real `mlx-community/SmolLM2-135M-Instruct`: |
| 526 | XDG cache populated by backend init, `next_token_dist` returns |
| 527 | finite top-k via converted adapter, `logprob_of` works, repeat |
| 528 | load skips reconvert (mtime check). Skipped on non-darwin / |
| 529 | missing `[mlx]` extra. |
| 530 | |
| 531 | **README.** |
| 532 | |
| 533 | - New "MLX backend (Apple Silicon)" section with the two install |
| 534 | paths (auto-convert via `sway run` vs. explicit `sway convert-adapter`). |
| 535 | Documents the cache location and out-of-scope items (QLoRA, |
| 536 | `modules_to_save`). |
| 537 | |
| 538 | ### Sprint 23 — H1 batched backend execution |
| 539 | |
| 540 | Opens the door to 3-5× wall-time reduction on HF-backend suites by |
| 541 | amortizing `model.forward` kernel-launch + memory-transfer overhead |
| 542 | across prompt batches. Real speedup lands on the HF backend; other |
| 543 | backends gain a unified Protocol surface via loop fallback. |
| 544 | |
| 545 | **Protocol.** |
| 546 | |
| 547 | - **`core/scoring.ScoringBackend`** — new method |
| 548 | `next_token_dist_batch(prompts, *, top_k) -> list[TokenDist]`. |
| 549 | Default implementation on the Protocol loops over |
| 550 | `next_token_dist`; backends override when they can amortize. |
| 551 | Adding the method to a `runtime_checkable` Protocol is a |
| 552 | contract change — all shipped backend views + the stub used by |
| 553 | `test_scoring.py::test_scoring_backend_runtime_checkable` |
| 554 | gained an explicit method implementation. |
| 555 | |
| 556 | **HF backend — real batching.** |
| 557 | |
| 558 | - **`backends/hf._HFView.next_token_dist_batch`** — calls |
| 559 | `tokenizer(prompts, padding=True, padding_side="left")` and a |
| 560 | single `model.forward` over the padded batch. Left-padding is |
| 561 | mandatory for decoder-only LMs: the last-token position (what |
| 562 | `next_token_dist` reads) must line up with the real end of each |
| 563 | sequence, not pad tokens. |
| 564 | - **`backends/hf._topk_to_token_dist`** — extracted helper shared |
| 565 | by the single-prompt and batched paths so tail-mass accounting |
| 566 | (B6) stays identical across both. |
| 567 | - **`backends/_instrumentation.BackendInstrumentation.cached_batch`** |
| 568 | — routes batched calls through per-prompt cache lookup before the |
| 569 | forward fires, so cached + uncached prompts in one call still |
| 570 | short-circuit per-entry. Callback contract: |
| 571 | `compute_misses(miss_indices) -> list[result]` in miss order. |
| 572 | Wall-time attribution divides the batch's time evenly across |
| 573 | miss entries for `scoring_wall_s` bookkeeping. |
| 574 | |
| 575 | **Instrumentation.** |
| 576 | |
| 577 | - **`BackendStats`** — three new counters: `batches_sent`, |
| 578 | `batched_prompts`, `max_batch_size`. `avg_batch_size` property |
| 579 | computes `batched_prompts / batches_sent` (0 when no batches |
| 580 | fired). All surfaced in `to_dict()`. |
| 581 | - **`suite/report._cache_line`** — the report footer's |
| 582 | `cache: N/M = P%` line now suffixes |
| 583 | `| batches: K (avg=X.X)` when any batched forward fired. Runs |
| 584 | without batching render the cache line alone — pre-S23 footer |
| 585 | shape preserved. |
| 586 | |
| 587 | **Probes — opt-in.** |
| 588 | |
| 589 | - **`probes/base.Probe.batch_score: ClassVar[bool] = False`** — |
| 590 | new flag. Probes set True when their scoring flows uniformly |
| 591 | through `next_token_dist` on a prompt list. |
| 592 | - **`probes/delta_kl.DeltaKLProbe`** — `batch_score = True`; |
| 593 | `run()` now issues one batched call per view (inside single |
| 594 | `as_base` / `as_finetuned` contexts) instead of 2×N context |
| 595 | entries with per-prompt calls. Divergences computed via |
| 596 | `zip(base_dists, ft_dists, strict=True)`. |
| 597 | - **`probes/cluster_kl.ClusterKLProbe`** — same treatment; same |
| 598 | math. |
| 599 | - Other probes (section_internalization, leakage, |
| 600 | paraphrase_invariance, adapter_ablation, prompt_collapse, |
| 601 | multi-turn-style probes) keep `batch_score=False` — each needs |
| 602 | bespoke batching logic deferred to a follow-up sprint. |
| 603 | |
| 604 | **Dummy / MLX / API backends.** |
| 605 | |
| 606 | - **`backends/dummy._DummyView.next_token_dist_batch`** — loops |
| 607 | over `self.next_token_dist` (not `_compute_next_token_dist`) so |
| 608 | subclass overrides (`_NullView`'s seeded-noise perturbation, |
| 609 | `_InterpolatedView`'s lam-blend) fire correctly on the batched |
| 610 | path. The dummy has no real forward to amortize; batching |
| 611 | counters stay at zero on this backend by design. |
| 612 | - **`backends/mlx._MLXView.next_token_dist_batch`** — per-prompt |
| 613 | loop stub with a `TODO: mx.array padded forward` docstring. |
| 614 | MLX's per-prompt forward on Apple Silicon is already fast |
| 615 | enough that the kernel-launch amortization a real batched |
| 616 | forward would buy is small relative to the HF CUDA/MPS case. |
| 617 | - **`backends/api.ApiScoringBackend.next_token_dist_batch`** — |
| 618 | per-prompt loop; OpenAI-compat HTTP servers score one |
| 619 | completion per request and httpx connection pooling already |
| 620 | amortizes network cost. |
| 621 | |
| 622 | **Tests.** |
| 623 | |
| 624 | - **`tests/unit/test_backend_instrumentation`** — new |
| 625 | `TestBackendInstrumentationCachedBatch` class (6 tests) pinning |
| 626 | all-miss / partial-cache-hit / all-cached / wrong-length-return |
| 627 | / max_batch_size / empty-prompts branches. `TestBackendStats` |
| 628 | gained 2 tests for the new counters + `avg_batch_size` property. |
| 629 | - **`tests/unit/test_batched_backend_s23`** — new file with 5 |
| 630 | tests: `batch_score` flag set on `DeltaKLProbe`, probe-level |
| 631 | spy that confirms the batched method is called exactly once |
| 632 | per view, byte-identical results vs serial, report footer |
| 633 | batches-segment conditional rendering, empty-prompts |
| 634 | short-circuit. |
| 635 | |
| 636 | ## 0.1.0 — 2026-04-24 |
| 637 | |
| 638 | First PyPI release. Alpha — API not guaranteed stable until v1.0. |
| 639 | Rolls up every sprint through Sprint 22; earlier sprints accumulated |
| 640 | under the rolling Unreleased header pre-release. |
| 641 | |
| 642 | ### Sprint 22 — v0.1.0 release + F06 dlm-compat test + Docker image |
| 643 | |
| 644 | First PyPI publish. Version bumped from `0.1.0.dev0` to `0.1.0`. |
| 645 | Pre-commit consumer `rev:` pins switch from commit SHA to `v0.1.0`. |
| 646 | Closes Audit 03 F05 (SHA-pin resolves at tag) and F06 (dlm API |
| 647 | compatibility regression test). |
| 648 | |
| 649 | **D-path1 — PyPI publish.** |
| 650 | |
| 651 | - **`pyproject.toml`** — version `0.1.0`. `Development Status :: 3 - |
| 652 | Alpha` classifier already present. |
| 653 | - **`src/dlm_sway/__init__.py`** — `__version__ = "0.1.0"`. |
| 654 | - **`README.md`** — real `pip install "dlm-sway[hf]"` recipe replaces |
| 655 | the "Planned PyPI install (not live yet)" placeholder. Pre-alpha |
| 656 | banner swapped for an explicit Alpha + semver-from-v1.0 disclaimer. |
| 657 | |
| 658 | **F06 — dlm API compatibility regression test.** |
| 659 | |
| 660 | - **`core/errors.DlmCompatError`** — new typed exception. Raised |
| 661 | when dlm's public surface drifts out of the contract sway's |
| 662 | resolver depends on (currently: `dlm.base_models.resolve(key).hf_id`). |
| 663 | Message includes installed-dlm version + a pip-install hint. |
| 664 | - **`integrations/dlm/resolver`** — the `except Exception: return |
| 665 | base_model` silent fallback is gone. `resolve()` raising or |
| 666 | returning an object without `hf_id` now raises `DlmCompatError` |
| 667 | (chained via `__cause__` for tracebacks). An extra |
| 668 | `_installed_dlm_version()` helper introspects |
| 669 | `importlib.metadata` for the error message. |
| 670 | - **`tests/unit/test_dlm_bridge`** — 2 new regression tests pin the |
| 671 | two drift branches (missing `hf_id`; `resolve` itself raising). |
| 672 | - **`tests/integration/test_dlm_api_compat`** — new slow+online+dlm |
| 673 | integration test iterates dlm's registry keys and asserts every |
| 674 | entry resolves to a spec with a plausible `hf_id`. Gracefully |
| 675 | skipped (via `pytest.importorskip`) when the `[dlm]` extra isn't |
| 676 | installed, so CI without dlm published to PyPI still passes. |
| 677 | - **`pyproject.toml`** — `[dlm]` extra pinned to `dlm>=0.9,<1.0`. |
| 678 | Upper bound tightens to the range the compat test has validated; |
| 679 | bump when dlm cuts v1.0 and the contract is re-verified. |
| 680 | |
| 681 | **D-path2 — Docker image (`sway-gate`).** |
| 682 | |
| 683 | - **`Dockerfile.gate`** — `python:3.11-slim` + `dlm-sway[hf,semsim]` |
| 684 | at the build-time `SWAY_VERSION` ARG. Pre-fetches the MiniLM |
| 685 | weights (`sentence-transformers/all-MiniLM-L6-v2`, ~80 MB) so |
| 686 | `adapter_revert` / `cluster_kl` probes don't cold-download on |
| 687 | first gate run. `ENTRYPOINT ["sway"]` — pre-commit hook passes |
| 688 | `gate` as the first container arg. |
| 689 | - **`.github/workflows/docker.yml`** — on `v*` tag push + manual |
| 690 | dispatch. Builds on GHA, pushes to `ghcr.io/ |
| 691 | tenseleyflow/sway-gate:{vX.Y.Z, latest}`. Smoke-tests the pushed |
| 692 | image with `--version`. |
| 693 | - **`.pre-commit-hooks.yaml`** — new `sway-gate-docker` variant |
| 694 | using `language: docker_image` pointing at |
| 695 | `ghcr.io/tenseleyflow/sway-gate:v0.1.0 gate`. Third option |
| 696 | alongside `sway-gate` (system PATH) and `sway-gate-isolated` |
| 697 | (fresh venv). |
| 698 | |
| 699 | **F05 closure.** |
| 700 | |
| 701 | - **`.pre-commit-hooks.yaml`** — isolated variant's |
| 702 | `additional_dependencies` migrates from |
| 703 | `dlm-sway[hf] @ git+https://…@<SHA>` to |
| 704 | `dlm-sway[hf]==0.1.0`. Consumer `rev:` in |
| 705 | `.pre-commit-config.yaml` pins to `v0.1.0`. SHA churn over. |
| 706 | |
| 707 | **README.** |
| 708 | |
| 709 | - Real `pip install` recipes for every extra (`[hf]`, `[dlm]`, |
| 710 | `[all]`, …). "Install from source" kept for contributor workflow. |
| 711 | - Pre-commit section: three hooks (system / isolated / docker) |
| 712 | with first-run-cost comparison table. |
| 713 | |
| 714 | ### Sprint 21 — Audit 03 closure |
| 715 | |
| 716 | Closes the Audit 03 short list — 2 🟠 major + 5 🟡 minor findings the |
| 717 | audit recommended for v0.1.0 readiness. Full audit at |
| 718 | `.docs/audits/03-final-audit.md`. Verdict was YELLOW-leaning-GREEN |
| 719 | with zero critical findings; this sprint pushes it to GREEN on the |
| 720 | in-scope items. Deferred: F01 (MLX converter — its own feature |
| 721 | sprint, S24), F05 (resolves naturally at v0.1.0 tag), F06 (dlm API |
| 722 | compat test — lands with v0.1.0 release in S22). `v0.1.0` PyPI |
| 723 | publish is deferred to S22 (requires release coordination + |
| 724 | credentials). |
| 725 | |
| 726 | **🟠 Major — degenerate z-score clipping (F02).** |
| 727 | |
| 728 | - **`probes/null_adapter`** — null stats now carry an explicit |
| 729 | `degenerate: 1.0` field when the calibration ran but produced an |
| 730 | unusable baseline (`runs: 1`, or every seed producing the *exact* |
| 731 | same raw). The std floor at `1e-6` is preserved for valid-but-tight |
| 732 | multi-seed nulls so they still calibrate. Audit observed a |
| 733 | `+290,766σ` leakage-probe z under `runs: 1`; post-fix, z_score |
| 734 | refuses and the probe falls back to fixed thresholds with a clear |
| 735 | footer rollup. |
| 736 | - **`probes/_zscore.z_score`** — refuses when `stats["degenerate"]` |
| 737 | is truthy, independent of the std check. Belt-and-suspenders on |
| 738 | top of the existing MIN_STD guard. |
| 739 | - **`suite/report.collect_degenerate_null_kinds`** — new rollup |
| 740 | surface. Terminal + markdown footers gain a |
| 741 | "N probe kind(s) had a degenerate null baseline — bump `runs:`" |
| 742 | block, distinct from the existing null-opt-outs rollup. |
| 743 | - 12 new unit tests across `test_null_calibration.py`, |
| 744 | `test_zscore_helpers.py`, `test_report_extras_rollup.py`, and |
| 745 | `test_probe_external_perplexity.py`. Existing |
| 746 | `test_std_floor_prevents_runaway_zscore` was rewritten to assert |
| 747 | the fixed behavior (the pre-fix contract WAS the audit's bug). |
| 748 | |
| 749 | **🟡 Minor — CI resilience (F03).** |
| 750 | |
| 751 | - **`pytest-timeout>=2.3`** + **`tenacity>=9.0`** in dev deps. |
| 752 | - **`tests/fixtures/tiny_model.py`** wraps `snapshot_download` with |
| 753 | exponential-backoff tenacity retry (3 attempts, 5-10-20s backoff) |
| 754 | + `etag_timeout=10` to bound per-file head probes. Benefits every |
| 755 | slow+online test. |
| 756 | - **`tests/integration/test_determinism_golden.py`** gains |
| 757 | `@pytest.mark.timeout(600)`. A silent network hang now surfaces |
| 758 | as a test failure with actionable output (audit observed 20m |
| 759 | workflow-timeout hang on Sprint 19 merge run 24747915467). |
| 760 | |
| 761 | **🟡 Minor — outlier-miner pool guard (F04).** |
| 762 | |
| 763 | - **`mining/outlier_miner.mine_outliers`** — raises `SwayError` when |
| 764 | the pool has fewer than `2·top_k` distinct scored prompts, with |
| 765 | an actionable `--top-k N` hint. Pre-fix: 1-distinct-prompt pool |
| 766 | produced `top=[p], bottom=[p]` (identical lists — no outlier |
| 767 | contrast). Guard applies AFTER scoring so unsupported probe kinds |
| 768 | still return the empty-result path. |
| 769 | |
| 770 | **🟡 Minor — autogen skipped-probes comment (F07).** |
| 771 | |
| 772 | - **`integrations/dlm/autogen.collect_skipped_probe_reasons`** — new |
| 773 | public helper returning `(probe_kind, reason)` tuples for probes |
| 774 | `_build_suite` intentionally omitted for the given handle. |
| 775 | - **`_render_annotated_yaml`** — when `skipped` is non-empty, header |
| 776 | gains a `# skipped: <kind> (<reason>)` block. Users no longer have |
| 777 | to diff the autogen source to understand which probes are missing |
| 778 | from their generated `sway.yaml`. |
| 779 | |
| 780 | **🟡 Minor — CI Node.js 20 deprecation silence (F08).** |
| 781 | |
| 782 | - **`.github/workflows/ci.yml`** sets |
| 783 | `FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: "true"` at the workflow |
| 784 | level. Silences the Node 20 deprecation warnings each CI log |
| 785 | currently carries (deadline June 2026). Temporary until every |
| 786 | action we pull bumps to Node 24 natively. |
| 787 | |
| 788 | **🟡 Minor — portable `dlm_source` (F09).** |
| 789 | |
| 790 | - **`integrations/dlm/autogen._portable_dlm_source`** emits a |
| 791 | cwd-relative path when the `.dlm` lives inside the cwd, absolute |
| 792 | otherwise. The cwd-relative form survives cross-machine checkout |
| 793 | (CI agents, other devs); the old absolute-path emission broke the |
| 794 | moment a committed autogen'd YAML was run on a different host. |
| 795 | |
| 796 | **Other.** |
| 797 | |
| 798 | - **640 unit tests pass** (up from 626 before S21 — 14 new |
| 799 | regression tests across F02/F03/F04/F07/F09). |
| 800 | - **Pragmatic deviations from original scope:** S21 was audit-scoped |
| 801 | to ~2 days of cleanup. Actual depth: 6 findings closed with 14 |
| 802 | new regression tests. One existing test |
| 803 | (`test_std_floor_prevents_runaway_zscore`) explicitly rewritten to |
| 804 | flip its assertion — the pre-fix contract the test encoded WAS |
| 805 | the audit's reported bug. |
| 806 | |
| 807 | ### Sprint 19 — Pre-commit hook `sway gate` |
| 808 | |
| 809 | Closes Audit 01 stretch-list F-item "pre-commit hook sway gate." Ships |
| 810 | a `.pre-commit-hooks.yaml` declaring two hook variants so |
| 811 | [pre-commit.com](https://pre-commit.com) users can gate their adapter |
| 812 | on every commit that touches a spec, `.dlm` file, or adapter directory. |
| 813 | |
| 814 | - **Two hooks, pick your posture.** |
| 815 | - `sway-gate` — `language: system`. Uses the sway install on the |
| 816 | user's `PATH`. Fast, zero install cost. Recommended default. |
| 817 | - `sway-gate-isolated` — `language: python` + git-based |
| 818 | `additional_dependencies` (`dlm-sway[hf] @ git+...@<SHA>`). |
| 819 | Self-contained — pre-commit builds a fresh venv and installs |
| 820 | sway + torch + transformers on first run (~5 GB, ~2 min). |
| 821 | - **Rev pinning stance.** Example config pins to a commit SHA, not |
| 822 | `HEAD`. Sway is pre-v0.1.0 — no tagged release yet — and `HEAD` |
| 823 | drifts silently on every `pre-commit autoupdate`. SHA pinning is |
| 824 | the honest pre-release pattern; migration to `rev: v0.1.0` is a |
| 825 | README edit once the first release lands. |
| 826 | - **`pass_filenames: false` on both variants.** `sway gate` takes |
| 827 | exactly one spec path; the user declares it via `args:`. The |
| 828 | `files:` regex decides when the hook fires. No ambiguity. |
| 829 | - **Scope discipline.** Neither hook surfaces `--json` or |
| 830 | `--markdown` flags. The hook gates (non-zero on FAIL) and nothing |
| 831 | else. Users wanting report artifacts run `sway run` separately. |
| 832 | - **Readme "Pre-commit" section** documents the consumer-side config |
| 833 | for both variants, the SHA-pinning rationale, and the one-time |
| 834 | install cost of the isolated variant. |
| 835 | - **`examples/precommit-example/`** — template `sway.yaml`, |
| 836 | consumer-side `.pre-commit-config.yaml`, and a README walk-through. |
| 837 | - **`tests/integration/test_pre_commit_hook.py`** (slow + online) — |
| 838 | three cases: (a) parse-and-shape smoke test on |
| 839 | `.pre-commit-hooks.yaml`; (b) pass-case runtime — spawns |
| 840 | `pre-commit run sway-gate` as a subprocess against a tmp git repo |
| 841 | with a real LoRA on SmolLM2-135M, asserts exit 0; (c) fail-case — |
| 842 | same fixture with an impossible `assert_mean_gte`, asserts |
| 843 | non-zero exit + `gate FAILED` banner. |
| 844 | - **`pre-commit>=3.8`** in `[dependency-groups].dev` so the |
| 845 | integration test can invoke it as a subprocess. Keeps the tool |
| 846 | out of the user's runtime deps. |
| 847 | |
| 848 | ### Sprint 18 — Cross-platform determinism golden |
| 849 | |
| 850 | Closes Audit 01 stretch-list F-item "cross-platform determinism golden |
| 851 | test." The README's "deterministic on CPU where possible" claim held |
| 852 | in theory; now it's pinned by CI on two platforms. |
| 853 | |
| 854 | - **New module** `src/dlm_sway/core/golden.py`. Tolerance-aware JSON |
| 855 | comparator with `compare_goldens(actual, expected, *, logprob_tol=1e-6, |
| 856 | score_tol=1e-4)` and `mask_variable_fields` for stripping |
| 857 | timestamps, wall-seconds, `duration_s`, `backend_stats`, |
| 858 | `sway_version`, and cwd-resolved path identifiers (`adapter_id`, |
| 859 | `base_model_id`) before comparison. No torch dep; runs in the fast |
| 860 | lane. |
| 861 | - **New integration test** `tests/integration/test_determinism_golden.py` |
| 862 | (slow + online). Builds a deterministically-seeded LoRA on |
| 863 | SmolLM2-135M, runs a minimal 2-probe suite (delta_kl + |
| 864 | calibration_drift), and diffs the JSON output against |
| 865 | `tests/golden/expected_<platform>.json`. `SWAY_UPDATE_GOLDENS=1` |
| 866 | toggles regen mode; missing golden → SKIP with a regen recipe. |
| 867 | - **New CI matrix** `determinism-golden` with |
| 868 | `strategy.matrix.os: [ubuntu-latest, macos-latest]` in |
| 869 | `.github/workflows/ci.yml`. Triggered by changes under |
| 870 | `tests/golden/**`, `src/dlm_sway/core/golden.py`, or the test file |
| 871 | itself (plus the standard schedule/dispatch/push triggers). |
| 872 | - **`workflow_dispatch` regen mode** — dispatching the CI workflow |
| 873 | with `regenerate_goldens=true` flips `SWAY_UPDATE_GOLDENS=1` on both |
| 874 | matrix legs and uploads the regenerated JSONs as per-platform |
| 875 | artifacts. Meant for deliberate "yes I changed the algorithm" |
| 876 | flows. |
| 877 | - **Tolerance rationale** (documented in `core/golden.py`): `1e-6` for |
| 878 | logprob-like numeric fields sits above typical BLAS-implementation |
| 879 | drift (`1e-8`–`1e-7` band between OpenBLAS on linux and Accelerate |
| 880 | on darwin) but below real algorithm-change drift. `1e-4` for score |
| 881 | fields absorbs composition noise. |
| 882 | - **25 new unit tests** for the comparator covering mask coverage, |
| 883 | tolerance thresholds, structural diffs (missing keys, length |
| 884 | mismatches, type mismatches), NaN/inf edge cases, and a realistic |
| 885 | two-masked-payloads round-trip. |
| 886 | - **Pragmatic scope deviation**: the sprint envisioned a checked-in |
| 887 | 5 MB adapter binary. Instead the test builds it from a fixed seed |
| 888 | at runtime (same pattern as `test_external_perplexity_e2e` and |
| 889 | `test_cluster_kl_e2e`). Avoids the regeneration chore noted in the |
| 890 | sprint's risks section. |
| 891 | - **Linux golden bootstrap**: first PR ships `expected_darwin.json` |
| 892 | only; the linux leg SKIPs with a recipe pointing at the |
| 893 | `workflow_dispatch` regen mode. Maintainer dispatches, downloads |
| 894 | the artifact, commits `expected_linux.json`, and the next CI run |
| 895 | asserts cleanly on both platforms. One-time onboarding cost. |
| 896 | |
| 897 | ### Sprint 17 — Adversarial paraphrase mining + outlier-prompt miner |
| 898 | |
| 899 | Closes Audit 01 innovation item F11. Adds `sway mine` — an evaluation |
| 900 | companion to `paraphrase_invariance` that surfaces the paraphrases a |
| 901 | memorizing adapter most reliably fails on, plus a secondary mode that |
| 902 | ranks `delta_kl` prompts by per-prompt divergence. |
| 903 | |
| 904 | **`sway mine --mode paraphrase`** — per `paraphrase_invariance` case: |
| 905 | |
| 906 | 1. Generate candidate paraphrases via nlpaug `SynonymAug` (WordNet, |
| 907 | deterministic under a fixed seed; back-translation is a user-supplied |
| 908 | escape hatch to keep the default fast and offline). |
| 909 | 2. Embed candidates through the shared MiniLM cache (same 80 MB load |
| 910 | `adapter_revert` pulls) and greedy farthest-first select the top-K |
| 911 | most pairwise-distant ones — dodges nlpaug's tendency to emit |
| 912 | near-duplicate synonym swaps. |
| 913 | 3. Rank by per-token lift gap: `(ft(prompt, gold) - base(prompt, gold)) |
| 914 | - (ft(candidate, gold) - base(candidate, gold))`. Large positive gap |
| 915 | = the candidate breaks the adapter's lift = the adapter doesn't |
| 916 | generalize there. |
| 917 | 4. Emit `sway-mined-paraphrase.yaml` with a `mined_cases:` block paste- |
| 918 | compatible with `paraphrase_invariance.cases`. |
| 919 | |
| 920 | **`sway mine --mode outliers`** — rank a prompt pool by per-prompt |
| 921 | `delta_kl.raw`. Pool defaults to the spec's own `delta_kl` prompts; |
| 922 | `--from-corpus public_domain_en` draws from S09's CC0 corpus instead. |
| 923 | Emits top-K and bottom-K blocks: the highest-divergence prompts (best |
| 924 | for tightening a gate) and lowest-divergence (candidates to drop from |
| 925 | the suite). `leakage` and `paraphrase_invariance` have section/case- |
| 926 | based specs; outlier mining on those is future work, documented |
| 927 | inline in `mining/outlier_miner.py`. |
| 928 | |
| 929 | - **New package** `src/dlm_sway/mining/` — two modules, |
| 930 | `paraphrase_miner.py` and `outlier_miner.py`, plus a thin |
| 931 | `corpus_prompts` helper for the `--from-corpus` path. |
| 932 | - **New CLI subcommand** `sway mine SPEC --mode paraphrase|outliers |
| 933 | [--out FILE] [--top-k N] [--n-candidates N] [--from-corpus NAME] |
| 934 | [--seed N]`. Paste-compatible YAML emission by default; `--out` |
| 935 | overrides the filename. |
| 936 | - **18 new unit tests** — 7 for the paraphrase miner (ranker, |
| 937 | diversity filter, dedup, input validation), 8 for the outlier |
| 938 | miner (delta_kl ranking, corpus wiring, unsupported-kind skip), 3 |
| 939 | for the CLI (paraphrase + outliers modes + no-prompts error path). |
| 940 | - **Prove-the-value test** at |
| 941 | `tests/unit/test_paraphrase_miner_prove_value.py`. On a |
| 942 | deliberately-memorizing dummy backend the hand-written paraphrase |
| 943 | list passes with `generalization_ratio > 0.5`; substituting the |
| 944 | mined list (same seed, same adapter) drops the ratio below 0.5 |
| 945 | and flips the verdict to FAIL, with a ≥ 0.3 ratio gap. |
| 946 | - **Dependency footprint**: no new extras. The paraphrase miner |
| 947 | uses nlpaug (already in `[style]`) + sentence-transformers (in |
| 948 | `[semsim]`); the diversity filter reuses the embedder |
| 949 | `adapter_revert` already pulls. Graceful `BackendNotAvailableError` |
| 950 | with a pip hint when the extras aren't installed. |
| 951 | |
| 952 | ### Sprint 20 — Audit 02 closure |
| 953 | |
| 954 | Closes the full finding inventory from `.docs/audits/02-followup-audit.md` |
| 955 | (1 🔴 critical, 8 🟠 major, 11 🟡 minor, 4 stronger-test opportunities, 3 |
| 956 | dead-code items). Sixteen sprints of innovation work had accumulated |
| 957 | since Audit 01; the re-audit surfaced one ship-blocker (S14's bootstrap |
| 958 | CI dropped at the runner boundary) plus a grab-bag of claim-vs-code |
| 959 | gaps, dead branches, and untested contracts. |
| 960 | |
| 961 | **🔴 Critical — ship-blocker.** |
| 962 | |
| 963 | - **F01** — `suite/runner.py:_with_duration` now forwards every |
| 964 | `ProbeResult` field. The pre-fix version silently dropped `ci_95` |
| 965 | on every probe, making S14's headline deliverable runtime-inert in |
| 966 | any real `sway run`. Regression test at the runner boundary + snapshot |
| 967 | fixture carrying a populated `ci_95` pin the fix. |
| 968 | |
| 969 | **🟠 Major — claim-vs-code gaps.** |
| 970 | |
| 971 | - **F02** — real `sklearn.cluster.KMeans` exercised by two new unit |
| 972 | tests (separation + seed determinism) + a slow+online integration |
| 973 | test at `tests/integration/test_cluster_kl_e2e.py`. Before: every |
| 974 | cluster_kl test monkeypatched `_kmeans_cluster` with an argmax stub, |
| 975 | leaving S16's reason-for-being unverified. |
| 976 | - **F03** — `sway list-probes` falls back to the defining module's |
| 977 | `__doc__` when a probe class has no docstring. Every one of the 13 |
| 978 | shipped probes now renders a summary row. |
| 979 | - **F04** — `sway doctor` probes `plotly` (the load-bearing `[viz]` dep), |
| 980 | `sklearn` (S16 cluster_kl), and the `api` extras (`httpx`, `tenacity`). |
| 981 | - **F05** — README primitives table reflects 13 probes (adherence + |
| 982 | cluster_kl, calibration + external_perplexity, baseline row for |
| 983 | null_adapter). |
| 984 | - **F06** — `TwoModelDifferential` composes `safe_for_concurrent_views` |
| 985 | from its two inner backends. Before: the attribute was missing and |
| 986 | the runner defaulted to `False` even when both inners set `True`, |
| 987 | breaking S13's concurrency claim at the only supported differential- |
| 988 | use pattern. |
| 989 | - **F07** — `integrations/dlm/autogen` emits `cluster_kl` when the |
| 990 | prompt pool clears 20 entries; markdown report gains a per-probe |
| 991 | "Cluster breakdown" section with per-cluster mean KL + exemplars. |
| 992 | - **F08** — `backends/dummy._NullView` RNG uses `hashlib.md5` instead |
| 993 | of Python's PYTHONHASHSEED-salted `hash()`. Cross-process determinism |
| 994 | test at `tests/unit/test_cross_process_determinism.py` pins the fix. |
| 995 | - **F09** — trace writer ↔ analyzer round-trip test at |
| 996 | `tests/unit/test_runner_backend_stats.py`: runs a suite with |
| 997 | `trace_path=`, loads the file back, asserts probe labels + hit/miss |
| 998 | counts match `backend_stats`. |
| 999 | |
| 1000 | **🟡 Minor — dead code, doc drift, latent brittleness.** |
| 1001 | |
| 1002 | - **F10** — removed dead `_ft_view` branch in `probes/prompt_collapse.py`. |
| 1003 | - **F11** — pytest plugin resolves spec paths against `config.rootpath`, |
| 1004 | not process cwd. |
| 1005 | - **F12** — `backends/api._post_completions` delegates to tenacity's |
| 1006 | `Retrying()` callable; removes the unreachable post-loop fallback. |
| 1007 | - **F13** — `preference_flip` WARN branch emits `ci_95` and `z_by_rank` |
| 1008 | for consistency with every other numeric probe's WARN shape. |
| 1009 | - **F14** — `adapter_ablation` module docstring documents why `ci_95` |
| 1010 | renders as em-dash (it's a curve-fit, not a sample-mean aggregator). |
| 1011 | - **F15** — report footer surfaces `null_adapter` opt-outs (probes with |
| 1012 | `calibrate_spec=None`) in both terminal and markdown surfaces. |
| 1013 | - **F16** — report category ordering derives from |
| 1014 | `DEFAULT_COMPONENT_WEIGHTS` and accepts unknown categories from |
| 1015 | custom `Probe` subclasses. |
| 1016 | - **F17** — `cluster_kl` degenerate zero-variance case now returns |
| 1017 | `Verdict.WARN` with `evidence["degenerate_zero_variance"]=True` and |
| 1018 | suppresses z-score to avoid a spurious small-sample-noise calibration. |
| 1019 | - **F18** — `_calibration_pack.py` gains per-section provenance notes |
| 1020 | (F18 audit trail) without noisy per-item annotations. |
| 1021 | - **F19** — pytest plugin defers `dlm_sway.core.result` / |
| 1022 | `dlm_sway.core.errors` imports to call sites so non-`@pytest.mark.sway` |
| 1023 | users don't pay the load tax. |
| 1024 | - **F20** — `sway check` `--help` documents the σ-banner's null-calibration |
| 1025 | dependency (fall-through to composite score band when null SKIPs). |
| 1026 | |
| 1027 | **Stronger-test opportunities — regression tests for things we weren't |
| 1028 | pinning.** |
| 1029 | |
| 1030 | - **#9** — `probes/_divergence` rejects effectively-uniform TokenDists |
| 1031 | (spread < 1e-9) via `_check_non_degenerate_token_dist`. Catches a |
| 1032 | shape-broken lm_head that would otherwise compute a trivial constant |
| 1033 | divergence across prompts. Two compatibility touchups rode with the |
| 1034 | change: dummy backend's synthesized `ft` dist and cluster_kl test |
| 1035 | fixture `_dist_broad` gained a tiny monotonic perturbation to clear |
| 1036 | the guard (real models never produce bit-uniform logits thanks to |
| 1037 | fp32 accumulation noise). |
| 1038 | - **#10** — cross-verdict consistency test |
| 1039 | (`tests/unit/test_cross_verdict_consistency.py`): runs the same spec |
| 1040 | through `sway run`, `sway gate`, and `sway report --format junit` and |
| 1041 | asserts identical per-verdict tallies. |
| 1042 | - **#11** — `sway doctor --json` schema-shape snapshot test locks the |
| 1043 | top-level keys and the per-extra module-name set. |
| 1044 | - **#12** — subprocess determinism test asserts `_NullView.next_token_dist` |
| 1045 | is byte-identical across `PYTHONHASHSEED` values; pins F08's fix. |
| 1046 | |
| 1047 | **Dead-code inventory — DC3–DC5.** |
| 1048 | |
| 1049 | - **DC3** — `null_adapter.py`'s pre-S10 cache-promotion branch is |
| 1050 | annotated as legacy; removal queued for the next minor. |
| 1051 | - **DC4** — dropped the stderr warning at `suite/runner.py` that fired |
| 1052 | whenever `concurrent_probes > 1` even though execution never fanned |
| 1053 | out. The spec field is still accepted (future pool will light it up); |
| 1054 | the noisy warning is gone. |
| 1055 | - **DC5** — `tests/unit/test_model.py` grows coverage of `ModelSpec`'s |
| 1056 | dtype enum, `endpoint`, `trust_remote_code`, and `custom`/`api` |
| 1057 | `kind` branches. |
| 1058 | |
| 1059 | Final state: 582 unit tests + 3 integration tests passing; mypy strict |
| 1060 | clean across 55 source files; ruff + format clean across 137 files. |
| 1061 | Sprint 20 is a single PR composed of per-finding, per-file commits so |
| 1062 | the history reads as a fix-by-fix cleanup. |
| 1063 | |
| 1064 | ### Sprint 16 — Cluster-coherent KL probe |
| 1065 | |
| 1066 | Closes Audit 01 innovation item F8. Adds a new `cluster_kl` probe that |
| 1067 | answers the question `delta_kl` can't: the mean divergence may be the |
| 1068 | same, but did the adapter shift the *right* topics, or is it a uniform |
| 1069 | blunt-instrument shift? |
| 1070 | |
| 1071 | - **New probe `cluster_kl`** (category: adherence). Embeds prompts via |
| 1072 | shared MiniLM (same cache key as `adapter_revert`), k-means clusters |
| 1073 | them at a fixed seed, measures per-prompt JS divergence between base |
| 1074 | and ft, and reports a **specificity ratio**: |
| 1075 | `between_variance / (between_variance + within_variance)`. Range |
| 1076 | `[0, 1]`: `≈ 0.5` on a blunt adapter that shifts every topic by the |
| 1077 | same amount, `→ 1.0` on a topic-targeted adapter. Pair |
| 1078 | `(mean_kl, specificity)` tells a more honest story than either number |
| 1079 | alone. |
| 1080 | - **Bootstrap CI** on specificity. Resamples `(divergence, cluster_label)` |
| 1081 | pairs with replacement and takes the 2.5/97.5 percentiles of the |
| 1082 | bootstrap ratio distribution. Returns `None` below 4 prompts — matches |
| 1083 | the convention `core.stats.bootstrap_ci` uses. |
| 1084 | - **Z-score calibration** against `null_adapter` baseline via the |
| 1085 | existing `_zscore` helpers. `calibrate_spec` synthesizes 8 mixed-topic |
| 1086 | sentinel prompts + k=2 for the null pass; the specificity distribution |
| 1087 | concentrates around 0.5 there, so a real adapter's separation above |
| 1088 | that baseline reads as a clean z-score. |
| 1089 | - **Degenerate-input policy.** `< min_prompts` (default 20) → SKIP with |
| 1090 | a clear message; `num_clusters * 2 > num_prompts` → SKIP (per-cluster |
| 1091 | mean not well-resolved); empty prompt list → ERROR. Missing `[semsim]` |
| 1092 | extras → SKIP with a `pip install 'dlm-sway[semsim]'` hint. |
| 1093 | - **Zero-variance fallback.** If every prompt produced the same |
| 1094 | divergence (canned stub data, no adapter motion), the specificity |
| 1095 | ratio is mathematically undefined; we return `0.5` — the null-adapter |
| 1096 | expectation — so downstream z-score reports "no signal" instead of |
| 1097 | NaN. |
| 1098 | - **New dependency.** `scikit-learn>=1.4` added to the `[semsim]` extra |
| 1099 | (riding the 80 MB MiniLM load that probe already pulls in). Mirrored |
| 1100 | to `[all]` and to mypy's stubless-override list. |
| 1101 | - **7 unit tests** in `tests/unit/test_probe_cluster_kl.py`: two-topic |
| 1102 | adapter → high specificity, uniform adapter → 0.5 fallback, too-few |
| 1103 | prompts → SKIP, empty prompts → ERROR, `num_clusters > prompts/2` → |
| 1104 | SKIP, CI bracketing, missing-extras SKIP path. |
| 1105 | - **Prove-the-value test** at `tests/unit/test_cluster_kl_prove_value.py`. |
| 1106 | Two backends with comparable `delta_kl` (ratio < 3×) have specificity |
| 1107 | scores that split by at least 0.3 — concrete evidence that `cluster_kl` |
| 1108 | surfaces a structural distinction `delta_kl` merges. |
| 1109 | |
| 1110 | ### Sprint 15 — pytest plugin (`@pytest.mark.sway`) |
| 1111 | |
| 1112 | Closes Audit 01 innovation item F10. Packages sway as a pytest |
| 1113 | library so teams already running pytest adopt sway with a single |
| 1114 | decorator instead of a subprocess wrapper. |
| 1115 | |
| 1116 | - **New plugin** (`src/dlm_sway/pytest_plugin.py`) auto-loaded via |
| 1117 | the `pytest11` entry point after `pip install 'dlm-sway[pytest]'`. |
| 1118 | `@pytest.mark.sway(spec="...", threshold=0.0, weights=None)` |
| 1119 | expands a single pytest function into **one test item per probe** |
| 1120 | in the referenced spec + an optional `__gate__` item that fires |
| 1121 | only when the composite score drops below `threshold`. |
| 1122 | - **Verdict translation.** `FAIL` / `ERROR` → pytest Failed, |
| 1123 | `SKIP` → pytest Skipped, `WARN` → pytest warning, `PASS` → |
| 1124 | pytest pass. Probe-level failures isolate: a failing adherence |
| 1125 | probe doesn't mask a failing calibration one, and `pytest -k |
| 1126 | adherence` runs just that probe. |
| 1127 | - **Suite runs once per decorated function.** A session-scoped |
| 1128 | `_SuiteCache` keyed on `(spec_path, weights)` ensures the N-way |
| 1129 | item expansion doesn't multiply backend wall time. Two |
| 1130 | `@pytest.mark.sway` tests against the same spec share one run. |
| 1131 | - **Malformed marks fail cleanly.** Missing `spec`, non-numeric |
| 1132 | `threshold`, non-dict `weights`, and unknown kwargs produce a |
| 1133 | synthetic `_ConfigErrorItem` with a green-field pytest failure |
| 1134 | line — no cryptic collection-time tracebacks. |
| 1135 | - **New `[pytest]` extra** — just `pytest>=8.0`. Matches the pattern |
| 1136 | other plugin-style extras use. Also registers the `pytest11` |
| 1137 | entry point so the plugin is discovered automatically on install, |
| 1138 | consistent with pytest-cov / pytest-xdist. |
| 1139 | - **Example directory** at `examples/pytest_integration/` with a |
| 1140 | minimal `sway.yaml` + `test_sway_gate.py` showing the decorator |
| 1141 | replacing a legacy `subprocess.run(["sway", "gate", ...])` |
| 1142 | wrapper. README gains a "Pytest integration" section pointing at |
| 1143 | it. |
| 1144 | - **13 new unit tests** via pytest's canonical `pytester` fixture: |
| 1145 | marker registration, N-item expansion, FAIL / SKIP / ERROR |
| 1146 | routing, gate below-threshold failure, gate above-threshold pass, |
| 1147 | gate absence when `threshold=0`, malformed-mark error paths, |
| 1148 | cache sharing across multiple decorated tests. |
| 1149 | - **Drive-by fix for the S14 CI-narrowing test flake.** The dummy |
| 1150 | backend's per-prompt noise was seeded via Python's `hash()`, |
| 1151 | which is salted per-process via `PYTHONHASHSEED`. Swapped to a |
| 1152 | stable `hashlib.md5`-derived seed so the narrowing invariant |
| 1153 | holds deterministically across test-order permutations. |
| 1154 | |
| 1155 | ### Sprint 14 — Bootstrap CIs + forward-pass trace CLI |
| 1156 | |
| 1157 | Closes Audit 01 innovation items F9 (bootstrap confidence intervals |
| 1158 | on raw metrics) + F12 (polished forward-pass trace CLI). Paired |
| 1159 | sharpenings of existing probe output and existing trace |
| 1160 | infrastructure. |
| 1161 | |
| 1162 | - **New `core/stats.bootstrap_ci`** — percentile-bootstrap 95% CI on |
| 1163 | any sequence of per-sample measurements. Numpy-only, 1000-resample |
| 1164 | default, seeded from ``ctx.seed`` so intervals are reproducible. |
| 1165 | Short-circuits on non-finite / empty / degenerate-constant inputs; |
| 1166 | vectorized resample keeps the overhead ~1 ms per probe. |
| 1167 | - **`ProbeResult.ci_95: tuple[float, float] | None`** — new field, |
| 1168 | threaded through `safe_finalize` (nulled if `raw` is nulled, so a |
| 1169 | CI never brackets a defensive-null point estimate). `to_json` / |
| 1170 | `from_json` persist the pair as a two-list. |
| 1171 | - **Six aggregating probes emit `ci_95`:** `delta_kl` (over per-prompt |
| 1172 | divergences), `calibration_drift` (over per-item regression |
| 1173 | indicators), `external_perplexity` (over per-chunk deltas), |
| 1174 | `leakage` (over clean-recall rates), `paraphrase_invariance` (over |
| 1175 | verbatim lifts), `section_internalization` (over effective SIS |
| 1176 | scores). Each also lands the interval under |
| 1177 | ``evidence["raw_ci_95"]`` for JSON consumers. |
| 1178 | - **Report tables add a `ci95` column** between `raw` and `z` in |
| 1179 | both terminal and markdown output. `format_ci` renders as |
| 1180 | ``[lo, hi]`` with em-dash for missing/non-finite. Markdown |
| 1181 | snapshot refreshed; JSON snapshot refreshed for the new |
| 1182 | per-probe `ci_95` field. |
| 1183 | - **New `sway trace <jsonl>` CLI.** Reads the forward-pass trace |
| 1184 | JSONL the S07 runner writes when `--trace <path>` is set, |
| 1185 | aggregates into per-probe and per-view wall-time + hit-rate |
| 1186 | tables, plus a top-N slowest-events table. `--format |
| 1187 | terminal|md|json`, `--slowest K` to tune the tail size. The |
| 1188 | parsing tolerates old trace shapes (missing `probe` / `hit` |
| 1189 | fields) and skips malformed lines. |
| 1190 | - **`suite/trace_analysis`** — typed dataclasses (`TraceEvent`, |
| 1191 | `ProbeSummary`, `ViewSummary`, `TraceReport`) + three renderers |
| 1192 | (terminal / markdown / json) matching the conventions of |
| 1193 | `suite/report` and `suite/compare`. |
| 1194 | - **Committed trace fixture** at `tests/fixtures/trace_sample.jsonl` |
| 1195 | (8 events, 2 probes × 4 prompts with cache hits). Drives 22 new |
| 1196 | trace-analysis + CLI tests. |
| 1197 | - **Prove-the-value (F9):** on a dummy backend with per-prompt |
| 1198 | variation, `delta_kl`'s CI narrows from `[0.33, 0.41]` (width |
| 1199 | 0.079) at N=4 prompts to `[0.38, 0.42]` (width 0.044) at N=32 — |
| 1200 | 1.8× tighter in width, tracking the √(N₂/N₁) ≈ 2.8× theoretical |
| 1201 | scaling minus dummy-backend dispersion. |
| 1202 | - **Prove-the-value (F12):** the CLI's terminal output on the |
| 1203 | committed fixture correctly identifies `sis/base` (520.3 ms) as |
| 1204 | the single slowest event, `sis` (1,529 ms) as the slower probe |
| 1205 | over `dk` (529 ms), and `ft` (1,357 ms) as the slower view over |
| 1206 | `base` (701 ms). |
| 1207 | |
| 1208 | ### Sprint 13 — OpenAI-compatible HTTP scoring backend |
| 1209 | |
| 1210 | Closes Audit 01 innovation item F7. Unlocks sway against hosted |
| 1211 | fine-tunes — OpenAI platform, `vllm serve`, Ollama — without |
| 1212 | requiring a local torch + PEFT load. |
| 1213 | |
| 1214 | - **New backend `ApiScoringBackend`** (`backends/api.py`): scores |
| 1215 | against a ``/v1/completions`` endpoint via httpx. Implements |
| 1216 | `ScoringBackend` only (not `DifferentialBackend`) since one |
| 1217 | endpoint is one model; users compose two `ApiScoringBackend` |
| 1218 | instances behind the existing `TwoModelDifferential` wrapper for |
| 1219 | a full differential run. Method surface: `logprob_of` via |
| 1220 | `echo=True`+`logprobs=0`, `rolling_logprob` via the same, |
| 1221 | `next_token_dist` via `max_tokens=1`+`logprobs=K`, plus |
| 1222 | `preflight_finite_check` and an `ApiScoringBackend.generate` for |
| 1223 | probes that need text output (e.g. `leakage`). |
| 1224 | - **Token-boundary handling.** The API returns tokens as strings, so |
| 1225 | `logprob_of` walks the echoed tokens by character length until the |
| 1226 | running total covers the prompt, then sums the rest. When the |
| 1227 | boundary falls mid-token, the partial token lands on the prompt |
| 1228 | side — over-counts the prompt, under-attributes to the completion |
| 1229 | — and the behavior is asserted by a dedicated test |
| 1230 | (`test_mid_token_prompt_leans_conservative`). |
| 1231 | - **Retry + preflight.** Tenacity handles 5xx + network errors with |
| 1232 | exponential backoff (configurable `max_retries`). Preflight hits |
| 1233 | the endpoint once with `hello`, `max_tokens=1`, `logprobs=1`; |
| 1234 | rejects non-finite logprobs before the suite runs. |
| 1235 | - **First backend with `safe_for_concurrent_views=True`.** HTTP is |
| 1236 | stateless, so the S07 concurrent-probe scheduler can dispatch |
| 1237 | against an API backend in parallel as soon as the pool |
| 1238 | implementation lands (still scaffolding in the runner). |
| 1239 | - **`ModelSpec` additions:** `BackendKind` gains `"api"`; |
| 1240 | `ModelSpec.endpoint` field carries the server's base URL (the |
| 1241 | `/v1/completions` path is appended by the backend). API key from |
| 1242 | `SWAY_API_KEY` → `OPENAI_API_KEY` env var fallback, so secrets |
| 1243 | stay out of the YAML. |
| 1244 | - **`backends.build` dispatch** routes `kind="api"` to |
| 1245 | `ApiScoringBackend`. Combined with `build_two_separate` + |
| 1246 | `TwoModelDifferential`, a YAML with |
| 1247 | `defaults.differential: false` and two `kind: api` models Just |
| 1248 | Works end-to-end. |
| 1249 | - **New `[api]` extra** — `httpx>=0.27` + `tenacity>=9.0`. Core sway |
| 1250 | stays torch-free; the `[api]` extra adds ~1 MB of deps vs the |
| 1251 | `[hf]` extra's 3 GB. |
| 1252 | - **Unit tests (21 new) use httpx's `MockTransport`** to intercept |
| 1253 | every call and assert numeric outputs match canned OpenAI-shaped |
| 1254 | responses — covers all three scoring methods, cache dedup, 4xx |
| 1255 | error surfacing, 503→200 retry recovery, API-key env fallback, |
| 1256 | NaN-response preflight rejection, and the token-boundary math. |
| 1257 | - **Prove-the-value (§F7):** `tests/integration/test_api_ollama.py` |
| 1258 | is opt-in via `SWAY_OLLAMA_URL` + `SWAY_OLLAMA_MODEL`. Runs the |
| 1259 | full scoring surface against a live Ollama serving a small model |
| 1260 | (e.g. `llama3.2:1b`) and asserts finite output + preflight pass. |
| 1261 | Documents the wall-time budget the "≤3× HF backend" claim rests |
| 1262 | on once the concurrent-dispatch pool lands. |
| 1263 | |
| 1264 | ### Sprint 12 — Interactive HTML report |
| 1265 | |
| 1266 | Closes Audit 01 innovation item F6. Adds the exploration surface to |
| 1267 | sit alongside the terminal (CI logs) and markdown (PR artifacts) |
| 1268 | outputs — a single-file interactive HTML page for the research / |
| 1269 | write-up case. |
| 1270 | |
| 1271 | - **`sway report result.json --format html --out report.html`** emits |
| 1272 | a self-contained HTML page with five interactive Plotly panels: |
| 1273 | composite-score gauge with banded thresholds, per-category |
| 1274 | horizontal-bar breakdown, per-section SIS bar chart (when |
| 1275 | `section_internalization` evidence carries a `per_section` array), |
| 1276 | adapter-ablation response curve (when `adapter_ablation` evidence |
| 1277 | carries `lambdas` + `mean_divergence_per_lambda`), and an |
| 1278 | all-probe score × z-score scatter with hover tooltips. The terminal |
| 1279 | verdict palette (green / yellow / red) carries across every panel. |
| 1280 | - **Plotly bundle inlined once in `<head>`**; each panel is a |
| 1281 | Plotly-produced `<div>` with a stable `sway-*` id so snapshot tests |
| 1282 | don't churn on Plotly point releases. No external `<script src>` / |
| 1283 | `<link href>` references — the page loads offline from a single |
| 1284 | ~4.9 MB file (Plotly 6.x JS is the majority of that; the sway |
| 1285 | wrapper + chart data is ~40 KB). |
| 1286 | - **CLI gates `--out PATH` on file-producing formats.** `--format html` |
| 1287 | requires `--out` (3 MB of JS has no business on stdout); |
| 1288 | `--format md|json|junit` now also accept `--out` for symmetry with |
| 1289 | the HTML path. `--format terminal` rejects `--out` — the terminal |
| 1290 | renderer is for the console only. |
| 1291 | - **`plotly>=5.20` is optional**, shipped via the existing `[viz]` |
| 1292 | extra (alongside matplotlib). Without it, `--format html` exits 2 |
| 1293 | with the `pip install 'dlm-sway[viz]'` install hint. Graceful |
| 1294 | ImportError → RuntimeError translation tested end-to-end. |
| 1295 | - **Snapshot** at `tests/snapshots/report.html` locks the Sway-owned |
| 1296 | wrapper structure. The Plotly JS bundle is stripped from the |
| 1297 | snapshot (replaced with a placeholder) so the 43-line snapshot |
| 1298 | stays human-reviewable and Plotly version bumps don't drift it. |
| 1299 | - **Prove-the-value:** `test_html_from_real_history_loads_offline` |
| 1300 | renders HTML from the committed `tests/fixtures/sway-history/02-*` |
| 1301 | run (4 probes, real exported payload) and asserts: file parses via |
| 1302 | `html.parser`, zero external `<script src>` / `<link href>` |
| 1303 | references, every probe name appears in the body, file size lands |
| 1304 | between 1 MB and 10 MB. Measured output: 4.87 MB. |
| 1305 | |
| 1306 | ### Sprint 11 — `sway compare` across saved JSON runs |
| 1307 | |
| 1308 | Closes Audit 01 innovation item F5. Ships the regression-dashboard |
| 1309 | primitive every CI integration reached for but had to script. |
| 1310 | |
| 1311 | - **New CLI subcommand `sway compare`.** Accepts N saved result JSONs |
| 1312 | (typically from `sway run --json`), rehydrates each via |
| 1313 | `report.from_json`, folds them into a score matrix, and renders: |
| 1314 | a per-probe score table with columns per run, per-adjacent-pair |
| 1315 | delta columns (colored red on drop / green on lift), and a |
| 1316 | composite-score timeline row. Formats: `terminal` (default, Rich), |
| 1317 | `--format md`, `--format json`. |
| 1318 | - **`--fail-on-regression <threshold>`.** Exits 1 when any probe's |
| 1319 | score in the newest run dropped ≥ `threshold` vs the prior run. |
| 1320 | `threshold=0` (default) disables the gate. Gate fires after the |
| 1321 | output is emitted so CI logs always capture the matrix even on |
| 1322 | red builds. |
| 1323 | - **`suite/compare.py` module.** Clean separation: `build_matrix` |
| 1324 | folds `(SuiteResult, SwayScore)` pairs into a `CompareMatrix` |
| 1325 | dataclass; `render_{terminal,markdown,json}` consume that. No |
| 1326 | filesystem IO in the module itself — the CLI owns the reads, the |
| 1327 | renderers own the writes. Probes that disappeared between runs |
| 1328 | show as `None` / em-dash in the matrix; new probes show as |
| 1329 | `None` on older runs. Union of probe names is sorted for stable |
| 1330 | row order across invocations. |
| 1331 | - **Markdown snapshot locked** at `tests/snapshots/compare.md` — |
| 1332 | silent schema drift breaks the test like every other snapshot. |
| 1333 | - **Committed history fixture + prove-the-value test.** |
| 1334 | `tests/fixtures/sway-history/{01,02,03}-*.json` ship a three-run |
| 1335 | narrative: baseline → retrained-improved → over-trained. Run 03 |
| 1336 | plants a `section_internalization` drop of 0.22 and a |
| 1337 | `calibration_drift` drop of 0.25 while `delta_kl` still rises |
| 1338 | (the memorization-without-generalization failure mode). |
| 1339 | `test_compare_catches_planted_regression` invokes |
| 1340 | `sway compare sway-history/*.json --fail-on-regression 0.10` and |
| 1341 | asserts exit=1 with both regressed probes surfaced in the JSON |
| 1342 | payload and terminal text — exactly the F5 "CI gate the build on |
| 1343 | regression" experiment. |
| 1344 | |
| 1345 | ### Sprint 10 — Multi-rank adversarial null adapters |
| 1346 | |
| 1347 | Closes Audit 01 innovation item F4. Extends the S02 null-calibration |
| 1348 | matrix with a per-rank profile — users now read "how rank-saturated |
| 1349 | is my adapter?" straight off the report. |
| 1350 | |
| 1351 | - **`NullAdapterSpec.rank_multipliers: list[float] = [1.0]`** (new). |
| 1352 | Default preserves single-rank behavior byte-for-byte. Setting |
| 1353 | `[0.5, 1.0, 2.0]` calibrates three independent null distributions |
| 1354 | per probe kind and emits a z-profile alongside the verdict z-score. |
| 1355 | - **`NullCalibratedBackend.as_null_adapter` gains a `rank_scale` kwarg.** |
| 1356 | Both shipped backends (dummy + HF) implement it by scaling the |
| 1357 | null-weight noise std by `sqrt(rank_scale)` — mathematically |
| 1358 | equivalent to a rank change in terms of the LoRA output variance |
| 1359 | (`A·B` is a sum of `r` rank-1 outer products; variance is linear |
| 1360 | in `r`). No tensor-shape surgery, no model reload per multiplier. |
| 1361 | - **`RunContext.null_stats_by_rank`** threads the per-rank matrix |
| 1362 | through the runner. Keys are canonical `rank_{mult:.2f}` strings; |
| 1363 | inner structure matches `null_stats`. |
| 1364 | - **Numeric probes emit `evidence["z_by_rank"]`.** Every numeric |
| 1365 | probe (delta_kl, calibration_drift, external_perplexity, leakage, |
| 1366 | adapter_revert, adapter_ablation, paraphrase_invariance, |
| 1367 | preference_flip, prompt_collapse, section_internalization, |
| 1368 | style_fingerprint) computes per-rank z-scores using its existing |
| 1369 | sign convention. The verdict path still reads the 1.0x group — |
| 1370 | no existing thresholds shift. |
| 1371 | - **Report surfaces the profile.** Terminal + markdown probe notes |
| 1372 | append `rank profile: +4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x` |
| 1373 | whenever multi-rank calibration ran. `format_z_profile` helper |
| 1374 | centralizes the rendering. |
| 1375 | - **Null-stats disk cache widens key to include `rank_multipliers`.** |
| 1376 | Single-rank caches from pre-S10 runs are still readable — they're |
| 1377 | promoted to the new `null_stats_by_rank` shape on load. |
| 1378 | - **Bug fix (`external_perplexity`): drop erroneous z sign flip.** |
| 1379 | `mean_delta` is higher-is-better (ft logprob minus base logprob |
| 1380 | on external prose), so the raw z-score maps directly onto |
| 1381 | `z >= assert_z_gte`. S09's sign flip was reversing the pass/fail |
| 1382 | direction when the null-calibration path ran. |
| 1383 | - **README: rank-profile interpretation guide.** Explains when the |
| 1384 | shape indicates rank saturation vs. rank oversizing, plus the |
| 1385 | "low rank can be pathologically quiet" dual-reading caveat. |
| 1386 | - **Prove-the-value (`tests/unit/test_null_multi_rank.py`):** on a |
| 1387 | fixed adapter with the dummy backend, the delta_kl z-profile is |
| 1388 | strictly monotone in inverse rank (`z@0.5x > z@1.0x > z@2.0x`) — |
| 1389 | exactly the signature of a rank-scaled null distribution. |
| 1390 | |
| 1391 | ### Sprint 09 — External-perplexity-gap probe |
| 1392 | |
| 1393 | Closes Audit 01 innovation item F3. First innovation sprint landed on |
| 1394 | top of the audit-closure campaign (S01–S07). |
| 1395 | |
| 1396 | - **New probe `external_perplexity`** (`probes/external_perplexity.py`): |
| 1397 | rolling-logprob delta of ft vs base on held-out public-domain English |
| 1398 | prose. Raw metric is `mean_delta_nats` (per-token). Negative = |
| 1399 | adapter raised perplexity on external text (diffuse forgetting); |
| 1400 | positive = adapter improved English modeling incidentally. Category |
| 1401 | `calibration`; scored alongside `calibration_drift`. |
| 1402 | - **Packaged public-domain corpus** |
| 1403 | (`probes/_corpora/public_domain_en.txt`, ~14 KB): hand-assembled |
| 1404 | from twelve US-public-domain passages (Lincoln, Emerson, Twain, |
| 1405 | Thoreau, Austen, Darwin, Shakespeare, Franklin, Melville, |
| 1406 | Dickinson, US Constitution, Federalist #10). Each passage carries |
| 1407 | an inline `# -- source:` provenance comment identifying the work |
| 1408 | and the "US public domain (pre-1929)" basis. Loader strips the |
| 1409 | comments at read time so the probe sees only raw prose. |
| 1410 | - **Null calibration**: `calibrate_spec()` returns a 4-chunk cheap |
| 1411 | version so `null_adapter` produces per-kind stats without dominating |
| 1412 | suite runtime. Sign-flipped z-score — lower raw delta is worse, so |
| 1413 | the shared `z >= assert_z_gte` semantics read as "σ better than |
| 1414 | noise" on external fluency. |
| 1415 | - **`autogen` wiring**: emits an `external_ppl` suite entry (with |
| 1416 | `corpus=public_domain_en`, `max_chunks=8`) whenever the source `.dlm` |
| 1417 | has any PROSE section. Intent table explains it as the |
| 1418 | "diffuse-forgetting complement to `calibration_drift`." |
| 1419 | - **Prove-the-value test** |
| 1420 | (`tests/unit/test_ext_ppl_vs_calibration_drift.py`): a dummy backend |
| 1421 | that applies a uniform −0.3 nats/token drift to every pack item and |
| 1422 | every corpus chunk — below `calibration_drift`'s 1.0-nat per-item |
| 1423 | regression threshold, above its −0.5 mean-delta gate, but well below |
| 1424 | `external_perplexity`'s −0.1 fixed-threshold. `calibration_drift` |
| 1425 | reports PASS, `external_perplexity` reports FAIL. The two probes |
| 1426 | measure different failure modes and do not substitute for each |
| 1427 | other. |
| 1428 | |
| 1429 | ### Sprint 07 — Performance & caching |
| 1430 | |
| 1431 | Closes Audit 01 findings B19 (deferred per design note), |
| 1432 | E-cache-opportunity. |
| 1433 | |
| 1434 | - **Forward-pass cache** (`backends/_instrumentation.py`): every |
| 1435 | backend view now routes `next_token_dist` / `rolling_logprob` / |
| 1436 | `logprob_of` through a bounded LRU keyed on `(op, view_id, |
| 1437 | prompt_hash, top_k)`. Baseline A/B on a tiny 4-probe suite against |
| 1438 | SmolLM2-135M on CPU: 2.33s → 1.76s (**25% wall-time reduction**, |
| 1439 | forward passes 88 → 62, hit rate 30%). The audit's 18.5s Quillstone |
| 1440 | suite has more overlap and would see larger gains; the 25% floor |
| 1441 | holds on any suite with repeated prompts across probes. |
| 1442 | - **Cache key includes `view_id`**: `"base"` / `"ft"` / |
| 1443 | `"scaled_1.25"` / `"null_42"` are distinct namespaces. A future |
| 1444 | toggle regression that fails to flip the adapter would surface as |
| 1445 | wrong-side cached values immediately, not silently corrupt |
| 1446 | divergence math. |
| 1447 | - **Backend stats surface in the report**: `SuiteResult.backend_stats` |
| 1448 | captures `cache_hits` / `cache_misses` / `forward_passes` / |
| 1449 | `scoring_wall_s` / `hit_rate`. Terminal + markdown footers render |
| 1450 | `cache: 26/88 = 30%` when stats are present. |
| 1451 | - **Forward-pass tracing**: `sway run --trace <path.jsonl>` writes |
| 1452 | one event per backend scoring call (probe / view_id / prompt_hash / |
| 1453 | top_k / op / wall_ms / hit). Zero overhead when unset. |
| 1454 | - **`concurrent_probes` scaffolding**: `spec.defaults.concurrent_probes: |
| 1455 | int = 1` field plus `safe_for_concurrent_views: bool = False` |
| 1456 | class attribute on HF / MLX / Dummy backends. The runner warns on |
| 1457 | stderr when the user requested > 1 against an unsafe backend and |
| 1458 | stays sequential. Custom backends that are already concurrency-safe |
| 1459 | (e.g. a stateless hosted-API backend) can opt in without waiting |
| 1460 | for the HF fix. |
| 1461 | - **B19 design note**: `.docs/design/backend-concurrency.md` documents |
| 1462 | current state, why v0.1 doesn't fix it, and two future paths |
| 1463 | (per-thread lock vs per-worker pool) with recommendation. |
| 1464 | - **Generation stays uncached**: `view.generate()` intentionally |
| 1465 | bypasses the cache — probe-side callers vary |
| 1466 | `(prompt, max_new_tokens, temperature, seed)` in ways that would |
| 1467 | rarely collide, and caching sampled output would hide seed bugs |
| 1468 | behind stale strings. |
| 1469 | |
| 1470 | ### Sprint 06 — CLI & report UX polish |
| 1471 | |
| 1472 | Closes Audit 01 findings D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, |
| 1473 | D13, D14, D15, B16, B17, B18. |
| 1474 | |
| 1475 | - **D3 — extras rollup footer:** terminal + markdown reports now end |
| 1476 | with a single `pip install 'dlm-sway[...]'` line collecting every |
| 1477 | extra mentioned in SKIP messages. Single rollup, no per-row scan. |
| 1478 | - **D4 — `sway check` infers `--base`:** reads |
| 1479 | `base_model_name_or_path` from the adapter's `adapter_config.json` |
| 1480 | when `--base` is omitted; echoes "(inferred base model: …)" so the |
| 1481 | user knows what got picked. Falls back to a clear required-flag |
| 1482 | error when the field is missing. |
| 1483 | - **D5 — `sway autogen` annotates the YAML:** generated specs carry a |
| 1484 | header block (source `.dlm`, dlm_id, base, adapter, generation |
| 1485 | timestamp, sway version) and a one-line `# intent` comment above |
| 1486 | every probe entry. No new dep — pyyaml + a position-based |
| 1487 | post-processor instead of `ruamel.yaml`. |
| 1488 | - **D6 — `sway run --dry-run` + `sway list-probes`:** dry-run validates |
| 1489 | the spec, prints a probe table (#, name, kind, category, enabled), |
| 1490 | and exits 0 without building a backend. `list-probes` prints every |
| 1491 | shipped probe kind with its category and the first line of its |
| 1492 | docstring. |
| 1493 | - **D7 — `sway doctor --json`:** machine-readable doctor payload |
| 1494 | (`sway_version`, `python`, `platform`, `extras`) keyed by extra and |
| 1495 | module. CI-grep-friendly. |
| 1496 | - **D8 — `dlm_source` resolution warns instead of silently swallowing:** |
| 1497 | when the spec sets `dlm_source` but the `[dlm]` extra isn't installed |
| 1498 | (or the bridge errors), the runner emits a yellow stderr warning with |
| 1499 | the install hint and continues with `sections=None`. No more silent |
| 1500 | "why are my section probes SKIPping?" debugging. |
| 1501 | - **D9 — markdown column parity with terminal:** the markdown probes |
| 1502 | table now carries `score`, `raw`, `z`, `duration`, and `note` |
| 1503 | columns. A `Top findings` section follows the table; a `Skipped |
| 1504 | probes` section closes when extras are missing. |
| 1505 | - **D10 — unified number formatters:** `format_score`, `format_raw`, |
| 1506 | `format_z`, `format_duration_s` live in `suite/report.py` and are |
| 1507 | the only numeric formatters used. Thousands separators above 1 000; |
| 1508 | `—` glyph for `None` / non-finite. Single source kills cross-surface |
| 1509 | drift. |
| 1510 | - **D11 — `sway report --format` validated via `StrEnum`:** the |
| 1511 | Typer-level enum rejects unknown formats with the standard |
| 1512 | "Invalid value for '--format'" error. No more silent fallback to |
| 1513 | the terminal renderer on a typo. |
| 1514 | - **D12 — `sway check` verdict banner:** prints |
| 1515 | `✅ adapter is +4.2σ above noise` (green ≥ 3σ), |
| 1516 | `⚠️ adapter is +1.5σ above noise — marginal` (yellow ≥ 1σ), or |
| 1517 | `❌ adapter is +0.3σ — indistinguishable from noise` (red) above |
| 1518 | the full report. Calibrated on the delta_kl z-score; `check` |
| 1519 | now runs `null_adapter` first so the z-score is available. |
| 1520 | - **D13 — `sway diff` regression summary:** after per-probe deltas, |
| 1521 | the diff prints `A→B: N regressed >0.10, M regressed >0.20, |
| 1522 | composite Δ=±X.XX` color-coded by direction. |
| 1523 | - **D14 — adapter paths with spaces render safely:** `_adapter_label` |
| 1524 | wraps the path in double quotes when any whitespace is present. |
| 1525 | - **D15 — long messages wrap, don't truncate:** the terminal probe |
| 1526 | table column uses Rich's `overflow="fold"` instead of an 80-char |
| 1527 | hard cut with an ellipsis. The full message is always visible. |
| 1528 | - **B16 — single markdown renderer:** `report.from_json` round-trips |
| 1529 | saved JSON back into the canonical dataclass pair, so |
| 1530 | `sway report --format md` and `sway run --markdown` both flow |
| 1531 | through `report.to_markdown` with identical output. The legacy |
| 1532 | `_render_markdown_from_json` and `_render_junit_from_json` helpers |
| 1533 | in `cli/commands.py` are deleted. |
| 1534 | - **B17 — `None` scores render as `—`:** the unified formatters cover |
| 1535 | every render path; no more `0.00` masking missing data. |
| 1536 | - **B18 — `baseline` row labeled `(informational, weight=0)`:** in |
| 1537 | both terminal and markdown component breakdowns, matching the |
| 1538 | Sprint 03 explicit-weight-zero decision. |
| 1539 | |
| 1540 | ### Sprint 05 — Probe quality & edge cases |
| 1541 | |
| 1542 | Closes Audit 01 findings B6, B7, B8, B9, B10, B11, B12, B13, B14, B20, |
| 1543 | B21, B22. |
| 1544 | |
| 1545 | - **B6 — `tail_logprob` semantics:** the field is now `float | None` |
| 1546 | with three discrete states: `None` (top-k covered the full vocab — no |
| 1547 | tail to redistribute), `0.0` (a real tail underflowed below fp32), |
| 1548 | and a measurable negative log-prob. HF, MLX, and dummy backends all |
| 1549 | emit `None` when `k == vocab`. Preflight check guards against |
| 1550 | non-finite `tail_logprob` only when it's a number. |
| 1551 | - **B7 — probe validation before backend build:** new |
| 1552 | `validate_all_probes(suite)` helper collects every spec error in one |
| 1553 | pass; `_execute_spec` calls it *before* materializing the backend so |
| 1554 | a typo in `kind:` surfaces immediately, and *all* typos surface in a |
| 1555 | single error message. |
| 1556 | - **B8 — `autogen` style prompts:** `style_fingerprint` no longer |
| 1557 | receives the leading sentence of a prose section (which elicited doc |
| 1558 | *content*, not stylistic voice). Replaced with a fixed |
| 1559 | `_STYLE_ELICITATION_PROMPTS` set of 6 open-ended, content-neutral |
| 1560 | prompts. |
| 1561 | - **B9 — sentence-transformer caching:** `_load_embedder` is now |
| 1562 | `functools.lru_cache(maxsize=4)`. Suites that run `adapter_revert` |
| 1563 | back-to-back (multi-adapter diff, repeated probes) reuse the same |
| 1564 | ~80 MB embedder instead of re-loading every probe call. |
| 1565 | - **B10 — `_lcs_ratio` docstring rot:** the function name is a |
| 1566 | historical misnomer (it's gestalt similarity via |
| 1567 | `difflib.SequenceMatcher`, not LCS). Docstring rewritten to make the |
| 1568 | contract explicit; rename deferred to v0.2 to preserve the public |
| 1569 | API surface. |
| 1570 | - **B11 — leakage adversarial perturbations:** added 4 new perturbations |
| 1571 | (`synonym_swap` with a hand-curated 50-pair table, `clause_reverse`, |
| 1572 | `prefix_inject`, `register_shift`) alongside the original three. |
| 1573 | Default perturbation list is now all 7; the spec accepts any subset. |
| 1574 | - **B12 — calibration pack expansion:** `BUILT_IN_PACK` grew from 30 |
| 1575 | to **200** items (geography, natural sciences, arithmetic, language, |
| 1576 | history, biology, technology, miscellaneous). All public-domain / |
| 1577 | hand-composed grade-school facts — no third-party dataset license |
| 1578 | attaches. `items_limit` docstring documents the new resolution |
| 1579 | (~0.5 pp per regressed item vs the old ~3.3 pp). |
| 1580 | - **B13 — tokenizer-aware `_stuffing`:** `prompt_collapse` now derives |
| 1581 | its padding from the model's pad / unk / EOS token via the backend's |
| 1582 | tokenizer, making the metric language-agnostic. The pre-B13 hardcoded |
| 1583 | English string remains as the dummy-backend fallback and behind a |
| 1584 | `legacy_stuffing: bool = False` spec field for one-release |
| 1585 | backward-compat. |
| 1586 | - **B14 — `preference_flip` per-triple error fence:** wrapped each |
| 1587 | triple's `logprob_of` calls in `try/except ProbeError`. A single bad |
| 1588 | triple no longer kills the whole batch; `evidence["dropped_triples"]` |
| 1589 | + `dropped_reasons` surface the count + first 5 reasons. When *every* |
| 1590 | triple raises, the probe routes to ERROR with a clear explanation. |
| 1591 | - **B20 — custom backend protocol checks:** `_load_custom` now |
| 1592 | isinstance-checks `NullCalibratedBackend` and |
| 1593 | `ScalableDifferentialBackend` after the `DifferentialBackend` check. |
| 1594 | The set of satisfied protocols is stamped onto the instance as |
| 1595 | `__sway_protocols__: tuple[str, ...]` so the report can show which |
| 1596 | features are available without re-checking. |
| 1597 | - **B21 — `RunContext.null_stats` truly frozen:** the runner now wraps |
| 1598 | the stats dict in `types.MappingProxyType` before threading it |
| 1599 | through. The dataclass was already `frozen=True` but the dict was |
| 1600 | mutable by reference; B21 makes the docstring's "frozen" claim |
| 1601 | literally true. Field type widened from `dict[str, dict[str, float]]` |
| 1602 | to `Mapping[str, Mapping[str, float]]`. |
| 1603 | - **B22 — `ModelSpec.adapter` path normalization:** added a pydantic |
| 1604 | `field_validator` that runs `Path.expanduser().resolve()` on the |
| 1605 | field at spec-load time. Backends no longer re-do the work; the cache |
| 1606 | key in `_null_cache.compute_key` is now stable regardless of how the |
| 1607 | user spelled the path in YAML or on the CLI. |
| 1608 | |
| 1609 | ### Sprint 04 — Integration & regression testing |
| 1610 | |
| 1611 | Closes Audit 01 findings C1, C2, C5, C6, C7, C8, C10, C11, C12, B3 |
| 1612 | (test side), B15. |
| 1613 | |
| 1614 | - **Slow lane runs end-to-end** (C1): the `huggingface_hub` dev dep |
| 1615 | added in S01 is verified; `pytest -m "slow or online"` now executes |
| 1616 | every checked-in integration test from a clean clone. |
| 1617 | - **HF backend coverage 21% → 91%** (C2) — measured combined fast + |
| 1618 | slow lane against `dlm_sway.backends.hf`. New tests: |
| 1619 | - `tests/integration/test_hf_adapter_toggle.py` extended with a |
| 1620 | bit-identical ft → base → ft roundtrip (B15 mitigation). |
| 1621 | - `tests/integration/test_hf_scaled_adapter.py`: λ sweep |
| 1622 | monotonicity + `LoraLayer.scaling[key]` restoration on clean exit |
| 1623 | AND on exception. |
| 1624 | - `tests/integration/test_hf_null_adapter.py`: same-seed |
| 1625 | determinism, different-seeds divergence, original adapter |
| 1626 | restoration on clean exit AND on exception. |
| 1627 | - `tests/integration/test_hf_scoring.py`: `logprob_of` (incl. |
| 1628 | zero-token-completion → ProbeError), `rolling_logprob`, |
| 1629 | `next_token_dist`, `_HFView.generate` (greedy + sampled-with-seed |
| 1630 | determinism). |
| 1631 | - `tests/unit/test_backend_hf_helpers.py`: direct unit coverage on |
| 1632 | `_resolve_dtype` and `_detect_device` so dtype regressions are |
| 1633 | caught in the fast lane. |
| 1634 | - **MLX smoke test** (C5): `tests/integration/test_mlx_smoke.py` |
| 1635 | exercises `MLXDifferentialBackend` on darwin-arm64 with a small |
| 1636 | LoRA adapter. Skips cleanly on non-darwin / non-arm64 / no-mlx_lm |
| 1637 | / missing-fixture. Reproducible adapter builder ships at |
| 1638 | `tests/fixtures/build_mlx_adapter.py` (run once on a Mac to |
| 1639 | populate the fixture directory). |
| 1640 | - **`sway gate` exit code pinned** (C6): |
| 1641 | `tests/integration/test_sway_gate_exit_code.py` covers PASS |
| 1642 | (exit 0), FAIL verdict (exit 1), and below-threshold-with-passing |
| 1643 | verdicts (exit 1) via Typer's `CliRunner`. |
| 1644 | - **`dlm` import ban regression-guarded** (C7): |
| 1645 | `tests/unit/test_dlm_not_imported.py` patches every `dlm.*` entry in |
| 1646 | `sys.modules` to `None` and asserts the dummy suite runs end-to-end |
| 1647 | without ImportError, plus that `sway autogen` surfaces a clean |
| 1648 | install-hint error rather than a stack trace. |
| 1649 | - **Pathological probe coverage** (C8 + B3 test side): |
| 1650 | `tests/unit/test_probe_adapter_ablation.py` now drives |
| 1651 | monotonically-decreasing curves through the helper and pins |
| 1652 | probe-level `evidence["saturation_reason"]` for flat / found / |
| 1653 | overshoot-with-dip / non_monotonic shapes via a monkeypatched |
| 1654 | `divergence`. |
| 1655 | - **WARN-branch numerical formula pinned** (C10): |
| 1656 | `test_warn_branch_score_formula_pinned` in |
| 1657 | `tests/unit/test_probe_preference_flip.py` asserts the exact |
| 1658 | `score = 0.5 + mean_delta / 4.0` formula with a hand-computed |
| 1659 | expected value. |
| 1660 | - **Disjoint top-k divergence** (C12): |
| 1661 | `test_disjoint_top_k_supports_produce_finite_divergence` in |
| 1662 | `tests/unit/test_divergence.py` covers the |
| 1663 | `aligned_probs` tail-redistribution path against fully-disjoint |
| 1664 | base / ft supports — JS comes out finite and approaches its |
| 1665 | theoretical ln(2) bound. |
| 1666 | - **Report schema snapshots** (C11): `tests/unit/test_report_snapshot.py` |
| 1667 | byte-compares `to_json` / `to_markdown` / `to_junit` against |
| 1668 | checked-in snapshots under `tests/snapshots/`. Intentional schema |
| 1669 | bumps: `SWAY_UPDATE_SNAPSHOTS=1 uv run pytest …` and commit the |
| 1670 | updated files. No new dep — hand-rolled diff helper. |
| 1671 | - **CI workflow shipped**: `.github/workflows/ci.yml` runs the fast |
| 1672 | lane (unit + lint + mypy) on every push / PR, and the slow lane |
| 1673 | (integration, HF backend) on schedule (nightly 07:00 UTC), manual |
| 1674 | dispatch, push to main, and PRs that touch `src/dlm_sway/backends/`, |
| 1675 | `tests/integration/`, or `pyproject.toml`. |
| 1676 | |
| 1677 | ### Sprint 03 — Documentation truth & dead code |
| 1678 | |
| 1679 | Closes Audit 01 findings P03, P04, P05, P07 (doc), P08, P09, P10, P14, |
| 1680 | P15, P17, P18. |
| 1681 | |
| 1682 | - **Determinism is now wired** (P09): the suite runner calls |
| 1683 | `dlm_sway.core.determinism.seed_everything(spec.defaults.seed)` before |
| 1684 | any backend work runs. The achieved determinism class (`strict` / |
| 1685 | `best_effort` / `loose`) is captured in `SuiteResult.determinism` and |
| 1686 | surfaces in the report footer + JSON payload + markdown header. |
| 1687 | - **`defaults.differential: false` works end-to-end** (P14): new |
| 1688 | `backends/two_model.py` with `TwoModelDifferential` wrapper + |
| 1689 | `build_two_separate(spec_models)` helper. `_execute_spec` routes to |
| 1690 | the wrapper when the flag is false. Doubles memory; primarily for |
| 1691 | custom backends that can't toggle adapters in place. |
| 1692 | - **`score_weights` overridable from YAML and CLI** (P15): new |
| 1693 | `SuiteDefaults.score_weights` field with subset-validating pydantic |
| 1694 | field validator (rejects unknown categories, negative weights, all-zero |
| 1695 | weights). New `--weights k=v,k=v` flag on `sway run` and `sway gate`. |
| 1696 | Partial overrides merge with `DEFAULT_COMPONENT_WEIGHTS` so users |
| 1697 | rarely have to respecify all five categories. |
| 1698 | - **`baseline` row labeled `(informational)` with weight 0.0 explicit** |
| 1699 | (P08, B18): added to `DEFAULT_COMPONENT_WEIGHTS` so the row appears |
| 1700 | in reports for transparency but contributes nothing to the composite. |
| 1701 | Terminal renderer adds the `(informational)` annotation column; |
| 1702 | markdown table adds a `weight` column with the same label. |
| 1703 | - **Extended (9-dim) style fingerprint when `[style]` is installed** |
| 1704 | (P10): `style_fingerprint` now actually uses spaCy POS-tagging and |
| 1705 | textstat syllable counting. Three new dims appended to the existing 6: |
| 1706 | passive-voice rate, POS 4-gram entropy (Shannon, bits), syllables per |
| 1707 | word. Spec field `extended: "auto" | "on" | "off"` (default `"auto"`) |
| 1708 | controls the path. `evidence["schema_version"]` bumped to `2` so |
| 1709 | snapshot consumers can branch on dimensionality. |
| 1710 | - **Stale "Known gaps" entries refreshed** (P03, P04, P05): removed |
| 1711 | "null_adapter not yet wired" (delivered in S01/S02), "custom backend |
| 1712 | stubbed" (it's full), "MLX paths raise" (rewritten to describe the |
| 1713 | real limitation: two model copies because `mlx_lm` has no runtime |
| 1714 | adapter toggle). |
| 1715 | - **README adds determinism + weights + differential documentation** |
| 1716 | and a calibration paragraph that points at the per-kind null matrix. |
| 1717 | - **`delta_kl` docstring rot fixed** (P17): dropped the dead |
| 1718 | ``:mod:`dir``` reference. |
| 1719 | - **Meta-test `test_no_dead_options.py`** (P14/P15 regression guard): |
| 1720 | greps the source tree for documented spec/CLI option names and |
| 1721 | asserts each has a consumer outside its declaration site. Catches the |
| 1722 | exact "documented but unused" pattern Audit 01 flagged. |
| 1723 | |
| 1724 | ### Sprint 02 — Universal z-score calibration |
| 1725 | |
| 1726 | Closes Audit 01 findings P02 (delivery), B2, C9. |
| 1727 | |
| 1728 | - **`NullAdapterProbe` is now a per-kind calibration matrix**: iterates |
| 1729 | every downstream numeric kind in the suite (or an explicit |
| 1730 | `calibrate_kinds` list), runs a miniature version of each through the |
| 1731 | `NullCalibrationBackendProxy` for N seeds, and publishes |
| 1732 | `{kind: {mean, std, n}}` under `evidence["null_stats"]`. The old |
| 1733 | "publishes two keys only" implementation is gone (closes B2). |
| 1734 | - **`NullCalibrationBackendProxy`** (`_null_proxy.py`): swaps |
| 1735 | `as_finetuned()` to yield `as_null_adapter(seed)` so every numeric |
| 1736 | probe's own math produces the "what does my metric look like when the |
| 1737 | fine-tune is structural noise?" distribution without a bespoke code |
| 1738 | path per probe. |
| 1739 | - **`Probe.calibrate_spec(ctx)` classmethod** plus |
| 1740 | `SENTINEL_PROMPTS` / `SENTINEL_DOC` shared constants in |
| 1741 | `probes/base.py`. Each numeric probe overrides `calibrate_spec` to |
| 1742 | return a small, cheap spec for calibration; probes that can't be |
| 1743 | meaningfully calibrated (e.g. `adapter_revert` needs an embedder, |
| 1744 | `adapter_ablation` needs `as_scaled_adapter`, `prompt_collapse` |
| 1745 | can't fit an exponential decay to null noise) opt out by returning |
| 1746 | `None` and surface `(no calibration for <kind>)` in the report. |
| 1747 | - **Shared `_zscore` helpers** (`probes/_zscore.py`): |
| 1748 | `z_score(raw, stats)`, `verdict_from_z(z, threshold)`, |
| 1749 | `score_from_z(z)`, `no_calibration_note(kind)`. Every numeric probe |
| 1750 | now flows through these — no bespoke `(raw - mean) / std` math left |
| 1751 | in any probe file. Includes the `MIN_STD = 1e-6` floor so a |
| 1752 | degenerate null distribution (e.g. a single-seed calibration) can't |
| 1753 | produce an infinite z-score (closes C9). |
| 1754 | - **All 10 numeric probes thread `get_null_stats(ctx, self.kind)`**: |
| 1755 | `delta_kl`, `adapter_revert`, `prompt_collapse`, |
| 1756 | `section_internalization`, `paraphrase_invariance`, |
| 1757 | `preference_flip`, `style_fingerprint`, `calibration_drift`, |
| 1758 | `leakage`, `adapter_ablation`. Each has an `assert_z_gte: float = 3.0` |
| 1759 | that is *preferred* when stats exist. Lower-is-better probes |
| 1760 | (`adapter_revert`, `calibration_drift`, `leakage`) sign-flip the |
| 1761 | z internally so the shared `z >= threshold` PASS rule still reads as |
| 1762 | "significantly better than null". Two probes (`paraphrase_invariance`, |
| 1763 | `calibration_drift`) keep their intent-aware / compound thresholds |
| 1764 | as the no-calibration fallback. |
| 1765 | - **On-disk null-stats cache** (`probes/_null_cache.py`, |
| 1766 | `backends/hf.py`): HF backend exposes `cache_identity()`; |
| 1767 | `NullAdapterProbe` hashes `(backend_identity, runs, init_scale, |
| 1768 | seed_base, top_k, kinds)` into a stable filename under |
| 1769 | `~/.dlm-sway/null-stats/<key>.json` (XDG-respecting). Cache is |
| 1770 | best-effort — a missing / malformed file rebuilds. Disable per-suite |
| 1771 | with `cache: false` in the spec, or globally with |
| 1772 | `SWAY_DISABLE_NULL_CACHE=1`. The dummy backend doesn't expose a |
| 1773 | cache identity, so tests never touch disk unless they opt in. |
| 1774 | - **Report shows z-scores in every numeric probe row**: the terminal |
| 1775 | renderer already had a `z` column; the markdown renderer now does |
| 1776 | too. Rows that fell back to fixed thresholds carry |
| 1777 | `(no calibration for <kind>)` directly in the message. |
| 1778 | |
| 1779 | ### Sprint 01 — Finite safety & verdict integrity |
| 1780 | |
| 1781 | Closes Audit 01 findings B1, B3 (fix), B4, B5, C3, C4, D1, D2. |
| 1782 | |
| 1783 | - **`safe_finalize` helper** in `core/result.py`: every numeric probe now |
| 1784 | routes its final `ProbeResult` through this guard. A non-finite |
| 1785 | `raw` (or any explicitly-critical field) auto-converts to |
| 1786 | `Verdict.ERROR`; non-finite auxiliaries are nulled out and recorded |
| 1787 | in `evidence["defensively_nulled"]`. |
| 1788 | - **`_divergence` non-finite rejection**: `aligned_probs`, `kl`, `js` |
| 1789 | now raise `ProbeError` on NaN/inf inputs instead of silently |
| 1790 | producing garbage. JS results are bounds-checked against the |
| 1791 | theoretical `[0, ln 2]`; out-of-bound results raise rather than |
| 1792 | ship. (Pins the +11639σ bug — JS exceeded ln 2 nats by 19×.) |
| 1793 | - **`PreflightCheckable` Protocol**: HF and dummy backends gain |
| 1794 | `preflight_finite_check()` that runs one forward pass per view with |
| 1795 | a sentinel prompt and rejects backends whose logprobs are non-finite. |
| 1796 | - **Suite runner preflight gate**: every `sway run` calls |
| 1797 | `backend.preflight_finite_check()` before the probe loop. Failure |
| 1798 | emits a single synthetic ERROR probe and skips all configured |
| 1799 | probes; `sway gate` exits non-zero. Disable with `skip_preflight=True` |
| 1800 | for sub-second test suites. |
| 1801 | - **`style_fingerprint` zero-vector fix (B4)**: a fine-tuned model that |
| 1802 | produces empty / whitespace-only generations no longer reports a |
| 1803 | spurious "+0.82 shift toward doc" PASS. The `_cosine_shift` math is |
| 1804 | replaced by `_projection_shift = (ft-base)·(doc-base) / ||doc-base||²` |
| 1805 | which goes to zero when ft equals base. |
| 1806 | - **`_saturation_lambda` full-range search (B3)**: the adapter-ablation |
| 1807 | saturation detector now searches the entire λ range (not just λ ≤ 1.0) |
| 1808 | with `max(divs)` as the reference, and returns a typed reason |
| 1809 | (`"found"` / `"non_monotonic"` / `"flat_curve"` / `"below_floor"`) |
| 1810 | so a flat NaN-adapter curve is distinguishable from a healthy |
| 1811 | saturating one. |
| 1812 | - **Property-based divergence tests** (`hypothesis`): symmetry, |
| 1813 | `JS(p,p) = 0`, `JS ≤ ln 2`, `KL(p,p) = 0`, KL non-negativity. Plus |
| 1814 | explicit non-finite-raises tests pinning every entry point. |
| 1815 | - **End-to-end NaN regression** (`tests/integration/test_nan_adapter_regression.py`, |
| 1816 | slow+online): builds a real PEFT adapter with all-NaN weights via |
| 1817 | PEFT, persists it through safetensors, then asserts the HF backend's |
| 1818 | preflight rejects it and the suite produces ERROR — not the |
| 1819 | +11639σ headline the audit caught. |
| 1820 | - Added `huggingface_hub` to dev dep group so integration tests can |
| 1821 | resolve the `tiny_model_dir` fixture (closes C1 ahead of Sprint 04). |
| 1822 | |
| 1823 | ## 0.1.0.dev0 — 2026-04-20 |
| 1824 | |
| 1825 | Initial pre-alpha. Full 11-primitive battery shipped. |
| 1826 | |
| 1827 | ### Primitives |
| 1828 | |
| 1829 | - **Adherence** |
| 1830 | - `delta_kl` — mean JS/KL divergence between base and fine-tuned next-token distributions |
| 1831 | - `adapter_revert` — reversion under adversarial paraphrase (needs `sway-eval[semsim]`) |
| 1832 | - `prompt_collapse` — exponential-decay fit of divergence over context length |
| 1833 | - **Attribution** |
| 1834 | - `section_internalization` *(flagship)* — per-section `effective_sis` with leak check |
| 1835 | - `paraphrase_invariance` — memorization vs. generalization, intent-aware |
| 1836 | - `preference_flip` — DPO/ORPO chosen/rejected margin inversion |
| 1837 | - **Calibration** |
| 1838 | - `style_fingerprint` — 6-dim numpy-only stylistic shift vs. document |
| 1839 | - `calibration_drift` — general-knowledge regression on a packaged 30-item pack |
| 1840 | - `leakage` — greedy LCS recall + perturbation fragility |
| 1841 | - **Ablation** |
| 1842 | - `adapter_ablation` *(signature primitive)* — λ-scaled divergence curve with linearity, saturation, overshoot metrics |
| 1843 | - **Baseline** |
| 1844 | - `null_adapter` — per-kind null distribution matrix for z-score calibration (wired end-to-end in Unreleased) |
| 1845 | |
| 1846 | ### Infrastructure |
| 1847 | |
| 1848 | - `DifferentialBackend` + `ScalableDifferentialBackend` protocols |
| 1849 | - HuggingFace + PEFT backend with `disable_adapter` / `set_adapter` toggling and LoRA-scale mutation |
| 1850 | - Dummy backend for unit tests (canned responses + linear-blend scalable mode) |
| 1851 | - YAML spec loader, composite score (four-category weighted), rich terminal + JSON + JUnit + Markdown reports |
| 1852 | - Typer CLI: `run`, `gate`, `check`, `diff`, `autogen`, `doctor`, `report` |
| 1853 | - `.dlm` bridge (`dlm-sway[dlm]`): resolver + full-battery autogen |
| 1854 | - Matplotlib visualizations (`dlm-sway[viz]`): SIS bar chart, ablation curve, KL histogram |
| 1855 | |
| 1856 | ### Known gaps |
| 1857 | |
| 1858 | - MLX backend loads **two** model copies in memory (base + adapter-fused) because `mlx_lm` has no runtime adapter toggle. Memory footprint is ~2x the HF path; fine for the small (<3B) models MLX typically runs. |
| 1859 | - MLX requires a pre-converted `.npz` adapter — raw PEFT safetensors are rejected by `mlx_lm.load`. A PEFT-→-MLX converter is a future milestone. |
| 1860 | - PyPI publication of the `dlm-sway` wheel is pending a clean CI release workflow. |