sway
Differential testing for fine-tuned causal language models.
Alpha — v0.1.0 on PyPI. API is not stable; semantic versioning applies only from v1.0 onward. Feedback + issues welcome.
One question: did LoRA/QLoRA training actually change model behavior in a meaningful way, or is the model just defaulting to the pretrained base?
sway gives you a trustworthy, reproducible answer with thirteen
purpose-built primitives, each z-scored against a null-adapter baseline.
No LLM judges. No external APIs. Deterministic on CPU where possible.
Naming convention. The source repo and CLI entry point are both
sway. The PyPI wheel isdlm-swaybecause the shortswayname is taken on PyPI by an unrelated project. The CLI installed bypip install dlm-swayissway— mismatched wheel/command names are a PyPA convention (seepyyaml→import yaml).
Install
# HF + PEFT backend — required for real models
pip install "dlm-sway[hf]"
# Extras composable as usual
pip install "dlm-sway[hf,style,semsim]"
pip install "dlm-sway[all]"
# .dlm auto-suite generation (requires the DLM sibling project)
pip install "dlm-sway[dlm]"
Available extras:
[hf]— HuggingFace + PEFT backend (required for real models)[mlx]— Apple Silicon MLX backend (darwin-arm64 only)[style]— stylistic fingerprint extensions (spaCy + textstat + nlpaug)[semsim]— sentence-transformers for the revert probe[dlm]— auto-generate suites from.dlmdocuments[viz]— matplotlib plots[all]— everything
Verify the install:
sway --version
sway doctor
MLX backend (Apple Silicon)
Install the extra and point sway at any PEFT adapter — the MLX backend
auto-converts on first load and caches the result under
~/.cache/dlm-sway/mlx-converted/<sha>/. No manual .npz step.
pip install "dlm-sway[mlx]"
# Either: let sway auto-convert when it loads the adapter
sway run sway.yaml # spec sets models.ft.kind: mlx, points at any PEFT dir
# Or: convert explicitly (useful for inspection / scripting)
sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter
Conversion is one-shot — repeated sway run invocations on the same
adapter version short-circuit on a content hash. Supports standard
LoRA on q_proj / v_proj / etc. QLoRA 4-bit and full-weight
modules_to_save overrides aren't supported (the converter prints a
clear warning if it sees them).
Install from source
For the development HEAD (unreleased changes, contributor workflow):
git clone https://github.com/tenseleyFlow/sway.git
cd sway
uv venv --python 3.11 .venv # or: python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[hf]" --group dev
90-second smoke test
sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
Outputs a verdict in under a minute on CPU for small models: your adapter is 4.2σ above noise ✅ or indistinguishable from a null adapter ❌.
Pre-run diagnostics
If your adapter was trained with DocumentLanguageModel,
sway check runs the gradient_ghost probe first — a ~50 ms
disk-only inspection of dlm's training_state.pt. It catches
adapters that are obviously broken (e.g. left at --max-steps 5
from a smoke test) before the model ever loads, with a banner
on top of the regular verdict:
⚠️ PRE-RUN ALERT — gradient_ghost flagged severe undertraining
severely undertrained: global_step=2 < threshold 50.
The probe scores below may be unreliable. Consider retraining.
The probe's signal ladder, in order of decisiveness:
global_stepbelow a configurable threshold (default 50) → FAIL.- Every per-param
exp_avg_sqis NaN → FAIL. - More than 30% of LoRA layers show >2× the lowest layer's gradient variance → FAIL or WARN.
- Otherwise → PASS.
Specs that contain only pre-run diagnostics like gradient_ghost
skip backend construction entirely — sway run against a pre-flight
spec doesn't load a model, doesn't allocate GPU memory, doesn't pay
the cold-start tax. Useful for pre-commit gates that should fast-
path adapter sanity before authorizing the full battery.
The training_drift probe pairs with gradient_ghost: where the
ghost reads optimizer state at end-of-training, drift reads the loss
curve during training (from dlm's per-step JSONL logs):
⚠️ training_drift flagged unstable training
smoothness=0.42, convergence_ratio=0.91, instability_events=4
Four metrics: final_loss, convergence_ratio, smoothness
(1 − var(Δloss)/var(loss)), and instability_events (loss-increase
spikes that exceed local typical movement). Verdict PASS when all
three of (smoothness ≥ 0.7, instability_events == 0,
convergence_ratio ≤ 0.7); else WARN. Like gradient_ghost, runs in
~10 ms with no model load.
Full suite
# sway.yaml
version: 1
models:
base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
ft: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
adapter: "./runs/adapter/v0003"}
suite:
- {name: null_baseline, kind: null_adapter, runs: 3}
- {name: doc_divergence, kind: delta_kl,
prompts: ["The key insight is", "An important rule"]}
- {name: section_attribution, kind: section_internalization}
- {name: no_leakage, kind: leakage}
- {name: ablation_shape, kind: adapter_ablation,
prompts: ["Tell me more about"]}
sway run sway.yaml # full report to terminal + JSON
sway gate sway.yaml --junit # CI-friendly; non-zero on fail
# Override the composite weights on the command line (partial overrides
# are fine — unspecified categories keep their defaults):
sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
Inside sway.yaml, tuning knobs in defaults include:
seed— passed toseed_everythingbefore any probe runs.differential(defaulttrue) — toggle between the single-load PEFT path and a two-model load (doubled memory, rarely needed; for custom backends that can't do in-place adapter toggling).score_weights— per-category weight overrides baked into the spec so CI runs reproduce the same score without a CLI flag.
Why it exists
Standard benchmarks (MMLU, HellaSwag) ask "how good is this model?" That's the wrong question after a targeted LoRA fine-tune on a small user-authored document. The right question is "did the adapter actually move the model toward what I wrote?" — and existing tools answer this poorly.
sway answers it directly via fifteen primitives across four
categories, plus a baseline-calibration primitive:
| Category | Primitives |
|---|---|
| Adherence | delta_kl, adapter_revert, prompt_collapse, cluster_kl, multi_turn_coherence_decay |
| Attribution | section_internalization, paraphrase_invariance, preference_flip, tool_use_fidelity |
| Calibration | style_fingerprint, calibration_drift, leakage, external_perplexity, gradient_ghost, training_drift |
| Ablation | adapter_ablation ← the signature primitive |
| Baseline | null_adapter (powers every z-score in the report) |
The signature primitive. adapter_ablation scales the LoRA additive
term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
response. A degenerate one shows a step function or an overshoot-then-
crash. Nobody else does this because nobody else gets this close to the
adapter math.
The calibration. Every numeric probe z-scores its raw metric against
a null-adapter baseline — a same-structure LoRA with random-init weights.
"Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
fixed threshold. The null-adapter calibration requires a backend that
implements NullCalibratedBackend (the HF backend does); probes that
can't be calibrated (e.g., adapter_revert needs an embedder, the null
proxy doesn't have one) surface (no calibration) in the report and
fall back to fixed thresholds. Calibration stats are cached on disk
under ~/.dlm-sway/null-stats/ keyed by backend identity.
The rank profile. null_adapter takes an optional
rank_multipliers: list[float] (default [1.0]). Pass
[0.5, 1.0, 2.0] and every numeric probe carries a three-point
z-score curve: z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x. The shape
is diagnostic:
- Flat or slightly rising toward 0.5x — adapter signal is rank-stable, roughly independent of noise energy.
- Sharply higher at 0.5x, lower at 2x — adapter is rank-saturated:
a smaller rank would have yielded a clearer separation from noise.
Consider halving
r. - Low everywhere — adapter is barely above noise at any rank; the signal is real but weak.
Caveat: high z at low rank can also mean the low-rank null is pathologically quiet rather than that the adapter is strong. Read the profile as a shape, not a scalar — if all three z's move proportionally, the adapter is doing work; if they spread apart, the rank is mis-sized.
Implementation note: rank scaling is mathematically equivalent to
multiplying the null noise std by sqrt(rank_scale) (LoRA's A·B output
variance scales linearly with rank). The shipped backends apply that
scaling rather than reshaping PEFT tensors — no model reload, no
rank-specific adapter cache, same alpha/r scaling throughout.
Determinism. Every sway run calls seed_everything(spec.defaults.seed)
before the first probe — seeds python/numpy/torch RNGs and asks torch
for deterministic algorithms (CUBLAS_WORKSPACE_CONFIG=:4096:8). The
report footer prints the achieved class — strict (CUDA), best_effort
(CPU/MPS), or loose (deterministic algorithms refused). Same seed +
same host = bit-identical scoring across runs.
Pytest integration
For teams already testing their training pipeline with pytest, sway
ships a plugin behind the [pytest] extra. A single decorator turns
one pytest function into one test item per probe plus an optional
composite-score gate:
import pytest
@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
def test_adapter_healthy() -> None:
"""The decorator owns the body — a bare pass is conventional."""
pytest -v then reports:
test_sway_gate.py::test_adapter_healthy::adherence PASSED
test_sway_gate.py::test_adapter_healthy::calibration PASSED
test_sway_gate.py::test_adapter_healthy::__gate__ PASSED
--junitxml emits one <testcase> per probe, pytest -k adherence
runs just that probe, FAIL / ERROR / SKIP verdicts translate to
pytest outcomes. See examples/pytest_integration/ for a full
before/after walkthrough.
pip install 'dlm-sway[hf,pytest]'
Pre-commit
For teams using pre-commit.com, sway ships
a .pre-commit-hooks.yaml declaring three hooks that run sway gate
before every commit touching a spec, .dlm document, or adapter
file. Add 4–5 lines to your .pre-commit-config.yaml:
repos:
- repo: https://github.com/tenseleyFlow/sway
rev: v0.1.0
hooks:
- id: sway-gate
args: ["sway.yaml", "--threshold=0.6"]
Three variants ship; pick whichever fits your install posture:
| Hook | When to use | First-run cost |
|---|---|---|
sway-gate |
you already ran pip install 'dlm-sway[hf]' |
~none — uses the sway binary on your PATH |
sway-gate-isolated |
fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers |
sway-gate-docker |
zero-install hosts with docker available | ~1 min — pulls ghcr.io/tenseleyflow/sway-gate:v0.1.0 (torch baked in, MiniLM weights pre-cached) |
The recommended default is sway-gate. Switch to
sway-gate-isolated if you can't rely on a host-level sway install.
Reach for sway-gate-docker on ephemeral CI runners where docker is
cheaper than a fresh venv.
Rev pinning
The example above pins to the v0.1.0 tag. Bump it deliberately
when you want to pick up a new release; pre-commit autoupdate will
surface newer tags when you run it explicitly.
Scope
The hook only gates — exits non-zero on FAIL, zero on PASS. No
--json / --markdown report flags are surfaced; those belong in
sway run (ad-hoc or in a separate CI job). Keeps git commit fast
and the gate's verdict uncluttered.
See examples/precommit-example/ for
the full walk-through including the sway.yaml template, the
consumer-side .pre-commit-config.yaml, and the
try-it-locally-before-you-install recipe.
GitHub Action
For repos that don't use pre-commit, the
tenseleyflow/sway-action
GitHub Action wraps sway gate + a markdown report posted as a PR
comment in three lines of YAML:
- uses: tenseleyflow/sway-action@v0.1.0
with:
spec-path: sway.yaml
The action handles install, caching, gate execution, edit-in-place PR comments (so rebases don't spam), and exit-code translation. A complete workflow:
name: sway
on:
pull_request:
paths: ["**/*.dlm", "**/sway.yaml"]
permissions:
contents: read
pull-requests: write
jobs:
sway:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: tenseleyflow/sway-action@v0.1.0
with:
spec-path: sway.yaml
Inputs include fail-on (fail / warn / never),
comment-on-pr, upload-artifact, and sway-version for pinning.
Outputs sway-score, verdict, and report-path are available to
downstream steps. See the action repo's README for the full surface.
The .dlm integration
If you trained your adapter via the DocumentLanguageModel
project, sway
auto-generates a test suite from your document's sections.
Install sway with the [dlm] extra alongside [hf] (pre-PyPI, editable):
# inside a clone of this repo
uv pip install -e ".[hf,dlm]"
Then:
sway autogen path/to/doc.dlm -o sway.yaml
sway run sway.yaml
Per-section attribution tells you which parts of your document actually moved the model — a kind of signal no other tool provides.
Heads up — once dlm publishes its sister change,
dlm export --to sway-jsonwrites a ready-to-runsway.yamlnext to the GGUF. Skip the manualsway autogenstep entirely.
Tool-use fidelity
Adapters that target tool-calling behavior have a failure mode no
other probe catches: they pass every adherence / attribution probe
but silently degrade the base model's JSON-schema compliance, or
start hallucinating tool names that aren't in the declared surface.
The tool_use_fidelity probe scores three independent signals on
each (prompt, tool_spec) case:
- JSON-schema validity delta —
ft_valid_rate − base_valid_rate. A negative value means the adapter regressed on producing schema-valid calls; the default pass criterion tolerates a 5pp drop. - Tool-name hallucination rate — over schema-valid ft calls, the
fraction that pick a tool outside the declared
allowed_toolssurface (or differ fromgold_tool_namewhen no surface is set). Default cap 10%. - Argument-field disagreement rate — over cases where both views produce schema-valid calls, the per-leaf-field disagreement rate between base and ft arguments. Surfaced as evidence — the v1 surface doesn't gate on it.
suite:
- name: tool_calls_intact
kind: tool_use_fidelity
cases:
- prompt: "Search the web for the capital of France."
tool_spec:
name: search_web
parameters:
type: object
properties: {query: {type: string}}
required: [query]
gold_tool_name: search_web
allowed_tools: [search_web, calculator, http_fetch]
The probe uses an OpenAI-style function-calling schema (the most common shape across hosted and OSS tool-use stacks). Other schema flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint; the v1 implementation deliberately ships the format that covers the largest user surface.
json_valid_rate_ft is z-scored against the null-adapter baseline
when null_adapter is in the suite — the principled "the adapter
preserved the base's tool-call structure beyond noise" signal.
Multi-turn coherence
Every other adherence probe is single-turn: one user message, one
ft response, one score. Adapters that pass delta_kl cleanly
frequently forget their training by turn 2 or 3 of a real
dialogue — the model's own previous responses fill the context
window and create compounding drift. The
multi_turn_coherence_decay probe rolls a multi-turn synthetic
dialogue per prompt and fits an exponential-decay curve to the
per-turn KL:
suite:
- name: holds_a_conversation
kind: multi_turn_coherence_decay
prompts:
- "Explain how a neural network learns."
- "What's the difference between TCP and UDP?"
max_turns: 4
assert_half_life_turns: 2.0
The probe reports half_life_turns (the turn at which adapter
influence is halved), a per-turn KL list, and a tiny ASCII
sparkline in the report message so you can see the curve shape
without opening the JSON. Bases without a chat_template SKIP
gracefully — multi-turn requires one to format dialogue history.
The probe deliberately doesn't z-score against the null-adapter
baseline (a null adapter has no coherence to decay; the null
distribution is meaningless). Fixed-threshold verdicts are the
published path. Mirrors prompt_collapse.
Serve daemon
Loading the HF backend takes ~15s every time you run sway run. For
notebook exploration, the sway watch retrain loop, or any flow that
fires the suite repeatedly against the same model, that startup is
the dominant cost. sway serve keeps the backend warm in a small
FastAPI daemon: first request pays the load, subsequent requests
reuse the cached weights.
pip install 'dlm-sway[serve]' # adds fastapi + uvicorn + httpx
sway serve --port 8787 --max-loaded-models 2
Default bind is 127.0.0.1. The daemon refuses to start on a
non-loopback interface unless you pass --api-key <token>, after
which every non-/health request must carry
Authorization: Bearer <token>.
# With curl — sweetspot is from inside a notebook or watch loop.
curl -s -X POST http://localhost:8787/run \
-H 'Content-Type: application/json' \
-d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score
# From a notebook (or any Python).
from dlm_sway.serve.client import ServeClient
from dlm_sway.suite.loader import load_spec
client = ServeClient("http://localhost:8787")
report = client.run(load_spec("sway.yaml"))
print(report["score"]) # full report shape mirrors `sway run --json`
print(report["request_seconds"]) # cold ~15s; warm ~2s
The cache is keyed on (kind, base, adapter, dtype, device) and capped
at --max-loaded-models (default 2). Loading a third distinct model
LRU-evicts the oldest, calling backend.close() to release the
weights. GET /health reports the currently-warm models;
GET /stats reports request count and mean latency.
Reproducing a sway run
Sometimes you want a coworker (or a future-you, or a bug report) to
re-execute exactly what you ran without recreating your environment.
sway pack bundles the spec + the source .dlm document + your
locally-cached null-stats + (optionally) a known-good report into a
single *.swaypack.tar.gz you can email or commit:
sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \
--include-golden last-run.json
# wrote audit-2026-04.swaypack.tar.gz (4.2 KB) sections=312b null_stats=0 golden=yes
# share the tarball with a coworker
sway unpack audit-2026-04.swaypack.tar.gz
# extracted: <cwd>/swaypack
# spec_path: <cwd>/swaypack/sway.yaml
# ...
# To run the bundled spec:
# SWAY_NULL_CACHE_DIR=<cwd>/swaypack/null-stats sway run <cwd>/swaypack/sway.yaml
The packed null-stats cache means the consumer doesn't pay the
re-calibration cost (~120 forward passes on a typical 10-probe
suite). The unpack output prints the exact SWAY_NULL_CACHE_DIR=...
env var to use; that env redirects null-stats lookups at the bundled
cache instead of ~/.dlm-sway/.
A bundled golden JSON makes it trivial for the consumer to verify
their re-run matches yours. Defaults to a 50 MB pack-size cap;
override with --max-size-mb.
Status
Pre-alpha. API will break. Not yet on PyPI — install editable from source
(see Install from source). Version 0.1.0 will be
the first published tag; until then, every clone pulls the tip of main.
License
MIT
View source
| 1 | # sway |
| 2 | |
| 3 | Differential testing for fine-tuned causal language models. |
| 4 | |
| 5 | > **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning |
| 6 | > applies only from v1.0 onward. Feedback + issues welcome. |
| 7 | |
| 8 | **One question:** *did LoRA/QLoRA training actually change model behavior |
| 9 | in a meaningful way, or is the model just defaulting to the pretrained |
| 10 | base?* |
| 11 | |
| 12 | `sway` gives you a trustworthy, reproducible answer with thirteen |
| 13 | purpose-built primitives, each z-scored against a null-adapter baseline. |
| 14 | No LLM judges. No external APIs. Deterministic on CPU where possible. |
| 15 | |
| 16 | > **Naming convention.** The source repo and CLI entry point are both |
| 17 | > `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is |
| 18 | > taken on PyPI by an unrelated project. The CLI installed by |
| 19 | > `pip install dlm-sway` is `sway` — mismatched wheel/command names are |
| 20 | > a PyPA convention (see `pyyaml` → `import yaml`). |
| 21 | |
| 22 | ## Install |
| 23 | |
| 24 | ```bash |
| 25 | # HF + PEFT backend — required for real models |
| 26 | pip install "dlm-sway[hf]" |
| 27 | |
| 28 | # Extras composable as usual |
| 29 | pip install "dlm-sway[hf,style,semsim]" |
| 30 | pip install "dlm-sway[all]" |
| 31 | |
| 32 | # .dlm auto-suite generation (requires the DLM sibling project) |
| 33 | pip install "dlm-sway[dlm]" |
| 34 | ``` |
| 35 | |
| 36 | Available extras: |
| 37 | |
| 38 | - `[hf]` — HuggingFace + PEFT backend (required for real models) |
| 39 | - `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only) |
| 40 | - `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug) |
| 41 | - `[semsim]` — sentence-transformers for the revert probe |
| 42 | - `[dlm]` — auto-generate suites from `.dlm` documents |
| 43 | - `[viz]` — matplotlib plots |
| 44 | - `[all]` — everything |
| 45 | |
| 46 | Verify the install: |
| 47 | |
| 48 | ```bash |
| 49 | sway --version |
| 50 | sway doctor |
| 51 | ``` |
| 52 | |
| 53 | ## MLX backend (Apple Silicon) |
| 54 | |
| 55 | Install the extra and point sway at any PEFT adapter — the MLX backend |
| 56 | auto-converts on first load and caches the result under |
| 57 | `~/.cache/dlm-sway/mlx-converted/<sha>/`. No manual `.npz` step. |
| 58 | |
| 59 | ```bash |
| 60 | pip install "dlm-sway[mlx]" |
| 61 | |
| 62 | # Either: let sway auto-convert when it loads the adapter |
| 63 | sway run sway.yaml # spec sets models.ft.kind: mlx, points at any PEFT dir |
| 64 | |
| 65 | # Or: convert explicitly (useful for inspection / scripting) |
| 66 | sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter |
| 67 | ``` |
| 68 | |
| 69 | Conversion is one-shot — repeated `sway run` invocations on the same |
| 70 | adapter version short-circuit on a content hash. Supports standard |
| 71 | LoRA on `q_proj` / `v_proj` / etc. QLoRA 4-bit and full-weight |
| 72 | `modules_to_save` overrides aren't supported (the converter prints a |
| 73 | clear warning if it sees them). |
| 74 | |
| 75 | ## Install from source |
| 76 | |
| 77 | For the development HEAD (unreleased changes, contributor workflow): |
| 78 | |
| 79 | ```bash |
| 80 | git clone https://github.com/tenseleyFlow/sway.git |
| 81 | cd sway |
| 82 | |
| 83 | uv venv --python 3.11 .venv # or: python -m venv .venv |
| 84 | source .venv/bin/activate |
| 85 | uv pip install -e ".[hf]" --group dev |
| 86 | ``` |
| 87 | |
| 88 | ## 90-second smoke test |
| 89 | |
| 90 | ```bash |
| 91 | sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct |
| 92 | ``` |
| 93 | |
| 94 | Outputs a verdict in under a minute on CPU for small models: *your |
| 95 | adapter is 4.2σ above noise* ✅ or *indistinguishable from a null |
| 96 | adapter* ❌. |
| 97 | |
| 98 | ## Pre-run diagnostics |
| 99 | |
| 100 | If your adapter was trained with [DocumentLanguageModel](https://github.com/tenseleyFlow/DocumentLanguageModel), |
| 101 | `sway check` runs the **`gradient_ghost`** probe first — a ~50 ms |
| 102 | disk-only inspection of dlm's `training_state.pt`. It catches |
| 103 | adapters that are obviously broken (e.g. left at `--max-steps 5` |
| 104 | from a smoke test) **before** the model ever loads, with a banner |
| 105 | on top of the regular verdict: |
| 106 | |
| 107 | ``` |
| 108 | ⚠️ PRE-RUN ALERT — gradient_ghost flagged severe undertraining |
| 109 | severely undertrained: global_step=2 < threshold 50. |
| 110 | The probe scores below may be unreliable. Consider retraining. |
| 111 | ``` |
| 112 | |
| 113 | The probe's signal ladder, in order of decisiveness: |
| 114 | |
| 115 | 1. `global_step` below a configurable threshold (default 50) → FAIL. |
| 116 | 2. Every per-param `exp_avg_sq` is NaN → FAIL. |
| 117 | 3. More than 30% of LoRA layers show >2× the lowest layer's |
| 118 | gradient variance → FAIL or WARN. |
| 119 | 4. Otherwise → PASS. |
| 120 | |
| 121 | Specs that contain *only* pre-run diagnostics like `gradient_ghost` |
| 122 | skip backend construction entirely — `sway run` against a pre-flight |
| 123 | spec doesn't load a model, doesn't allocate GPU memory, doesn't pay |
| 124 | the cold-start tax. Useful for `pre-commit` gates that should fast- |
| 125 | path adapter sanity before authorizing the full battery. |
| 126 | |
| 127 | The **`training_drift`** probe pairs with `gradient_ghost`: where the |
| 128 | ghost reads optimizer state at end-of-training, drift reads the loss |
| 129 | *curve* during training (from dlm's per-step JSONL logs): |
| 130 | |
| 131 | ``` |
| 132 | ⚠️ training_drift flagged unstable training |
| 133 | smoothness=0.42, convergence_ratio=0.91, instability_events=4 |
| 134 | ``` |
| 135 | |
| 136 | Four metrics: `final_loss`, `convergence_ratio`, `smoothness` |
| 137 | (1 − var(Δloss)/var(loss)), and `instability_events` (loss-increase |
| 138 | spikes that exceed local typical movement). Verdict PASS when all |
| 139 | three of (smoothness ≥ 0.7, instability_events == 0, |
| 140 | convergence_ratio ≤ 0.7); else WARN. Like `gradient_ghost`, runs in |
| 141 | ~10 ms with no model load. |
| 142 | |
| 143 | ## Full suite |
| 144 | |
| 145 | ```yaml |
| 146 | # sway.yaml |
| 147 | version: 1 |
| 148 | models: |
| 149 | base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"} |
| 150 | ft: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct", |
| 151 | adapter: "./runs/adapter/v0003"} |
| 152 | suite: |
| 153 | - {name: null_baseline, kind: null_adapter, runs: 3} |
| 154 | - {name: doc_divergence, kind: delta_kl, |
| 155 | prompts: ["The key insight is", "An important rule"]} |
| 156 | - {name: section_attribution, kind: section_internalization} |
| 157 | - {name: no_leakage, kind: leakage} |
| 158 | - {name: ablation_shape, kind: adapter_ablation, |
| 159 | prompts: ["Tell me more about"]} |
| 160 | ``` |
| 161 | |
| 162 | ```bash |
| 163 | sway run sway.yaml # full report to terminal + JSON |
| 164 | sway gate sway.yaml --junit # CI-friendly; non-zero on fail |
| 165 | |
| 166 | # Override the composite weights on the command line (partial overrides |
| 167 | # are fine — unspecified categories keep their defaults): |
| 168 | sway run sway.yaml --weights "attribution=0.5,adherence=0.2" |
| 169 | ``` |
| 170 | |
| 171 | Inside `sway.yaml`, tuning knobs in `defaults` include: |
| 172 | |
| 173 | - `seed` — passed to `seed_everything` before any probe runs. |
| 174 | - `differential` (default `true`) — toggle between the single-load PEFT |
| 175 | path and a two-model load (doubled memory, rarely needed; for custom |
| 176 | backends that can't do in-place adapter toggling). |
| 177 | - `score_weights` — per-category weight overrides baked into the spec so |
| 178 | CI runs reproduce the same score without a CLI flag. |
| 179 | |
| 180 | ## Why it exists |
| 181 | |
| 182 | Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"* |
| 183 | That's the wrong question after a targeted LoRA fine-tune on a small |
| 184 | user-authored document. The right question is *"did the adapter actually |
| 185 | move the model toward what I wrote?"* — and existing tools answer this |
| 186 | poorly. |
| 187 | |
| 188 | `sway` answers it directly via fifteen primitives across four |
| 189 | categories, plus a baseline-calibration primitive: |
| 190 | |
| 191 | | Category | Primitives | |
| 192 | |---------------|-------------------------------------------------------| |
| 193 | | Adherence | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`, `multi_turn_coherence_decay` | |
| 194 | | Attribution | `section_internalization`, `paraphrase_invariance`, `preference_flip`, `tool_use_fidelity` | |
| 195 | | Calibration | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`, `gradient_ghost`, `training_drift` | |
| 196 | | Ablation | `adapter_ablation` ← the signature primitive | |
| 197 | | Baseline | `null_adapter` (powers every z-score in the report) | |
| 198 | |
| 199 | **The signature primitive.** `adapter_ablation` scales the LoRA additive |
| 200 | term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence |
| 201 | curve. A healthy fine-tune shows a smooth, monotonic, non-saturated |
| 202 | response. A degenerate one shows a step function or an overshoot-then- |
| 203 | crash. Nobody else does this because nobody else gets this close to the |
| 204 | adapter math. |
| 205 | |
| 206 | **The calibration.** Every numeric probe z-scores its raw metric against |
| 207 | a null-adapter baseline — a same-structure LoRA with random-init weights. |
| 208 | "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a |
| 209 | fixed threshold. The null-adapter calibration requires a backend that |
| 210 | implements `NullCalibratedBackend` (the HF backend does); probes that |
| 211 | can't be calibrated (e.g., `adapter_revert` needs an embedder, the null |
| 212 | proxy doesn't have one) surface `(no calibration)` in the report and |
| 213 | fall back to fixed thresholds. Calibration stats are cached on disk |
| 214 | under `~/.dlm-sway/null-stats/` keyed by backend identity. |
| 215 | |
| 216 | **The rank profile.** `null_adapter` takes an optional |
| 217 | `rank_multipliers: list[float]` (default `[1.0]`). Pass |
| 218 | `[0.5, 1.0, 2.0]` and every numeric probe carries a three-point |
| 219 | z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape |
| 220 | is diagnostic: |
| 221 | |
| 222 | - **Flat or slightly rising toward 0.5x** — adapter signal is |
| 223 | rank-stable, roughly independent of noise energy. |
| 224 | - **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated: |
| 225 | a smaller rank would have yielded a clearer separation from noise. |
| 226 | Consider halving `r`. |
| 227 | - **Low everywhere** — adapter is barely above noise at any rank; |
| 228 | the signal is real but weak. |
| 229 | |
| 230 | Caveat: high z at low rank can also mean the low-rank null is |
| 231 | *pathologically quiet* rather than that the adapter is strong. Read the |
| 232 | profile as a shape, not a scalar — if all three z's move proportionally, |
| 233 | the adapter is doing work; if they spread apart, the rank is mis-sized. |
| 234 | |
| 235 | Implementation note: rank scaling is mathematically equivalent to |
| 236 | multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output |
| 237 | variance scales linearly with rank). The shipped backends apply that |
| 238 | scaling rather than reshaping PEFT tensors — no model reload, no |
| 239 | rank-specific adapter cache, same `alpha/r` scaling throughout. |
| 240 | |
| 241 | **Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)` |
| 242 | before the first probe — seeds python/numpy/torch RNGs and asks torch |
| 243 | for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The |
| 244 | report footer prints the achieved class — `strict` (CUDA), `best_effort` |
| 245 | (CPU/MPS), or `loose` (deterministic algorithms refused). Same seed + |
| 246 | same host = bit-identical scoring across runs. |
| 247 | |
| 248 | ## Pytest integration |
| 249 | |
| 250 | For teams already testing their training pipeline with pytest, sway |
| 251 | ships a plugin behind the `[pytest]` extra. A single decorator turns |
| 252 | one pytest function into one test item per probe plus an optional |
| 253 | composite-score gate: |
| 254 | |
| 255 | ```python |
| 256 | import pytest |
| 257 | |
| 258 | @pytest.mark.sway(spec="sway.yaml", threshold=0.6) |
| 259 | def test_adapter_healthy() -> None: |
| 260 | """The decorator owns the body — a bare pass is conventional.""" |
| 261 | ``` |
| 262 | |
| 263 | `pytest -v` then reports: |
| 264 | |
| 265 | ``` |
| 266 | test_sway_gate.py::test_adapter_healthy::adherence PASSED |
| 267 | test_sway_gate.py::test_adapter_healthy::calibration PASSED |
| 268 | test_sway_gate.py::test_adapter_healthy::__gate__ PASSED |
| 269 | ``` |
| 270 | |
| 271 | `--junitxml` emits one `<testcase>` per probe, `pytest -k adherence` |
| 272 | runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to |
| 273 | pytest outcomes. See `examples/pytest_integration/` for a full |
| 274 | before/after walkthrough. |
| 275 | |
| 276 | ```bash |
| 277 | pip install 'dlm-sway[hf,pytest]' |
| 278 | ``` |
| 279 | |
| 280 | ## Pre-commit |
| 281 | |
| 282 | For teams using [pre-commit.com](https://pre-commit.com), sway ships |
| 283 | a `.pre-commit-hooks.yaml` declaring three hooks that run `sway gate` |
| 284 | before every commit touching a spec, `.dlm` document, or adapter |
| 285 | file. Add 4–5 lines to your `.pre-commit-config.yaml`: |
| 286 | |
| 287 | ```yaml |
| 288 | repos: |
| 289 | - repo: https://github.com/tenseleyFlow/sway |
| 290 | rev: v0.1.0 |
| 291 | hooks: |
| 292 | - id: sway-gate |
| 293 | args: ["sway.yaml", "--threshold=0.6"] |
| 294 | ``` |
| 295 | |
| 296 | Three variants ship; pick whichever fits your install posture: |
| 297 | |
| 298 | | Hook | When to use | First-run cost | |
| 299 | |---|---|---| |
| 300 | | `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` | |
| 301 | | `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers | |
| 302 | | `sway-gate-docker` | zero-install hosts with docker available | ~1 min — pulls `ghcr.io/tenseleyflow/sway-gate:v0.1.0` (torch baked in, MiniLM weights pre-cached) | |
| 303 | |
| 304 | The recommended default is `sway-gate`. Switch to |
| 305 | `sway-gate-isolated` if you can't rely on a host-level sway install. |
| 306 | Reach for `sway-gate-docker` on ephemeral CI runners where docker is |
| 307 | cheaper than a fresh venv. |
| 308 | |
| 309 | ### Rev pinning |
| 310 | |
| 311 | The example above pins to the `v0.1.0` tag. Bump it deliberately |
| 312 | when you want to pick up a new release; `pre-commit autoupdate` will |
| 313 | surface newer tags when you run it explicitly. |
| 314 | |
| 315 | ### Scope |
| 316 | |
| 317 | The hook **only gates** — exits non-zero on FAIL, zero on PASS. No |
| 318 | `--json` / `--markdown` report flags are surfaced; those belong in |
| 319 | `sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast |
| 320 | and the gate's verdict uncluttered. |
| 321 | |
| 322 | See [`examples/precommit-example/`](examples/precommit-example/) for |
| 323 | the full walk-through including the `sway.yaml` template, the |
| 324 | consumer-side `.pre-commit-config.yaml`, and the |
| 325 | try-it-locally-before-you-install recipe. |
| 326 | |
| 327 | ## GitHub Action |
| 328 | |
| 329 | For repos that don't use pre-commit, the |
| 330 | [`tenseleyflow/sway-action`](https://github.com/tenseleyFlow/sway-action) |
| 331 | GitHub Action wraps `sway gate` + a markdown report posted as a PR |
| 332 | comment in three lines of YAML: |
| 333 | |
| 334 | ```yaml |
| 335 | - uses: tenseleyflow/sway-action@v0.1.0 |
| 336 | with: |
| 337 | spec-path: sway.yaml |
| 338 | ``` |
| 339 | |
| 340 | The action handles install, caching, gate execution, edit-in-place |
| 341 | PR comments (so rebases don't spam), and exit-code translation. A |
| 342 | complete workflow: |
| 343 | |
| 344 | ```yaml |
| 345 | name: sway |
| 346 | on: |
| 347 | pull_request: |
| 348 | paths: ["**/*.dlm", "**/sway.yaml"] |
| 349 | permissions: |
| 350 | contents: read |
| 351 | pull-requests: write |
| 352 | jobs: |
| 353 | sway: |
| 354 | runs-on: ubuntu-latest |
| 355 | steps: |
| 356 | - uses: actions/checkout@v4 |
| 357 | - uses: tenseleyflow/sway-action@v0.1.0 |
| 358 | with: |
| 359 | spec-path: sway.yaml |
| 360 | ``` |
| 361 | |
| 362 | Inputs include `fail-on` (`fail` / `warn` / `never`), |
| 363 | `comment-on-pr`, `upload-artifact`, and `sway-version` for pinning. |
| 364 | Outputs `sway-score`, `verdict`, and `report-path` are available to |
| 365 | downstream steps. See the action repo's README for the full surface. |
| 366 | |
| 367 | ## The `.dlm` integration |
| 368 | |
| 369 | If you trained your adapter via the [DocumentLanguageModel |
| 370 | project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway` |
| 371 | auto-generates a test suite from your document's sections. |
| 372 | |
| 373 | Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable): |
| 374 | |
| 375 | ```bash |
| 376 | # inside a clone of this repo |
| 377 | uv pip install -e ".[hf,dlm]" |
| 378 | ``` |
| 379 | |
| 380 | Then: |
| 381 | |
| 382 | ```bash |
| 383 | sway autogen path/to/doc.dlm -o sway.yaml |
| 384 | sway run sway.yaml |
| 385 | ``` |
| 386 | |
| 387 | Per-section attribution tells you *which* parts of your document |
| 388 | actually moved the model — a kind of signal no other tool provides. |
| 389 | |
| 390 | > Heads up — once dlm publishes its sister change, `dlm export --to |
| 391 | > sway-json` writes a ready-to-run `sway.yaml` next to the GGUF. |
| 392 | > Skip the manual `sway autogen` step entirely. |
| 393 | |
| 394 | ## Tool-use fidelity |
| 395 | |
| 396 | Adapters that target tool-calling behavior have a failure mode no |
| 397 | other probe catches: they pass every adherence / attribution probe |
| 398 | but silently degrade the base model's JSON-schema compliance, or |
| 399 | start hallucinating tool names that aren't in the declared surface. |
| 400 | The `tool_use_fidelity` probe scores three independent signals on |
| 401 | each `(prompt, tool_spec)` case: |
| 402 | |
| 403 | - **JSON-schema validity delta** — `ft_valid_rate − base_valid_rate`. |
| 404 | A negative value means the adapter regressed on producing |
| 405 | schema-valid calls; the default pass criterion tolerates a 5pp |
| 406 | drop. |
| 407 | - **Tool-name hallucination rate** — over schema-valid ft calls, the |
| 408 | fraction that pick a tool outside the declared `allowed_tools` |
| 409 | surface (or differ from `gold_tool_name` when no surface is set). |
| 410 | Default cap 10%. |
| 411 | - **Argument-field disagreement rate** — over cases where both |
| 412 | views produce schema-valid calls, the per-leaf-field disagreement |
| 413 | rate between base and ft arguments. Surfaced as evidence — the |
| 414 | v1 surface doesn't gate on it. |
| 415 | |
| 416 | ```yaml |
| 417 | suite: |
| 418 | - name: tool_calls_intact |
| 419 | kind: tool_use_fidelity |
| 420 | cases: |
| 421 | - prompt: "Search the web for the capital of France." |
| 422 | tool_spec: |
| 423 | name: search_web |
| 424 | parameters: |
| 425 | type: object |
| 426 | properties: {query: {type: string}} |
| 427 | required: [query] |
| 428 | gold_tool_name: search_web |
| 429 | allowed_tools: [search_web, calculator, http_fetch] |
| 430 | ``` |
| 431 | |
| 432 | The probe uses an OpenAI-style function-calling schema (the most |
| 433 | common shape across hosted and OSS tool-use stacks). Other schema |
| 434 | flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint; |
| 435 | the v1 implementation deliberately ships the format that covers |
| 436 | the largest user surface. |
| 437 | |
| 438 | `json_valid_rate_ft` is z-scored against the null-adapter baseline |
| 439 | when `null_adapter` is in the suite — the principled "the adapter |
| 440 | preserved the base's tool-call structure beyond noise" signal. |
| 441 | |
| 442 | ## Multi-turn coherence |
| 443 | |
| 444 | Every other adherence probe is single-turn: one user message, one |
| 445 | ft response, one score. Adapters that pass `delta_kl` cleanly |
| 446 | frequently *forget their training* by turn 2 or 3 of a real |
| 447 | dialogue — the model's own previous responses fill the context |
| 448 | window and create compounding drift. The |
| 449 | `multi_turn_coherence_decay` probe rolls a multi-turn synthetic |
| 450 | dialogue per prompt and fits an exponential-decay curve to the |
| 451 | per-turn KL: |
| 452 | |
| 453 | ```yaml |
| 454 | suite: |
| 455 | - name: holds_a_conversation |
| 456 | kind: multi_turn_coherence_decay |
| 457 | prompts: |
| 458 | - "Explain how a neural network learns." |
| 459 | - "What's the difference between TCP and UDP?" |
| 460 | max_turns: 4 |
| 461 | assert_half_life_turns: 2.0 |
| 462 | ``` |
| 463 | |
| 464 | The probe reports `half_life_turns` (the turn at which adapter |
| 465 | influence is halved), a per-turn KL list, and a tiny ASCII |
| 466 | sparkline in the report message so you can see the curve shape |
| 467 | without opening the JSON. Bases without a `chat_template` SKIP |
| 468 | gracefully — multi-turn requires one to format dialogue history. |
| 469 | |
| 470 | The probe deliberately **doesn't** z-score against the null-adapter |
| 471 | baseline (a null adapter has no coherence to decay; the null |
| 472 | distribution is meaningless). Fixed-threshold verdicts are the |
| 473 | published path. Mirrors `prompt_collapse`. |
| 474 | |
| 475 | ## Serve daemon |
| 476 | |
| 477 | Loading the HF backend takes ~15s every time you run `sway run`. For |
| 478 | notebook exploration, the `sway watch` retrain loop, or any flow that |
| 479 | fires the suite repeatedly against the same model, that startup is |
| 480 | the dominant cost. `sway serve` keeps the backend warm in a small |
| 481 | FastAPI daemon: first request pays the load, subsequent requests |
| 482 | reuse the cached weights. |
| 483 | |
| 484 | ```bash |
| 485 | pip install 'dlm-sway[serve]' # adds fastapi + uvicorn + httpx |
| 486 | sway serve --port 8787 --max-loaded-models 2 |
| 487 | ``` |
| 488 | |
| 489 | Default bind is `127.0.0.1`. The daemon refuses to start on a |
| 490 | non-loopback interface unless you pass `--api-key <token>`, after |
| 491 | which every non-`/health` request must carry |
| 492 | `Authorization: Bearer <token>`. |
| 493 | |
| 494 | ```bash |
| 495 | # With curl — sweetspot is from inside a notebook or watch loop. |
| 496 | curl -s -X POST http://localhost:8787/run \ |
| 497 | -H 'Content-Type: application/json' \ |
| 498 | -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score |
| 499 | ``` |
| 500 | |
| 501 | ```python |
| 502 | # From a notebook (or any Python). |
| 503 | from dlm_sway.serve.client import ServeClient |
| 504 | from dlm_sway.suite.loader import load_spec |
| 505 | |
| 506 | client = ServeClient("http://localhost:8787") |
| 507 | report = client.run(load_spec("sway.yaml")) |
| 508 | print(report["score"]) # full report shape mirrors `sway run --json` |
| 509 | print(report["request_seconds"]) # cold ~15s; warm ~2s |
| 510 | ``` |
| 511 | |
| 512 | The cache is keyed on `(kind, base, adapter, dtype, device)` and capped |
| 513 | at `--max-loaded-models` (default 2). Loading a third distinct model |
| 514 | LRU-evicts the oldest, calling `backend.close()` to release the |
| 515 | weights. `GET /health` reports the currently-warm models; |
| 516 | `GET /stats` reports request count and mean latency. |
| 517 | |
| 518 | ## Reproducing a sway run |
| 519 | |
| 520 | Sometimes you want a coworker (or a future-you, or a bug report) to |
| 521 | re-execute exactly what you ran without recreating your environment. |
| 522 | `sway pack` bundles the spec + the source `.dlm` document + your |
| 523 | locally-cached null-stats + (optionally) a known-good report into a |
| 524 | single `*.swaypack.tar.gz` you can email or commit: |
| 525 | |
| 526 | ```bash |
| 527 | sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \ |
| 528 | --include-golden last-run.json |
| 529 | # wrote audit-2026-04.swaypack.tar.gz (4.2 KB) sections=312b null_stats=0 golden=yes |
| 530 | |
| 531 | # share the tarball with a coworker |
| 532 | sway unpack audit-2026-04.swaypack.tar.gz |
| 533 | # extracted: <cwd>/swaypack |
| 534 | # spec_path: <cwd>/swaypack/sway.yaml |
| 535 | # ... |
| 536 | # To run the bundled spec: |
| 537 | # SWAY_NULL_CACHE_DIR=<cwd>/swaypack/null-stats sway run <cwd>/swaypack/sway.yaml |
| 538 | ``` |
| 539 | |
| 540 | The packed null-stats cache means the consumer doesn't pay the |
| 541 | re-calibration cost (~120 forward passes on a typical 10-probe |
| 542 | suite). The unpack output prints the exact `SWAY_NULL_CACHE_DIR=...` |
| 543 | env var to use; that env redirects null-stats lookups at the bundled |
| 544 | cache instead of `~/.dlm-sway/`. |
| 545 | |
| 546 | A bundled golden JSON makes it trivial for the consumer to verify |
| 547 | their re-run matches yours. Defaults to a 50 MB pack-size cap; |
| 548 | override with `--max-size-mb`. |
| 549 | |
| 550 | ## Status |
| 551 | |
| 552 | Pre-alpha. API will break. Not yet on PyPI — install editable from source |
| 553 | (see [Install from source](#install-from-source)). Version `0.1.0` will be |
| 554 | the first published tag; until then, every clone pulls the tip of `main`. |
| 555 | |
| 556 | ## License |
| 557 | |
| 558 | MIT |