markdown · 10901 bytes Raw Blame History

sway

Differential testing for fine-tuned causal language models.

Alpha — v0.1.0 on PyPI. API is not stable; semantic versioning applies only from v1.0 onward. Feedback + issues welcome.

One question: did LoRA/QLoRA training actually change model behavior in a meaningful way, or is the model just defaulting to the pretrained base?

sway gives you a trustworthy, reproducible answer with thirteen purpose-built primitives, each z-scored against a null-adapter baseline. No LLM judges. No external APIs. Deterministic on CPU where possible.

Naming convention. The source repo and CLI entry point are both sway. The PyPI wheel is dlm-sway because the short sway name is taken on PyPI by an unrelated project. The CLI installed by pip install dlm-sway is sway — mismatched wheel/command names are a PyPA convention (see pyyamlimport yaml).

Install

# HF + PEFT backend — required for real models
pip install "dlm-sway[hf]"

# Extras composable as usual
pip install "dlm-sway[hf,style,semsim]"
pip install "dlm-sway[all]"

# .dlm auto-suite generation (requires the DLM sibling project)
pip install "dlm-sway[dlm]"

Available extras:

  • [hf] — HuggingFace + PEFT backend (required for real models)
  • [mlx] — Apple Silicon MLX backend (darwin-arm64 only)
  • [style] — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
  • [semsim] — sentence-transformers for the revert probe
  • [dlm] — auto-generate suites from .dlm documents
  • [viz] — matplotlib plots
  • [all] — everything

Verify the install:

sway --version
sway doctor

Install from source

For the development HEAD (unreleased changes, contributor workflow):

git clone https://github.com/tenseleyFlow/sway.git
cd sway

uv venv --python 3.11 .venv      # or: python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[hf]" --group dev

90-second smoke test

sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct

Outputs a verdict in under a minute on CPU for small models: your adapter is 4.2σ above noise ✅ or indistinguishable from a null adapter ❌.

Full suite

# sway.yaml
version: 1
models:
  base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
  ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
         adapter: "./runs/adapter/v0003"}
suite:
  - {name: null_baseline,       kind: null_adapter, runs: 3}
  - {name: doc_divergence,      kind: delta_kl,
     prompts: ["The key insight is", "An important rule"]}
  - {name: section_attribution, kind: section_internalization}
  - {name: no_leakage,          kind: leakage}
  - {name: ablation_shape,      kind: adapter_ablation,
     prompts: ["Tell me more about"]}
sway run sway.yaml              # full report to terminal + JSON
sway gate sway.yaml --junit     # CI-friendly; non-zero on fail

# Override the composite weights on the command line (partial overrides
# are fine — unspecified categories keep their defaults):
sway run sway.yaml --weights "attribution=0.5,adherence=0.2"

Inside sway.yaml, tuning knobs in defaults include:

  • seed — passed to seed_everything before any probe runs.
  • differential (default true) — toggle between the single-load PEFT path and a two-model load (doubled memory, rarely needed; for custom backends that can't do in-place adapter toggling).
  • score_weights — per-category weight overrides baked into the spec so CI runs reproduce the same score without a CLI flag.

Why it exists

Standard benchmarks (MMLU, HellaSwag) ask "how good is this model?" That's the wrong question after a targeted LoRA fine-tune on a small user-authored document. The right question is "did the adapter actually move the model toward what I wrote?" — and existing tools answer this poorly.

sway answers it directly via thirteen primitives across four categories, plus a baseline-calibration primitive:

Category Primitives
Adherence delta_kl, adapter_revert, prompt_collapse, cluster_kl
Attribution section_internalization, paraphrase_invariance, preference_flip
Calibration style_fingerprint, calibration_drift, leakage, external_perplexity
Ablation adapter_ablation ← the signature primitive
Baseline null_adapter (powers every z-score in the report)

The signature primitive. adapter_ablation scales the LoRA additive term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence curve. A healthy fine-tune shows a smooth, monotonic, non-saturated response. A degenerate one shows a step function or an overshoot-then- crash. Nobody else does this because nobody else gets this close to the adapter math.

The calibration. Every numeric probe z-scores its raw metric against a null-adapter baseline — a same-structure LoRA with random-init weights. "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a fixed threshold. The null-adapter calibration requires a backend that implements NullCalibratedBackend (the HF backend does); probes that can't be calibrated (e.g., adapter_revert needs an embedder, the null proxy doesn't have one) surface (no calibration) in the report and fall back to fixed thresholds. Calibration stats are cached on disk under ~/.dlm-sway/null-stats/ keyed by backend identity.

The rank profile. null_adapter takes an optional rank_multipliers: list[float] (default [1.0]). Pass [0.5, 1.0, 2.0] and every numeric probe carries a three-point z-score curve: z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x. The shape is diagnostic:

  • Flat or slightly rising toward 0.5x — adapter signal is rank-stable, roughly independent of noise energy.
  • Sharply higher at 0.5x, lower at 2x — adapter is rank-saturated: a smaller rank would have yielded a clearer separation from noise. Consider halving r.
  • Low everywhere — adapter is barely above noise at any rank; the signal is real but weak.

Caveat: high z at low rank can also mean the low-rank null is pathologically quiet rather than that the adapter is strong. Read the profile as a shape, not a scalar — if all three z's move proportionally, the adapter is doing work; if they spread apart, the rank is mis-sized.

Implementation note: rank scaling is mathematically equivalent to multiplying the null noise std by sqrt(rank_scale) (LoRA's A·B output variance scales linearly with rank). The shipped backends apply that scaling rather than reshaping PEFT tensors — no model reload, no rank-specific adapter cache, same alpha/r scaling throughout.

Determinism. Every sway run calls seed_everything(spec.defaults.seed) before the first probe — seeds python/numpy/torch RNGs and asks torch for deterministic algorithms (CUBLAS_WORKSPACE_CONFIG=:4096:8). The report footer prints the achieved class — strict (CUDA), best_effort (CPU/MPS), or loose (deterministic algorithms refused). Same seed + same host = bit-identical scoring across runs.

Pytest integration

For teams already testing their training pipeline with pytest, sway ships a plugin behind the [pytest] extra. A single decorator turns one pytest function into one test item per probe plus an optional composite-score gate:

import pytest

@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
def test_adapter_healthy() -> None:
    """The decorator owns the body — a bare pass is conventional."""

pytest -v then reports:

test_sway_gate.py::test_adapter_healthy::adherence    PASSED
test_sway_gate.py::test_adapter_healthy::calibration  PASSED
test_sway_gate.py::test_adapter_healthy::__gate__     PASSED

--junitxml emits one <testcase> per probe, pytest -k adherence runs just that probe, FAIL / ERROR / SKIP verdicts translate to pytest outcomes. See examples/pytest_integration/ for a full before/after walkthrough.

pip install 'dlm-sway[hf,pytest]'

Pre-commit

For teams using pre-commit.com, sway ships a .pre-commit-hooks.yaml declaring two hooks that run sway gate before every commit touching a spec, .dlm document, or adapter file. Add 4–5 lines to your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/tenseleyFlow/sway
    rev: 2ecd9a0c9d65a9b9576a185597c88f41444f9646  # pin to a SHA
    hooks:
      - id: sway-gate
        args: ["sway.yaml", "--threshold=0.6"]

Two variants ship; pick whichever fits your install posture:

Hook When to use First-run cost
sway-gate you already ran pip install 'dlm-sway[hf]' ~none — uses the sway binary on your PATH
sway-gate-isolated fresh venv, no existing sway install ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers

The recommended default is sway-gate. Switch to sway-gate-isolated if you can't rely on a host-level sway install.

Rev pinning

The example above pins to a commit SHA. Sway is pre-v0.1.0 — no tagged release yet. Pinning to HEAD would silently drift your gate's behavior under every pre-commit autoupdate; a SHA is the honest pre-release pattern. Bump it deliberately when you want to pick up upstream changes. After sway publishes v0.1.0 the recipe switches to rev: v0.1.0 and the SHA churn stops.

Scope

The hook only gates — exits non-zero on FAIL, zero on PASS. No --json / --markdown report flags are surfaced; those belong in sway run (ad-hoc or in a separate CI job). Keeps git commit fast and the gate's verdict uncluttered.

See examples/precommit-example/ for the full walk-through including the sway.yaml template, the consumer-side .pre-commit-config.yaml, and the try-it-locally-before-you-install recipe.

The .dlm integration

If you trained your adapter via the DocumentLanguageModel project, sway auto-generates a test suite from your document's sections.

Install sway with the [dlm] extra alongside [hf] (pre-PyPI, editable):

# inside a clone of this repo
uv pip install -e ".[hf,dlm]"

Then:

sway autogen path/to/doc.dlm -o sway.yaml
sway run sway.yaml

Per-section attribution tells you which parts of your document actually moved the model — a kind of signal no other tool provides.

Status

Pre-alpha. API will break. Not yet on PyPI — install editable from source (see Install from source). Version 0.1.0 will be the first published tag; until then, every clone pulls the tip of main.

License

MIT

View source
1 # sway
2
3 Differential testing for fine-tuned causal language models.
4
5 > **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning
6 > applies only from v1.0 onward. Feedback + issues welcome.
7
8 **One question:** *did LoRA/QLoRA training actually change model behavior
9 in a meaningful way, or is the model just defaulting to the pretrained
10 base?*
11
12 `sway` gives you a trustworthy, reproducible answer with thirteen
13 purpose-built primitives, each z-scored against a null-adapter baseline.
14 No LLM judges. No external APIs. Deterministic on CPU where possible.
15
16 > **Naming convention.** The source repo and CLI entry point are both
17 > `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is
18 > taken on PyPI by an unrelated project. The CLI installed by
19 > `pip install dlm-sway` is `sway` — mismatched wheel/command names are
20 > a PyPA convention (see `pyyaml` → `import yaml`).
21
22 ## Install
23
24 ```bash
25 # HF + PEFT backend — required for real models
26 pip install "dlm-sway[hf]"
27
28 # Extras composable as usual
29 pip install "dlm-sway[hf,style,semsim]"
30 pip install "dlm-sway[all]"
31
32 # .dlm auto-suite generation (requires the DLM sibling project)
33 pip install "dlm-sway[dlm]"
34 ```
35
36 Available extras:
37
38 - `[hf]` — HuggingFace + PEFT backend (required for real models)
39 - `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only)
40 - `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
41 - `[semsim]` — sentence-transformers for the revert probe
42 - `[dlm]` — auto-generate suites from `.dlm` documents
43 - `[viz]` — matplotlib plots
44 - `[all]` — everything
45
46 Verify the install:
47
48 ```bash
49 sway --version
50 sway doctor
51 ```
52
53 ## Install from source
54
55 For the development HEAD (unreleased changes, contributor workflow):
56
57 ```bash
58 git clone https://github.com/tenseleyFlow/sway.git
59 cd sway
60
61 uv venv --python 3.11 .venv # or: python -m venv .venv
62 source .venv/bin/activate
63 uv pip install -e ".[hf]" --group dev
64 ```
65
66 ## 90-second smoke test
67
68 ```bash
69 sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
70 ```
71
72 Outputs a verdict in under a minute on CPU for small models: *your
73 adapter is 4.2σ above noise* ✅ or *indistinguishable from a null
74 adapter* ❌.
75
76 ## Full suite
77
78 ```yaml
79 # sway.yaml
80 version: 1
81 models:
82 base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
83 ft: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
84 adapter: "./runs/adapter/v0003"}
85 suite:
86 - {name: null_baseline, kind: null_adapter, runs: 3}
87 - {name: doc_divergence, kind: delta_kl,
88 prompts: ["The key insight is", "An important rule"]}
89 - {name: section_attribution, kind: section_internalization}
90 - {name: no_leakage, kind: leakage}
91 - {name: ablation_shape, kind: adapter_ablation,
92 prompts: ["Tell me more about"]}
93 ```
94
95 ```bash
96 sway run sway.yaml # full report to terminal + JSON
97 sway gate sway.yaml --junit # CI-friendly; non-zero on fail
98
99 # Override the composite weights on the command line (partial overrides
100 # are fine — unspecified categories keep their defaults):
101 sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
102 ```
103
104 Inside `sway.yaml`, tuning knobs in `defaults` include:
105
106 - `seed` — passed to `seed_everything` before any probe runs.
107 - `differential` (default `true`) — toggle between the single-load PEFT
108 path and a two-model load (doubled memory, rarely needed; for custom
109 backends that can't do in-place adapter toggling).
110 - `score_weights` — per-category weight overrides baked into the spec so
111 CI runs reproduce the same score without a CLI flag.
112
113 ## Why it exists
114
115 Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"*
116 That's the wrong question after a targeted LoRA fine-tune on a small
117 user-authored document. The right question is *"did the adapter actually
118 move the model toward what I wrote?"* — and existing tools answer this
119 poorly.
120
121 `sway` answers it directly via thirteen primitives across four
122 categories, plus a baseline-calibration primitive:
123
124 | Category | Primitives |
125 |---------------|-------------------------------------------------------|
126 | Adherence | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl` |
127 | Attribution | `section_internalization`, `paraphrase_invariance`, `preference_flip` |
128 | Calibration | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity` |
129 | Ablation | `adapter_ablation` ← the signature primitive |
130 | Baseline | `null_adapter` (powers every z-score in the report) |
131
132 **The signature primitive.** `adapter_ablation` scales the LoRA additive
133 term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
134 curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
135 response. A degenerate one shows a step function or an overshoot-then-
136 crash. Nobody else does this because nobody else gets this close to the
137 adapter math.
138
139 **The calibration.** Every numeric probe z-scores its raw metric against
140 a null-adapter baseline — a same-structure LoRA with random-init weights.
141 "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
142 fixed threshold. The null-adapter calibration requires a backend that
143 implements `NullCalibratedBackend` (the HF backend does); probes that
144 can't be calibrated (e.g., `adapter_revert` needs an embedder, the null
145 proxy doesn't have one) surface `(no calibration)` in the report and
146 fall back to fixed thresholds. Calibration stats are cached on disk
147 under `~/.dlm-sway/null-stats/` keyed by backend identity.
148
149 **The rank profile.** `null_adapter` takes an optional
150 `rank_multipliers: list[float]` (default `[1.0]`). Pass
151 `[0.5, 1.0, 2.0]` and every numeric probe carries a three-point
152 z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape
153 is diagnostic:
154
155 - **Flat or slightly rising toward 0.5x** — adapter signal is
156 rank-stable, roughly independent of noise energy.
157 - **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated:
158 a smaller rank would have yielded a clearer separation from noise.
159 Consider halving `r`.
160 - **Low everywhere** — adapter is barely above noise at any rank;
161 the signal is real but weak.
162
163 Caveat: high z at low rank can also mean the low-rank null is
164 *pathologically quiet* rather than that the adapter is strong. Read the
165 profile as a shape, not a scalar — if all three z's move proportionally,
166 the adapter is doing work; if they spread apart, the rank is mis-sized.
167
168 Implementation note: rank scaling is mathematically equivalent to
169 multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output
170 variance scales linearly with rank). The shipped backends apply that
171 scaling rather than reshaping PEFT tensors — no model reload, no
172 rank-specific adapter cache, same `alpha/r` scaling throughout.
173
174 **Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)`
175 before the first probe — seeds python/numpy/torch RNGs and asks torch
176 for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The
177 report footer prints the achieved class — `strict` (CUDA), `best_effort`
178 (CPU/MPS), or `loose` (deterministic algorithms refused). Same seed +
179 same host = bit-identical scoring across runs.
180
181 ## Pytest integration
182
183 For teams already testing their training pipeline with pytest, sway
184 ships a plugin behind the `[pytest]` extra. A single decorator turns
185 one pytest function into one test item per probe plus an optional
186 composite-score gate:
187
188 ```python
189 import pytest
190
191 @pytest.mark.sway(spec="sway.yaml", threshold=0.6)
192 def test_adapter_healthy() -> None:
193 """The decorator owns the body — a bare pass is conventional."""
194 ```
195
196 `pytest -v` then reports:
197
198 ```
199 test_sway_gate.py::test_adapter_healthy::adherence PASSED
200 test_sway_gate.py::test_adapter_healthy::calibration PASSED
201 test_sway_gate.py::test_adapter_healthy::__gate__ PASSED
202 ```
203
204 `--junitxml` emits one `<testcase>` per probe, `pytest -k adherence`
205 runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to
206 pytest outcomes. See `examples/pytest_integration/` for a full
207 before/after walkthrough.
208
209 ```bash
210 pip install 'dlm-sway[hf,pytest]'
211 ```
212
213 ## Pre-commit
214
215 For teams using [pre-commit.com](https://pre-commit.com), sway ships
216 a `.pre-commit-hooks.yaml` declaring two hooks that run `sway gate`
217 before every commit touching a spec, `.dlm` document, or adapter
218 file. Add 4–5 lines to your `.pre-commit-config.yaml`:
219
220 ```yaml
221 repos:
222 - repo: https://github.com/tenseleyFlow/sway
223 rev: 2ecd9a0c9d65a9b9576a185597c88f41444f9646 # pin to a SHA
224 hooks:
225 - id: sway-gate
226 args: ["sway.yaml", "--threshold=0.6"]
227 ```
228
229 Two variants ship; pick whichever fits your install posture:
230
231 | Hook | When to use | First-run cost |
232 |---|---|---|
233 | `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` |
234 | `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers |
235
236 The recommended default is `sway-gate`. Switch to
237 `sway-gate-isolated` if you can't rely on a host-level sway install.
238
239 ### Rev pinning
240
241 The example above pins to a commit SHA. Sway is pre-v0.1.0 — no
242 tagged release yet. Pinning to `HEAD` would silently drift your
243 gate's behavior under every `pre-commit autoupdate`; a SHA is the
244 honest pre-release pattern. Bump it deliberately when you want to
245 pick up upstream changes. After sway publishes v0.1.0 the recipe
246 switches to `rev: v0.1.0` and the SHA churn stops.
247
248 ### Scope
249
250 The hook **only gates** — exits non-zero on FAIL, zero on PASS. No
251 `--json` / `--markdown` report flags are surfaced; those belong in
252 `sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast
253 and the gate's verdict uncluttered.
254
255 See [`examples/precommit-example/`](examples/precommit-example/) for
256 the full walk-through including the `sway.yaml` template, the
257 consumer-side `.pre-commit-config.yaml`, and the
258 try-it-locally-before-you-install recipe.
259
260 ## The `.dlm` integration
261
262 If you trained your adapter via the [DocumentLanguageModel
263 project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway`
264 auto-generates a test suite from your document's sections.
265
266 Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable):
267
268 ```bash
269 # inside a clone of this repo
270 uv pip install -e ".[hf,dlm]"
271 ```
272
273 Then:
274
275 ```bash
276 sway autogen path/to/doc.dlm -o sway.yaml
277 sway run sway.yaml
278 ```
279
280 Per-section attribution tells you *which* parts of your document
281 actually moved the model — a kind of signal no other tool provides.
282
283 ## Status
284
285 Pre-alpha. API will break. Not yet on PyPI — install editable from source
286 (see [Install from source](#install-from-source)). Version `0.1.0` will be
287 the first published tag; until then, every clone pulls the tip of `main`.
288
289 ## License
290
291 MIT