# sway Differential testing for fine-tuned causal language models. > **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning > applies only from v1.0 onward. Feedback + issues welcome. **One question:** *did LoRA/QLoRA training actually change model behavior in a meaningful way, or is the model just defaulting to the pretrained base?* `sway` gives you a trustworthy, reproducible answer with thirteen purpose-built primitives, each z-scored against a null-adapter baseline. No LLM judges. No external APIs. Deterministic on CPU where possible. > **Naming convention.** The source repo and CLI entry point are both > `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is > taken on PyPI by an unrelated project. The CLI installed by > `pip install dlm-sway` is `sway` — mismatched wheel/command names are > a PyPA convention (see `pyyaml` → `import yaml`). ## Install ```bash # HF + PEFT backend — required for real models pip install "dlm-sway[hf]" # Extras composable as usual pip install "dlm-sway[hf,style,semsim]" pip install "dlm-sway[all]" # .dlm auto-suite generation (requires the DLM sibling project) pip install "dlm-sway[dlm]" ``` Available extras: - `[hf]` — HuggingFace + PEFT backend (required for real models) - `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only) - `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug) - `[semsim]` — sentence-transformers for the revert probe - `[dlm]` — auto-generate suites from `.dlm` documents - `[viz]` — matplotlib plots - `[all]` — everything Verify the install: ```bash sway --version sway doctor ``` ## MLX backend (Apple Silicon) Install the extra and point sway at any PEFT adapter — the MLX backend auto-converts on first load and caches the result under `~/.cache/dlm-sway/mlx-converted//`. No manual `.npz` step. ```bash pip install "dlm-sway[mlx]" # Either: let sway auto-convert when it loads the adapter sway run sway.yaml # spec sets models.ft.kind: mlx, points at any PEFT dir # Or: convert explicitly (useful for inspection / scripting) sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter ``` Conversion is one-shot — repeated `sway run` invocations on the same adapter version short-circuit on a content hash. Supports standard LoRA on `q_proj` / `v_proj` / etc. QLoRA 4-bit and full-weight `modules_to_save` overrides aren't supported (the converter prints a clear warning if it sees them). ## Install from source For the development HEAD (unreleased changes, contributor workflow): ```bash git clone https://github.com/tenseleyFlow/sway.git cd sway uv venv --python 3.11 .venv # or: python -m venv .venv source .venv/bin/activate uv pip install -e ".[hf]" --group dev ``` ## 90-second smoke test ```bash sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct ``` Outputs a verdict in under a minute on CPU for small models: *your adapter is 4.2σ above noise* ✅ or *indistinguishable from a null adapter* ❌. ## Pre-run diagnostics If your adapter was trained with [DocumentLanguageModel](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway check` runs the **`gradient_ghost`** probe first — a ~50 ms disk-only inspection of dlm's `training_state.pt`. It catches adapters that are obviously broken (e.g. left at `--max-steps 5` from a smoke test) **before** the model ever loads, with a banner on top of the regular verdict: ``` ⚠️ PRE-RUN ALERT — gradient_ghost flagged severe undertraining severely undertrained: global_step=2 < threshold 50. The probe scores below may be unreliable. Consider retraining. ``` The probe's signal ladder, in order of decisiveness: 1. `global_step` below a configurable threshold (default 50) → FAIL. 2. Every per-param `exp_avg_sq` is NaN → FAIL. 3. More than 30% of LoRA layers show >2× the lowest layer's gradient variance → FAIL or WARN. 4. Otherwise → PASS. Specs that contain *only* pre-run diagnostics like `gradient_ghost` skip backend construction entirely — `sway run` against a pre-flight spec doesn't load a model, doesn't allocate GPU memory, doesn't pay the cold-start tax. Useful for `pre-commit` gates that should fast- path adapter sanity before authorizing the full battery. The **`training_drift`** probe pairs with `gradient_ghost`: where the ghost reads optimizer state at end-of-training, drift reads the loss *curve* during training (from dlm's per-step JSONL logs): ``` ⚠️ training_drift flagged unstable training smoothness=0.42, convergence_ratio=0.91, instability_events=4 ``` Four metrics: `final_loss`, `convergence_ratio`, `smoothness` (1 − var(Δloss)/var(loss)), and `instability_events` (loss-increase spikes that exceed local typical movement). Verdict PASS when all three of (smoothness ≥ 0.7, instability_events == 0, convergence_ratio ≤ 0.7); else WARN. Like `gradient_ghost`, runs in ~10 ms with no model load. ## Full suite ```yaml # sway.yaml version: 1 models: base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"} ft: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct", adapter: "./runs/adapter/v0003"} suite: - {name: null_baseline, kind: null_adapter, runs: 3} - {name: doc_divergence, kind: delta_kl, prompts: ["The key insight is", "An important rule"]} - {name: section_attribution, kind: section_internalization} - {name: no_leakage, kind: leakage} - {name: ablation_shape, kind: adapter_ablation, prompts: ["Tell me more about"]} ``` ```bash sway run sway.yaml # full report to terminal + JSON sway gate sway.yaml --junit # CI-friendly; non-zero on fail # Override the composite weights on the command line (partial overrides # are fine — unspecified categories keep their defaults): sway run sway.yaml --weights "attribution=0.5,adherence=0.2" ``` Inside `sway.yaml`, tuning knobs in `defaults` include: - `seed` — passed to `seed_everything` before any probe runs. - `differential` (default `true`) — toggle between the single-load PEFT path and a two-model load (doubled memory, rarely needed; for custom backends that can't do in-place adapter toggling). - `score_weights` — per-category weight overrides baked into the spec so CI runs reproduce the same score without a CLI flag. ## Why it exists Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"* That's the wrong question after a targeted LoRA fine-tune on a small user-authored document. The right question is *"did the adapter actually move the model toward what I wrote?"* — and existing tools answer this poorly. `sway` answers it directly via fifteen primitives across four categories, plus a baseline-calibration primitive: | Category | Primitives | |---------------|-------------------------------------------------------| | Adherence | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`, `multi_turn_coherence_decay` | | Attribution | `section_internalization`, `paraphrase_invariance`, `preference_flip`, `tool_use_fidelity` | | Calibration | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`, `gradient_ghost`, `training_drift` | | Ablation | `adapter_ablation` ← the signature primitive | | Baseline | `null_adapter` (powers every z-score in the report) | **The signature primitive.** `adapter_ablation` scales the LoRA additive term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence curve. A healthy fine-tune shows a smooth, monotonic, non-saturated response. A degenerate one shows a step function or an overshoot-then- crash. Nobody else does this because nobody else gets this close to the adapter math. **The calibration.** Every numeric probe z-scores its raw metric against a null-adapter baseline — a same-structure LoRA with random-init weights. "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a fixed threshold. The null-adapter calibration requires a backend that implements `NullCalibratedBackend` (the HF backend does); probes that can't be calibrated (e.g., `adapter_revert` needs an embedder, the null proxy doesn't have one) surface `(no calibration)` in the report and fall back to fixed thresholds. Calibration stats are cached on disk under `~/.dlm-sway/null-stats/` keyed by backend identity. **The rank profile.** `null_adapter` takes an optional `rank_multipliers: list[float]` (default `[1.0]`). Pass `[0.5, 1.0, 2.0]` and every numeric probe carries a three-point z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape is diagnostic: - **Flat or slightly rising toward 0.5x** — adapter signal is rank-stable, roughly independent of noise energy. - **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated: a smaller rank would have yielded a clearer separation from noise. Consider halving `r`. - **Low everywhere** — adapter is barely above noise at any rank; the signal is real but weak. Caveat: high z at low rank can also mean the low-rank null is *pathologically quiet* rather than that the adapter is strong. Read the profile as a shape, not a scalar — if all three z's move proportionally, the adapter is doing work; if they spread apart, the rank is mis-sized. Implementation note: rank scaling is mathematically equivalent to multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output variance scales linearly with rank). The shipped backends apply that scaling rather than reshaping PEFT tensors — no model reload, no rank-specific adapter cache, same `alpha/r` scaling throughout. **Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)` before the first probe — seeds python/numpy/torch RNGs and asks torch for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The report footer prints the achieved class — `strict` (CUDA), `best_effort` (CPU/MPS), or `loose` (deterministic algorithms refused). Same seed + same host = bit-identical scoring across runs. ## Pytest integration For teams already testing their training pipeline with pytest, sway ships a plugin behind the `[pytest]` extra. A single decorator turns one pytest function into one test item per probe plus an optional composite-score gate: ```python import pytest @pytest.mark.sway(spec="sway.yaml", threshold=0.6) def test_adapter_healthy() -> None: """The decorator owns the body — a bare pass is conventional.""" ``` `pytest -v` then reports: ``` test_sway_gate.py::test_adapter_healthy::adherence PASSED test_sway_gate.py::test_adapter_healthy::calibration PASSED test_sway_gate.py::test_adapter_healthy::__gate__ PASSED ``` `--junitxml` emits one `` per probe, `pytest -k adherence` runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to pytest outcomes. See `examples/pytest_integration/` for a full before/after walkthrough. ```bash pip install 'dlm-sway[hf,pytest]' ``` ## Pre-commit For teams using [pre-commit.com](https://pre-commit.com), sway ships a `.pre-commit-hooks.yaml` declaring three hooks that run `sway gate` before every commit touching a spec, `.dlm` document, or adapter file. Add 4–5 lines to your `.pre-commit-config.yaml`: ```yaml repos: - repo: https://github.com/tenseleyFlow/sway rev: v0.1.0 hooks: - id: sway-gate args: ["sway.yaml", "--threshold=0.6"] ``` Three variants ship; pick whichever fits your install posture: | Hook | When to use | First-run cost | |---|---|---| | `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` | | `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers | | `sway-gate-docker` | zero-install hosts with docker available | ~1 min — pulls `ghcr.io/tenseleyflow/sway-gate:v0.1.0` (torch baked in, MiniLM weights pre-cached) | The recommended default is `sway-gate`. Switch to `sway-gate-isolated` if you can't rely on a host-level sway install. Reach for `sway-gate-docker` on ephemeral CI runners where docker is cheaper than a fresh venv. ### Rev pinning The example above pins to the `v0.1.0` tag. Bump it deliberately when you want to pick up a new release; `pre-commit autoupdate` will surface newer tags when you run it explicitly. ### Scope The hook **only gates** — exits non-zero on FAIL, zero on PASS. No `--json` / `--markdown` report flags are surfaced; those belong in `sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast and the gate's verdict uncluttered. See [`examples/precommit-example/`](examples/precommit-example/) for the full walk-through including the `sway.yaml` template, the consumer-side `.pre-commit-config.yaml`, and the try-it-locally-before-you-install recipe. ## GitHub Action For repos that don't use pre-commit, the [`tenseleyflow/sway-action`](https://github.com/tenseleyFlow/sway-action) GitHub Action wraps `sway gate` + a markdown report posted as a PR comment in three lines of YAML: ```yaml - uses: tenseleyflow/sway-action@v0.1.0 with: spec-path: sway.yaml ``` The action handles install, caching, gate execution, edit-in-place PR comments (so rebases don't spam), and exit-code translation. A complete workflow: ```yaml name: sway on: pull_request: paths: ["**/*.dlm", "**/sway.yaml"] permissions: contents: read pull-requests: write jobs: sway: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: tenseleyflow/sway-action@v0.1.0 with: spec-path: sway.yaml ``` Inputs include `fail-on` (`fail` / `warn` / `never`), `comment-on-pr`, `upload-artifact`, and `sway-version` for pinning. Outputs `sway-score`, `verdict`, and `report-path` are available to downstream steps. See the action repo's README for the full surface. ## The `.dlm` integration If you trained your adapter via the [DocumentLanguageModel project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway` auto-generates a test suite from your document's sections. Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable): ```bash # inside a clone of this repo uv pip install -e ".[hf,dlm]" ``` Then: ```bash sway autogen path/to/doc.dlm -o sway.yaml sway run sway.yaml ``` Per-section attribution tells you *which* parts of your document actually moved the model — a kind of signal no other tool provides. > Heads up — once dlm publishes its sister change, `dlm export --to > sway-json` writes a ready-to-run `sway.yaml` next to the GGUF. > Skip the manual `sway autogen` step entirely. ## Tool-use fidelity Adapters that target tool-calling behavior have a failure mode no other probe catches: they pass every adherence / attribution probe but silently degrade the base model's JSON-schema compliance, or start hallucinating tool names that aren't in the declared surface. The `tool_use_fidelity` probe scores three independent signals on each `(prompt, tool_spec)` case: - **JSON-schema validity delta** — `ft_valid_rate − base_valid_rate`. A negative value means the adapter regressed on producing schema-valid calls; the default pass criterion tolerates a 5pp drop. - **Tool-name hallucination rate** — over schema-valid ft calls, the fraction that pick a tool outside the declared `allowed_tools` surface (or differ from `gold_tool_name` when no surface is set). Default cap 10%. - **Argument-field disagreement rate** — over cases where both views produce schema-valid calls, the per-leaf-field disagreement rate between base and ft arguments. Surfaced as evidence — the v1 surface doesn't gate on it. ```yaml suite: - name: tool_calls_intact kind: tool_use_fidelity cases: - prompt: "Search the web for the capital of France." tool_spec: name: search_web parameters: type: object properties: {query: {type: string}} required: [query] gold_tool_name: search_web allowed_tools: [search_web, calculator, http_fetch] ``` The probe uses an OpenAI-style function-calling schema (the most common shape across hosted and OSS tool-use stacks). Other schema flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint; the v1 implementation deliberately ships the format that covers the largest user surface. `json_valid_rate_ft` is z-scored against the null-adapter baseline when `null_adapter` is in the suite — the principled "the adapter preserved the base's tool-call structure beyond noise" signal. ## Multi-turn coherence Every other adherence probe is single-turn: one user message, one ft response, one score. Adapters that pass `delta_kl` cleanly frequently *forget their training* by turn 2 or 3 of a real dialogue — the model's own previous responses fill the context window and create compounding drift. The `multi_turn_coherence_decay` probe rolls a multi-turn synthetic dialogue per prompt and fits an exponential-decay curve to the per-turn KL: ```yaml suite: - name: holds_a_conversation kind: multi_turn_coherence_decay prompts: - "Explain how a neural network learns." - "What's the difference between TCP and UDP?" max_turns: 4 assert_half_life_turns: 2.0 ``` The probe reports `half_life_turns` (the turn at which adapter influence is halved), a per-turn KL list, and a tiny ASCII sparkline in the report message so you can see the curve shape without opening the JSON. Bases without a `chat_template` SKIP gracefully — multi-turn requires one to format dialogue history. The probe deliberately **doesn't** z-score against the null-adapter baseline (a null adapter has no coherence to decay; the null distribution is meaningless). Fixed-threshold verdicts are the published path. Mirrors `prompt_collapse`. ## Serve daemon Loading the HF backend takes ~15s every time you run `sway run`. For notebook exploration, the `sway watch` retrain loop, or any flow that fires the suite repeatedly against the same model, that startup is the dominant cost. `sway serve` keeps the backend warm in a small FastAPI daemon: first request pays the load, subsequent requests reuse the cached weights. ```bash pip install 'dlm-sway[serve]' # adds fastapi + uvicorn + httpx sway serve --port 8787 --max-loaded-models 2 ``` Default bind is `127.0.0.1`. The daemon refuses to start on a non-loopback interface unless you pass `--api-key `, after which every non-`/health` request must carry `Authorization: Bearer `. ```bash # With curl — sweetspot is from inside a notebook or watch loop. curl -s -X POST http://localhost:8787/run \ -H 'Content-Type: application/json' \ -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score ``` ```python # From a notebook (or any Python). from dlm_sway.serve.client import ServeClient from dlm_sway.suite.loader import load_spec client = ServeClient("http://localhost:8787") report = client.run(load_spec("sway.yaml")) print(report["score"]) # full report shape mirrors `sway run --json` print(report["request_seconds"]) # cold ~15s; warm ~2s ``` The cache is keyed on `(kind, base, adapter, dtype, device)` and capped at `--max-loaded-models` (default 2). Loading a third distinct model LRU-evicts the oldest, calling `backend.close()` to release the weights. `GET /health` reports the currently-warm models; `GET /stats` reports request count and mean latency. ## Reproducing a sway run Sometimes you want a coworker (or a future-you, or a bug report) to re-execute exactly what you ran without recreating your environment. `sway pack` bundles the spec + the source `.dlm` document + your locally-cached null-stats + (optionally) a known-good report into a single `*.swaypack.tar.gz` you can email or commit: ```bash sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \ --include-golden last-run.json # wrote audit-2026-04.swaypack.tar.gz (4.2 KB) sections=312b null_stats=0 golden=yes # share the tarball with a coworker sway unpack audit-2026-04.swaypack.tar.gz # extracted: /swaypack # spec_path: /swaypack/sway.yaml # ... # To run the bundled spec: # SWAY_NULL_CACHE_DIR=/swaypack/null-stats sway run /swaypack/sway.yaml ``` The packed null-stats cache means the consumer doesn't pay the re-calibration cost (~120 forward passes on a typical 10-probe suite). The unpack output prints the exact `SWAY_NULL_CACHE_DIR=...` env var to use; that env redirects null-stats lookups at the bundled cache instead of `~/.dlm-sway/`. A bundled golden JSON makes it trivial for the consumer to verify their re-run matches yours. Defaults to a 50 MB pack-size cap; override with `--max-size-mb`. ## Status Pre-alpha. API will break. Not yet on PyPI — install editable from source (see [Install from source](#install-from-source)). Version `0.1.0` will be the first published tag; until then, every clone pulls the tip of `main`. ## License MIT