# sway

Differential testing for fine-tuned causal language models.

> **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning
> applies only from v1.0 onward. Feedback + issues welcome.

**One question:** *did LoRA/QLoRA training actually change model behavior
in a meaningful way, or is the model just defaulting to the pretrained
base?*

`sway` gives you a trustworthy, reproducible answer with thirteen
purpose-built primitives, each z-scored against a null-adapter baseline.
No LLM judges. No external APIs. Deterministic on CPU where possible.

> **Naming convention.** The source repo and CLI entry point are both
> `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is
> taken on PyPI by an unrelated project. The CLI installed by
> `pip install dlm-sway` is `sway` — mismatched wheel/command names are
> a PyPA convention (see `pyyaml` → `import yaml`).

## Install

```bash
# HF + PEFT backend — required for real models
pip install "dlm-sway[hf]"

# Extras composable as usual
pip install "dlm-sway[hf,style,semsim]"
pip install "dlm-sway[all]"

# .dlm auto-suite generation (requires the DLM sibling project)
pip install "dlm-sway[dlm]"
```

Available extras:

- `[hf]` — HuggingFace + PEFT backend (required for real models)
- `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only)
- `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
- `[semsim]` — sentence-transformers for the revert probe
- `[dlm]` — auto-generate suites from `.dlm` documents
- `[viz]` — matplotlib plots
- `[all]` — everything

Verify the install:

```bash
sway --version
sway doctor
```

## MLX backend (Apple Silicon)

Install the extra and point sway at any PEFT adapter — the MLX backend
auto-converts on first load and caches the result under
`~/.cache/dlm-sway/mlx-converted/<sha>/`. No manual `.npz` step.

```bash
pip install "dlm-sway[mlx]"

# Either: let sway auto-convert when it loads the adapter
sway run sway.yaml   # spec sets models.ft.kind: mlx, points at any PEFT dir

# Or: convert explicitly (useful for inspection / scripting)
sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter
```

Conversion is one-shot — repeated `sway run` invocations on the same
adapter version short-circuit on a content hash. Supports standard
LoRA on `q_proj` / `v_proj` / etc. QLoRA 4-bit and full-weight
`modules_to_save` overrides aren't supported (the converter prints a
clear warning if it sees them).

## Install from source

For the development HEAD (unreleased changes, contributor workflow):

```bash
git clone https://github.com/tenseleyFlow/sway.git
cd sway

uv venv --python 3.11 .venv      # or: python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[hf]" --group dev
```

## 90-second smoke test

```bash
sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
```

Outputs a verdict in under a minute on CPU for small models: *your
adapter is 4.2σ above noise* ✅ or *indistinguishable from a null
adapter* ❌.

## Pre-run diagnostics

If your adapter was trained with [DocumentLanguageModel](https://github.com/tenseleyFlow/DocumentLanguageModel),
`sway check` runs the **`gradient_ghost`** probe first — a ~50 ms
disk-only inspection of dlm's `training_state.pt`. It catches
adapters that are obviously broken (e.g. left at `--max-steps 5`
from a smoke test) **before** the model ever loads, with a banner
on top of the regular verdict:

```
⚠️  PRE-RUN ALERT — gradient_ghost flagged severe undertraining
   severely undertrained: global_step=2 < threshold 50.
   The probe scores below may be unreliable. Consider retraining.
```

The probe's signal ladder, in order of decisiveness:

1. `global_step` below a configurable threshold (default 50) → FAIL.
2. Every per-param `exp_avg_sq` is NaN → FAIL.
3. More than 30% of LoRA layers show >2× the lowest layer's
   gradient variance → FAIL or WARN.
4. Otherwise → PASS.

Specs that contain *only* pre-run diagnostics like `gradient_ghost`
skip backend construction entirely — `sway run` against a pre-flight
spec doesn't load a model, doesn't allocate GPU memory, doesn't pay
the cold-start tax. Useful for `pre-commit` gates that should fast-
path adapter sanity before authorizing the full battery.

The **`training_drift`** probe pairs with `gradient_ghost`: where the
ghost reads optimizer state at end-of-training, drift reads the loss
*curve* during training (from dlm's per-step JSONL logs):

```
⚠️  training_drift flagged unstable training
    smoothness=0.42, convergence_ratio=0.91, instability_events=4
```

Four metrics: `final_loss`, `convergence_ratio`, `smoothness`
(1 − var(Δloss)/var(loss)), and `instability_events` (loss-increase
spikes that exceed local typical movement). Verdict PASS when all
three of (smoothness ≥ 0.7, instability_events == 0,
convergence_ratio ≤ 0.7); else WARN. Like `gradient_ghost`, runs in
~10 ms with no model load.

## Full suite

```yaml
# sway.yaml
version: 1
models:
  base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
  ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
         adapter: "./runs/adapter/v0003"}
suite:
  - {name: null_baseline,       kind: null_adapter, runs: 3}
  - {name: doc_divergence,      kind: delta_kl,
     prompts: ["The key insight is", "An important rule"]}
  - {name: section_attribution, kind: section_internalization}
  - {name: no_leakage,          kind: leakage}
  - {name: ablation_shape,      kind: adapter_ablation,
     prompts: ["Tell me more about"]}
```

```bash
sway run sway.yaml              # full report to terminal + JSON
sway gate sway.yaml --junit     # CI-friendly; non-zero on fail

# Override the composite weights on the command line (partial overrides
# are fine — unspecified categories keep their defaults):
sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
```

Inside `sway.yaml`, tuning knobs in `defaults` include:

- `seed` — passed to `seed_everything` before any probe runs.
- `differential` (default `true`) — toggle between the single-load PEFT
  path and a two-model load (doubled memory, rarely needed; for custom
  backends that can't do in-place adapter toggling).
- `score_weights` — per-category weight overrides baked into the spec so
  CI runs reproduce the same score without a CLI flag.

## Why it exists

Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"*
That's the wrong question after a targeted LoRA fine-tune on a small
user-authored document. The right question is *"did the adapter actually
move the model toward what I wrote?"* — and existing tools answer this
poorly.

`sway` answers it directly via fifteen primitives across four
categories, plus a baseline-calibration primitive:

| Category      | Primitives                                            |
|---------------|-------------------------------------------------------|
| Adherence     | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`, `multi_turn_coherence_decay` |
| Attribution   | `section_internalization`, `paraphrase_invariance`, `preference_flip`, `tool_use_fidelity` |
| Calibration   | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`, `gradient_ghost`, `training_drift` |
| Ablation      | `adapter_ablation` ← the signature primitive          |
| Baseline      | `null_adapter` (powers every z-score in the report)   |

**The signature primitive.** `adapter_ablation` scales the LoRA additive
term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
response. A degenerate one shows a step function or an overshoot-then-
crash. Nobody else does this because nobody else gets this close to the
adapter math.

**The calibration.** Every numeric probe z-scores its raw metric against
a null-adapter baseline — a same-structure LoRA with random-init weights.
"Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
fixed threshold. The null-adapter calibration requires a backend that
implements `NullCalibratedBackend` (the HF backend does); probes that
can't be calibrated (e.g., `adapter_revert` needs an embedder, the null
proxy doesn't have one) surface `(no calibration)` in the report and
fall back to fixed thresholds. Calibration stats are cached on disk
under `~/.dlm-sway/null-stats/` keyed by backend identity.

**The rank profile.** `null_adapter` takes an optional
`rank_multipliers: list[float]` (default `[1.0]`). Pass
`[0.5, 1.0, 2.0]` and every numeric probe carries a three-point
z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape
is diagnostic:

- **Flat or slightly rising toward 0.5x** — adapter signal is
  rank-stable, roughly independent of noise energy.
- **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated:
  a smaller rank would have yielded a clearer separation from noise.
  Consider halving `r`.
- **Low everywhere** — adapter is barely above noise at any rank;
  the signal is real but weak.

Caveat: high z at low rank can also mean the low-rank null is
*pathologically quiet* rather than that the adapter is strong. Read the
profile as a shape, not a scalar — if all three z's move proportionally,
the adapter is doing work; if they spread apart, the rank is mis-sized.

Implementation note: rank scaling is mathematically equivalent to
multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output
variance scales linearly with rank). The shipped backends apply that
scaling rather than reshaping PEFT tensors — no model reload, no
rank-specific adapter cache, same `alpha/r` scaling throughout.

**Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)`
before the first probe — seeds python/numpy/torch RNGs and asks torch
for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The
report footer prints the achieved class — `strict` (CUDA), `best_effort`
(CPU/MPS), or `loose` (deterministic algorithms refused). Same seed +
same host = bit-identical scoring across runs.

## Pytest integration

For teams already testing their training pipeline with pytest, sway
ships a plugin behind the `[pytest]` extra. A single decorator turns
one pytest function into one test item per probe plus an optional
composite-score gate:

```python
import pytest

@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
def test_adapter_healthy() -> None:
    """The decorator owns the body — a bare pass is conventional."""
```

`pytest -v` then reports:

```
test_sway_gate.py::test_adapter_healthy::adherence    PASSED
test_sway_gate.py::test_adapter_healthy::calibration  PASSED
test_sway_gate.py::test_adapter_healthy::__gate__     PASSED
```

`--junitxml` emits one `<testcase>` per probe, `pytest -k adherence`
runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to
pytest outcomes. See `examples/pytest_integration/` for a full
before/after walkthrough.

```bash
pip install 'dlm-sway[hf,pytest]'
```

## Pre-commit

For teams using [pre-commit.com](https://pre-commit.com), sway ships
a `.pre-commit-hooks.yaml` declaring three hooks that run `sway gate`
before every commit touching a spec, `.dlm` document, or adapter
file. Add 4–5 lines to your `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/tenseleyFlow/sway
    rev: v0.1.0
    hooks:
      - id: sway-gate
        args: ["sway.yaml", "--threshold=0.6"]
```

Three variants ship; pick whichever fits your install posture:

| Hook | When to use | First-run cost |
|---|---|---|
| `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` |
| `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers |
| `sway-gate-docker` | zero-install hosts with docker available | ~1 min — pulls `ghcr.io/tenseleyflow/sway-gate:v0.1.0` (torch baked in, MiniLM weights pre-cached) |

The recommended default is `sway-gate`. Switch to
`sway-gate-isolated` if you can't rely on a host-level sway install.
Reach for `sway-gate-docker` on ephemeral CI runners where docker is
cheaper than a fresh venv.

### Rev pinning

The example above pins to the `v0.1.0` tag. Bump it deliberately
when you want to pick up a new release; `pre-commit autoupdate` will
surface newer tags when you run it explicitly.

### Scope

The hook **only gates** — exits non-zero on FAIL, zero on PASS. No
`--json` / `--markdown` report flags are surfaced; those belong in
`sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast
and the gate's verdict uncluttered.

See [`examples/precommit-example/`](examples/precommit-example/) for
the full walk-through including the `sway.yaml` template, the
consumer-side `.pre-commit-config.yaml`, and the
try-it-locally-before-you-install recipe.

## GitHub Action

For repos that don't use pre-commit, the
[`tenseleyflow/sway-action`](https://github.com/tenseleyFlow/sway-action)
GitHub Action wraps `sway gate` + a markdown report posted as a PR
comment in three lines of YAML:

```yaml
- uses: tenseleyflow/sway-action@v0.1.0
  with:
    spec-path: sway.yaml
```

The action handles install, caching, gate execution, edit-in-place
PR comments (so rebases don't spam), and exit-code translation. A
complete workflow:

```yaml
name: sway
on:
  pull_request:
    paths: ["**/*.dlm", "**/sway.yaml"]
permissions:
  contents: read
  pull-requests: write
jobs:
  sway:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: tenseleyflow/sway-action@v0.1.0
        with:
          spec-path: sway.yaml
```

Inputs include `fail-on` (`fail` / `warn` / `never`),
`comment-on-pr`, `upload-artifact`, and `sway-version` for pinning.
Outputs `sway-score`, `verdict`, and `report-path` are available to
downstream steps. See the action repo's README for the full surface.

## The `.dlm` integration

If you trained your adapter via the [DocumentLanguageModel
project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway`
auto-generates a test suite from your document's sections.

Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable):

```bash
# inside a clone of this repo
uv pip install -e ".[hf,dlm]"
```

Then:

```bash
sway autogen path/to/doc.dlm -o sway.yaml
sway run sway.yaml
```

Per-section attribution tells you *which* parts of your document
actually moved the model — a kind of signal no other tool provides.

> Heads up — once dlm publishes its sister change, `dlm export --to
> sway-json` writes a ready-to-run `sway.yaml` next to the GGUF.
> Skip the manual `sway autogen` step entirely.

## Tool-use fidelity

Adapters that target tool-calling behavior have a failure mode no
other probe catches: they pass every adherence / attribution probe
but silently degrade the base model's JSON-schema compliance, or
start hallucinating tool names that aren't in the declared surface.
The `tool_use_fidelity` probe scores three independent signals on
each `(prompt, tool_spec)` case:

- **JSON-schema validity delta** — `ft_valid_rate − base_valid_rate`.
  A negative value means the adapter regressed on producing
  schema-valid calls; the default pass criterion tolerates a 5pp
  drop.
- **Tool-name hallucination rate** — over schema-valid ft calls, the
  fraction that pick a tool outside the declared `allowed_tools`
  surface (or differ from `gold_tool_name` when no surface is set).
  Default cap 10%.
- **Argument-field disagreement rate** — over cases where both
  views produce schema-valid calls, the per-leaf-field disagreement
  rate between base and ft arguments. Surfaced as evidence — the
  v1 surface doesn't gate on it.

```yaml
suite:
  - name: tool_calls_intact
    kind: tool_use_fidelity
    cases:
      - prompt: "Search the web for the capital of France."
        tool_spec:
          name: search_web
          parameters:
            type: object
            properties: {query: {type: string}}
            required: [query]
        gold_tool_name: search_web
    allowed_tools: [search_web, calculator, http_fetch]
```

The probe uses an OpenAI-style function-calling schema (the most
common shape across hosted and OSS tool-use stacks). Other schema
flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint;
the v1 implementation deliberately ships the format that covers
the largest user surface.

`json_valid_rate_ft` is z-scored against the null-adapter baseline
when `null_adapter` is in the suite — the principled "the adapter
preserved the base's tool-call structure beyond noise" signal.

## Multi-turn coherence

Every other adherence probe is single-turn: one user message, one
ft response, one score. Adapters that pass `delta_kl` cleanly
frequently *forget their training* by turn 2 or 3 of a real
dialogue — the model's own previous responses fill the context
window and create compounding drift. The
`multi_turn_coherence_decay` probe rolls a multi-turn synthetic
dialogue per prompt and fits an exponential-decay curve to the
per-turn KL:

```yaml
suite:
  - name: holds_a_conversation
    kind: multi_turn_coherence_decay
    prompts:
      - "Explain how a neural network learns."
      - "What's the difference between TCP and UDP?"
    max_turns: 4
    assert_half_life_turns: 2.0
```

The probe reports `half_life_turns` (the turn at which adapter
influence is halved), a per-turn KL list, and a tiny ASCII
sparkline in the report message so you can see the curve shape
without opening the JSON. Bases without a `chat_template` SKIP
gracefully — multi-turn requires one to format dialogue history.

The probe deliberately **doesn't** z-score against the null-adapter
baseline (a null adapter has no coherence to decay; the null
distribution is meaningless). Fixed-threshold verdicts are the
published path. Mirrors `prompt_collapse`.

## Serve daemon

Loading the HF backend takes ~15s every time you run `sway run`. For
notebook exploration, the `sway watch` retrain loop, or any flow that
fires the suite repeatedly against the same model, that startup is
the dominant cost. `sway serve` keeps the backend warm in a small
FastAPI daemon: first request pays the load, subsequent requests
reuse the cached weights.

```bash
pip install 'dlm-sway[serve]'   # adds fastapi + uvicorn + httpx
sway serve --port 8787 --max-loaded-models 2
```

Default bind is `127.0.0.1`. The daemon refuses to start on a
non-loopback interface unless you pass `--api-key <token>`, after
which every non-`/health` request must carry
`Authorization: Bearer <token>`.

```bash
# With curl — sweetspot is from inside a notebook or watch loop.
curl -s -X POST http://localhost:8787/run \
  -H 'Content-Type: application/json' \
  -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score
```

```python
# From a notebook (or any Python).
from dlm_sway.serve.client import ServeClient
from dlm_sway.suite.loader import load_spec

client = ServeClient("http://localhost:8787")
report = client.run(load_spec("sway.yaml"))
print(report["score"])              # full report shape mirrors `sway run --json`
print(report["request_seconds"])    # cold ~15s; warm ~2s
```

The cache is keyed on `(kind, base, adapter, dtype, device)` and capped
at `--max-loaded-models` (default 2). Loading a third distinct model
LRU-evicts the oldest, calling `backend.close()` to release the
weights. `GET /health` reports the currently-warm models;
`GET /stats` reports request count and mean latency.

## Reproducing a sway run

Sometimes you want a coworker (or a future-you, or a bug report) to
re-execute exactly what you ran without recreating your environment.
`sway pack` bundles the spec + the source `.dlm` document + your
locally-cached null-stats + (optionally) a known-good report into a
single `*.swaypack.tar.gz` you can email or commit:

```bash
sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \
  --include-golden last-run.json
# wrote audit-2026-04.swaypack.tar.gz (4.2 KB)  sections=312b  null_stats=0  golden=yes

# share the tarball with a coworker
sway unpack audit-2026-04.swaypack.tar.gz
# extracted: <cwd>/swaypack
#   spec_path: <cwd>/swaypack/sway.yaml
#   ...
# To run the bundled spec:
#   SWAY_NULL_CACHE_DIR=<cwd>/swaypack/null-stats sway run <cwd>/swaypack/sway.yaml
```

The packed null-stats cache means the consumer doesn't pay the
re-calibration cost (~120 forward passes on a typical 10-probe
suite). The unpack output prints the exact `SWAY_NULL_CACHE_DIR=...`
env var to use; that env redirects null-stats lookups at the bundled
cache instead of `~/.dlm-sway/`.

A bundled golden JSON makes it trivial for the consumer to verify
their re-run matches yours. Defaults to a 50 MB pack-size cap;
override with `--max-size-mb`.

## Status

Pre-alpha. API will break. Not yet on PyPI — install editable from source
(see [Install from source](#install-from-source)). Version `0.1.0` will be
the first published tag; until then, every clone pulls the tip of `main`.

## License

MIT