sway Public

Watch 0 Fork 0 Star 0

markdown · 21111 bytes Raw Blame History

sway

Differential testing for fine-tuned causal language models.

Alpha — v0.1.0 on PyPI. API is not stable; semantic versioning applies only from v1.0 onward. Feedback + issues welcome.

One question: did LoRA/QLoRA training actually change model behavior in a meaningful way, or is the model just defaulting to the pretrained base?

sway gives you a trustworthy, reproducible answer with thirteen purpose-built primitives, each z-scored against a null-adapter baseline. No LLM judges. No external APIs. Deterministic on CPU where possible.

Naming convention. The source repo and CLI entry point are both sway. The PyPI wheel is dlm-sway because the short sway name is taken on PyPI by an unrelated project. The CLI installed by pip install dlm-sway is sway — mismatched wheel/command names are a PyPA convention (see pyyaml → import yaml).

Install

# HF + PEFT backend — required for real models
pip install "dlm-sway[hf]"

# Extras composable as usual
pip install "dlm-sway[hf,style,semsim]"
pip install "dlm-sway[all]"

# .dlm auto-suite generation (requires the DLM sibling project)
pip install "dlm-sway[dlm]"

Available extras:

[hf] — HuggingFace + PEFT backend (required for real models)
[mlx] — Apple Silicon MLX backend (darwin-arm64 only)
[style] — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
[semsim] — sentence-transformers for the revert probe
[dlm] — auto-generate suites from .dlm documents
[viz] — matplotlib plots
[all] — everything

Verify the install:

sway --version
sway doctor

MLX backend (Apple Silicon)

Install the extra and point sway at any PEFT adapter — the MLX backend auto-converts on first load and caches the result under ~/.cache/dlm-sway/mlx-converted/<sha>/. No manual .npz step.

pip install "dlm-sway[mlx]"

# Either: let sway auto-convert when it loads the adapter
sway run sway.yaml   # spec sets models.ft.kind: mlx, points at any PEFT dir

# Or: convert explicitly (useful for inspection / scripting)
sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter

Conversion is one-shot — repeated sway run invocations on the same adapter version short-circuit on a content hash. Supports standard LoRA on q_proj / v_proj / etc. QLoRA 4-bit and full-weight modules_to_save overrides aren't supported (the converter prints a clear warning if it sees them).

Install from source

For the development HEAD (unreleased changes, contributor workflow):

git clone https://github.com/tenseleyFlow/sway.git
cd sway

uv venv --python 3.11 .venv      # or: python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[hf]" --group dev

90-second smoke test

sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct

Outputs a verdict in under a minute on CPU for small models: your adapter is 4.2σ above noise ✅ or indistinguishable from a null adapter ❌.

Pre-run diagnostics

If your adapter was trained with DocumentLanguageModel, sway check runs the gradient_ghost probe first — a ~50 ms disk-only inspection of dlm's training_state.pt. It catches adapters that are obviously broken (e.g. left at --max-steps 5 from a smoke test) before the model ever loads, with a banner on top of the regular verdict:

⚠️  PRE-RUN ALERT — gradient_ghost flagged severe undertraining
   severely undertrained: global_step=2 < threshold 50.
   The probe scores below may be unreliable. Consider retraining.

The probe's signal ladder, in order of decisiveness:

global_step below a configurable threshold (default 50) → FAIL.
Every per-param exp_avg_sq is NaN → FAIL.
More than 30% of LoRA layers show >2× the lowest layer's gradient variance → FAIL or WARN.
Otherwise → PASS.

Specs that contain only pre-run diagnostics like gradient_ghost skip backend construction entirely — sway run against a pre-flight spec doesn't load a model, doesn't allocate GPU memory, doesn't pay the cold-start tax. Useful for pre-commit gates that should fast- path adapter sanity before authorizing the full battery.

The training_drift probe pairs with gradient_ghost: where the ghost reads optimizer state at end-of-training, drift reads the loss curve during training (from dlm's per-step JSONL logs):

⚠️  training_drift flagged unstable training
    smoothness=0.42, convergence_ratio=0.91, instability_events=4

Four metrics: final_loss, convergence_ratio, smoothness (1 − var(Δloss)/var(loss)), and instability_events (loss-increase spikes that exceed local typical movement). Verdict PASS when all three of (smoothness ≥ 0.7, instability_events == 0, convergence_ratio ≤ 0.7); else WARN. Like gradient_ghost, runs in ~10 ms with no model load.

Full suite

# sway.yaml
version: 1
models:
  base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
  ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
         adapter: "./runs/adapter/v0003"}
suite:
  - {name: null_baseline,       kind: null_adapter, runs: 3}
  - {name: doc_divergence,      kind: delta_kl,
     prompts: ["The key insight is", "An important rule"]}
  - {name: section_attribution, kind: section_internalization}
  - {name: no_leakage,          kind: leakage}
  - {name: ablation_shape,      kind: adapter_ablation,
     prompts: ["Tell me more about"]}

sway run sway.yaml              # full report to terminal + JSON
sway gate sway.yaml --junit     # CI-friendly; non-zero on fail

# Override the composite weights on the command line (partial overrides
# are fine — unspecified categories keep their defaults):
sway run sway.yaml --weights "attribution=0.5,adherence=0.2"

Inside sway.yaml, tuning knobs in defaults include:

seed — passed to seed_everything before any probe runs.
differential (default true) — toggle between the single-load PEFT path and a two-model load (doubled memory, rarely needed; for custom backends that can't do in-place adapter toggling).
score_weights — per-category weight overrides baked into the spec so CI runs reproduce the same score without a CLI flag.

Why it exists

Standard benchmarks (MMLU, HellaSwag) ask "how good is this model?" That's the wrong question after a targeted LoRA fine-tune on a small user-authored document. The right question is "did the adapter actually move the model toward what I wrote?" — and existing tools answer this poorly.

sway answers it directly via fifteen primitives across four categories, plus a baseline-calibration primitive:

Category	Primitives
Adherence	`delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`, `multi_turn_coherence_decay`
Attribution	`section_internalization`, `paraphrase_invariance`, `preference_flip`, `tool_use_fidelity`
Calibration	`style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`, `gradient_ghost`, `training_drift`
Ablation	`adapter_ablation` ← the signature primitive
Baseline	`null_adapter` (powers every z-score in the report)

The signature primitive. adapter_ablation scales the LoRA additive term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence curve. A healthy fine-tune shows a smooth, monotonic, non-saturated response. A degenerate one shows a step function or an overshoot-then- crash. Nobody else does this because nobody else gets this close to the adapter math.

The calibration. Every numeric probe z-scores its raw metric against a null-adapter baseline — a same-structure LoRA with random-init weights. "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a fixed threshold. The null-adapter calibration requires a backend that implements NullCalibratedBackend (the HF backend does); probes that can't be calibrated (e.g., adapter_revert needs an embedder, the null proxy doesn't have one) surface (no calibration) in the report and fall back to fixed thresholds. Calibration stats are cached on disk under ~/.dlm-sway/null-stats/ keyed by backend identity.

The rank profile. null_adapter takes an optional rank_multipliers: list[float] (default [1.0]). Pass [0.5, 1.0, 2.0] and every numeric probe carries a three-point z-score curve: z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x. The shape is diagnostic:

Flat or slightly rising toward 0.5x — adapter signal is rank-stable, roughly independent of noise energy.
Sharply higher at 0.5x, lower at 2x — adapter is rank-saturated: a smaller rank would have yielded a clearer separation from noise. Consider halving r.
Low everywhere — adapter is barely above noise at any rank; the signal is real but weak.

Caveat: high z at low rank can also mean the low-rank null is pathologically quiet rather than that the adapter is strong. Read the profile as a shape, not a scalar — if all three z's move proportionally, the adapter is doing work; if they spread apart, the rank is mis-sized.

Implementation note: rank scaling is mathematically equivalent to multiplying the null noise std by sqrt(rank_scale) (LoRA's A·B output variance scales linearly with rank). The shipped backends apply that scaling rather than reshaping PEFT tensors — no model reload, no rank-specific adapter cache, same alpha/r scaling throughout.

Determinism. Every sway run calls seed_everything(spec.defaults.seed) before the first probe — seeds python/numpy/torch RNGs and asks torch for deterministic algorithms (CUBLAS_WORKSPACE_CONFIG=:4096:8). The report footer prints the achieved class — strict (CUDA), best_effort (CPU/MPS), or loose (deterministic algorithms refused). Same seed + same host = bit-identical scoring across runs.

Pytest integration

For teams already testing their training pipeline with pytest, sway ships a plugin behind the [pytest] extra. A single decorator turns one pytest function into one test item per probe plus an optional composite-score gate:

import pytest

@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
def test_adapter_healthy() -> None:
    """The decorator owns the body — a bare pass is conventional."""

pytest -v then reports:

test_sway_gate.py::test_adapter_healthy::adherence    PASSED
test_sway_gate.py::test_adapter_healthy::calibration  PASSED
test_sway_gate.py::test_adapter_healthy::__gate__     PASSED

--junitxml emits one <testcase> per probe, pytest -k adherence runs just that probe, FAIL / ERROR / SKIP verdicts translate to pytest outcomes. See examples/pytest_integration/ for a full before/after walkthrough.

pip install 'dlm-sway[hf,pytest]'

Pre-commit

For teams using pre-commit.com, sway ships a .pre-commit-hooks.yaml declaring three hooks that run sway gate before every commit touching a spec, .dlm document, or adapter file. Add 4–5 lines to your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/tenseleyFlow/sway
    rev: v0.1.0
    hooks:
      - id: sway-gate
        args: ["sway.yaml", "--threshold=0.6"]

Three variants ship; pick whichever fits your install posture:

Hook	When to use	First-run cost
`sway-gate`	you already ran `pip install 'dlm-sway[hf]'`	~none — uses the sway binary on your `PATH`
`sway-gate-isolated`	fresh venv, no existing sway install	~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers
`sway-gate-docker`	zero-install hosts with docker available	~1 min — pulls `ghcr.io/tenseleyflow/sway-gate:v0.1.0` (torch baked in, MiniLM weights pre-cached)

The recommended default is sway-gate. Switch to sway-gate-isolated if you can't rely on a host-level sway install. Reach for sway-gate-docker on ephemeral CI runners where docker is cheaper than a fresh venv.

Rev pinning

The example above pins to the v0.1.0 tag. Bump it deliberately when you want to pick up a new release; pre-commit autoupdate will surface newer tags when you run it explicitly.

Scope

The hook only gates — exits non-zero on FAIL, zero on PASS. No --json / --markdown report flags are surfaced; those belong in sway run (ad-hoc or in a separate CI job). Keeps git commit fast and the gate's verdict uncluttered.

See examples/precommit-example/ for the full walk-through including the sway.yaml template, the consumer-side .pre-commit-config.yaml, and the try-it-locally-before-you-install recipe.

GitHub Action

For repos that don't use pre-commit, the tenseleyflow/sway-action GitHub Action wraps sway gate + a markdown report posted as a PR comment in three lines of YAML:

- uses: tenseleyflow/sway-action@v0.1.0
  with:
    spec-path: sway.yaml

The action handles install, caching, gate execution, edit-in-place PR comments (so rebases don't spam), and exit-code translation. A complete workflow:

name: sway
on:
  pull_request:
    paths: ["**/*.dlm", "**/sway.yaml"]
permissions:
  contents: read
  pull-requests: write
jobs:
  sway:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: tenseleyflow/sway-action@v0.1.0
        with:
          spec-path: sway.yaml

Inputs include fail-on (fail / warn / never), comment-on-pr, upload-artifact, and sway-version for pinning. Outputs sway-score, verdict, and report-path are available to downstream steps. See the action repo's README for the full surface.

The `.dlm` integration

If you trained your adapter via the DocumentLanguageModel project, sway auto-generates a test suite from your document's sections.

Install sway with the [dlm] extra alongside [hf] (pre-PyPI, editable):

# inside a clone of this repo
uv pip install -e ".[hf,dlm]"

Then:

sway autogen path/to/doc.dlm -o sway.yaml
sway run sway.yaml

Per-section attribution tells you which parts of your document actually moved the model — a kind of signal no other tool provides.

Heads up — once dlm publishes its sister change, dlm export --to sway-json writes a ready-to-run sway.yaml next to the GGUF. Skip the manual sway autogen step entirely.

Tool-use fidelity

Adapters that target tool-calling behavior have a failure mode no other probe catches: they pass every adherence / attribution probe but silently degrade the base model's JSON-schema compliance, or start hallucinating tool names that aren't in the declared surface. The tool_use_fidelity probe scores three independent signals on each (prompt, tool_spec) case:

JSON-schema validity delta — ft_valid_rate − base_valid_rate. A negative value means the adapter regressed on producing schema-valid calls; the default pass criterion tolerates a 5pp drop.
Tool-name hallucination rate — over schema-valid ft calls, the fraction that pick a tool outside the declared allowed_tools surface (or differ from gold_tool_name when no surface is set). Default cap 10%.
Argument-field disagreement rate — over cases where both views produce schema-valid calls, the per-leaf-field disagreement rate between base and ft arguments. Surfaced as evidence — the v1 surface doesn't gate on it.

suite:
  - name: tool_calls_intact
    kind: tool_use_fidelity
    cases:
      - prompt: "Search the web for the capital of France."
        tool_spec:
          name: search_web
          parameters:
            type: object
            properties: {query: {type: string}}
            required: [query]
        gold_tool_name: search_web
    allowed_tools: [search_web, calculator, http_fetch]

The probe uses an OpenAI-style function-calling schema (the most common shape across hosted and OSS tool-use stacks). Other schema flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint; the v1 implementation deliberately ships the format that covers the largest user surface.

json_valid_rate_ft is z-scored against the null-adapter baseline when null_adapter is in the suite — the principled "the adapter preserved the base's tool-call structure beyond noise" signal.

Multi-turn coherence

Every other adherence probe is single-turn: one user message, one ft response, one score. Adapters that pass delta_kl cleanly frequently forget their training by turn 2 or 3 of a real dialogue — the model's own previous responses fill the context window and create compounding drift. The multi_turn_coherence_decay probe rolls a multi-turn synthetic dialogue per prompt and fits an exponential-decay curve to the per-turn KL:

suite:
  - name: holds_a_conversation
    kind: multi_turn_coherence_decay
    prompts:
      - "Explain how a neural network learns."
      - "What's the difference between TCP and UDP?"
    max_turns: 4
    assert_half_life_turns: 2.0

The probe reports half_life_turns (the turn at which adapter influence is halved), a per-turn KL list, and a tiny ASCII sparkline in the report message so you can see the curve shape without opening the JSON. Bases without a chat_template SKIP gracefully — multi-turn requires one to format dialogue history.

The probe deliberately doesn't z-score against the null-adapter baseline (a null adapter has no coherence to decay; the null distribution is meaningless). Fixed-threshold verdicts are the published path. Mirrors prompt_collapse.

Serve daemon

Loading the HF backend takes ~15s every time you run sway run. For notebook exploration, the sway watch retrain loop, or any flow that fires the suite repeatedly against the same model, that startup is the dominant cost. sway serve keeps the backend warm in a small FastAPI daemon: first request pays the load, subsequent requests reuse the cached weights.

pip install 'dlm-sway[serve]'   # adds fastapi + uvicorn + httpx
sway serve --port 8787 --max-loaded-models 2

Default bind is 127.0.0.1. The daemon refuses to start on a non-loopback interface unless you pass --api-key <token>, after which every non-/health request must carry Authorization: Bearer <token>.

# With curl — sweetspot is from inside a notebook or watch loop.
curl -s -X POST http://localhost:8787/run \
  -H 'Content-Type: application/json' \
  -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score

# From a notebook (or any Python).
from dlm_sway.serve.client import ServeClient
from dlm_sway.suite.loader import load_spec

client = ServeClient("http://localhost:8787")
report = client.run(load_spec("sway.yaml"))
print(report["score"])              # full report shape mirrors `sway run --json`
print(report["request_seconds"])    # cold ~15s; warm ~2s

The cache is keyed on (kind, base, adapter, dtype, device) and capped at --max-loaded-models (default 2). Loading a third distinct model LRU-evicts the oldest, calling backend.close() to release the weights. GET /health reports the currently-warm models; GET /stats reports request count and mean latency.

Reproducing a sway run

Sometimes you want a coworker (or a future-you, or a bug report) to re-execute exactly what you ran without recreating your environment. sway pack bundles the spec + the source .dlm document + your locally-cached null-stats + (optionally) a known-good report into a single *.swaypack.tar.gz you can email or commit:

sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \
  --include-golden last-run.json
# wrote audit-2026-04.swaypack.tar.gz (4.2 KB)  sections=312b  null_stats=0  golden=yes

# share the tarball with a coworker
sway unpack audit-2026-04.swaypack.tar.gz
# extracted: <cwd>/swaypack
#   spec_path: <cwd>/swaypack/sway.yaml
#   ...
# To run the bundled spec:
#   SWAY_NULL_CACHE_DIR=<cwd>/swaypack/null-stats sway run <cwd>/swaypack/sway.yaml

The packed null-stats cache means the consumer doesn't pay the re-calibration cost (~120 forward passes on a typical 10-probe suite). The unpack output prints the exact SWAY_NULL_CACHE_DIR=... env var to use; that env redirects null-stats lookups at the bundled cache instead of ~/.dlm-sway/.

A bundled golden JSON makes it trivial for the consumer to verify their re-run matches yours. Defaults to a 50 MB pack-size cap; override with --max-size-mb.

Status

Pre-alpha. API will break. Not yet on PyPI — install editable from source (see Install from source). Version 0.1.0 will be the first published tag; until then, every clone pulls the tip of main.

License

MIT

View source

  
        1
        # sway
      
        2
        
        3
        Differential testing for fine-tuned causal language models.
      
        4
        
        5
        > **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning
      
        6
        > applies only from v1.0 onward. Feedback + issues welcome.
      
        7
        
        8
        **One question:** *did LoRA/QLoRA training actually change model behavior
      
        9
        in a meaningful way, or is the model just defaulting to the pretrained
      
        10
        base?*
      
        11
        
        12
        `sway` gives you a trustworthy, reproducible answer with thirteen
      
        13
        purpose-built primitives, each z-scored against a null-adapter baseline.
      
        14
        No LLM judges. No external APIs. Deterministic on CPU where possible.
      
        15
        
        16
        > **Naming convention.** The source repo and CLI entry point are both
      
        17
        > `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is
      
        18
        > taken on PyPI by an unrelated project. The CLI installed by
      
        19
        > `pip install dlm-sway` is `sway` — mismatched wheel/command names are
      
        20
        > a PyPA convention (see `pyyaml` → `import yaml`).
      
        21
        
        22
        ## Install
      
        23
        
        24
        ```bash
      
        25
        # HF + PEFT backend — required for real models
      
        26
        pip install "dlm-sway[hf]"
      
        27
        
        28
        # Extras composable as usual
      
        29
        pip install "dlm-sway[hf,style,semsim]"
      
        30
        pip install "dlm-sway[all]"
      
        31
        
        32
        # .dlm auto-suite generation (requires the DLM sibling project)
      
        33
        pip install "dlm-sway[dlm]"
      
        34
        ```
      
        35
        
        36
        Available extras:
      
        37
        
        38
        - `[hf]` — HuggingFace + PEFT backend (required for real models)
      
        39
        - `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only)
      
        40
        - `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
      
        41
        - `[semsim]` — sentence-transformers for the revert probe
      
        42
        - `[dlm]` — auto-generate suites from `.dlm` documents
      
        43
        - `[viz]` — matplotlib plots
      
        44
        - `[all]` — everything
      
        45
        
        46
        Verify the install:
      
        47
        
        48
        ```bash
      
        49
        sway --version
      
        50
        sway doctor
      
        51
        ```
      
        52
        
        53
        ## MLX backend (Apple Silicon)
      
        54
        
        55
        Install the extra and point sway at any PEFT adapter — the MLX backend
      
        56
        auto-converts on first load and caches the result under
      
        57
        `~/.cache/dlm-sway/mlx-converted/<sha>/`. No manual `.npz` step.
      
        58
        
        59
        ```bash
      
        60
        pip install "dlm-sway[mlx]"
      
        61
        
        62
        # Either: let sway auto-convert when it loads the adapter
      
        63
        sway run sway.yaml   # spec sets models.ft.kind: mlx, points at any PEFT dir
      
        64
        
        65
        # Or: convert explicitly (useful for inspection / scripting)
      
        66
        sway convert-adapter ~/path/to/peft-adapter ~/path/to/mlx-adapter
      
        67
        ```
      
        68
        
        69
        Conversion is one-shot — repeated `sway run` invocations on the same
      
        70
        adapter version short-circuit on a content hash. Supports standard
      
        71
        LoRA on `q_proj` / `v_proj` / etc. QLoRA 4-bit and full-weight
      
        72
        `modules_to_save` overrides aren't supported (the converter prints a
      
        73
        clear warning if it sees them).
      
        74
        
        75
        ## Install from source
      
        76
        
        77
        For the development HEAD (unreleased changes, contributor workflow):
      
        78
        
        79
        ```bash
      
        80
        git clone https://github.com/tenseleyFlow/sway.git
      
        81
        cd sway
      
        82
        
        83
        uv venv --python 3.11 .venv      # or: python -m venv .venv
      
        84
        source .venv/bin/activate
      
        85
        uv pip install -e ".[hf]" --group dev
      
        86
        ```
      
        87
        
        88
        ## 90-second smoke test
      
        89
        
        90
        ```bash
      
        91
        sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
      
        92
        ```
      
        93
        
        94
        Outputs a verdict in under a minute on CPU for small models: *your
      
        95
        adapter is 4.2σ above noise* ✅ or *indistinguishable from a null
      
        96
        adapter* ❌.
      
        97
        
        98
        ## Pre-run diagnostics
      
        99
        
        100
        If your adapter was trained with [DocumentLanguageModel](https://github.com/tenseleyFlow/DocumentLanguageModel),
      
        101
        `sway check` runs the **`gradient_ghost`** probe first — a ~50 ms
      
        102
        disk-only inspection of dlm's `training_state.pt`. It catches
      
        103
        adapters that are obviously broken (e.g. left at `--max-steps 5`
      
        104
        from a smoke test) **before** the model ever loads, with a banner
      
        105
        on top of the regular verdict:
      
        106
        
        107
        ```
      
        108
        ⚠️  PRE-RUN ALERT — gradient_ghost flagged severe undertraining
      
        109
           severely undertrained: global_step=2 < threshold 50.
      
        110
           The probe scores below may be unreliable. Consider retraining.
      
        111
        ```
      
        112
        
        113
        The probe's signal ladder, in order of decisiveness:
      
        114
        
        115
        1. `global_step` below a configurable threshold (default 50) → FAIL.
      
        116
        2. Every per-param `exp_avg_sq` is NaN → FAIL.
      
        117
        3. More than 30% of LoRA layers show >2× the lowest layer's
      
        118
           gradient variance → FAIL or WARN.
      
        119
        4. Otherwise → PASS.
      
        120
        
        121
        Specs that contain *only* pre-run diagnostics like `gradient_ghost`
      
        122
        skip backend construction entirely — `sway run` against a pre-flight
      
        123
        spec doesn't load a model, doesn't allocate GPU memory, doesn't pay
      
        124
        the cold-start tax. Useful for `pre-commit` gates that should fast-
      
        125
        path adapter sanity before authorizing the full battery.
      
        126
        
        127
        The **`training_drift`** probe pairs with `gradient_ghost`: where the
      
        128
        ghost reads optimizer state at end-of-training, drift reads the loss
      
        129
        *curve* during training (from dlm's per-step JSONL logs):
      
        130
        
        131
        ```
      
        132
        ⚠️  training_drift flagged unstable training
      
        133
            smoothness=0.42, convergence_ratio=0.91, instability_events=4
      
        134
        ```
      
        135
        
        136
        Four metrics: `final_loss`, `convergence_ratio`, `smoothness`
      
        137
        (1 − var(Δloss)/var(loss)), and `instability_events` (loss-increase
      
        138
        spikes that exceed local typical movement). Verdict PASS when all
      
        139
        three of (smoothness ≥ 0.7, instability_events == 0,
      
        140
        convergence_ratio ≤ 0.7); else WARN. Like `gradient_ghost`, runs in
      
        141
        ~10 ms with no model load.
      
        142
        
        143
        ## Full suite
      
        144
        
        145
        ```yaml
      
        146
        # sway.yaml
      
        147
        version: 1
      
        148
        models:
      
        149
          base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
      
        150
          ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
      
        151
                 adapter: "./runs/adapter/v0003"}
      
        152
        suite:
      
        153
          - {name: null_baseline,       kind: null_adapter, runs: 3}
      
        154
          - {name: doc_divergence,      kind: delta_kl,
      
        155
             prompts: ["The key insight is", "An important rule"]}
      
        156
          - {name: section_attribution, kind: section_internalization}
      
        157
          - {name: no_leakage,          kind: leakage}
      
        158
          - {name: ablation_shape,      kind: adapter_ablation,
      
        159
             prompts: ["Tell me more about"]}
      
        160
        ```
      
        161
        
        162
        ```bash
      
        163
        sway run sway.yaml              # full report to terminal + JSON
      
        164
        sway gate sway.yaml --junit     # CI-friendly; non-zero on fail
      
        165
        
        166
        # Override the composite weights on the command line (partial overrides
      
        167
        # are fine — unspecified categories keep their defaults):
      
        168
        sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
      
        169
        ```
      
        170
        
        171
        Inside `sway.yaml`, tuning knobs in `defaults` include:
      
        172
        
        173
        - `seed` — passed to `seed_everything` before any probe runs.
      
        174
        - `differential` (default `true`) — toggle between the single-load PEFT
      
        175
          path and a two-model load (doubled memory, rarely needed; for custom
      
        176
          backends that can't do in-place adapter toggling).
      
        177
        - `score_weights` — per-category weight overrides baked into the spec so
      
        178
          CI runs reproduce the same score without a CLI flag.
      
        179
        
        180
        ## Why it exists
      
        181
        
        182
        Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"*
      
        183
        That's the wrong question after a targeted LoRA fine-tune on a small
      
        184
        user-authored document. The right question is *"did the adapter actually
      
        185
        move the model toward what I wrote?"* — and existing tools answer this
      
        186
        poorly.
      
        187
        
        188
        `sway` answers it directly via fifteen primitives across four
      
        189
        categories, plus a baseline-calibration primitive:
      
        190
        
        191
        | Category      | Primitives                                            |
      
        192
        |---------------|-------------------------------------------------------|
      
        193
        | Adherence     | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`, `multi_turn_coherence_decay` |
      
        194
        | Attribution   | `section_internalization`, `paraphrase_invariance`, `preference_flip`, `tool_use_fidelity` |
      
        195
        | Calibration   | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`, `gradient_ghost`, `training_drift` |
      
        196
        | Ablation      | `adapter_ablation` ← the signature primitive          |
      
        197
        | Baseline      | `null_adapter` (powers every z-score in the report)   |
      
        198
        
        199
        **The signature primitive.** `adapter_ablation` scales the LoRA additive
      
        200
        term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
      
        201
        curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
      
        202
        response. A degenerate one shows a step function or an overshoot-then-
      
        203
        crash. Nobody else does this because nobody else gets this close to the
      
        204
        adapter math.
      
        205
        
        206
        **The calibration.** Every numeric probe z-scores its raw metric against
      
        207
        a null-adapter baseline — a same-structure LoRA with random-init weights.
      
        208
        "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
      
        209
        fixed threshold. The null-adapter calibration requires a backend that
      
        210
        implements `NullCalibratedBackend` (the HF backend does); probes that
      
        211
        can't be calibrated (e.g., `adapter_revert` needs an embedder, the null
      
        212
        proxy doesn't have one) surface `(no calibration)` in the report and
      
        213
        fall back to fixed thresholds. Calibration stats are cached on disk
      
        214
        under `~/.dlm-sway/null-stats/` keyed by backend identity.
      
        215
        
        216
        **The rank profile.** `null_adapter` takes an optional
      
        217
        `rank_multipliers: list[float]` (default `[1.0]`). Pass
      
        218
        `[0.5, 1.0, 2.0]` and every numeric probe carries a three-point
      
        219
        z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape
      
        220
        is diagnostic:
      
        221
        
        222
        - **Flat or slightly rising toward 0.5x** — adapter signal is
      
        223
          rank-stable, roughly independent of noise energy.
      
        224
        - **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated:
      
        225
          a smaller rank would have yielded a clearer separation from noise.
      
        226
          Consider halving `r`.
      
        227
        - **Low everywhere** — adapter is barely above noise at any rank;
      
        228
          the signal is real but weak.
      
        229
        
        230
        Caveat: high z at low rank can also mean the low-rank null is
      
        231
        *pathologically quiet* rather than that the adapter is strong. Read the
      
        232
        profile as a shape, not a scalar — if all three z's move proportionally,
      
        233
        the adapter is doing work; if they spread apart, the rank is mis-sized.
      
        234
        
        235
        Implementation note: rank scaling is mathematically equivalent to
      
        236
        multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output
      
        237
        variance scales linearly with rank). The shipped backends apply that
      
        238
        scaling rather than reshaping PEFT tensors — no model reload, no
      
        239
        rank-specific adapter cache, same `alpha/r` scaling throughout.
      
        240
        
        241
        **Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)`
      
        242
        before the first probe — seeds python/numpy/torch RNGs and asks torch
      
        243
        for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The
      
        244
        report footer prints the achieved class — `strict` (CUDA), `best_effort`
      
        245
        (CPU/MPS), or `loose` (deterministic algorithms refused). Same seed +
      
        246
        same host = bit-identical scoring across runs.
      
        247
        
        248
        ## Pytest integration
      
        249
        
        250
        For teams already testing their training pipeline with pytest, sway
      
        251
        ships a plugin behind the `[pytest]` extra. A single decorator turns
      
        252
        one pytest function into one test item per probe plus an optional
      
        253
        composite-score gate:
      
        254
        
        255
        ```python
      
        256
        import pytest
      
        257
        
        258
        @pytest.mark.sway(spec="sway.yaml", threshold=0.6)
      
        259
        def test_adapter_healthy() -> None:
      
        260
            """The decorator owns the body — a bare pass is conventional."""
      
        261
        ```
      
        262
        
        263
        `pytest -v` then reports:
      
        264
        
        265
        ```
      
        266
        test_sway_gate.py::test_adapter_healthy::adherence    PASSED
      
        267
        test_sway_gate.py::test_adapter_healthy::calibration  PASSED
      
        268
        test_sway_gate.py::test_adapter_healthy::__gate__     PASSED
      
        269
        ```
      
        270
        
        271
        `--junitxml` emits one `<testcase>` per probe, `pytest -k adherence`
      
        272
        runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to
      
        273
        pytest outcomes. See `examples/pytest_integration/` for a full
      
        274
        before/after walkthrough.
      
        275
        
        276
        ```bash
      
        277
        pip install 'dlm-sway[hf,pytest]'
      
        278
        ```
      
        279
        
        280
        ## Pre-commit
      
        281
        
        282
        For teams using [pre-commit.com](https://pre-commit.com), sway ships
      
        283
        a `.pre-commit-hooks.yaml` declaring three hooks that run `sway gate`
      
        284
        before every commit touching a spec, `.dlm` document, or adapter
      
        285
        file. Add 4–5 lines to your `.pre-commit-config.yaml`:
      
        286
        
        287
        ```yaml
      
        288
        repos:
      
        289
          - repo: https://github.com/tenseleyFlow/sway
      
        290
            rev: v0.1.0
      
        291
            hooks:
      
        292
              - id: sway-gate
      
        293
                args: ["sway.yaml", "--threshold=0.6"]
      
        294
        ```
      
        295
        
        296
        Three variants ship; pick whichever fits your install posture:
      
        297
        
        298
        | Hook | When to use | First-run cost |
      
        299
        |---|---|---|
      
        300
        | `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` |
      
        301
        | `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers |
      
        302
        | `sway-gate-docker` | zero-install hosts with docker available | ~1 min — pulls `ghcr.io/tenseleyflow/sway-gate:v0.1.0` (torch baked in, MiniLM weights pre-cached) |
      
        303
        
        304
        The recommended default is `sway-gate`. Switch to
      
        305
        `sway-gate-isolated` if you can't rely on a host-level sway install.
      
        306
        Reach for `sway-gate-docker` on ephemeral CI runners where docker is
      
        307
        cheaper than a fresh venv.
      
        308
        
        309
        ### Rev pinning
      
        310
        
        311
        The example above pins to the `v0.1.0` tag. Bump it deliberately
      
        312
        when you want to pick up a new release; `pre-commit autoupdate` will
      
        313
        surface newer tags when you run it explicitly.
      
        314
        
        315
        ### Scope
      
        316
        
        317
        The hook **only gates** — exits non-zero on FAIL, zero on PASS. No
      
        318
        `--json` / `--markdown` report flags are surfaced; those belong in
      
        319
        `sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast
      
        320
        and the gate's verdict uncluttered.
      
        321
        
        322
        See [`examples/precommit-example/`](examples/precommit-example/) for
      
        323
        the full walk-through including the `sway.yaml` template, the
      
        324
        consumer-side `.pre-commit-config.yaml`, and the
      
        325
        try-it-locally-before-you-install recipe.
      
        326
        
        327
        ## GitHub Action
      
        328
        
        329
        For repos that don't use pre-commit, the
      
        330
        [`tenseleyflow/sway-action`](https://github.com/tenseleyFlow/sway-action)
      
        331
        GitHub Action wraps `sway gate` + a markdown report posted as a PR
      
        332
        comment in three lines of YAML:
      
        333
        
        334
        ```yaml
      
        335
        - uses: tenseleyflow/sway-action@v0.1.0
      
        336
          with:
      
        337
            spec-path: sway.yaml
      
        338
        ```
      
        339
        
        340
        The action handles install, caching, gate execution, edit-in-place
      
        341
        PR comments (so rebases don't spam), and exit-code translation. A
      
        342
        complete workflow:
      
        343
        
        344
        ```yaml
      
        345
        name: sway
      
        346
        on:
      
        347
          pull_request:
      
        348
            paths: ["**/*.dlm", "**/sway.yaml"]
      
        349
        permissions:
      
        350
          contents: read
      
        351
          pull-requests: write
      
        352
        jobs:
      
        353
          sway:
      
        354
            runs-on: ubuntu-latest
      
        355
            steps:
      
        356
              - uses: actions/checkout@v4
      
        357
              - uses: tenseleyflow/sway-action@v0.1.0
      
        358
                with:
      
        359
                  spec-path: sway.yaml
      
        360
        ```
      
        361
        
        362
        Inputs include `fail-on` (`fail` / `warn` / `never`),
      
        363
        `comment-on-pr`, `upload-artifact`, and `sway-version` for pinning.
      
        364
        Outputs `sway-score`, `verdict`, and `report-path` are available to
      
        365
        downstream steps. See the action repo's README for the full surface.
      
        366
        
        367
        ## The `.dlm` integration
      
        368
        
        369
        If you trained your adapter via the [DocumentLanguageModel
      
        370
        project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway`
      
        371
        auto-generates a test suite from your document's sections.
      
        372
        
        373
        Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable):
      
        374
        
        375
        ```bash
      
        376
        # inside a clone of this repo
      
        377
        uv pip install -e ".[hf,dlm]"
      
        378
        ```
      
        379
        
        380
        Then:
      
        381
        
        382
        ```bash
      
        383
        sway autogen path/to/doc.dlm -o sway.yaml
      
        384
        sway run sway.yaml
      
        385
        ```
      
        386
        
        387
        Per-section attribution tells you *which* parts of your document
      
        388
        actually moved the model — a kind of signal no other tool provides.
      
        389
        
        390
        > Heads up — once dlm publishes its sister change, `dlm export --to
      
        391
        > sway-json` writes a ready-to-run `sway.yaml` next to the GGUF.
      
        392
        > Skip the manual `sway autogen` step entirely.
      
        393
        
        394
        ## Tool-use fidelity
      
        395
        
        396
        Adapters that target tool-calling behavior have a failure mode no
      
        397
        other probe catches: they pass every adherence / attribution probe
      
        398
        but silently degrade the base model's JSON-schema compliance, or
      
        399
        start hallucinating tool names that aren't in the declared surface.
      
        400
        The `tool_use_fidelity` probe scores three independent signals on
      
        401
        each `(prompt, tool_spec)` case:
      
        402
        
        403
        - **JSON-schema validity delta** — `ft_valid_rate − base_valid_rate`.
      
        404
          A negative value means the adapter regressed on producing
      
        405
          schema-valid calls; the default pass criterion tolerates a 5pp
      
        406
          drop.
      
        407
        - **Tool-name hallucination rate** — over schema-valid ft calls, the
      
        408
          fraction that pick a tool outside the declared `allowed_tools`
      
        409
          surface (or differ from `gold_tool_name` when no surface is set).
      
        410
          Default cap 10%.
      
        411
        - **Argument-field disagreement rate** — over cases where both
      
        412
          views produce schema-valid calls, the per-leaf-field disagreement
      
        413
          rate between base and ft arguments. Surfaced as evidence — the
      
        414
          v1 surface doesn't gate on it.
      
        415
        
        416
        ```yaml
      
        417
        suite:
      
        418
          - name: tool_calls_intact
      
        419
            kind: tool_use_fidelity
      
        420
            cases:
      
        421
              - prompt: "Search the web for the capital of France."
      
        422
                tool_spec:
      
        423
                  name: search_web
      
        424
                  parameters:
      
        425
                    type: object
      
        426
                    properties: {query: {type: string}}
      
        427
                    required: [query]
      
        428
                gold_tool_name: search_web
      
        429
            allowed_tools: [search_web, calculator, http_fetch]
      
        430
        ```
      
        431
        
        432
        The probe uses an OpenAI-style function-calling schema (the most
      
        433
        common shape across hosted and OSS tool-use stacks). Other schema
      
        434
        flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint;
      
        435
        the v1 implementation deliberately ships the format that covers
      
        436
        the largest user surface.
      
        437
        
        438
        `json_valid_rate_ft` is z-scored against the null-adapter baseline
      
        439
        when `null_adapter` is in the suite — the principled "the adapter
      
        440
        preserved the base's tool-call structure beyond noise" signal.
      
        441
        
        442
        ## Multi-turn coherence
      
        443
        
        444
        Every other adherence probe is single-turn: one user message, one
      
        445
        ft response, one score. Adapters that pass `delta_kl` cleanly
      
        446
        frequently *forget their training* by turn 2 or 3 of a real
      
        447
        dialogue — the model's own previous responses fill the context
      
        448
        window and create compounding drift. The
      
        449
        `multi_turn_coherence_decay` probe rolls a multi-turn synthetic
      
        450
        dialogue per prompt and fits an exponential-decay curve to the
      
        451
        per-turn KL:
      
        452
        
        453
        ```yaml
      
        454
        suite:
      
        455
          - name: holds_a_conversation
      
        456
            kind: multi_turn_coherence_decay
      
        457
            prompts:
      
        458
              - "Explain how a neural network learns."
      
        459
              - "What's the difference between TCP and UDP?"
      
        460
            max_turns: 4
      
        461
            assert_half_life_turns: 2.0
      
        462
        ```
      
        463
        
        464
        The probe reports `half_life_turns` (the turn at which adapter
      
        465
        influence is halved), a per-turn KL list, and a tiny ASCII
      
        466
        sparkline in the report message so you can see the curve shape
      
        467
        without opening the JSON. Bases without a `chat_template` SKIP
      
        468
        gracefully — multi-turn requires one to format dialogue history.
      
        469
        
        470
        The probe deliberately **doesn't** z-score against the null-adapter
      
        471
        baseline (a null adapter has no coherence to decay; the null
      
        472
        distribution is meaningless). Fixed-threshold verdicts are the
      
        473
        published path. Mirrors `prompt_collapse`.
      
        474
        
        475
        ## Serve daemon
      
        476
        
        477
        Loading the HF backend takes ~15s every time you run `sway run`. For
      
        478
        notebook exploration, the `sway watch` retrain loop, or any flow that
      
        479
        fires the suite repeatedly against the same model, that startup is
      
        480
        the dominant cost. `sway serve` keeps the backend warm in a small
      
        481
        FastAPI daemon: first request pays the load, subsequent requests
      
        482
        reuse the cached weights.
      
        483
        
        484
        ```bash
      
        485
        pip install 'dlm-sway[serve]'   # adds fastapi + uvicorn + httpx
      
        486
        sway serve --port 8787 --max-loaded-models 2
      
        487
        ```
      
        488
        
        489
        Default bind is `127.0.0.1`. The daemon refuses to start on a
      
        490
        non-loopback interface unless you pass `--api-key <token>`, after
      
        491
        which every non-`/health` request must carry
      
        492
        `Authorization: Bearer <token>`.
      
        493
        
        494
        ```bash
      
        495
        # With curl — sweetspot is from inside a notebook or watch loop.
      
        496
        curl -s -X POST http://localhost:8787/run \
      
        497
          -H 'Content-Type: application/json' \
      
        498
          -d "$(yq -o=json sway.yaml | jq -c '{spec: .}')" | jq .score
      
        499
        ```
      
        500
        
        501
        ```python
      
        502
        # From a notebook (or any Python).
      
        503
        from dlm_sway.serve.client import ServeClient
      
        504
        from dlm_sway.suite.loader import load_spec
      
        505
        
        506
        client = ServeClient("http://localhost:8787")
      
        507
        report = client.run(load_spec("sway.yaml"))
      
        508
        print(report["score"])              # full report shape mirrors `sway run --json`
      
        509
        print(report["request_seconds"])    # cold ~15s; warm ~2s
      
        510
        ```
      
        511
        
        512
        The cache is keyed on `(kind, base, adapter, dtype, device)` and capped
      
        513
        at `--max-loaded-models` (default 2). Loading a third distinct model
      
        514
        LRU-evicts the oldest, calling `backend.close()` to release the
      
        515
        weights. `GET /health` reports the currently-warm models;
      
        516
        `GET /stats` reports request count and mean latency.
      
        517
        
        518
        ## Reproducing a sway run
      
        519
        
        520
        Sometimes you want a coworker (or a future-you, or a bug report) to
      
        521
        re-execute exactly what you ran without recreating your environment.
      
        522
        `sway pack` bundles the spec + the source `.dlm` document + your
      
        523
        locally-cached null-stats + (optionally) a known-good report into a
      
        524
        single `*.swaypack.tar.gz` you can email or commit:
      
        525
        
        526
        ```bash
      
        527
        sway pack sway.yaml -o audit-2026-04.swaypack.tar.gz \
      
        528
          --include-golden last-run.json
      
        529
        # wrote audit-2026-04.swaypack.tar.gz (4.2 KB)  sections=312b  null_stats=0  golden=yes
      
        530
        
        531
        # share the tarball with a coworker
      
        532
        sway unpack audit-2026-04.swaypack.tar.gz
      
        533
        # extracted: <cwd>/swaypack
      
        534
        #   spec_path: <cwd>/swaypack/sway.yaml
      
        535
        #   ...
      
        536
        # To run the bundled spec:
      
        537
        #   SWAY_NULL_CACHE_DIR=<cwd>/swaypack/null-stats sway run <cwd>/swaypack/sway.yaml
      
        538
        ```
      
        539
        
        540
        The packed null-stats cache means the consumer doesn't pay the
      
        541
        re-calibration cost (~120 forward passes on a typical 10-probe
      
        542
        suite). The unpack output prints the exact `SWAY_NULL_CACHE_DIR=...`
      
        543
        env var to use; that env redirects null-stats lookups at the bundled
      
        544
        cache instead of `~/.dlm-sway/`.
      
        545
        
        546
        A bundled golden JSON makes it trivial for the consumer to verify
      
        547
        their re-run matches yours. Defaults to a 50 MB pack-size cap;
      
        548
        override with `--max-size-mb`.
      
        549
        
        550
        ## Status
      
        551
        
        552
        Pre-alpha. API will break. Not yet on PyPI — install editable from source
      
        553
        (see [Install from source](#install-from-source)). Version `0.1.0` will be
      
        554
        the first published tag; until then, every clone pulls the tip of `main`.
      
        555
        
        556
        ## License
      
        557
        
        558
        MIT