sway Public

Watch 0 Fork 0 Star 0

markdown · 10901 bytes Raw Blame History

sway

Differential testing for fine-tuned causal language models.

Alpha — v0.1.0 on PyPI. API is not stable; semantic versioning applies only from v1.0 onward. Feedback + issues welcome.

One question: did LoRA/QLoRA training actually change model behavior in a meaningful way, or is the model just defaulting to the pretrained base?

sway gives you a trustworthy, reproducible answer with thirteen purpose-built primitives, each z-scored against a null-adapter baseline. No LLM judges. No external APIs. Deterministic on CPU where possible.

Naming convention. The source repo and CLI entry point are both sway. The PyPI wheel is dlm-sway because the short sway name is taken on PyPI by an unrelated project. The CLI installed by pip install dlm-sway is sway — mismatched wheel/command names are a PyPA convention (see pyyaml → import yaml).

Install

# HF + PEFT backend — required for real models
pip install "dlm-sway[hf]"

# Extras composable as usual
pip install "dlm-sway[hf,style,semsim]"
pip install "dlm-sway[all]"

# .dlm auto-suite generation (requires the DLM sibling project)
pip install "dlm-sway[dlm]"

Available extras:

[hf] — HuggingFace + PEFT backend (required for real models)
[mlx] — Apple Silicon MLX backend (darwin-arm64 only)
[style] — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
[semsim] — sentence-transformers for the revert probe
[dlm] — auto-generate suites from .dlm documents
[viz] — matplotlib plots
[all] — everything

Verify the install:

sway --version
sway doctor

Install from source

For the development HEAD (unreleased changes, contributor workflow):

git clone https://github.com/tenseleyFlow/sway.git
cd sway

uv venv --python 3.11 .venv      # or: python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[hf]" --group dev

90-second smoke test

sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct

Outputs a verdict in under a minute on CPU for small models: your adapter is 4.2σ above noise ✅ or indistinguishable from a null adapter ❌.

Full suite

# sway.yaml
version: 1
models:
  base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
  ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
         adapter: "./runs/adapter/v0003"}
suite:
  - {name: null_baseline,       kind: null_adapter, runs: 3}
  - {name: doc_divergence,      kind: delta_kl,
     prompts: ["The key insight is", "An important rule"]}
  - {name: section_attribution, kind: section_internalization}
  - {name: no_leakage,          kind: leakage}
  - {name: ablation_shape,      kind: adapter_ablation,
     prompts: ["Tell me more about"]}

sway run sway.yaml              # full report to terminal + JSON
sway gate sway.yaml --junit     # CI-friendly; non-zero on fail

# Override the composite weights on the command line (partial overrides
# are fine — unspecified categories keep their defaults):
sway run sway.yaml --weights "attribution=0.5,adherence=0.2"

Inside sway.yaml, tuning knobs in defaults include:

seed — passed to seed_everything before any probe runs.
differential (default true) — toggle between the single-load PEFT path and a two-model load (doubled memory, rarely needed; for custom backends that can't do in-place adapter toggling).
score_weights — per-category weight overrides baked into the spec so CI runs reproduce the same score without a CLI flag.

Why it exists

Standard benchmarks (MMLU, HellaSwag) ask "how good is this model?" That's the wrong question after a targeted LoRA fine-tune on a small user-authored document. The right question is "did the adapter actually move the model toward what I wrote?" — and existing tools answer this poorly.

sway answers it directly via thirteen primitives across four categories, plus a baseline-calibration primitive:

Category	Primitives
Adherence	`delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl`
Attribution	`section_internalization`, `paraphrase_invariance`, `preference_flip`
Calibration	`style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity`
Ablation	`adapter_ablation` ← the signature primitive
Baseline	`null_adapter` (powers every z-score in the report)

The signature primitive. adapter_ablation scales the LoRA additive term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence curve. A healthy fine-tune shows a smooth, monotonic, non-saturated response. A degenerate one shows a step function or an overshoot-then- crash. Nobody else does this because nobody else gets this close to the adapter math.

The calibration. Every numeric probe z-scores its raw metric against a null-adapter baseline — a same-structure LoRA with random-init weights. "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a fixed threshold. The null-adapter calibration requires a backend that implements NullCalibratedBackend (the HF backend does); probes that can't be calibrated (e.g., adapter_revert needs an embedder, the null proxy doesn't have one) surface (no calibration) in the report and fall back to fixed thresholds. Calibration stats are cached on disk under ~/.dlm-sway/null-stats/ keyed by backend identity.

The rank profile. null_adapter takes an optional rank_multipliers: list[float] (default [1.0]). Pass [0.5, 1.0, 2.0] and every numeric probe carries a three-point z-score curve: z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x. The shape is diagnostic:

Flat or slightly rising toward 0.5x — adapter signal is rank-stable, roughly independent of noise energy.
Sharply higher at 0.5x, lower at 2x — adapter is rank-saturated: a smaller rank would have yielded a clearer separation from noise. Consider halving r.
Low everywhere — adapter is barely above noise at any rank; the signal is real but weak.

Caveat: high z at low rank can also mean the low-rank null is pathologically quiet rather than that the adapter is strong. Read the profile as a shape, not a scalar — if all three z's move proportionally, the adapter is doing work; if they spread apart, the rank is mis-sized.

Implementation note: rank scaling is mathematically equivalent to multiplying the null noise std by sqrt(rank_scale) (LoRA's A·B output variance scales linearly with rank). The shipped backends apply that scaling rather than reshaping PEFT tensors — no model reload, no rank-specific adapter cache, same alpha/r scaling throughout.

Determinism. Every sway run calls seed_everything(spec.defaults.seed) before the first probe — seeds python/numpy/torch RNGs and asks torch for deterministic algorithms (CUBLAS_WORKSPACE_CONFIG=:4096:8). The report footer prints the achieved class — strict (CUDA), best_effort (CPU/MPS), or loose (deterministic algorithms refused). Same seed + same host = bit-identical scoring across runs.

Pytest integration

For teams already testing their training pipeline with pytest, sway ships a plugin behind the [pytest] extra. A single decorator turns one pytest function into one test item per probe plus an optional composite-score gate:

import pytest

@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
def test_adapter_healthy() -> None:
    """The decorator owns the body — a bare pass is conventional."""

pytest -v then reports:

test_sway_gate.py::test_adapter_healthy::adherence    PASSED
test_sway_gate.py::test_adapter_healthy::calibration  PASSED
test_sway_gate.py::test_adapter_healthy::__gate__     PASSED

--junitxml emits one <testcase> per probe, pytest -k adherence runs just that probe, FAIL / ERROR / SKIP verdicts translate to pytest outcomes. See examples/pytest_integration/ for a full before/after walkthrough.

pip install 'dlm-sway[hf,pytest]'

Pre-commit

For teams using pre-commit.com, sway ships a .pre-commit-hooks.yaml declaring two hooks that run sway gate before every commit touching a spec, .dlm document, or adapter file. Add 4–5 lines to your .pre-commit-config.yaml:

repos:
  - repo: https://github.com/tenseleyFlow/sway
    rev: 2ecd9a0c9d65a9b9576a185597c88f41444f9646  # pin to a SHA
    hooks:
      - id: sway-gate
        args: ["sway.yaml", "--threshold=0.6"]

Two variants ship; pick whichever fits your install posture:

Hook	When to use	First-run cost
`sway-gate`	you already ran `pip install 'dlm-sway[hf]'`	~none — uses the sway binary on your `PATH`
`sway-gate-isolated`	fresh venv, no existing sway install	~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers

The recommended default is sway-gate. Switch to sway-gate-isolated if you can't rely on a host-level sway install.

Rev pinning

The example above pins to a commit SHA. Sway is pre-v0.1.0 — no tagged release yet. Pinning to HEAD would silently drift your gate's behavior under every pre-commit autoupdate; a SHA is the honest pre-release pattern. Bump it deliberately when you want to pick up upstream changes. After sway publishes v0.1.0 the recipe switches to rev: v0.1.0 and the SHA churn stops.

Scope

The hook only gates — exits non-zero on FAIL, zero on PASS. No --json / --markdown report flags are surfaced; those belong in sway run (ad-hoc or in a separate CI job). Keeps git commit fast and the gate's verdict uncluttered.

See examples/precommit-example/ for the full walk-through including the sway.yaml template, the consumer-side .pre-commit-config.yaml, and the try-it-locally-before-you-install recipe.

The `.dlm` integration

If you trained your adapter via the DocumentLanguageModel project, sway auto-generates a test suite from your document's sections.

Install sway with the [dlm] extra alongside [hf] (pre-PyPI, editable):

# inside a clone of this repo
uv pip install -e ".[hf,dlm]"

Then:

sway autogen path/to/doc.dlm -o sway.yaml
sway run sway.yaml

Per-section attribution tells you which parts of your document actually moved the model — a kind of signal no other tool provides.

Status

Pre-alpha. API will break. Not yet on PyPI — install editable from source (see Install from source). Version 0.1.0 will be the first published tag; until then, every clone pulls the tip of main.

License

MIT

View source

  
        1
        # sway
      
        2
        
        3
        Differential testing for fine-tuned causal language models.
      
        4
        
        5
        > **Alpha — v0.1.0 on PyPI.** API is not stable; semantic versioning
      
        6
        > applies only from v1.0 onward. Feedback + issues welcome.
      
        7
        
        8
        **One question:** *did LoRA/QLoRA training actually change model behavior
      
        9
        in a meaningful way, or is the model just defaulting to the pretrained
      
        10
        base?*
      
        11
        
        12
        `sway` gives you a trustworthy, reproducible answer with thirteen
      
        13
        purpose-built primitives, each z-scored against a null-adapter baseline.
      
        14
        No LLM judges. No external APIs. Deterministic on CPU where possible.
      
        15
        
        16
        > **Naming convention.** The source repo and CLI entry point are both
      
        17
        > `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is
      
        18
        > taken on PyPI by an unrelated project. The CLI installed by
      
        19
        > `pip install dlm-sway` is `sway` — mismatched wheel/command names are
      
        20
        > a PyPA convention (see `pyyaml` → `import yaml`).
      
        21
        
        22
        ## Install
      
        23
        
        24
        ```bash
      
        25
        # HF + PEFT backend — required for real models
      
        26
        pip install "dlm-sway[hf]"
      
        27
        
        28
        # Extras composable as usual
      
        29
        pip install "dlm-sway[hf,style,semsim]"
      
        30
        pip install "dlm-sway[all]"
      
        31
        
        32
        # .dlm auto-suite generation (requires the DLM sibling project)
      
        33
        pip install "dlm-sway[dlm]"
      
        34
        ```
      
        35
        
        36
        Available extras:
      
        37
        
        38
        - `[hf]` — HuggingFace + PEFT backend (required for real models)
      
        39
        - `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only)
      
        40
        - `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
      
        41
        - `[semsim]` — sentence-transformers for the revert probe
      
        42
        - `[dlm]` — auto-generate suites from `.dlm` documents
      
        43
        - `[viz]` — matplotlib plots
      
        44
        - `[all]` — everything
      
        45
        
        46
        Verify the install:
      
        47
        
        48
        ```bash
      
        49
        sway --version
      
        50
        sway doctor
      
        51
        ```
      
        52
        
        53
        ## Install from source
      
        54
        
        55
        For the development HEAD (unreleased changes, contributor workflow):
      
        56
        
        57
        ```bash
      
        58
        git clone https://github.com/tenseleyFlow/sway.git
      
        59
        cd sway
      
        60
        
        61
        uv venv --python 3.11 .venv      # or: python -m venv .venv
      
        62
        source .venv/bin/activate
      
        63
        uv pip install -e ".[hf]" --group dev
      
        64
        ```
      
        65
        
        66
        ## 90-second smoke test
      
        67
        
        68
        ```bash
      
        69
        sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
      
        70
        ```
      
        71
        
        72
        Outputs a verdict in under a minute on CPU for small models: *your
      
        73
        adapter is 4.2σ above noise* ✅ or *indistinguishable from a null
      
        74
        adapter* ❌.
      
        75
        
        76
        ## Full suite
      
        77
        
        78
        ```yaml
      
        79
        # sway.yaml
      
        80
        version: 1
      
        81
        models:
      
        82
          base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
      
        83
          ft:   {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
      
        84
                 adapter: "./runs/adapter/v0003"}
      
        85
        suite:
      
        86
          - {name: null_baseline,       kind: null_adapter, runs: 3}
      
        87
          - {name: doc_divergence,      kind: delta_kl,
      
        88
             prompts: ["The key insight is", "An important rule"]}
      
        89
          - {name: section_attribution, kind: section_internalization}
      
        90
          - {name: no_leakage,          kind: leakage}
      
        91
          - {name: ablation_shape,      kind: adapter_ablation,
      
        92
             prompts: ["Tell me more about"]}
      
        93
        ```
      
        94
        
        95
        ```bash
      
        96
        sway run sway.yaml              # full report to terminal + JSON
      
        97
        sway gate sway.yaml --junit     # CI-friendly; non-zero on fail
      
        98
        
        99
        # Override the composite weights on the command line (partial overrides
      
        100
        # are fine — unspecified categories keep their defaults):
      
        101
        sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
      
        102
        ```
      
        103
        
        104
        Inside `sway.yaml`, tuning knobs in `defaults` include:
      
        105
        
        106
        - `seed` — passed to `seed_everything` before any probe runs.
      
        107
        - `differential` (default `true`) — toggle between the single-load PEFT
      
        108
          path and a two-model load (doubled memory, rarely needed; for custom
      
        109
          backends that can't do in-place adapter toggling).
      
        110
        - `score_weights` — per-category weight overrides baked into the spec so
      
        111
          CI runs reproduce the same score without a CLI flag.
      
        112
        
        113
        ## Why it exists
      
        114
        
        115
        Standard benchmarks (MMLU, HellaSwag) ask *"how good is this model?"*
      
        116
        That's the wrong question after a targeted LoRA fine-tune on a small
      
        117
        user-authored document. The right question is *"did the adapter actually
      
        118
        move the model toward what I wrote?"* — and existing tools answer this
      
        119
        poorly.
      
        120
        
        121
        `sway` answers it directly via thirteen primitives across four
      
        122
        categories, plus a baseline-calibration primitive:
      
        123
        
        124
        | Category      | Primitives                                            |
      
        125
        |---------------|-------------------------------------------------------|
      
        126
        | Adherence     | `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl` |
      
        127
        | Attribution   | `section_internalization`, `paraphrase_invariance`, `preference_flip` |
      
        128
        | Calibration   | `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity` |
      
        129
        | Ablation      | `adapter_ablation` ← the signature primitive          |
      
        130
        | Baseline      | `null_adapter` (powers every z-score in the report)   |
      
        131
        
        132
        **The signature primitive.** `adapter_ablation` scales the LoRA additive
      
        133
        term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
      
        134
        curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
      
        135
        response. A degenerate one shows a step function or an overshoot-then-
      
        136
        crash. Nobody else does this because nobody else gets this close to the
      
        137
        adapter math.
      
        138
        
        139
        **The calibration.** Every numeric probe z-scores its raw metric against
      
        140
        a null-adapter baseline — a same-structure LoRA with random-init weights.
      
        141
        "Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
      
        142
        fixed threshold. The null-adapter calibration requires a backend that
      
        143
        implements `NullCalibratedBackend` (the HF backend does); probes that
      
        144
        can't be calibrated (e.g., `adapter_revert` needs an embedder, the null
      
        145
        proxy doesn't have one) surface `(no calibration)` in the report and
      
        146
        fall back to fixed thresholds. Calibration stats are cached on disk
      
        147
        under `~/.dlm-sway/null-stats/` keyed by backend identity.
      
        148
        
        149
        **The rank profile.** `null_adapter` takes an optional
      
        150
        `rank_multipliers: list[float]` (default `[1.0]`). Pass
      
        151
        `[0.5, 1.0, 2.0]` and every numeric probe carries a three-point
      
        152
        z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape
      
        153
        is diagnostic:
      
        154
        
        155
        - **Flat or slightly rising toward 0.5x** — adapter signal is
      
        156
          rank-stable, roughly independent of noise energy.
      
        157
        - **Sharply higher at 0.5x, lower at 2x** — adapter is rank-saturated:
      
        158
          a smaller rank would have yielded a clearer separation from noise.
      
        159
          Consider halving `r`.
      
        160
        - **Low everywhere** — adapter is barely above noise at any rank;
      
        161
          the signal is real but weak.
      
        162
        
        163
        Caveat: high z at low rank can also mean the low-rank null is
      
        164
        *pathologically quiet* rather than that the adapter is strong. Read the
      
        165
        profile as a shape, not a scalar — if all three z's move proportionally,
      
        166
        the adapter is doing work; if they spread apart, the rank is mis-sized.
      
        167
        
        168
        Implementation note: rank scaling is mathematically equivalent to
      
        169
        multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output
      
        170
        variance scales linearly with rank). The shipped backends apply that
      
        171
        scaling rather than reshaping PEFT tensors — no model reload, no
      
        172
        rank-specific adapter cache, same `alpha/r` scaling throughout.
      
        173
        
        174
        **Determinism.** Every `sway run` calls `seed_everything(spec.defaults.seed)`
      
        175
        before the first probe — seeds python/numpy/torch RNGs and asks torch
      
        176
        for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The
      
        177
        report footer prints the achieved class — `strict` (CUDA), `best_effort`
      
        178
        (CPU/MPS), or `loose` (deterministic algorithms refused). Same seed +
      
        179
        same host = bit-identical scoring across runs.
      
        180
        
        181
        ## Pytest integration
      
        182
        
        183
        For teams already testing their training pipeline with pytest, sway
      
        184
        ships a plugin behind the `[pytest]` extra. A single decorator turns
      
        185
        one pytest function into one test item per probe plus an optional
      
        186
        composite-score gate:
      
        187
        
        188
        ```python
      
        189
        import pytest
      
        190
        
        191
        @pytest.mark.sway(spec="sway.yaml", threshold=0.6)
      
        192
        def test_adapter_healthy() -> None:
      
        193
            """The decorator owns the body — a bare pass is conventional."""
      
        194
        ```
      
        195
        
        196
        `pytest -v` then reports:
      
        197
        
        198
        ```
      
        199
        test_sway_gate.py::test_adapter_healthy::adherence    PASSED
      
        200
        test_sway_gate.py::test_adapter_healthy::calibration  PASSED
      
        201
        test_sway_gate.py::test_adapter_healthy::__gate__     PASSED
      
        202
        ```
      
        203
        
        204
        `--junitxml` emits one `<testcase>` per probe, `pytest -k adherence`
      
        205
        runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to
      
        206
        pytest outcomes. See `examples/pytest_integration/` for a full
      
        207
        before/after walkthrough.
      
        208
        
        209
        ```bash
      
        210
        pip install 'dlm-sway[hf,pytest]'
      
        211
        ```
      
        212
        
        213
        ## Pre-commit
      
        214
        
        215
        For teams using [pre-commit.com](https://pre-commit.com), sway ships
      
        216
        a `.pre-commit-hooks.yaml` declaring two hooks that run `sway gate`
      
        217
        before every commit touching a spec, `.dlm` document, or adapter
      
        218
        file. Add 4–5 lines to your `.pre-commit-config.yaml`:
      
        219
        
        220
        ```yaml
      
        221
        repos:
      
        222
          - repo: https://github.com/tenseleyFlow/sway
      
        223
            rev: 2ecd9a0c9d65a9b9576a185597c88f41444f9646  # pin to a SHA
      
        224
            hooks:
      
        225
              - id: sway-gate
      
        226
                args: ["sway.yaml", "--threshold=0.6"]
      
        227
        ```
      
        228
        
        229
        Two variants ship; pick whichever fits your install posture:
      
        230
        
        231
        | Hook | When to use | First-run cost |
      
        232
        |---|---|---|
      
        233
        | `sway-gate` | you already ran `pip install 'dlm-sway[hf]'` | ~none — uses the sway binary on your `PATH` |
      
        234
        | `sway-gate-isolated` | fresh venv, no existing sway install | ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers |
      
        235
        
        236
        The recommended default is `sway-gate`. Switch to
      
        237
        `sway-gate-isolated` if you can't rely on a host-level sway install.
      
        238
        
        239
        ### Rev pinning
      
        240
        
        241
        The example above pins to a commit SHA. Sway is pre-v0.1.0 — no
      
        242
        tagged release yet. Pinning to `HEAD` would silently drift your
      
        243
        gate's behavior under every `pre-commit autoupdate`; a SHA is the
      
        244
        honest pre-release pattern. Bump it deliberately when you want to
      
        245
        pick up upstream changes. After sway publishes v0.1.0 the recipe
      
        246
        switches to `rev: v0.1.0` and the SHA churn stops.
      
        247
        
        248
        ### Scope
      
        249
        
        250
        The hook **only gates** — exits non-zero on FAIL, zero on PASS. No
      
        251
        `--json` / `--markdown` report flags are surfaced; those belong in
      
        252
        `sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast
      
        253
        and the gate's verdict uncluttered.
      
        254
        
        255
        See [`examples/precommit-example/`](examples/precommit-example/) for
      
        256
        the full walk-through including the `sway.yaml` template, the
      
        257
        consumer-side `.pre-commit-config.yaml`, and the
      
        258
        try-it-locally-before-you-install recipe.
      
        259
        
        260
        ## The `.dlm` integration
      
        261
        
        262
        If you trained your adapter via the [DocumentLanguageModel
      
        263
        project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway`
      
        264
        auto-generates a test suite from your document's sections.
      
        265
        
        266
        Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable):
      
        267
        
        268
        ```bash
      
        269
        # inside a clone of this repo
      
        270
        uv pip install -e ".[hf,dlm]"
      
        271
        ```
      
        272
        
        273
        Then:
      
        274
        
        275
        ```bash
      
        276
        sway autogen path/to/doc.dlm -o sway.yaml
      
        277
        sway run sway.yaml
      
        278
        ```
      
        279
        
        280
        Per-section attribution tells you *which* parts of your document
      
        281
        actually moved the model — a kind of signal no other tool provides.
      
        282
        
        283
        ## Status
      
        284
        
        285
        Pre-alpha. API will break. Not yet on PyPI — install editable from source
      
        286
        (see [Install from source](#install-from-source)). Version `0.1.0` will be
      
        287
        the first published tag; until then, every clone pulls the tip of `main`.
      
        288
        
        289
        ## License
      
        290
        
        291
        MIT

1	# sway
2
3	Differential testing for fine-tuned causal language models.
4
5	> Alpha — v0.1.0 on PyPI. API is not stable; semantic versioning
6	> applies only from v1.0 onward. Feedback + issues welcome.
7
8	One question: *did LoRA/QLoRA training actually change model behavior
9	in a meaningful way, or is the model just defaulting to the pretrained
10	base?*
11
12	`sway` gives you a trustworthy, reproducible answer with thirteen
13	purpose-built primitives, each z-scored against a null-adapter baseline.
14	No LLM judges. No external APIs. Deterministic on CPU where possible.
15
16	> Naming convention. The source repo and CLI entry point are both
17	> `sway`. The PyPI wheel is `dlm-sway` because the short `sway` name is
18	> taken on PyPI by an unrelated project. The CLI installed by
19	> `pip install dlm-sway` is `sway` — mismatched wheel/command names are
20	> a PyPA convention (see `pyyaml` → `import yaml`).
21
22	## Install
23
24	```bash
25	# HF + PEFT backend — required for real models
26	pip install "dlm-sway[hf]"
27
28	# Extras composable as usual
29	pip install "dlm-sway[hf,style,semsim]"
30	pip install "dlm-sway[all]"
31
32	# .dlm auto-suite generation (requires the DLM sibling project)
33	pip install "dlm-sway[dlm]"
34	```
35
36	Available extras:
37
38	- `[hf]` — HuggingFace + PEFT backend (required for real models)
39	- `[mlx]` — Apple Silicon MLX backend (darwin-arm64 only)
40	- `[style]` — stylistic fingerprint extensions (spaCy + textstat + nlpaug)
41	- `[semsim]` — sentence-transformers for the revert probe
42	- `[dlm]` — auto-generate suites from `.dlm` documents
43	- `[viz]` — matplotlib plots
44	- `[all]` — everything
45
46	Verify the install:
47
48	```bash
49	sway --version
50	sway doctor
51	```
52
53	## Install from source
54
55	For the development HEAD (unreleased changes, contributor workflow):
56
57	```bash
58	git clone https://github.com/tenseleyFlow/sway.git
59	cd sway
60
61	uv venv --python 3.11 .venv # or: python -m venv .venv
62	source .venv/bin/activate
63	uv pip install -e ".[hf]" --group dev
64	```
65
66	## 90-second smoke test
67
68	```bash
69	sway check path/to/adapter --base HuggingFaceTB/SmolLM2-135M-Instruct
70	```
71
72	Outputs a verdict in under a minute on CPU for small models: *your
73	adapter is 4.2σ above noise* ✅ or *indistinguishable from a null
74	adapter* ❌.
75
76	## Full suite
77
78	```yaml
79	# sway.yaml
80	version: 1
81	models:
82	base: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct"}
83	ft: {kind: hf, base: "HuggingFaceTB/SmolLM2-135M-Instruct",
84	adapter: "./runs/adapter/v0003"}
85	suite:
86	- {name: null_baseline, kind: null_adapter, runs: 3}
87	- {name: doc_divergence, kind: delta_kl,
88	prompts: ["The key insight is", "An important rule"]}
89	- {name: section_attribution, kind: section_internalization}
90	- {name: no_leakage, kind: leakage}
91	- {name: ablation_shape, kind: adapter_ablation,
92	prompts: ["Tell me more about"]}
93	```
94
95	```bash
96	sway run sway.yaml # full report to terminal + JSON
97	sway gate sway.yaml --junit # CI-friendly; non-zero on fail
98
99	# Override the composite weights on the command line (partial overrides
100	# are fine — unspecified categories keep their defaults):
101	sway run sway.yaml --weights "attribution=0.5,adherence=0.2"
102	```
103
104	Inside `sway.yaml`, tuning knobs in `defaults` include:
105
106	- `seed` — passed to `seed_everything` before any probe runs.
107	- `differential` (default `true`) — toggle between the single-load PEFT
108	path and a two-model load (doubled memory, rarely needed; for custom
109	backends that can't do in-place adapter toggling).
110	- `score_weights` — per-category weight overrides baked into the spec so
111	CI runs reproduce the same score without a CLI flag.
112
113	## Why it exists
114
115	Standard benchmarks (MMLU, HellaSwag) ask "how good is this model?"
116	That's the wrong question after a targeted LoRA fine-tune on a small
117	user-authored document. The right question is *"did the adapter actually
118	move the model toward what I wrote?"* — and existing tools answer this
119	poorly.
120
121	`sway` answers it directly via thirteen primitives across four
122	categories, plus a baseline-calibration primitive:
123
124	\| Category \| Primitives \|
125	\|---------------\|-------------------------------------------------------\|
126	\| Adherence \| `delta_kl`, `adapter_revert`, `prompt_collapse`, `cluster_kl` \|
127	\| Attribution \| `section_internalization`, `paraphrase_invariance`, `preference_flip` \|
128	\| Calibration \| `style_fingerprint`, `calibration_drift`, `leakage`, `external_perplexity` \|
129	\| Ablation \| `adapter_ablation` ← the signature primitive \|
130	\| Baseline \| `null_adapter` (powers every z-score in the report) \|
131
132	The signature primitive. `adapter_ablation` scales the LoRA additive
133	term by λ ∈ {0, 0.25, 0.5, 0.75, 1.0, 1.25} and measures the divergence
134	curve. A healthy fine-tune shows a smooth, monotonic, non-saturated
135	response. A degenerate one shows a step function or an overshoot-then-
136	crash. Nobody else does this because nobody else gets this close to the
137	adapter math.
138
139	The calibration. Every numeric probe z-scores its raw metric against
140	a null-adapter baseline — a same-structure LoRA with random-init weights.
141	"Your adapter's KL is 4.2σ above noise" is a far stronger claim than a
142	fixed threshold. The null-adapter calibration requires a backend that
143	implements `NullCalibratedBackend` (the HF backend does); probes that
144	can't be calibrated (e.g., `adapter_revert` needs an embedder, the null
145	proxy doesn't have one) surface `(no calibration)` in the report and
146	fall back to fixed thresholds. Calibration stats are cached on disk
147	under `~/.dlm-sway/null-stats/` keyed by backend identity.
148
149	The rank profile. `null_adapter` takes an optional
150	`rank_multipliers: list[float]` (default `[1.0]`). Pass
151	`[0.5, 1.0, 2.0]` and every numeric probe carries a three-point
152	z-score curve: `z=+4.2σ @ 1x / +6.8σ @ 0.5x / +2.1σ @ 2x`. The shape
153	is diagnostic:
154
155	- Flat or slightly rising toward 0.5x — adapter signal is
156	rank-stable, roughly independent of noise energy.
157	- Sharply higher at 0.5x, lower at 2x — adapter is rank-saturated:
158	a smaller rank would have yielded a clearer separation from noise.
159	Consider halving `r`.
160	- Low everywhere — adapter is barely above noise at any rank;
161	the signal is real but weak.
162
163	Caveat: high z at low rank can also mean the low-rank null is
164	pathologically quiet rather than that the adapter is strong. Read the
165	profile as a shape, not a scalar — if all three z's move proportionally,
166	the adapter is doing work; if they spread apart, the rank is mis-sized.
167
168	Implementation note: rank scaling is mathematically equivalent to
169	multiplying the null noise std by `sqrt(rank_scale)` (LoRA's A·B output
170	variance scales linearly with rank). The shipped backends apply that
171	scaling rather than reshaping PEFT tensors — no model reload, no
172	rank-specific adapter cache, same `alpha/r` scaling throughout.
173
174	Determinism. Every `sway run` calls `seed_everything(spec.defaults.seed)`
175	before the first probe — seeds python/numpy/torch RNGs and asks torch
176	for deterministic algorithms (`CUBLAS_WORKSPACE_CONFIG=:4096:8`). The
177	report footer prints the achieved class — `strict` (CUDA), `best_effort`
178	(CPU/MPS), or `loose` (deterministic algorithms refused). Same seed +
179	same host = bit-identical scoring across runs.
180
181	## Pytest integration
182
183	For teams already testing their training pipeline with pytest, sway
184	ships a plugin behind the `[pytest]` extra. A single decorator turns
185	one pytest function into one test item per probe plus an optional
186	composite-score gate:
187
188	```python
189	import pytest
190
191	@pytest.mark.sway(spec="sway.yaml", threshold=0.6)
192	def test_adapter_healthy() -> None:
193	"""The decorator owns the body — a bare pass is conventional."""
194	```
195
196	`pytest -v` then reports:
197
198	```
199	test_sway_gate.py::test_adapter_healthy::adherence PASSED
200	test_sway_gate.py::test_adapter_healthy::calibration PASSED
201	test_sway_gate.py::test_adapter_healthy::__gate__ PASSED
202	```
203
204	`--junitxml` emits one `<testcase>` per probe, `pytest -k adherence`
205	runs just that probe, `FAIL` / `ERROR` / `SKIP` verdicts translate to
206	pytest outcomes. See `examples/pytest_integration/` for a full
207	before/after walkthrough.
208
209	```bash
210	pip install 'dlm-sway[hf,pytest]'
211	```
212
213	## Pre-commit
214
215	For teams using [pre-commit.com](https://pre-commit.com), sway ships
216	a `.pre-commit-hooks.yaml` declaring two hooks that run `sway gate`
217	before every commit touching a spec, `.dlm` document, or adapter
218	file. Add 4–5 lines to your `.pre-commit-config.yaml`:
219
220	```yaml
221	repos:
222	- repo: https://github.com/tenseleyFlow/sway
223	rev: 2ecd9a0c9d65a9b9576a185597c88f41444f9646 # pin to a SHA
224	hooks:
225	- id: sway-gate
226	args: ["sway.yaml", "--threshold=0.6"]
227	```
228
229	Two variants ship; pick whichever fits your install posture:
230
231	\| Hook \| When to use \| First-run cost \|
232	\|---\|---\|---\|
233	\| `sway-gate` \| you already ran `pip install 'dlm-sway[hf]'` \| ~none — uses the sway binary on your `PATH` \|
234	\| `sway-gate-isolated` \| fresh venv, no existing sway install \| ~2 min + ~5 GB — pre-commit builds a fresh venv and installs sway + torch + transformers \|
235
236	The recommended default is `sway-gate`. Switch to
237	`sway-gate-isolated` if you can't rely on a host-level sway install.
238
239	### Rev pinning
240
241	The example above pins to a commit SHA. Sway is pre-v0.1.0 — no
242	tagged release yet. Pinning to `HEAD` would silently drift your
243	gate's behavior under every `pre-commit autoupdate`; a SHA is the
244	honest pre-release pattern. Bump it deliberately when you want to
245	pick up upstream changes. After sway publishes v0.1.0 the recipe
246	switches to `rev: v0.1.0` and the SHA churn stops.
247
248	### Scope
249
250	The hook only gates — exits non-zero on FAIL, zero on PASS. No
251	`--json` / `--markdown` report flags are surfaced; those belong in
252	`sway run` (ad-hoc or in a separate CI job). Keeps `git commit` fast
253	and the gate's verdict uncluttered.
254
255	See [`examples/precommit-example/`](examples/precommit-example/) for
256	the full walk-through including the `sway.yaml` template, the
257	consumer-side `.pre-commit-config.yaml`, and the
258	try-it-locally-before-you-install recipe.
259
260	## The `.dlm` integration
261
262	If you trained your adapter via the [DocumentLanguageModel
263	project](https://github.com/tenseleyFlow/DocumentLanguageModel), `sway`
264	auto-generates a test suite from your document's sections.
265
266	Install sway with the `[dlm]` extra alongside `[hf]` (pre-PyPI, editable):
267
268	```bash
269	# inside a clone of this repo
270	uv pip install -e ".[hf,dlm]"
271	```
272
273	Then:
274
275	```bash
276	sway autogen path/to/doc.dlm -o sway.yaml
277	sway run sway.yaml
278	```
279
280	Per-section attribution tells you which parts of your document
281	actually moved the model — a kind of signal no other tool provides.
282
283	## Status
284
285	Pre-alpha. API will break. Not yet on PyPI — install editable from source
286	(see [Install from source](#install-from-source)). Version `0.1.0` will be
287	the first published tag; until then, every clone pulls the tip of `main`.
288
289	## License
290
291	MIT