@@ -2,6 +2,72 @@ |
| 2 | 2 | |
| 3 | 3 | ## Unreleased |
| 4 | 4 | |
| 5 | +### Sprint 27 — `tool_use_fidelity` probe |
| 6 | + |
| 7 | +Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway |
| 8 | +ships for anyone fine-tuning an adapter against a tool-using base — |
| 9 | +the dominant fine-tune target outside dlm-style document training, |
| 10 | +and the failure mode no other shipped probe catches. |
| 11 | + |
| 12 | +**New probe (`kind: tool_use_fidelity`, category: attribution).** |
| 13 | + |
| 14 | +For each `(prompt, tool_spec, gold_tool_name)` case the probe greedy- |
| 15 | +decodes from base + ft and scores three independent signals: |
| 16 | + |
| 17 | +- **JSON-schema validity delta** (`ft_valid_rate − base_valid_rate`): |
| 18 | + the gate that catches "ft broke the base's tool-call format". Default |
| 19 | + pass criterion tolerates a 5pp drop. |
| 20 | +- **Tool-name hallucination rate**: fraction of schema-valid ft calls |
| 21 | + whose `name` falls outside the declared `allowed_tools` surface (or |
| 22 | + differs from `gold_tool_name` per-case when no surface is set). |
| 23 | + Default cap 10%. |
| 24 | +- **Argument-field disagreement rate** (informational): leaf-field |
| 25 | + drift between matched base/ft schema-valid calls. v1 surfaces it |
| 26 | + as evidence rather than gating on it — the per-token KL on |
| 27 | + free-form argument values is deferred past v1 because it requires |
| 28 | + alignment between two decoded strings. |
| 29 | + |
| 30 | +The probe greedy-generates from both views, parses each output via a |
| 31 | +forgiving extractor (whole-text JSON → fenced ```json``` block → |
| 32 | +first balanced `{...}` substring), then validates against an |
| 33 | +OpenAI-flavored schema subset (`string`, `integer`, `number`, |
| 34 | +`boolean`, `object`, `array` plus `required`). No `jsonschema` |
| 35 | +dependency added — core sway dependencies stay lean. |
| 36 | + |
| 37 | +`json_valid_rate_ft` is z-scored against the null-adapter baseline |
| 38 | +when `null_adapter` is in the suite. Calibration spec ships two |
| 39 | +sentinel tool-use cases so the null distribution carries useful |
| 40 | +signal. |
| 41 | + |
| 42 | +**Implementation modules:** |
| 43 | +- **`probes/tool_use_fidelity.py`** — `ToolUseCase` / |
| 44 | + `ToolUseFidelitySpec` / `ToolUseFidelityProbe` plus |
| 45 | + `_parse_tool_call` / `_matches_schema` / `_field_disagreement` |
| 46 | + helpers. 573 LOC. |
| 47 | +- **`probes/__init__.py`** — registers the new probe. |
| 48 | + |
| 49 | +**Test surface:** |
| 50 | +- **`tests/unit/test_probe_tool_use_fidelity.py`** — 40 unit tests |
| 51 | + covering verdict logic (PASS / FAIL on validity / FAIL on |
| 52 | + hallucination / SKIP-no-cases), the parse helpers (whole-text / |
| 53 | + fenced / embedded / unbalanced / strings-with-braces), schema |
| 54 | + validation (required / type-tag / bool-not-int / extras-allowed), |
| 55 | + field disagreement (identical / drifted / nested / list-as-leaf / |
| 56 | + None-vs-absent), and the calibration handoff. |
| 57 | +- **`tests/integration/test_probe_tool_use_fidelity.py`** — |
| 58 | + slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies |
| 59 | + the full code path executes without error and emits every |
| 60 | + documented evidence key with rates in `[0, 1]`. Doesn't assert a |
| 61 | + specific verdict — the 135M base isn't tool-fluent enough to make |
| 62 | + the validity gate meaningful, only the plumbing. |
| 63 | +- **`tests/fixtures/tool_use_cases.yaml`** — eight hand-authored |
| 64 | + cases spanning the most common tool shapes (search, file I/O, |
| 65 | + Python exec, shell, HTTP fetch, DB query, calendar, calculator). |
| 66 | + |
| 67 | +**README** gains a "Tool-use fidelity" section between the `.dlm` |
| 68 | +integration and "Reproducing a sway run" — pitches the probe as |
| 69 | +sway's signal for agentic fine-tunes, with a worked YAML example. |
| 70 | + |
| 5 | 71 | ### Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair) |
| 6 | 72 | |
| 7 | 73 | Closes the X3 half of Audit 03's "make a sway run reproducible by a |