tenseleyflow/sway / 7b63dd0

Browse files

README + CHANGELOG: tool-use fidelity probe surface and Sprint 27 block

Authored by mfwolffe <wolffemf@dukes.jmu.edu>
SHA
7b63dd0dc809eff43ff701dd93e05761d629fe6d
Parents
a1417ac
Tree
4128c91

2 changed files

StatusFile+-
M CHANGELOG.md 66 0
M README.md 48 0
CHANGELOG.mdmodified
@@ -2,6 +2,72 @@
22
 
33
 ## Unreleased
44
 
5
+### Sprint 27 — `tool_use_fidelity` probe
6
+
7
+Closes the P1 "tool_use_fidelity probe" backlog item. The probe sway
8
+ships for anyone fine-tuning an adapter against a tool-using base —
9
+the dominant fine-tune target outside dlm-style document training,
10
+and the failure mode no other shipped probe catches.
11
+
12
+**New probe (`kind: tool_use_fidelity`, category: attribution).**
13
+
14
+For each `(prompt, tool_spec, gold_tool_name)` case the probe greedy-
15
+decodes from base + ft and scores three independent signals:
16
+
17
+- **JSON-schema validity delta** (`ft_valid_rate − base_valid_rate`):
18
+  the gate that catches "ft broke the base's tool-call format". Default
19
+  pass criterion tolerates a 5pp drop.
20
+- **Tool-name hallucination rate**: fraction of schema-valid ft calls
21
+  whose `name` falls outside the declared `allowed_tools` surface (or
22
+  differs from `gold_tool_name` per-case when no surface is set).
23
+  Default cap 10%.
24
+- **Argument-field disagreement rate** (informational): leaf-field
25
+  drift between matched base/ft schema-valid calls. v1 surfaces it
26
+  as evidence rather than gating on it — the per-token KL on
27
+  free-form argument values is deferred past v1 because it requires
28
+  alignment between two decoded strings.
29
+
30
+The probe greedy-generates from both views, parses each output via a
31
+forgiving extractor (whole-text JSON → fenced ```json``` block →
32
+first balanced `{...}` substring), then validates against an
33
+OpenAI-flavored schema subset (`string`, `integer`, `number`,
34
+`boolean`, `object`, `array` plus `required`). No `jsonschema`
35
+dependency added — core sway dependencies stay lean.
36
+
37
+`json_valid_rate_ft` is z-scored against the null-adapter baseline
38
+when `null_adapter` is in the suite. Calibration spec ships two
39
+sentinel tool-use cases so the null distribution carries useful
40
+signal.
41
+
42
+**Implementation modules:**
43
+- **`probes/tool_use_fidelity.py`** — `ToolUseCase` /
44
+  `ToolUseFidelitySpec` / `ToolUseFidelityProbe` plus
45
+  `_parse_tool_call` / `_matches_schema` / `_field_disagreement`
46
+  helpers. 573 LOC.
47
+- **`probes/__init__.py`** — registers the new probe.
48
+
49
+**Test surface:**
50
+- **`tests/unit/test_probe_tool_use_fidelity.py`** — 40 unit tests
51
+  covering verdict logic (PASS / FAIL on validity / FAIL on
52
+  hallucination / SKIP-no-cases), the parse helpers (whole-text /
53
+  fenced / embedded / unbalanced / strings-with-braces), schema
54
+  validation (required / type-tag / bool-not-int / extras-allowed),
55
+  field disagreement (identical / drifted / nested / list-as-leaf /
56
+  None-vs-absent), and the calibration handoff.
57
+- **`tests/integration/test_probe_tool_use_fidelity.py`** —
58
+  slow+online HF backend smoke on a tiny SmolLM2-135M LoRA. Verifies
59
+  the full code path executes without error and emits every
60
+  documented evidence key with rates in `[0, 1]`. Doesn't assert a
61
+  specific verdict — the 135M base isn't tool-fluent enough to make
62
+  the validity gate meaningful, only the plumbing.
63
+- **`tests/fixtures/tool_use_cases.yaml`** — eight hand-authored
64
+  cases spanning the most common tool shapes (search, file I/O,
65
+  Python exec, shell, HTTP fetch, DB query, calendar, calculator).
66
+
67
+**README** gains a "Tool-use fidelity" section between the `.dlm`
68
+integration and "Reproducing a sway run" — pitches the probe as
69
+sway's signal for agentic fine-tunes, with a worked YAML example.
70
+
571
 ### Sprint 26 — X3 sway pack / unpack (sway-side half of the cross-repo X1+X3 pair)
672
 
773
 Closes the X3 half of Audit 03's "make a sway run reproducible by a
README.mdmodified
@@ -335,6 +335,54 @@ actually moved the model — a kind of signal no other tool provides.
335335
 > sway-json` writes a ready-to-run `sway.yaml` next to the GGUF.
336336
 > Skip the manual `sway autogen` step entirely.
337337
 
338
+## Tool-use fidelity
339
+
340
+Adapters that target tool-calling behavior have a failure mode no
341
+other probe catches: they pass every adherence / attribution probe
342
+but silently degrade the base model's JSON-schema compliance, or
343
+start hallucinating tool names that aren't in the declared surface.
344
+The `tool_use_fidelity` probe scores three independent signals on
345
+each `(prompt, tool_spec)` case:
346
+
347
+- **JSON-schema validity delta** — `ft_valid_rate − base_valid_rate`.
348
+  A negative value means the adapter regressed on producing
349
+  schema-valid calls; the default pass criterion tolerates a 5pp
350
+  drop.
351
+- **Tool-name hallucination rate** — over schema-valid ft calls, the
352
+  fraction that pick a tool outside the declared `allowed_tools`
353
+  surface (or differ from `gold_tool_name` when no surface is set).
354
+  Default cap 10%.
355
+- **Argument-field disagreement rate** — over cases where both
356
+  views produce schema-valid calls, the per-leaf-field disagreement
357
+  rate between base and ft arguments. Surfaced as evidence — the
358
+  v1 surface doesn't gate on it.
359
+
360
+```yaml
361
+suite:
362
+  - name: tool_calls_intact
363
+    kind: tool_use_fidelity
364
+    cases:
365
+      - prompt: "Search the web for the capital of France."
366
+        tool_spec:
367
+          name: search_web
368
+          parameters:
369
+            type: object
370
+            properties: {query: {type: string}}
371
+            required: [query]
372
+        gold_tool_name: search_web
373
+    allowed_tools: [search_web, calculator, http_fetch]
374
+```
375
+
376
+The probe uses an OpenAI-style function-calling schema (the most
377
+common shape across hosted and OSS tool-use stacks). Other schema
378
+flavors (Anthropic, Llama-3.1, Gemini) land in a follow-up sprint;
379
+the v1 implementation deliberately ships the format that covers
380
+the largest user surface.
381
+
382
+`json_valid_rate_ft` is z-scored against the null-adapter baseline
383
+when `null_adapter` is in the suite — the principled "the adapter
384
+preserved the base's tool-call structure beyond noise" signal.
385
+
338386
 ## Reproducing a sway run
339387
 
340388
 Sometimes you want a coworker (or a future-you, or a bug report) to