# Probe-driven training Close the loop between a differential-testing eval harness and the trainer: failing probes flow back into the document, the adapter retrains, and the next eval run measures improvement. Two directions: - **Pull**: `dlm harvest --sway-json ` reads a sway JSON report and appends failing probes as `::instruction::` sections tagged `!probe`, with `auto_harvest: true` for provenance. - **Push**: `dlm train --listen-rpc ` opens a JSON-RPC endpoint that accepts `inject_probe` pushes during `--watch` mode; probes enter a queue and drain at the next cycle boundary. Both paths assume you run the eval harness (sway or equivalent) separately; dlm owns the document edit and retrain, not the eval. ## Pull path — harvesting a sway report Sway emits a JSON report describing per-probe outcomes. Extract failing probes with reference answers back into the document: ```bash # Dry-run first — shows what would be added, no writes: dlm harvest mydoc.dlm --sway-json sway-run-1.json # Apply after review: dlm harvest mydoc.dlm --sway-json sway-run-1.json --apply ``` What lands on disk: for each failing probe with `evidence.prompt` + `evidence.reference`, one `::instruction::` section in the shape ``` ::instruction:: ### Q !probe ### A :: ``` The section carries `auto_harvest: true` and `harvest_source: "/"` for traceability. ### Harvest flags | Flag | Effect | |---|---| | `--sway-json PATH` | Required. Path to the sway report. | | `--apply` | Write changes to disk. Default: dry-run. | | `--dry-run` | Explicit dry-run (default). | | `--revert` | Strip all `auto_harvest=True` sections. Mutually exclusive with `--sway-json`. | | `--tag NAME` | Override the default `auto-harvest` tag in `harvest_source`. | | `--min-confidence F` | Drop candidates below this confidence threshold. | | `--strict` / `--lax` | Strict: fail if any failing probe lacks a reference. Lax: skip + log. | ### Refusals - `--sway-json` missing → exit 1 - Sway JSON malformed → exit 1 - No failing probes with references → exit 2 (no candidates) - `--revert` + `--sway-json` → exit 1 (mutually exclusive) - Strict mode + probe without reference → exit 1 (hint: `--lax`) ### Revert path If a harvest pass pulls in noise (bad prompt wording, duplicated content), revert in one command: ```bash dlm harvest mydoc.dlm --revert ``` All sections with `auto_harvest=true` are stripped; hand-authored sections stay. Coarser than "undo the last harvest" by design — users audit the diff before `--apply`, so "undo all auto-edits" is the safe escape hatch. ## Push path — live probe injection For a long-running `--watch` session, open an RPC endpoint so an external sway (or equivalent) process can push failing probes as they arrive: ```bash export DLM_PROBE_TOKEN=$(openssl rand -hex 16) dlm train mydoc.dlm --watch --listen-rpc 127.0.0.1:7429 ``` The server accepts POSTs at `/rpc`: ```http POST /rpc HTTP/1.1 Authorization: Bearer Content-Type: application/json { "method": "inject_probe", "params": { "prompt": "What does DGEMM compute?", "reference": "A double-precision general matrix multiplication.", "tags": ["nightly-ci"] } } ``` Successful response: ```json {"accepted": true, "next_cycle_eta_s": 0, "queue_depth": 1} ``` ### Status codes - `200` accepted + queued - `400` malformed payload (bad JSON, missing fields, non-string tags) - `401` missing or invalid bearer token - `404` unknown method or path - `429` queue past capacity (default 1000) ### Security notes - **Localhost-only in v1.** The endpoint binds whatever host you pass; use `127.0.0.1` unless you know what you're doing. Remote pushes are a training-data-poisoning vector. - **Bearer token is mandatory.** Without `DLM_PROBE_TOKEN` set, the flag refuses at startup. The server uses constant-time compare. - **Body size capped at 64 KiB.** Bounds the DOS surface. - **Queue is bounded.** Past capacity, returns 429 — the client should retry after the next cycle drain. ### Combining pull and push You can use both: push for real-time streaming during a `--watch` session, then harvest the accumulated sway reports later to capture anything that didn't reach the live endpoint. The two paths share the same on-disk shape, so the retrain behavior is identical. ## What the trainer sees A harvested or injected probe becomes a `### Q !probe` pair in the document. At training time: - **Row building**: the `!probe` marker is stripped before the strict instruction parser runs, so the pair trains as a normal SFT example. - **Probe extraction**: `dlm.eval.probes` picks up the same marker and uses the pair as an explicit probe prompt for post-train eval. The effect: every harvested probe both *trains the model to answer it right* and *gets reused as an eval prompt on the retrained adapter*. That's the closed loop — sway's complaint becomes a training example and a regression check in one section. ## Reference - `dlm harvest` — `docs/cli/reference.md` - Section schema (`auto_harvest`, `harvest_source`) — `docs/format/frontmatter.md` - Sway report format — upstream sway docs