`d2010a9`

Audit Sprint 02 definition-of-done rollout

Authored by

espadonne 1 month ago

SHA: d2010a99fe1c2c2a501763f047d67aa6651d1e40
Parents: f0c4a17
Tree: d2ed827

2 changed files

Status	File	+	-
M	`.docs/PARITY.md`	18	6
M	`.docs/sprints/sprint02.md`	17	0

.docs/PARITY.mdmodified

  - native-tool round trips for `read`, `write`, `edit`, `glob`, `grep`, and `bash`
  - confirmation callbacks for destructive `write` and `bash` actions
  - raw JSON fallback when the model emits tool syntax in plain text
 -- heuristic completion nudges when the model stops before finishing a simple actionable task
 +- persisted definition-of-done state under `.loader/dod/`
 +- explicit verify/fix loops for mutating tasks, with a bounded retry budget
 +- task-size-aware verification command derivation based on actual tool history
 +- heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
  - typed `TurnSummary` output for completed turns, including trace events and tool-result messages
  - unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor`
  - typed tool-result messages backed by `Message.tool_results`
 +- CLI and TUI status surfaces for DoD phase, pending items, and last verification result
  ## Known weak spots
  - the core turn loop moved into [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py), but it is still much larger and more heuristic-heavy than the reference runtime in `refs/claw-code`
  - planning, decomposition, and several helper behaviors still live in [`src/loader/agent/loop.py`](../src/loader/agent/loop.py), so ownership is cleaner than Sprint 00 but not fully simplified yet
 -- completion is still heuristic, not evidence-backed
 +- DoD acceptance criteria and pending items are still runtime-derived and minimal, not model-authored task plans
 +- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
  - permissions are confirmation-based, not policy-based
  ## Out of scope in the current baseline
  - permission modes / policy engine
 -- persisted sessions / memory / `.loader/` runtime state
 +- persisted sessions / memory beyond DoD state
  - mode router, clarify, or planning artifacts
  - doctor / status / session product surfaces
  - `native_and_raw_tool_paths_share_executor_trace`: green
  - `backend_capability_probe_refreshes_native_tool_mode`: green
  - `run_streaming_delegates_to_primary_runtime`: green
 +- `definition_of_done_verify_phase`: green
 +- `verify_failure_routes_to_fix_loop`: green
 +- `verify_retry_budget_exhaustion`: green
 +- `conversational_task_skips_verify_phase`: green
  ## Verification snapshot
  As of 2026-04-06:
 -- `uv run pytest -q`: 80 passed
 -- `tests/test_runtime_harness.py` is fully green, including the original contract regression
 +- `uv run pytest -q`: 90 passed
 +- `tests/test_runtime_harness.py` is fully green, including DoD verify/fix coverage and the original contract regression
 +- `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation
 +- `tests/test_status_surfaces.py` covers the CLI/TUI DoD status formatting helpers
  - native and extracted tool calls now record the same executor trace events, with source-specific metadata
 -- turn startup can refine backend capability profiles before the first request, and `run_streaming()` now delegates into the main runtime path
 +- turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, and mutating tasks now route through persisted evidence-backed completion
  ## Definition of honesty
  - If a scenario is green here, it should have deterministic automated coverage.
  - If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
  - Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test.
 +- Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.

.docs/sprints/sprint02.mdmodified

  - failed verification cannot escape into a "looks done" final answer
  - simple tasks stay cheap (verify is skipped); complex tasks enter the verify/fix loop automatically
  - the user can see the DoD phase from the CLI and TUI
++
 +## Audit Notes
++
 +Audit checkpoint on 2026-04-06:
++
 +- added a persisted `DefinitionOfDone` runtime object under `src/loader/runtime/dod.py` and store-backed state under `.loader/dod/`
 +- routed mutating tasks through an explicit verify/fix gate in `src/loader/runtime/conversation.py`, with retry-budget exhaustion returning an honest failure summary instead of a premature success
 +- taught verification runs to execute through the shared executor with duplicate suppression disabled, confirmations skipped, and project-root working-directory awareness
 +- tightened duplicate suppression so rewrites used for recovery are allowed while true same-content rewrites are still skipped
 +- surfaced DoD state in both the non-TUI CLI and the TUI status line, and added deterministic coverage for runtime parity, DoD persistence/sizing, and status formatting
 +- full verification is green at `uv run pytest -q` with 90 passing tests
++
 +Residual debt after Sprint 02:
++
 +- DoD acceptance criteria and pending items are still runtime-derived and shallow; Loader does not yet have the richer task/workflow artifacts planned in Sprint 04 and Sprint 05
 +- verification summaries are runtime-generated from captured evidence rather than model-authored evidence explanations
 +- task-size-aware verification is intentionally conservative today; larger-task evidence scaling still has room to move closer to the reference verifier design