@@ -10,22 +10,27 @@ This file tracks the current deterministic runtime baseline for Loader. It stays |
| 10 | 10 | - native-tool round trips for `read`, `write`, `edit`, `glob`, `grep`, and `bash` |
| 11 | 11 | - confirmation callbacks for destructive `write` and `bash` actions |
| 12 | 12 | - raw JSON fallback when the model emits tool syntax in plain text |
| 13 | | -- heuristic completion nudges when the model stops before finishing a simple actionable task |
| 13 | +- persisted definition-of-done state under `.loader/dod/` |
| 14 | +- explicit verify/fix loops for mutating tasks, with a bounded retry budget |
| 15 | +- task-size-aware verification command derivation based on actual tool history |
| 16 | +- heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate |
| 14 | 17 | - typed `TurnSummary` output for completed turns, including trace events and tool-result messages |
| 15 | 18 | - unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor` |
| 16 | 19 | - typed tool-result messages backed by `Message.tool_results` |
| 20 | +- CLI and TUI status surfaces for DoD phase, pending items, and last verification result |
| 17 | 21 | |
| 18 | 22 | ## Known weak spots |
| 19 | 23 | |
| 20 | 24 | - the core turn loop moved into [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py), but it is still much larger and more heuristic-heavy than the reference runtime in `refs/claw-code` |
| 21 | 25 | - planning, decomposition, and several helper behaviors still live in [`src/loader/agent/loop.py`](../src/loader/agent/loop.py), so ownership is cleaner than Sprint 00 but not fully simplified yet |
| 22 | | -- completion is still heuristic, not evidence-backed |
| 26 | +- DoD acceptance criteria and pending items are still runtime-derived and minimal, not model-authored task plans |
| 27 | +- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives |
| 23 | 28 | - permissions are confirmation-based, not policy-based |
| 24 | 29 | |
| 25 | 30 | ## Out of scope in the current baseline |
| 26 | 31 | |
| 27 | 32 | - permission modes / policy engine |
| 28 | | -- persisted sessions / memory / `.loader/` runtime state |
| 33 | +- persisted sessions / memory beyond DoD state |
| 29 | 34 | - mode router, clarify, or planning artifacts |
| 30 | 35 | - doctor / status / session product surfaces |
| 31 | 36 | |
@@ -48,18 +53,25 @@ The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`]( |
| 48 | 53 | - `native_and_raw_tool_paths_share_executor_trace`: green |
| 49 | 54 | - `backend_capability_probe_refreshes_native_tool_mode`: green |
| 50 | 55 | - `run_streaming_delegates_to_primary_runtime`: green |
| 56 | +- `definition_of_done_verify_phase`: green |
| 57 | +- `verify_failure_routes_to_fix_loop`: green |
| 58 | +- `verify_retry_budget_exhaustion`: green |
| 59 | +- `conversational_task_skips_verify_phase`: green |
| 51 | 60 | |
| 52 | 61 | ## Verification snapshot |
| 53 | 62 | |
| 54 | 63 | As of 2026-04-06: |
| 55 | 64 | |
| 56 | | -- `uv run pytest -q`: 80 passed |
| 57 | | -- `tests/test_runtime_harness.py` is fully green, including the original contract regression |
| 65 | +- `uv run pytest -q`: 90 passed |
| 66 | +- `tests/test_runtime_harness.py` is fully green, including DoD verify/fix coverage and the original contract regression |
| 67 | +- `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation |
| 68 | +- `tests/test_status_surfaces.py` covers the CLI/TUI DoD status formatting helpers |
| 58 | 69 | - native and extracted tool calls now record the same executor trace events, with source-specific metadata |
| 59 | | -- turn startup can refine backend capability profiles before the first request, and `run_streaming()` now delegates into the main runtime path |
| 70 | +- turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, and mutating tasks now route through persisted evidence-backed completion |
| 60 | 71 | |
| 61 | 72 | ## Definition of honesty |
| 62 | 73 | |
| 63 | 74 | - If a scenario is green here, it should have deterministic automated coverage. |
| 64 | 75 | - If a scenario is flaky or broken, it should be called out here before we claim parity work is done. |
| 65 | 76 | - Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test. |
| 77 | +- Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+. |