tenseleyflow/loader / d2010a9

Browse files

Audit Sprint 02 definition-of-done rollout

Authored by espadonne
SHA
d2010a99fe1c2c2a501763f047d67aa6651d1e40
Parents
f0c4a17
Tree
d2ed827

2 changed files

StatusFile+-
M .docs/PARITY.md 18 6
M .docs/sprints/sprint02.md 17 0
.docs/PARITY.mdmodified
@@ -10,22 +10,27 @@ This file tracks the current deterministic runtime baseline for Loader. It stays
1010
 - native-tool round trips for `read`, `write`, `edit`, `glob`, `grep`, and `bash`
1111
 - confirmation callbacks for destructive `write` and `bash` actions
1212
 - raw JSON fallback when the model emits tool syntax in plain text
13
-- heuristic completion nudges when the model stops before finishing a simple actionable task
13
+- persisted definition-of-done state under `.loader/dod/`
14
+- explicit verify/fix loops for mutating tasks, with a bounded retry budget
15
+- task-size-aware verification command derivation based on actual tool history
16
+- heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
1417
 - typed `TurnSummary` output for completed turns, including trace events and tool-result messages
1518
 - unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor`
1619
 - typed tool-result messages backed by `Message.tool_results`
20
+- CLI and TUI status surfaces for DoD phase, pending items, and last verification result
1721
 
1822
 ## Known weak spots
1923
 
2024
 - the core turn loop moved into [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py), but it is still much larger and more heuristic-heavy than the reference runtime in `refs/claw-code`
2125
 - planning, decomposition, and several helper behaviors still live in [`src/loader/agent/loop.py`](../src/loader/agent/loop.py), so ownership is cleaner than Sprint 00 but not fully simplified yet
22
-- completion is still heuristic, not evidence-backed
26
+- DoD acceptance criteria and pending items are still runtime-derived and minimal, not model-authored task plans
27
+- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
2328
 - permissions are confirmation-based, not policy-based
2429
 
2530
 ## Out of scope in the current baseline
2631
 
2732
 - permission modes / policy engine
28
-- persisted sessions / memory / `.loader/` runtime state
33
+- persisted sessions / memory beyond DoD state
2934
 - mode router, clarify, or planning artifacts
3035
 - doctor / status / session product surfaces
3136
 
@@ -48,18 +53,25 @@ The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](
4853
 - `native_and_raw_tool_paths_share_executor_trace`: green
4954
 - `backend_capability_probe_refreshes_native_tool_mode`: green
5055
 - `run_streaming_delegates_to_primary_runtime`: green
56
+- `definition_of_done_verify_phase`: green
57
+- `verify_failure_routes_to_fix_loop`: green
58
+- `verify_retry_budget_exhaustion`: green
59
+- `conversational_task_skips_verify_phase`: green
5160
 
5261
 ## Verification snapshot
5362
 
5463
 As of 2026-04-06:
5564
 
56
-- `uv run pytest -q`: 80 passed
57
-- `tests/test_runtime_harness.py` is fully green, including the original contract regression
65
+- `uv run pytest -q`: 90 passed
66
+- `tests/test_runtime_harness.py` is fully green, including DoD verify/fix coverage and the original contract regression
67
+- `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation
68
+- `tests/test_status_surfaces.py` covers the CLI/TUI DoD status formatting helpers
5869
 - native and extracted tool calls now record the same executor trace events, with source-specific metadata
59
-- turn startup can refine backend capability profiles before the first request, and `run_streaming()` now delegates into the main runtime path
70
+- turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, and mutating tasks now route through persisted evidence-backed completion
6071
 
6172
 ## Definition of honesty
6273
 
6374
 - If a scenario is green here, it should have deterministic automated coverage.
6475
 - If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
6576
 - Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test.
77
+- Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.
.docs/sprints/sprint02.mdmodified
@@ -115,3 +115,20 @@ This is what makes the contract visible to the user instead of hidden inside the
115115
 - failed verification cannot escape into a "looks done" final answer
116116
 - simple tasks stay cheap (verify is skipped); complex tasks enter the verify/fix loop automatically
117117
 - the user can see the DoD phase from the CLI and TUI
118
+
119
+## Audit Notes
120
+
121
+Audit checkpoint on 2026-04-06:
122
+
123
+- added a persisted `DefinitionOfDone` runtime object under `src/loader/runtime/dod.py` and store-backed state under `.loader/dod/`
124
+- routed mutating tasks through an explicit verify/fix gate in `src/loader/runtime/conversation.py`, with retry-budget exhaustion returning an honest failure summary instead of a premature success
125
+- taught verification runs to execute through the shared executor with duplicate suppression disabled, confirmations skipped, and project-root working-directory awareness
126
+- tightened duplicate suppression so rewrites used for recovery are allowed while true same-content rewrites are still skipped
127
+- surfaced DoD state in both the non-TUI CLI and the TUI status line, and added deterministic coverage for runtime parity, DoD persistence/sizing, and status formatting
128
+- full verification is green at `uv run pytest -q` with 90 passing tests
129
+
130
+Residual debt after Sprint 02:
131
+
132
+- DoD acceptance criteria and pending items are still runtime-derived and shallow; Loader does not yet have the richer task/workflow artifacts planned in Sprint 04 and Sprint 05
133
+- verification summaries are runtime-generated from captured evidence rather than model-authored evidence explanations
134
+- task-size-aware verification is intentionally conservative today; larger-task evidence scaling still has room to move closer to the reference verifier design