@@ -2,7 +2,7 @@ |
| 2 | 2 | |
| 3 | 3 | Date: 2026-04-09 |
| 4 | 4 | |
| 5 | | -Deterministic baseline: `uv run pytest -q` → `397 passed` |
| 5 | +Deterministic baseline: `uv run pytest -q` → `414 passed` |
| 6 | 6 | |
| 7 | 7 | This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests. |
| 8 | 8 | |
@@ -88,12 +88,14 @@ This file tracks the current deterministic runtime baseline for Loader. It stays |
| 88 | 88 | - `loader status`, `loader session show`, and `loader workflow show` now surface observed verification directly, and `Recent Verification` is unified from canonical policy observations first with DoD evidence only as a fallback |
| 89 | 89 | - Loader now has a runtime-owned internal execution handle in `src/loader/runtime/runtime_handle.py`, and runtime-oriented launcher/bootstrap/public-shell tests no longer need to treat `Agent` as the only valid runtime owner |
| 90 | 90 | - non-TUI CLI paths, `loader explore`, and the scripted runtime harness now default to the runtime-first owner seam below `Agent`, so Loader uses `RuntimeHandle` in real internal integrations instead of reserving it for tests |
| 91 | +- the TUI now also launches through that runtime-first shell-owner seam below `Agent`, so the last major product path is no longer using the public facade by habit |
| 91 | 92 | - persisted session state now records the active runtime-owner path, and `loader status`, `loader session list/show`, and `loader workflow show` surface that runtime-owner provenance directly |
| 92 | 93 | - verification now emits per-command `verify_observation` events into the canonical workflow timeline while the verification loop is running, and workflow/policy read models project those entries as first-class accountability state |
| 94 | +- verification lifecycle now distinguishes planned, pending, stale, skipped, and observed states inside the canonical workflow timeline, and completion policy plus inspection surfaces preserve those states directly instead of flattening them into generic missing-proof summaries |
| 93 | 95 | |
| 94 | 96 | ## Known weak spots |
| 95 | 97 | |
| 96 | | -- the public runtime boundary is now explicit and runtime-shaped, and Loader now also has real runtime-first internal integrations through `RuntimeHandle`, but `Agent` still constructs and supplies the main public boundary and the TUI still routes through that public shell instead of a narrower runtime-first external API |
| 98 | +- the public runtime boundary is now explicit and runtime-shaped, and Loader now also has real runtime-first internal integrations through `RuntimeHandle` across CLI, explore, the scripted harness, and TUI launch, but `Agent` plus `runtime.public_shell` still supply the outer compatibility boundary instead of a narrower runtime-first external API |
| 97 | 99 | - [`src/loader/agent/loop.py`](../src/loader/agent/loop.py) is down to 267 lines and much closer to a public facade than the pre-Sprint-15 shell, but it still owns the compatibility shell and remaining launcher/UI glue instead of disappearing entirely |
| 98 | 100 | - [`src/loader/agent/reasoning.py`](../src/loader/agent/reasoning.py) and [`src/loader/agent/safeguards.py`](../src/loader/agent/safeguards.py) are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further |
| 99 | 101 | - [`src/loader/runtime/tool_batches.py`](../src/loader/runtime/tool_batches.py) and parts of [`src/loader/runtime/workflow_lanes.py`](../src/loader/runtime/workflow_lanes.py) are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in `refs/claw-code` |
@@ -122,7 +124,7 @@ This file tracks the current deterministic runtime baseline for Loader. It stays |
| 122 | 124 | |
| 123 | 125 | ## Deterministic parity scenarios |
| 124 | 126 | |
| 125 | | -The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity. Sprint 17 adds [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), expanded [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py) / [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py) / [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the explicit bootstrap view, expanded [`tests/test_repair.py`](../tests/test_repair.py) / [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py) coverage for honest raw-text recovery failure, and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for explore continuity inspection, reset, and persisted fresh-vs-continue visibility. Sprint 21 adds [`tests/test_evidence_provenance.py`](../tests/test_evidence_provenance.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), [`tests/test_runtime_handle.py`](../tests/test_runtime_handle.py), and expanded [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed evidence provenance, grouped policy-evidence rollups, the runtime-first internal handle, and evidence-backed status/session/workflow inspection. Sprint 22 adds [`tests/test_verification_observations.py`](../tests/test_verification_observations.py) plus expanded [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed verification observations, observed-verification stop/continue reasoning, and unified verification inspection sourced from canonical policy events. |
| 127 | +The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity. Sprint 17 adds [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), expanded [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py) / [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py) / [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the explicit bootstrap view, expanded [`tests/test_repair.py`](../tests/test_repair.py) / [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py) coverage for honest raw-text recovery failure, and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for explore continuity inspection, reset, and persisted fresh-vs-continue visibility. Sprint 21 adds [`tests/test_evidence_provenance.py`](../tests/test_evidence_provenance.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), [`tests/test_runtime_handle.py`](../tests/test_runtime_handle.py), and expanded [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed evidence provenance, grouped policy-evidence rollups, the runtime-first internal handle, and evidence-backed status/session/workflow inspection. Sprint 22 adds [`tests/test_verification_observations.py`](../tests/test_verification_observations.py) plus expanded [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed verification observations, observed-verification stop/continue reasoning, and unified verification inspection sourced from canonical policy events. Sprint 24 adds expanded [`tests/test_cli_runtime_owner.py`](../tests/test_cli_runtime_owner.py) coverage for runtime-first TUI launch ownership plus expanded [`tests/test_tool_batches.py`](../tests/test_tool_batches.py), [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for planned/pending/stale verification lifecycle and its operator-facing projections. |
| 126 | 128 | |
| 127 | 129 | - `streaming_text`: green |
| 128 | 130 | - `read_file_roundtrip`: green |
@@ -163,7 +165,7 @@ The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`]( |
| 163 | 165 | |
| 164 | 166 | As of 2026-04-09: |
| 165 | 167 | |
| 166 | | -- `uv run pytest -q`: 397 passed |
| 168 | +- `uv run pytest -q`: 414 passed |
| 167 | 169 | - `tests/test_runtime_harness.py` is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression |
| 168 | 170 | - `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence |
| 169 | 171 | - `tests/test_turn_state_machine.py` covers allowed/disallowed turn transitions and terminal transition metadata |
@@ -200,7 +202,8 @@ As of 2026-04-09: |
| 200 | 202 | - `tests/test_verification_observations.py` covers typed verification-observation serialization and normalization |
| 201 | 203 | - `tests/test_workflow_timeline_read_model.py` covers grouped supporting/missing policy-evidence rollups, latest-policy derivation, and observed-verification read models from the canonical workflow timeline |
| 202 | 204 | - `tests/test_runtime_handle.py` covers the runtime-owned internal handle below `Agent`, including direct launcher/context/runtime construction without depending on the public compatibility facade |
| 203 | | -- `tests/test_cli_runtime_owner.py` covers runtime-first CLI owner selection for non-TUI and explore paths |
| 205 | +- `tests/test_cli_runtime_owner.py` covers runtime-first owner selection for non-TUI CLI, `loader explore`, single-prompt execution, and TUI launch paths |
| 206 | +- `tests/test_tool_batches.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, `tests/test_workflow_runtime.py`, `tests/test_workflow_timeline_read_model.py`, and `tests/test_inspection.py` now cover planned/pending/stale verification lifecycle transitions plus their policy/accountability projections |
| 204 | 207 | - `tests/test_explore_runtime.py` covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, persisted `fresh` vs `continue` visibility, and `fresh` explore resets outside the parity harness |
| 205 | 208 | - `tests/test_expanded_tools.py` covers structured patch application, read-only git helpers, `notepad_append`, and richer structured user questions |
| 206 | 209 | - `tests/test_permissions.py` covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering |
@@ -237,3 +240,4 @@ As of 2026-04-09: |
| 237 | 240 | - Sprint 21 is complete: Loader now carries typed evidence provenance through canonical policy events, derives grouped policy-evidence rollups from one shared workflow-timeline read model, exposes “needed” vs “satisfied” evidence in `loader status` / `loader session show` / `loader workflow show`, and provides a runtime-owned internal handle so runtime-oriented code and tests no longer need to treat `Agent` as the only valid execution owner, but it still stops short of claw-code's fuller policy engine, a narrower runtime-first external API, and OMX's deeper verifier/interview rigor. |
| 238 | 241 | - Sprint 22 is complete on the verification-observation lane: Loader now captures typed verification observations closer to execution, carries those observations through canonical policy events and completion-stop decisions, and surfaces observed verification plus a unified `Recent Verification` view in `loader status` / `loader session show` / `loader workflow show`, but the planned runtime-first entry promotion beyond tests did not land and rolls forward as Sprint 23 debt alongside Loader's remaining gap to claw-code's fuller policy engine and OMX's deeper verifier/interview rigor. |
| 239 | 242 | - Sprint 23 is complete: Loader now uses the runtime-first seam in real internal integrations through `RuntimeHandle`, emits per-command `verify_observation` events while the verification loop runs, and surfaces persisted runtime-owner provenance in the existing operator views, but it still stops short of a narrower runtime-first public API, TUI migration away from the public shell, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. |
| 243 | +- Sprint 24 is complete: Loader now uses the runtime-first owner seam for the TUI as well, distinguishes planned/pending/stale verification lifecycle state inside the canonical policy timeline, and surfaces that lifecycle directly in status/session/workflow inspection, but it still stops short of a narrower runtime-first external API, richer verification queue/timestamp semantics, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. |