tenseleyflow/loader / e765d37

Browse files

Untrack .docs/ from remote, keep local

Authored by espadonne
SHA
e765d3774ac3a66a5f1c17e6bb9e7c5afe264fbe
Parents
346a24c
Tree
8bc3666

43 changed files

StatusFile+-
D .docs/PARITY.md 0 250
D .docs/REPORT.md 0 1116
D .docs/audit_sprints/index.md 0 82
D .docs/audit_sprints/sprint09.md 0 99
D .docs/audit_sprints/sprint09_baseline.md 0 108
D .docs/audit_sprints/sprint09_interactive_validation.md 0 96
D .docs/audit_sprints/sprint09_interactive_validation_native.md 0 89
D .docs/audit_sprints/sprint09_interactive_validation_raw_text.md 0 117
D .docs/audit_sprints/sprint10.md 0 93
D .docs/audit_sprints/sprint11.md 0 135
D .docs/audit_sprints/sprint12.md 0 91
D .docs/audit_sprints/sprint13.md 0 83
D .docs/audit_sprints/sprint13_closure.md 0 94
D .docs/audit_sprints/trunk_sitrep.md 0 191
D .docs/sprints/index.md 0 112
D .docs/sprints/sprint00.md 0 107
D .docs/sprints/sprint01.md 0 127
D .docs/sprints/sprint02.md 0 134
D .docs/sprints/sprint03.md 0 135
D .docs/sprints/sprint04.md 0 141
D .docs/sprints/sprint05.md 0 154
D .docs/sprints/sprint06.md 0 131
D .docs/sprints/sprint07.md 0 166
D .docs/sprints/sprint08.md 0 180
D .docs/sprints/sprint09.md 0 185
D .docs/sprints/sprint10.md 0 198
D .docs/sprints/sprint11.md 0 215
D .docs/sprints/sprint12.md 0 204
D .docs/sprints/sprint13.md 0 203
D .docs/sprints/sprint14.md 0 190
D .docs/sprints/sprint15.md 0 182
D .docs/sprints/sprint16.md 0 189
D .docs/sprints/sprint17.md 0 190
D .docs/sprints/sprint18.md 0 189
D .docs/sprints/sprint19.md 0 191
D .docs/sprints/sprint20.md 0 196
D .docs/sprints/sprint21.md 0 187
D .docs/sprints/sprint22.md 0 191
D .docs/sprints/sprint23.md 0 183
D .docs/sprints/sprint24.md 0 180
D .docs/sprints/sprint25.md 0 197
D .docs/sprints/sprint26.md 0 165
M .gitignore 1 0
.docs/PARITY.mddeleted
@@ -1,250 +0,0 @@
1
-# Loader Runtime Parity Checkpoint
2
-
3
-Date: 2026-04-09
4
-
5
-Deterministic baseline: `uv run pytest -q` → `416 passed`
6
-
7
-This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests.
8
-
9
-## Supported today
10
-
11
-- streamed text-only replies
12
-- native-tool round trips for `read`, `write`, `edit`, `patch`, `glob`, `grep`, `bash`, `git`, `TodoWrite`, `AskUserQuestion`, `project_memory_*`, and `notepad_*`
13
-- explicit permission modes: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow`
14
-- tool lifecycle hooks in `pre_tool_use` → permission check → execute → `post_tool_use` / `post_tool_use_failure` order
15
-- rule-based permission policy with workspace-local `allow` / `deny` / `ask` rules from `.loader/permission-rules.json`
16
-- policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information
17
-- `loader permissions show` for normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by hand
18
-- `loader permissions check` for dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning
19
-- raw JSON fallback when the model emits tool syntax in plain text
20
-- raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as `TodoWrite` and `AskUserQuestion`
21
-- persisted definition-of-done state under `.loader/dod/`
22
-- persisted clarify briefs under `.loader/briefs/`
23
-- persisted implementation and verification plans under `.loader/plans/`
24
-- persisted conversation sessions under `.loader/sessions/` plus active session state under `.loader/state/`
25
-- persisted permission policy metadata alongside session state, so `loader status` / `loader session list` / `loader session show` can explain the effective policy that ran
26
-- `loader --resume` and `loader --resume <session-id>` restore persisted session state
27
-- durable project memory in `.loader/project-memory.json` and working notes in `.loader/notepad.md`
28
-- native memory tools for `project_memory_*` and `notepad_*`
29
-- scored workflow routing across `clarify` → `plan` → `execute` → `verify`, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints
30
-- typed workflow-signal extraction with persisted `signal_summary` context for route pressure, recent workflow history, and unresolved questions
31
-- mode-specific system prompts for clarify, plan, execute, and verify
32
-- intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward
33
-- pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata
34
-- codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts
35
-- semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues
36
-- structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift
37
-- persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection
38
-- persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips
39
-- explicit verify/fix loops for mutating tasks, with a bounded retry budget
40
-- verify/fix retries return to execute mode without re-triggering clarify or plan
41
-- task-size-aware verification command derivation based on actual tool history
42
-- verification command loading from persisted `verification.md` artifacts when present
43
-- heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
44
-- typed `TurnSummary` output for completed turns, including trace events and tool-result messages
45
-- normalized per-turn usage plus cumulative session usage in `TurnSummary`
46
-- automatic transcript compaction with priority-aware line compression and continuation instructions
47
-- unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor`
48
-- typed tool-result messages backed by `Message.tool_results`
49
-- typed prompt construction in `runtime.prompting`, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state
50
-- persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection
51
-- validated turn-state transitions (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) with typed transition metadata, persisted session state, and emitted runtime events
52
-- typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints
53
-- `loader doctor` for backend, capability, workspace, command, state, and permission health checks outside the main runtime loop
54
-- `loader status` plus `loader session list/show/resume` for inspecting persisted runtime state without invoking the LLM
55
-- `loader prompt show [task]` for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model request
56
-- `loader prompt diff [session-id]` for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demand
57
-- `loader workflow show [session-id]` with `--mode`, `--kind`, and `--limit` filters plus operator-focused workflow highlights, recent timeline snippets in `loader session show`, and `--diff` / `--full-diff` artifact comparison for persisted workflow artifacts
58
-- `loader explore <prompt>` as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under `.loader/state/explore.json`, `--fresh` to ignore prior explore history when needed, and `--status` / `--reset` for continuity inspection and reset
59
-- `RuntimeContext` is now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the older `RuntimeLegacyServices` shim has been removed
60
-- shared runtime bootstrap through `runtime.bootstrap.build_runtime_context(...)` / `sync_runtime_context(...)`, with both conversation and explore runtimes now starting from an explicit `RuntimeBootstrapView` rather than a raw `Agent` object by default
61
-- runtime-owned safeguard and reasoning helpers now have canonical homes under `src/loader/runtime/`; `src/loader/agent/safeguards.py` and `src/loader/agent/reasoning.py` are compatibility-export layers rather than the primary implementations
62
-- the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through `src/loader/runtime/launcher.py`, which leaves a smaller and more honest `src/loader/agent/loop.py`
63
-- runtime-owned prompt/session shell helpers now live in `src/loader/runtime/public_shell.py`, including prompt construction, prompt snapshot persistence, session creation, session restore, and few-shot example selection
64
-- runtime-owned public-shell helpers now also own prompt-mode resolution, workflow-mode prompt invalidation, and owner-bound system/few-shot construction, leaving `src/loader/agent/loop.py` as an explicitly documented public facade instead of an ambiguous leftover shell
65
-- compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to `agent/reasoning.py` / `agent/safeguards.py` imports
66
-- CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id
67
-- CLI status now also surfaces recent explore activity, including bounded explore turn/message counts, the last explore history mode, and the last explore query
68
-- CLI and TUI workflow-mode visibility plus artifact notifications
69
-- CLI and TUI permission-mode visibility with color-coded status
70
-- workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata
71
-- shell mutability classification plus structured truncation and stderr/exit-code metadata
72
-- richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform responses
73
-- honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers
74
-- raw-text tool recovery now also fails honestly once its budget is exhausted instead of adding a synthetic follow-up invitation
75
-- dedicated assistant-response routing in `runtime.response_routing`, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline inside `turn_iteration.py`
76
-- assistant-turn request handling now lives in `runtime.assistant_turns`, clarify/plan lane execution now lives in `runtime.workflow_lanes`, tool-batch execution/recovery now lives in `runtime.tool_batches`, DoD/finalization logic now lives in `runtime.finalization`, workflow-state/session mutation lives in `runtime.workflow_state`, and the main loop now runs through `runtime.turn_preparation`, `runtime.turn_preamble`, `runtime.turn_iteration`, and `runtime.turn_loop` instead of accumulating further inside `conversation.py`
77
-- `src/loader/runtime/conversation.py` now acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop
78
-- persisted completion-decision summaries plus bounded completion traces are now available in session/runtime state, status/session inspection, and resume flows
79
-- typed follow-through evidence now backs non-mutating completion checks, with explicit required/missing evidence and honest terminal failure once the continuation budget is exhausted without enough proof of completion
80
-- the workflow timeline is now the canonical policy/accountability artifact for completion decisions too, with live and persisted completion traces projected from that timeline as a compact read model rather than maintained as a separate peer runtime contract
81
-- runtime-owned public-shell helpers now cover prompt/session factories, session install/load helpers, steering mailbox behavior, sync/async event wrapping, and capability-refresh decision helpers
82
-- `loader workflow show --policy` now filters directly to unified repair / verify-skip / completion accountability events, and `loader session show` now includes a `Policy Timeline` preview so operators can inspect the stop/continue/retry story without stitching together separate surfaces by hand
83
-- `loader status` and `loader session show` now also surface a latest-policy rollup sourced from the canonical workflow timeline so operators can see the last important stop/continue/retry decision at a glance
84
-- typed evidence provenance now flows through canonical workflow events and projected completion traces, so Loader can distinguish supporting, missing, and contradictory evidence instead of flattening everything into one summary string too early
85
-- `loader status`, `loader session show`, and `loader workflow show` now surface concise policy-evidence rollups derived from the canonical workflow timeline, including what evidence was still needed and what evidence satisfied the latest stop/continue decision
86
-- typed verification observations now flow through finalization, canonical policy events, and projected completion traces, so Loader preserves what verification was actually observed closer to execution instead of reconstructing that story only from later DoD/session summaries
87
-- completion stop/continue policy now cites observed verification facts when available, and exhausted continuation failures preserve those observed verification results in the canonical accountability story instead of only reporting generic missing evidence
88
-- `loader status`, `loader session show`, and `loader workflow show` now surface observed verification directly, and `Recent Verification` is unified from canonical policy observations first with DoD evidence only as a fallback
89
-- Loader now has a runtime-owned internal execution handle in `src/loader/runtime/runtime_handle.py`, and runtime-oriented launcher/bootstrap/public-shell tests no longer need to treat `Agent` as the only valid runtime owner
90
-- non-TUI CLI paths, `loader explore`, and the scripted runtime harness now default to the runtime-first owner seam below `Agent`, so Loader uses `RuntimeHandle` in real internal integrations instead of reserving it for tests
91
-- the TUI now also launches through that runtime-first shell-owner seam below `Agent`, so the last major product path is no longer using the public facade by habit
92
-- `src/loader/runtime/runtime_api.py` now defines a narrower runtime-owned shell API used by CLI and TUI owner construction, with `Agent` left as the documented public compatibility facade instead of the default internal contract
93
-- persisted session state now records the active runtime-owner path, and `loader status`, `loader session list/show`, and `loader workflow show` surface that runtime-owner provenance directly
94
-- verification now emits per-command `verify_observation` events into the canonical workflow timeline while the verification loop is running, and workflow/policy read models project those entries as first-class accountability state
95
-- verification lifecycle now distinguishes planned, pending, stale, skipped, and observed states inside the canonical workflow timeline, and completion policy plus inspection surfaces preserve those states directly instead of flattening them into generic missing-proof summaries
96
-- verification lifecycle now also carries explicit attempt identity across planned, pending, stale, skipped, and observed states, including supersession labels like `attempt 1 -> attempt 2` inside completion policy and inspection surfaces
97
-- `loader status`, `loader session show`, `loader workflow show`, and the TUI status line now surface runtime-boundary summaries plus attempt-aware verification state/DoD summaries instead of only raw owner or lifecycle labels
98
-
99
-## Known weak spots
100
-
101
-- the public runtime boundary is now explicit and runtime-shaped, and Loader now also has real runtime-first internal integrations through `RuntimeHandle` plus the narrower `runtime.runtime_api` contract across CLI, explore, the scripted harness, and TUI launch, but `Agent` plus `runtime.public_shell` still supply the outer compatibility boundary instead of a fully runtime-first external API
102
-- verification attempt identity is now explicit and attempt-aware completion can explain active versus superseded proof, but Loader still does not preserve richer queue/start/finish timing semantics, deeper multi-command attempt bundles, or OMX-style verifier reasoning
103
-- [`src/loader/agent/loop.py`](../src/loader/agent/loop.py) is down to 267 lines and much closer to a public facade than the pre-Sprint-15 shell, but it still owns the compatibility shell and remaining launcher/UI glue instead of disappearing entirely
104
-- [`src/loader/agent/reasoning.py`](../src/loader/agent/reasoning.py) and [`src/loader/agent/safeguards.py`](../src/loader/agent/safeguards.py) are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further
105
-- [`src/loader/runtime/tool_batches.py`](../src/loader/runtime/tool_batches.py) and parts of [`src/loader/runtime/workflow_lanes.py`](../src/loader/runtime/workflow_lanes.py) are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in `refs/claw-code`
106
-- the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth
107
-- clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class
108
-- plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes
109
-- the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance
110
-- plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop
111
-- DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state
112
-- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
113
-- session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts
114
-- project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction
115
-- rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface
116
-- policy state is inspectable in doctor/status/session surfaces and dry-runnable through `loader permissions show/check`, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface
117
-- follow-through evidence is now explicit and persisted, but it is still heuristic and runtime-authored rather than backed by a deeper verifier/model contract or richer artifact-derived proof model
118
-- prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage
119
-- workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace
120
-- shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model
121
-- explore mode now has lightweight transcript continuity plus `loader explore --status` / `--reset`, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances
122
-- the read-only `git` helper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and the `patch` tool still stops short of AST/LSP-aware editing
123
-
124
-## Out of scope in the current baseline
125
-
126
-- richer permission-rule UX / per-command allowlists
127
-- multi-agent / team orchestration
128
-
129
-## Deterministic parity scenarios
130
-
131
-The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity. Sprint 17 adds [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), expanded [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py) / [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py) / [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the explicit bootstrap view, expanded [`tests/test_repair.py`](../tests/test_repair.py) / [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py) coverage for honest raw-text recovery failure, and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for explore continuity inspection, reset, and persisted fresh-vs-continue visibility. Sprint 21 adds [`tests/test_evidence_provenance.py`](../tests/test_evidence_provenance.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), [`tests/test_runtime_handle.py`](../tests/test_runtime_handle.py), and expanded [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed evidence provenance, grouped policy-evidence rollups, the runtime-first internal handle, and evidence-backed status/session/workflow inspection. Sprint 22 adds [`tests/test_verification_observations.py`](../tests/test_verification_observations.py) plus expanded [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed verification observations, observed-verification stop/continue reasoning, and unified verification inspection sourced from canonical policy events. Sprint 24 adds expanded [`tests/test_cli_runtime_owner.py`](../tests/test_cli_runtime_owner.py) coverage for runtime-first TUI launch ownership plus expanded [`tests/test_tool_batches.py`](../tests/test_tool_batches.py), [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for planned/pending/stale verification lifecycle and its operator-facing projections. Sprint 25 adds expanded [`tests/test_cli_runtime_owner.py`](../tests/test_cli_runtime_owner.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_inspection.py`](../tests/test_inspection.py), and [`tests/test_status_surfaces.py`](../tests/test_status_surfaces.py) coverage for the runtime-owned shell API boundary, explicit verification-attempt identity, attempt-aware completion/freshness reasoning, and boundary/attempt operator surfaces.
132
-
133
-- `streaming_text`: green
134
-- `read_file_roundtrip`: green
135
-- `multi_tool_turn_roundtrip`: green
136
-- `write_file_allowed`: green
137
-- `write_file_denied`: green
138
-- `bash_stdout_roundtrip`: green
139
-- `bash_confirmation_prompt_approved`: green
140
-- `bash_confirmation_prompt_denied`: green
141
-- `read_only_mode_denies_write`: green
142
-- `read_only_mode_denies_mutating_bash`: green
143
-- `read_only_mode_allows_safe_bash`: green
144
-- `workspace_write_denies_write_outside_root`: green
145
-- `danger_full_access_allows_dangerous_bash`: green
146
-- `prompt_mode_prompts_destructive_write`: green
147
-- `allow_mode_skips_prompt_for_destructive_write`: green
148
-- `deny_rule_blocks_allowed_mode`: green
149
-- `ask_rule_prompts_even_when_mode_would_allow`: green
150
-- `raw_json_tool_call_fallback`: green
151
-- `completion_check_continuation`: green
152
-- `tool_result_contract_regression`: green
153
-- `turn_summary_smoke_for_multi_tool_turn`: green
154
-- `native_and_raw_tool_paths_share_executor_trace`: green
155
-- `backend_capability_probe_refreshes_native_tool_mode`: green
156
-- `run_streaming_delegates_to_primary_runtime`: green
157
-- `definition_of_done_verify_phase`: green
158
-- `verify_failure_routes_to_fix_loop`: green
159
-- `verify_retry_budget_exhaustion`: green
160
-- `ambiguous_prompt_routes_to_clarify`: green
161
-- `complex_prompt_routes_to_plan`: green
162
-- `verify_failure_fix_loop_does_not_reroute_workflow`: green
163
-- `conversational_task_skips_verify_phase`: green
164
-- `explore_mode_skips_dod_and_router`: green
165
-- `explore_mode_denies_write`: green
166
-- `explore_mode_ignores_global_allow_policy`: green
167
-
168
-## Verification snapshot
169
-
170
-As of 2026-04-09:
171
-
172
-- `uv run pytest -q`: 416 passed
173
-- `tests/test_runtime_harness.py` is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression
174
-- `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence
175
-- `tests/test_turn_state_machine.py` covers allowed/disallowed turn transitions and terminal transition metadata
176
-- `tests/test_runtime_phases.py` covers repair/completion phase transitions plus persisted transition metadata in runtime events and session state
177
-- `tests/test_runtime_repair_flows.py` covers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutes
178
-- `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a full `Agent`
179
-- `tests/test_runtime_bootstrap.py` covers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seam
180
-- `tests/test_runtime_public_shell.py` covers runtime-owned prompt/session shell helpers directly, including session creation metadata, prompt snapshot persistence, restored last-turn summary state, steering mailbox behavior, sync/async event emitter normalization, capability-refresh helper behavior, and fresh/load session-install helpers
181
-- `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` cover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegation
182
-- `tests/test_safeguard_services.py` covers the canonical runtime safeguard implementation plus the compatibility-export path under `loader.agent.safeguards`
183
-- `tests/test_reasoning_compat.py` covers runtime-owned deliberation/completion helpers plus the compatibility-export path under `loader.agent.reasoning`
184
-- `tests/test_compat_boundaries.py` fails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shims
185
-- `tests/test_repair.py` covers raw-text fallback through the runtime parser and active registry, including `TodoWrite` recovery and honest failure once raw-tool recovery exceeds its budget
186
-- `tests/test_completion_policy.py` covers direct text-loop bailout and continuation-prompt behavior on the typed runtime context
187
-- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now also pin typed follow-through evidence, honest budget-exhausted completion finalization, and persisted completion-evidence summaries
188
-- `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing at the new response-policy seam
189
-- `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation
190
-- `tests/test_workflow.py` covers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncing
191
-- `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` context
192
-- `tests/test_clarify_strategy.py` covers clarify-slot prioritization and targeted follow-up question selection
193
-- `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hints
194
-- `tests/test_workflow_policy.py` covers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serialization
195
-- `tests/test_artifact_invalidation.py` covers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plan
196
-- `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries
197
-- `tests/test_workflow_runtime.py` covers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recovery
198
-- `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` cover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime tests
199
-- `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover `TodoWrite`, `AskUserQuestion`, and runtime callback plumbing
200
-- `tests/test_session_state.py` covers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot history
201
-- `tests/test_compaction.py` covers claw-style line compression and compacted continuation-message behavior
202
-- `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory
203
-- `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore
204
-- `tests/test_inspection.py` covers `loader doctor`, `loader status`, `loader session list/show`, `loader permissions show/check`, `loader prompt show`, `loader prompt diff`, `loader workflow show --diff`, workflow timeline filtering/highlights, the new `loader workflow show --policy` accountability filter, and workflow/session inspection surfaces, including recent explore activity plus `loader explore --status` / `--reset`
205
-- `tests/test_evidence_provenance.py` covers typed provenance serialization plus structured stop/continue evidence carried through completion policy state
206
-- `tests/test_verification_observations.py` covers typed verification-observation serialization and normalization
207
-- `tests/test_workflow_timeline_read_model.py` covers grouped supporting/missing policy-evidence rollups, latest-policy derivation, and observed-verification read models from the canonical workflow timeline
208
-- `tests/test_runtime_handle.py` covers the runtime-owned internal handle below `Agent`, including direct launcher/context/runtime construction without depending on the public compatibility facade
209
-- `tests/test_cli_runtime_owner.py` covers runtime-first owner selection for non-TUI CLI, `loader explore`, single-prompt execution, and TUI launch paths
210
-- `tests/test_completion_policy.py` and `tests/test_turn_completion.py` now also pin attempt-aware completion/freshness reasoning, including superseded-attempt summaries in continuation-stop decisions
211
-- `tests/test_inspection.py` and `tests/test_status_surfaces.py` now also cover runtime-boundary summaries, verification-state summaries, and TUI owner/attempt status rendering
212
-- `tests/test_tool_batches.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, `tests/test_workflow_runtime.py`, `tests/test_workflow_timeline_read_model.py`, and `tests/test_inspection.py` now cover planned/pending/stale verification lifecycle transitions plus their policy/accountability projections
213
-- `tests/test_explore_runtime.py` covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, persisted `fresh` vs `continue` visibility, and `fresh` explore resets outside the parity harness
214
-- `tests/test_expanded_tools.py` covers structured patch application, read-only git helpers, `notepad_append`, and richer structured user questions
215
-- `tests/test_permissions.py` covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering
216
-- `tests/test_tool_safety.py` covers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classification
217
-- `tests/test_status_surfaces.py` covers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers
218
-- `tests/test_runtime_public_shell.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now also cover persisted runtime-owner metadata plus its status/session/workflow rendering
219
-- native and extracted tool calls now record the same executor trace events, with source-specific metadata
220
-- turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith
221
-
222
-## Definition of honesty
223
-
224
-- If a scenario is green here, it should have deterministic automated coverage.
225
-- If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
226
-- Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test.
227
-- Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.
228
-- Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants.
229
-- Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor.
230
-- Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction.
231
-- Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime.
232
-- Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs.
233
-- Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class `loader permissions show/check` operator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs.
234
-- Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions, `loader prompt show`, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs.
235
-- Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs.
236
-- Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in `runtime.workflow_lanes`, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs.
237
-- Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped `runtime.conversation`, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs.
238
-- Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics.
239
-- Sprint 14 is complete: Loader now treats `RuntimeContext` as the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces.
240
-- Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no remaining `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py` after dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor.
241
-- Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.
242
-- Sprint 17 is complete: Loader now starts public runtime launch from an explicit runtime-shaped bootstrap view, moves prompt/session shell helpers into `src/loader/runtime/public_shell.py`, fails raw-text tool recovery more honestly once its budget is exhausted, and exposes explore continuity through `loader explore --status` / `--reset`, but it still stops short of a fully minimal public shell, deeper completion-policy deletions, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.
243
-- Sprint 18 is complete: Loader now persists explicit completion decisions and bounded completion traces, exposes them through status/session inspection, restores them across resume, and moves more event/steering/capability/session shell glue into `src/loader/runtime/public_shell.py`, but it still stops short of a fully minimal public shell, harder deletion of all continuation heuristics, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.
244
-- Sprint 19 is complete: Loader now pushes more public entry glue under `src/loader/runtime/public_shell.py`, derives typed follow-through evidence for non-mutating completion checks, fails honestly when that evidence is still missing after the continuation budget is exhausted, and exposes a clearer unified policy story through `loader workflow show --policy` plus the `Policy Timeline` preview in `loader session show`, but it still stops short of a fully minimal public shell, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor.
245
-- Sprint 20 is complete: Loader now treats the workflow timeline as the canonical policy/accountability artifact even for live completion-trace projection, grounds more follow-through decisions in DoD verification state and tracked runtime evidence, exposes latest-policy rollups in the existing status/session surfaces, and explicitly settles the remaining `Agent` shell as a documented public facade guarded by boundary tests, but it still stops short of claw-code's fuller policy engine, a narrower runtime-first external API, and OMX's deeper verifier/interview rigor.
246
-- Sprint 21 is complete: Loader now carries typed evidence provenance through canonical policy events, derives grouped policy-evidence rollups from one shared workflow-timeline read model, exposes “needed” vs “satisfied” evidence in `loader status` / `loader session show` / `loader workflow show`, and provides a runtime-owned internal handle so runtime-oriented code and tests no longer need to treat `Agent` as the only valid execution owner, but it still stops short of claw-code's fuller policy engine, a narrower runtime-first external API, and OMX's deeper verifier/interview rigor.
247
-- Sprint 22 is complete on the verification-observation lane: Loader now captures typed verification observations closer to execution, carries those observations through canonical policy events and completion-stop decisions, and surfaces observed verification plus a unified `Recent Verification` view in `loader status` / `loader session show` / `loader workflow show`, but the planned runtime-first entry promotion beyond tests did not land and rolls forward as Sprint 23 debt alongside Loader's remaining gap to claw-code's fuller policy engine and OMX's deeper verifier/interview rigor.
248
-- Sprint 23 is complete: Loader now uses the runtime-first seam in real internal integrations through `RuntimeHandle`, emits per-command `verify_observation` events while the verification loop runs, and surfaces persisted runtime-owner provenance in the existing operator views, but it still stops short of a narrower runtime-first public API, TUI migration away from the public shell, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor.
249
-- Sprint 24 is complete: Loader now uses the runtime-first owner seam for the TUI as well, distinguishes planned/pending/stale verification lifecycle state inside the canonical policy timeline, and surfaces that lifecycle directly in status/session/workflow inspection, but it still stops short of a narrower runtime-first external API, richer verification queue/timestamp semantics, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor.
250
-- Sprint 25 is complete: Loader now has a runtime-owned shell API boundary below `Agent`, explicit verification-attempt identity carried through completion/freshness policy, and operator surfaces that expose runtime-boundary plus attempt-aware verification state across CLI and TUI, but it still stops short of a fully runtime-first external API, richer attempt queue/timestamp semantics, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor.
.docs/REPORT.mddeleted
1116 lines changed — click to load
@@ -1,1116 +0,0 @@
1
-# Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior
2
-
3
-Date: 2026-04-06
4
-
5
-## Scope and assumptions
6
-
7
-This report compares three things:
8
-
9
-1. `Loader` itself
10
-2. `refs/claw-code`, using the Rust workspace under `refs/claw-code/rust/` as the canonical runtime
11
-3. `refs/oh-my-codex` as the workflow-layer parent repo
12
-
13
-Assumption: `oh-my-codex` is the correct “parent repo” for this exercise. That assumption is based on:
14
-
15
-- `refs/claw-code/README.md`
16
-- `refs/claw-code/PHILOSOPHY.md`
17
-- the fact that `refs/claw-code` explicitly describes `src/` as a companion Python/reference workspace, not the primary runtime
18
-
19
-If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.
20
-
21
-## Executive summary
22
-
23
-Loader has the right instincts but is operating at the wrong layer.
24
-
25
-The codebase already knows that models need:
26
-
27
-- planning help
28
-- recovery help
29
-- confidence checks
30
-- completion checks
31
-- safe tool use
32
-
33
-But Loader mostly tries to enforce those after the model has already started drifting. `claw-code` and `oh-my-codex` get better behavior because they shape the work before, during, and after the model call:
34
-
35
-- before: explicit mode selection, clarification, approved planning artifacts
36
-- during: durable runtime state, richer tool surface, explicit permission model, session persistence
37
-- after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics
38
-
39
-The biggest lesson is not “copy their prompt.”
40
-
41
-The biggest lesson is:
42
-
43
-> Loader needs a stronger execution contract, not just stronger prompting.
44
-
45
-If we want Loader to feel closer to `claw-code` regardless of model choice, the highest-leverage work is:
46
-
47
-1. replace the monolithic heuristic loop with a typed turn engine
48
-2. add durable workflow/state artifacts
49
-3. make “definition of done” evidence-based instead of heuristic
50
-4. add real permission/safety boundaries around tools
51
-5. build a parity harness so we can improve behavior intentionally
52
-
53
-## Method
54
-
55
-I reviewed:
56
-
57
-- Loader source under `src/loader/`
58
-- Loader tests under `tests/`
59
-- `refs/claw-code/README.md`
60
-- `refs/claw-code/USAGE.md`
61
-- `refs/claw-code/PARITY.md`
62
-- `refs/claw-code/PHILOSOPHY.md`
63
-- `refs/claw-code/rust/crates/runtime/*`
64
-- `refs/claw-code/rust/crates/tools/src/lib.rs`
65
-- `refs/oh-my-codex/README.md`
66
-- `refs/oh-my-codex/AGENTS.md`
67
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
68
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
69
-- `refs/oh-my-codex/skills/ralph/SKILL.md`
70
-- `refs/oh-my-codex/src/modes/base.ts`
71
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
72
-- `refs/oh-my-codex/src/mcp/memory-server.ts`
73
-- `refs/oh-my-codex/src/verification/verifier.ts`
74
-- `refs/oh-my-codex/src/cli/doctor.ts`
75
-- `refs/oh-my-codex/src/scripts/notify-hook.ts`
76
-
77
-I also ran Loader verification commands:
78
-
79
-- `uv run pytest`
80
-  - failed during collection
81
-  - discovered `refs/claw-code/tests/*`
82
-  - also failed to import `loader`
83
-- `uv run --with pytest --with pytest-asyncio python -m pytest tests -q`
84
-  - 56 passed
85
-  - 3 failed
86
-
87
-That matters because some of Loader’s runtime paths are clearly under-tested.
88
-
89
-## What Loader already does well
90
-
91
-### 1. Loader is small, understandable, and hackable
92
-
93
-This is a real advantage.
94
-
95
-`src/loader/` is about 55 source files, and the core agent behavior is easy to locate. Compared to `claw-code` and especially OMX, Loader is much easier to refactor aggressively.
96
-
97
-### 2. Loader is genuinely local-first
98
-
99
-The Ollama-first posture is simple and useful. A lot of the complexity in `claw-code` and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.
100
-
101
-### 3. Loader already contains the seeds of a better system
102
-
103
-These are the right instincts:
104
-
105
-- project context detection in `src/loader/context/project.py`
106
-- runtime safeguards in `src/loader/agent/safeguards.py`
107
-- recovery categorization in `src/loader/agent/recovery.py`
108
-- optional decomposition / critique / confidence / verification / completion checks in `src/loader/agent/reasoning.py`
109
-- a decent Textual app in `src/loader/ui/app.py`
110
-
111
-The problem is not that Loader lacks ideas.
112
-
113
-The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.
114
-
115
-### 4. The TUI is a meaningful strength
116
-
117
-Loader’s TUI already gives you:
118
-
119
-- model selection
120
-- streaming output
121
-- approval handling
122
-- status line updates
123
-- tool widgets
124
-
125
-That is more product surface than many small local agents. It is worth keeping.
126
-
127
-## Where Loader is weak today
128
-
129
-### 1. Loader’s product surface is not trustworthy yet
130
-
131
-The most visible sign is the README:
132
-
133
-- `README.md:1-2` still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”
134
-
135
-That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. `claw-code` and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.
136
-
137
-### 2. Loader’s main runtime is too monolithic and too heuristic
138
-
139
-`src/loader/agent/loop.py` is the heart of Loader, and it is doing too much:
140
-
141
-- prompt construction
142
-- streaming output handling
143
-- raw tool-call extraction
144
-- duplicate tool execution flows
145
-- recovery
146
-- validation
147
-- rollback tracking
148
-- completion nudging
149
-- loop detection
150
-- steering
151
-- partial planning
152
-- decomposition
153
-
154
-The result is a loop that is hard to reason about and easy to destabilize.
155
-
156
-The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.
157
-
158
-### 3. Loader has a real runtime contract bug in tool-result handling
159
-
160
-**Verified directly against the code.** There is a concrete mismatch between `Message` and the loop:
161
-
162
-- `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, and `tool_results`. There is no `tool_call_id` field on `Message` — that field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`.
163
-- `src/loader/agent/loop.py:885` and `src/loader/agent/loop.py:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`.
164
-
165
-Both call sites will raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have **zero** integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.
166
-
167
-### 4. Loader duplicates tool execution logic instead of centralizing it
168
-
169
-There are effectively two execution paths:
170
-
171
-- the normal native/ReAct tool path
172
-- the “raw JSON extracted tool call” path
173
-
174
-Those paths duplicate:
175
-
176
-- duplicate checking
177
-- validation
178
-- confirmation behavior
179
-- result recording
180
-- loop/error handling
181
-
182
-That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.
183
-
184
-`claw-code`’s `ConversationRuntime::run_turn()` is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.
185
-
186
-### 5. Loader’s system prompt is too shallow and too rigid
187
-
188
-`src/loader/agent/prompts.py:148-208` gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.
189
-
190
-This is too blunt.
191
-
192
-Problems:
193
-
194
-- it treats all tasks like immediate tool-execution tasks
195
-- it globally bans numbered steps, which is bad for planning/reporting tasks
196
-- it does not define modes
197
-- it does not encode verification expectations
198
-- it does not encode completion criteria
199
-- it does not distinguish “clarify”, “plan”, “execute”, and “verify”
200
-
201
-OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.
202
-
203
-### 6. Loader’s tool surface is too thin
204
-
205
-Loader has 6 default tools:
206
-
207
-- `read`
208
-- `write`
209
-- `edit`
210
-- `glob`
211
-- `bash`
212
-- `grep`
213
-
214
-That is enough for toy execution, but not enough for strong agent behavior.
215
-
216
-What is missing compared to `claw-code` / OMX:
217
-
218
-- task/todo tracking
219
-- structured ask-user surfaces
220
-- memory/notepad
221
-- doctor/status/session tooling
222
-- git-aware helpers
223
-- explore vs full-execution split
224
-- diff/patch-aware editing
225
-- web/search/fetch surfaces
226
-- structured output surfaces
227
-- subagent/team coordination surfaces
228
-- MCP-backed state and memory
229
-
230
-The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.
231
-
232
-### 7. Loader’s safety model is primitive
233
-
234
-Loader’s current protection model is mostly:
235
-
236
-- “safe commands” vs “ask for confirmation”
237
-- destructive tool flags
238
-
239
-Problems in practice:
240
-
241
-- no permission modes like `read-only`, `workspace-write`, `danger-full-access`
242
-- no strong workspace boundary checks
243
-- no binary-file guards
244
-- no file size limits
245
-- no symlink escape protection
246
-- no command semantics beyond a short safe list
247
-
248
-Evidence:
249
-
250
-- `src/loader/tools/file_tools.py` reads/writes resolved paths directly
251
-- `src/loader/tools/shell_tools.py` uses `create_subprocess_shell()` on arbitrary shell strings
252
-- `src/loader/tools/shell_tools.py:13-20` uses a short safe command set, but no mode-based authorization model
253
-
254
-By comparison, `claw-code` has:
255
-
256
-- `PermissionPolicy`
257
-- `PermissionEnforcer`
258
-- workspace boundary checks
259
-- binary/size guards in file ops
260
-- permission-mode aware tool definitions
261
-
262
-That does not just make it safer. It makes the agent more predictable.
263
-
264
-### 8. Loader’s “definition of done” is heuristic, not contractual
265
-
266
-The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.
267
-
268
-Loader’s current strategy is:
269
-
270
-- heuristically decide whether the response looks premature
271
-- nudge the model to continue
272
-- maybe ask it to confirm completion
273
-
274
-See:
275
-
276
-- `src/loader/agent/reasoning.py:721-854`
277
-
278
-This is well-intentioned, but it is still guesswork.
279
-
280
-It does not require:
281
-
282
-- explicit acceptance criteria
283
-- a verification plan
284
-- fresh command evidence
285
-- zero pending tasks
286
-- a final sign-off phase
287
-
288
-OMX’s `ralph` workflow does.
289
-
290
-That difference is enormous.
291
-
292
-### 9. Loader has no durable workflow state
293
-
294
-Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.
295
-
296
-Missing pieces:
297
-
298
-- persisted mode state
299
-- session memory
300
-- approved plan artifacts
301
-- PRD / test-spec artifacts
302
-- progress ledger
303
-- durable “what was already decided”
304
-- resume-safe task state
305
-
306
-OMX writes state under `.omx/` and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.
307
-
308
-### 10. Loader is too backend-specific and too capability-fragile
309
-
310
-Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.
311
-
312
-Evidence:
313
-
314
-- `src/loader/cli/main.py` supports only `ollama`
315
-- `src/loader/llm/ollama.py` hardcodes native tool support by model-name substring matching
316
-
317
-This is fragile for behavior matching “with any model chosen.”
318
-
319
-What Loader needs instead is:
320
-
321
-- a provider-independent tool-calling contract
322
-- explicit capability profiles
323
-- distinct fallback strategies for native tools vs text tool calling
324
-- prompts/workflows that degrade gracefully
325
-
326
-### 11. Loader’s tests are not protecting the real runtime
327
-
328
-Loader’s test suite is mostly:
329
-
330
-- tool unit tests
331
-- parsing tests
332
-- recovery tests
333
-
334
-That is useful, but insufficient.
335
-
336
-The current state:
337
-
338
-- `uv run pytest` fails by default after adding `refs/`
339
-- the repo does not scope pytest discovery
340
-- the “normal” targeted run needs `--with pytest --with pytest-asyncio`
341
-- even then, 3 tests fail
342
-- there are no strong turn-loop integration tests
343
-- there is no deterministic mock backend harness comparable to `claw-code`
344
-
345
-This is why structural issues like the `tool_call_id` mismatch can survive.
346
-
347
-## What `claw-code` gets right
348
-
349
-## 1. The runtime contract is explicit
350
-
351
-`refs/claw-code/rust/crates/runtime/src/conversation.rs` is the biggest thing Loader should study.
352
-
353
-The core `run_turn()` flow is clean:
354
-
355
-1. append user message to session
356
-2. stream assistant response
357
-3. build a typed assistant message
358
-4. extract tool uses
359
-5. run permission checks
360
-6. execute tool
361
-7. append tool result message
362
-8. repeat until no more tool uses
363
-9. optionally compact session
364
-10. return a typed turn summary
365
-
366
-That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.
367
-
368
-## 2. Session persistence and compaction are first-class
369
-
370
-`claw-code` treats long-lived sessions as a product feature:
371
-
372
-- persisted sessions
373
-- resume support
374
-- usage tracking
375
-- compaction thresholds
376
-- summarized continuation messages
377
-
378
-Relevant files:
379
-
380
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
381
-- `refs/claw-code/rust/crates/runtime/src/compact.rs`
382
-- `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`
383
-- `refs/claw-code/rust/crates/runtime/src/usage.rs`
384
-
385
-This matters because good agent behavior is often continuity behavior.
386
-
387
-## 3. Permissions are part of the runtime, not just UI confirmation
388
-
389
-`claw-code` has an actual permission model with three layers:
390
-
391
-- **Mode layer** — `PermissionMode` enum with `ReadOnly`, `WorkspaceWrite`, `DangerFullAccess`, `Prompt`, and `Allow` (`refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`)
392
-- **Per-tool requirement layer** — every `ToolSpec` declares the minimum mode it requires, mapped in `PermissionPolicy.tool_requirements`
393
-- **Rule layer** — three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides on top of the mode/requirement check
394
-
395
-Plus typed authorization outcomes, file-write boundary logic, and bash gating.
396
-
397
-Relevant files:
398
-
399
-- `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs`
400
-- `refs/claw-code/rust/crates/runtime/src/permissions.rs`
401
-
402
-Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.
403
-
404
-## 4. File and shell operations are engineered, not just exposed
405
-
406
-`claw-code`’s file layer includes:
407
-
408
-- max read size
409
-- max write size
410
-- binary detection
411
-- workspace-boundary validation
412
-- structured patch outputs
413
-
414
-Relevant file:
415
-
416
-- `refs/claw-code/rust/crates/runtime/src/file_ops.rs`
417
-
418
-Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.
419
-
420
-## 5. Hooks and lifecycle surfaces give the runtime escape valves
421
-
422
-`claw-code` has pre-tool and post-tool hooks, including failure hooks.
423
-
424
-That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.
425
-
426
-Relevant files:
427
-
428
-- `refs/claw-code/rust/crates/runtime/src/hooks.rs`
429
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
430
-
431
-## 6. The project is honest about parity and weaknesses
432
-
433
-`refs/claw-code/PARITY.md` is one of the best engineering lessons in the whole comparison.
434
-
435
-It does three things Loader does not yet do:
436
-
437
-- names what is actually shipped
438
-- names what is still shallow or stubbed
439
-- ties roadmap claims to concrete evidence
440
-
441
-That alone reduces thrash.
442
-
443
-Loader needs a similar parity/backlog document for runtime behavior.
444
-
445
-## 7. Diagnostics and operator surfaces are part of the product
446
-
447
-`claw-code` exposes operational commands like:
448
-
449
-- `status`
450
-- `sandbox`
451
-- `agents`
452
-- `mcp`
453
-- `skills`
454
-- `doctor`
455
-- session resume
456
-
457
-This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.
458
-
459
-## Where `claw-code` is still incomplete
460
-
461
-It is worth staying honest here too.
462
-
463
-Even `claw-code` admits some shallowness in `PARITY.md`:
464
-
465
-- some surfaces are registry-backed approximations, not deep external integrations
466
-- session compaction parity is still open
467
-- token accounting accuracy is still open
468
-- some tool surfaces remain shallow or partially stubbed
469
-
470
-That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.
471
-
472
-## What OMX adds that Loader is currently missing almost entirely
473
-
474
-`claw-code` gives a better runtime. OMX gives a better workflow.
475
-
476
-This is where most of Loader’s “definition of done” and “follow-through” problems are answered.
477
-
478
-### 1. Clarification is a mode, not an ad hoc question
479
-
480
-`deep-interview` is not “ask a question if confused.”
481
-
482
-It is a formal ambiguity-reduction workflow with:
483
-
484
-- a context snapshot
485
-- one-question rounds
486
-- ambiguity scoring
487
-- explicit non-goals
488
-- explicit decision boundaries
489
-- a crystallized artifact for downstream execution
490
-
491
-Relevant files:
492
-
493
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
494
-
495
-Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.
496
-
497
-### 2. Planning is artifact-based and consensus-based
498
-
499
-`ralplan` is much more than “make a numbered list.”
500
-
501
-It includes:
502
-
503
-- Planner / Architect / Critic loops
504
-- max iteration handling
505
-- planning completion gates
506
-- PRD and test-spec artifacts
507
-- approved handoff into execution
508
-
509
-Relevant files:
510
-
511
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
512
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
513
-- `refs/oh-my-codex/src/planning/artifacts.ts`
514
-
515
-Loader’s `Plan` object is fine as a local helper, but it is nowhere near this level of control.
516
-
517
-### 3. “Done” is a workflow contract in Ralph
518
-
519
-This is the single biggest lesson for Loader.
520
-
521
-Ralph encodes:
522
-
523
-- persistence until done
524
-- mandatory verification
525
-- architect verification
526
-- retry/fix loops
527
-- state transitions
528
-- explicit cleanup on completion
529
-- a final checklist
530
-
531
-Relevant file:
532
-
533
-- `refs/oh-my-codex/skills/ralph/SKILL.md`
534
-
535
-This directly addresses the exact Loader problems you named:
536
-
537
-- weak tool follow-through
538
-- finishing too early
539
-- spending too long in loops
540
-- poor task closure
541
-
542
-### 4. Workflow state lives outside the prompt
543
-
544
-OMX stores durable mode state under `.omx/` and exposes it through state tools.
545
-
546
-Relevant files:
547
-
548
-- `refs/oh-my-codex/src/modes/base.ts`
549
-- `refs/oh-my-codex/src/mcp/state-server.ts`
550
-- `refs/oh-my-codex/src/mcp/memory-server.ts`
551
-
552
-That means:
553
-
554
-- progress survives interruptions
555
-- execution can be resumed
556
-- handoffs are grounded
557
-- context can be audited
558
-- the model does not have to remember everything itself
559
-
560
-### 5. Memory and notepad are explicit tools
561
-
562
-OMX has project memory and a notepad.
563
-
564
-That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:
565
-
566
-- conventions
567
-- known build commands
568
-- temporary working notes
569
-- durable directives
570
-
571
-Relevant file:
572
-
573
-- `refs/oh-my-codex/src/mcp/memory-server.ts`
574
-
575
-Loader currently rediscovers too much per turn.
576
-
577
-### 6. Verification is standardized
578
-
579
-OMX has verification instructions that scale by task size and explicitly require evidence.
580
-
581
-Relevant file:
582
-
583
-- `refs/oh-my-codex/src/verification/verifier.ts`
584
-
585
-Loader has completion heuristics. OMX has verification policy.
586
-
587
-That is the difference between “the model sounded done” and “the system proved done.”
588
-
589
-### 7. Doctor / explore / sparkshell reduce prompt waste
590
-
591
-OMX distinguishes:
592
-
593
-- health checking (`doctor`)
594
-- lightweight read-only exploration (`explore`)
595
-- bounded shell-native inspection (`sparkshell`)
596
-
597
-That is smart.
598
-
599
-It keeps the main execution loop from becoming the only place everything happens.
600
-
601
-Relevant files:
602
-
603
-- `refs/oh-my-codex/src/cli/doctor.ts`
604
-- `refs/oh-my-codex/src/cli/explore.ts`
605
-- `refs/oh-my-codex/src/cli/sparkshell.ts`
606
-
607
-### 8. Follow-through is supported outside the agent context window
608
-
609
-The idle notifications, leader nudges, and continuation prompts in OMX are important.
610
-
611
-Relevant file:
612
-
613
-- `refs/oh-my-codex/src/scripts/notify-hook.ts`
614
-
615
-This is one of the deeper design differences:
616
-
617
-- Loader tries to keep the model on-task from inside the loop
618
-- OMX also nudges, monitors, and routes from outside the loop
619
-
620
-That is a more robust design.
621
-
622
-## Comparison matrix
623
-
624
-| Area | Loader today | `claw-code` | OMX lesson | Takeaway for Loader |
625
-|---|---|---|---|---|
626
-| Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first |
627
-| Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools |
628
-| Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries |
629
-| Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification |
630
-| Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts |
631
-| Memory/state | none | sessions + compaction + tracing | `.omx/` mode state + memory | add `.loader/` state dir |
632
-| Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable |
633
-| Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests |
634
-| Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later |
635
-| Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy |
636
-
637
-## Why Loader’s current weaknesses produce the behavior you described
638
-
639
-### Poor tool use
640
-
641
-Root causes:
642
-
643
-- shallow tool surface
644
-- brittle prompt contract
645
-- native-vs-ReAct bifurcation
646
-- duplicated execution code paths
647
-- no typed runtime contract for tool results
648
-
649
-### Weak follow-through
650
-
651
-Root causes:
652
-
653
-- no persistent task state
654
-- no approved plan artifact
655
-- no explicit verification lane
656
-- no final completion checklist
657
-
658
-### Finishing early
659
-
660
-Root causes:
661
-
662
-- completion is heuristic
663
-- no required evidence model
664
-- no acceptance criteria artifact
665
-- no final “prove it” pass
666
-
667
-### Spending too long on simple tasks
668
-
669
-Root causes:
670
-
671
-- the runtime loop tries too many recoveries in one place
672
-- the system prompt does not distinguish task modes cleanly
673
-- there is no “lightweight inspect” lane like `explore`
674
-- the model often has to infer the workflow instead of being routed into one
675
-
676
-### Model sensitivity
677
-
678
-Root causes:
679
-
680
-- behavior is prompt-and-heuristic driven
681
-- capability detection is backend-specific and brittle
682
-- no workflow artifacts that survive model variance
683
-
684
-This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.
685
-
686
-## Concrete implementation targets
687
-
688
-These are ordered by impact on Loader behavior, not by code convenience.
689
-
690
-### Target 1: Introduce a real turn engine
691
-
692
-Goal:
693
-
694
-- replace the current giant loop with a smaller, typed conversation runtime
695
-
696
-Implementation target:
697
-
698
-- create a new `src/loader/runtime/` package
699
-- move message/session/tool-result logic out of `src/loader/agent/loop.py`
700
-- give tool results a first-class typed representation
701
-- unify native, ReAct, and extracted-tool execution through one executor path
702
-
703
-Why:
704
-
705
-- this is the foundation for every other improvement
706
-
707
-### Target 2: Add persistent Loader state under `.loader/`
708
-
709
-Goal:
710
-
711
-- make workflow state durable instead of prompt-only
712
-
713
-Implementation target:
714
-
715
-- `.loader/state/`
716
-- `.loader/sessions/`
717
-- `.loader/plans/`
718
-- `.loader/notepad.md`
719
-- `.loader/project-memory.json`
720
-
721
-Why:
722
-
723
-- Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge
724
-
725
-### Target 3: Separate task modes
726
-
727
-Goal:
728
-
729
-- stop treating all requests like immediate tool-execution requests
730
-
731
-Implementation target:
732
-
733
-- mode router with at least:
734
-  - `clarify`
735
-  - `plan`
736
-  - `execute`
737
-  - `verify`
738
-
739
-Why:
740
-
741
-- this is the minimum structure needed to stop overthinking simple work and underthinking complex work
742
-
743
-### Target 4: Replace heuristic completion with an evidence-backed done contract
744
-
745
-Goal:
746
-
747
-- make completion explicit and testable
748
-
749
-Implementation target:
750
-
751
-- define a `DefinitionOfDone` object per task
752
-- require:
753
-  - acceptance criteria
754
-  - verification commands
755
-  - evidence summary
756
-  - zero pending task items
757
-
758
-Why:
759
-
760
-- this is the main fix for premature completion
761
-
762
-### Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents
763
-
764
-Goal:
765
-
766
-- pull ambiguity reduction and planning review out of the middle of execution
767
-
768
-Implementation target:
769
-
770
-- `clarify` mode writes a task brief
771
-- `plan` mode writes:
772
-  - a short implementation plan
773
-  - a test/verification plan
774
-
775
-Do not try to copy every OMX feature immediately. Copy the artifact discipline first.
776
-
777
-### Target 6: Build a real permission model
778
-
779
-Goal:
780
-
781
-- move from confirmation prompts to policy-based authorization
782
-
783
-Implementation target:
784
-
785
-- permission modes:
786
-  - `read-only`
787
-  - `workspace-write`
788
-  - `danger-full-access`
789
-- tool specs declare required permission
790
-- file writes enforce workspace boundaries
791
-- shell commands go through command classification
792
-
793
-Why:
794
-
795
-- this is both safety and behavior quality
796
-
797
-### Target 7: Harden file and shell tools
798
-
799
-Goal:
800
-
801
-- make tool use trustworthy enough for automation
802
-
803
-Implementation target:
804
-
805
-- size limits
806
-- binary detection
807
-- symlink/traversal protection
808
-- structured patch/diff return values
809
-- shell command semantics and mutability classification
810
-
811
-### Target 8: Add `loader doctor`, `loader status`, and `loader session`
812
-
813
-Goal:
814
-
815
-- make Loader operable as a product
816
-
817
-Implementation target:
818
-
819
-- backend health
820
-- model capability snapshot
821
-- workspace detection
822
-- write-access detection
823
-- test/build command detection
824
-- active session summary
825
-
826
-Why:
827
-
828
-- better operator feedback means less guesswork in the agent loop
829
-
830
-### Target 9: Add memory/notepad tools
831
-
832
-Goal:
833
-
834
-- give Loader durable short-term and long-term memory
835
-
836
-Implementation target:
837
-
838
-- read/write project memory
839
-- append working notes
840
-- store user directives and repo conventions
841
-
842
-Why:
843
-
844
-- this reduces re-discovery and improves follow-through across turns
845
-
846
-### Target 10: Add a lightweight read-only inspect lane
847
-
848
-Goal:
849
-
850
-- avoid using the full agent loop for every lookup
851
-
852
-Implementation target:
853
-
854
-- `loader explore` or equivalent internal mode
855
-- optimized for:
856
-  - file/symbol lookup
857
-  - pattern discovery
858
-  - relationship questions
859
-
860
-Why:
861
-
862
-- simple tasks should stay cheap and fast
863
-
864
-### Target 11: Add a parity harness
865
-
866
-Goal:
867
-
868
-- improve behavior intentionally instead of impressionistically
869
-
870
-Implementation target:
871
-
872
-- scripted mock backend scenarios for:
873
-  - simple read
874
-  - multi-tool turn
875
-  - denied permission
876
-  - write/edit success
877
-  - verification-required task
878
-  - premature completion rejection
879
-  - looped/duplicate action prevention
880
-
881
-Why:
882
-
883
-- this is how Loader becomes reliable
884
-
885
-### Target 12: Add workflow-aware prompts and capability profiles
886
-
887
-Goal:
888
-
889
-- make Loader less brittle across models
890
-
891
-Implementation target:
892
-
893
-- replace one generic system prompt with mode-specific prompts
894
-- add provider/model capability profiles:
895
-  - native tools
896
-  - streaming
897
-  - context budget
898
-  - preferred tool-call format
899
-  - verification strictness
900
-
901
-Why:
902
-
903
-- behavior should be shaped by runtime policy, not guessed from model substrings
904
-
905
-## Priority order
906
-
907
-This section was rewritten after a deeper validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: **the Definition-of-Done work is the user's actual pain point and should land before permission modes**, not after, because permissions are a safety win and DoD is the behavior win.
908
-
909
-### P0: Stabilize before changing behavior (Sprint 00)
910
-
911
-- write a failing regression test for the `tool_call_id` bug at `agent/loop.py:885,906` *first*, before any harness work — it proves the bug is real and proves the harness exists in one move
912
-- scope pytest discovery so `refs/` stops contaminating collection
913
-- exclude `refs/` from ruff and mypy too
914
-- make `uv run pytest` work out of the box
915
-- port the scenario taxonomy from `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs`
916
-- rewrite `README.md` (currently still says "FortranGoingOnForty")
917
-- baseline parity checklist for current runtime behavior
918
-
919
-### P1: Replace the loop with a real runtime (Sprint 01)
920
-
921
-- new `src/loader/runtime/` package with a typed turn engine
922
-- unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
923
-- fix the named bugs from Sprint 00's failing tests (`tool_call_id`, duplicate execution path)
924
-- replace substring-based `NATIVE_TOOL_MODELS`/`NO_TOOL_MODELS` model detection with a `runtime/capabilities.py` profile system — Loader needs to behave consistently across model choices
925
-- structured `TurnSummary` output
926
-
927
-### P2: The behavior fix the user actually asked for (Sprint 02)
928
-
929
-- `DefinitionOfDone` object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items
930
-- explicit verify phase that runs the verification commands and gates completion on evidence
931
-- fix loop: verification failure returns to execution, not to final answer
932
-- minimum `.loader/` directory shape (`.loader/dod/`) — full session/memory layout deferred to Sprint 05
933
-
934
-This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."
935
-
936
-### P3: Safety as policy, not as confirmation prompt (Sprint 03)
937
-
938
-- permission modes: `read-only`, `workspace-write`, `danger-full-access`
939
-- three-event tool lifecycle hooks (`pre_tool_use`, `post_tool_use`, `post_tool_use_failure`) modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs`
940
-- refactor `safeguards.py` (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls
941
-- file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
942
-- shell operation hardening
943
-- expose active mode in CLI/TUI status
944
-
945
-Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.
946
-
947
-### P4: Stop improvising one workflow for everything (Sprint 04)
948
-
949
-- mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
950
-- clarify artifact written to `.loader/briefs/`
951
-- planning artifacts (implementation plan + verification plan) written to `.loader/plans/` and fed into the existing DoD object
952
-- tool prerequisites pulled forward from Sprint 06: `TodoWrite` (the "zero pending tasks" gate is empty without it) and `AskUserQuestion` (clarify rounds)
953
-
954
-### P5: Durable continuity (Sprint 05)
955
-
956
-- full `.loader/` state directory under the layout already started in Sprint 02
957
-- session persistence and resume
958
-- transcript compaction with priority-aware summarization (model the design on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`)
959
-- memory/notepad surfaces
960
-- usage/cost tracking
961
-
962
-### P6: Operability and tool-surface expansion (Sprint 06)
963
-
964
-- `loader doctor`, `loader status`, `loader session`
965
-- read-only explore lane
966
-- broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — `TodoWrite` and `AskUserQuestion` already exist from Sprint 04
967
-
968
-### Deferred indefinitely
969
-
970
-- workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
971
-- task/team/subagent orchestration
972
-- broad MCP ecosystem
973
-- richer plugin systems
974
-
975
-These are real wins in `claw-code`/OMX, but Loader should not pursue them until the solo runtime is trustworthy.
976
-
977
-## What Loader should copy directly, and what it should not
978
-
979
-### Copy directly
980
-
981
-- typed turn runtime
982
-- permission model
983
-- file/shell hardening
984
-- session persistence
985
-- compaction
986
-- doctor/status/session surfaces
987
-- workflow artifacts
988
-- evidence-backed verification
989
-- parity harness discipline
990
-
991
-### Copy in simplified form
992
-
993
-- deep-interview
994
-- ralplan
995
-- ralph
996
-- memory/notepad
997
-- explore vs full-execution split
998
-
999
-### Do not copy blindly yet
1000
-
1001
-- full tmux/team runtime
1002
-- huge command surface
1003
-- Discord/openclaw notification stack
1004
-- broad MCP ecosystem
1005
-
1006
-Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.
1007
-
1008
-## Recommended Loader architecture direction
1009
-
1010
-If we want behavior closer to `claw-code` without losing Loader’s simplicity, I would steer toward:
1011
-
1012
-### Layer 1: Runtime core
1013
-
1014
-- typed `TurnRuntime`
1015
-- `SessionStore`
1016
-- `PermissionPolicy`
1017
-- `ToolExecutor`
1018
-- `VerificationEngine`
1019
-
1020
-### Layer 2: Workflow layer
1021
-
1022
-- `ClarifyWorkflow`
1023
-- `PlanWorkflow`
1024
-- `ExecuteWorkflow`
1025
-- `VerifyWorkflow`
1026
-
1027
-### Layer 3: Product surfaces
1028
-
1029
-- TUI
1030
-- CLI
1031
-- `doctor`
1032
-- `status`
1033
-- `session`
1034
-- `explore`
1035
-
1036
-### Layer 4: Optional future orchestration
1037
-
1038
-- hooks
1039
-- background verification
1040
-- multi-agent/task orchestration
1041
-
1042
-That is a better fit for Loader than trying to clone all of OMX wholesale.
1043
-
1044
-## Immediate conclusions
1045
-
1046
-1. Loader’s biggest problems are architectural, not just prompt-related.
1047
-2. `claw-code` is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.
1048
-3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
1049
-4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
1050
-   - workflow artifacts
1051
-   - explicit verification
1052
-   - persistent state
1053
-   - a smaller, more trustworthy turn engine
1054
-
1055
-## Sprint scaffolding
1056
-
1057
-After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:
1058
-
1059
-- `.docs/sprints/index.md`
1060
-- `.docs/sprints/sprint00.md` — Foundation, Measurement, and Parity Harness
1061
-- `.docs/sprints/sprint01.md` — Turn Engine, Tool Contract, and Capability Profiles
1062
-- `.docs/sprints/sprint02.md` — Definition of Done and Verify/Fix Loop
1063
-- `.docs/sprints/sprint03.md` — Permission Modes and Tool Lifecycle Hooks
1064
-- `.docs/sprints/sprint04.md` — Mode Router, Clarify, and Plan Artifacts
1065
-- `.docs/sprints/sprint05.md` — Session State, Memory, and Compaction
1066
-- `.docs/sprints/sprint06.md` — Doctor, Explore, Status, and Tool Surface Expansion
1067
-
1068
-## Recommended next move
1069
-
1070
-Start with Sprint 00, and start Sprint 00 with the failing regression test.
1071
-
1072
-Reason:
1073
-
1074
-- Loader needs a measurable baseline and a safer runtime before adding more behavior
1075
-- the `tool_call_id` bug at `agent/loop.py:885,906` is proof that untested code paths are silently broken
1076
-- writing the failing test first proves both the bug and the harness in one move
1077
-- otherwise every feature sprint will be built on unstable agent semantics
1078
-
1079
-The execution phase should then be:
1080
-
1081
-1. lock down the runtime and test harness (Sprint 00)
1082
-2. replace the loop with a typed runtime and capability profiles (Sprint 01)
1083
-3. define and enforce the completion contract (Sprint 02)
1084
-4. add the policy-based safety layer with hooks (Sprint 03)
1085
-5. add workflow modes and planning artifacts on top (Sprint 04)
1086
-6. then widen the durability and product surfaces (Sprints 05 and 06)
1087
-
1088
-## Plan adjustments after deeper review
1089
-
1090
-The following changes were applied to the original report after a firsthand validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus spot-checks of Loader's runtime.
1091
-
1092
-### Verified directly against the code
1093
-
1094
-- **`tool_call_id` bug confirmed at `src/loader/agent/loop.py:885` and `:906`.** Both call sites construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`, but `Message` (`src/loader/llm/base.py:33-39`) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.
1095
-- **Pytest discovery is broken by default.** `uv run pytest --collect-only` picks up `refs/claw-code/tests/test_porting_workspace.py` and fails to import `loader` because there is no `tool.pytest.ini_options` block in `pyproject.toml`.
1096
-- **Loop monolith confirmed by line counts.** `agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196, `agent/safeguards.py` is 1079 — roughly 4200 lines of orchestration in one cluster.
1097
-- **claw-code's `run_turn()` shape** is exactly as the report describes. Read directly at `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed `ConversationMessage::tool_result()` → push → repeat. ~175 lines of clean code.
1098
-- **claw-code permission modes** are `ReadOnly` / `WorkspaceWrite` / `DangerFullAccess` (plus `Prompt` and `Allow`), defined at `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in `file_ops.rs` are all real.
1099
-- **claw-code hooks** are `PreToolUse` / `PostToolUse` / `PostToolUseFailure`, defined at `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34` and wired into the conversation loop at lines 371, 427-453.
1100
-- **OMX skills are real and even more rigorous than the report described.** `ralplan` enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. `ralph` has explicit phase enums (`starting`/`executing`/`verifying`/`fixing`/`complete`/`failed`/`cancelled`) persisted via `state_write` to `.omx/state/{mode}-state.json`. The verifier in `src/verification/verifier.ts` scales by task size with concrete file-count thresholds.
1101
-
1102
-### Corrected facts
1103
-
1104
-- **Tool count: 49, not 40.** `refs/claw-code/rust/crates/tools/src/lib.rs` exposes 49 `ToolSpec` entries in `mvp_tool_specs()`. Doesn't change the lesson, but worth knowing.
1105
-- **claw-code permissions have a third layer.** Beyond `PermissionMode` and per-tool requirements, `PermissionPolicy` carries three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides. Loader can land the mode layer first and defer the rule layer.
1106
-- **claw-code summary compression is sophisticated.** It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`. Sprint 05 should model on this rather than reinventing.
1107
-
1108
-### Structural plan changes
1109
-
1110
-- **The old Sprint 03 was split.** It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
1111
-- **The old Sprint 02 (permissions) became the new Sprint 03** and was reordered to land *after* DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
1112
-- **Hooks landed in the same sprint as permissions.** The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
1113
-- **Capability profiles became a Sprint 01 deliverable.** They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
1114
-- **The minimum `.loader/` directory shape moves to Sprint 02** (just `.loader/dod/`). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05.
1115
-- **`TodoWrite` and `AskUserQuestion` move from Sprint 06 to Sprint 04** as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.
1116
-- **Sprint 00's first deliverable is now the failing regression test** for the `tool_call_id` bug, before any harness work. It proves the bug and proves the harness exist in one move.
.docs/audit_sprints/index.mddeleted
@@ -1,82 +0,0 @@
1
-# Loader Audit Cleanup Sprint Index
2
-
3
-These sprints translate the 2026-04-07 audit in `.docs/audit.txt` into a post-Sprint-08 cleanup plan that is explicitly about deleting contract debt, not wrapping it in more helpers.
4
-
5
-The repo has moved since the audit snapshot. On this planning branch:
6
-
7
-- `uv run pytest -q` is green with `226 passed`
8
-- Sprint 08's prompt builder, turn-phase tracking, and permission inspection surfaces are already present on `HEAD`
9
-- Sprint 09 interactive validation has started; `loader doctor` now distinguishes metadata reachability from live chat readiness, and both native-capable and `json_tag` Ollama lanes currently fail the live chat probe on `/api/chat` with HTTP 500
10
-- Sprint 10's runtime-ownership inversion is now materially in place: `src/loader/runtime/` no longer reaches into `Agent` directly, and the remaining legacy dependencies are explicit `RuntimeLegacyServices` seams
11
-- Sprint 11 has already deleted several puppet behaviors and collapsed the raw-text fallback stack onto the shared parser used by the runtime and Ollama text fallback paths
12
-- Sprint 13 has now finished the runtime-side retirement work:
13
-  - `src/loader/runtime/` has `0` direct imports from `agent/*`
14
-  - shared safeguard, rollback, recovery, parsing, reasoning-type, and task-classification surfaces now live under `src/loader/runtime/`
15
-  - the remaining debt is narrower:
16
-    - workflow modes are honestly scoped, but the refs' deeper protocol and routing discipline are still absent
17
-    - `agent/reasoning.py` and `agent/safeguards.py` still hold some real legacy behavior behind explicit seams
18
-    - interactive validation against a healthy live backend is still blocked on Ollama `/api/chat` failures
19
-
20
-## Sprint 09 Ownership Baseline
21
-
22
-- `src/loader/runtime/conversation.py`: 881 lines, `49` `self.agent.` reach-ins
23
-- `src/loader/runtime/assistant_turns.py`: `23` `self.agent.` reach-ins
24
-- `src/loader/runtime/tool_batches.py`: `26` `self.agent.` reach-ins
25
-- `src/loader/runtime/completion_policy.py`: `9` `self.agent.` reach-ins
26
-- `src/loader/runtime/repair.py`: `4` `self.agent.` reach-ins
27
-- `src/loader/runtime/explore.py`: `12` `self.agent.` reach-ins
28
-- `src/loader/agent/loop.py`: 1108 lines
29
-- `src/loader/agent/reasoning.py`: 1235 lines
30
-- `src/loader/agent/safeguards.py`: 1142 lines
31
-- `src/loader/agent/recovery.py`: 648 lines
32
-
33
-## Current runtime ownership status
34
-
35
-- `src/loader/runtime/` direct `self.agent.` reach-ins: `0`
36
-- runtime ownership now flows through `RuntimeContext` plus explicit `RuntimeLegacyServices` adapters
37
-- `src/loader/runtime/` direct imports from `agent/*`: `0`
38
-- the runtime-side ownership migration phase is complete; the remaining work is closure reporting and any future follow-on deletion beyond the runtime package
39
-
40
-## Current legacy-tree status
41
-
42
-- `src/loader/agent/loop.py`: `721` lines, down `390` lines from the Sprint 09 baseline of `1111`
43
-- `src/loader/agent/reasoning.py`: `649` lines, down `586` lines from the Sprint 09 baseline of `1235`
44
-- `src/loader/agent/safeguards.py`: `595` lines, down `547` lines from the Sprint 09 baseline of `1142`
45
-- `src/loader/agent/recovery.py`: `12` lines, down `636` lines from the Sprint 09 baseline of `648`
46
-- shared raw-text parsing now runs through `src/loader/runtime/parsing.py`; `src/loader/agent/parsing.py` is compatibility-only and the stale `_extract_raw_json_tool_calls(...)` fallback is gone from `src/loader/agent/loop.py`
47
-- deleted assistant-puppeting behaviors:
48
-  - post-action follow-up suffix
49
-  - first-turn `[` prefill trick
50
-  - fake-tool narration scolding
51
-  - deflection repair prompts
52
-  - self-critique reroute
53
-  - non-mutating completion nudge
54
-  - text-loop bailout
55
-  - action-loop bailout on successful repeated tool patterns
56
-- empty-output handling is now one honest retry followed by explicit failure instead of five fake assistant continuation prompts
57
-- the remaining debt is no longer parser fragmentation or hidden runtime ownership; it is the narrower set of explicit legacy callbacks, streamed safeguards, workflow-depth gaps, and still-blocked live backend validation
58
-
59
-## Phase 1: Validate Before Deleting
60
-
61
-- [Sprint 09](sprint09.md) — Interactive Validation, Baselines, and Guardrails
62
-
63
-## Phase 2: Tighten the Runtime Contract
64
-
65
-- [Sprint 10](sprint10.md) — Runtime Context and Ownership Inversion
66
-- [Sprint 11](sprint11.md) — Recovery Deletion and Tool Parsing Unification
67
-
68
-## Phase 3: Finish the Behavioral Cleanup
69
-
70
-- [Sprint 12](sprint12.md) — Workflow Protocol Hardening and Decomposition Decision
71
-- [Sprint 13](sprint13.md) — Legacy Runtime Retirement and Safeguard Refactor
72
-- [Sprint 13 Closure](sprint13_closure.md) — Final scoreboard, remaining debt, and audit-closure honesty pass
73
-- [Trunk Sitrep](trunk_sitrep.md) — Divergence snapshot and integration recommendation against current `trunk`
74
-
75
-## Working principles
76
-
77
-- Prefer deletion over relocation. A sprint that only moves code around is not done.
78
-- Each discrete fix gets its own commit. Do not batch unrelated parser, workflow, safety, and runtime-contract changes into one "cleanup" commit.
79
-- Pair each behavior change with deterministic coverage or a committed interactive validation artifact.
80
-- Keep `.docs/PARITY.md` and any residual-debt notes honest as claims change.
81
-- Stop at the sprint boundary. If a sprint uncovers a larger follow-on job, create the next sprint or artifact instead of expanding the current one mid-flight.
82
-- Treat interactive testing as first-class evidence. The parity harness is necessary, but it is not enough for deciding which recovery layers are truly load-bearing.
.docs/audit_sprints/sprint09.mddeleted
@@ -1,99 +0,0 @@
1
-# Sprint 09: Interactive Validation, Baselines, and Guardrails
2
-
3
-## Prerequisites
4
-
5
-Sprint 08
6
-
7
-## Goals
8
-
9
-Turn the audit from a strong paper read into an executable cleanup baseline.
10
-
11
-Before deleting recovery layers, Loader needs two things the repo does not have yet:
12
-
13
-- direct evidence from real interactive runs across at least one native-tool path and one raw-text/smaller-model path
14
-- guardrail coverage for the specific parser/runtime blind spots the audit surfaced
15
-
16
-This sprint is intentionally narrow. It should produce evidence, close the known stale allowlist bug, and leave the larger contract surgery to later sprints.
17
-
18
-## Deliverables
19
-
20
-### 1. Interactive validation matrix
21
-
22
-Run a fixed task matrix against at least:
23
-
24
-- one native-tool-capable backend/profile
25
-- one smaller or raw-text-prone backend/profile
26
-
27
-Capture for each run:
28
-
29
-- task prompt
30
-- active capability profile
31
-- whether native tools or raw-text fallback fired
32
-- phase trace
33
-- final response quality
34
-- verification outcome
35
-- which recovery layers fired and whether they helped or harmed
36
-
37
-Persist the report under `.docs/audit_sprints/` so later deletion sprints can cite real evidence instead of memory.
38
-
39
-### 2. Parser blind-spot guardrails
40
-
41
-Close the audit's most concrete regression before deeper parser work:
42
-
43
-- add deterministic coverage for raw-text recovery of `TodoWrite`
44
-- add deterministic coverage for raw-text recovery of `patch`
45
-- add deterministic coverage for at least one newer workflow/operator tool such as `AskUserQuestion`
46
-- stop hardcoding the six-tool allowlist in `agent/loop.py`
47
-
48
-This sprint does not need the final parser architecture yet. It does need to stop widening the gap every time the registry grows.
49
-
50
-### 3. Contract baseline artifacts
51
-
52
-Record the current cleanup baseline in a committed artifact:
53
-
54
-- `self.agent.` reach-in counts across `src/loader/runtime/`
55
-- line counts for the major legacy files under `src/loader/agent/`
56
-- inventory of every remaining recovery behavior, including:
57
-  - owner
58
-  - trigger
59
-  - dependency
60
-  - current test coverage
61
-  - proposed disposition: `delete`, `gate`, or `keep`
62
-
63
-This inventory becomes the checklist for Sprint 11 and Sprint 13.
64
-
65
-### 4. Deletion criteria
66
-
67
-Write down the rules for what survives:
68
-
69
-- default to delete unless the interactive matrix shows the recovery path is materially load-bearing
70
-- if a behavior stays, tie it to an explicit capability-profile condition or session-level guardrail
71
-- forbid new fake-assistant continuation text unless a future sprint explicitly re-approves it with evidence
72
-
73
-## Commit slicing
74
-
75
-- one commit for the interactive validation artifact and baseline metrics
76
-- one commit for each new raw-parser regression test cluster
77
-- one commit for the allowlist fix or registry-derived fallback guard
78
-- one commit for the recovery inventory / disposition table
79
-
80
-## Testing strategy
81
-
82
-- `uv run pytest -q`
83
-- targeted runtime/parity coverage for raw-text recovery of newer tools
84
-- at least one committed interactive validation report from a native-tool profile
85
-- at least one committed interactive validation report from a smaller/raw-text-prone profile
86
-
87
-## Definition of done
88
-
89
-- Loader has committed interactive evidence for the contract discussion instead of only paper analysis
90
-- the stale six-tool raw-extraction regression is closed and covered
91
-- every remaining recovery layer has an owner, a disposition, and a later sprint target
92
-- the repo has a stable baseline for reach-ins, line counts, and runtime heuristics before larger deletion work starts
93
-
94
-## Explicitly out of scope
95
-
96
-- introducing `RuntimeContext`
97
-- deleting multiple recovery layers at once
98
-- redesigning clarify/plan workflows
99
-- refactoring `agent/safeguards.py` in bulk
.docs/audit_sprints/sprint09_baseline.mddeleted
@@ -1,108 +0,0 @@
1
-# Sprint 09 Baseline and Recovery Inventory
2
-
3
-## Snapshot
4
-
5
-- Branch: `cleanup-audit-plan`
6
-- Worktree: `/tmp/loader-audit-cleanup`
7
-- Source snapshot for this baseline: `5c10aab` (`Harden raw tool-call fallback coverage`)
8
-- Targeted verification completed before writing this baseline:
9
-  - `uv run pytest -q tests/test_parsing.py`
10
-  - `uv run pytest -q tests/test_runtime_harness.py -k 'raw_json or native_and_raw_tool_paths_share_executor_trace or runtime_parity_manifest_matches_implemented_cases'`
11
-- Repo-wide verification after the latest Sprint 09 characterization update:
12
-  - `uv run pytest -q` → `191 passed`
13
-- Doctor surface update:
14
-  - `f60523c` adds a dedicated live chat probe so `loader doctor` now reports `backend` reachability separately from `/api/chat` readiness
15
-- New Sprint 09 guardrails now cover raw-text recovery for:
16
-  - `read`
17
-  - `TodoWrite`
18
-  - `patch`
19
-  - `AskUserQuestion`
20
-
21
-## Runtime Ownership Baseline
22
-
23
-| File | Lines | `self.agent.` reach-ins |
24
-| --- | ---: | ---: |
25
-| `src/loader/runtime/conversation.py` | 881 | 49 |
26
-| `src/loader/runtime/assistant_turns.py` | 155 | 23 |
27
-| `src/loader/runtime/tool_batches.py` | 372 | 26 |
28
-| `src/loader/runtime/finalization.py` | 339 | 8 |
29
-| `src/loader/runtime/completion_policy.py` | 187 | 9 |
30
-| `src/loader/runtime/repair.py` | 208 | 4 |
31
-| `src/loader/runtime/explore.py` | 220 | 12 |
32
-
33
-This is the migration scoreboard for Sprint 10. The goal is not only to move code around, but to remove `Agent` as the runtime's implicit data model.
34
-
35
-## Legacy Tree Baseline
36
-
37
-| File | Lines |
38
-| --- | ---: |
39
-| `src/loader/agent/loop.py` | 1111 |
40
-| `src/loader/agent/reasoning.py` | 1235 |
41
-| `src/loader/agent/safeguards.py` | 1142 |
42
-| `src/loader/agent/recovery.py` | 648 |
43
-
44
-This is the subtraction scoreboard for Sprint 11 and Sprint 13.
45
-
46
-## Sprint 11 Progress Against This Baseline
47
-
48
-- `src/loader/agent/loop.py`: `1111` -> `926` (`-185`)
49
-- `src/loader/runtime/conversation.py`: `881` -> `828` (`-53`)
50
-- `src/loader/runtime/repair.py`: `208` -> `146` (`-62`)
51
-- `src/loader/runtime/completion_policy.py`: `187` -> `182` (`-5`)
52
-- `src/loader/agent/parsing.py`: `182` -> `276`
53
-  - this file grew because it is now the shared owner of raw-text parsing after Sprint 11 deleted the legacy loop extractor
54
-- parser-contract status:
55
-  - `_extract_raw_json_tool_calls(...)` has been deleted from `src/loader/agent/loop.py`
56
-  - `src/loader/runtime/repair.py`, `src/loader/runtime/explore.py`, and `src/loader/llm/ollama.py` now converge on the shared parser
57
-- deleted behaviors relative to this inventory:
58
-  - prefill trick
59
-  - fake-tool narration repair
60
-  - deflection repair
61
-  - post-action follow-up suffix
62
-- tightened behavior relative to this inventory:
63
-  - empty-response handling is now one honest retry plus explicit failure instead of five fake assistant continuation prompts
64
-  - self-critique reroute, text-loop bailout, non-mutating completion nudge, and action-loop bailout have all been deleted from the runtime turn path
65
-- still open relative to this inventory:
66
-  - no remaining inline completion/critique bailout from this inventory survives in the runtime turn path
67
-
68
-## Recovery Inventory
69
-
70
-| Behavior | Current owner | Trigger | Dependency | Current coverage | Proposed disposition |
71
-| --- | --- | --- | --- | --- | --- |
72
-| Prefill trick | `src/loader/runtime/conversation.py:142-161` | First iteration, single user message, action-keyword heuristic | Direct session write of fake assistant `[` | `tests/test_runtime_repair_flows.py::test_fresh_agent_messages_are_disconnected_from_session_history` | Delete. Fresh sessions currently keep `agent.messages` disconnected from `session.messages`, so this gate is already stale in practice. |
73
-| Empty-output retry prompts | `src/loader/runtime/repair.py:43-76` via `conversation.py:192-211` | Assistant content is empty up to `max_empty_retries=5` | Five fake assistant continuation prompts | `tests/test_runtime_repair_flows.py::test_empty_response_repair_injects_retry_prompt_and_recovers` | Delete or reduce to one bounded retry with honest failure |
74
-| Raw-text tool fallback | `src/loader/runtime/repair.py:101-125` plus `src/loader/agent/parsing.py` and legacy `src/loader/agent/loop.py:862-1111` | Native tool call list is empty but response contains tool syntax | Parser stack, capability-profile behavior, legacy extractor | `tests/test_parsing.py`, `tests/test_runtime_harness.py` raw JSON scenarios | Keep short-term, gate by capability profile, unify in Sprint 11 |
75
-| Fake-tool narration repair | `src/loader/runtime/repair.py:156-182` plus `src/loader/agent/loop.py:770-860` | `_contains_unexecuted_code(...)` matches narration or code-block heuristics | Legacy regex wall plus injected scolding prompt | `tests/test_runtime_repair_flows.py::test_fake_tool_narration_repair_injects_scolding_prompt` | Delete |
76
-| Deflection repair | `src/loader/runtime/repair.py:184-201` | Non-ReAct response deflects with "you can/should/could/try running" and no actions taken | Phrase heuristic plus injected user repair turn | `tests/test_runtime_repair_flows.py::test_deflection_repair_injects_use_your_tools_prompt` | Delete unless interactive evidence shows it is load-bearing |
77
-| Self-critique reroute | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | Long response and `should_self_critique(...)` said revise | `agent._self_critique`, reasoning prompt, session reinjection | `tests/test_runtime_repair_flows.py::test_long_code_response_no_longer_reroutes_for_self_critique` | Deleted |
78
-| Text-loop bailout | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | `self.agent.safeguards.detect_text_loop(...)` reported repetition | `agent/safeguards.py` action tracker | `tests/test_runtime_repair_flows.py::test_non_mutating_completion_returns_directly_without_text_bailout` | Deleted |
79
-| Non-mutating completion nudge | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | `completion_check` enabled, no mutating actions, `detect_premature_completion(...)` hit | `agent/reasoning.py` continuation heuristics plus session reinjection | `tests/test_runtime_harness.py::test_non_mutating_completion_no_longer_forces_continuation` | Deleted |
80
-| Post-action follow-up suffix | `src/loader/runtime/completion_policy.py:173-186` | Actions were taken and final text does not already end in `?` | Pure string heuristic | `tests/test_runtime_repair_flows.py::test_post_action_follow_up_suffix_is_appended_to_final_response` | Delete |
81
-| Action-loop bailout | Deleted in Sprint 11 (was `src/loader/runtime/tool_batches.py`) | `self.agent.safeguards.detect_loop()` reported repeated tool behavior | `agent/safeguards.py` action tracker | `tests/test_runtime_repair_flows.py::test_repeated_tool_pattern_no_longer_triggers_action_loop_bailout` | Deleted |
82
-
83
-## Interactive Validation Matrix
84
-
85
-These runs are still pending. They require at least one configured real native-tool backend and one configured raw-text-prone backend.
86
-
87
-| Lane | Task | Why it matters | Status |
88
-| --- | --- | --- | --- |
89
-| Native-tool lane | Read a file, then write a small file and let DoD verify it | Confirms the runtime can stay on the normal native path without repair machinery stepping in | Blocked: `loader doctor` now reports `backend: pass` but `chat: fail`, and live `/api/chat` still fails with HTTP 500 before turn execution. See `sprint09_interactive_validation_native.md`. |
90
-| Native-tool lane | Ambiguous request that routes through clarify mode | Measures whether current clarify behavior is helpful or just extra prompt text | Blocked behind the same native-lane `/api/chat` failure. |
91
-| Native-tool lane | Multi-step implementation that uses `TodoWrite` and verification | Measures if completion behavior stays disciplined without fake continuations | Blocked behind the same native-lane `/api/chat` failure. |
92
-| Raw-text-prone lane | Recover `read`, `patch`, `TodoWrite`, and `AskUserQuestion` from raw JSON/text | Confirms which raw fallback paths are still load-bearing after the new guardrails | Blocked: `loader doctor` now reports `backend: pass` but `chat: fail`, and live `/api/chat` fails with HTTP 500 before any raw-text output is produced. See `sprint09_interactive_validation_raw_text.md`. |
93
-| Raw-text-prone lane | Prompt that tends to elicit narrated fake tool use | Measures whether fake-tool repair actually saves the run or just churns the conversation | Blocked behind the same raw-text-lane `/api/chat` failure. |
94
-| Raw-text-prone lane | Prompt that tends to return empty or deflective text | Measures whether empty-output and deflection repairs help enough to justify keeping them | Blocked behind the same raw-text-lane `/api/chat` failure. |
95
-
96
-Use [sprint09_interactive_validation.md](sprint09_interactive_validation.md) as the capture format for each completed run set.
97
-
98
-Completed captures:
99
-
100
-- [Native lane](sprint09_interactive_validation_native.md)
101
-- [Raw-text-prone lane](sprint09_interactive_validation_raw_text.md)
102
-
103
-## Immediate Sprint 09 Follow-on
104
-
105
-- Use the `191 passed` repo-wide baseline as the regression floor for the next Sprint 09 slices.
106
-- Restore a working live chat backend, then rerun the interactive validation matrix against the documented native and raw-text-prone model lanes.
107
-- Keep `191 passed` as the regression floor for any Sprint 10 runtime-seam work that begins before the backend is healthy again.
108
-- Use this inventory as the checklist for Sprint 10 service seams and Sprint 11 deletions.
.docs/audit_sprints/sprint09_interactive_validation.mddeleted
@@ -1,96 +0,0 @@
1
-# Sprint 09 Interactive Validation Template
2
-
3
-Use this template for each real-backend validation run in Sprint 09. The goal is to turn the audit's architectural concerns into concrete runtime evidence that later deletion sprints can cite.
4
-
5
-Create one filled copy per backend/profile pair, or append multiple runs under the same backend heading if the profile is stable.
6
-
7
-## Backend Summary
8
-
9
-- Date:
10
-- Operator:
11
-- Branch:
12
-- Commit:
13
-- Backend:
14
-- Model:
15
-- Capability profile:
16
-- Native tools enabled:
17
-- Streaming enabled:
18
-- Permission mode:
19
-- Workflow override:
20
-
21
-## Run 1
22
-
23
-- Task:
24
-- Expected productive path:
25
-- Expected risky heuristics:
26
-- Actual outcome:
27
-- Verification outcome:
28
-- Final response quality:
29
-
30
-### Evidence
31
-
32
-- Tool path used:
33
-- Phase trace:
34
-- Recovery layers fired:
35
-- Session/runtime notes:
36
-
37
-### Assessment
38
-
39
-- Did the runtime help or get in the way?
40
-- Which heuristics were load-bearing?
41
-- Which heuristics felt like churn?
42
-- Proposed disposition updates:
43
-
44
-## Run 2
45
-
46
-- Task:
47
-- Expected productive path:
48
-- Expected risky heuristics:
49
-- Actual outcome:
50
-- Verification outcome:
51
-- Final response quality:
52
-
53
-### Evidence
54
-
55
-- Tool path used:
56
-- Phase trace:
57
-- Recovery layers fired:
58
-- Session/runtime notes:
59
-
60
-### Assessment
61
-
62
-- Did the runtime help or get in the way?
63
-- Which heuristics were load-bearing?
64
-- Which heuristics felt like churn?
65
-- Proposed disposition updates:
66
-
67
-## Run 3
68
-
69
-- Task:
70
-- Expected productive path:
71
-- Expected risky heuristics:
72
-- Actual outcome:
73
-- Verification outcome:
74
-- Final response quality:
75
-
76
-### Evidence
77
-
78
-- Tool path used:
79
-- Phase trace:
80
-- Recovery layers fired:
81
-- Session/runtime notes:
82
-
83
-### Assessment
84
-
85
-- Did the runtime help or get in the way?
86
-- Which heuristics were load-bearing?
87
-- Which heuristics felt like churn?
88
-- Proposed disposition updates:
89
-
90
-## Summary
91
-
92
-- Most useful runtime behaviors:
93
-- Least useful runtime behaviors:
94
-- Recovery layers that should likely be deleted:
95
-- Recovery layers that should likely be gated by capability profile:
96
-- Follow-up code or test tasks:
.docs/audit_sprints/sprint09_interactive_validation_native.mddeleted
@@ -1,89 +0,0 @@
1
-# Sprint 09 Interactive Validation — Native Lane
2
-
3
-## Backend Summary
4
-
5
-- Date: 2026-04-07
6
-- Operator: Codex
7
-- Branch: `cleanup-audit-plan`
8
-- Capture base: `f60523c` (`Probe live chat health in doctor`)
9
-- Backend: `ollama`
10
-- Models exercised:
11
-  - `qwen2.5:7b`
12
-  - `qwen2.5:14b`
13
-- Capability profile:
14
-  - `qwen2.5:7b` → `native`
15
-  - `qwen2.5:14b` → `native`
16
-- Native tools enabled: yes
17
-- Streaming enabled: yes
18
-- Permission mode: `read-only`
19
-- Workflow override: none (`execute`)
20
-
21
-## Run 1
22
-
23
-- Task: `Say hello in five words.`
24
-- Expected productive path: one assistant turn, no tools, immediate final response
25
-- Expected risky heuristics: none; this should not need repair or completion nudges
26
-- Actual outcome: failed before the first assistant turn completed
27
-- Verification outcome: not reached
28
-- Final response quality: none; CLI exited with `httpx.HTTPStatusError`
29
-
30
-### Evidence
31
-
32
-- Doctor result: `uv run loader doctor -m qwen2.5:7b` reported `backend: pass`, `chat: fail`, and capabilities `pass`
33
-- Runtime invocation: `uv run loader -m qwen2.5:7b --no-tui --permission-mode read-only "Say hello in five words."`
34
-- Tool path used: none
35
-- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk
36
-- Recovery layers fired: none observed
37
-- Session/runtime notes:
38
-  - Loader selected `Mode: Native`
39
-  - session id: `20260407T190318Z-2f8a2087`
40
-  - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)`
41
-
42
-### Assessment
43
-
44
-- Did the runtime help or get in the way? The runtime did not get a chance to help or interfere; the backend failed before the first assistant turn was materialized.
45
-- Which heuristics were load-bearing? None in this run.
46
-- Which heuristics felt like churn? None in this run.
47
-- Proposed disposition updates:
48
-  - Do not draw recovery-layer conclusions from this run.
49
-  - Treat the native validation lane as blocked on live `/api/chat` reliability, not on Loader turn logic.
50
-
51
-## Run 2
52
-
53
-- Task: `Say hello in five words.`
54
-- Expected productive path: one assistant turn, no tools, immediate final response
55
-- Expected risky heuristics: none
56
-- Actual outcome: failed before the first assistant turn completed
57
-- Verification outcome: not reached
58
-- Final response quality: none; CLI exited with `httpx.HTTPStatusError`
59
-
60
-### Evidence
61
-
62
-- Doctor result: `uv run loader doctor -m qwen2.5:14b` reported `backend: pass`, `chat: fail`, and capabilities `pass`
63
-- Runtime invocation: `uv run loader -m qwen2.5:14b --no-tui --permission-mode read-only "Say hello in five words."`
64
-- Tool path used: none
65
-- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk
66
-- Recovery layers fired: none observed
67
-- Session/runtime notes:
68
-  - Loader selected `Mode: Native`
69
-  - session id: `20260407T190412Z-0eaaa107`
70
-  - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)`
71
-
72
-### Assessment
73
-
74
-- Did the runtime help or get in the way? Same result as Run 1; the runtime never got past the first backend chat request.
75
-- Which heuristics were load-bearing? None in this run.
76
-- Which heuristics felt like churn? None in this run.
77
-- Proposed disposition updates:
78
-  - The failure is not isolated to one native-capable model.
79
-  - Keep the native Sprint 09 lane open, but mark it blocked until Ollama chat requests succeed again.
80
-
81
-## Summary
82
-
83
-- Most useful runtime behaviors: `loader doctor` now separates model availability from live chat readiness, which makes the blockage explicit instead of implying the lane is healthy.
84
-- Least useful runtime behaviors: none assessed; runtime behavior was not exercised.
85
-- Recovery layers that should likely be deleted: no update from this artifact
86
-- Recovery layers that should likely be gated by capability profile: no update from this artifact
87
-- Follow-up code or test tasks:
88
-  - investigate why `doctor` passes on `/api/tags` and `/api/show` while live `/api/chat` requests fail with HTTP 500
89
-  - rerun this lane with a file-read task once chat requests are healthy, so the runtime can actually exercise native-tool behavior
.docs/audit_sprints/sprint09_interactive_validation_raw_text.mddeleted
@@ -1,117 +0,0 @@
1
-# Sprint 09 Interactive Validation — Raw-Text-Prone Lane
2
-
3
-## Backend Summary
4
-
5
-- Date: 2026-04-07
6
-- Operator: Codex
7
-- Branch: `cleanup-audit-plan`
8
-- Capture base: `f60523c` (`Probe live chat health in doctor`)
9
-- Backend: `ollama`
10
-- Models exercised:
11
-  - `qwen3-coder:30b`
12
-  - `gemma3:12b`
13
-  - `llama2:latest` (doctor only)
14
-- Capability profile:
15
-  - `qwen3-coder:30b` → `json_tag`
16
-  - `gemma3:12b` → `json_tag`
17
-  - `llama2:latest` → `json_tag`
18
-- Native tools enabled: no
19
-- Streaming enabled: yes
20
-- Permission mode: `read-only`
21
-- Workflow override: none (`execute`)
22
-
23
-## Run 1
24
-
25
-- Task: `Say hello in five words.`
26
-- Expected productive path: one assistant turn, no tools, immediate final response
27
-- Expected risky heuristics: none; even the raw-text lane should handle this without fallback parsing
28
-- Actual outcome: failed before the first assistant turn completed
29
-- Verification outcome: not reached
30
-- Final response quality: none; CLI exited with `httpx.HTTPStatusError`
31
-
32
-### Evidence
33
-
34
-- Doctor result: `uv run loader doctor -m qwen3-coder:30b` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`)
35
-- Runtime invocation: `uv run loader -m qwen3-coder:30b --no-tui --permission-mode read-only --react "Say hello in five words."`
36
-- Tool path used: none
37
-- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk
38
-- Recovery layers fired: none observed
39
-- Session/runtime notes:
40
-  - Loader selected `Mode: ReAct`
41
-  - session id: `20260407T190318Z-a3a6385a`
42
-  - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)`
43
-
44
-### Assessment
45
-
46
-- Did the runtime help or get in the way? The runtime did not get a usable response back from the backend, so the raw-text tool fallback was never exercised.
47
-- Which heuristics were load-bearing? None in this run.
48
-- Which heuristics felt like churn? None in this run.
49
-- Proposed disposition updates:
50
-  - Do not use this run to justify keeping or deleting raw-text repair heuristics.
51
-  - Treat the raw-text-prone lane as blocked on live `/api/chat` reliability.
52
-
53
-## Run 2
54
-
55
-- Task: `Say hello in five words.`
56
-- Expected productive path: one assistant turn, no tools, immediate final response
57
-- Expected risky heuristics: none
58
-- Actual outcome: failed before the first assistant turn completed
59
-- Verification outcome: not reached
60
-- Final response quality: none; CLI exited with `httpx.HTTPStatusError`
61
-
62
-### Evidence
63
-
64
-- Doctor result: `uv run loader doctor -m gemma3:12b` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`)
65
-- Runtime invocation: `uv run loader -m gemma3:12b --no-tui --permission-mode read-only --react "Say hello in five words."`
66
-- Tool path used: none
67
-- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk
68
-- Recovery layers fired: none observed
69
-- Session/runtime notes:
70
-  - Loader selected `Mode: ReAct`
71
-  - session id: `20260407T190412Z-7e745da3`
72
-  - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)`
73
-
74
-### Assessment
75
-
76
-- Did the runtime help or get in the way? Same result as Run 1; the runtime never reached the point where ReAct parsing or repair could matter.
77
-- Which heuristics were load-bearing? None in this run.
78
-- Which heuristics felt like churn? None in this run.
79
-- Proposed disposition updates:
80
-  - The block is not isolated to one ReAct-profile model.
81
-  - Keep the raw-text Sprint 09 lane open, but mark it blocked until live chat requests succeed.
82
-
83
-## Run 3
84
-
85
-- Task: doctor-only preflight for a second fallback-profile model
86
-- Expected productive path: confirm the lane is not blocked by model availability alone
87
-- Expected risky heuristics: none
88
-- Actual outcome: doctor passed for `llama2:latest`, but no live run was attempted after the matching `/api/chat` failures in Runs 1 and 2
89
-- Verification outcome: not applicable
90
-- Final response quality: not applicable
91
-
92
-### Evidence
93
-
94
-- Doctor result: `uv run loader doctor -m llama2:latest` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`)
95
-- Tool path used: none
96
-- Phase trace: not applicable
97
-- Recovery layers fired: none observed
98
-- Session/runtime notes:
99
-  - This run was intentionally limited to preflight evidence because the lane was already blocked by repeated `/api/chat` failures.
100
-
101
-### Assessment
102
-
103
-- Did the runtime help or get in the way? Not assessed.
104
-- Which heuristics were load-bearing? Not assessed.
105
-- Which heuristics felt like churn? Not assessed.
106
-- Proposed disposition updates:
107
-  - Once `/api/chat` is stable again, prioritize this model family for a real raw-text fallback task that exercises `read`, `TodoWrite`, `patch`, and `AskUserQuestion`.
108
-
109
-## Summary
110
-
111
-- Most useful runtime behaviors: `loader doctor` now distinguishes capability classification from live chat readiness, so the raw-text lane blockage is explicit.
112
-- Least useful runtime behaviors: none assessed; runtime behavior was not exercised.
113
-- Recovery layers that should likely be deleted: no update from this artifact
114
-- Recovery layers that should likely be gated by capability profile: no update from this artifact
115
-- Follow-up code or test tasks:
116
-  - investigate the `/api/chat` failure independently of Sprint 09 runtime deletion work
117
-  - rerun this lane with a raw-text tool-use task as soon as live chat requests succeed so the fallback machinery can be evaluated on real evidence
.docs/audit_sprints/sprint10.mddeleted
@@ -1,93 +0,0 @@
1
-# Sprint 10: Runtime Context and Ownership Inversion
2
-
3
-## Status on `cleanup-audit-plan`
4
-
5
-- `RuntimeContext` and `RuntimeLegacyServices` are now in place under `src/loader/runtime/context.py`
6
-- the core turn path helpers run on typed runtime context inputs instead of direct `self.agent.*` access
7
-- `src/loader/runtime/` direct `self.agent.` reach-ins are down to `0`
8
-- focused fake-context coverage landed for assistant turns, tool batches, completion policy, finalization, repair, and explore
9
-- repo-wide verification is green with `203 passed`
10
-
11
-## Prerequisites
12
-
13
-Sprint 09
14
-
15
-## Goals
16
-
17
-Make the runtime a real runtime instead of a set of helper classes that reach back into `Agent`.
18
-
19
-Sprint 01 split the loop across files, and Sprints 07-08 continued that split. The next step is to replace `agent: Any` with a typed boundary so runtime correctness no longer depends on reaching into `self.agent.*` from every helper.
20
-
21
-## Deliverables
22
-
23
-### 1. Introduce a typed runtime context
24
-
25
-Add a typed context layer under `src/loader/runtime/` that owns the state and services the runtime actually needs, such as:
26
-
27
-- session access
28
-- backend access
29
-- registry and tool schemas
30
-- permission policy and hook manager
31
-- workflow mode and capability profile
32
-- prompt-building inputs
33
-- runtime-scoped callbacks/services that still need legacy behavior during migration
34
-
35
-The key rule is that runtime helpers accept this context, not `Agent`.
36
-
37
-### 2. Migrate runtime helpers off `self.agent`
38
-
39
-Migrate at least these modules to the new context boundary:
40
-
41
-- `runtime/conversation.py`
42
-- `runtime/assistant_turns.py`
43
-- `runtime/tool_batches.py`
44
-- `runtime/finalization.py`
45
-- `runtime/repair.py`
46
-- `runtime/completion_policy.py`
47
-- `runtime/explore.py`
48
-
49
-During this sprint it is acceptable for the `Agent` to construct the context and act as an adapter. It is not acceptable for runtime modules to keep reaching through that adapter directly.
50
-
51
-### 3. Introduce typed service seams for legacy dependencies
52
-
53
-Where runtime logic still depends on legacy behavior, surface that dependency explicitly as a typed service or callback protocol instead of an object reach-in. Likely seams include:
54
-
55
-- self-critique
56
-- loop detection
57
-- text-loop detection
58
-- action verification and recovery bookkeeping
59
-- any remaining stream filtering or steering hooks
60
-
61
-This keeps Sprint 11 focused on deleting heuristics rather than disentangling call sites.
62
-
63
-### 4. Isolation-friendly runtime tests
64
-
65
-Add or update tests so the migrated runtime helpers can be exercised with lightweight fake contexts instead of full `Agent` instances.
66
-
67
-## Commit slicing
68
-
69
-- one commit for the new runtime-context types and adapters
70
-- one commit per migrated runtime module or closely related module pair
71
-- one commit for test harness/fake-context support
72
-- one commit for any follow-up cleanup that removes obsolete `agent: Any` plumbing
73
-
74
-## Testing strategy
75
-
76
-- `uv run pytest -q`
77
-- unit coverage for runtime-context construction and validation
78
-- focused tests showing migrated helpers work with fake contexts
79
-- regression coverage proving session state, permission state, and phase state still flow through the runtime correctly
80
-
81
-## Definition of done
82
-
83
-- `src/loader/runtime/` no longer treats `Agent` as its primary data model
84
-- the major runtime helpers run on typed context/service inputs instead of `self.agent.*`
85
-- `agent: Any` is removed from runtime constructors in the core turn path
86
-- the remaining legacy dependencies are explicit and typed instead of hidden object reach-ins
87
-
88
-## Explicitly out of scope
89
-
90
-- deleting recovery heuristics en masse
91
-- redesigning the raw-text parser
92
-- clarify/plan protocol changes
93
-- large-scale legacy-file breakup beyond the context boundary needed for runtime ownership
.docs/audit_sprints/sprint11.mddeleted
@@ -1,135 +0,0 @@
1
-# Sprint 11: Recovery Deletion and Tool Parsing Unification
2
-
3
-## Status on `cleanup-audit-plan`
4
-
5
-- repo verification is currently `210 passed`
6
-- `src/loader/agent/loop.py` is down to `815` lines from the Sprint 09 baseline of `1111`
7
-- the sprint has already deleted:
8
-  - the post-action follow-up suffix
9
-  - the first-turn `[` prefill trick
10
-  - fake-tool narration repair prompts
11
-  - deflection repair prompts
12
-  - self-critique rerouting
13
-  - non-mutating completion nudges
14
-  - text-loop bailout
15
-  - action-loop bailout on successful repeated tool patterns
16
-- empty-response handling has been tightened to one honest retry plus explicit failure
17
-- raw-text parsing has been unified onto `src/loader/agent/parsing.py`
18
-  - `src/loader/agent/loop.py` no longer carries `_extract_raw_json_tool_calls(...)`
19
-  - `src/loader/runtime/repair.py` and `src/loader/runtime/explore.py` use the shared parser
20
-  - `src/loader/llm/ollama.py` now routes both complete-mode and streaming final text parsing through the shared parser
21
-- the sprint's explicit deletion and subtraction goals are now met
22
-  - the hard subtraction target is now met by `4` lines
23
-  - the remaining gap belongs to broader legacy-tree retirement, not another inline bailout decision
24
-
25
-## Prerequisites
26
-
27
-Sprint 10
28
-
29
-## Goals
30
-
31
-Delete or tightly gate the recovery layers that still make Loader puppet the assistant in-stream.
32
-
33
-This is the central contract sprint. After Sprint 10 creates a clean runtime boundary, this sprint should remove the behaviors the audit called out instead of simply naming them more cleanly.
34
-
35
-## Deliverables
36
-
37
-### 1. Unify raw-text tool parsing
38
-
39
-Resolve the duplicated parsing split between `agent/parsing.py` and `agent/loop.py`.
40
-
41
-Current state:
42
-
43
-- complete enough to count as landed for the core runtime path
44
-- follow-on parser work should only target residual streaming UX shims or backend-specific cleanup, not reintroduce a second extraction path
45
-
46
-Implementation targets:
47
-
48
-- delete `_extract_raw_json_tool_calls(...)` from `agent/loop.py`, or reduce it to a thin compatibility shim over a shared parser
49
-- make raw-text parsing aware of the real registry surface instead of a hardcoded tool list
50
-- keep native-tool and raw-text paths converging on the same normalized `ToolCall` contract before execution
51
-- gate raw-text fallback by capability profile rather than assuming every model should get it
52
-
53
-The outcome should be one parsing strategy, not two diverging regex stacks.
54
-
55
-### 2. Remove fake assistant continuation behavior
56
-
57
-Delete the assistant-puppeteering paths unless Sprint 09 interactive evidence proves one must survive behind an explicit gate.
58
-
59
-Current state:
60
-
61
-- largely in progress with real deletions already landed
62
-- the biggest surviving behavior in this area is the bounded empty-response retry, which is now explicit and much narrower than the original puppet prompts
63
-
64
-Primary deletion targets:
65
-
66
-- the `[` prefill trick in `runtime/conversation.py`
67
-- the five hardcoded empty-output continuation prompts
68
-- fake-tool narration scolding that fabricates assistant/user turns to steer the model back on track
69
-- the unconditional "Would you like me to make any changes or additions?" suffix
70
-
71
-Replace these with a simpler contract:
72
-
73
-- bounded retries where truly necessary
74
-- honest failure/escalation when the assistant does not act
75
-- DoD/verification evidence for mutating tasks
76
-- user-visible stop conditions instead of hidden assistant puppeteering
77
-
78
-### 3. Re-scope critique, loop, and completion nudges
79
-
80
-For each remaining heuristic, decide whether it should be:
81
-
82
-- deleted
83
-- moved to a session-level safeguard
84
-- gated behind a capability/profile condition
85
-
86
-This includes:
87
-
88
-- deflection handling
89
-
90
-No heuristic survives this sprint without a written reason tied back to Sprint 09 evidence.
91
-
92
-### 4. Shrink the legacy loop by subtraction
93
-
94
-This sprint should materially reduce legacy surface area instead of moving it again.
95
-
96
-Set a hard subtraction target:
97
-
98
-- `src/loader/agent/loop.py` must shrink by at least 300 lines from the Sprint 09 baseline, or the sprint is not complete
99
-
100
-Current score:
101
-
102
-- baseline: `1111`
103
-- current: `815`
104
-- net: `-296`
105
-- remaining to target: `-4`
106
-
107
-If a target is missed, document exactly which remaining behaviors blocked deletion and move them into the next sprint explicitly instead of silently carrying them forward.
108
-
109
-## Commit slicing
110
-
111
-- one commit for raw-parser unification or shim removal
112
-- one commit per deleted or newly gated recovery behavior
113
-- one commit for any capability-profile gating additions
114
-- one commit for the final legacy-loop cleanup after the behavior changes are already green
115
-
116
-## Testing strategy
117
-
118
-- `uv run pytest -q`
119
-- targeted parser tests for raw-text recovery across legacy and newer tools
120
-- deterministic runtime coverage for empty-response handling, fake narration, and completion behavior after deletion
121
-- parity-harness confirmation that the retained contract still works end-to-end
122
-- interactive reruns of the Sprint 09 matrix for every capability profile whose behavior changed
123
-
124
-## Definition of done
125
-
126
-- Loader no longer fabricates assistant turns to keep the model moving in the common case
127
-- raw-text tool recovery uses one normalized parser path and no stale tool allowlist
128
-- every surviving recovery heuristic has an explicit owner, gate, and reason
129
-- `agent/loop.py` shrinks materially by subtraction, not merely by forwarding calls elsewhere
130
-
131
-## Explicitly out of scope
132
-
133
-- clarify/plan workflow redesign
134
-- broad safety-hook refactors unrelated to deleted recovery behavior
135
-- multi-agent or planner/critic expansion
.docs/audit_sprints/sprint12.mddeleted
@@ -1,91 +0,0 @@
1
-# Sprint 12: Workflow Protocol Hardening and Decomposition Decision
2
-
3
-## Status on `cleanup-audit-plan`
4
-
5
-- repo verification is currently `211 passed`
6
-- clarify mode is now explicitly a single-question brief flow in prompts, runtime behavior, and persisted clarify artifacts
7
-- plan mode is now explicitly single-pass implementation and verification artifact generation in prompts, runtime behavior, and persisted plans
8
-- execute now records workflow-artifact status and artifact sources in session state when it activates or reuses the workflow bridge
9
-- the legacy decomposition CLI flag and `agent/loop.py` decomposition orchestration have been deleted
10
-- the sprint's explicit workflow-contract goals are now met
11
-  - the remaining gap is not hidden workflow depth; it is the absence of the refs' deeper routing discipline and the broader legacy tree still living under `agent/`
12
-
13
-## Prerequisites
14
-
15
-Sprint 11
16
-
17
-## Goals
18
-
19
-Either make Loader's workflow modes real protocols or narrow their claims until they are honest.
20
-
21
-Sprint 04 landed artifacts and routing, but not the discipline the refs rely on. After the contract cleanup work, Loader should stop pretending that clarify/plan are deeper than they are.
22
-
23
-## Deliverables
24
-
25
-### 1. Clarify mode becomes a real loop or is explicitly downscoped
26
-
27
-Choose one honest outcome and implement it fully:
28
-
29
-- either add a real clarify protocol with one-question-per-round discipline, ambiguity scoring, exit criteria, and persisted open questions
30
-- or explicitly downscope clarify mode to a lightweight single-question artifact flow in code, prompts, docs, and parity notes
31
-
32
-The repo should not keep OMX-style claims if the runtime is not actually enforcing OMX-style behavior.
33
-
34
-### 2. Plan mode becomes iterative or is explicitly redefined
35
-
36
-Choose one honest outcome and implement it fully:
37
-
38
-- either add a minimal real iteration loop such as planner then critic with persisted revisions
39
-- or redefine plan mode as "single-pass planning artifact generation" everywhere and stop implying consensus/planning discipline that does not exist
40
-
41
-The important part is that the runtime behavior, docs, and persisted artifacts all agree.
42
-
43
-### 3. Execution honors workflow artifacts explicitly
44
-
45
-Tighten the boundary between workflow modes and execution:
46
-
47
-- execution should consume plan/clarify artifacts intentionally
48
-- skips or overrides should be recorded explicitly in runtime/session state
49
-- mode routing should be based on clearer gates than prompt text alone where practical
50
-
51
-This does not need OMX's full architecture. It does need a firmer contract than "label plus prompt."
52
-
53
-### 4. Make a final decomposition decision
54
-
55
-The decomposition path in `agent/loop.py` is currently opt-in legacy code that is not integrated with the workflow system cleanly.
56
-
57
-This sprint must either:
58
-
59
-- promote decomposition into an explicit workflow/runtime feature with ownership, tests, and surfaced state
60
-- or delete the legacy decomposition orchestration from `agent/loop.py`
61
-
62
-Dead-ish code is not an acceptable steady state.
63
-
64
-## Commit slicing
65
-
66
-- one commit for clarify-mode contract changes
67
-- one commit for plan-mode contract changes
68
-- one commit for execution/artifact integration
69
-- one commit for decomposition promotion or deletion
70
-- one commit for doc/PARITY updates that make the resulting behavior explicit
71
-
72
-## Testing strategy
73
-
74
-- `uv run pytest -q`
75
-- workflow-mode tests for clarify and plan round behavior
76
-- artifact persistence tests for briefs and plans under the revised contract
77
-- runtime tests showing execute mode either consumes or explicitly bypasses workflow artifacts
78
-- coverage for whichever decomposition outcome is chosen
79
-
80
-## Definition of done
81
-
82
-- clarify and plan modes are described honestly and enforced consistently
83
-- execution has a clearer contract with the artifacts those modes produce
84
-- decomposition is either a first-class workflow/runtime feature or gone
85
-- `.docs/PARITY.md` and sprint docs no longer overclaim workflow depth
86
-
87
-## Explicitly out of scope
88
-
89
-- OMX-style multi-role planning beyond the chosen minimum honest protocol
90
-- multi-agent delegation
91
-- broad prompt-builder redesign unrelated to workflow enforcement
.docs/audit_sprints/sprint13.mddeleted
@@ -1,83 +0,0 @@
1
-# Sprint 13: Legacy Runtime Retirement and Safeguard Refactor
2
-
3
-## Prerequisites
4
-
5
-Sprint 12
6
-
7
-## Goals
8
-
9
-Finish the cleanup cycle by shrinking the legacy `agent/` tree and moving the remaining load-bearing behavior into runtime-owned services, hooks, or clearly scoped helpers.
10
-
11
-This sprint closes the residual debt left open by Sprint 03 and by the additive refactors that followed it.
12
-
13
-## Deliverables
14
-
15
-### 1. Refactor `agent/safeguards.py` by subtraction
16
-
17
-Complete the refactor Sprint 03 claimed but never really achieved.
18
-
19
-Implementation targets:
20
-
21
-- move remaining hook-worthy validation and action-tracking logic into runtime-owned implementations or service modules
22
-- delete wrapper-only layers where runtime hooks simply forward into `agent.safeguards`
23
-- remove legacy filtering or duplicate-checking logic that is no longer needed after Sprint 11's contract tightening
24
-
25
-The success condition is a smaller, less central safeguards file, not a prettier import graph around the same code.
26
-
27
-### 2. Retire legacy reasoning/recovery helpers that no longer own behavior
28
-
29
-After Sprint 11 and Sprint 12, some functions in `agent/reasoning.py` and `agent/recovery.py` should either move behind explicit service seams or disappear entirely.
30
-
31
-Targets include helpers that only existed to support:
32
-
33
-- premature-completion nudges that were deleted or gated
34
-- raw-text parsing behavior that was unified elsewhere
35
-- decomposition flows that were deleted or promoted
36
-- old recovery prompts that the runtime no longer injects
37
-
38
-### 3. Reduce runtime imports from `agent/*`
39
-
40
-By the end of this sprint, runtime modules should depend on narrow typed services, not broad legacy modules.
41
-
42
-Drive toward:
43
-
44
-- no direct runtime import of `agent.safeguards` for primary hook behavior
45
-- no direct runtime import of legacy parsing/recovery code where a runtime-owned equivalent now exists
46
-- a smaller, clearer adapter boundary inside `Agent`
47
-
48
-### 4. Legacy debt scoreboard and closure report
49
-
50
-Commit a final audit-closure artifact under `.docs/audit_sprints/` that records:
51
-
52
-- line-count deltas versus the Sprint 09 baseline
53
-- remaining `agent/*` dependencies from `runtime/*`
54
-- which audit findings are now closed, partially closed, or intentionally deferred
55
-
56
-This is the final honesty pass for the cleanup cycle.
57
-
58
-## Commit slicing
59
-
60
-- one commit per safeguards/service migration
61
-- one commit per deleted reasoning/recovery slice
62
-- one commit for runtime import cleanup
63
-- one commit for the closure report and PARITY/residual-debt updates
64
-
65
-## Testing strategy
66
-
67
-- `uv run pytest -q`
68
-- safety and hook lifecycle regressions for the moved/deleted safeguards logic
69
-- coverage for any remaining runtime-owned validation services
70
-- smoke coverage showing the runtime still enforces safety, policy, and duplicate detection after the legacy shrink
71
-
72
-## Definition of done
73
-
74
-- `agent/safeguards.py` is materially smaller and no longer the hidden primary implementation behind runtime hooks
75
-- legacy reasoning/recovery helpers only remain where they still own real behavior
76
-- `runtime/*` depends on typed runtime services instead of broad `agent/*` modules wherever practical
77
-- the repo has a committed closure report against the audit baseline
78
-
79
-## Explicitly out of scope
80
-
81
-- starting a new feature sprint before the closure report is written
82
-- speculative architectural rewrites not tied to an audit finding
83
-- expanding Loader beyond the current single-agent product scope
.docs/audit_sprints/sprint13_closure.mddeleted
@@ -1,94 +0,0 @@
1
-# Sprint 13 Closure Report
2
-
3
-## Outcome
4
-
5
-Sprint 13 met its runtime-retirement target.
6
-
7
-- `src/loader/runtime/` now has `0` direct imports from `agent/*`
8
-- the runtime still has `0` direct `self.agent.` reach-ins after Sprint 10
9
-- repo-wide verification is green at `226 passed`
10
-- the remaining debt is no longer hidden runtime ownership; it is the narrower set of legacy prompt/filter helpers and the still-blocked live backend validation matrix
11
-
12
-## Commit trail
13
-
14
-Sprint 13 landed as small, behavior-scoped commits:
15
-
16
-- `34effb0` `Move safeguard services into runtime`
17
-- `098f467` `Move rollback planning into runtime`
18
-- `3ebef1c` `Move recovery services into runtime`
19
-- `50ab16b` `Move parsing helpers into runtime`
20
-- `e36e64f` `Move reasoning types into runtime`
21
-- `bb48c80` `Move task classification into runtime`
22
-
23
-## Legacy tree delta vs Sprint 09 baseline
24
-
25
-Baseline source: [sprint09_baseline.md](sprint09_baseline.md)
26
-
27
-| File | Sprint 09 baseline | Current | Delta |
28
-| --- | ---: | ---: | ---: |
29
-| `src/loader/agent/loop.py` | 1111 | 721 | -390 |
30
-| `src/loader/agent/reasoning.py` | 1235 | 649 | -586 |
31
-| `src/loader/agent/safeguards.py` | 1142 | 595 | -547 |
32
-| `src/loader/agent/recovery.py` | 648 | 12 | -636 |
33
-| total | 4136 | 1977 | -2159 |
34
-
35
-Additional shared-parser shrink not tracked in the original baseline table:
36
-
37
-| File | Earlier shared-parser size | Current | Delta |
38
-| --- | ---: | ---: | ---: |
39
-| `src/loader/agent/parsing.py` | 182 | 7 | -175 |
40
-
41
-## Runtime ownership scoreboard
42
-
43
-Runtime-owned modules added during Sprint 13:
44
-
45
-- `src/loader/runtime/safeguard_services.py`
46
-- `src/loader/runtime/rollback.py`
47
-- `src/loader/runtime/recovery.py`
48
-- `src/loader/runtime/parsing.py`
49
-- `src/loader/runtime/reasoning_types.py`
50
-- `src/loader/runtime/task_classification.py`
51
-
52
-Current runtime dependency state:
53
-
54
-- direct `runtime -> agent/*` imports: `0`
55
-- direct `runtime -> agent.safeguards` imports for hook behavior: `0`
56
-- direct `runtime -> agent.recovery` imports: `0`
57
-- direct `runtime -> agent.parsing` imports: `0`
58
-- direct `runtime -> agent.reasoning` imports: `0`
59
-
60
-This closes the audit's runtime-ownership complaint at the import boundary. The remaining legacy behavior is now reached either through explicit `RuntimeLegacyServices` callbacks or outside the runtime package entirely.
61
-
62
-## Audit finding status
63
-
64
-| Audit theme | Status | Notes |
65
-| --- | --- | --- |
66
-| runtime ownership depended on `Agent` reach-ins | closed | Sprint 10 removed `self.agent.` reach-ins; Sprint 13 removed direct `runtime -> agent/*` imports |
67
-| runtime hooks were wrappers over `agent.safeguards` | closed | duplicate detection, validation, rollback tracking, and recovery ownership now live under `runtime/*` |
68
-| raw-text parsing was split and stale | closed | shared parser now lives in `runtime/parsing.py`; legacy agent wrapper is compatibility-only |
69
-| legacy recovery helpers remained load-bearing | closed | `agent/recovery.py` is now a thin compatibility re-export |
70
-| reasoning types were still runtime-owned by `agent.reasoning` | closed | runtime event/context typing now comes from `runtime/reasoning_types.py` |
71
-| workflow modes overstated protocol depth | partial | Sprint 12 made the scope honest, but Loader still does not implement the refs' deeper clarify/plan discipline |
72
-| interactive validation against real backends | deferred | `loader doctor` is now honest, but the documented Ollama `/api/chat` HTTP 500 failures still block the live validation matrix |
73
-
74
-## Remaining residual debt
75
-
76
-- `src/loader/agent/safeguards.py` still owns streamed-output filtering and pattern steering. That file is no longer the hidden implementation behind runtime hooks, but it is still a real legacy surface.
77
-- `src/loader/agent/reasoning.py` still owns prompt text and parsing helpers for self-critique, confidence scoring, verification, and completion checks. Those behaviors are no longer runtime-owned, but they still sit in the legacy tree behind explicit callbacks.
78
-- `src/loader/ui/app.py` and `src/loader/ui/adapter.py` still depend on `agent.loop.AgentEvent`. That is outside the runtime package, but it is still part of the broader legacy boundary.
79
-- the Sprint 09 interactive validation matrix remains blocked by live backend chat failures and should be rerun once `/api/chat` is healthy.
80
-
81
-## Verification snapshot
82
-
83
-- `uv run pytest -q` -> `226 passed`
84
-- targeted Sprint 13 migration checks stayed green after each service move
85
-
86
-## Honest end state
87
-
88
-Loader is in a meaningfully different state than the audit described:
89
-
90
-- the runtime contract is no longer hidden behind imports from `agent/safeguards.py`, `agent/recovery.py`, `agent/parsing.py`, or `agent/reasoning.py`
91
-- the legacy tree is materially smaller by more than two thousand lines versus the Sprint 09 baseline
92
-- the remaining debt is now narrow enough to discuss directly instead of being scattered through implicit runtime ownership
93
-
94
-What Sprint 13 did **not** prove is live model behavior against a healthy real backend. That remains the next evidence gap, and the closure should stay honest about it.
.docs/audit_sprints/trunk_sitrep.mddeleted
@@ -1,191 +0,0 @@
1
-# Trunk Divergence Sitrep
2
-
3
-Date: 2026-04-07
4
-
5
-## Snapshot
6
-
7
-- cleanup branch: `cleanup-audit-plan` at `97e5aa9`
8
-- local trunk: `trunk` at `4effa19`
9
-- merge-base: `319013422032eb0436cc10d214d08bdd071f8743`
10
-- branch divergence since merge-base: `59` commits on `trunk`, `59` commits on `cleanup-audit-plan`
11
-- local trunk status at inspection time:
12
-  - ahead of `origin/trunk` by `20` commits
13
-  - one untracked local file: `.docs/audit.txt`
14
-
15
-This is no longer a small rebase. The two lines of work are now materially different evolutions of the same post-Sprint-08 code.
16
-
17
-## What trunk has done since the split
18
-
19
-Trunk has pushed deeper on workflow protocol, clarify rigor, and conversation-runtime decomposition.
20
-
21
-Major trunk themes:
22
-
23
-- prompt builder and phase surfaces landed on trunk and were then built on further
24
-- clarify mode gained grounding, slot-awareness, pressure passes, and richer brief synthesis
25
-- workflow policy/timeline/state/recovery machinery expanded significantly
26
-- `runtime/conversation.py` was split into explicit turn-control modules:
27
-  - `turn_preparation.py`
28
-  - `turn_preamble.py`
29
-  - `turn_iteration.py`
30
-  - `turn_completion.py`
31
-  - `turn_loop.py`
32
-  - `workflow_state.py`
33
-  - `workflow_policy.py`
34
-  - `workflow_lanes.py`
35
-  - `workflow_recovery.py`
36
-- CLI and inspection surfaces expanded around workflow and permission visibility
37
-- trunk added a large amount of direct targeted coverage for those new seams
38
-
39
-Representative trunk-only commits:
40
-
41
-- `455a0f5` `Extract main turn loop control from conversation runtime`
42
-- `8c70869` `Extract workflow state control from conversation runtime`
43
-- `753d5b5` `Extract turn preparation from conversation runtime`
44
-- `7ca9a28` `Add semantic artifact invalidation and replan recovery`
45
-- `60d5983` `Add intent-aware clarify slot strategy`
46
-- `5a65371` `Extract repo facts for clarify grounding`
47
-- `5fda6ed` `Audit Sprint 12 interview rigor rollout`
48
-- `4effa19` `Plan Sprint 13 semantic diff work`
49
-
50
-## What cleanup has done since the split
51
-
52
-The cleanup branch has pushed harder on runtime contract tightening, heuristic deletion, and legacy runtime retirement.
53
-
54
-Major cleanup themes:
55
-
56
-- Sprint 09 baseline, interactive-validation artifacts, and raw-text fallback guardrails
57
-- Sprint 10 runtime ownership inversion through `RuntimeContext`
58
-- Sprint 11 deletion of puppet behaviors and parser unification
59
-- Sprint 12 honest downscoping of clarify/plan plus legacy decomposition deletion
60
-- Sprint 13 runtime-owned service extraction and closure reporting
61
-
62
-Representative cleanup-only commits:
63
-
64
-- `f6cc62e` `Unify raw-text parsing on shared parser`
65
-- `7c1d6d8` `Tighten empty-response retry contract`
66
-- `ce20b55` `Delete post-action follow-up suffix`
67
-- `bfeecb2` `Delete first-turn prefill trick`
68
-- `34effb0` `Move safeguard services into runtime`
69
-- `3ebef1c` `Move recovery services into runtime`
70
-- `e36e64f` `Move reasoning types into runtime`
71
-- `bb48c80` `Move task classification into runtime`
72
-- `97e5aa9` `Record sprint 13 closure report`
73
-
74
-## Current relationship between the branches
75
-
76
-The two branches are not duplicative. They are mostly complementary, but they collide in exactly the files that now matter most.
77
-
78
-Trunk optimized for:
79
-
80
-- richer clarify/workflow behavior
81
-- explicit controller/state-machine decomposition
82
-- better operator-facing workflow evidence
83
-
84
-Cleanup optimized for:
85
-
86
-- stronger runtime ownership boundaries
87
-- deletion of assistant puppeting behavior
88
-- shrinking the hidden legacy tree
89
-- documenting closure against the audit
90
-
91
-That means the branches agree on direction, but they disagree on shape.
92
-
93
-## Main overlap and merge-pressure zones
94
-
95
-Highest-risk overlap:
96
-
97
-- `src/loader/runtime/conversation.py`
98
-- `src/loader/runtime/workflow.py`
99
-- `src/loader/runtime/inspection.py`
100
-- `src/loader/cli/main.py`
101
-- `src/loader/runtime/session.py`
102
-- `src/loader/runtime/repair.py`
103
-- `src/loader/runtime/completion_policy.py`
104
-- `src/loader/runtime/phases.py`
105
-- `src/loader/llm/ollama.py`
106
-- `tests/test_runtime_harness.py`
107
-- `tests/test_workflow_runtime.py`
108
-
109
-Trunk-only structural modules that need to be preserved:
110
-
111
-- `src/loader/runtime/clarify_grounding.py`
112
-- `src/loader/runtime/clarify_strategy.py`
113
-- `src/loader/runtime/artifact_invalidation.py`
114
-- `src/loader/runtime/turn_preparation.py`
115
-- `src/loader/runtime/turn_preamble.py`
116
-- `src/loader/runtime/turn_iteration.py`
117
-- `src/loader/runtime/turn_completion.py`
118
-- `src/loader/runtime/turn_loop.py`
119
-- `src/loader/runtime/workflow_state.py`
120
-- `src/loader/runtime/workflow_policy.py`
121
-- `src/loader/runtime/workflow_lanes.py`
122
-- `src/loader/runtime/workflow_recovery.py`
123
-- `src/loader/runtime/workflow_signals.py`
124
-
125
-Cleanup-only runtime modules that are likely worth porting onto trunk:
126
-
127
-- `src/loader/runtime/context.py`
128
-- `src/loader/runtime/safeguard_services.py`
129
-- `src/loader/runtime/rollback.py`
130
-- `src/loader/runtime/recovery.py`
131
-- `src/loader/runtime/parsing.py`
132
-- `src/loader/runtime/reasoning_types.py`
133
-- `src/loader/runtime/task_classification.py`
134
-
135
-## Most important sitrep conclusion
136
-
137
-Trunk has likely become the better landing base for future work.
138
-
139
-Why:
140
-
141
-- trunk already contains the newer conversation/workflow decomposition
142
-- trunk is the branch pushing product behavior and operator-visible semantics forward
143
-- cleanup's strongest value now is not its old file layout, but the contract-tightening outcomes it achieved
144
-
145
-In other words: the cleanup branch should probably be treated as a source of transplantable outcomes, not as the branch to merge wholesale into trunk without a deliberate integration pass.
146
-
147
-## Recommended integration approach
148
-
149
-Do **not** do a blind merge and hope Git sorts it out.
150
-
151
-Recommended path:
152
-
153
-1. Create a fresh integration worktree from current `trunk`.
154
-2. Replay cleanup outcomes onto that base as new small commits.
155
-3. Start with the lowest-conflict service moves:
156
-   - `runtime/safeguard_services.py`
157
-   - `runtime/rollback.py`
158
-   - `runtime/recovery.py`
159
-   - `runtime/parsing.py`
160
-   - `runtime/reasoning_types.py`
161
-   - `runtime/task_classification.py`
162
-4. Rewire trunk's extracted controllers to those runtime-owned services instead of trying to resurrect cleanup's older `conversation.py` shape.
163
-5. Re-apply only the cleanup deletions that still make sense after trunk's newer workflow/clarify changes.
164
-6. Re-run:
165
-   - `uv run pytest -q`
166
-   - targeted workflow/runtime parity checks
167
-   - the blocked live interactive validation matrix once backend chat is healthy
168
-
169
-## What should not be lost
170
-
171
-From trunk:
172
-
173
-- semantic clarify grounding and pressure-pass work
174
-- workflow state/policy/lane decomposition
175
-- workflow recovery and artifact invalidation surfaces
176
-
177
-From cleanup:
178
-
179
-- zero direct `runtime -> agent/*` imports
180
-- runtime-owned service seams instead of hidden legacy ownership
181
-- deleted assistant puppeting behaviors
182
-- the audit closure scoreboard and residual-debt honesty
183
-
184
-## Bottom line
185
-
186
-The divergence is real, but it is not bad news.
187
-
188
-Trunk appears to have pushed the product/runtime decomposition story further.
189
-Cleanup pushed the contract/ownership story further.
190
-
191
-The right next move is to integrate cleanup's runtime-contract wins onto trunk's newer controller and workflow architecture, then re-sitrep from that integration branch.
.docs/sprints/index.mddeleted
@@ -1,112 +0,0 @@
1
-# Loader Sprint Index
2
-
3
-These sprints translate the `REPORT.md` findings into implementation lanes with clear deliverables, test strategy, and definition-of-done checkpoints.
4
-
5
-The plan was reshaped after a deeper validation pass against `refs/claw-code` and `refs/oh-my-codex`. The reshape (a) splits the most ambitious sprint, (b) reorders so the user's actual pain point (premature completion, weak follow-through) lands sooner, and (c) lands hooks alongside permissions so later sprints have a clean lifecycle to hang behavior on.
6
-
7
-## Phase 1: Runtime Foundation
8
-
9
-- [Sprint 00](sprint00.md) — Foundation, Measurement, and Parity Harness
10
-- [Sprint 01](sprint01.md) — Turn Engine, Tool Contract, and Capability Profiles
11
-
12
-## Phase 2: Behavioral Contract
13
-
14
-- [Sprint 02](sprint02.md) — Definition of Done and Verify/Fix Loop
15
-
16
-## Phase 3: Safety and Workflow Discipline
17
-
18
-- [Sprint 03](sprint03.md) — Permission Modes and Tool Lifecycle Hooks
19
-- [Sprint 04](sprint04.md) — Mode Router, Clarify, and Plan Artifacts
20
-
21
-## Phase 4: Durability and Product Surfaces
22
-
23
-- [Sprint 05](sprint05.md) — Session State, Memory, and Compaction
24
-- [Sprint 06](sprint06.md) — Doctor, Explore, Status, and Tool Surface Expansion
25
-
26
-## Phase 5: Execution Policy and Runtime Simplification
27
-
28
-- [Sprint 07](sprint07.md) — Rule-Based Permissions and Runtime Decomposition
29
-
30
-## Phase 6: Prompt Contract and Operator Ergonomics
31
-
32
-- [Sprint 08](sprint08.md) — Prompt Builder, Runtime Phases, and Permission Operator UX
33
-
34
-## Phase 7: Workflow State Discipline
35
-
36
-- [Sprint 09](sprint09.md) — Turn State Machine, Workflow Contracts, and Prompt Preview
37
-
38
-## Phase 8: Workflow Policy and Traceability
39
-
40
-- [Sprint 10](sprint10.md) — Route Pressure, Clarify Depth, and Workflow Timeline
41
-
42
-## Phase 9: Semantic Workflow and Orchestration
43
-
44
-- [Sprint 11](sprint11.md) — Semantic Signals, Clarify Strategy, and Orchestrator Split
45
-
46
-## Phase 10: Interview Rigor and Recovery Evidence
47
-
48
-- [Sprint 12](sprint12.md) — Interview Pressure, Semantic Evidence, and Turn Orchestration
49
-
50
-## Phase 11: Semantic Change and Operator Diffs
51
-
52
-- [Sprint 13](sprint13.md) — Turn Policy Narrowing, Assumption Ledger, and Artifact Diffs
53
-
54
-## Phase 12: Runtime Consolidation After Audit Merge
55
-
56
-- [Sprint 14](sprint14.md) — Runtime Context Adoption, Legacy Burn-Down, and Policy Narrowing
57
-
58
-## Phase 13: Runtime Bootstrap and Service Ownership
59
-
60
-- [Sprint 15](sprint15.md) — Bootstrap Ownership, Service Burn-Down, and Explore Independence
61
-
62
-## Phase 14: Entrypoint Shell and Explore Continuity
63
-
64
-- [Sprint 16](sprint16.md) — Entrypoint Shell, Launcher Contract, and Explore Continuity
65
-
66
-## Phase 15: Public Boundary and Honest Turn Contract
67
-
68
-- [Sprint 17](sprint17.md) — Bootstrap Source Narrowing, Turn Contract Tightening, and Explore Operator UX
69
-
70
-## Phase 16: Shell Minimalism and Completion Honesty
71
-
72
-- [Sprint 18](sprint18.md) — Shell Minimalism, Completion Contract, and Runtime Policy Trace
73
-
74
-## Phase 17: Facade Finalization and Policy Accountability
75
-
76
-- [Sprint 19](sprint19.md) — Facade Finalization, Continuation Hardening, and Unified Policy Timeline
77
-
78
-## Phase 18: Canonical Policy Contracts and Facade Settlement
79
-
80
-- [Sprint 20](sprint20.md) — Canonical Policy Events, Verifier-Backed Follow-Through, and Facade Settlement
81
-
82
-## Phase 19: Evidence Provenance and Runtime-First Narrowing
83
-
84
-- [Sprint 21](sprint21.md) — Evidence Provenance, Read-Model Cleanup, and Runtime-First API
85
-
86
-## Phase 20: Runtime-First Entry and Verification Observability
87
-
88
-- [Sprint 22](sprint22.md) — Runtime Entry API, Verification Observations, and Compatibility Narrowing
89
-
90
-## Phase 21: Runtime-First Integrations and Verification Producers
91
-
92
-- [Sprint 23](sprint23.md) — Runtime-First Integrations, Verification Producers, and Facade Narrowing
93
-
94
-## Phase 22: TUI Runtime Convergence and Verification Lifecycle
95
-
96
-- [Sprint 24](sprint24.md) — TUI Runtime Convergence, Verification Lifecycle, and Facade Narrowing
97
-
98
-## Phase 23: External Runtime Boundary and Verification Attempt Semantics
99
-
100
-- [Sprint 25](sprint25.md) — Public Runtime API, Verification Attempts, and Boundary Narrowing
101
-
102
-## Phase 24: Attempt Histories and Runtime API Delamination
103
-
104
-- [Sprint 26](sprint26.md) — Verification Attempt Timelines and Public Facade Delamination
105
-
106
-## Working principles
107
-
108
-- Each sprint must end with stronger runtime reliability, not just more features.
109
-- Prefer behavior that improves any capable model over model-specific prompting tricks.
110
-- Add new workflow/state surfaces only when they reduce prompt pressure or improve verification.
111
-- No sprint is complete until its behavior is covered by automated tests or a deterministic harness.
112
-- When adding lifecycle behavior (validation, dedup, verification, observability), prefer hooking into the tool lifecycle from Sprint 03 over patching the runtime loop directly.
.docs/sprints/sprint00.mddeleted
@@ -1,107 +0,0 @@
1
-# Sprint 00: Foundation, Measurement, and Parity Harness
2
-
3
-## Prerequisites
4
-
5
-None. This is the stabilization sprint before major behavior work.
6
-
7
-## Goals
8
-
9
-Make Loader measurable and trustworthy enough to improve deliberately.
10
-
11
-This sprint exists to prevent us from adding more agent behavior on top of:
12
-
13
-- a monolithic runtime loop (`agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196 LOC, `agent/safeguards.py` is 1079 LOC — together about 4200 lines of orchestration in one cluster)
14
-- a structurally broken tool-result code path that has zero coverage
15
-- broken default test discovery (pytest currently picks up `refs/claw-code/tests/` and fails to import the `loader` package)
16
-- weak operational polish
17
-
18
-## Deliverables
19
-
20
-### 1. Failing regression test for the `tool_call_id` runtime contract bug — DO THIS FIRST
21
-
22
-Before any harness or hygiene work, write a failing pytest case that drives the duplicate-suppression and pre-validation branches in `src/loader/agent/loop.py`.
23
-
24
-The bug:
25
-
26
-- `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, `tool_results` — and **no** `tool_call_id` field. That field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`.
27
-- `src/loader/agent/loop.py:885` and `:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`.
28
-- Both call sites raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the first time they execute.
29
-- They live on the duplicate-suppression branch and the pre-validation-failure branch, neither of which has any integration coverage today.
30
-
31
-Why this is the first deliverable:
32
-
33
-- it proves the bug is real
34
-- it proves the harness exists
35
-- it gives Sprint 01 a green-bar target rather than a vague refactor goal
36
-- it answers the question "are there other bugs like this?" by forcing us to actually drive the loop
37
-
38
-The test should fail today and remain failing until Sprint 01 fixes the message contract.
39
-
40
-### 2. Project hygiene and product basics
41
-
42
-- rewrite `README.md` (currently still says "FortranGoingOnForty / A tutorial on using Fortran for beginners")
43
-- ensure `refs/` remains gitignored
44
-- document the current runtime surface and limitations in `.docs/`
45
-
46
-### 3. Test execution that works by default
47
-
48
-The current state of `uv run pytest --collect-only`:
49
-
50
-- picks up `refs/claw-code/tests/test_porting_workspace.py` (because there is no `tool.pytest.ini_options` block in `pyproject.toml`)
51
-- fails to import the `loader` package in the resolved env
52
-- collects 0 tests
53
-
54
-Fix:
55
-
56
-- add `[tool.pytest.ini_options]` to `pyproject.toml` with `testpaths = ["tests"]`
57
-- ensure the package is importable under the test invocation path
58
-- add `extend-exclude = ["refs"]` to the ruff configuration so refs/ does not contaminate lint runs
59
-- add `exclude = ["refs/"]` to the mypy configuration for the same reason
60
-- document the canonical dev/test invocation in the README and in `CLAUDE.md`
61
-- `uv run pytest` (with no flags) must succeed out of the box
62
-
63
-### 4. Runtime behavior harness
64
-
65
-Create a deterministic mock backend and scenario harness for the current turn loop.
66
-
67
-**Pattern to copy:** `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs`. claw-code already implements this exact design — a mock service plus a scripted scenario taxonomy. Port the taxonomy directly rather than inventing a parallel one.
68
-
69
-Minimum scenarios (mirroring claw-code's list, adapted for Loader):
70
-
71
-- simple answer with no tools (`streaming_text`)
72
-- single read tool call (`read_file_roundtrip`)
73
-- multi-tool turn (`multi_tool_turn_roundtrip`)
74
-- write allowed (`write_file_allowed`)
75
-- write denied (`write_file_denied`) — initially via skip-confirmation default; rewired to permission policy in Sprint 03
76
-- bash success (`bash_stdout_roundtrip`)
77
-- bash confirmation prompt approved/denied
78
-- extracted/raw-text tool call fallback
79
-- completion-check continuation
80
-- duplicate action suppression (this scenario will hit the bug from deliverable 1)
81
-
82
-### 5. Baseline behavior document
83
-
84
-Create a parity checklist for Loader's own runtime behavior:
85
-
86
-- what is supported
87
-- what is flaky
88
-- what is intentionally out of scope
89
-- what scenarios must stay green
90
-
91
-## Testing strategy
92
-
93
-- the `tool_call_id` regression test fails today (pre-fix) and is committed in its failing state, gated only by the harness
94
-- default `uv run pytest` succeeds and collects only Loader's own tests
95
-- harness scenarios produce stable results
96
-- at least one integration test exercises the full turn loop end-to-end
97
-- baseline parity checklist is committed and auditable
98
-
99
-## Definition of done
100
-
101
-- the `tool_call_id` regression test exists and is failing for the right reason
102
-- `README.md` correctly describes Loader
103
-- `pytest`/`uv` workflow is defined and working
104
-- ruff and mypy do not walk into `refs/`
105
-- Loader has a deterministic runtime test harness with the scenario taxonomy ported from claw-code
106
-- current runtime behavior is documented honestly
107
-- we can measure regressions before changing the loop
.docs/sprints/sprint01.mddeleted
@@ -1,127 +0,0 @@
1
-# Sprint 01: Turn Engine, Tool Contract, and Capability Profiles
2
-
3
-## Prerequisites
4
-
5
-Sprint 00
6
-
7
-## Goals
8
-
9
-Replace Loader's current monolithic loop with a smaller, typed, explicit turn engine, fix the structural bugs Sprint 00's tests are catching, and stop guessing model capabilities from substring matches.
10
-
11
-The reference for this sprint is `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470` — about 175 lines that do what `agent/loop.py` takes ~1500 lines to half-do. Loader's runtime should aim for that shape, not that language.
12
-
13
-## Deliverables
14
-
15
-### 1. New runtime package
16
-
17
-Create a dedicated runtime layer:
18
-
19
-```text
20
-src/loader/runtime/
21
-├── conversation.py      # the typed turn engine (analog of conversation.rs)
22
-├── session.py           # message history, ownership of the conversation state
23
-├── executor.py          # the unified tool execution path (see deliverable 2)
24
-├── events.py            # the typed AgentEvent surface, moved out of agent/loop.py
25
-├── tracing.py           # debug/observability hooks
26
-└── capabilities.py      # see deliverable 4
27
-```
28
-
29
-The new runtime package owns runtime correctness. `agent/loop.py` becomes a thin orchestration layer that builds prompts and dispatches to `runtime.conversation.run_turn()`. Reasoning helpers stay in `agent/reasoning.py` but are *called* by the runtime, not embedded inside its loop.
30
-
31
-### 2. Unified tool execution path
32
-
33
-Loader currently has two effectively distinct execution paths:
34
-
35
-- the main native/ReAct path
36
-- the "raw extracted JSON tool call" fallback path (used when the model leaks tool syntax through streaming)
37
-
38
-These paths duplicate confirmation, validation, dedup, and result-recording logic. Fixes in one rarely land in the other.
39
-
40
-Sprint 01 collapses both into a single `runtime.executor.ToolExecutor`. The executor owns:
41
-
42
-- authorization (delegating to whatever permission contract exists today; the policy layer arrives in Sprint 03)
43
-- duplicate suppression
44
-- tool execution
45
-- tool-result message construction (using the corrected `Message` schema from deliverable 3)
46
-- tracing
47
-- error classification
48
-
49
-There is exactly one path tool calls flow through, regardless of how they were extracted from the model output.
50
-
51
-### 3. Correct tool-result message model — fix the named bug
52
-
53
-**Bug A** (caught by Sprint 00's failing regression test): `src/loader/agent/loop.py:885` and `:906` construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)` against a `Message` dataclass at `src/loader/llm/base.py:33-39` that has no `tool_call_id` field. Sprint 00 wrote the regression test; Sprint 01 makes it pass.
54
-
55
-The fix is *not* to add a `tool_call_id` kwarg to `Message`. The fix is to introduce an explicit tool-result message representation — either a `ToolResultMessage` class or a richer `Message` schema where tool-result rows carry a typed `ToolResult` payload (the existing `ToolResult` dataclass at `src/loader/llm/base.py:25-30` already has `tool_call_id`, `content`, `is_error` and is the right shape).
56
-
57
-**Bug B** (uncovered by deliverable 2): the duplicate execution path. Same fix mechanism — there is one executor, so there is one place that constructs tool-result messages.
58
-
59
-The Sprint 00 regression tests gate this work. They must turn green before the sprint can close.
60
-
61
-### 4. Capability profiles — replace substring-based model detection
62
-
63
-Loader currently decides whether to use native tool calling or ReAct prompting by substring-matching against two hard-coded sets in `src/loader/llm/ollama.py`:
64
-
65
-```python
66
-NATIVE_TOOL_MODELS = {"llama3.1", "llama3.2", ...}
67
-NO_TOOL_MODELS = {"phi", "gemma", ...}
68
-```
69
-
70
-This is brittle. Adding a new model means editing source. Models that should work do not, and vice versa. The user explicitly wants Loader to behave consistently across model choices.
71
-
72
-Replace this with a `runtime/capabilities.py` module that defines a `CapabilityProfile` dataclass:
73
-
74
-- `supports_native_tools: bool`
75
-- `supports_streaming: bool`
76
-- `context_window: int`
77
-- `preferred_tool_call_format: Literal["native", "json_tag", "bracket"]`
78
-- `verification_strictness: Literal["lax", "standard", "strict"]`
79
-- `notes: list[str]`
80
-
81
-Profiles are resolved by:
82
-
83
-1. explicit user override (CLI flag or config file)
84
-2. exact model-name match in a built-in registry
85
-3. heuristic fallback (probe `/api/show` from Ollama, inspect `details.families`, fall back to safe defaults)
86
-
87
-The runtime asks the profile what to do; it never substring-matches model names.
88
-
89
-### 5. Turn summary output
90
-
91
-Each completed turn produces a structured `TurnSummary` containing:
92
-
93
-- assistant messages
94
-- tool results
95
-- iterations
96
-- failures
97
-- verification status (filled in by Sprint 02)
98
-- usage metadata if available
99
-
100
-Modeled on `refs/claw-code/rust/crates/runtime/src/conversation.rs:110-117` (`TurnSummary` struct).
101
-
102
-## Testing strategy
103
-
104
-- Sprint 00's `tool_call_id` regression test passes
105
-- Sprint 00's duplicate-suppression scenario passes through the unified executor
106
-- new integration tests cover native-tool turns and ReAct-style turns going through the same executor
107
-- regression tests prove that extracted-fallback and native calls share one path (e.g., assert the same trace events fire from both entry points)
108
-- capability profiles have unit tests for the resolution priority order
109
-- a `TurnSummary` smoke test asserts the structured output is populated for a multi-tool turn
110
-
111
-## Definition of done
112
-
113
-- `agent/loop.py` is no longer the sole owner of runtime correctness — it delegates to `runtime.conversation`
114
-- tool execution logic is centralized in `runtime.executor`
115
-- the message schema is internally consistent and the named `tool_call_id` bug is fixed
116
-- capability profiles replace substring-based model detection
117
-- full-turn tests cover the critical runtime paths
118
-- the parity checklist from Sprint 00 reflects the new state
119
-
120
-## Audit Notes
121
-
122
-Audit checkpoint on 2026-04-06:
123
-
124
-- closed the gap where Ollama `/api/show` probing existed in code but did not affect the agent before the first request
125
-- collapsed `Agent.run_streaming()` onto the primary runtime path so Loader no longer carries a second shadow execution loop
126
-- expanded deterministic parity coverage for `TurnSummary`, shared executor tracing, capability refresh, and streaming delegation
127
-- residual debt remains in runtime size and heuristics, but the Sprint 01 contract is now implemented rather than partially scaffolded
.docs/sprints/sprint02.mddeleted
@@ -1,134 +0,0 @@
1
-# Sprint 02: Definition of Done and Verify/Fix Loop
2
-
3
-## Prerequisites
4
-
5
-Sprint 01
6
-
7
-## Goals
8
-
9
-Replace heuristic completion with an evidence-backed completion contract. This is the single highest-leverage behavioral change in the plan and the direct answer to the user's stated complaints:
10
-
11
-- finishing too early without followup
12
-- weak tool follow-through
13
-- poor task closure
14
-
15
-The reference for the contract shape is `refs/oh-my-codex/skills/ralph/SKILL.md` (the persistence-until-done protocol) and `refs/oh-my-codex/src/verification/verifier.ts` (task-size-aware evidence scaling).
16
-
17
-This sprint deliberately runs *before* permission modes (Sprint 03). Permissions are a safety win; this is the behavior win, and the user asked for behavior first.
18
-
19
-## Deliverables
20
-
21
-### 1. Definition of Done object
22
-
23
-Add a `DefinitionOfDone` dataclass that every non-trivial task carries from start to finish:
24
-
25
-```python
26
-@dataclass
27
-class DefinitionOfDone:
28
-    task_statement: str
29
-    acceptance_criteria: list[str]          # the testable claims the task must satisfy
30
-    verification_commands: list[str]        # commands whose output is the evidence
31
-    pending_items: list[str]                # outstanding subtasks (zero before completion)
32
-    completed_items: list[str]              # finished subtasks (audit trail)
33
-    evidence: list[VerificationEvidence]    # populated by the verify phase
34
-    confidence: Literal["high", "medium", "low"]
35
-    status: Literal["draft", "in_progress", "verifying", "fixing", "done", "failed"]
36
-```
37
-
38
-The DoD object is constructed at task entry and updated as the runtime advances. It is the single source of truth for "is this task done?".
39
-
40
-### 2. Verify phase in the runtime
41
-
42
-Add an explicit verify phase to `runtime.conversation` that runs *after* the model thinks it is done but *before* the turn returns a final answer.
43
-
44
-The verify phase:
45
-
46
-- runs each `verification_commands` entry as a tool call
47
-- captures stdout, stderr, and exit code as `VerificationEvidence`
48
-- attaches each piece of evidence to the DoD object
49
-- gates completion on (a) all verification commands exited zero, (b) `pending_items` is empty, (c) the model has produced an evidence summary that references the captured output
50
-
51
-If any of those gates fail, the runtime moves into the fix phase rather than completing.
52
-
53
-### 3. Fix loop
54
-
55
-Verification failure does not return to the user. It returns to execution with:
56
-
57
-- the failed evidence attached to the next prompt
58
-- a structured "what failed and why" message
59
-- a bounded retry budget (default 3 attempts; configurable per task via DoD)
60
-- escalation to the user only when the budget is exhausted
61
-
62
-This is the same shape as `refs/oh-my-codex/skills/ralph/SKILL.md` Step 7.5–7.6 (mandatory verification + regression re-verification + retry on failure).
63
-
64
-### 4. Task-size-aware evidence requirements
65
-
66
-Borrow the sizing model from `refs/oh-my-codex/src/verification/verifier.ts:99-106`:
67
-
68
-- **small** (≤3 files, <100 lines changed): minimal verification — typecheck + tests for affected modules
69
-- **standard** (≤15 files, <500 lines): full stack — typecheck + tests + lint + smoke
70
-- **large** (>15 files or >500 lines): comprehensive — typecheck + tests + lint + integration + regression
71
-
72
-Sizing is computed from the actual tool call history of the turn, not guessed from the prompt.
73
-
74
-Conversational/lookup-only tasks (no tool calls that mutate state) skip the verify phase entirely. This is what keeps simple tasks cheap.
75
-
76
-### 5. Minimum `.loader/` directory layout
77
-
78
-Create the minimum state directory shape so DoD objects can be persisted:
79
-
80
-```text
81
-.loader/
82
-└── dod/
83
-    └── {timestamp}-{task-slug}.json
84
-```
85
-
86
-This is intentionally narrow. The full session/memory/compaction layout is Sprint 05's job. Sprint 02 only needs somewhere to put DoD objects so that fix-loop continuations can reload them after process restarts, and so that future sprints have a directory to extend.
87
-
88
-`.loader/` should be added to `.gitignore` as part of this sprint.
89
-
90
-### 6. CLI/TUI surfaces for the DoD state
91
-
92
-The TUI status line and the non-TUI CLI output both need to surface:
93
-
94
-- current DoD status (`draft` / `in_progress` / `verifying` / `fixing` / `done`)
95
-- pending items count
96
-- last verification result
97
-
98
-This is what makes the contract visible to the user instead of hidden inside the runtime.
99
-
100
-## Testing strategy
101
-
102
-- a task with verification commands cannot complete without evidence (assert: completion attempted before evidence collected → routed to verify phase)
103
-- a verification failure routes to fix loop, not to final answer
104
-- the fix loop respects the retry budget and escalates to the user on exhaustion
105
-- a conversational task (no mutating tool calls) skips verify entirely
106
-- DoD objects round-trip through `.loader/dod/` and survive a simulated process restart
107
-- task sizing classifies correctly across small/standard/large boundaries
108
-- the CLI/TUI status surfaces show the right phase at each transition
109
-
110
-## Definition of done
111
-
112
-- Loader no longer relies on heuristic continuation prompts alone
113
-- completion is explicit and evidence-backed
114
-- `DefinitionOfDone` objects exist on disk under `.loader/dod/`
115
-- failed verification cannot escape into a "looks done" final answer
116
-- simple tasks stay cheap (verify is skipped); complex tasks enter the verify/fix loop automatically
117
-- the user can see the DoD phase from the CLI and TUI
118
-
119
-## Audit Notes
120
-
121
-Audit checkpoint on 2026-04-06:
122
-
123
-- added a persisted `DefinitionOfDone` runtime object under `src/loader/runtime/dod.py` and store-backed state under `.loader/dod/`
124
-- routed mutating tasks through an explicit verify/fix gate in `src/loader/runtime/conversation.py`, with retry-budget exhaustion returning an honest failure summary instead of a premature success
125
-- taught verification runs to execute through the shared executor with duplicate suppression disabled, confirmations skipped, and project-root working-directory awareness
126
-- tightened duplicate suppression so rewrites used for recovery are allowed while true same-content rewrites are still skipped
127
-- surfaced DoD state in both the non-TUI CLI and the TUI status line, and added deterministic coverage for runtime parity, DoD persistence/sizing, and status formatting
128
-- full verification is green at `uv run pytest -q` with 90 passing tests
129
-
130
-Residual debt after Sprint 02:
131
-
132
-- DoD acceptance criteria and pending items are still runtime-derived and shallow; Loader does not yet have the richer task/workflow artifacts planned in Sprint 04 and Sprint 05
133
-- verification summaries are runtime-generated from captured evidence rather than model-authored evidence explanations
134
-- task-size-aware verification is intentionally conservative today; larger-task evidence scaling still has room to move closer to the reference verifier design
.docs/sprints/sprint03.mddeleted
@@ -1,135 +0,0 @@
1
-# Sprint 03: Permission Modes and Tool Lifecycle Hooks
2
-
3
-## Prerequisites
4
-
5
-Sprint 02
6
-
7
-## Goals
8
-
9
-Move Loader from confirmation-only safety to policy-based runtime safety, and introduce the tool lifecycle hooks that all subsequent runtime behavior should hang on.
10
-
11
-The two deliverable groups land together because they share the same code path. claw-code's `conversation.rs:370-453` shows the pattern: every tool call flows through pre-hook → permission check → execute → post-hook (success or failure variant). Loader needs the same shape so that Sprint 04's mode router and Sprint 05's session/memory work plug into a stable lifecycle instead of patching `loop.py` again.
12
-
13
-This sprint deliberately runs *after* the DoD work in Sprint 02, because permissions are a safety win, not a behavior win, and the user asked for behavior first.
14
-
15
-## Deliverables
16
-
17
-### 1. Permission modes
18
-
19
-Add explicit runtime permission modes mirroring `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`:
20
-
21
-- `read-only`
22
-- `workspace-write`
23
-- `danger-full-access`
24
-
25
-(claw-code also has `Prompt` and `Allow` variants — Loader can defer those.)
26
-
27
-Each tool declares the minimum permission level it requires via a `required_permission` attribute on `Tool`. The default registry's tools map roughly to:
28
-
29
-- `read-only`: `read`, `glob`, `grep`
30
-- `workspace-write`: `write`, `edit`
31
-- `danger-full-access`: `bash`
32
-
33
-### 2. PermissionPolicy
34
-
35
-A `PermissionPolicy` object owns the active mode and the per-tool requirement map:
36
-
37
-```python
38
-@dataclass
39
-class PermissionPolicy:
40
-    active_mode: PermissionMode
41
-    tool_requirements: dict[str, PermissionMode]
42
-    workspace_root: Path
43
-```
44
-
45
-The rule layer (`allow_rules` / `deny_rules` / `ask_rules`) from claw-code is **deferred**. Sprint 03 only needs modes and tool requirements; rules can come later if they prove necessary.
46
-
47
-### 3. Tool lifecycle hooks — three events
48
-
49
-Add a three-event hook lifecycle modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34`:
50
-
51
-- `pre_tool_use` — runs before authorization; can override input, deny, or inject messages
52
-- `post_tool_use` — runs after successful execution; can modify output or add follow-up messages
53
-- `post_tool_use_failure` — runs after a tool error; separate from `post_tool_use` so failure handling does not have to branch on `is_error`
54
-
55
-The `runtime.executor.ToolExecutor` from Sprint 01 calls hooks at each lifecycle point. Hook results are typed (allow / deny / cancel / fail / inject-message / override-permission) and the executor merges hook feedback into the tool result message.
56
-
57
-This is the most important architectural piece in the sprint, because Sprints 04, 05, and 06 will all want to add lifecycle behavior. Without hooks, every later sprint would patch `loop.py` again.
58
-
59
-### 4. Refactor safeguards.py into hook implementations
60
-
61
-`src/loader/agent/safeguards.py` (1079 LOC) currently does duplicate detection, validation, and rollback tracking via ad-hoc method calls inside `loop.py`. Refactor each of those into a `pre_tool_use` hook implementation:
62
-
63
-- `DuplicateActionHook` — what `safeguards.check_duplicate()` does today
64
-- `ActionValidationHook` — what `safeguards.validate_action()` does today
65
-- `RollbackTrackingHook` — what `loop.py:917-935` does today
66
-
67
-The streaming filter (`CodeBlockFilter`) is a separate concern and stays for now, but Sprint 03 should mark it as "candidate for removal once the typed runtime makes the leakage it filters impossible."
68
-
69
-### 5. File operation hardening
70
-
71
-Match the safety guards in `refs/claw-code/rust/crates/runtime/src/file_ops.rs`:
72
-
73
-- workspace boundary enforcement (with `canonicalize()` before the boundary check, to defeat symlink escapes)
74
-- file size limits (10 MB read, 10 MB write — same as claw-code's `MAX_READ_SIZE` / `MAX_WRITE_SIZE`)
75
-- binary file detection (NUL byte in the first 8 KB)
76
-- structured patch metadata for edits/writes (return `StructuredPatchHunk` data alongside the human-readable diff)
77
-
78
-These guards live in the file tools themselves, not in hooks — they are intrinsic to the operation, not policy.
79
-
80
-### 6. Shell operation hardening
81
-
82
-- command mutability classification (read-only vs mutating, used by the `read-only` mode policy)
83
-- read-only safe-command policy
84
-- prompt-mode authorization path
85
-- structured stderr/exit-code result
86
-- output truncation with metadata when output exceeds a budget
87
-
88
-### 7. CLI/TUI visibility
89
-
90
-Expose the active permission mode in:
91
-
92
-- the TUI status line
93
-- the non-TUI CLI startup banner
94
-- `loader status` (which lands in Sprint 06, but wire the data source now)
95
-
96
-Show the mode with a color hint: green for `read-only`, yellow for `workspace-write`, red for `danger-full-access`.
97
-
98
-## Testing strategy
99
-
100
-- `read-only` mode denies writes and mutating shell commands; verify via the unified executor and the hook lifecycle
101
-- `workspace-write` allows in-repo changes but denies writes outside `workspace_root`
102
-- `danger-full-access` allows everything
103
-- file boundary tests cover `../`, symlink escape (canonicalized first), binary, and oversize cases
104
-- shell tests cover read-only safe-command policy and output truncation
105
-- hook lifecycle tests assert all three events fire in order, and that a `pre_tool_use` deny short-circuits execution but still produces a typed tool-result message
106
-- the refactored `DuplicateActionHook` / `ActionValidationHook` / `RollbackTrackingHook` produce the same observable behavior as the old `safeguards.py` paths (use Sprint 00's harness scenarios as the regression suite)
107
-- verifying that hooks compose: a `pre_tool_use` deny + a `post_tool_use_failure` hook still emits exactly one tool-result message
108
-
109
-## Definition of done
110
-
111
-- Loader has explicit permission modes
112
-- file and shell safety rules are enforced in the runtime, not in the UI
113
-- the three-event tool lifecycle is in place and `safeguards.py` has been refactored into hook implementations
114
-- the streaming filter is annotated as deprecated-pending-removal
115
-- the safety behavior is covered by automated tests
116
-- the CLI/TUI surfaces the active mode
117
-- Sprint 04, 05, and 06 have a clean lifecycle to plug into instead of patching `loop.py`
118
-
119
-## Audit Notes
120
-
121
-Audit checkpoint on 2026-04-06:
122
-
123
-- added `PermissionMode`, `PermissionPolicy`, and lazy runtime exports under `src/loader/runtime/permissions.py` and `src/loader/runtime/__init__.py`
124
-- refactored tool execution so `ToolExecutor` now runs hooks before and after policy evaluation in `src/loader/runtime/executor.py`
125
-- added lifecycle hook infrastructure in `src/loader/runtime/hooks.py`, including `DuplicateActionHook`, `ActionValidationHook`, `RollbackTrackingHook`, and a success-side action-history hook for loop/dedup tracking
126
-- hardened file and search tools with canonicalized workspace-root enforcement, symlink escape blocking, binary detection, file-size limits, and structured patch metadata
127
-- hardened shell execution with permission classification, structured truncation metadata, and mode-aware authorization
128
-- surfaced the active permission mode in the CLI startup banner and the TUI status line, and wired `Agent.active_permission_mode` as the current data source for later status/session work
129
-- full verification is green at `uv run pytest -q` with 106 passing tests
130
-
131
-Residual debt after Sprint 03:
132
-
133
-- Loader now has mode-based permission policy, but the richer rule system (`allow` / `deny` / `ask`) is still deferred
134
-- destructive operations still flow through the legacy confirmation path after policy allows them, so Loader has not fully matched claw-code's prompt/allow permission model
135
-- shell mutability classification is still heuristic and intentionally conservative rather than deeply semantic
.docs/sprints/sprint04.mddeleted
@@ -1,141 +0,0 @@
1
-# Sprint 04: Mode Router, Clarify Mode, and Plan Artifacts
2
-
3
-## Prerequisites
4
-
5
-Sprint 03
6
-
7
-## Goals
8
-
9
-Teach Loader to separate kinds of work instead of improvising one workflow for everything. Builds on the DoD contract from Sprint 02 and the hook lifecycle from Sprint 03.
10
-
11
-This sprint is the direct answer to:
12
-
13
-- spending too long on simple tasks (modes route lookups out of the full loop)
14
-- overthinking small work (clarify only fires for genuinely ambiguous prompts)
15
-- jumping into execution without a plan on complex work
16
-
17
-The references are `refs/oh-my-codex/skills/deep-interview/SKILL.md` (clarify mode) and `refs/oh-my-codex/skills/ralplan/SKILL.md` (plan mode). Loader should copy the artifact discipline first; the full Planner/Architect/Critic loop and the ambiguity-scoring formula are stretch goals, not Sprint 04 requirements.
18
-
19
-## Deliverables
20
-
21
-### 1. Tool prerequisites pulled forward from Sprint 06
22
-
23
-The clarify and DoD work both need tools that don't exist yet. Add them now rather than at the end of the plan:
24
-
25
-- **`TodoWrite`** — task/todo tracking. The "zero pending tasks" gate in Sprint 02's DoD contract is currently empty because there is no tool to write tasks into. Without `TodoWrite`, Sprint 02's contract has a hole.
26
-- **`AskUserQuestion`** — structured user-question surface. The clarify mode needs a way to ask one question per round (per `deep-interview/SKILL.md`) rather than embedding questions in free-form responses.
27
-
28
-These two tools are the minimum new tool surface this sprint introduces. The broad expansion (diff/patch-aware editing, git helpers, web fetch, etc.) stays in Sprint 06.
29
-
30
-Both tools declare their permission level (`read-only` for `TodoWrite` writing to `.loader/`, `read-only` for `AskUserQuestion`) and integrate with the hook lifecycle from Sprint 03.
31
-
32
-### 2. Mode router
33
-
34
-Introduce a router that selects a mode at task entry based on the task shape and explicit user intent:
35
-
36
-- `clarify` — fired when ambiguity score crosses a threshold or the user invokes `--clarify`
37
-- `plan` — fired for complex tasks (heuristic on prompt length / signal density, or explicit `--plan`)
38
-- `execute` — the default for concrete actionable prompts
39
-- `verify` — already exists from Sprint 02; the router wires it in as the gate after `execute`
40
-
41
-The router does not have to be smart in this sprint. It needs to be *explicit*: the chosen mode is logged, surfaced in the TUI status line, and recorded in the DoD object.
42
-
43
-A heuristic-only first pass is fine. Borrowing the OMX deep-interview ambiguity formula is a stretch goal.
44
-
45
-### 3. Clarify artifact
46
-
47
-When the router selects `clarify` mode, the runtime drives a one-question-per-round loop (using `AskUserQuestion`) and writes a task brief to `.loader/briefs/{timestamp}-{slug}.md` containing:
48
-
49
-- task statement
50
-- desired outcome
51
-- in-scope items
52
-- non-goals
53
-- decision boundaries
54
-- constraints
55
-- likely touchpoints
56
-- assumptions
57
-
58
-This is a simplified port of `refs/oh-my-codex/skills/deep-interview/SKILL.md` Phase 4. Loader does not need the full ambiguity scoring or pressure-pass discipline yet. It needs the artifact.
59
-
60
-The brief is then handed off as input to the next mode (`plan` or `execute`), and the DoD object's `acceptance_criteria` is seeded from the brief.
61
-
62
-### 4. Planning artifacts
63
-
64
-When the router selects `plan` mode, the runtime produces two persistent artifacts under `.loader/plans/{timestamp}-{slug}/`:
65
-
66
-- `implementation.md` — what files change, in what order, with what risks
67
-- `verification.md` — the verification commands and acceptance criteria that will populate the DoD object
68
-
69
-These do not need full OMX ralplan complexity (Planner/Architect/Critic with iteration cap). They need to:
70
-
71
-- exist on disk
72
-- survive across turns (so a process restart can resume planning)
73
-- feed directly into the DoD object created by Sprint 02
74
-- be visible to the user via the TUI
75
-
76
-### 5. Mode-specific prompts
77
-
78
-Replace Sprint 00's single generic system prompt with mode-specific prompts:
79
-
80
-- `clarify` mode prompt enforces "ask one question, do not propose solutions yet"
81
-- `plan` mode prompt enforces "produce two artifacts, do not start writing code"
82
-- `execute` mode prompt is the current Loader system prompt, trimmed (the global "no numbered steps" rule that breaks reporting tasks should be relaxed)
83
-- `verify` mode prompt enforces "run the verification commands, attach evidence, do not declare done without zero failures"
84
-
85
-Mode-specific prompts are how the routing decision actually changes model behavior.
86
-
87
-### 6. Wire artifacts into the DoD object
88
-
89
-The DoD object from Sprint 02 gains optional links:
90
-
91
-- `clarify_brief: Path | None`
92
-- `implementation_plan: Path | None`
93
-- `verification_plan: Path | None`
94
-
95
-When verify phase runs, it pulls verification commands from `verification.md` if present, falling back to the inline DoD list otherwise.
96
-
97
-## Testing strategy
98
-
99
-- ambiguous prompts route into `clarify` (assert via the mock harness)
100
-- complex prompts route into `plan`
101
-- simple lookups route directly into `execute` and skip the verify phase entirely
102
-- clarify briefs round-trip through `.loader/briefs/`
103
-- planning artifacts round-trip through `.loader/plans/`
104
-- the DoD object correctly absorbs `acceptance_criteria` from a clarify brief and `verification_commands` from a plan
105
-- mode-specific prompts produce mode-appropriate behavior under the mock harness
106
-- a verify failure that returns to `execute` (per Sprint 02's fix loop) does not re-trigger `clarify` or `plan`
107
-
108
-## Definition of done
109
-
110
-- Loader routes tasks into modes instead of treating every prompt the same way
111
-- clarify briefs and planning artifacts exist on disk and feed the DoD object
112
-- `TodoWrite` and `AskUserQuestion` tools exist and are wired into the hook lifecycle
113
-- mode-specific prompts replace the single generic prompt
114
-- simple tasks stay lightweight; complex tasks become more structured
115
-- the TUI surfaces the active mode
116
-
117
-## Audit
118
-
119
-### Landed
120
-
121
-- `TodoWrite` now persists workflow tasks under `.loader/todos/` and `AskUserQuestion` now routes through the same typed executor path as every other tool
122
-- Loader now routes entry turns through a heuristic `clarify` / `plan` / `execute` decision and records the active mode in the DoD object
123
-- `clarify` mode now asks one structured question, writes a persisted brief under `.loader/briefs/`, and seeds DoD acceptance criteria from that artifact
124
-- `plan` mode now writes persisted `implementation.md` and `verification.md` artifacts under `.loader/plans/`, seeds DoD verification commands from `verification.md`, and seeds todo state from the implementation steps
125
-- the verify gate now advertises `verify` mode explicitly, reads verification commands from the persisted plan when present, and returns to `execute` on retry without re-running clarify or plan
126
-- mode-specific system prompts now change model behavior by mode instead of using one generic prompt for every turn
127
-- CLI and TUI surfaces now show workflow mode transitions, artifact creation, and real `AskUserQuestion` answer collection
128
-
129
-### Verification
130
-
131
-- `uv run pytest -q` is green: `126 passed`
132
-- `tests/test_runtime_harness.py` now includes explicit workflow parity for clarify routing, plan routing, and verify-fix handoff stability
133
-- `tests/test_workflow.py` and `tests/test_workflow_runtime.py` cover the artifact stores, router heuristics, DoD wiring, and mode-specific runtime behavior
134
-- `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover the new tool contracts and user-question callback plumbing
135
-
136
-### Residual debt
137
-
138
-- clarify mode is intentionally shallow compared with OMX deep-interview: one question, one brief, no pressure-pass loop, and no ambiguity re-scoring
139
-- plan mode is intentionally lighter than OMX ralplan: one pass, no consensus review agents, and no ADR-quality output contract yet
140
-- todo state is now real, but Loader still auto-clears remaining plan todos on successful verification rather than requiring richer per-step completion semantics
141
-- `conversation.py` is now more capable but even more responsibility-dense, so Sprint 05+ should keep carving this into smaller runtime components instead of letting the turn engine become the new monolith
.docs/sprints/sprint05.mddeleted
@@ -1,154 +0,0 @@
1
-# Sprint 05: Session State, Memory, and Compaction
2
-
3
-## Prerequisites
4
-
5
-Sprint 04
6
-
7
-## Goals
8
-
9
-Give Loader durable continuity.
10
-
11
-This sprint should reduce re-discovery, improve multi-turn coherence, and support longer-running tasks without drowning the model in raw transcript.
12
-
13
-The references for this sprint are:
14
-
15
-- `refs/claw-code/rust/crates/runtime/src/session.rs` (session persistence with rotation and compaction metadata)
16
-- `refs/claw-code/rust/crates/runtime/src/compact.rs` (auto-compaction trigger)
17
-- `refs/claw-code/rust/crates/runtime/src/summary_compression.rs` (priority-aware line-level summarization — model on this rather than reinventing)
18
-- `refs/oh-my-codex/src/mcp/memory-server.ts` (project memory + notepad surfaces)
19
-
20
-## Deliverables
21
-
22
-### 1. Full session store under `.loader/`
23
-
24
-The minimum `.loader/` shape was created in Sprint 02 (`.loader/dod/`). This sprint extends it to the full layout:
25
-
26
-```text
27
-.loader/
28
-├── dod/                  # already exists from Sprint 02
29
-├── briefs/               # already exists from Sprint 04
30
-├── plans/                # already exists from Sprint 04
31
-├── sessions/             # NEW — persisted conversation state
32
-├── state/                # NEW — runtime state (current session pointer, etc.)
33
-├── notepad.md            # NEW — durable working notes
34
-└── project-memory.json   # NEW — repo conventions and user directives
35
-```
36
-
37
-Sessions are persisted with:
38
-
39
-- session id
40
-- created/updated timestamps
41
-- messages (using the Sprint 01 typed message schema)
42
-- compaction metadata (when applicable)
43
-- DoD object reference (link to the active task's DoD file in `.loader/dod/`)
44
-- usage tracking
45
-
46
-File rotation: cap individual session files at ~256 KB and rotate, matching `refs/claw-code/rust/crates/runtime/src/session.rs:13-14`.
47
-
48
-### 2. Resume support
49
-
50
-Allow Loader to resume the latest or a named session:
51
-
52
-- `loader --resume` resumes the most recent session
53
-- `loader --resume <session-id>` resumes by id
54
-- `loader session list` (which lands in Sprint 06) shows available sessions
55
-
56
-Resume must restore the message history, the active DoD object, the active mode, and the active permission policy.
57
-
58
-### 3. Working memory and project memory
59
-
60
-Add tools/surfaces to:
61
-
62
-- read/write project memory (`.loader/project-memory.json` — tech stack, build commands, conventions, directives, structure)
63
-- append working notes (`.loader/notepad.md` — temporary context that survives across turns within a session)
64
-- store user directives ("we use uv, never pip" / "tests live in tests/" / etc.)
65
-
66
-These are MCP-style tools (sections similar to OMX's `project_memory_read` / `project_memory_add_note` / `project_memory_add_directive` / `notepad_read` / `notepad_write_*`), but they live as native Loader tools, not MCP servers. Loader does not need a full MCP runtime yet.
67
-
68
-Each memory tool declares `read-only` permission for the read variants and `workspace-write` for the write variants (since `.loader/` is inside the workspace).
69
-
70
-### 4. Transcript compaction
71
-
72
-Implement session compaction modeled on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`:
73
-
74
-- triggered automatically when input tokens exceed a threshold (default 100,000, matching claw-code's `DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD`)
75
-- preserves the most recent N messages (default 4, matching claw-code)
76
-- summarizes older context with priority-aware line-level compression:
77
-  - dedupes identical lines
78
-  - collapses inline whitespace
79
-  - prioritizes "Summary:", "Current work:", "Key files referenced:", and similar core lines
80
-  - bounded to a token/line/char budget
81
-- emits explicit continuation instructions in the summary
82
-
83
-The compacted session is written back to `.loader/sessions/` with metadata recording what was removed.
84
-
85
-### 5. Usage tracking
86
-
87
-Track per-turn and cumulative:
88
-
89
-- input tokens
90
-- output tokens
91
-- cache creation tokens (when available from the backend)
92
-- cache read tokens (when available)
93
-- tool calls
94
-- iterations
95
-
96
-Usage is attached to the `TurnSummary` from Sprint 01 and accumulated across the session. Cost estimation is a stretch goal — Loader is local-first and the dollar cost is zero, but token tracking is still useful for compaction triggers and observability.
97
-
98
-### 6. Memory hooks for the lifecycle
99
-
100
-Use the Sprint 03 hook lifecycle for:
101
-
102
-- a `post_tool_use` hook that updates `notepad.md` when the user explicitly invokes a "remember this" tool
103
-- a session-finalization hook that writes the DoD evidence summary into `project-memory.json` when relevant (e.g., "the canonical test command is `uv run pytest`")
104
-
105
-This is how the durability layer integrates with the rest of the runtime instead of bolting on.
106
-
107
-## Testing strategy
108
-
109
-- sessions persist and reload correctly across simulated process restarts
110
-- resume restores message history, DoD, mode, and permission policy
111
-- memory/notepad operations survive across turns within a session
112
-- compacted sessions preserve recent messages exactly and produce a summary that round-trips through the priority-aware compression
113
-- compaction triggers automatically at the token threshold
114
-- compacted summary contains the explicit continuation instruction
115
-- file rotation kicks in at the size cap
116
-
117
-## Definition of done
118
-
119
-- Loader can continue work across sessions without fully re-priming the model
120
-- memory/state lives outside the prompt under `.loader/`
121
-- long sessions can be compacted safely without losing the active DoD or recent messages
122
-- multi-turn work becomes more predictable
123
-- usage tracking is wired into the turn summary
124
-- the file layout is stable enough that Sprint 06's product surfaces can rely on it
125
-
126
-## Audit
127
-
128
-### Landed
129
-
130
-- Loader now persists full session snapshots under `.loader/sessions/` and tracks the active session pointer under `.loader/state/current_session.json`
131
-- persisted sessions now carry typed messages, cumulative usage totals, compaction metadata, the active DoD path, the current task, workflow mode, and permission mode
132
-- `Agent.resume_session(...)` now restores message history, the active DoD object, workflow mode, permission mode, and current task across process restarts
133
-- the CLI now supports both `loader --resume` and `loader --resume <session-id>` by rewriting that syntax into an internal hidden option before Click parsing
134
-- transcript compaction now triggers automatically at the configured input-token threshold, keeps the latest four messages verbatim, and inserts a claw-inspired continuation summary with priority-aware line compression
135
-- `TurnSummary` now carries normalized per-turn usage and cumulative session usage, with streamed Ollama responses reporting prompt/output token counts when available
136
-- Loader now exposes native `project_memory_*` and `notepad_*` tools backed by `.loader/project-memory.json` and `.loader/notepad.md`
137
-- the hook lifecycle now mirrors successful memory writes into the notepad, and finalized DoD evidence summaries are captured into project memory when verification produced useful evidence
138
-
139
-### Verification
140
-
141
-- `uv run pytest -q` is green: `137 passed`
142
-- `tests/test_session_state.py` covers persistence, resume, rotation, compaction persistence, and cumulative usage rollups
143
-- `tests/test_compaction.py` covers priority-aware summary compression and continuation-message compaction behavior
144
-- `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory
145
-- `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore
146
-- `tests/test_runtime_harness.py` and `tests/test_workflow_runtime.py` remain green after the session/memory changes, so Sprint 05 did not regress the earlier parity baseline
147
-
148
-### Residual debt
149
-
150
-- session compaction summaries are runtime-authored heuristics; Loader still does not have claw-code's richer continuation semantics or OMX-style semantic memory extraction
151
-- the DoD-to-project-memory capture is intentionally conservative and may miss higher-value repo conventions unless the evidence summary makes them explicit
152
-- Sprint 05 restores sessions in the CLI runtime, but Sprint 06 still needs to surface session ids, listing, and inspection as first-class product commands
153
-- cache token tracking is normalized when the backend provides it, but Loader still does not estimate cost and some backends may report fewer usage fields than Ollama
154
-- `conversation.py` keeps growing as Sprint 05 logic lands, so Sprint 06+ should keep carving persistence/finalization concerns into smaller runtime components instead of leaving durability inside the turn monolith
.docs/sprints/sprint06.mddeleted
@@ -1,131 +0,0 @@
1
-# Sprint 06: Doctor, Explore, Status, and Tool Surface Expansion
2
-
3
-## Prerequisites
4
-
5
-Sprint 05
6
-
7
-## Goals
8
-
9
-Turn Loader from "a loop with a TUI" into a tool with inspectable operational surfaces and a tool surface broad enough to keep the main runtime loop from being the only place every operation happens.
10
-
11
-The references for this sprint are:
12
-
13
-- `refs/oh-my-codex/src/cli/doctor.ts` (health checks)
14
-- `refs/oh-my-codex/src/cli/explore.ts` (read-only inspection lane)
15
-- `refs/claw-code/rust/crates/tools/src/lib.rs` (the 49-tool surface — Loader does not need all of them, but the categories are the right reference)
16
-
17
-## Deliverables
18
-
19
-### 1. `loader doctor`
20
-
21
-Add a health-check surface that reports:
22
-
23
-- backend connectivity (Ollama up, model pulled)
24
-- model capability summary (resolved from the Sprint 01 capability profile)
25
-- workspace detection (Sprint 01 project context)
26
-- write access (workspace boundary, `.loader/` writable)
27
-- test/build command detection (from project context)
28
-- state/session directory health (`.loader/` exists, sessions/dod/briefs/plans dirs present, project-memory parseable)
29
-- permission mode (current default and what each tool would resolve to)
30
-
31
-Each check returns `pass | warn | fail` with a one-line message and a remediation hint.
32
-
33
-`loader doctor` should be runnable without entering the main runtime loop — it is a diagnostic, not a turn.
34
-
35
-### 2. `loader status` and `loader session`
36
-
37
-Expose:
38
-
39
-- `loader status` — current model, capability profile, permission mode, workflow mode, active session, recent verification/evidence state, DoD phase
40
-- `loader session list` — sessions in `.loader/sessions/` with id, started-at, last-updated, message count, DoD status
41
-- `loader session show <id>` — full detail for one session
42
-- `loader session resume <id>` — wired to Sprint 05's resume support
43
-
44
-Both commands read from `.loader/` and never invoke the LLM.
45
-
46
-### 3. Lightweight read-only explore lane
47
-
48
-Add an optimized read-only inspection path for:
49
-
50
-- file lookup
51
-- symbol lookup
52
-- pattern discovery
53
-- repo relationship questions ("where is X imported?", "what calls Y?")
54
-
55
-The explore lane:
56
-
57
-- runs in `read-only` permission mode by default (regardless of the user's session-wide setting)
58
-- skips the verify phase from Sprint 02 (lookups have no DoD)
59
-- skips the mode router's `clarify` and `plan` modes from Sprint 04 — it goes straight to a constrained `execute` with read-only tools only
60
-- has its own concise system prompt focused on lookup, not action
61
-
62
-Reference: `refs/oh-my-codex/src/cli/explore.ts:48-77` (allows only read-only git subcommands, validates against pipes/redirects/semicolons in tokenized form).
63
-
64
-This is what keeps simple lookups out of the full execution loop and addresses the user's "spending too long on simple tasks" complaint.
65
-
66
-### 4. Tool surface expansion
67
-
68
-Add a first serious expansion pass for Loader tools. `TodoWrite` and `AskUserQuestion` already exist from Sprint 04. New tools in this sprint:
69
-
70
-- **diff/patch-aware editing** — a tool that takes a structured patch (using the `StructuredPatchHunk` shape from Sprint 03's file_ops hardening) instead of raw old/new strings. Reduces the failure rate on multi-line edits.
71
-- **git status helper** — read-only git status / log / diff / show / branch surface, similar to OMX explore's tokenized subcommand allowlist. Lives in the explore lane but also available to the main runtime.
72
-- **memory/notepad tools** — wire the Sprint 05 memory surfaces into the tool registry so the model can call them: `project_memory_read`, `project_memory_write`, `notepad_read`, `notepad_append`.
73
-- **structured ask-user** — a richer variant of `AskUserQuestion` that can present multiple-choice options or numbered alternatives (for plan-mode handoff and verify-mode "which fix do you want?")
74
-
75
-Do **not** add team/subagent orchestration in this sprint. The solo runtime needs to be stable first. Multi-agent surfaces are deferred indefinitely.
76
-
77
-### 5. CLI/TUI consolidation
78
-
79
-By this point Loader has accumulated a lot of CLI flags and TUI surfaces. Sprint 06 should:
80
-
81
-- consolidate the flag surface (deprecate flags that are now redundant with mode router decisions)
82
-- update the TUI status line to surface model + capability profile + permission mode + workflow mode + DoD phase + session id, all in one line
83
-- ensure `loader --help` is coherent
84
-
85
-## Testing strategy
86
-
87
-- doctor reports meaningful failures for: Ollama down, model not pulled, workspace not writable, `.loader/` corrupted
88
-- doctor reports pass for a known-good local setup
89
-- status/session surfaces reflect real runtime state (assert against fixtures in `.loader/`)
90
-- explore mode handles read-only lookups without entering the full execution workflow (assert: no DoD object created, no verify phase, no mode router decision logged)
91
-- explore mode denies write attempts even if the user has `workspace-write` set globally
92
-- new tools are covered by both unit tests and the Sprint 00 mock harness
93
-- the Sprint 02 baseline parity checklist (which has been growing each sprint) covers the new product surfaces
94
-
95
-## Definition of done
96
-
97
-- Loader is operable and inspectable from outside the main runtime loop
98
-- simple inspection tasks are faster and cheaper via the explore lane
99
-- the expanded tool surface reduces prompt pressure on the main loop
100
-- doctor / status / session surfaces reflect real state
101
-- Loader feels closer to a product and less like an experiment
102
-- the team / multi-agent / hook-ecosystem deferrals are still deferred (and that is the right call)
103
-
104
-## Audit
105
-
106
-### Landed
107
-
108
-- Loader now exposes `loader doctor`, `loader status`, `loader session list`, `loader session show`, and `loader session resume` as real product surfaces backed by persisted runtime state under `.loader/`, without entering the main LLM loop
109
-- `loader doctor` reports backend health, resolved capabilities, workspace and write access, command detection, runtime-state health, and permission-mode summaries with pass/warn/fail status plus remediation hints
110
-- the read-only explore lane is live through `loader explore <prompt>` and `Agent.run_explore(...)`, with its own system prompt, constrained read-only registry, forced `read-only` permission mode, and no workflow routing or DoD persistence
111
-- Loader's tool registry now includes a structured `patch` tool, a read-only `git` helper, `notepad_append`, and richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform answers
112
-- the CLI and TUI status surfaces now show model, capability profile, mode, workflow mode, permission mode, DoD state, and active session id in a single coherent surface, and `loader --help` reflects the new product entry points
113
-- Sprint 06 coverage now extends the deterministic parity harness with explore-mode scenarios, so the constrained lookup lane is measured alongside the earlier runtime contracts
114
-
115
-### Verification
116
-
117
-- `uv run pytest -q` is green: `153 passed`
118
-- `tests/test_inspection.py` covers doctor health reporting, persisted status/session inspection, root help text, and session-resume dispatch
119
-- `tests/test_explore_runtime.py` covers the direct explore-lane contract and forced read-only behavior
120
-- `tests/test_expanded_tools.py` covers structured patch application, read-only git tooling, `notepad_append`, and richer `AskUserQuestion` behavior
121
-- `tests/test_runtime_harness.py` remains green and now includes deterministic explore-mode parity scenarios in addition to the earlier runtime baseline
122
-- `tests/test_status_surfaces.py` covers the consolidated CLI/TUI capability-profile and session-id status formatting
123
-
124
-### Residual debt
125
-
126
-- explore mode is intentionally one-shot and read-only; Loader still does not have a richer interactive inspection lane or OMX-style repo-navigation ergonomics
127
-- the CLI surface is more coherent, but Sprint 06 does not fully deprecate every older entry path or simplify all historical flag combinations
128
-- the read-only `git` helper is still much narrower than claw-code and OMX's broader repo/product surfaces
129
-- the structured `patch` tool improves multi-line edits, but Loader still lacks AST-aware, LSP-aware, or symbol-aware editing semantics
130
-- `conversation.py` remains a large runtime module even though explore now bypasses it for simple lookup work
131
-- multi-agent and team-oriented surfaces remain intentionally deferred, and that continues to be the right tradeoff for Loader at this stage
.docs/sprints/sprint07.mddeleted
@@ -1,166 +0,0 @@
1
-# Sprint 07: Rule-Based Permissions and Runtime Decomposition
2
-
3
-## Prerequisites
4
-
5
-Sprint 06
6
-
7
-## Goals
8
-
9
-Finish the permission model and keep the runtime from re-forming a monolith.
10
-
11
-Sprint 03 gave Loader explicit permission modes and lifecycle hooks. Sprint 06 made Loader more inspectable and product-like. The next leverage point is to replace the remaining legacy confirmation behavior with a real policy layer, then carve authorization and finalization concerns out of `src/loader/runtime/conversation.py` so later work does not keep accumulating there.
12
-
13
-This is not a stop-the-world cleanup sprint. It is a contract sprint:
14
-
15
-- policy decides whether tool use is allowed, denied, or prompted
16
-- prompts are driven by authorization outcomes instead of ad hoc tool confirmations
17
-- `conversation.py` becomes orchestration, not a sink for every new behavior
18
-
19
-The references for this sprint are:
20
-
21
-- `refs/claw-code/rust/crates/runtime/src/permissions.rs`
22
-- `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs`
23
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`
24
-
25
-## Deliverables
26
-
27
-### 1. Full permission policy layer
28
-
29
-Extend Loader's current mode-based policy to a real rule-based policy.
30
-
31
-Implementation targets:
32
-
33
-- add `prompt` and `allow` to `PermissionMode`, mirroring `claw-code`
34
-- extend `PermissionPolicy` with `allow_rules`, `deny_rules`, and `ask_rules`
35
-- introduce a typed permission-rule representation with conservative matching over:
36
-  - tool name
37
-  - normalized tool input summary
38
-  - optional workspace/path context where relevant
39
-- define deterministic policy precedence:
40
-  - deny rules always win
41
-  - hook-level deny/cancel/fail still deny
42
-  - ask rules and `prompt` mode route to interactive approval
43
-  - allow rules and `allow` mode can elevate within policy
44
-  - otherwise the required-mode gate still applies
45
-- keep rule syntax intentionally narrow; do not invent a complex DSL in this sprint
46
-
47
-The goal is to get the runtime to a `PermissionPolicy` shape closer to claw-code, not to build a policy language for its own sake.
48
-
49
-### 2. Policy-backed prompting instead of legacy confirmations
50
-
51
-Today Loader still falls back to legacy confirmation flows after policy allows certain destructive actions. This sprint should invert that relationship.
52
-
53
-Implementation targets:
54
-
55
-- route write/bash/edit/patch approvals through policy outcomes (`allow`, `deny`, `ask`)
56
-- make `ToolExecutor` the primary owner of interactive approval decisions
57
-- keep `Tool.check_confirmation()` only as a compatibility shim while call sites are migrated
58
-- ensure prompt payloads include:
59
-  - tool name
60
-  - normalized input summary
61
-  - active mode
62
-  - required mode
63
-  - matched rule or hook reason when available
64
-- preserve Sprint 06's explore guarantee: explore mode remains forced `read-only` even if the broader session policy is `allow`
65
-
66
-This is the behavioral finish to Sprint 03. Safety and approval should come from one runtime contract, not a policy layer plus leftover tool-specific prompting.
67
-
68
-### 3. Split `conversation.py` into smaller runtime components
69
-
70
-`src/loader/runtime/conversation.py` is working, but it is still too responsibility-dense. The next sprint should reduce risk by moving major responsibilities into dedicated runtime modules.
71
-
72
-Implementation targets:
73
-
74
-- extract assistant-turn request/response handling into a smaller request/turn helper
75
-- extract tool-batch execution and retry bookkeeping into a dedicated runtime component
76
-- extract verify/fix completion gating and session-finalization concerns into a dedicated runtime component
77
-- keep `ConversationRuntime.run_turn(...)` as the coordinator that wires those parts together
78
-- avoid moving behavior back into `agent/loop.py`; decomposition should continue inside `src/loader/runtime/`
79
-
80
-A good outcome is not just fewer lines. A good outcome is that future work on authorization, verification, and completion lands in focused modules instead of reopening the turn loop every time.
81
-
82
-### 4. Observable policy state in product surfaces
83
-
84
-If Loader gains rule-based permissions, operators need to be able to see that state without reading code.
85
-
86
-Implementation targets:
87
-
88
-- extend `loader status` to show:
89
-  - active permission mode
90
-  - whether policy prompting is enabled
91
-  - allow/deny/ask rule counts
92
-- extend `loader doctor` to validate policy configuration and warn on invalid or conflicting rules
93
-- persist enough policy metadata in session/runtime state that inspection surfaces can explain the effective policy cleanly
94
-- fail closed on invalid policy configuration and report the failure clearly
95
-
96
-This keeps the policy layer inspectable and reduces surprise when a tool is prompted, denied, or silently allowed by rule.
97
-
98
-## Testing strategy
99
-
100
-- unit coverage for:
101
-  - `PermissionMode` parsing including `prompt` and `allow`
102
-  - rule parsing and normalization
103
-  - precedence between deny/ask/allow rules and hook overrides
104
-  - invalid policy configuration failing closed
105
-- deterministic harness coverage for:
106
-  - `prompt_mode_prompts_destructive_write`
107
-  - `allow_mode_skips_prompt_for_destructive_write`
108
-  - `deny_rule_blocks_allowed_mode`
109
-  - `ask_rule_prompts_even_when_mode_would_allow`
110
-  - `explore_mode_ignores_global_allow_policy`
111
-- inspection coverage for:
112
-  - doctor reporting invalid policy files
113
-  - status/session surfaces showing rule summaries and active policy mode
114
-- regression coverage:
115
-  - Sprint 00-06 parity scenarios remain green after the runtime split
116
-  - no new behavior path should bypass `ToolExecutor`
117
-
118
-## Definition of done
119
-
120
-- Loader can express `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow` modes
121
-- permission approvals and denials come from policy evaluation, not mostly from leftover tool-specific confirmation logic
122
-- allow/deny/ask rules are real, typed, tested, and visible in product surfaces
123
-- `conversation.py` is slimmer and more coordinator-like, with authorization/finalization logic living in dedicated runtime modules
124
-- the existing parity baseline stays green and the new policy scenarios are deterministic
125
-- Loader is closer to claw-code's execution-policy contract without taking on claw-code's full complexity
126
-
127
-## Explicitly out of scope
128
-
129
-- interactive multi-step explore workflows
130
-- AST-aware, LSP-aware, or symbol-aware editing
131
-- multi-agent or team orchestration
132
-- broad plugin or MCP expansion
133
-
134
-## Audit
135
-
136
-### Landed
137
-
138
-- Loader now supports the full Sprint 07 permission-mode surface: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow`
139
-- workspace-local `.loader/permission-rules.json` files now load into typed `allow` / `deny` / `ask` rule sets with conservative matching over tool name, normalized input summaries, and optional path hints
140
-- policy precedence is now deterministic: deny rules win first, hook-level overrides can still deny/ask/allow, ask rules drive interactive approval, allow rules and `allow` mode can elevate, and the required-mode gate still applies otherwise
141
-- `ToolExecutor` is now the primary owner of interactive approval decisions, and destructive write/bash/edit/patch paths run through policy outcomes instead of relying mainly on tool-specific confirmation logic
142
-- policy prompt payloads now include the active mode, required mode, normalized input summary, and matched rule or hook reason where available
143
-- explore mode preserves the Sprint 06 read-only guarantee by copying deny/ask rules but intentionally ignoring broader allow rules
144
-- `loader doctor` and `loader status` now expose prompting state, allow/deny/ask rule counts, and invalid rule configuration clearly, while `loader session list/show` now persist and surface the effective policy metadata that a session actually ran with
145
-- invalid permission configuration now fails closed both when starting an `Agent` and when inspecting a workspace through doctor/status surfaces
146
-- runtime decomposition continued inside `src/loader/runtime/`: assistant requests now live in `assistant_turns.py`, tool-batch execution/recovery/post-tool verification now live in `tool_batches.py`, and DoD/finalization logic now live in `finalization.py`
147
-- `ConversationRuntime.run_turn(...)` is now more coordinator-like: it prepares the workflow, requests an assistant turn, delegates tool execution to the batch runner, delegates DoD gating/finalization, and keeps orchestration in one place instead of owning every behavior directly
148
-- the deterministic parity harness now includes prompt/allow/rule-policy scenarios and remains green after the runtime split
149
-
150
-### Verification
151
-
152
-- `uv run pytest -q` is green: `167 passed`
153
-- `tests/test_permissions.py` covers `prompt` / `allow` parsing, rule parsing, deny/ask/allow precedence, hook overrides, and policy-backed prompting behavior
154
-- `tests/test_runtime_harness.py` keeps the full Sprint 00-06 baseline green and now covers prompt-mode prompting, allow-mode skipping, deny-rule blocking, ask-rule prompting, and explore-mode isolation from global allow policy
155
-- `tests/test_inspection.py` covers invalid rule reporting plus rule-aware `doctor`, `status`, `session list`, and `session show` surfaces
156
-- `tests/test_session_state.py` now covers persisted permission-policy metadata alongside the earlier session persistence/resume/compaction coverage
157
-- targeted `ruff` checks are green for the new runtime modules and the touched inspection/session test files
158
-
159
-### Residual debt
160
-
161
-- rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model, preview UX, or temporary allow/deny override ergonomics
162
-- policy-backed prompting is now primary, but the older tool-confirmation compatibility layer still exists and should continue shrinking rather than becoming a second policy path again
163
-- `conversation.py` is materially slimmer than before Sprint 07, but it still owns workflow routing, prompt repair, self-critique/completion heuristics, and several other coordinator behaviors that remain more heuristic-heavy than the refs
164
-- shell mutability classification and rule matching are still conservative string/command heuristics rather than a richer semantic sandbox or argument-aware policy model
165
-- session inspection now preserves effective policy state, but Loader still does not offer a first-class product surface for authoring, validating, or dry-running permission rules
166
-- explore mode remains intentionally one-shot and read-only; Sprint 07 does not add a richer interactive inspection workflow
.docs/sprints/sprint08.mddeleted
@@ -1,180 +0,0 @@
1
-# Sprint 08: Prompt Builder, Runtime Phases, and Permission Operator UX
2
-
3
-## Prerequisites
4
-
5
-Sprint 07
6
-
7
-## Goals
8
-
9
-Turn the remaining "smart heuristics" into explicit runtime contracts and make Loader's permission system operable without reading JSON files by hand.
10
-
11
-Sprint 07 gave Loader a real execution-policy layer and smaller runtime seams. The next leverage point is to stop letting prompt assembly and response repair live as ad hoc strings and inline heuristics. `claw-code` keeps the runtime tighter partly because prompt construction is its own subsystem, and OMX keeps workflows legible because phase/state is explicit instead of implied.
12
-
13
-This sprint is the bridge from "better runtime internals" to "better operator control":
14
-
15
-- prompt construction becomes a typed builder with explicit sections
16
-- turn phases become named runtime states instead of inline branches
17
-- permission policy becomes dry-runnable and explainable from the CLI
18
-
19
-The references for this sprint are:
20
-
21
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
22
-- `refs/claw-code/rust/crates/runtime/src/permissions.rs`
23
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`
24
-- `refs/oh-my-codex/src/modes/base.ts`
25
-
26
-## Deliverables
27
-
28
-### 1. Typed prompt builder instead of hand-built templates
29
-
30
-Loader's prompt layer is still mostly string templates in `src/loader/agent/prompts.py`. Sprint 08 should turn that into a runtime prompt builder with explicit sections and clearer ownership.
31
-
32
-Implementation targets:
33
-
34
-- replace the current monolithic prompt strings with a typed builder under `src/loader/runtime/` or another clearly-owned prompt module
35
-- separate static scaffolding from dynamic runtime context with an explicit boundary, following the shape of `claw-code`'s prompt builder
36
-- render mode guidance, project context, runtime config, permission mode, and relevant workflow/artifact context as independent sections instead of one merged string blob
37
-- keep native-tool vs ReAct prompt differences as a thin formatting concern, not two largely separate prompt bodies
38
-- make it easy to add or remove one section without rewriting the whole prompt template
39
-- preserve Loader's current local-first/product-specific guidance; this sprint is about structure first, not prompt verbosity for its own sake
40
-
41
-The goal is not "more prompt text." The goal is a prompt contract that is inspectable, composable, and easier to evolve without reopening the turn loop.
42
-
43
-### 2. Explicit runtime phases for response repair and completion behavior
44
-
45
-`src/loader/runtime/conversation.py` is healthier after Sprint 07, but it still owns too many inline heuristics:
46
-
47
-- empty-output retries
48
-- raw-tool-call fallback
49
-- fake-tool narration correction
50
-- self-critique rerouting
51
-- completion nudges for non-mutating tasks
52
-- text-loop bailout behavior
53
-
54
-Sprint 08 should move those into named runtime phases and focused helpers.
55
-
56
-Implementation targets:
57
-
58
-- define a typed turn-phase model for the remaining coordinator behaviors, for example:
59
-  - `prepare`
60
-  - `assistant`
61
-  - `tools`
62
-  - `repair`
63
-  - `critique`
64
-  - `completion`
65
-- extract response-repair and completion-policy decisions into dedicated runtime components rather than leaving them inline in `ConversationRuntime.run_turn(...)`
66
-- keep `conversation.py` as the turn coordinator that advances phase state and delegates to helpers
67
-- persist enough phase metadata in trace/session state that status surfaces and debugging can explain where Loader spent time during a turn
68
-- avoid moving these heuristics back into `agent/loop.py`; the runtime split should keep going in one direction
69
-
70
-This is the next step toward a runtime that is easier to debug and less vulnerable to regressions when we tune follow-through behavior.
71
-
72
-### 3. First-class permission operator surfaces
73
-
74
-Sprint 07 made policy inspectable. Sprint 08 should make it operable.
75
-
76
-Implementation targets:
77
-
78
-- add `loader permissions show` to display:
79
-  - active permission mode
80
-  - prompting state
81
-  - rule source path
82
-  - normalized allow/deny/ask rules
83
-  - rule counts and validity
84
-- add `loader permissions check` to dry-run one hypothetical tool request and show:
85
-  - tool name
86
-  - normalized input summary
87
-  - required mode
88
-  - policy decision (`allow` / `deny` / `ask`)
89
-  - matched rule or hook-style reason when present
90
-- support both structured JSON-like tool arguments and simple string inputs where practical
91
-- wire `loader doctor` remediation hints to these policy commands when rules are invalid or confusing
92
-- keep the UX read-mostly in this sprint; authoring/edit flows can remain file-based for now
93
-
94
-This gives operators a way to answer "why did Loader allow/prompt/deny this?" without reproducing the behavior in a live turn.
95
-
96
-### 4. Prompt and policy state surfaced coherently in product surfaces
97
-
98
-Once prompt construction and phase state are explicit, the product surfaces should expose enough context to make Loader's behavior legible.
99
-
100
-Implementation targets:
101
-
102
-- extend `loader status` and the TUI status line to surface the active turn phase when a run is in progress
103
-- make doctor/status/session output consistent about permission terminology (`mode`, `prompting`, `rules`, `source`)
104
-- expose prompt-builder metadata in a minimal, operator-friendly way:
105
-  - current mode
106
-  - whether ReAct/native formatting is active
107
-  - which dynamic sections were included
108
-- keep these surfaces concise; this sprint is about observability, not dumping the whole prompt to the screen by default
109
-
110
-The goal is to make Loader easier to reason about in live use, not just in code review.
111
-
112
-## Testing strategy
113
-
114
-- unit coverage for:
115
-  - prompt-builder section rendering
116
-  - native vs ReAct prompt-format differences over the same section set
117
-  - turn-phase transitions for repair/critique/completion flows
118
-  - permission dry-run explanations and invalid-input handling
119
-- CLI coverage for:
120
-  - `loader permissions show`
121
-  - `loader permissions check`
122
-  - doctor remediation output that points users toward policy inspection
123
-- deterministic/runtime coverage for:
124
-  - empty assistant output triggering repair-phase retries without regressing the tool path
125
-  - raw-tool fallback still sharing the same executor and phase bookkeeping
126
-  - non-mutating completion nudges remaining deterministic after the phase split
127
-  - Sprint 00-07 parity scenarios staying green
128
-- status/TUI coverage for:
129
-  - active phase rendering
130
-  - coherent permission terminology across doctor/status/session output
131
-
132
-## Definition of done
133
-
134
-- prompt construction is builder-based, sectioned, and easier to inspect than the current template strings
135
-- `conversation.py` is slimmer again, with response-repair and completion heuristics moved into focused runtime components
136
-- Loader exposes first-class permission inspection and dry-run commands instead of requiring manual JSON reading
137
-- prompt, phase, and policy state are more legible in status/inspection surfaces
138
-- the full parity baseline remains green after the phase split
139
-- Loader moves closer to claw-code's "tight runtime, explicit prompt contract" shape without overcommitting to a huge configuration system
140
-
141
-## Explicitly out of scope
142
-
143
-- a full interactive rule editor or TUI-based permission authoring flow
144
-- AST-aware, LSP-aware, or symbol-aware editing
145
-- a richer shell sandbox than the current command-based model
146
-- interactive multi-step explore workflows
147
-- multi-agent or team orchestration
148
-
149
-## Audit
150
-
151
-### Landed
152
-
153
-- Loader's prompt construction now lives in `src/loader/runtime/prompting.py` as a typed builder with explicit sections, a static/dynamic boundary marker, and thin native-vs-ReAct formatting differences instead of one mostly hand-built string blob
154
-- prompt metadata now persists in session state, so `loader status`, `loader session list/show`, and the live agent state can explain the active prompt format and which dynamic sections were actually included for the current workspace/task
155
-- the remaining coordinator heuristics are now split into explicit runtime components: phase tracking in `runtime.phases`, response repair in `runtime.repair`, and completion/self-critique policy in `runtime.completion_policy`
156
-- `ConversationRuntime.run_turn(...)` now advances explicit turn phases (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) and persists the active phase into session state while also emitting runtime events for the CLI/TUI
157
-- the TUI status line and CLI/session inspection surfaces now expose the active turn phase while a turn is in flight, which makes Loader's mid-turn behavior much easier to debug than the earlier implicit branch structure
158
-- Loader now has first-class permission operator commands:
159
-  - `loader permissions show` displays the active mode, prompting state, rules source, validity, counts, and normalized allow/deny/ask rules
160
-  - `loader permissions check` dry-runs one hypothetical tool request and reports the normalized input summary, required mode, allow/deny/ask decision, matched rule, and policy reason
161
-- `loader permissions check` supports both JSON object arguments and practical positional input mapping for common tools such as `bash`, `read`, `write`, `edit`, `patch`, `glob`, `grep`, `git`, and read-only memory/notepad lookups
162
-- `loader doctor` remediation now points operators toward `loader permissions show` / `loader permissions check` instead of leaving permission debugging as a code/JSON-reading exercise
163
-- doctor/status/session output now uses more consistent permission terminology around mode, prompting, rules, and source instead of mixing several labels for the same policy concepts
164
-
165
-### Verification
166
-
167
-- `uv run pytest -q` is green: `176 passed`
168
-- `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt-builder persistence metadata
169
-- `tests/test_runtime_phases.py` covers repair/completion phase transitions and active phase bookkeeping
170
-- `tests/test_inspection.py` now covers `loader permissions show`, `loader permissions check`, invalid JSON input handling, invalid-rule visibility, prompt/policy metadata in status/session surfaces, and the existing doctor/session inspection behavior
171
-- targeted `ruff` checks are green for `src/loader/runtime/inspection.py`, `tests/test_inspection.py`, and import ordering in `src/loader/cli/main.py`
172
-- the full Sprint 00-07 parity baseline stayed green through the prompt/phase split and permission CLI rollout
173
-
174
-### Residual debt
175
-
176
-- `src/loader/runtime/conversation.py` is slimmer than before Sprint 08, but it still coordinates workflow routing and phase transitions with a heuristic branch structure rather than a more formal state machine
177
-- prompt construction is now inspectable and sectioned, but Loader still does not offer prompt previews/diffs, a richer prompt-contract parity harness, or operator controls for temporarily adjusting prompt sections
178
-- `loader permissions show/check` make the policy operable, but authoring/editing rules is still file-based and there is still no first-class preview UX for comparing multiple rule sets or applying temporary session overrides
179
-- doctor/status/session terminology is more coherent now, but the product still stops short of the richer policy UX and sandbox semantics used by the references
180
-- the explicit turn phases improve observability, but they are still runtime bookkeeping around heuristics, not yet a deeper workflow-state contract on the level of OMX's more opinionated routing discipline
.docs/sprints/sprint09.mddeleted
@@ -1,185 +0,0 @@
1
-# Sprint 09: Turn State Machine, Workflow Contracts, and Prompt Preview
2
-
3
-## Prerequisites
4
-
5
-Sprint 08
6
-
7
-## Goals
8
-
9
-Turn Loader's explicit phase labels into a real runtime contract and make workflow routing explainable instead of merely observable.
10
-
11
-Sprint 08 gave Loader a typed prompt builder, named turn phases, and permission/operator surfaces. That was the right bridge, but the audit is honest about what still hurts:
12
-
13
-- `conversation.py` still coordinates too much through heuristic branching
14
-- workflow routing is still implied by helper decisions more than enforced by a transition model
15
-- phase state is visible, but not yet validated like a real state machine
16
-- prompt metadata is inspectable, but the actual prompt contract is still hard to preview directly
17
-
18
-The next leverage point is to stop treating state as labels attached to heuristics and start treating it as the runtime.
19
-
20
-This sprint is about discipline:
21
-
22
-- turn progression becomes a validated state machine
23
-- workflow routing becomes a typed decision contract with persisted reasons
24
-- `conversation.py` becomes thinner again by delegating transitions instead of deciding everything inline
25
-- operators can inspect the prompt/workflow contract without triggering a live model turn
26
-
27
-The references for this sprint are:
28
-
29
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
30
-- `refs/claw-code/rust/crates/runtime/src/mcp_lifecycle_hardened.rs`
31
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
32
-- `refs/oh-my-codex/src/modes/base.ts`
33
-
34
-## Deliverables
35
-
36
-### 1. Validated turn state machine instead of phase bookkeeping only
37
-
38
-Sprint 08 added named phases. Sprint 09 should make those phases authoritative.
39
-
40
-Implementation targets:
41
-
42
-- introduce a dedicated turn-state machine under `src/loader/runtime/` with typed states, transition intents, and terminal reasons
43
-- define explicit allowed transitions for the main turn lifecycle, for example:
44
-  - `prepare -> assistant`
45
-  - `assistant -> repair`
46
-  - `assistant -> tools`
47
-  - `assistant -> completion`
48
-  - `tools -> critique`
49
-  - `critique -> completion`
50
-  - `completion -> finalize`
51
-- fail loudly on impossible transitions in tests and tracing instead of silently drifting
52
-- capture transition metadata such as:
53
-  - why a transition happened
54
-  - whether it was a normal path, reroute, retry, or recovery move
55
-  - whether it ended in success, blocked, fix-loop reentry, or iteration exhaustion
56
-- keep the state machine separate from model/tool side effects so transition logic is testable on its own
57
-
58
-The goal is not ceremony. The goal is to make Loader's turn lifecycle harder to accidentally regress.
59
-
60
-### 2. Typed workflow-routing contract with persisted reasoning
61
-
62
-Loader already has `clarify`, `plan`, `execute`, `verify`, and `explore`, but the router still behaves more like a cluster of heuristics than a durable contract.
63
-
64
-Implementation targets:
65
-
66
-- replace the current loose routing outputs with a typed workflow decision object, for example:
67
-  - selected mode
68
-  - reason code
69
-  - ambiguity/complexity signal
70
-  - whether the mode is an initial route, a reentry, or a forced continuation
71
-  - whether a downstream mode is already scheduled
72
-- persist workflow-decision metadata in session/runtime state so inspection surfaces can explain:
73
-  - why Loader chose clarify vs plan vs execute
74
-  - why a verify failure returned to execute
75
-  - why a conversational task skipped verification
76
-- move verify/fix loop reentry onto the same contract instead of letting it remain a partial special case
77
-- keep explore explicitly outside the main mutating workflow contract while still representing it as a typed lane
78
-
79
-This is how Loader gets closer to OMX's “mode with a contract” behavior without copying OMX's full planning stack yet.
80
-
81
-### 3. Slim `conversation.py` around transitions and routing
82
-
83
-After Sprint 08, `conversation.py` is better, but it still knows too much about routing and control flow.
84
-
85
-Implementation targets:
86
-
87
-- extract workflow-decision evaluation and turn-transition advancement into dedicated runtime modules
88
-- make `ConversationRuntime.run_turn(...)` primarily:
89
-  - build turn context
90
-  - ask the router for the next workflow decision
91
-  - advance the state machine
92
-  - delegate assistant/tool/repair/finalization work
93
-  - persist the resulting turn summary
94
-- remove remaining inline branches that duplicate completion/reentry/skip logic in multiple places
95
-- keep all new control-flow work inside `src/loader/runtime/`, not back in `agent/loop.py`
96
-
97
-A good outcome is that `conversation.py` reads like an orchestrator over contracts rather than a place where routing policy keeps accumulating.
98
-
99
-### 4. First-class prompt/workflow preview surfaces
100
-
101
-Sprint 08 made prompt and policy metadata legible. Sprint 09 should make the actual contract previewable.
102
-
103
-Implementation targets:
104
-
105
-- add `loader prompt show` to render the current prompt contract for a task without issuing a model request
106
-- support previewing at least:
107
-  - workflow mode
108
-  - prompt format (`native` / `react`)
109
-  - included dynamic sections
110
-  - current permission mode
111
-  - relevant workflow context
112
-- keep the output operator-friendly:
113
-  - summary metadata first
114
-  - prompt body second
115
-  - no hidden live side effects
116
-- extend `loader status` / `loader session show` with the latest workflow-decision reason and latest transition summary when present
117
-
118
-The goal is to make Loader's runtime choices debuggable before and after a turn, not only during one.
119
-
120
-## Testing strategy
121
-
122
-- unit coverage for:
123
-  - allowed and disallowed turn-state transitions
124
-  - terminal-state reasons and retry/reentry transitions
125
-  - workflow-decision objects and persisted reason codes
126
-  - prompt-preview rendering over multiple workflow modes
127
-- CLI coverage for:
128
-  - `loader prompt show`
129
-  - workflow reason/transition fields in `loader status` and `loader session show`
130
-- deterministic/runtime coverage for:
131
-  - verify/fix reentry using only valid state-machine transitions
132
-  - conversational tasks skipping verification through a typed workflow decision instead of an implicit branch
133
-  - repair-phase retries preserving valid transition sequences
134
-  - Sprint 00-08 parity scenarios staying green after the state-machine split
135
-- regression coverage:
136
-  - no turn path should bypass the state machine once the contract is introduced
137
-  - `conversation.py` should no longer be the only place that knows whether a reroute is legal
138
-
139
-## Definition of done
140
-
141
-- Loader has a validated turn-state machine, not just named phase labels
142
-- workflow routing emits typed, persisted decisions with explainable reasons and reentry metadata
143
-- `conversation.py` is slimmer again and more coordinator-like
144
-- operators can preview the current prompt contract without a live model call
145
-- status/session surfaces can explain the latest workflow decision and transition outcome
146
-- the full parity baseline remains green after the control-flow split
147
-
148
-## Explicitly out of scope
149
-
150
-- a full workflow editor or visual state-machine UI
151
-- a first-class permission rule editor
152
-- OMX-style multi-iteration consensus planning
153
-- AST-aware, LSP-aware, or symbol-aware editing
154
-- multi-agent or team orchestration
155
-
156
-## Audit
157
-
158
-### Landed
159
-
160
-- Loader now has a validated turn-state machine in `src/loader/runtime/phases.py` instead of phase bookkeeping only; allowed transitions are explicit, invalid transitions fail loudly, and transition metadata now captures reason code, human summary, and whether the move was normal, retry, reroute, recovery, or terminal
161
-- turn-transition metadata is now persisted in session state and surfaced through typed runtime events and `TurnSummary`, so Loader can explain the latest transition outcome instead of only exposing the current phase label
162
-- workflow routing now uses a richer typed `ModeDecision` contract in `src/loader/runtime/workflow.py`, including reason code, reason summary, decision kind, ambiguity/complexity scores, and optional scheduled-next-mode hints
163
-- verify/fix loop handoffs now use the same workflow-decision contract as initial routing, so verify entry and execute reentry are persisted and inspectable instead of living as partial special cases
164
-- `ConversationRuntime` now treats workflow decisions more like contracts than loose labels: it sets workflow state from typed decisions, records reason metadata into session state, and keeps `conversation.py` more coordinator-like than before Sprint 09
165
-- `loader status`, `loader session list`, and `loader session show` now surface the latest workflow-decision reason, decision kind, and last validated transition summary when those fields are present
166
-- Loader now has `loader prompt show [task]`, implemented through `runtime.inspection.collect_prompt_preview(...)`, which renders the current prompt contract without issuing a model request and reports workflow mode, permission mode, prompt format, dynamic sections, and the full prompt body
167
-- prompt preview reuses the typed prompt builder and capability resolution rather than duplicating prompt strings in the CLI, so the operator surface stays aligned with the runtime contract it is previewing
168
-
169
-### Verification
170
-
171
-- `uv run pytest -q` is green: `180 passed`
172
-- `tests/test_turn_state_machine.py` covers valid/invalid turn transitions and terminal transition metadata
173
-- `tests/test_runtime_phases.py` now covers persisted transition metadata in runtime events, session state, and final turn summaries
174
-- `tests/test_workflow_runtime.py` now covers persisted workflow-decision reason codes and handoff kinds for clarify/plan/verify flows
175
-- `tests/test_session_state.py` now covers round-tripping workflow-decision metadata and transition metadata through persisted session snapshots
176
-- `tests/test_inspection.py` now covers workflow-reason/transition rendering in status/session surfaces plus `loader prompt show`
177
-- targeted `ruff` checks are green for `src/loader/runtime/inspection.py` and `tests/test_inspection.py`
178
-
179
-### Residual debt
180
-
181
-- Loader now has a real turn state machine, but workflow routing itself is still heuristic-only and materially lighter than OMX's deeper route discipline
182
-- `src/loader/runtime/conversation.py` is slimmer and more contract-driven than before Sprint 09, but it still coordinates several heuristic completion/repair paths that could be decomposed further
183
-- `loader prompt show` gives operators a real preview surface, but Loader still does not support prompt diffs, historical prompt snapshots, or richer side-by-side comparison workflows
184
-- workflow reasons and transitions are now inspectable, but the product still does not offer a richer workflow trace/timeline surface or stronger workflow-authoring controls
185
-- Sprint 09 strengthens state and inspection contracts, but it does not add deeper consensus planning, AST/LSP-aware editing, or a richer permission-rule authoring UX
.docs/sprints/sprint10.mddeleted
@@ -1,198 +0,0 @@
1
-# Sprint 10: Route Pressure, Clarify Depth, and Workflow Timeline
2
-
3
-## Prerequisites
4
-
5
-Sprint 09
6
-
7
-## Goals
8
-
9
-Turn Loader's workflow routing from a better heuristic into a more opinionated workflow policy, and make that policy inspectable over time instead of only at the latest state.
10
-
11
-Sprint 09 gave Loader a validated turn state machine, typed workflow decisions, and prompt preview. That was the right contract layer, but the audit is honest about what still hurts:
12
-
13
-- workflow routing is still threshold-based and lighter than OMX's route discipline
14
-- clarify and plan are still mostly one-shot preprocessors instead of deeper workflow lanes
15
-- status/session surfaces can explain the latest decision, but not the sequence of decisions that got Loader there
16
-- `conversation.py` is slimmer, but it still owns route/handoff/skip logic that should live in dedicated workflow policy code
17
-
18
-The next leverage point is to stop asking only "what mode are we in now?" and start asking "what workflow pressure led us here, what is still unresolved, and what should happen next if the task keeps moving?"
19
-
20
-This sprint is about workflow rigor:
21
-
22
-- routing becomes a scored workflow policy instead of a small threshold router
23
-- clarify and plan become more durable lanes with bounded depth and freshness rules
24
-- workflow history becomes a persisted timeline instead of only last-known fields
25
-- `conversation.py` gets thinner again by delegating route pressure and handoff policy
26
-
27
-The references for this sprint are:
28
-
29
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
30
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
31
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
32
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
33
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
34
-
35
-## Deliverables
36
-
37
-### 1. Workflow policy engine instead of threshold-only routing
38
-
39
-Sprint 09 made workflow decisions typed. Sprint 10 should make them more deliberate.
40
-
41
-Implementation targets:
42
-
43
-- replace the current simple threshold router with a workflow-policy module that evaluates route pressure using typed signals, for example:
44
-  - ambiguity
45
-  - complexity
46
-  - mutability / verification pressure
47
-  - artifact availability
48
-  - artifact freshness
49
-  - explicit user request
50
-  - unresolved assumptions
51
-- produce a scored route evaluation that explains:
52
-  - which mode won
53
-  - why it won
54
-  - what the runner-up pressure was
55
-  - whether a downstream mode is already scheduled
56
-- keep initial route, handoff, and reentry decisions on the same policy surface instead of splitting them between the router and coordinator code
57
-- make non-mutating "skip verify" behavior a typed workflow decision instead of an implicit completion branch
58
-
59
-The goal is not a giant planning engine. The goal is to make Loader's workflow behavior less accidental and easier to tune deliberately.
60
-
61
-### 2. Deeper clarify lane with bounded follow-through
62
-
63
-Loader can clarify today, but it still behaves like a one-question prelude.
64
-
65
-Implementation targets:
66
-
67
-- allow clarify to continue for more than one round when the first answer leaves key boundaries unresolved
68
-- keep that depth bounded with an explicit clarify budget and escalation rules
69
-- persist unresolved assumptions or open questions in workflow state so later turns can explain why clarify continued or stopped
70
-- distinguish between:
71
-  - clarify completed cleanly
72
-  - clarify exhausted its budget
73
-  - clarify was bypassed by a stronger route decision
74
-- keep the UX disciplined: no deep-interview sprawl, only focused rounds that materially reduce ambiguity
75
-
76
-This is how Loader gets closer to OMX's deeper interview behavior without turning every task into a questionnaire.
77
-
78
-### 3. Plan freshness and re-plan discipline
79
-
80
-Loader now persists plans, but it still treats them as mostly static once created.
81
-
82
-Implementation targets:
83
-
84
-- add typed freshness checks for clarify briefs and plan artifacts so Loader can detect when a plan no longer matches the task state
85
-- define explicit refresh triggers, for example:
86
-  - task meaning changed after clarification
87
-  - verify/fix reentry reveals plan gaps
88
-  - touched files drift outside the planned touchpoints
89
-  - acceptance criteria changed materially
90
-- route stale artifacts through a typed re-plan or plan-refresh decision instead of silently continuing with outdated plans
91
-- keep refresh lightweight: prefer targeted plan refresh over full workflow restart when possible
92
-
93
-This brings Loader closer to claw-code's stronger execution discipline: artifacts should steer the turn, not become dead files on disk.
94
-
95
-### 4. Persisted workflow timeline and operator surfaces
96
-
97
-Sprint 09 exposed the latest workflow reason. Sprint 10 should expose workflow history.
98
-
99
-Implementation targets:
100
-
101
-- persist workflow timeline entries in session/runtime state, including:
102
-  - route decisions
103
-  - handoffs
104
-  - reentries
105
-  - clarify-budget outcomes
106
-  - plan refresh decisions
107
-  - the latest prompt-contract metadata attached to those moments when relevant
108
-- add a first-class inspection surface such as `loader workflow show` for summarizing that timeline without a live model call
109
-- extend `loader session show` to surface the most recent workflow timeline items when present
110
-- keep the output operator-friendly:
111
-  - newest-important events first
112
-  - concise summaries
113
-  - artifact paths or mode transitions only when they help explain behavior
114
-
115
-The goal is to make Loader's workflow evolution debuggable, not just its final state.
116
-
117
-### 5. Continue slimming `conversation.py`
118
-
119
-Sprint 09 improved the coordinator. Sprint 10 should keep the split moving in the same direction.
120
-
121
-Implementation targets:
122
-
123
-- extract route-pressure evaluation, clarify-budget handling, and artifact-freshness checks into dedicated runtime modules
124
-- make `ConversationRuntime.run_turn(...)` primarily:
125
-  - collect turn context
126
-  - ask the workflow policy for a route or reentry decision
127
-  - delegate clarify/plan/execute/verify work
128
-  - append timeline entries
129
-  - finalize the turn summary
130
-- avoid rebuilding the monolith by placing all new workflow-policy logic under `src/loader/runtime/`
131
-
132
-A good outcome is that `conversation.py` keeps shrinking because policy is becoming modular, not because the behavior is disappearing.
133
-
134
-## Testing strategy
135
-
136
-- unit coverage for:
137
-  - workflow-policy score breakdowns and winning-route selection
138
-  - clarify-budget continuation vs exhaustion
139
-  - artifact-freshness detection and targeted plan refresh triggers
140
-  - workflow timeline persistence and serialization
141
-- CLI coverage for:
142
-  - `loader workflow show`
143
-  - workflow timeline rendering in `loader session show`
144
-- deterministic/runtime coverage for:
145
-  - ambiguous tasks that require more than one clarify round before execution
146
-  - verify/fix reentry that triggers plan refresh instead of blindly continuing
147
-  - non-mutating tasks skipping verify through a typed workflow decision
148
-  - Sprint 00-09 parity scenarios staying green after the workflow-policy split
149
-- regression coverage:
150
-  - route and reentry choices should come from the workflow-policy layer, not ad hoc coordinator branches
151
-  - persisted workflow history should survive session resume and inspection
152
-
153
-## Definition of done
154
-
155
-- Loader routes through a scored workflow-policy engine instead of a small threshold router
156
-- clarify can continue for bounded, explainable follow-up rounds when ambiguity remains
157
-- plan artifacts can be marked stale and refreshed through typed workflow decisions
158
-- workflow history is persisted and inspectable, not only the latest decision
159
-- `conversation.py` is slimmer again and more policy-driven
160
-- the full parity baseline remains green after the workflow-policy and timeline work
161
-
162
-## Explicitly out of scope
163
-
164
-- full OMX-style consensus planning
165
-- a visual workflow timeline UI
166
-- a first-class permission rule editor
167
-- AST-aware, LSP-aware, or symbol-aware editing
168
-- multi-agent or team orchestration
169
-
170
-## Audit
171
-
172
-### Landed
173
-
174
-- Loader now routes through a scored workflow policy in `src/loader/runtime/workflow_policy.py` instead of the old threshold-only router contract; route decisions now carry winner score, runner-up mode/score, unresolved questions, and a human-readable pressure summary
175
-- initial route, artifact reuse, stale-plan reentry, and handoff metadata now sit on the same typed `ModeDecision` surface, which makes workflow choices easier to persist, inspect, and tune deliberately
176
-- clarify is no longer a single-pass prelude: `src/loader/runtime/conversation.py` now supports a bounded multi-round clarify lane, re-evaluates ambiguity after each answer, and persists unresolved questions when the clarify budget is exhausted
177
-- plan freshness is now an explicit runtime concern: Loader can detect file-drift against persisted plan artifacts, route back through a targeted plan refresh, regenerate implementation/verification artifacts, and hand back to execute without restarting the whole workflow
178
-- non-mutating turns now record verify-skip as an explicit workflow timeline event instead of disappearing through an implicit branch
179
-- workflow history is now persisted as `workflow_timeline` session state via `src/loader/runtime/session.py`, with timeline entries for routes, handoffs, reentries, clarify continuation/exit, plan refresh behavior, and verify skips
180
-- operators now have `loader workflow show [session-id]` plus recent workflow timeline snippets inside `loader session show`, implemented through `src/loader/runtime/inspection.py` and `src/loader/cli/main.py`
181
-- `conversation.py` is slimmer than before Sprint 10 because scoring, clarify review, artifact freshness, and timeline contracts now live in dedicated runtime modules instead of accumulating as coordinator-only heuristics
182
-
183
-### Verification
184
-
185
-- `uv run pytest -q` is green: `188 passed`
186
-- `tests/test_workflow_policy.py` covers scored-route breakdowns, clarify follow-up reviews, artifact-freshness detection, and workflow timeline serialization
187
-- `tests/test_workflow_runtime.py` covers bounded clarify continuation, targeted plan refresh on stale artifacts, verify/fix reentry, and persisted workflow timeline behavior
188
-- `tests/test_inspection.py` covers `loader workflow show`, recent timeline rendering in `loader session show`, and persisted workflow timeline inspection without a live model call
189
-- `tests/test_workflow.py` now aligns legacy router expectations with Sprint 10's scored policy contract instead of the older raw-threshold assumption
190
-- targeted `ruff` checks are green for `src/loader/runtime/inspection.py`, `tests/test_inspection.py`, and `tests/test_workflow.py`; `src/loader/cli/main.py` was also checked for new unused-import regressions
191
-
192
-### Residual debt
193
-
194
-- the new workflow policy is scored, but it is still hand-tuned and text-heuristic; Loader still does not match OMX's deeper ambiguity analysis, route-pressure passes, or richer branch-specific workflow policies
195
-- clarify now has bounded follow-through, but it is still intentionally shallow compared with OMX's deep-interview behavior and does not yet adapt its budget or questioning style by task class
196
-- plan freshness is currently driven by touched-file drift; Loader still does not reason well about semantic task changes, changed acceptance criteria, or broader artifact invalidation
197
-- `loader workflow show` makes workflow evolution inspectable, but the operator UX still stops short of timeline filtering, artifact diffs, or richer prompt/history comparison
198
-- `src/loader/runtime/conversation.py` is smaller and more policy-driven than before Sprint 10, but it still coordinates more workflow behavior than the claw-code references, especially around completion and downstream execution orchestration
.docs/sprints/sprint11.mddeleted
@@ -1,215 +0,0 @@
1
-# Sprint 11: Semantic Signals, Clarify Strategy, and Orchestrator Split
2
-
3
-## Prerequisites
4
-
5
-Sprint 10
6
-
7
-## Goals
8
-
9
-Turn Loader's new workflow policy from a better scorecard into a more structured workflow contract, and keep shrinking the coordinator so policy and orchestration live in dedicated runtime seams instead of collecting back inside `conversation.py`.
10
-
11
-Sprint 10 was a meaningful step forward. Loader now has scored routing, bounded clarify follow-through, plan refresh, and a persisted workflow timeline. That closes a real gap with claw-code and OMX, but the audit is honest about what still hurts:
12
-
13
-- workflow scoring is still hand-tuned and text-heuristic rather than driven by a typed signal model
14
-- clarify has follow-through now, but the questioning strategy is still generic and shallow compared with OMX's deep-interview discipline
15
-- plan freshness is still mostly file-drift based instead of understanding broader semantic invalidation
16
-- workflow history is inspectable, but not yet filtered or summarized around the most useful operator questions
17
-- `conversation.py` is smaller than it was, but it still coordinates more workflow behavior than the refs
18
-
19
-The next leverage point is to stop asking only "what pressure score won?" and start asking "what concrete workflow signals are in play, which task boundaries remain unresolved, and which orchestration module should own the next move?"
20
-
21
-This sprint is about workflow structure:
22
-
23
-- route policy consumes typed workflow signals rather than leaning so heavily on inline heuristics
24
-- clarify becomes intent-aware instead of merely multi-round
25
-- replan discipline becomes more semantic than touched-file drift alone
26
-- workflow inspection becomes more useful for debugging why Loader stayed in or re-entered a lane
27
-- `conversation.py` shrinks again because orchestration moves into dedicated runtime modules
28
-
29
-The references for this sprint are:
30
-
31
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
32
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
33
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
34
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
35
-- `refs/oh-my-codex/src/modes/base.ts`
36
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
37
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
38
-
39
-## Deliverables
40
-
41
-### 1. Typed workflow-signal extraction instead of score inputs assembled inline
42
-
43
-Sprint 10 made routing scored. Sprint 11 should make the inputs first-class.
44
-
45
-Implementation targets:
46
-
47
-- introduce a dedicated workflow-signal module under `src/loader/runtime/`, for example around:
48
-  - ambiguity signals
49
-  - complexity signals
50
-  - mutation / verification pressure
51
-  - unresolved clarification slots
52
-  - artifact availability and freshness
53
-  - explicit user workflow requests
54
-  - recent workflow timeline pressure
55
-- separate signal extraction from route scoring so policy code can reason over a typed signal packet rather than rebuilding context ad hoc
56
-- persist enough of the winning signal context to explain:
57
-  - why clarify won over execute
58
-  - why plan refresh was triggered
59
-  - why direct execution was still allowed despite ambiguity
60
-- keep route scoring tunable, but move the fragile task-text heuristics out of the coordinator path
61
-
62
-The goal is not to build a giant intent engine. The goal is to make workflow policy more explainable, testable, and less accidental.
63
-
64
-### 2. Intent-aware clarify strategy instead of generic follow-up rounds
65
-
66
-Loader can now clarify more than once, but it still asks questions in a relatively flat way.
67
-
68
-Implementation targets:
69
-
70
-- define typed clarify objectives or slots such as:
71
-  - desired outcome
72
-  - acceptance criteria
73
-  - constraints
74
-  - non-goals
75
-  - risk boundaries
76
-- choose the next clarify question from unresolved slots instead of using a mostly generic follow-up loop
77
-- adapt clarify behavior based on signal severity and task class while preserving a hard upper bound
78
-- persist why clarify stopped:
79
-  - enough boundaries gathered
80
-  - budget exhausted
81
-  - route pressure shifted toward plan or execute
82
-  - explicit user answer narrowed the scope sufficiently
83
-- carry unresolved slots forward into workflow state and artifacts so later plan/execute decisions can explain what was still uncertain
84
-
85
-This is how Loader gets closer to OMX's deeper interview rigor without turning every task into a long questionnaire.
86
-
87
-### 3. Semantic artifact invalidation and stronger re-plan discipline
88
-
89
-Sprint 10 made plan refresh possible. Sprint 11 should make refresh triggers smarter.
90
-
91
-Implementation targets:
92
-
93
-- enrich planning artifacts with more structured metadata where it materially helps, for example:
94
-  - expected touchpoints
95
-  - acceptance-criteria anchors
96
-  - planned files or subsystems
97
-  - known risks or assumptions
98
-- define broader invalidation triggers beyond file drift, for example:
99
-  - verification evidence contradicts the plan assumptions
100
-  - the implementation touched files or subsystems outside the expected scope
101
-  - acceptance criteria changed materially after clarify or verification
102
-  - the current task wording narrowed or expanded after the plan was written
103
-- distinguish between:
104
-  - targeted plan refresh
105
-  - clarify reentry
106
-  - full re-plan
107
-- keep the runtime disciplined: prefer the smallest valid recovery move instead of restarting workflow lanes casually
108
-
109
-This should move Loader closer to claw-code's stronger artifact discipline, where plans remain live contracts instead of just persisted markdown.
110
-
111
-### 4. Workflow inspection that answers operator questions more directly
112
-
113
-Sprint 10 made workflow history visible. Sprint 11 should make it more usable.
114
-
115
-Implementation targets:
116
-
117
-- extend `loader workflow show` with higher-signal inspection affordances such as:
118
-  - filtering by mode or event kind
119
-  - limiting to the most recent meaningful items
120
-  - clearer summaries for refresh, reentry, and clarify-budget outcomes
121
-- expose the signal/reason context that most directly answers questions like:
122
-  - why did Loader ask again?
123
-  - why did Loader refresh the plan?
124
-  - why did Loader skip verify?
125
-- keep session surfaces concise by surfacing only the most recent or most important workflow events by default
126
-- avoid building a visual UI in this sprint; prioritize text inspection that reduces debugging time immediately
127
-
128
-The goal is not prettier output. The goal is faster workflow debugging and better operator trust.
129
-
130
-### 5. Continue shrinking `conversation.py` into a coordinator over runtime modules
131
-
132
-Sprint 10 improved the split, but the coordinator still owns too much sequencing logic.
133
-
134
-Implementation targets:
135
-
136
-- extract additional orchestration seams under `src/loader/runtime/`, likely around:
137
-  - signal extraction
138
-  - clarify-lane control
139
-  - plan refresh / invalidation decisions
140
-  - workflow timeline append policy
141
-- make `ConversationRuntime.run_turn(...)` read more like:
142
-  - collect turn state
143
-  - compute workflow signals
144
-  - ask policy/orchestrator for the next lane decision
145
-  - delegate lane execution
146
-  - persist summary and timeline outcomes
147
-- keep completion and downstream workflow handoff logic out of the signal-extraction path
148
-- avoid replacing one monolith with another; new orchestration modules should have narrow responsibilities and direct tests
149
-
150
-A good outcome is that `conversation.py` keeps shrinking because ownership is clearer, not because behavior gets hidden.
151
-
152
-## Testing strategy
153
-
154
-- unit coverage for:
155
-  - typed workflow-signal extraction and normalization
156
-  - route-policy scoring over structured signals
157
-  - clarify-slot progression and stop reasons
158
-  - semantic invalidation triggers and targeted recovery selection
159
-- CLI coverage for:
160
-  - `loader workflow show` filtering and summarization
161
-  - session/workflow output for clarify exhaustion, plan refresh, and reentry reasons
162
-- deterministic/runtime coverage for:
163
-  - ambiguous tasks where clarify chooses different follow-up questions based on unresolved slots
164
-  - verification failure that triggers plan refresh vs clarify reentry based on typed invalidation reasons
165
-  - tasks that remain executable even with mild ambiguity because stronger signals favor direct execution
166
-  - Sprint 00-10 parity scenarios staying green after the workflow-policy split deepens again
167
-- regression coverage:
168
-  - route policy should consume typed signals rather than rebuilding them ad hoc inside the coordinator
169
-  - workflow inspection should continue to work after session resume and compaction
170
-
171
-## Definition of done
172
-
173
-- Loader extracts typed workflow signals before route scoring
174
-- clarify behavior is intent-aware and persists why it continued or stopped
175
-- plan refresh uses richer invalidation reasons than file drift alone
176
-- workflow inspection better explains reentry, refresh, and clarify behavior
177
-- `conversation.py` is slimmer again and more coordinator-like
178
-- the full parity baseline remains green after the deeper workflow-policy split
179
-
180
-## Explicitly out of scope
181
-
182
-- full OMX-style consensus planning
183
-- a visual workflow timeline UI
184
-- a first-class permission rule editor
185
-- AST-aware, LSP-aware, or symbol-aware editing
186
-- multi-agent or team orchestration
187
-
188
-## Audit
189
-
190
-### Landed
191
-
192
-- Loader now extracts typed workflow signals in `src/loader/runtime/workflow_signals.py`, and route decisions persist `signal_summary` context so we can explain why clarify, plan, or direct execute won without rebuilding those heuristics inside the coordinator
193
-- clarify is now intent-aware instead of generic: `src/loader/runtime/clarify_strategy.py` defines explicit slots such as desired outcome, non-goals, acceptance criteria, constraints, decision boundaries, and likely touchpoints, and the runtime now persists why clarify continued or stopped around those slots
194
-- replan discipline is broader and more honest: `src/loader/runtime/artifact_invalidation.py` can now distinguish targeted plan refresh, clarify reentry, and full re-plan based on semantic drift instead of only touched-file mismatch, and `src/loader/runtime/conversation.py` routes those recovery moves explicitly
195
-- workflow inspection is more useful for actual operator questions: `loader workflow show` now supports mode/kind filtering and entry limits, and `src/loader/runtime/inspection.py` plus `src/loader/cli/main.py` surface concise workflow highlights for re-asks, reentries, refreshes, and verify skips
196
-- `conversation.py` is slimmer again because clarify/plan lane execution moved into `src/loader/runtime/workflow_lanes.py`, which now owns lane prompts, artifact writes, clarify follow-up handling, and plan todo seeding while the main runtime acts more like a coordinator
197
-
198
-### Verification
199
-
200
-- `uv run pytest -q` is green: `197 passed`
201
-- `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` state
202
-- `tests/test_clarify_strategy.py` covers slot prioritization and targeted clarify questions
203
-- `tests/test_artifact_invalidation.py` covers semantic invalidation and recovery-mode selection
204
-- `tests/test_workflow_runtime.py` covers intent-aware clarify continuation, targeted plan refresh, and full re-plan through clarify reentry
205
-- `tests/test_inspection.py` covers `loader workflow show` filtering/highlights plus session/status workflow inspection
206
-- targeted `ruff` checks are green for `src/loader/runtime/workflow_signals.py`, `src/loader/runtime/clarify_strategy.py`, `src/loader/runtime/artifact_invalidation.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/workflow_lanes.py`, `src/loader/runtime/conversation.py`, `src/loader/runtime/inspection.py`, and the new/expanded workflow inspection tests
207
-- `uv run python -m compileall src/loader/cli/main.py` is green; full-file `ruff` on `src/loader/cli/main.py` still inherits the repo's older line-length backlog, so CLI verification here is anchored primarily by the inspection command tests
208
-
209
-### Residual debt
210
-
211
-- typed workflow signals are a better contract than inline heuristics, but they are still built from hand-tuned text/runtime cues rather than OMX-style ambiguity scoring, evidence passes, or richer task semantics
212
-- clarify is now slot-driven, but it still stops well short of OMX's deep-interview pressure-pass discipline, repository-backed fact gathering, and task-profile-dependent depth
213
-- artifact invalidation is broader now, but it is still lightweight and text-based; Loader does not yet reason over richer artifact metadata, deeper verification contradictions, or more explicit assumption tracking
214
-- `loader workflow show` now answers the common operator questions much better, but it still lacks artifact diffs, prompt-history comparison, and richer timeline drill-down ergonomics
215
-- `src/loader/runtime/conversation.py` is more coordinator-like than before Sprint 11, but it still owns the main turn loop, completion policy handoff, and some recovery sequencing that claw-code keeps in even more dedicated seams
.docs/sprints/sprint12.mddeleted
@@ -1,204 +0,0 @@
1
-# Sprint 12: Interview Pressure, Semantic Evidence, and Turn Orchestration
2
-
3
-## Prerequisites
4
-
5
-Sprint 11
6
-
7
-## Goals
8
-
9
-Turn Loader's newer workflow structure into a more disciplined execution contract by deepening clarify beyond slot selection, making semantic invalidation rely on richer evidence than text overlap alone, and shrinking the main turn loop into a clearer orchestration shell.
10
-
11
-Sprint 11 closed several real gaps. Loader now has typed workflow signals, slot-aware clarify, semantic invalidation, better workflow inspection, and a slimmer coordinator. That is meaningful progress toward claw-code and OMX, but the audit is honest about what still hurts:
12
-
13
-- typed workflow signals are still hand-tuned runtime heuristics rather than a deeper ambiguity/evidence model
14
-- clarify is more intentional now, but it still lacks OMX's pressure-pass discipline, evidence-chasing, and codebase-backed interview style
15
-- artifact invalidation is broader than file drift, but it still reasons from lightweight text overlap instead of richer structured evidence
16
-- `conversation.py` is smaller, but it still owns the main assistant/recovery/completion orchestration loop that the refs spread across narrower runtime seams
17
-
18
-The next leverage point is to stop treating clarify as "ask a better next question" and start treating it as "run a bounded interview with explicit pressure passes, factual grounding, and a stronger handoff contract for later execution."
19
-
20
-This sprint is about execution rigor:
21
-
22
-- clarify gains pressure-pass behavior instead of only slot-follow-up behavior
23
-- semantic invalidation uses richer structured evidence and contradiction tracking
24
-- the main turn loop shrinks again by delegating orchestration checkpoints into dedicated runtime modules
25
-- Loader gets closer to closed-source agentic tools not by more prompt prose, but by stronger workflow contracts
26
-
27
-The references for this sprint are:
28
-
29
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
30
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
31
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
32
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
33
-- `refs/oh-my-codex/src/modes/base.ts`
34
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
35
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
36
-
37
-## Deliverables
38
-
39
-### 1. Pressure-pass clarify controller instead of slot selection alone
40
-
41
-Sprint 11 made clarify targeted. Sprint 12 should make it disciplined.
42
-
43
-Implementation targets:
44
-
45
-- introduce a dedicated clarify controller under `src/loader/runtime/` that tracks:
46
-  - current interview stage
47
-  - weakest clarity dimension
48
-  - whether a pressure pass has occurred
49
-  - whether non-goals and decision boundaries are explicit
50
-  - how much interview budget remains
51
-- extend clarify reasoning beyond "what slot is unresolved?" to also ask:
52
-  - was the last answer too broad?
53
-  - has this assumption been challenged yet?
54
-  - do we still need an example, counterexample, tradeoff, or explicit stop boundary?
55
-- persist clarify progress in structured form so later workflow decisions can explain:
56
-  - which dimension clarify was targeting
57
-  - whether Loader was still gathering boundaries
58
-  - whether it stopped because the budget was exhausted or because readiness gates were met
59
-- keep it bounded and pragmatic:
60
-  - no unbounded interviews
61
-  - no long questionnaires
62
-  - one question at a time with explicit stop conditions
63
-
64
-The goal is not to copy OMX wholesale. The goal is to adopt the parts that materially reduce misaligned execution and premature planning.
65
-
66
-### 2. Codebase-backed clarify grounding and stronger requirement artifacts
67
-
68
-Sprint 11 still relies mostly on the user answer plus task text. Sprint 12 should let clarify lean on facts Loader can gather directly.
69
-
70
-Implementation targets:
71
-
72
-- add a lightweight preflight/context seam for brownfield tasks that can feed clarify with discovered facts before asking the user for repository details
73
-- prefer evidence-backed clarify questions when Loader already knows something, for example:
74
-  - "I found X in Y. Should this change follow that pattern?"
75
-  - "The current touchpoints appear to be A and B. Should I keep C out of scope?"
76
-- persist richer clarify artifact metadata where it helps downstream runtime behavior, for example:
77
-  - explicit non-goal status
78
-  - explicit decision-boundary status
79
-  - whether a pressure pass occurred
80
-  - likely touchpoint evidence
81
-  - inferred vs confirmed boundaries
82
-- keep this grounded in Loader's existing tool surface rather than inventing a large research subsystem
83
-
84
-This moves Loader closer to OMX's "reduce user effort and don't ask for facts we can discover" principle.
85
-
86
-### 3. Structured semantic evidence for invalidation and replan decisions
87
-
88
-Sprint 11 improved invalidation, but it still reasons mostly from text coverage. Sprint 12 should give recovery choices a stronger evidence model.
89
-
90
-Implementation targets:
91
-
92
-- define a structured invalidation/evidence contract under `src/loader/runtime/`, for example around:
93
-  - confirmed touchpoints
94
-  - inferred touchpoints
95
-  - acceptance anchors
96
-  - contradicted assumptions
97
-  - verification contradiction signals
98
-  - changed user boundaries after clarify
99
-- teach invalidation to distinguish:
100
-  - plan mismatch
101
-  - brief contradiction
102
-  - verification contradiction
103
-  - stale assumptions
104
-- improve recovery selection so Loader can explain not only what it chose, but what evidence forced that choice
105
-- preserve "smallest valid recovery move first" as the governing behavior
106
-
107
-This is how Loader gets from "semantic-ish refresh" to a more trustworthy workflow contract.
108
-
109
-### 4. Turn orchestration split beyond lane execution
110
-
111
-Sprint 11 moved clarify/plan lanes out. Sprint 12 should keep shrinking the top-level turn loop.
112
-
113
-Implementation targets:
114
-
115
-- extract additional runtime seams under `src/loader/runtime/`, likely around:
116
-  - turn preparation/bootstrap
117
-  - workflow recovery/reentry control
118
-  - completion/continuation orchestration
119
-  - assistant-response repair routing
120
-- make `ConversationRuntime.run_turn(...)` read more like:
121
-  - initialize turn state
122
-  - prepare workflow contract
123
-  - delegate iteration/orchestration helpers
124
-  - finalize summary
125
-- avoid creating a new monolith module; prefer narrow orchestration seams with direct tests
126
-
127
-A good outcome is that the turn loop becomes easier to reason about and less likely to collect ad hoc behavior again.
128
-
129
-### 5. Workflow/operator surfaces that explain evidence, not just decisions
130
-
131
-Sprint 11 made `loader workflow show` more useful. Sprint 12 should make it explain the evidence behind recovery and clarify pressure more directly.
132
-
133
-Implementation targets:
134
-
135
-- extend workflow inspection surfaces to show:
136
-  - whether a pressure pass occurred
137
-  - which clarify dimension was active
138
-  - which evidence triggered refresh or reentry
139
-  - which assumptions were still unresolved
140
-- keep the default UX concise, but expose richer detail when explicitly requested
141
-- avoid a visual UI in this sprint; prioritize text surfaces that make the runtime easier to debug immediately
142
-
143
-## Testing strategy
144
-
145
-- unit coverage for:
146
-  - clarify pressure-pass progression and readiness gates
147
-  - codebase-backed clarify question selection from discovered facts
148
-  - structured invalidation evidence and contradiction handling
149
-  - new orchestration seams preserving current turn behavior
150
-- CLI coverage for:
151
-  - workflow inspection showing clarify pressure/evidence
152
-  - session/workflow output for contradiction-driven reentry
153
-- deterministic/runtime coverage for:
154
-  - ambiguous brownfield tasks where Loader asks evidence-backed clarify questions
155
-  - tasks that need an assumption/tradeoff pressure pass before planning
156
-  - verification contradictions that trigger targeted refresh vs full re-plan
157
-  - Sprint 00-11 parity scenarios staying green after the deeper orchestration split
158
-- regression coverage:
159
-  - clarify should not ask the user for repository facts Loader can gather directly
160
-  - orchestration extraction should not regress the verify/fix or permission/runtime contracts
161
-
162
-## Definition of done
163
-
164
-- clarify uses a bounded pressure-pass controller rather than slot selection alone
165
-- brownfield clarify can ask evidence-backed questions from discovered facts
166
-- invalidation relies on richer structured evidence and contradiction tracking
167
-- workflow/operator surfaces explain clarify and recovery evidence more directly
168
-- `conversation.py` is slimmer again and more orchestration-shell-like
169
-- the full parity baseline remains green after the deeper clarify/orchestration split
170
-
171
-## Explicitly out of scope
172
-
173
-- full OMX-style consensus planning
174
-- a visual workflow timeline UI
175
-- a first-class permission rule editor
176
-- AST-aware, LSP-aware, or symbol-aware editing
177
-- multi-agent or team orchestration
178
-
179
-## Audit
180
-
181
-### Landed
182
-
183
-- clarify now has explicit pressure-pass discipline instead of only slot-follow-up behavior: `src/loader/runtime/clarify_strategy.py`, `src/loader/runtime/workflow_policy.py`, and `src/loader/runtime/workflow_lanes.py` track readiness gates such as `non_goals`, `decision_boundaries`, and `pressure_pass`, and can drive later clarify rounds toward examples, tradeoffs, and challenged assumptions
184
-- brownfield clarify is now grounded in discovered workspace evidence instead of relying only on user answers and task text: `src/loader/runtime/clarify_grounding.py` feeds repo paths, repo facts, slot-aware evidence, pressure-aware evidence, and grounded brief hints into clarify prompts, fallback questions, and persisted brief synthesis
185
-- invalidation and recovery now use richer structured evidence than file drift alone: `src/loader/runtime/artifact_invalidation.py`, `src/loader/runtime/workflow_policy.py`, and `src/loader/runtime/workflow_recovery.py` now distinguish confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift, and that evidence is surfaced through workflow inspection
186
-- workflow/operator surfaces now explain clarify pressure and recovery evidence more directly: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` surface pressure metadata, recovery evidence, and the newer workflow history context instead of only route labels
187
-- the runtime shell is now genuinely controller-based instead of monolithic: `src/loader/runtime/workflow_recovery.py`, `src/loader/runtime/turn_preparation.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/turn_preamble.py`, `src/loader/runtime/workflow_state.py`, and `src/loader/runtime/turn_loop.py` now own distinct orchestration seams, and `src/loader/runtime/conversation.py` is down to a compact coordinator
188
-
189
-### Verification
190
-
191
-- `uv run pytest -q` is green: `231 passed`
192
-- `tests/test_clarify_strategy.py` covers pressure-pass reviews, readiness gates, and later-round clarify pressure selection
193
-- `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware evidence selection, pressure-aware grounding, and grounded brief hints
194
-- `tests/test_artifact_invalidation.py`, `tests/test_workflow_policy.py`, `tests/test_workflow_runtime.py`, and `tests/test_inspection.py` cover structured drift evidence, contradiction-driven recovery, workflow pressure metadata, and operator-facing recovery summaries
195
-- `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` give direct coverage to the new controller seams instead of relying only on large end-to-end runtime tests
196
-- targeted `ruff` checks stayed green on the touched runtime/controller modules and their new tests throughout the extraction work, and the full suite remained green after each slice
197
-
198
-### Residual debt
199
-
200
-- clarify is now pressure-aware and grounded, but it is still bounded and lighter than OMX's deeper interview style; Loader still does not adapt interview depth by task class or run richer challenge/consensus passes
201
-- the new invalidation evidence is a much better contract than text overlap alone, but it is still runtime-authored and heuristic; Loader still does not use deeper semantic reasoning over artifacts, symbols, or model-assisted contradiction analysis
202
-- `src/loader/runtime/conversation.py` is now a real coordinator, but `src/loader/runtime/turn_iteration.py` remains the heaviest seam and still carries a fair amount of repair/completion/tool-routing policy that claw-code spreads across even narrower runtime modules
203
-- workflow/operator surfaces explain more than they did at Sprint 11, but they still stop short of artifact diffs, prompt/history comparison, and richer timeline drill-down
204
-- Loader is much closer to a controller-based runtime than it was at the start of Sprint 12, but it still does not match claw-code or OMX on deeper planning rigor, semantic artifact discipline, or broader operator ergonomics
.docs/sprints/sprint13.mddeleted
@@ -1,203 +0,0 @@
1
-# Sprint 13: Turn Policy Narrowing, Assumption Ledger, and Artifact Diffs
2
-
3
-## Prerequisites
4
-
5
-Sprint 12
6
-
7
-## Goals
8
-
9
-Turn Loader's newly controllerized runtime into a more semantically explicit workflow system by shrinking the still-heavy `turn_iteration` seam, promoting assumptions and contradictions into first-class workflow state, and giving operators diff-oriented artifact visibility instead of only latest-state inspection.
10
-
11
-Sprint 12 was a real structural win. Loader now has pressure-pass clarify, codebase-backed grounding, structured recovery evidence, and a controller-shaped runtime shell. That meaningfully closes the gap with claw-code and OMX. The audit is also honest about what still hurts:
12
-
13
-- `turn_iteration.py` is still carrying a lot of repair, tool-routing, and completion policy in one seam
14
-- contradiction and invalidation evidence are richer than before, but they are still mostly runtime-authored summaries rather than a reusable semantic ledger
15
-- operator surfaces can explain "why did this happen?" better than before, but they still cannot show "what changed?" across briefs, plans, verification, or prompt contracts
16
-- Loader now has better workflow discipline, but it still lacks some of the day-two operator ergonomics that make claw-code and OMX easier to trust during long tasks
17
-
18
-The next leverage point is to stop treating semantic drift and operator visibility as one-off summaries and start treating them as durable contracts:
19
-
20
-- the turn runtime should classify and route assistant output through narrower policy seams
21
-- assumptions, contradictions, and acceptance anchors should survive across workflow phases as explicit state
22
-- inspection should be able to show diffs between the artifacts and prompt contracts that drove behavior
23
-
24
-This sprint is about making Loader more inspectable and less accidental:
25
-
26
-- `turn_iteration` shrinks into narrower policy-oriented seams
27
-- workflow invalidation gains an explicit assumption/contradiction ledger
28
-- operator tooling gains artifact and prompt diff visibility
29
-- Loader gets closer to claw-code not just in structure, but in debuggability
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
36
-- `refs/claw-code/PARITY.md`
37
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
38
-- `refs/oh-my-codex/src/modes/base.ts`
39
-- `refs/oh-my-codex/src/verification/verifier.ts`
40
-- `refs/oh-my-codex/skills/deep-interview/SKILL.md`
41
-- `refs/oh-my-codex/skills/ralplan/SKILL.md`
42
-
43
-## Deliverables
44
-
45
-### 1. Split `turn_iteration` into narrower response-policy seams
46
-
47
-Sprint 12 made `conversation.py` coordinator-shaped. Sprint 13 should keep the same discipline for the still-heavy iteration seam.
48
-
49
-Implementation targets:
50
-
51
-- extract narrower helpers under `src/loader/runtime/`, likely around:
52
-  - assistant-response classification
53
-  - repair routing
54
-  - final-answer routing
55
-  - tool-batch routing
56
-  - no-tool completion handoff
57
-- make `turn_iteration.py` read more like:
58
-  - request assistant turn
59
-  - classify response
60
-  - delegate the winning route
61
-  - return loop-state deltas
62
-- keep the main behavior unchanged while reducing policy density per module
63
-- add direct controller tests so future iteration changes do not depend only on broad runtime integration coverage
64
-
65
-The goal is not more files for their own sake. The goal is to make assistant-turn behavior easier to tune deliberately.
66
-
67
-### 2. Assumption and contradiction ledger instead of one-off evidence summaries
68
-
69
-Sprint 12 introduced richer drift evidence. Sprint 13 should make that evidence durable and reusable.
70
-
71
-Implementation targets:
72
-
73
-- define a typed workflow ledger contract under `src/loader/runtime/` for:
74
-  - explicit assumptions
75
-  - confirmed assumptions
76
-  - contradicted assumptions
77
-  - acceptance anchors
78
-  - open decision boundaries
79
-  - closed decision boundaries
80
-- thread that ledger through clarify, planning, verification, and recovery instead of only summarizing evidence at refresh time
81
-- persist enough structure to answer:
82
-  - which assumption was invalidated?
83
-  - which workflow phase introduced it?
84
-  - what evidence contradicted it?
85
-  - whether the contradiction forced refresh, reentry, or only inspection visibility
86
-- keep the first version pragmatic and text-first; do not try to build a symbolic reasoning engine
87
-
88
-This is how Loader gets from "richer summaries" to a more explicit semantic workflow contract.
89
-
90
-### 3. Artifact and prompt diff surfaces for operators
91
-
92
-Loader can now show the latest prompt and workflow timeline. Sprint 13 should help operators see what changed.
93
-
94
-Implementation targets:
95
-
96
-- add diff-oriented inspection surfaces, likely around:
97
-  - clarify brief vs refreshed brief
98
-  - old plan vs refreshed plan
99
-  - workflow ledger changes across reentry
100
-  - prompt metadata or prompt-body diffs across relevant turns
101
-- keep the product surface text-first and operator-friendly, for example via:
102
-  - `loader workflow show --diff`
103
-  - `loader prompt diff`
104
-  - or an equivalent `loader artifact show` family if that is cleaner
105
-- include concise change summaries by default and fuller diffs when explicitly requested
106
-- avoid a visual UI in this sprint; prioritize fast CLI/TUI debugging value
107
-
108
-The goal is to make workflow changes legible, not just persisted.
109
-
110
-### 4. Workflow/operator surfaces that explain semantic change, not only event history
111
-
112
-Sprint 12 improved evidence visibility. Sprint 13 should improve semantic visibility.
113
-
114
-Implementation targets:
115
-
116
-- extend inspection surfaces so they can show:
117
-  - which assumptions remain open
118
-  - which assumptions were contradicted
119
-  - which acceptance anchors changed across clarify/plan/verify
120
-  - whether a refresh was forced by contradiction, touchpoint drift, or acceptance drift
121
-- preserve concise defaults so everyday status remains readable
122
-- make session/workflow output useful for long-running or resumed tasks, not only single-turn debugging
123
-
124
-This brings Loader closer to claw-code's stronger operator trust model.
125
-
126
-### 5. Keep the parity baseline honest while the runtime narrows again
127
-
128
-Sprint 12 closed a big structural loop. Sprint 13 should protect that gain.
129
-
130
-Implementation targets:
131
-
132
-- add direct tests for the newly split iteration policy seams
133
-- extend workflow/inspection coverage for diff and ledger behavior
134
-- keep existing parity scenarios green after the iteration split
135
-- update `PARITY.md` and the sprint audit only after the new surfaces and contracts are actually covered
136
-
137
-## Testing strategy
138
-
139
-- unit coverage for:
140
-  - response classification and per-route delegation
141
-  - assumption-ledger updates and contradiction recording
142
-  - artifact/prompt diff formatting and summaries
143
-  - workflow refresh decisions reading from the new ledger state
144
-- CLI coverage for:
145
-  - prompt/artifact/workflow diff surfaces
146
-  - workflow/session output for contradiction-led refreshes
147
-- deterministic/runtime coverage for:
148
-  - a clarify answer that seeds assumptions later contradicted during verification
149
-  - a plan refresh where the operator surface can show exactly what changed
150
-  - a resumed session where workflow inspection still reflects semantic ledger state
151
-  - Sprint 00-12 parity scenarios staying green after the deeper iteration split
152
-- regression coverage:
153
-  - iteration refactors should not regress verify/fix, permission, or explore contracts
154
-  - diff surfaces should read persisted artifacts/session state rather than reconstructing history heuristically
155
-
156
-## Definition of done
157
-
158
-- `turn_iteration.py` is slimmer and delegates through narrower response-policy seams
159
-- assumptions and contradictions are persisted as explicit workflow state
160
-- operators can inspect artifact or prompt diffs from the product surface
161
-- workflow inspection explains semantic change, not only route history
162
-- the full parity baseline remains green after the deeper iteration split
163
-
164
-## Explicitly out of scope
165
-
166
-- full OMX-style consensus planning
167
-- a visual workflow diff UI
168
-- AST-aware, LSP-aware, or symbol-aware editing
169
-- a first-class permission rule editor
170
-- multi-agent or team orchestration
171
-
172
-## Audit
173
-
174
-### Status
175
-
176
-- Sprint 13 is complete. The semantic ledger, prompt/artifact diff surfaces, deletion-oriented runtime cleanup, and narrower assistant-response routing seams are now all landed and covered.
177
-
178
-### Landed
179
-
180
-- the runtime is less accidental and less puppet-like even before a deeper iteration split: `src/loader/runtime/turn_preamble.py`, `src/loader/runtime/repair.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/turn_completion.py`, and `src/loader/runtime/completion_policy.py` no longer inject synthetic prefill, no longer puppet repeated empty responses, no longer scold fake-tool narration or deflection through injected reroutes, and no longer bounce long no-tool answers through the self-critique reroute
181
-- `turn_iteration.py` is now routing through a dedicated response-policy seam instead of owning final-answer, tool-batch, and no-tool dispatch inline: `src/loader/runtime/response_routing.py` now owns classified assistant-response routing, and `src/loader/runtime/turn_iteration.py` is correspondingly smaller and closer to a request/repair/loop-state controller
182
-- assumptions, contradictions, acceptance anchors, and decision-boundary state are now persisted as an explicit workflow ledger instead of one-off summaries: `src/loader/runtime/workflow_ledger.py` and `src/loader/runtime/session.py` define and persist the ledger, `src/loader/runtime/workflow_lanes.py` seeds it from clarify/plan artifacts, and `src/loader/runtime/workflow_recovery.py` updates it from contradiction and freshness evidence
183
-- workflow/operator surfaces now explain semantic change more directly: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` expose the workflow ledger, contradiction highlights, and richer workflow history so operators can see which assumptions remain open, which were contradicted, and which acceptance anchors changed
184
-- prompt and artifact change visibility is now a product surface instead of an inferred debugging exercise: `src/loader/runtime/prompt_history.py`, `src/loader/runtime/session.py`, `src/loader/runtime/inspection.py`, `src/loader/agent/loop.py`, and `src/loader/cli/main.py` persist prompt snapshots, add `loader prompt diff`, and add `loader workflow show --diff` with concise summaries by default and fuller unified diffs on demand
185
-- the parity baseline stayed green while the semantic contract expanded and `turn_iteration` narrowed: the new coverage lives in `tests/test_runtime_repair_flows.py`, `tests/test_response_routing.py`, `tests/test_workflow_ledger.py`, `tests/test_session_state.py`, `tests/test_inspection.py`, `tests/test_turn_completion.py`, and the existing runtime/workflow inspection tests
186
-
187
-### Verification
188
-
189
-- `uv run pytest -q` is green: `247 passed`
190
-- `tests/test_runtime_repair_flows.py` covers the honest empty-response retry path, no synthetic prefill on first turns, and the absence of the older no-tool puppeting/scolding behavior
191
-- `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing without relying only on the larger iteration loop tests
192
-- `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries
193
-- `tests/test_session_state.py` covers persistence of the workflow ledger and prompt snapshot history across saved/resumed sessions
194
-- `tests/test_inspection.py` covers workflow-ledger inspection, `loader prompt diff`, and `loader workflow show --diff` against persisted prompt and artifact history
195
-- the earlier Sprint 12 controller/runtime coverage remained green after the repair cleanup and semantic-state additions, so Sprint 13 did not regress the controllerized turn runtime while adding richer inspection state
196
-
197
-### Residual debt
198
-
199
-- `src/loader/runtime/turn_iteration.py` is no longer the main policy knot, but `src/loader/runtime/response_routing.py` and `src/loader/runtime/tool_batches.py` still carry more heuristic response/tool policy than the narrower reference seams in claw-code
200
-- the workflow ledger is intentionally pragmatic and text-first; Loader still does not have deeper symbolic reasoning, model-authored contradiction analysis, or richer provenance for every semantic state change
201
-- prompt and artifact diffs are based on persisted snapshots and versioned text artifacts; Loader still does not offer pre-run candidate prompt comparison, semantic/AST-aware artifact diffs, or richer visual timeline tooling
202
-- older sessions may not have prompt-history or artifact-history depth comparable to new sessions, so the newest diff surfaces are strongest on sessions created after the Sprint 13 persistence changes
203
-- Loader is more inspectable and less accidental than it was at the end of Sprint 12, but it still does not match claw-code or OMX on deeper planning rigor, semantic artifact discipline, or broader day-two operator ergonomics
.docs/sprints/sprint14.mddeleted
@@ -1,190 +0,0 @@
1
-# Sprint 14: Runtime Context Adoption, Legacy Burn-Down, and Policy Narrowing
2
-
3
-## Prerequisites
4
-
5
-Sprint 13
6
-
7
-## Goals
8
-
9
-Turn the newly merged audit-line cleanup into a first-class runtime contract by promoting `RuntimeContext` from a compatibility seam into the primary execution boundary, burning down more agent-owned policy/services, and narrowing the still-heavy response/tool policy helpers.
10
-
11
-Sprint 13 closed a real loop. Loader now has a semantic workflow ledger, prompt/artifact diff surfaces, a more honest no-tool completion path, and a dedicated response-routing seam. The audit branch is now merged too, which changes the next leverage point in an important way:
12
-
13
-- Loader now has runtime-owned context, parsing, recovery, rollback, safeguard, and task-classification modules on `trunk`
14
-- `src/loader/runtime/` no longer imports `agent/*` directly, and the merged tree is green at `286 passed`
15
-- but several of those modules are still only partially adopted, with compatibility layers and legacy agent hooks still carrying too much behavior
16
-- `response_routing.py` and `tool_batches.py` are better than the old inline loop, but they still carry more heuristic policy than the narrower seams in claw-code
17
-- `agent/loop.py` is smaller in responsibility than it used to be, but it is still doing too much as a holder for runtime-owned decisions, helper methods, and policy services
18
-- `agent/reasoning.py` and `agent/safeguards.py` are no longer hidden runtime implementations, but they still own meaningful legacy behavior behind explicit seams
19
-
20
-The next leverage point is to stop treating the merged audit runtime pieces as optional support modules and start treating them as the default execution contract:
21
-
22
-- `RuntimeContext` should become the normal runtime boundary, not only a bridge for selected helpers and tests
23
-- runtime-owned parsing/recovery/rollback/safeguard/task-classification services should replace more legacy agent ownership
24
-- response/tool policy should narrow further now that the runtime service surface is richer
25
-
26
-This sprint is about consolidating the merge into a cleaner architecture rather than leaving the audit branch as a large historical merge with only partial adoption.
27
-
28
-The references for this sprint are:
29
-
30
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
31
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
32
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
33
-- `refs/claw-code/PARITY.md`
34
-- `.docs/audit_sprints/trunk_sitrep.md`
35
-- `.docs/audit_sprints/sprint13_closure.md`
36
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
37
-- `refs/oh-my-codex/src/verification/verifier.ts`
38
-
39
-## Deliverables
40
-
41
-### 1. Promote `RuntimeContext` from bridge to primary runtime contract
42
-
43
-The merged audit branch introduced a typed runtime context. Sprint 14 should make that seam the default runtime boundary for more helpers.
44
-
45
-Implementation targets:
46
-
47
-- adopt `src/loader/runtime/context.py` across more runtime controllers and services so they consume typed runtime state instead of direct `Agent` access where practical
48
-- reduce direct runtime dependencies on:
49
-  - `Agent._extract_raw_json_tool_calls(...)`
50
-  - `Agent._assess_confidence(...)`
51
-  - `Agent._verify_action(...)`
52
-  - ad hoc steering/recovery workflow hooks
53
-- make the remaining `legacy` services in `RuntimeContext` explicit migration seams instead of long-term hidden dependencies
54
-- shrink the number of callbacks exposed through `RuntimeLegacyServices`, or narrow them so their contracts are obviously transitional
55
-- keep the contract pragmatic:
56
-  - do not rewrite the whole runtime in one sweep
57
-  - prefer controller/service boundaries that improve testability immediately
58
-
59
-The goal is not purity for its own sake. The goal is to make runtime behavior easier to reason about and less likely to regress when we continue narrowing policy seams.
60
-
61
-### 2. Burn down more agent-owned runtime services
62
-
63
-The merged branch brought runtime-owned modules onto `trunk`. Sprint 14 should make more of them actually own behavior.
64
-
65
-Implementation targets:
66
-
67
-- move more effective ownership for the following out of `agent/loop.py` and into runtime modules:
68
-  - task classification
69
-  - raw-text parsing and tool-call recovery
70
-  - rollback planning helpers
71
-  - safeguard services
72
-  - recovery prompts / retry guidance
73
-- make the remaining `agent/reasoning.py` and `agent/safeguards.py` callbacks inventoryable and explicit, so we can say which ones are still required and which are just legacy inertia
74
-- keep `agent/loop.py` focused on:
75
-  - public entrypoints
76
-  - user/session-facing orchestration
77
-  - compatibility wrappers that truly still need to exist
78
-- avoid duplicating logic across `agent/*` and `runtime/*`; prefer one real implementation plus compatibility exports only where needed
79
-
80
-This is how the merged audit work becomes structural improvement instead of passive file accumulation.
81
-
82
-### 3. Narrow `response_routing.py` and `tool_batches.py`
83
-
84
-Sprint 13 moved response dispatch out of `turn_iteration.py`. Sprint 14 should keep pushing the same discipline into the next heavy seams.
85
-
86
-Implementation targets:
87
-
88
-- split `src/loader/runtime/response_routing.py` into narrower policy helpers where it pays off, likely around:
89
-  - final-answer routing
90
-  - raw-text tool routing
91
-  - no-tool completion routing
92
-  - halt/finalize decision shaping
93
-- split `src/loader/runtime/tool_batches.py` more deliberately around:
94
-  - confidence gate
95
-  - recovery handling
96
-  - DoD post-tool bookkeeping
97
-  - post-tool verification
98
-- keep behavior steady while making route ownership and failure handling easier to test directly
99
-
100
-The goal is to keep moving away from broad “policy soup” modules and toward claw-code-style narrower execution seams.
101
-
102
-### 4. Consolidate merged cleanup behavior into the test contract
103
-
104
-The merge brought in new tests and new runtime compatibility seams. Sprint 14 should turn that into a clearer, intentional contract.
105
-
106
-Implementation targets:
107
-
108
-- keep the `RuntimeContext` tests green while reducing how much behavior still depends on compatibility shims
109
-- add direct tests for any newly split response/tool policy controllers
110
-- keep honest-repair, no synthetic prefill, and no-tool completion cleanup behavior covered after the deeper service migration
111
-- extend parity and inspection coverage where the merged audit docs surfaced real operator-facing expectations
112
-- treat the merged `.docs/audit_sprints/` artifacts as regression evidence when deciding whether a seam is actually load-bearing
113
-
114
-This makes the merged audit branch a maintained baseline rather than a one-time reconciliation event.
115
-
116
-### 5. Reconcile docs after the merged audit line
117
-
118
-The branch is merged. The docs should stop behaving like the audit line is still “over there.”
119
-
120
-Implementation targets:
121
-
122
-- refresh `REPORT.md`, `PARITY.md`, and the sprint audit trail where the merge changed the architectural baseline in a meaningful way
123
-- preserve `.docs/audit_sprints/` as historical evidence, not as a second active roadmap
124
-- keep the parity checkpoint honest about which runtime-context/service seams are truly primary vs compatibility-bound
125
-
126
-## Testing strategy
127
-
128
-- unit coverage for:
129
-  - `RuntimeContext`-owned service behavior
130
-  - response-routing subcontrollers
131
-  - tool-batch subcontrollers
132
-  - raw-text fallback and capability refresh paths after the deeper context adoption
133
-- runtime coverage for:
134
-  - native-tool and raw-text tool parity
135
-  - explore mode through the typed runtime context
136
-  - verification/recovery after service migration
137
-  - Sprint 00-13 parity scenarios staying green after the context/service ownership shift
138
-- regression coverage:
139
-  - no synthetic prefill
140
-  - no repeated empty-response puppeting
141
-  - no self-critique reroute regression
142
-  - no accidental rollback/recovery ownership drift between `agent/*` and `runtime/*`
143
-
144
-## Definition of done
145
-
146
-- `RuntimeContext` is a real primary runtime seam for more controllers/services, not just a compatibility adapter
147
-- more runtime-owned parsing/recovery/rollback/safeguard/task-classification behavior is actually moved off `agent/loop.py`
148
-- `response_routing.py` and `tool_batches.py` are narrower and more directly tested
149
-- the merged audit line is reflected as one baseline, not two parallel architectures
150
-- the remaining legacy callbacks are enumerated, narrower, and clearly transitional
151
-- the full parity baseline remains green after the deeper runtime-context adoption
152
-
153
-## Explicitly out of scope
154
-
155
-- full claw-code policy-engine parity
156
-- AST-aware or LSP-aware semantic artifact diffs
157
-- visual workflow or timeline UIs
158
-- multi-agent or team orchestration
159
-
160
-## Audit
161
-
162
-### Status
163
-
164
-- Sprint 14 is complete, and the audit is green. `RuntimeContext` is now the normal runtime seam for the main turn path rather than a selective bridge layered over legacy callbacks.
165
-
166
-### Landed
167
-
168
-- `src/loader/runtime/context.py` is now a real primary contract instead of a partial adapter: the old `RuntimeLegacyServices` shim is gone, workflow-mode mutation lives on the typed context, and the runtime can refresh capability state, steering state, and workflow state without routing back through a legacy wrapper
169
-- runtime-owned service adoption is materially deeper than the merged-audit baseline: `src/loader/runtime/workflow_state.py`, `src/loader/runtime/phases.py`, `src/loader/runtime/repair.py`, `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/response_route_handlers.py`, `src/loader/runtime/response_routing.py`, `src/loader/runtime/turn_loop.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_lanes.py`, and `src/loader/runtime/workflow_recovery.py` now consume typed runtime state instead of reaching into `Agent` for session, backend, registry, or workflow state
170
-- raw-text tool recovery no longer depends on a hidden `Agent._extract_raw_json_tool_calls(...)` escape hatch: `src/loader/runtime/repair.py` now routes fallback parsing through the runtime parser plus the active registry, which closes an old audit concern around newer tools such as `TodoWrite`
171
-- the response/tool path is narrower and more directly testable: the earlier extraction of `src/loader/runtime/tool_batch_checks.py`, `src/loader/runtime/tool_batch_recovery.py`, `src/loader/runtime/response_route_handlers.py`, and `src/loader/runtime/response_route_types.py` is now paired with typed-context adoption across the hot path, so response routing, tool-batch gating, no-tool completion, and finalization no longer behave like disguised `agent/loop.py` helpers
172
-- the merged audit line is now reflected as one architectural baseline instead of a second hidden runtime: the active `trunk` runtime owns reasoning callbacks, raw-text recovery, response policy, workflow state, turn state, and finalization directly, and the remaining `agent/*` ownership is much smaller and easier to inventory
173
-- the test contract around this migration is stronger and more intentional: `tests/test_runtime_context.py`, `tests/test_runtime_state_controllers.py`, `tests/test_completion_policy.py`, `tests/test_repair.py`, `tests/test_response_route_handlers.py`, `tests/test_turn_loop.py`, `tests/test_turn_iteration.py`, and `tests/test_explore_runtime.py` now pin the typed-context contract directly instead of relying only on large integration tests
174
-
175
-### Verification
176
-
177
-- `uv run pytest -q` is green: `303 passed`
178
-- `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed context construction plus direct workflow-state and phase-tracker behavior without an `Agent` object on the other side
179
-- `tests/test_repair.py` covers raw-text fallback through the runtime parser/registry, including modern workflow-tool recovery such as `TodoWrite`
180
-- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_response_route_handlers.py`, `tests/test_response_routing.py`, `tests/test_turn_iteration.py`, and `tests/test_turn_loop.py` cover the main response-policy and assistant-cycle path after the deeper context adoption
181
-- `tests/test_explore_runtime.py` still proves explore refreshes capabilities before the first request, so the typed runtime context is also now load-bearing outside the main task loop
182
-- the larger workflow/runtime suites remained green after the migration, so Sprint 14 did not trade architectural cleanup for parity regressions
183
-
184
-### Residual debt
185
-
186
-- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` still bootstrap from `agent._build_runtime_context()`, and `conversation.py` still performs a post-prepare sync of capability/prompt state from the agent wrapper; the remaining runtime/agent coupling is now mostly bootstrap ownership rather than policy ownership
187
-- `src/loader/agent/loop.py` still owns meaningful planning/prompt/session orchestration outside the hot runtime path, so Loader is much cleaner than the merged-audit baseline but has not fully collapsed to a minimal public entrypoint shell yet
188
-- `src/loader/agent/reasoning.py` and `src/loader/agent/safeguards.py` are now behind typed runtime protocols, but they still own meaningful behavior and remain future burn-down candidates if Sprint 15 wants to keep reducing agent-owned runtime services
189
-- `src/loader/runtime/tool_batches.py` and parts of `src/loader/runtime/workflow_lanes.py` are narrower than before, but they still carry more heuristic policy than the tighter claw-code reference seams
190
-- the workflow policy is stronger and the runtime contract is cleaner, but Loader still stops short of claw-code's fuller policy engine, OMX's deeper planning/interview rigor, and a richer operator UX for editing or simulating policy/rule state
.docs/sprints/sprint15.mddeleted
@@ -1,182 +0,0 @@
1
-# Sprint 15: Bootstrap Ownership, Service Burn-Down, and Explore Independence
2
-
3
-## Prerequisites
4
-
5
-Sprint 14
6
-
7
-## Goals
8
-
9
-Finish the last high-value runtime cleanup that Sprint 14 exposed: move bootstrap ownership and the remaining agent-owned runtime services onto explicit runtime seams, so Loader's hot path is not only context-driven after initialization, but also context-driven at initialization.
10
-
11
-Sprint 14 was a real architectural win. `RuntimeContext` is now the primary seam across workflow state, turn phases, response repair, response routing, turn looping, workflow recovery, and finalization. The older `RuntimeLegacyServices` shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the main runtime path is much less accidental than it was when the audit line first branched.
12
-
13
-That said, the current residual debt is now very specific:
14
-
15
-- `conversation.py` and `explore.py` still bootstrap from `agent._build_runtime_context()`
16
-- `agent/loop.py` still owns too much prompt/session/bootstrap coordination for a runtime that is otherwise context-owned
17
-- `agent/reasoning.py` and `agent/safeguards.py` still own meaningful runtime behavior behind typed protocols
18
-- the audit's core warning still matters in a narrower form:
19
-  Loader should keep deleting wrapper-only ownership, not just wrapping it in nicer files
20
-
21
-Sprint 15 is about finishing that next contraction honestly:
22
-
23
-- runtime bootstrapping becomes an explicit runtime contract instead of an agent-only helper
24
-- explore mode stops being a special runtime that still depends on agent construction shape
25
-- reasoning and safeguard ownership become more inventoryable and less agent-bound
26
-- `agent/loop.py` shrinks further toward entrypoint/session orchestration instead of runtime-service ownership
27
-
28
-This sprint should feel like closing the structural loop opened by Sprint 14, not starting a new product branch.
29
-
30
-The references for this sprint are:
31
-
32
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
33
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs`
36
-- `refs/claw-code/PARITY.md`
37
-- `.docs/audit.txt`
38
-- `.docs/audit_sprints/trunk_sitrep.md`
39
-- `.docs/audit_sprints/sprint13_closure.md`
40
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
41
-- `refs/oh-my-codex/src/verification/verifier.ts`
42
-
43
-## Deliverables
44
-
45
-### 1. Runtime bootstrap becomes a first-class runtime seam
46
-
47
-Sprint 14 made `RuntimeContext` load-bearing after construction. Sprint 15 should make construction itself less agent-special.
48
-
49
-Implementation targets:
50
-
51
-- introduce an explicit runtime bootstrap/factory seam under `src/loader/runtime/`, likely around:
52
-  - building `RuntimeContext`
53
-  - initializing project/session/prompt/capability state needed by runtimes
54
-  - synchronizing prompt/capability metadata when the backend or prompt contract changes
55
-- reduce direct runtime dependence on `agent._build_runtime_context()` so `ConversationRuntime` and `ExploreRuntime` do not depend on a hidden agent helper as their primary construction mechanism
56
-- keep the contract pragmatic:
57
-  - it is acceptable for `Agent` to call the factory
58
-  - it is not acceptable for runtime correctness to depend on ad hoc agent-only bootstrap behavior
59
-
60
-The goal is to make runtime ownership explicit from the first line of construction, not only once the turn is already running.
61
-
62
-### 2. Burn down more agent-owned runtime services
63
-
64
-The remaining runtime service ownership now lives mostly in `agent/reasoning.py` and `agent/safeguards.py`.
65
-
66
-Implementation targets:
67
-
68
-- inventory the still-runtime-relevant behavior in:
69
-  - `src/loader/agent/reasoning.py`
70
-  - `src/loader/agent/safeguards.py`
71
-- move or re-home the behavior that is still genuinely runtime-owned, especially around:
72
-  - confidence / verification service boundaries
73
-  - stream filtering / steering / duplicate detection
74
-  - action validation hooks
75
-- prefer one real implementation plus compatibility exports over keeping a runtime wrapper around an agent-owned implementation indefinitely
76
-- explicitly delete or retire dead wrapper layers where the runtime already has a better home
77
-
78
-This is the sprint where we should be suspicious of “adapter forever” solutions. If a behavior is still part of the runtime contract, it should increasingly live under `runtime/`.
79
-
80
-### 3. Explore runtime should share the same bootstrap discipline
81
-
82
-Explore is intentionally narrower than the main runtime, but it should not be structurally special in the wrong way.
83
-
84
-Implementation targets:
85
-
86
-- remove or narrow the `ExploreRuntime(agent)` construction shape so explore can be built from the same runtime bootstrap contract as the main runtime
87
-- keep the read-only registry, read-only permission mode, and capability refresh behavior intact
88
-- add direct tests for explore bootstrap and state ownership so explore remains a maintained runtime lane rather than a side path
89
-
90
-The goal is not to make explore bigger. The goal is to make it less magical and more aligned with the primary runtime contract.
91
-
92
-### 4. Shrink `agent/loop.py` toward entrypoint orchestration
93
-
94
-Sprint 14 made the runtime path smaller and more explicit. Sprint 15 should let `agent/loop.py` benefit from that work.
95
-
96
-Implementation targets:
97
-
98
-- move more bootstrap/session/prompt/runtime wiring out of `agent/loop.py` where it has become runtime ownership in practice
99
-- keep `agent/loop.py` focused on:
100
-  - public entrypoints
101
-  - session-facing orchestration
102
-  - UI/event integration
103
-  - compatibility wrappers that still truly need to exist
104
-- avoid letting new runtime helpers bounce back into `agent/loop.py` just to preserve old ownership lines
105
-
106
-This is the step that turns Sprint 14's seam cleanup into a visibly smaller agent shell.
107
-
108
-### 5. Keep the audit line active as a regression check, not a second roadmap
109
-
110
-`audit.txt` is old on specifics but still sharp on the pattern to avoid: additive cleanup that never deletes ownership.
111
-
112
-Implementation targets:
113
-
114
-- use the audit's core complaint as a check against Sprint 15 implementation:
115
-  - do not add a new wrapper if we can adopt or delete
116
-  - do not leave bootstrap ownership ambiguous
117
-  - do not grow a “temporary” compatibility seam without direct tests and an exit story
118
-- update `PARITY.md` and the sprint audit only after the bootstrap/service changes are actually covered
119
-
120
-## Testing strategy
121
-
122
-- unit coverage for:
123
-  - runtime bootstrap/factory behavior
124
-  - explore bootstrap behavior
125
-  - runtime-owned reasoning/safeguard services after migration
126
-  - prompt/capability synchronization at the new bootstrap seam
127
-- runtime coverage for:
128
-  - main turn execution through the new bootstrap path
129
-  - explore mode through the shared bootstrap/runtime contract
130
-  - Sprint 00-14 parity scenarios staying green after the bootstrap/service migration
131
-- regression coverage for:
132
-  - no reintroduction of hidden raw-text extractors
133
-  - no reintroduction of legacy callback shims equivalent to `RuntimeLegacyServices`
134
-  - no ownership drift where runtime modules silently depend on agent-only helpers again
135
-
136
-## Definition of done
137
-
138
-- runtime bootstrapping is a first-class runtime seam, not primarily an agent helper
139
-- explore mode shares the same bootstrap discipline as the main runtime
140
-- more runtime-relevant behavior is moved or retired out of `agent/reasoning.py` and `agent/safeguards.py`
141
-- `agent/loop.py` shrinks further toward entrypoint/session orchestration
142
-- the parity baseline remains green after the bootstrap/service migration
143
-
144
-## Explicitly out of scope
145
-
146
-- full claw-code policy-engine parity
147
-- AST-aware or LSP-aware semantic artifact diffs
148
-- a richer permission rule editor
149
-- visual workflow tooling
150
-- multi-agent or team orchestration
151
-
152
-## Audit
153
-
154
-### Status
155
-
156
-- Sprint 15 is complete, and the audit is green. Bootstrap ownership is now explicit under `src/loader/runtime/bootstrap.py`, runtime-owned safeguards and reasoning helpers have canonical homes under `src/loader/runtime/`, and the remaining `agent/*` surface is much closer to entrypoint/session orchestration than runtime-service ownership.
157
-
158
-### Landed
159
-
160
-- runtime bootstrap is now a first-class shared seam: `src/loader/runtime/bootstrap.py` defines the typed `RuntimeBootstrapSource` contract plus `build_runtime_context(...)` / `sync_runtime_context(...)`, and both `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` construct runtime state through that shared path instead of a hidden `Agent._build_runtime_context()` helper
161
-- the old bootstrap helper is gone from `src/loader/agent/loop.py`, and `tests/test_runtime_context.py` now exercises the shared bootstrap contract directly, which makes runtime construction visibly runtime-owned from the first line of setup
162
-- safeguard ownership is now honest: `src/loader/runtime/safeguards.py` is the canonical implementation, `src/loader/agent/safeguards.py` is a compatibility export only, and `src/loader/agent/loop.py` now imports `RuntimeSafeguards` from the runtime package rather than its own shim
163
-- reasoning ownership is materially cleaner: decomposition/self-critique helpers now live in `src/loader/runtime/deliberation.py`, completion-check parsing now lives in `src/loader/runtime/task_completion.py`, and `src/loader/agent/reasoning.py` has been reduced to a compatibility-export layer over runtime-owned modules instead of a second live implementation
164
-- `src/loader/agent/loop.py` is substantially smaller and less misleading than the Sprint 14 baseline: dead planner hooks, dead self-critique hooks, dead raw extraction helpers, and the unused `src/loader/agent/planner.py` module have been deleted, bringing the loop shell down to `668` lines from the earlier four-figure baseline
165
-- the direct proof contract is stronger and more intentional: `tests/test_runtime_bootstrap.py`, `tests/test_safeguard_services.py`, `tests/test_reasoning_compat.py`, and `tests/test_runtime_context.py` now pin the shared bootstrap seam plus the compatibility-export story for safeguards/reasoning directly, rather than relying only on larger runtime integration tests
166
-
167
-### Verification
168
-
169
-- `uv run pytest -q` is green: `312 passed`
170
-- `tests/test_runtime_bootstrap.py` covers the shared bootstrap contract, prompt/capability synchronization, and both conversation/explore construction through the runtime bootstrap seam
171
-- `tests/test_runtime_context.py` now proves typed context construction without an `Agent._build_runtime_context()` escape hatch
172
-- `tests/test_safeguard_services.py` proves `src/loader/runtime/safeguards.py` is the canonical implementation and `loader.agent.safeguards` is compatibility-only
173
-- `tests/test_reasoning_compat.py` proves the runtime-owned deliberation/completion helpers are canonical and `loader.agent.reasoning` re-exports those runtime implementations rather than carrying a second live copy
174
-- the full parity suite remained green after the service migration and deletion work, so Sprint 15 reduced ownership ambiguity without trading away deterministic runtime coverage
175
-
176
-### Residual debt
177
-
178
-- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` no longer depend on a hidden bootstrap helper, but their constructors still start from an `Agent`-shaped bootstrap source at the entrypoint boundary; the remaining coupling is now public bootstrap orchestration rather than hidden runtime behavior
179
-- `src/loader/agent/loop.py` is much smaller and cleaner, but it still owns the conversational fast path, decomposition orchestration, public run/explore entrypoints, and session/UI-facing glue; it is closer to the right shell, not yet minimal
180
-- `src/loader/agent/reasoning.py` and `src/loader/agent/safeguards.py` are now compatibility shims instead of primary implementations, but those compatibility exports still exist until we decide whether the external import surface can be reduced further
181
-- explore mode now shares the bootstrap discipline, but it is still a one-shot read-only lane rather than a richer interactive inspection workflow
182
-- Loader’s workflow/runtime architecture is much cleaner after Sprint 15, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and a richer operator UX around policy/rule authoring
.docs/sprints/sprint16.mddeleted
@@ -1,189 +0,0 @@
1
-# Sprint 16: Entrypoint Shell, Launcher Contract, and Explore Continuity
2
-
3
-## Prerequisites
4
-
5
-Sprint 15
6
-
7
-## Goals
8
-
9
-Turn Sprint 15's ownership cleanup into a cleaner public runtime shape: reduce the remaining `Agent`-shaped bootstrap dependency, shrink `agent/loop.py` further toward a thin facade, and make explore mode feel like a maintained read-only runtime lane instead of a one-shot side path.
10
-
11
-Sprint 15 closed an important loop. Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no hidden `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py`.
12
-
13
-That leaves a tighter but more visible set of residual debts:
14
-
15
-- `conversation.py` and `explore.py` still start from an `Agent`-shaped bootstrap source at the public entrypoint layer
16
-- `agent/loop.py` still owns conversational fast-path behavior, decomposition orchestration, and too much session/UI-facing wiring for a runtime that otherwise lives under `src/loader/runtime/`
17
-- `agent/reasoning.py` and `agent/safeguards.py` are now compatibility-only, but Loader has not yet decided how narrow that compatibility surface should become
18
-- explore mode is structurally cleaner than it was, but it is still a one-shot lookup lane rather than a more intentional read-only workflow
19
-
20
-Sprint 16 is about finishing that next contraction without drifting into a brand-new roadmap:
21
-
22
-- the runtime gets a first-class launcher/entrypoint contract instead of depending on an `Agent`-shaped bootstrap source everywhere
23
-- `agent/loop.py` shrinks further toward a public facade over runtime-owned launch and orchestration helpers
24
-- compatibility exports remain explicit, tested, and intentionally narrow instead of becoming permanent soft ownership
25
-- explore mode gains a better continuity/inspection contract so it feels like a real product lane, not just a single-turn utility
26
-
27
-This sprint should feel like consolidating the public shell after Sprint 15, not reopening the old runtime ownership debates.
28
-
29
-The references for this sprint are:
30
-
31
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
32
-- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs`
33
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
35
-- `refs/claw-code/PARITY.md`
36
-- `.docs/audit.txt`
37
-- `.docs/audit_sprints/trunk_sitrep.md`
38
-- `.docs/sprints/sprint15.md`
39
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
40
-- `refs/oh-my-codex/src/verification/verifier.ts`
41
-
42
-## Deliverables
43
-
44
-### 1. Introduce a first-class runtime launcher contract
45
-
46
-Sprint 15 made bootstrap shared. Sprint 16 should make the public launch path less `Agent`-shaped.
47
-
48
-Implementation targets:
49
-
50
-- introduce an explicit launcher/entrypoint seam under `src/loader/runtime/`, likely around:
51
-  - creating a main-turn runtime
52
-  - creating an explore runtime
53
-  - carrying the minimum public bootstrap state needed by those runtimes
54
-- reduce direct dependence on `ConversationRuntime(self)` / `ExploreRuntime(self)` from `agent/loop.py`
55
-- make the launcher contract narrow and explicit:
56
-  - project/session/prompt/capability state that truly belongs to launch-time setup
57
-  - public callbacks that still genuinely need to stay at the `Agent` layer
58
-- avoid replacing one implicit wrapper with another vague wrapper; the goal is a visibly smaller ownership surface
59
-
60
-The goal is not to erase `Agent`. The goal is to make `Agent` a thin public shell over a runtime launcher contract instead of the default owner of launch-time state.
61
-
62
-### 2. Shrink `agent/loop.py` into a thinner facade
63
-
64
-Sprint 15 deleted dead code. Sprint 16 should move more of the still-live orchestration to clearer homes.
65
-
66
-Implementation targets:
67
-
68
-- move or re-home more of the still-live `agent/loop.py` behavior that has become runtime or launcher ownership in practice, especially:
69
-  - conversational fast-path handling
70
-  - decomposition orchestration
71
-  - runtime/explore construction and launch wiring
72
-  - session-facing setup that does not truly need to live on the agent object
73
-- keep `agent/loop.py` focused on:
74
-  - public entrypoints
75
-  - top-level session lifecycle
76
-  - UI/event integration
77
-  - explicit compatibility shims that still need to exist
78
-- continue deleting dead helper paths instead of letting the shell regrow around the new launcher seam
79
-
80
-This is the sprint where the public shell should become visibly easier to understand.
81
-
82
-### 3. Narrow the compatibility-export surface deliberately
83
-
84
-`agent/reasoning.py` and `agent/safeguards.py` are now compatibility layers. Sprint 16 should make that state intentional rather than indefinite.
85
-
86
-Implementation targets:
87
-
88
-- inventory the remaining compatibility-only exports under:
89
-  - `src/loader/agent/reasoning.py`
90
-  - `src/loader/agent/safeguards.py`
91
-- decide which exports still need to exist for test/import compatibility and which can be retired
92
-- keep internal Loader code importing canonical runtime modules instead of their compatibility mirrors
93
-- add direct tests that lock the intended compatibility contract:
94
-  - which symbols are still re-exported
95
-  - which internal hot paths must not import via compatibility modules anymore
96
-
97
-The goal is not immediate deletion of every compatibility export. The goal is to stop compatibility from behaving like shadow ownership.
98
-
99
-### 4. Give explore mode better continuity and inspection shape
100
-
101
-Explore is cleaner now, but still underpowered as a product lane.
102
-
103
-Implementation targets:
104
-
105
-- deepen the explore runtime without turning it into full workflow mode, likely around:
106
-  - lightweight transcript continuity for follow-up explore questions
107
-  - clearer inspection or status surfaces for recent explore activity
108
-  - a stronger read-only session/state story
109
-- preserve the current constraints:
110
-  - read-only registry
111
-  - no DoD
112
-  - no mutating workflow artifacts
113
-  - explicit denial of write/destructive actions
114
-- add direct coverage so explore remains a maintained runtime lane rather than a product afterthought
115
-
116
-The goal is to make explore feel intentionally usable over multiple adjacent questions, not to turn it into a second main runtime.
117
-
118
-### 5. Keep the audit line active as a deletion-first check
119
-
120
-The useful part of `audit.txt` is still the bias toward deleting disguised ownership instead of endlessly wrapping it.
121
-
122
-Implementation targets:
123
-
124
-- use the audit's core complaint as a Sprint 16 check:
125
-  - do not leave the launcher contract `Agent`-shaped by habit
126
-  - do not regrow logic in `agent/loop.py` after moving it out
127
-  - do not let compatibility exports become the internal default import path again
128
-- update `PARITY.md` and the sprint audit only after the launcher/entrypoint work is directly covered
129
-
130
-## Testing strategy
131
-
132
-- unit coverage for:
133
-  - runtime launcher creation and bootstrap narrowing
134
-  - conversational/decomposition orchestration after re-homing
135
-  - compatibility-export boundaries
136
-  - explore continuity state and inspection helpers
137
-- runtime coverage for:
138
-  - main runtime launch through the new launcher contract
139
-  - explore runtime launch and follow-up continuity
140
-  - Sprint 00-15 parity scenarios staying green after the entrypoint-shell contraction
141
-- regression coverage for:
142
-  - no return of hidden bootstrap helpers
143
-  - no internal fallback to compatibility imports for runtime-owned helpers
144
-  - no mutation leakage into explore continuity/session behavior
145
-
146
-## Definition of done
147
-
148
-- Loader has a first-class runtime launcher/entrypoint seam instead of relying on an `Agent`-shaped bootstrap source everywhere
149
-- `agent/loop.py` shrinks further toward a public facade over runtime-owned launch/orchestration helpers
150
-- compatibility exports under `agent/reasoning.py` and `agent/safeguards.py` are narrower, explicitly tested, and no longer used by internal hot paths
151
-- explore mode has a stronger continuity/inspection contract while staying read-only and workflow-light
152
-- the parity baseline remains green after the launcher/entrypoint changes
153
-
154
-## Explicitly out of scope
155
-
156
-- full claw-code policy-engine parity
157
-- multi-agent or team orchestration
158
-- AST-aware or LSP-aware semantic artifact diffs
159
-- a full visual explore/workflow UI
160
-
161
-## Audit
162
-
163
-### Status
164
-
165
-- Sprint 16 is complete, and the audit is green. Loader now has a real public launcher/entrypoint contract, a visibly smaller `agent/loop.py`, explicit compatibility-boundary proof, and a persisted read-only explore continuity story that stays outside the main workflow runtime.
166
-
167
-### Landed
168
-
169
-- the launcher contract is now first-class under `src/loader/runtime/launcher.py`: the public runtime seam no longer just constructs runtimes, it now owns conversational fast-path routing, decomposition entry routing, direct turn routing, and read-only explore launch through a single entry contract
170
-- `src/loader/agent/loop.py` has shrunk further toward a real facade: conversational handling and decomposition orchestration now live under `src/loader/runtime/chat_lane.py` and `src/loader/runtime/decomposition_lane.py`, and the remaining shell is much closer to public entrypoints, session lifecycle, prompt factories, and UI/event integration than to runtime ownership
171
-- the compatibility surface is now deliberate instead of implicit: `tests/test_compat_boundaries.py`, `tests/test_reasoning_compat.py`, and `tests/test_safeguard_services.py` explicitly lock the current `agent/reasoning.py` / `agent/safeguards.py` export contract and assert that internal runtime code does not drift back to importing through those compatibility shims
172
-- explore mode now has a stronger continuity/state story without becoming a second workflow runtime: `src/loader/runtime/explore_state.py` persists a bounded read-only transcript under `.loader/state/explore.json`, `src/loader/runtime/explore.py` reuses that transcript for follow-up questions by default, and `loader explore --fresh` gives operators a clean escape hatch when they want a one-off lookup
173
-- operator visibility now reflects that explore state: `src/loader/runtime/inspection.py` includes recent explore activity in the status snapshot, and `src/loader/cli/main.py` surfaces explore turns, bounded transcript state, and the last explore query in `loader status`
174
-
175
-### Verification
176
-
177
-- `uv run pytest -q` is green: `329 passed`
178
-- `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` now prove the public launcher contract directly, including conversational routing, decomposition delegation, direct turn routing, and explore launch behavior
179
-- `tests/test_runtime_bootstrap.py` and `tests/test_runtime_context.py` remained green after the launcher ownership shift, which keeps the shared bootstrap/context seam honest instead of reintroducing an agent-only backdoor
180
-- `tests/test_compat_boundaries.py`, `tests/test_reasoning_compat.py`, and `tests/test_safeguard_services.py` now pin the intended compatibility-export boundary directly and fail if internal Loader code slides back into using those shims as primary imports
181
-- `tests/test_explore_runtime.py` and `tests/test_inspection.py` now cover persisted explore continuity, `fresh` resets, and status-surface visibility for recent explore activity
182
-
183
-### Residual debt
184
-
185
-- `src/loader/agent/loop.py` is much smaller and clearer, but it still owns prompt/session factories, resume/clear lifecycle, and event-wrapper glue; Loader is close to a minimal public shell, not fully there yet
186
-- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` still start from an `Agent`-shaped bootstrap source at the public boundary, even though the launcher contract is now more explicit and load-bearing
187
-- the compatibility exports are now intentionally bounded and tested, but they still exist until Loader decides whether the external import surface can narrow further
188
-- explore continuity is now real, but it is still transcript-first and lightweight: there is no richer explore inspection command, multi-step read-only workflow, or deeper repo-navigation UX yet
189
-- Loader’s runtime shell is materially cleaner after Sprint 16, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and richer operator tooling around policy, rules, and explore workflows
.docs/sprints/sprint17.mddeleted
@@ -1,190 +0,0 @@
1
-# Sprint 17: Bootstrap Source Narrowing, Turn Contract Tightening, and Explore Operator UX
2
-
3
-## Prerequisites
4
-
5
-Sprint 16
6
-
7
-## Goals
8
-
9
-Finish the next real contraction after Sprint 16: stop treating the public runtime boundary as an `Agent` object by default, keep deleting in-stream repair ownership where the runtime can enforce a stronger contract instead, and give the new explore continuity story a small but real operator surface.
10
-
11
-Sprint 16 closed an important shell-level loop. Loader now has a first-class runtime launcher contract, a smaller `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity. That work changed the shape of the remaining debt in a useful way:
12
-
13
-- the runtime no longer needs `agent/loop.py` to decide chat vs decompose vs direct turn routing
14
-- compatibility shims are now explicit and guarded instead of silently load-bearing
15
-- explore is no longer purely one-shot
16
-- but the runtime still starts from an `Agent`-shaped bootstrap source at the public boundary
17
-- `agent/loop.py` still owns prompt/session factory behavior and too much entrypoint lifecycle glue
18
-- and the old audit critique still matters in a narrower form: Loader must keep deleting or hardening in-stream repair behavior, not just moving it again
19
-
20
-`audit.txt` is stale on specifics, test counts, and many file-level claims. It is not the roadmap anymore. But one core warning is still worth carrying into Sprint 17:
21
-
22
-- do not keep wrapping model-misbehavior recovery in nicer files if the runtime can instead enforce a stronger explicit contract
23
-
24
-Sprint 17 should use that warning deliberately while staying grounded in the current codebase rather than the old audit snapshot.
25
-
26
-The references for this sprint are:
27
-
28
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
29
-- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs`
30
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
31
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
32
-- `refs/claw-code/PARITY.md`
33
-- `.docs/audit.txt`
34
-- `.docs/audit_sprints/trunk_sitrep.md`
35
-- `.docs/sprints/sprint16.md`
36
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
37
-- `refs/oh-my-codex/src/verification/verifier.ts`
38
-
39
-## Deliverables
40
-
41
-### 1. Narrow the public bootstrap source below `Agent`
42
-
43
-Sprint 16 gave Loader a launcher contract. Sprint 17 should stop treating that launcher contract as “an agent object with the right fields.”
44
-
45
-Implementation targets:
46
-
47
-- introduce a smaller public bootstrap/launcher source under `src/loader/runtime/`, likely around:
48
-  - backend
49
-  - registry
50
-  - session
51
-  - permission/capability state
52
-  - prompt/session callbacks that still genuinely need to stay dynamic
53
-- reduce direct `RuntimeBootstrapSource` dependence on the full `Agent` object
54
-- keep `ConversationRuntime`, `ExploreRuntime`, and `RuntimeLauncher` constructing from that narrower source rather than from `Agent` by convention
55
-- avoid moving fields mechanically; the point is to decide which launch-time responsibilities truly belong in the runtime boundary and which belong in the public shell
56
-
57
-The goal is not “zero references to Agent.” The goal is to make the public runtime boundary look intentionally runtime-shaped rather than coincidentally agent-shaped.
58
-
59
-### 2. Move prompt/session shell behavior out of `agent/loop.py`
60
-
61
-Sprint 16 shrank entry routing. Sprint 17 should take the next obvious shell debt: prompt/session factories and lifecycle glue.
62
-
63
-Implementation targets:
64
-
65
-- inventory what still makes `src/loader/agent/loop.py` feel heavier than a public facade, especially:
66
-  - system-prompt construction and snapshot persistence
67
-  - few-shot example selection
68
-  - session creation/replacement boilerplate
69
-  - resume/clear lifecycle helpers that have become reusable runtime-shell behavior in practice
70
-- move or re-home the behavior that is no longer meaningfully “agent-owned”
71
-- keep `agent/loop.py` focused on:
72
-  - public entrypoints
73
-  - explicit lifecycle commands
74
-  - UI/event wrappers
75
-  - compatibility accessors that still truly need to exist
76
-
77
-The goal is to make Sprint 16’s thinner shell materially easier to understand on sight.
78
-
79
-### 3. Tighten the remaining turn-repair contract
80
-
81
-This is where `audit.txt` still matters. The remaining question is no longer “did we extract the repair path?” but “are we still repairing things the runtime should simply reject or halt on?”
82
-
83
-Implementation targets:
84
-
85
-- inventory the remaining repair/completion heuristics across:
86
-  - `src/loader/runtime/repair.py`
87
-  - `src/loader/runtime/completion_policy.py`
88
-  - `src/loader/runtime/turn_completion.py`
89
-  - `src/loader/runtime/assistant_turns.py`
90
-- identify the heuristics that still look like in-stream puppeting or speculative follow-through rather than explicit runtime contract
91
-- prefer one of:
92
-  - deletion
93
-  - a stricter typed failure state
94
-  - an explicit retry budget with honest surfaced failure
95
-  over adding another wrapper or reroute
96
-- add direct regression tests for every deleted/tightened behavior so we do not silently reintroduce it later
97
-
98
-This is the sprint where we should convert more “runtime tries to rescue the model” behavior into “runtime enforces the contract and reports honestly.”
99
-
100
-### 4. Give explore a small but real operator surface
101
-
102
-Sprint 16 gave explore continuity. Sprint 17 should make that state inspectable and manageable.
103
-
104
-Implementation targets:
105
-
106
-- add a small operator-facing explore surface, likely around:
107
-  - recent explore status/history inspection
108
-  - resetting explore continuity without touching main workflow sessions
109
-  - clearer visibility into whether an explore query ran fresh vs continued
110
-- keep the explore lane intentionally lightweight:
111
-  - no DoD
112
-  - no workflow artifacts
113
-  - no mutation
114
-  - no conversion into a second main runtime
115
-- prefer one or two clean CLI/inspection surfaces over a broad subcommand family
116
-
117
-The goal is to make explore continuity usable and debuggable, not to build a whole second product mode.
118
-
119
-### 5. Keep the audit line active as a contract check, not a competing roadmap
120
-
121
-Implementation targets:
122
-
123
-- use `audit.txt` only for the still-valid patterns it warns about:
124
-  - additive wrapper cleanup
125
-  - hidden ownership
126
-  - in-stream rescue behavior that should become explicit contract
127
-- prefer current code and current sprint audits over old audit counts or old branch-era file claims
128
-- update `PARITY.md` and the sprint audit only after the bootstrap narrowing and turn-contract work are directly covered
129
-
130
-## Testing strategy
131
-
132
-- unit coverage for:
133
-  - narrowed launcher/bootstrap source construction
134
-  - prompt/session-shell helpers after re-homing
135
-  - deleted or tightened repair/completion heuristics
136
-  - explore operator surfaces and continuity reset behavior
137
-- runtime coverage for:
138
-  - main runtime launch through the narrowed bootstrap source
139
-  - explore continuity through the new operator surface
140
-  - existing launcher/chat/decomposition parity staying green after the shell contraction
141
-- regression coverage for:
142
-  - no new implicit dependence on full `Agent` shape at the runtime boundary
143
-  - no silent return of deleted repair/continuation heuristics
144
-  - no explore continuity leakage into main session/DoD state
145
-
146
-## Definition of done
147
-
148
-- the public runtime bootstrap source is narrower and less `Agent`-shaped than Sprint 16
149
-- `agent/loop.py` shrinks further toward public facade and lifecycle glue only
150
-- Loader deletes or tightens more remaining in-stream repair behavior instead of merely relocating it
151
-- explore continuity gains a small operator surface while staying read-only and workflow-light
152
-- the parity baseline remains green after the Sprint 17 contract tightening
153
-
154
-## Explicitly out of scope
155
-
156
-- full claw-code policy-engine parity
157
-- multi-agent or team orchestration
158
-- AST-aware or LSP-aware semantic artifact diffs
159
-- a full visual explore workflow
160
-- a broad rule editor or policy authoring UI
161
-
162
-## Audit
163
-
164
-### Status
165
-
166
-- Sprint 17 is complete, and the audit is green. Loader now starts the public runtime from an explicit runtime-shaped bootstrap view, `agent/loop.py` is materially thinner again, one more rescue-style repair path was converted into an honest failure contract, and explore continuity has a small operator surface instead of being invisible state.
167
-
168
-### Landed
169
-
170
-- the public runtime boundary is no longer “raw `Agent` by convention”: `src/loader/runtime/bootstrap.py` now exposes an explicit `RuntimeBootstrapView`, `src/loader/runtime/launcher.py` stores that narrowed source directly, and both `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` now construct from that runtime-shaped contract rather than from `Agent`-typed ownership
171
-- prompt/session shell behavior moved into runtime-owned helpers under `src/loader/runtime/public_shell.py`: session creation, session restore, prompt construction, prompt snapshot persistence, and few-shot example selection no longer live inline inside `src/loader/agent/loop.py`
172
-- `src/loader/agent/loop.py` has shrunk again, now down to 437 lines; its remaining weight is much closer to what Sprint 17 intended: public entrypoints, resume/clear lifecycle, capability refresh, steering, and UI-facing wrapper behavior
173
-- the repair contract tightened in a useful audit-aligned place: when raw-text tool recovery exhausts its budget, `src/loader/runtime/repair.py` now stops with an explicit honest failure instead of appending another soft “let me know if you'd like me to continue” rescue line
174
-- explore continuity now has a real operator surface while staying workflow-light: `src/loader/runtime/explore_state.py` persists whether the last lookup ran `fresh` or `continue`, `src/loader/runtime/inspection.py` exposes that continuity directly, and `src/loader/cli/main.py` now supports `loader explore --status` and `loader explore --reset` alongside the existing read-only lookup flow
175
-
176
-### Verification
177
-
178
-- `uv run pytest -q` is green: `336 passed`
179
-- `tests/test_runtime_bootstrap.py`, `tests/test_runtime_launcher.py`, and `tests/test_runtime_context.py` now pin the narrowed bootstrap boundary directly and assert that the launcher/runtime contract is a `RuntimeBootstrapView` rather than a raw `Agent`
180
-- `tests/test_runtime_public_shell.py` now covers runtime-owned prompt/session shell helpers directly, including prompt-contract persistence, session creation metadata, and restored last-turn summary state
181
-- `tests/test_repair.py` and `tests/test_runtime_repair_flows.py` now cover honest raw-text tool recovery failure once the recovery budget is exhausted
182
-- `tests/test_explore_runtime.py` and `tests/test_inspection.py` now cover persisted explore history mode (`fresh` vs `continue`), explore continuity inspection, and explore reset behavior from the CLI
183
-
184
-### Residual debt
185
-
186
-- `src/loader/agent/loop.py` is much closer to a public facade than it was at Sprint 16, but it still owns resume/clear lifecycle, steering plumbing, capability refresh, and UI/event wrapper behavior; it is thin enough to be honest, not yet minimal
187
-- the public runtime boundary is now explicitly runtime-shaped, but `Agent` still constructs and supplies that boundary; Loader has not yet decided whether later sprints should narrow the public shell further or treat that as stable product architecture
188
-- the repair path is more honest than it was, but some completion/continuation heuristics still remain in `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py`; those paths are now the right place to look for any future audit-driven deletions
189
-- explore continuity is now inspectable and resettable, but it is still intentionally narrow: no richer browse/navigation UX, no multi-step explore workflow, and no broader product surface than the single-command lookup lane plus the new operator flags
190
-- Loader is in a healthier public-boundary shape after Sprint 17, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and richer operator tooling around policy, rules, and explore workflows
.docs/sprints/sprint18.mddeleted
@@ -1,189 +0,0 @@
1
-# Sprint 18: Shell Minimalism, Completion Contract, and Runtime Policy Trace
2
-
3
-## Prerequisites
4
-
5
-Sprint 17
6
-
7
-## Goals
8
-
9
-Take the next honest contraction after Sprint 17: finish shrinking the public shell where it is still load-bearing, tighten the remaining completion-policy heuristics that still act like soft rescue behavior, and expose more of the runtime’s stop/continue policy as inspectable state instead of hidden control flow.
10
-
11
-Sprint 17 closed several important seams:
12
-
13
-- the public runtime boundary is now explicitly runtime-shaped instead of raw-`Agent` by convention
14
-- prompt/session shell behavior moved into `src/loader/runtime/public_shell.py`
15
-- raw-text tool recovery now fails honestly once its budget is exhausted
16
-- explore continuity now has a small but real operator surface
17
-
18
-That changes the remaining debt in a useful way:
19
-
20
-- `src/loader/agent/loop.py` is now much thinner, but it still owns resume/clear lifecycle, steering plumbing, capability refresh, and public event-wrapper glue
21
-- `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py` still contain heuristics that may be useful, but they are the clearest remaining place where Loader can still look like it is nudging the model instead of enforcing a typed contract
22
-- the runtime now makes more honest decisions, but operators still cannot inspect completion-policy decisions with the same clarity they can inspect permissions, prompts, workflow, or explore continuity
23
-
24
-Sprint 18 should keep using `refs/claw-code` and `refs/oh-my-codex` as architectural references, not as a literal feature-copy checklist. Loader is no longer in the “blindly match every upstream surface” phase. The right standard now is:
25
-
26
-- honor claw-code where it provides proven runtime seams, policy discipline, and explicit state ownership
27
-- honor OMX where it sharpens workflow/verifier thinking
28
-- keep Loader-specific product surfaces when they improve inspectability or fit the current Python runtime better
29
-
30
-That means Sprint 18 should still be guided by the refs, but it should use them to make better Loader decisions rather than to perform a mechanical port.
31
-
32
-The references for this sprint are:
33
-
34
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
37
-- `refs/claw-code/PARITY.md`
38
-- `.docs/PARITY.md`
39
-- `.docs/audit.txt`
40
-- `.docs/audit_sprints/trunk_sitrep.md`
41
-- `.docs/sprints/sprint17.md`
42
-- `refs/oh-my-codex/src/ralplan/runtime.ts`
43
-- `refs/oh-my-codex/src/verification/verifier.ts`
44
-
45
-## Deliverables
46
-
47
-### 1. Shrink the remaining public shell to an intentionally small facade
48
-
49
-Sprint 17 moved prompt/session helpers out of `agent/loop.py`. Sprint 18 should finish the next obvious shell contraction.
50
-
51
-Implementation targets:
52
-
53
-- inventory the remaining `Agent` responsibilities that still feel like runtime/public-shell plumbing rather than true product entrypoint behavior, especially:
54
-  - resume/clear lifecycle helpers
55
-  - capability refresh wiring
56
-  - steering queue plumbing
57
-  - event-wrapper helpers for run/explore/streaming entrypoints
58
-- move what is reusable/runtime-owned into dedicated runtime/public-shell helpers
59
-- keep `agent/loop.py` focused on:
60
-  - public API entrypoints
61
-  - the minimal state that truly belongs to the agent shell
62
-  - explicit compatibility or UI-facing integration points
63
-
64
-The goal is not “delete `Agent`.” The goal is for `agent/loop.py` to read like an intentionally tiny facade rather than “the last place leftover things live.”
65
-
66
-### 2. Tighten the remaining completion/continuation contract
67
-
68
-This is the strongest remaining audit line after Sprint 17.
69
-
70
-Implementation targets:
71
-
72
-- inventory the remaining heuristic behavior across:
73
-  - `src/loader/runtime/completion_policy.py`
74
-  - `src/loader/runtime/turn_completion.py`
75
-  - `src/loader/runtime/assistant_turns.py`
76
-  - any nearby turn-loop controllers that still re-enter based on soft textual cues
77
-- identify which of those behaviors still look like:
78
-  - speculative completion rescue
79
-  - soft continuation nudges that should instead become explicit failure or explicit follow-through state
80
-  - hidden policy that operators cannot inspect afterward
81
-- prefer:
82
-  - deletion
83
-  - stricter typed completion decisions
84
-  - explicit stop/continue reason codes
85
-  - surfaced runtime evidence
86
-  over adding more wrapper logic
87
-
88
-The goal is to continue the Sprint 13 and Sprint 17 line: the runtime should either proceed for a clear reason or stop honestly, not softly coax the model onward without making that policy visible.
89
-
90
-### 3. Expose completion-policy and stop/continue decisions as inspectable runtime state
91
-
92
-Loader’s policy decisions should become easier to inspect, not just easier to reason about in code.
93
-
94
-Implementation targets:
95
-
96
-- define a small typed trace or summary for completion-policy decisions, likely covering:
97
-  - why a text response was accepted
98
-  - why a continuation was requested
99
-  - why a turn was stopped or finalized
100
-  - whether the decision came from text-loop detection, DoD gating, raw-tool recovery exhaustion, or completion-policy evaluation
101
-- persist enough of that state in the current session/turn summary to support operator inspection
102
-- surface it through existing product seams, likely:
103
-  - `loader status`
104
-  - `loader session show`
105
-  - possibly `loader workflow show` if that is the cleanest fit
106
-
107
-The goal is to make completion policy inspectable the same way permissions, prompts, workflow, and explore continuity are now inspectable.
108
-
109
-### 4. Keep the ref relationship explicit and healthy
110
-
111
-By Sprint 18, Loader should be clearly past the “copy the refs feature-for-feature” stage without losing the discipline the refs gave us.
112
-
113
-Implementation targets:
114
-
115
-- use claw-code for:
116
-  - runtime seam quality
117
-  - explicit policy/state ownership
118
-  - prompt/runtime separation
119
-- use OMX for:
120
-  - verifier/workflow pressure and follow-through ideas
121
-- do not add work just because the refs have it
122
-- do add work when a ref reveals a real Loader weakness in:
123
-  - honesty
124
-  - inspectability
125
-  - runtime ownership
126
-  - follow-through
127
-
128
-The goal is to make Sprint 18 explicitly “reference-guided Loader optimization,” not “shadow-porting another codebase.”
129
-
130
-## Testing strategy
131
-
132
-- unit coverage for:
133
-  - new public-shell helpers or reduced `Agent` shell behavior
134
-  - tightened completion-policy decisions
135
-  - persisted completion/stop decision summaries
136
-- runtime coverage for:
137
-  - no regression in normal follow-through
138
-  - honest stop behavior where rescue logic was deleted or tightened
139
-  - session/status inspection of completion-policy decisions
140
-- regression coverage for:
141
-  - no reintroduction of soft rescue phrasing after stop/failure conditions
142
-  - no drift back toward `agent/loop.py` owning runtime plumbing
143
-  - no loss of current explore/session/workflow inspection behavior while shell state shifts again
144
-
145
-## Definition of done
146
-
147
-- `agent/loop.py` shrinks again or becomes materially more facade-like even if line count does not fall dramatically
148
-- Loader deletes or hardens more remaining completion/continuation heuristics instead of preserving them through softer wrappers
149
-- operators can inspect more of the runtime’s completion-policy reasoning after the fact
150
-- Sprint 17’s explicit-bootstrap and explore-operator gains remain green
151
-- the parity baseline remains green after the Sprint 18 shell/policy tightening
152
-
153
-## Explicitly out of scope
154
-
155
-- full claw-code policy-engine parity
156
-- multi-agent or team orchestration
157
-- AST-aware semantic diffs
158
-- a broad explore workflow UI
159
-- broad permission-rule editing UX
160
-
161
-## Audit
162
-
163
-### Status
164
-
165
-- Sprint 18 is complete, and the audit is green. Loader now persists explicit completion-policy decisions and step traces, exposes them through the existing operator surfaces, and moves more shell glue out of `src/loader/agent/loop.py` into runtime-owned public-shell helpers.
166
-
167
-### Landed
168
-
169
-- completion-policy outcomes are now explicit runtime contract instead of hidden side effects: `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, and `src/loader/runtime/finalization.py` now preserve reason-coded stop/continue outcomes such as `premature_completion_nudge`, `non_mutating_response_accepted`, `verification_passed`, and `verification_failed_reentry`
170
-- session/runtime state now persists both the latest completion decision and a bounded per-turn completion trace through `src/loader/runtime/session.py`, `src/loader/runtime/events.py`, and the new `src/loader/runtime/completion_trace.py`
171
-- completion-policy state is now inspectable from the product surface: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` show the latest completion decision in `loader status`, `loader session list`, and `loader session show`, while `loader session show` also prints the richer step-by-step completion trace
172
-- resumed session state now restores completion decisions and traces through `src/loader/runtime/public_shell.py`, so inspection survives restart instead of only describing the active process
173
-- the public shell is thinner again: `src/loader/runtime/public_shell.py` now owns a real `SteeringMailbox`, fresh/load session-install helpers, sync/async event-emitter normalization, and capability-refresh decision helpers
174
-- `src/loader/agent/loop.py` now adopts those runtime-owned helpers instead of open-coding them; the file is down to 432 lines, steering no longer leaks across resume/clear lifecycle, and replacing a session now invalidates cached prompt state explicitly
175
-
176
-### Verification
177
-
178
-- `uv run pytest -q` is green: `342 passed`
179
-- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_finalization.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now pin persisted completion decisions plus step-trace behavior directly
180
-- `tests/test_runtime_public_shell.py` now covers steering mailbox behavior, fresh/load session-install helpers, sync/async event emitters, and capability-refresh helper behavior directly
181
-- `tests/test_turn_preparation.py` now proves a new turn clears stale completion trace state before execution starts
182
-- `tests/test_runtime_context.py`, `tests/test_runtime_bootstrap.py`, and `tests/test_runtime_launcher.py` stayed green while `Agent` adopted the slimmer public-shell helper set
183
-
184
-### Residual debt
185
-
186
-- `src/loader/agent/loop.py` is materially more facade-like than Sprint 17, but it still owns the public entrypoints themselves plus the remaining launcher/UI glue; Sprint 18 shrank the shell again, it did not eliminate it
187
-- completion policy is now explicit and inspectable, but Loader still keeps bounded continuation heuristics for some non-mutating tasks; those nudges are no longer hidden, but they are still policy choices rather than hard-stop failures
188
-- the new completion trace is intentionally compact and per-turn; Loader still does not have a richer long-horizon policy/debug timeline that merges completion, workflow, and repair evidence into one operator view
189
-- Sprint 18 kept following claw-code and OMX as architectural guardrails, but Loader still stops short of claw-code's tighter policy-engine seams and OMX's deeper verifier/interview rigor
.docs/sprints/sprint19.mddeleted
@@ -1,191 +0,0 @@
1
-# Sprint 19: Facade Finalization, Continuation Hardening, and Unified Policy Timeline
2
-
3
-## Prerequisites
4
-
5
-Sprint 18
6
-
7
-## Goals
8
-
9
-Take the next honest contraction after Sprint 18: finish thinning the public shell where it still reads like runtime glue, harden the last continuation heuristics that can still feel like soft rescue behavior, and unify the scattered policy/debug surfaces into one clearer operator story.
10
-
11
-Sprint 18 changed the shape of the remaining debt in a useful way:
12
-
13
-- completion policy is now explicit, persisted, and inspectable
14
-- public-shell helpers now own steering, session install/load, event-emitter normalization, and capability-refresh decisions
15
-- `src/loader/agent/loop.py` is materially thinner
16
-- but `Agent` still owns the public entrypoints and launcher glue
17
-- completion traces and workflow traces now both exist, but they are still separate operator surfaces
18
-- Loader still keeps bounded continuation nudges for some non-mutating tasks, and those nudges are explicit now but not yet deeply justified
19
-
20
-Sprint 19 should stay reference-guided, not reference-submissive.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen runtime seams, policy ownership, and explicit lifecycle contracts
25
-- use OMX to sharpen follow-through, verifier pressure, and operator-facing runtime accountability
26
-- do not add a feature just because the refs have it
27
-- do pursue changes when the refs reveal that Loader is still too soft, too implicit, or too hard to audit
28
-
29
-`audit.txt` is still not the roadmap. It is useful only as a guardrail against sliding back into wrapper-heavy cleanup and soft model rescue behavior.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
38
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
39
-- `refs/claw-code/rust/crates/runtime/src/prompt.rs`
40
-- `refs/claw-code/PARITY.md`
41
-- `.docs/PARITY.md`
42
-- `.docs/audit.txt`
43
-- `.docs/audit_sprints/trunk_sitrep.md`
44
-- `.docs/sprints/sprint18.md`
45
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
46
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
47
-- `refs/oh-my-codex/src/verification/verifier.ts`
48
-- `refs/oh-my-codex/src/hooks/session.ts`
49
-- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts`
50
-
51
-## Deliverables
52
-
53
-### 1. Finish the next public-shell contraction below `Agent`
54
-
55
-Sprint 18 moved more shell behavior into `src/loader/runtime/public_shell.py`, but `Agent` still owns the public entrypoint wrappers and some launch-time glue.
56
-
57
-Implementation targets:
58
-
59
-- inventory what remains in `src/loader/agent/loop.py` that still feels like runtime/public-shell plumbing instead of true public API ownership, especially:
60
-  - run / run_streaming / run_explore event-wrapper glue
61
-  - resume / clear lifecycle orchestration
62
-  - launcher construction and runtime-source preparation
63
-  - capability-refresh and prompt invalidation wiring
64
-- move what is reusable into runtime-owned helpers or a tighter launcher/public-shell seam
65
-- keep `Agent` focused on:
66
-  - public API shape
67
-  - compatibility-facing attributes
68
-  - minimal UI-facing integration points
69
-
70
-The goal is not “delete `Agent`.” The goal is for `Agent` to read like an intentionally tiny facade instead of a convenient place for leftover runtime glue.
71
-
72
-### 2. Harden the remaining continuation contract
73
-
74
-Sprint 18 made continuation behavior visible. Sprint 19 should decide which of that behavior is still too soft.
75
-
76
-Implementation targets:
77
-
78
-- inventory the remaining continuation behavior across:
79
-  - `src/loader/runtime/completion_policy.py`
80
-  - `src/loader/runtime/turn_completion.py`
81
-  - `src/loader/runtime/assistant_turns.py`
82
-  - any nearby repair/finalization controller that can still nudge rather than stop
83
-- identify which continuation cases are still justified by explicit runtime evidence versus merely tolerated by textual heuristics
84
-- prefer:
85
-  - deletion
86
-  - a stricter typed stop/fail state
87
-  - explicit follow-through requirements derived from runtime artifacts or session state
88
-  over keeping broad “continue once more” behavior
89
-- where a continuation path remains, make the required evidence explicit and persisted
90
-
91
-The goal is to keep following the Sprint 13 / Sprint 17 / Sprint 18 line: the runtime should proceed for a clear typed reason or stop honestly, not continue because the model “probably meant well.”
92
-
93
-### 3. Unify completion, workflow, and repair accountability into one operator-facing timeline
94
-
95
-Loader now has workflow timeline entries and a separate completion trace. That is better than hidden state, but still fragmented.
96
-
97
-Implementation targets:
98
-
99
-- define a compact unified policy timeline or policy event model that can carry:
100
-  - workflow routing/handoff decisions
101
-  - completion-policy outcomes
102
-  - repair / retry / recovery decisions
103
-  - terminal stop reasons
104
-- decide whether the existing workflow timeline should absorb completion/repair events or whether a sibling policy timeline is the cleaner contract
105
-- persist enough of that state to survive resume and make post-mortem inspection more useful
106
-- surface it through existing product seams, likely one of:
107
-  - `loader workflow show`
108
-  - `loader session show`
109
-  - a narrowly-scoped new policy-focused surface if and only if it is cleaner than overloading the workflow view
110
-
111
-The goal is that operators can answer “why did Loader keep going, stop, retry, or accept this result?” from one coherent surface instead of stitching together multiple tables by hand.
112
-
113
-### 4. Keep the ref relationship explicit and healthy
114
-
115
-Implementation targets:
116
-
117
-- use claw-code for:
118
-  - lane-event shape
119
-  - session/runtime control seams
120
-  - explicit policy and green-contract ownership
121
-- use OMX for:
122
-  - verifier/follow-through accountability
123
-  - session/runtime operator clarity
124
-  - stronger prompt/runtime contract thinking
125
-- do not add work just because the refs have it
126
-- do add work when the refs reveal a real Loader weakness in:
127
-  - honesty
128
-  - inspectability
129
-  - shell minimalism
130
-  - follow-through
131
-
132
-The goal is to keep Loader reference-guided and self-aware, not to drift into either blind feature copying or isolated local optimization.
133
-
134
-## Testing strategy
135
-
136
-- unit coverage for:
137
-  - any new public-shell or launcher helper that further reduces `Agent` ownership
138
-  - tightened continuation/terminal-stop decisions
139
-  - unified policy timeline serialization and restoration
140
-- runtime coverage for:
141
-  - no regression in normal follow-through on non-mutating and mutating tasks
142
-  - honest terminal behavior where continuation heuristics were deleted or narrowed
143
-  - session/workflow inspection of the unified policy/debug story
144
-- regression coverage for:
145
-  - no drift back toward `agent/loop.py` owning extracted shell glue
146
-  - no silent reintroduction of soft continuation phrasing after harder stop conditions
147
-  - no loss of the current workflow/completion/explore inspection surfaces while timelines are unified
148
-
149
-## Definition of done
150
-
151
-- `agent/loop.py` shrinks again or becomes materially more facade-like even if line count only drops modestly
152
-- Loader deletes or hardens more of the remaining continuation heuristics instead of merely explaining them better
153
-- operators can inspect one more coherent runtime policy story after the fact
154
-- Sprint 18’s completion-trace and public-shell gains remain green
155
-- the parity baseline remains green after the Sprint 19 shell and policy tightening
156
-
157
-## Explicitly out of scope
158
-
159
-- full claw-code policy-engine parity
160
-- multi-agent or team orchestration
161
-- AST-aware semantic diffs
162
-- a broad visual workflow UI
163
-- rich permission-rule editing UX
164
-
165
-## Audit
166
-
167
-### Status
168
-
169
-- Sprint 19 is complete, and the audit is green. Loader now enforces a typed follow-through contract for non-mutating completion checks, fails honestly once that contract is still unsatisfied after the continuation budget is exhausted, and exposes a clearer unified policy story through workflow and session inspection.
170
-
171
-### Landed
172
-
173
-- the public shell contraction continued below `Agent`: `src/loader/runtime/public_shell.py` now owns the public run / run_streaming / run_explore entry helpers plus session-install, resume/reset, and capability-refresh helpers, and `src/loader/agent/loop.py` is down to 292 lines instead of remaining a mixed runtime shell
174
-- non-mutating completion checks are now evidence-shaped instead of binary-only: `src/loader/runtime/task_completion.py` and `src/loader/runtime/completion_policy.py` derive explicit required and missing follow-through evidence, thread that through `TaskCompletionCheck`, and persist it through runtime policy decisions
175
-- Loader now stops honestly when follow-through evidence is still missing after the continuation budget is exhausted: `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py` finalize with an explicit failure response and decision code instead of silently accepting the response
176
-- completion-policy evidence now survives inspection and resume: `src/loader/runtime/completion_trace.py`, `src/loader/runtime/policy_timeline.py`, `src/loader/runtime/session.py`, and `src/loader/runtime/turn_completion.py` persist evidence summaries on completion traces and unified workflow-policy timeline entries
177
-- operators now get a clearer single policy story: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` add policy-focused workflow filtering through `loader workflow show --policy`, extend workflow kind filters to repair/completion entries, and show a `Policy Timeline` preview inside `loader session show`
178
-
179
-### Verification
180
-
181
-- `uv run pytest -q` is green: `359 passed`
182
-- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now pin the typed follow-through contract plus the honest budget-exhausted finalize path and persisted evidence summaries
183
-- `tests/test_inspection.py` now covers `loader workflow show --policy`, accountability-only timeline filtering, and the new policy-timeline preview in `loader session show`
184
-- the broader runtime shell and launcher coverage stayed green while the shell contraction continued: `tests/test_runtime_public_shell.py`, `tests/test_runtime_harness.py`, and the existing status/session/workflow inspection coverage all remained green
185
-
186
-### Residual debt
187
-
188
-- `src/loader/agent/loop.py` is now much closer to a true facade, but it still exists as the public compatibility shell and still owns some user-facing API shape rather than disappearing entirely
189
-- the unified operator story is better, but completion traces and the workflow timeline still remain separate persisted artifacts under the hood; Sprint 19 made them coherent to inspect, not identical contracts
190
-- the new follow-through assessment is explicit and typed, but it is still heuristic and runtime-authored rather than driven by a deeper verifier/model contract like OMX
191
-- Loader is now more honest and inspectable than Sprint 18, but it still stops short of claw-code's fuller policy engine and OMX's richer long-horizon verifier/interview rigor
.docs/sprints/sprint20.mddeleted
@@ -1,196 +0,0 @@
1
-# Sprint 20: Canonical Policy Events, Verifier-Backed Follow-Through, and Facade Settlement
2
-
3
-## Prerequisites
4
-
5
-Sprint 19
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 19: stop treating policy accountability as a set of coordinated side channels, strengthen follow-through requirements with better runtime evidence, and decide what the public shell should actually remain responsible for now that most of the runtime owns itself.
10
-
11
-Sprint 19 improved the shape of the remaining debt again:
12
-
13
-- `src/loader/agent/loop.py` is now much smaller and more facade-like
14
-- continuation behavior is more honest because missing follow-through can now end in explicit failure instead of silent acceptance
15
-- operators can inspect policy accountability more coherently through `loader workflow show --policy` and the `Policy Timeline` preview in `loader session show`
16
-- but completion traces and workflow timeline entries still coexist as separate persisted artifacts
17
-- follow-through evidence is typed now, but it is still heuristic and runtime-authored rather than grounded in a stronger verifier/evidence contract
18
-- Loader is much closer to an intentional public boundary, but it still has not explicitly settled whether the remaining `Agent` shell is the desired long-term seam or simply the next temporary contraction point
19
-
20
-Sprint 20 should keep using the references as architectural guardrails, not as a porting checklist.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen canonical runtime ownership, policy/event contracts, and session/runtime seams
25
-- use OMX to sharpen verifier-backed follow-through requirements, evidence accounting, and operator-facing accountability
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too heuristic, too fragmented, or too hard to audit
28
-
29
-`audit.txt` remains a guardrail against backsliding into wrappers and soft rescue behavior. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
38
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
39
-- `refs/claw-code/PARITY.md`
40
-- `refs/oh-my-codex/src/verification/verifier.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
43
-- `refs/oh-my-codex/src/hooks/session.ts`
44
-- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts`
45
-- `.docs/PARITY.md`
46
-- `.docs/audit.txt`
47
-- `.docs/audit_sprints/trunk_sitrep.md`
48
-- `.docs/sprints/sprint19.md`
49
-
50
-## Deliverables
51
-
52
-### 1. Define one canonical persisted policy-event contract
53
-
54
-Sprint 19 made the operator story clearer, but Loader still persists policy accountability in more than one shape.
55
-
56
-Implementation targets:
57
-
58
-- inventory the current persisted/debug policy surfaces across:
59
-  - `src/loader/runtime/completion_trace.py`
60
-  - `src/loader/runtime/policy_timeline.py`
61
-  - `src/loader/runtime/workflow_policy.py`
62
-  - `src/loader/runtime/session.py`
63
-  - `src/loader/runtime/events.py`
64
-- decide what the canonical persisted policy/accountability artifact should be:
65
-  - extend the workflow timeline into the sole source of truth
66
-  - or introduce a first-class policy-event model and derive the old surfaces from it
67
-- make sure the canonical event contract can express:
68
-  - workflow routing/handoff/reentry
69
-  - repair retries/failures
70
-  - completion accept/continue/finalize outcomes
71
-  - verify skips and explicit stop reasons
72
-  - evidence summaries and prompt/runtime context
73
-- avoid preserving multiple peer artifacts unless one is clearly a compatibility/read-model projection of the other
74
-
75
-The goal is that Loader has one canonical answer to “what policy decisions happened during this turn/session?” instead of two overlapping persistence models that merely render similarly.
76
-
77
-### 2. Move follow-through requirements closer to a verifier-backed contract
78
-
79
-Sprint 19 added typed required/missing evidence. Sprint 20 should make that contract less purely heuristic.
80
-
81
-Implementation targets:
82
-
83
-- inventory where follow-through requirements are still inferred from weak textual cues across:
84
-  - `src/loader/runtime/task_completion.py`
85
-  - `src/loader/runtime/completion_policy.py`
86
-  - `src/loader/runtime/turn_completion.py`
87
-  - any nearby DoD/verification/session helpers that already know more than the completion heuristic does
88
-- define a stronger runtime completion-requirements contract using available structured evidence such as:
89
-  - task class and workflow mode
90
-  - DoD state and acceptance criteria
91
-  - verification plans and prior verification results
92
-  - actual tool history, artifact paths, and session/runtime evidence
93
-- prefer explicit requirements like:
94
-  - “verification command ran and passed”
95
-  - “mutating touchpoint is accounted for in the active artifact set”
96
-  - “claimed result is backed by observed output”
97
-  over generic “probably incomplete” heuristics
98
-- where the runtime still cannot prove completion, stop honestly and preserve the exact missing requirement set
99
-
100
-The goal is to move closer to claw-code’s green-contract thinking and OMX’s verifier/accountability discipline without forcing Loader into a fake model-assisted verifier that it cannot honestly support yet.
101
-
102
-### 3. Settle the intended long-term public shell boundary
103
-
104
-By Sprint 19, `Agent` is much smaller. Sprint 20 should decide what remains on purpose.
105
-
106
-Implementation targets:
107
-
108
-- inventory the current responsibilities still living in `src/loader/agent/loop.py`
109
-- identify which of those are:
110
-  - true public API / compatibility surface
111
-  - UI integration seam
112
-  - leftover runtime or launcher ownership
113
-- move remaining runtime-ish behavior below the public shell where that is still obviously correct
114
-- explicitly document what stays in `Agent` and why, so future sprints stop treating the shell as a vague cleanup target
115
-
116
-The goal is not “delete the public shell.” The goal is to stop having an ambiguous shell.
117
-
118
-### 4. Make the operator-facing accountability story sharper without multiplying product surfaces
119
-
120
-Sprint 19 improved inspection, but there is still room to reduce cognitive stitching.
121
-
122
-Implementation targets:
123
-
124
-- improve the existing policy-facing views so operators can answer:
125
-  - why did Loader continue?
126
-  - why did Loader stop?
127
-  - what evidence was still missing?
128
-  - what policy stage made the final decision?
129
-- prefer improving:
130
-  - `loader workflow show`
131
-  - `loader session show`
132
-  - `loader status`
133
-  over inventing a brand-new inspection command unless a new surface is clearly cleaner
134
-- where useful, add concise rollup/highlight views that summarize the last important policy event rather than requiring the user to parse the whole timeline manually
135
-
136
-The goal is to make Loader easier to audit after the fact, not simply more verbose.
137
-
138
-## Testing strategy
139
-
140
-- unit coverage for:
141
-  - the canonical policy-event serialization/restoration contract
142
-  - verifier-backed completion-requirement derivation
143
-  - any reduced/settled public-shell boundary helpers
144
-- runtime coverage for:
145
-  - honest finalization when runtime evidence still fails the completion contract
146
-  - no regression in successful follow-through on normal non-mutating and mutating tasks
147
-  - session/workflow/status inspection of the canonical policy story
148
-- regression coverage for:
149
-  - no drift back toward duplicate persisted policy artifacts without a canonical source of truth
150
-  - no reintroduction of soft continuation acceptance after missing-evidence finalization
151
-  - no drift back toward `agent/loop.py` accumulating runtime ownership because the shell boundary is still ambiguous
152
-
153
-## Definition of done
154
-
155
-- Loader has one clearly canonical persisted policy/accountability contract
156
-- follow-through requirements are more explicitly grounded in runtime/verifier evidence instead of mostly textual heuristics
157
-- the remaining public-shell responsibilities are materially smaller or explicitly settled on purpose
158
-- operators can answer the main stop/continue/retry questions from one clearer set of existing inspection surfaces
159
-- Sprint 19’s honesty and policy-inspection gains remain green
160
-
161
-## Explicitly out of scope
162
-
163
-- full claw-code policy-engine parity
164
-- model-authored verifier narratives as a new mandatory runtime dependency
165
-- multi-agent or team orchestration
166
-- AST-aware semantic diffs
167
-- a broad visual workflow UI
168
-- rich permission-rule editing UX
169
-
170
-## Audit
171
-
172
-### Status
173
-
174
-- Sprint 20 is complete, and the audit is green. Loader now treats the workflow timeline as the canonical policy/accountability artifact, grounds more follow-through decisions in runtime verification state, and makes the remaining `Agent` shell boundary explicit in both code and tests.
175
-
176
-### Landed
177
-
178
-- canonical policy/accountability ownership is tighter now: `src/loader/runtime/session.py`, `src/loader/runtime/completion_trace.py`, and `src/loader/runtime/turn_completion.py` now project live completion traces from the canonical workflow timeline instead of treating completion trace writes as a peer runtime artifact
179
-- follow-through checks now consume stronger runtime evidence: `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, and `src/loader/runtime/turn_completion.py` now use DoD verification commands, prior verification results, successful verification evidence, and tracked pending items to decide whether a non-mutating turn can honestly stop
180
-- operator accountability is sharper without adding new commands: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` now surface a one-line latest-policy rollup in `loader status` and `loader session show`, sourced from the canonical workflow timeline rather than an ad hoc side channel
181
-- the public shell boundary is now settled on purpose instead of merely smaller by accident: `src/loader/runtime/public_shell.py` owns prompt-mode resolution, prompt-cache invalidation, and owner-bound system/few-shot construction, while `src/loader/agent/loop.py` is down to 267 lines and explicitly documented as the public facade over runtime-owned launch/session helpers
182
-- the shell boundary is also pinned by proof now: `tests/test_runtime_public_shell.py` covers owner-based prompt/few-shot construction and workflow-mode invalidation, while `tests/test_compat_boundaries.py` now guards `agent/loop.py` against drifting back to direct runtime-controller imports
183
-
184
-### Verification
185
-
186
-- `uv run pytest -q` is green: `372 passed`
187
-- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now pin verifier-backed follow-through decisions plus live canonical completion-trace projection
188
-- `tests/test_inspection.py` now covers the latest-policy rollup in the existing status/session surfaces
189
-- `tests/test_runtime_public_shell.py`, `tests/test_runtime_bootstrap.py`, and `tests/test_compat_boundaries.py` now cover the settled public-shell boundary directly
190
-
191
-### Residual debt
192
-
193
-- the workflow timeline is now the canonical policy artifact, but completion traces still remain as a compatibility/read-model surface because status/session inspection still benefits from a compact per-turn view
194
-- follow-through is more verifier-backed than Sprint 19, but it is still runtime-authored and heuristic in places; Loader still does not have OMX-style deeper verifier reasoning or richer artifact-derived proof models
195
-- `src/loader/agent/loop.py` is now explicitly the public facade, but Loader still keeps that compatibility shell instead of collapsing to a narrower runtime-first API
196
-- Loader is more coherent and auditable than Sprint 19, but it still stops short of claw-code's fuller policy engine, richer rule/prompt policy surfaces, and OMX's deeper interview/verifier rigor
.docs/sprints/sprint21.mddeleted
@@ -1,187 +0,0 @@
1
-# Sprint 21: Evidence Provenance, Read-Model Cleanup, and Runtime-First API
2
-
3
-## Prerequisites
4
-
5
-Sprint 20
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 20: move Loader's completion and verification story from "better heuristics with canonical policy events" toward stronger evidence provenance, reduce compatibility read-model duplication where the canonical workflow timeline already carries the truth, and begin narrowing internal callers toward a runtime-first API instead of treating `Agent` as the only natural seam.
10
-
11
-Sprint 20 changed the remaining debt in a useful way:
12
-
13
-- the workflow timeline is now the canonical policy/accountability artifact, including live completion-trace projection
14
-- follow-through checks now use stronger DoD/runtime evidence instead of only textual heuristics
15
-- the remaining `Agent` shell is explicitly documented and guarded as a public facade
16
-- but completion/verification evidence is still mostly flattened into human-readable strings rather than typed provenance
17
-- compatibility/read-model surfaces still rely on a few projections that are honest but not yet minimal
18
-- internal callers still treat `Agent` as the default runtime entry seam even though the shell is now explicitly a compatibility/public facade
19
-
20
-Sprint 21 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen canonical event ownership, green-contract discipline, and runtime-first seams
25
-- use OMX to sharpen verifier/accountability provenance and evidence-backed follow-through
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too stringly-typed, too duplicative, or too dependent on a compatibility shell
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and soft rescue behavior. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
38
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
39
-- `refs/claw-code/PARITY.md`
40
-- `refs/oh-my-codex/src/verification/verifier.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
43
-- `refs/oh-my-codex/src/hooks/session.ts`
44
-- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts`
45
-- `.docs/PARITY.md`
46
-- `.docs/audit.txt`
47
-- `.docs/audit_sprints/trunk_sitrep.md`
48
-- `.docs/sprints/sprint20.md`
49
-
50
-## Deliverables
51
-
52
-### 1. Introduce typed evidence provenance for completion and verification
53
-
54
-Sprint 20 strengthened follow-through, but most of the contract still collapses into free-text evidence summaries too early.
55
-
56
-Implementation targets:
57
-
58
-- inventory where completion/verification evidence is currently flattened into strings across:
59
-  - `src/loader/runtime/task_completion.py`
60
-  - `src/loader/runtime/completion_trace.py`
61
-  - `src/loader/runtime/policy_timeline.py`
62
-  - `src/loader/runtime/finalization.py`
63
-  - `src/loader/runtime/dod.py`
64
-  - `src/loader/runtime/workflow_policy.py`
65
-- define a small typed provenance model that can represent things like:
66
-  - verification command ran and passed/failed
67
-  - verification command was still missing
68
-  - tracked work item remained incomplete
69
-  - artifact/touchpoint evidence existed or was contradicted
70
-  - claimed runtime outcome was backed by observed output
71
-- prefer structured provenance that can still be rendered into human-readable summaries, instead of making strings the primary contract
72
-- thread that provenance through completion policy and canonical policy events where it materially improves honesty or inspectability
73
-
74
-The goal is not to build a fake theorem prover. The goal is to stop throwing away runtime evidence structure too early.
75
-
76
-### 2. Reduce read-model duplication around the canonical workflow timeline
77
-
78
-Sprint 20 made the workflow timeline canonical, but a few read models still feel more coupled than they need to be.
79
-
80
-Implementation targets:
81
-
82
-- inventory where compatibility/read-model projections still depend on direct mutation or duplicated logic across:
83
-  - `src/loader/runtime/completion_trace.py`
84
-  - `src/loader/runtime/session.py`
85
-  - `src/loader/runtime/inspection.py`
86
-  - `src/loader/runtime/events.py`
87
-  - any nearby status/session helper that reconstructs policy state manually
88
-- make sure projections like completion traces and latest-policy summaries are clearly derivations from canonical policy events instead of semi-independent contracts
89
-- remove any remaining direct writes or state bookkeeping that are only there to keep parallel policy read models in sync
90
-- keep compact operator-facing read models where they help, but make their derived nature explicit in code and tests
91
-
92
-The goal is one canonical truth plus honest projections, not a forest of near-duplicates.
93
-
94
-### 3. Start the runtime-first internal API transition below the public `Agent` facade
95
-
96
-Sprint 20 settled `Agent` as the public compatibility shell. Sprint 21 should stop using that shell as the default internal seam where it no longer needs to be.
97
-
98
-Implementation targets:
99
-
100
-- inventory current internal call sites that still instantiate or consume `Agent` when a runtime-first seam would be cleaner, especially in:
101
-  - launcher/bootstrap helpers
102
-  - CLI/TUI integration code
103
-  - tests that are really exercising runtime behavior rather than public compatibility
104
-- define a small runtime-first entry contract for internal consumers where it clearly reduces shell coupling
105
-- keep `Agent` as the public compatibility surface, but begin migrating internal runtime-oriented callers away from assuming that `Agent` is the only valid execution owner
106
-- document what remains intentionally public-shell-only versus what is now runtime-first
107
-
108
-The goal is not to delete `Agent`. The goal is to make `Agent` clearly public/compatibility-facing while runtime internals use runtime-first seams by default.
109
-
110
-### 4. Sharpen operator visibility for evidence-backed stop/continue decisions
111
-
112
-Sprint 20 improved policy summaries, but the evidence itself is still only partially visible.
113
-
114
-Implementation targets:
115
-
116
-- improve the existing operator views so users can answer:
117
-  - what exact evidence was missing when Loader stopped?
118
-  - what exact evidence satisfied the completion contract?
119
-  - which policy event carried that evidence?
120
-- prefer improving:
121
-  - `loader workflow show`
122
-  - `loader session show`
123
-  - `loader status`
124
-  over inventing a new command unless a new surface is clearly cleaner
125
-- add concise rollups first, and expose deeper provenance only where it materially helps post-mortem inspection
126
-
127
-The goal is to make Loader easier to audit after the fact, not simply more verbose.
128
-
129
-## Testing strategy
130
-
131
-- unit coverage for:
132
-  - typed evidence-provenance normalization/rendering
133
-  - derived read-model projections from the canonical workflow timeline
134
-  - any new runtime-first internal entry contract below `Agent`
135
-- runtime coverage for:
136
-  - honest finalization with explicit evidence provenance when completion still fails
137
-  - successful completion paths that now surface structured proof instead of only summary strings
138
-  - status/session/workflow inspection of the evidence-backed policy story
139
-- regression coverage for:
140
-  - no drift back toward peer policy artifacts beside the canonical workflow timeline
141
-  - no drift back toward `Agent` as the default internal seam when a runtime-first contract exists
142
-  - no loss of the current compact operator read models while provenance becomes richer
143
-
144
-## Definition of done
145
-
146
-- Loader preserves one canonical policy/accountability artifact while making evidence provenance more structured
147
-- completion/verification evidence is less stringly-typed and more inspectable without weakening honesty
148
-- internal runtime-oriented code has at least one cleaner runtime-first seam below the public `Agent` facade
149
-- existing status/session/workflow surfaces answer stop/continue questions with clearer evidence context
150
-- Sprint 20's canonical-policy and facade-settlement gains remain green
151
-
152
-## Explicitly out of scope
153
-
154
-- full claw-code policy-engine parity
155
-- model-authored verifier narratives as a mandatory dependency
156
-- multi-agent or team orchestration
157
-- AST-aware semantic diffs
158
-- a broad visual workflow UI
159
-- rich permission-rule editing UX
160
-
161
-## Audit
162
-
163
-### Status
164
-
165
-- Sprint 21 is complete, and the audit is green. Loader now carries typed evidence provenance through canonical policy events, derives more of its operator/accountability story from one shared workflow-timeline read model, and has a real runtime-first internal owner below the public `Agent` facade.
166
-
167
-### Landed
168
-
169
-- evidence provenance is now a stronger first-class contract instead of a mostly flattened string path: `src/loader/runtime/evidence_provenance.py`, `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/policy_timeline.py`, and `src/loader/runtime/completion_trace.py` now preserve typed support/missing/contradiction context through canonical policy events and projected completion traces
170
-- canonical read-model duplication is lower: `src/loader/runtime/workflow_timeline_read_model.py` now owns shared policy projections, grouped evidence rollups, latest-policy summaries, and operator highlights instead of scattering that logic across inspection surfaces
171
-- the runtime-first internal API transition is real now, not just planned: `src/loader/runtime/runtime_handle.py` provides a runtime-owned owner below `Agent`, and runtime-oriented tests in `tests/test_runtime_handle.py`, `tests/test_runtime_launcher.py`, `tests/test_turn_preparation.py`, and `tests/test_runtime_public_shell.py` now exercise launcher/bootstrap/public-shell behavior without assuming the public compatibility facade is the only valid execution owner
172
-- operator visibility is sharper without adding new product surfaces: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` now show latest policy evidence rollups in `loader status`, `loader session show`, and `loader workflow show`, including concise “needed” vs “satisfied” evidence summaries derived from the canonical workflow timeline
173
-
174
-### Verification
175
-
176
-- `uv run pytest -q` is green: `380 passed`
177
-- `tests/test_evidence_provenance.py`, `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now pin typed provenance through completion/finalization plus persisted operator inspection
178
-- `tests/test_workflow_timeline_read_model.py` now covers grouped supporting/missing evidence rollups from canonical policy events
179
-- `tests/test_runtime_handle.py`, `tests/test_runtime_launcher.py`, `tests/test_turn_preparation.py`, and `tests/test_runtime_public_shell.py` now cover the new runtime-first owner seam directly
180
-- `tests/test_compat_boundaries.py` remains green, including the runtime import-boundary guard after the new handle landed
181
-
182
-### Residual debt
183
-
184
-- Loader now has a runtime-first internal owner, but the public `Agent` shell still exists as the outer compatibility API and is still used by many public-surface and end-to-end tests
185
-- evidence provenance is more structured and inspectable, but it is still runtime-authored and bounded; Loader still does not have OMX-style deeper verifier reasoning, richer artifact-derived proof, or model-assisted audit narratives
186
-- the workflow timeline read model is cleaner, but Loader still keeps compact derived surfaces like completion traces and latest-policy summaries because operators benefit from them
187
-- the new policy-evidence rollups make stop/continue decisions easier to inspect, but Loader still stops short of claw-code's fuller policy engine, richer rule surfaces, and OMX's deeper interview/verifier rigor
.docs/sprints/sprint22.mddeleted
@@ -1,191 +0,0 @@
1
-# Sprint 22: Runtime Entry API, Verification Observations, and Compatibility Narrowing
2
-
3
-## Prerequisites
4
-
5
-Sprint 21
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 21: stop treating the new runtime-first owner as only a testing seam, move verification/accountability closer to the moment verification actually happens, and keep narrowing the public compatibility shell without pretending Loader is ready to delete `Agent`.
10
-
11
-Sprint 21 changed the remaining debt in a useful way:
12
-
13
-- Loader now has a runtime-owned internal handle and no longer needs `Agent` for some runtime-oriented tests
14
-- policy/accountability surfaces can now show grouped supporting vs missing evidence instead of flattening everything into one summary string
15
-- the workflow timeline read model is more canonical and less duplicative
16
-- but `Agent` still remains the default construction seam for most real integrations
17
-- verification evidence is still largely reconstructed from DoD/session state after the fact rather than captured as a first-class observation at execution time
18
-- evidence provenance is richer, but it is still bounded and runtime-authored; Loader still cannot always answer which observed verification attempt or artifact actually justified the final stop/continue decision
19
-
20
-Sprint 22 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen runtime-first entry seams, session/bootstrap ownership, and event accountability
25
-- use OMX to sharpen verifier-backed evidence capture and auditability around successful vs missing proof
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too shell-bound, too post-hoc, or too hard to audit honestly
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and fake rescue behavior. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
38
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
39
-- `refs/claw-code/PARITY.md`
40
-- `refs/oh-my-codex/src/verification/verifier.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
43
-- `refs/oh-my-codex/src/hooks/session.ts`
44
-- `.docs/PARITY.md`
45
-- `.docs/audit.txt`
46
-- `.docs/audit_sprints/trunk_sitrep.md`
47
-- `.docs/sprints/sprint21.md`
48
-
49
-## Deliverables
50
-
51
-### 1. Promote the runtime-first entry contract beyond test-only use
52
-
53
-Sprint 21 proved that `RuntimeHandle` is a valid internal owner. Sprint 22 should use that seam in more real integration paths where `Agent` is only serving as an unnecessary wrapper.
54
-
55
-Implementation targets:
56
-
57
-- inventory internal call sites that still instantiate `Agent` by default even though they are really consuming runtime-owned behavior, especially around:
58
-  - launcher/bootstrap helpers
59
-  - interactive CLI/TUI integration seams
60
-  - harnesses and utilities that are not actually testing public compatibility
61
-- define a small runtime-first entry contract for internal integrations that need:
62
-  - runtime bootstrap/session ownership
63
-  - launcher/public-shell execution
64
-  - inspection or session continuity hooks
65
-- migrate a bounded but real set of internal callers to that runtime-first seam
66
-- explicitly document what remains intentionally public-shell-only versus what is now runtime-first by default
67
-
68
-The goal is not to delete `Agent`. The goal is to make `Agent` more clearly public/compatibility-facing while real internal integrations stop depending on it by habit.
69
-
70
-### 2. Introduce typed verification observations closer to execution time
71
-
72
-Loader is better at explaining missing evidence now, but it still often reconstructs that explanation after the fact from session and DoD state.
73
-
74
-Implementation targets:
75
-
76
-- inventory where verification evidence is currently inferred or flattened after execution across:
77
-  - `src/loader/runtime/dod.py`
78
-  - `src/loader/runtime/finalization.py`
79
-  - `src/loader/runtime/task_completion.py`
80
-  - `src/loader/runtime/completion_policy.py`
81
-  - `src/loader/runtime/session.py`
82
-  - `src/loader/runtime/workflow_policy.py`
83
-- define a typed verification-observation contract that can represent things like:
84
-  - verification command requested
85
-  - verification command ran
86
-  - verification command passed or failed
87
-  - verification command output backed or contradicted a claimed result
88
-  - verification attempt was skipped, stale, or still missing
89
-  - observed artifact/touchpoint evidence that materially supported or blocked completion
90
-- decide whether those observations should live directly in the canonical workflow timeline, as a derived session sub-artifact, or as a canonical companion model with a single clear ownership story
91
-- avoid inventing a second peer truth beside the canonical policy story
92
-
93
-The goal is to move Loader’s accountability closer to “this is what was actually observed” rather than “this is what the runtime later summarized.”
94
-
95
-### 3. Strengthen stop/continue proof using observed verification and artifact evidence
96
-
97
-Sprint 21 made evidence more structured, but it still leaves some stop/continue decisions too dependent on runtime-authored summaries.
98
-
99
-Implementation targets:
100
-
101
-- connect completion/finalization decisions more directly to:
102
-  - typed verification observations
103
-  - active DoD acceptance state
104
-  - tracked pending items
105
-  - observed artifact/touchpoint evidence
106
-  - contradictions already captured in the workflow ledger or drift evidence
107
-- prefer explicit proof stories like:
108
-  - “verification command X passed and covers acceptance boundary Y”
109
-  - “artifact Z was updated and verified against the planned touchpoint set”
110
-  - “completion still failed because verification command X never ran / failed / contradicted the claim”
111
-  over broader fallback summaries when the runtime has stronger evidence available
112
-- where Loader still cannot prove completion, preserve the exact missing or contradictory observation set in the policy/accountability story
113
-
114
-The goal is not to build a deep theorem prover. The goal is to keep pushing Loader from heuristic completion toward evidence-backed completion without bluffing.
115
-
116
-### 4. Sharpen operator inspection around observed verification state
117
-
118
-Sprint 21 made policy evidence easier to read. Sprint 22 should make observed verification attempts easier to audit from the same existing product surfaces.
119
-
120
-Implementation targets:
121
-
122
-- improve the existing operator views so users can answer:
123
-  - which verification command last ran?
124
-  - did it pass, fail, or never run?
125
-  - what output or observed artifact actually backed the stop/continue decision?
126
-  - what evidence is still missing versus already satisfied?
127
-- prefer improving:
128
-  - `loader status`
129
-  - `loader session show`
130
-  - `loader workflow show`
131
-  over inventing a new command unless a new surface is clearly cleaner
132
-- keep concise rollups first, and expose deeper verification/observation detail only where it materially improves post-mortem debugging
133
-
134
-The goal is to make Loader easier to audit after the fact, not simply more verbose.
135
-
136
-## Testing strategy
137
-
138
-- unit coverage for:
139
-  - runtime-first entry helpers adopted below `Agent`
140
-  - typed verification-observation normalization and persistence
141
-  - derived policy/accountability projections from observed verification state
142
-- runtime coverage for:
143
-  - successful completion with explicit observed verification backing
144
-  - failed or missing verification that now produces a more concrete stop/continue story
145
-  - internal integration paths that now use the runtime-first entry contract
146
-- regression coverage for:
147
-  - no drift back toward `Agent` as the default internal seam when a runtime-first contract exists
148
-  - no duplicate truth beside the canonical policy/accountability story
149
-  - no regression in Sprint 21’s evidence provenance, policy rollups, or runtime-handle contract
150
-
151
-## Definition of done
152
-
153
-- Loader has at least one more real internal integration path using a runtime-first entry contract below `Agent`
154
-- verification/accountability captures more first-class observed state instead of only post-hoc summaries
155
-- stop/continue decisions can point to clearer observed proof or missing proof
156
-- existing status/session/workflow surfaces expose that stronger verification/accountability story without multiplying commands
157
-- Sprint 21’s runtime-handle, provenance, and grouped policy-evidence gains remain green
158
-
159
-## Explicitly out of scope
160
-
161
-- deleting `Agent` as the public compatibility surface
162
-- full claw-code policy-engine parity
163
-- model-authored verifier narratives as a required runtime dependency
164
-- AST-aware semantic diffs
165
-- a broad visual workflow UI
166
-- multi-agent or team orchestration
167
-
168
-## Audit
169
-
170
-### Status
171
-
172
-- Sprint 22 is complete on the verification-observation and accountability lane, and the audit is green. Loader now captures typed verification observations closer to execution, carries them through the canonical policy story, and exposes that observed verification state directly in the existing operator surfaces.
173
-- The planned runtime-first entry promotion beyond test-only use did not land in this sprint. That debt is now an explicit Sprint 23 carry-forward item, not an implied cleanup tail.
174
-
175
-### Landed
176
-
177
-- verification observations are now a first-class runtime contract instead of a reconstructed afterthought: `src/loader/runtime/verification_observations.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/policy_timeline.py`, `src/loader/runtime/completion_trace.py`, and `src/loader/runtime/turn_completion.py` now preserve typed observed verification state through the DoD gate, canonical policy events, and projected completion traces
178
-- stop/continue policy is more explicit about why Loader stopped: `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, and `src/loader/runtime/turn_completion.py` now use observed verification facts when they exist and preserve those facts on exhausted continuation failures instead of only falling back to generic missing-evidence language
179
-- operator inspection is sharper without multiplying surfaces: `src/loader/runtime/workflow_timeline_read_model.py`, `src/loader/runtime/inspection.py`, and `src/loader/cli/main.py` now surface observed verification in `loader status`, `loader session show`, and `loader workflow show`, including a unified `Recent Verification` view sourced from canonical policy observations first and DoD evidence second
180
-
181
-### Verification
182
-
183
-- `uv run pytest -q` is green: `388 passed`
184
-- `tests/test_verification_observations.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, and `tests/test_turn_completion.py` now pin the verification-observation contract through finalization and completion stop policy
185
-- `tests/test_workflow_timeline_read_model.py` and `tests/test_inspection.py` now cover observed-verification rollups, the unified recent-verification view, and the operator-facing explanation strings sourced from canonical policy state
186
-
187
-### Residual debt
188
-
189
-- Sprint 22 intentionally did not complete Deliverable 1. Loader still has a runtime-first internal owner from Sprint 21, but this sprint did not promote additional real integration paths away from `Agent`
190
-- verification/accountability is more observed and audit-friendly now, but it is still bounded and runtime-authored; Loader still stops short of deeper OMX-style verifier reasoning, richer artifact-derived proof, or model-assisted audit narratives
191
-- the existing status/session/workflow surfaces are clearer now, but Loader still stops short of claw-code's fuller policy engine, narrower runtime-first public API, and richer rule/prompt accountability surfaces
.docs/sprints/sprint23.mddeleted
@@ -1,183 +0,0 @@
1
-# Sprint 23: Runtime-First Integrations, Verification Producers, and Facade Narrowing
2
-
3
-## Prerequisites
4
-
5
-Sprint 22
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 22: stop treating the runtime-first owner as mostly a testing seam, move more real integrations onto that seam where `Agent` is only habit, and widen the new verification-observation contract from a finalization/result story into a more direct execution-time producer story.
10
-
11
-Sprint 22 improved the remaining debt in a useful way:
12
-
13
-- Loader now carries typed verification observations through canonical policy events and completion-stop decisions
14
-- the operator surfaces can now explain recent verification from canonical policy observations first instead of stitching together post-hoc summaries
15
-- the canonical workflow timeline is stronger as the accountability story
16
-- but the planned runtime-first entry promotion beyond tests did not land
17
-- `Agent` still remains the default construction seam for many real internal paths even though Loader now has a runtime-owned internal handle and explicit public facade boundaries
18
-- verification observations are still strongest at the DoD/finalization edge; Loader still captures less from the earlier verification lifecycle than it now knows how to represent
19
-
20
-Sprint 23 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen runtime-first bootstrap/session ownership, event capture, and narrower public boundaries
25
-- use OMX to sharpen direct verifier-observation capture and evidence-backed accountability
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too shell-bound, too late in its evidence capture, or too fuzzy about which boundary owns what
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and soft compatibility habits. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
38
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
39
-- `refs/claw-code/PARITY.md`
40
-- `refs/oh-my-codex/src/verification/verifier.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
43
-- `refs/oh-my-codex/src/hooks/session.ts`
44
-- `.docs/PARITY.md`
45
-- `.docs/audit.txt`
46
-- `.docs/audit_sprints/trunk_sitrep.md`
47
-- `.docs/sprints/sprint22.md`
48
-
49
-## Deliverables
50
-
51
-### 1. Promote the runtime-first entry contract into real internal integrations
52
-
53
-Sprint 22 left this as explicit debt. Sprint 23 should land it for real.
54
-
55
-Implementation targets:
56
-
57
-- inventory internal call sites that still instantiate or route through `Agent` by default even though they are consuming runtime-owned behavior, especially around:
58
-  - launcher/bootstrap helpers
59
-  - CLI/TUI integration seams that are not actually testing the public compatibility contract
60
-  - harnesses, utilities, and inspection/session helpers that only need runtime ownership
61
-- define or refine a small runtime-first entry contract for internal consumers that need:
62
-  - runtime bootstrap/session ownership
63
-  - launcher/public-shell execution
64
-  - inspection, continuity, or workflow/accountability hooks
65
-- migrate a bounded but real set of internal integrations onto that seam
66
-- explicitly document what remains intentionally public-shell-only versus what is now runtime-first by default
67
-
68
-The goal is not to delete `Agent`. The goal is to stop using it internally by reflex where a runtime-owned seam is cleaner and already exists.
69
-
70
-### 2. Expand verification observations to earlier execution-time producers
71
-
72
-Sprint 22 made verification observations real, but they still enter the story relatively late.
73
-
74
-Implementation targets:
75
-
76
-- inventory where verification-related facts are currently available earlier than finalization across:
77
-  - `src/loader/runtime/dod.py`
78
-  - `src/loader/runtime/finalization.py`
79
-  - `src/loader/runtime/tool_batches.py`
80
-  - `src/loader/runtime/executor.py`
81
-  - `src/loader/runtime/workflow_lanes.py`
82
-  - `src/loader/runtime/workflow_policy.py`
83
-- identify which observation kinds should be emitted closer to execution, such as:
84
-  - verification command planned/requested
85
-  - verification command actually executed
86
-  - verification output observed and classified as passed/failed/contradictory
87
-  - verification was intentionally skipped, stale, or still pending
88
-  - observed artifact/touchpoint evidence materially backed or blocked completion before finalization
89
-- thread those earlier observations into the canonical policy story without creating a peer truth beside the workflow timeline
90
-
91
-The goal is to move Loader’s accountability closer to “this is what we observed while verification happened” instead of only “this is what the runtime concluded later.”
92
-
93
-### 3. Narrow the remaining public facade boundary on purpose
94
-
95
-Sprint 20 settled `Agent` as the public shell. Sprint 23 should keep making that shell smaller and more explicit.
96
-
97
-Implementation targets:
98
-
99
-- inventory what still lives in `src/loader/agent/loop.py` and nearby public-shell glue
100
-- identify which pieces are:
101
-  - true public compatibility API
102
-  - UI integration seam
103
-  - leftover runtime ownership that can move below the shell now
104
-- move the still-obviously-runtime pieces below the public shell where that reduces ambiguity
105
-- add or extend direct boundary tests so internal code does not drift back toward `Agent` ownership once a runtime seam exists
106
-
107
-The goal is not “make the file smaller” for its own sake. The goal is that future work has a clearer answer to what the public shell is for.
108
-
109
-### 4. Sharpen operator visibility for runtime-first ownership and observed verification
110
-
111
-Once more of the real integration paths go runtime-first and more observations are captured earlier, the existing product surfaces should make that easier to audit.
112
-
113
-Implementation targets:
114
-
115
-- improve the existing surfaces so users can answer:
116
-  - which runtime-owned path produced the current session/accountability state?
117
-  - what verification was actually observed earlier in the turn versus only concluded at finalization?
118
-  - what evidence is pending, contradicted, or already satisfied?
119
-- prefer improving:
120
-  - `loader status`
121
-  - `loader session show`
122
-  - `loader workflow show`
123
-  over inventing a new command unless a new surface is clearly cleaner
124
-- keep concise rollups first, and expose deeper ownership/observation detail only where it materially improves post-mortem debugging
125
-
126
-The goal is to make Loader easier to audit after the fact, not simply more verbose.
127
-
128
-## Testing strategy
129
-
130
-- unit coverage for:
131
-  - runtime-first entry helpers adopted in real internal integration paths
132
-  - earlier verification-observation producer normalization and persistence
133
-  - public-shell boundary helpers and import/boundary guards
134
-- runtime coverage for:
135
-  - successful completion with observed verification facts emitted before finalization
136
-  - failed or pending verification that now leaves a clearer producer-backed policy trail
137
-  - internal integration paths that no longer need `Agent` by default
138
-- regression coverage for:
139
-  - no drift back toward `Agent` as the default internal seam when a runtime-owned seam exists
140
-  - no duplicate truth beside the canonical policy/accountability story
141
-  - no regression in Sprint 22’s observed-verification inspection and stop/continue honesty
142
-
143
-## Definition of done
144
-
145
-- Loader has at least one more real internal integration path using a runtime-first entry seam below `Agent`
146
-- verification observations are emitted from at least one earlier execution-time producer instead of only the finalization edge
147
-- the remaining public-shell boundary is smaller or more explicitly defended on purpose
148
-- existing status/session/workflow surfaces expose the stronger runtime-first and verification-observation story without multiplying commands
149
-- Sprint 22’s observed-verification and accountability gains remain green
150
-
151
-## Explicitly out of scope
152
-
153
-- deleting `Agent` as the public compatibility surface
154
-- full claw-code policy-engine parity
155
-- model-authored verifier narratives as a required runtime dependency
156
-- AST-aware semantic diffs
157
-- a broad visual workflow UI
158
-- multi-agent or team orchestration
159
-
160
-## Audit
161
-
162
-### Status
163
-
164
-- Sprint 23 is complete, and the audit is green. Loader now uses the runtime-first seam in real internal integrations, captures verification observations closer to the moment verification runs, and exposes runtime-owner provenance in the same operator surfaces that already carry policy and workflow accountability.
165
-
166
-### Landed
167
-
168
-- runtime-first ownership is now materially real outside tests: `src/loader/runtime/runtime_handle.py` now owns direct `run` / `run_streaming` / `run_explore` entrypoints, `src/loader/cli/main.py` routes non-TUI CLI and `loader explore` through that runtime-first owner by default, and `tests/helpers/runtime_harness.py` now uses `RuntimeHandle` for scripted runtime scenarios instead of instantiating `Agent` by habit
169
-- verification observations now enter the canonical accountability story closer to execution: `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/policy_timeline.py`, and `src/loader/runtime/workflow_timeline_read_model.py` now persist and project per-command `verify_observation` entries, so Loader can explain what verification actually ran and what it observed instead of only summarizing that state later
170
-- runtime-owner provenance is now part of persisted session state and inspection: `src/loader/runtime/owner_metadata.py`, `src/loader/runtime/bootstrap.py`, `src/loader/runtime/public_shell.py`, and `src/loader/runtime/session.py` now persist owner-path metadata, while `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` surface that metadata in `loader status`, `loader session list/show`, and `loader workflow show`
171
-
172
-### Verification
173
-
174
-- `uv run pytest -q` is green: `397 passed`
175
-- `tests/test_runtime_handle.py`, `tests/test_cli_runtime_owner.py`, and `tests/helpers/runtime_harness.py` now pin real runtime-first integration paths below `Agent`
176
-- `tests/test_finalization.py` and `tests/test_workflow_timeline_read_model.py` now pin per-command verification-observation entries and their projection into workflow/policy views
177
-- `tests/test_session_state.py`, `tests/test_runtime_public_shell.py`, `tests/test_runtime_bootstrap.py`, `tests/test_runtime_launcher.py`, and `tests/test_inspection.py` now cover persisted runtime-owner metadata plus its status/session/workflow rendering
178
-
179
-### Residual debt
180
-
181
-- Loader now has real runtime-first internal integrations, but the TUI still routes through the public `Agent` facade and the public shell still remains the outermost construction contract for external integrations
182
-- verification observations are now closer to execution, but they are still strongest around the verification loop/finalization path; Loader still does not yet emit a richer lifecycle story for planned, pending, or stale verification outside that bounded lane
183
-- the new owner-path visibility makes runtime-first adoption auditable, but Loader still stops short of a narrower public runtime API, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor
.docs/sprints/sprint24.mddeleted
@@ -1,180 +0,0 @@
1
-# Sprint 24: TUI Runtime Convergence, Verification Lifecycle, and Facade Narrowing
2
-
3
-## Prerequisites
4
-
5
-Sprint 23
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 23: stop treating the TUI as the last major product path that still defaults to `Agent` by habit, widen the verification-observation story from “what ran” toward “what is planned, pending, or stale,” and keep narrowing the public shell without pretending Loader is ready to delete it.
10
-
11
-Sprint 23 changed the remaining debt in a useful way:
12
-
13
-- non-TUI CLI, `loader explore`, and the scripted runtime harness now use the runtime-first owner seam below `Agent`
14
-- verification observations now enter the canonical workflow timeline closer to execution through per-command verification events
15
-- operator surfaces can now show which runtime-owner path produced the current session/accountability state
16
-- but the TUI still routes through the public `Agent` shell even though it primarily consumes runtime-owned behavior
17
-- verification observations still say more about commands that ran than about commands that are only planned, still pending, or now stale
18
-- Loader is much closer to an explicit public/runtime boundary, but it still has not fully converged on what must remain public-shell-only versus what can be runtime-first by default
19
-
20
-Sprint 24 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen runtime/bootstrap ownership, event accountability, and narrower UI-facing shell seams
25
-- use OMX to sharpen verifier-lifecycle visibility and auditability around pending vs observed proof
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too shell-bound, too late in its verification story, or too ambiguous about what the public shell still owns
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
38
-- `refs/claw-code/PARITY.md`
39
-- `refs/oh-my-codex/src/verification/verifier.ts`
40
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/hooks/session.ts`
43
-- `.docs/PARITY.md`
44
-- `.docs/audit.txt`
45
-- `.docs/audit_sprints/trunk_sitrep.md`
46
-- `.docs/sprints/sprint23.md`
47
-
48
-## Deliverables
49
-
50
-### 1. Move the TUI onto the runtime-first owner seam
51
-
52
-Sprint 23 made runtime-first real in the CLI and scripted harness. Sprint 24 should make the TUI stop depending on `Agent` by default when it mostly consumes runtime-owned behavior.
53
-
54
-Implementation targets:
55
-
56
-- inventory what the TUI actually needs from its execution owner across:
57
-  - `src/loader/ui/app.py`
58
-  - `src/loader/ui/adapter.py`
59
-  - `src/loader/runtime/public_shell.py`
60
-  - `src/loader/runtime/runtime_handle.py`
61
-  - `src/loader/cli/main.py`
62
-- define or refine a small shell-owner contract for the TUI instead of letting `Agent` be the assumed type
63
-- migrate the TUI launch path to the runtime-first owner where that does not weaken the public compatibility story
64
-- preserve explicit public compatibility where it still matters, but stop using `Agent` as the default UI owner by habit
65
-
66
-The goal is not to remove `Agent` from the codebase. The goal is to make the biggest remaining real product path stop depending on it unnecessarily.
67
-
68
-### 2. Expand verification observations to planned, pending, and stale states
69
-
70
-Sprint 23 captures commands that ran. Sprint 24 should make Loader more honest about verification work that has not yet happened or is no longer fresh.
71
-
72
-Implementation targets:
73
-
74
-- inventory where verification lifecycle facts already exist before or beyond command execution across:
75
-  - `src/loader/runtime/dod.py`
76
-  - `src/loader/runtime/finalization.py`
77
-  - `src/loader/runtime/task_completion.py`
78
-  - `src/loader/runtime/workflow_policy.py`
79
-  - `src/loader/runtime/policy_timeline.py`
80
-  - `src/loader/runtime/workflow_timeline_read_model.py`
81
-- define which observation kinds Loader should represent directly, such as:
82
-  - verification planned
83
-  - verification pending
84
-  - verification stale
85
-  - verification intentionally skipped
86
-  - verification observed as passed or failed
87
-- keep those lifecycle facts inside the canonical policy/accountability story instead of inventing a second peer model
88
-
89
-The goal is to make Loader answer “what proof is still pending?” as directly as it can already answer “what proof did we observe?”
90
-
91
-### 3. Tighten the public facade boundary around runtime-first defaults
92
-
93
-Sprint 23 improved real internal adoption, but the long-term shell boundary is still not quite settled enough.
94
-
95
-Implementation targets:
96
-
97
-- inventory what remains in `src/loader/agent/loop.py` that is:
98
-  - true public compatibility API
99
-  - UI integration glue
100
-  - leftover runtime ownership
101
-- move any still-obviously-runtime ownership lower when that reduces ambiguity
102
-- add or extend boundary tests so future work does not drift back toward `Agent` as the default internal seam after Sprint 24
103
-
104
-The goal is not to shrink files for vanity. The goal is to make the remaining public shell answerable and intentionally narrow.
105
-
106
-### 4. Improve operator visibility for owner path and verification lifecycle
107
-
108
-Once Loader can show runtime-owner provenance and richer verification lifecycle state, the existing product surfaces should make that easier to audit.
109
-
110
-Implementation targets:
111
-
112
-- improve the current surfaces so users can answer:
113
-  - which owner path produced this session?
114
-  - what verification is planned, pending, stale, skipped, or observed?
115
-  - what evidence is still needed versus already satisfied?
116
-- prefer improving:
117
-  - `loader status`
118
-  - `loader session show`
119
-  - `loader workflow show`
120
-  - the TUI status surface
121
-  over inventing a new command unless one is clearly cleaner
122
-
123
-The goal is to make Loader easier to audit live and after the fact, not simply more verbose.
124
-
125
-## Testing strategy
126
-
127
-- unit coverage for:
128
-  - the TUI/runtime shell-owner contract
129
-  - runtime-first TUI launch routing
130
-  - planned/pending/stale verification-observation normalization and projection
131
-- runtime coverage for:
132
-  - the TUI or UI-facing shell path using the runtime-first owner without losing steering, confirmation, or question handling
133
-  - policy/accountability views that now distinguish pending vs observed verification
134
-- regression coverage for:
135
-  - no drift back toward `Agent` as the default owner for internal UI/runtime integrations
136
-  - no duplicate verification lifecycle truth beside the canonical policy timeline
137
-  - no regression in Sprint 23's runtime-owner visibility and execution-time verification observations
138
-
139
-## Definition of done
140
-
141
-- the TUI uses a runtime-first owner seam below `Agent`, or any remaining public-shell dependency is explicit and justified
142
-- Loader preserves richer verification lifecycle state than just “command ran” within the canonical policy/accountability story
143
-- the public facade boundary is narrower or more explicitly defended on purpose
144
-- existing status/session/workflow/TUI surfaces expose the stronger owner-path and verification-lifecycle story without multiplying product commands
145
-- Sprint 23's runtime-first integration and verification-producer gains remain green
146
-
147
-## Explicitly out of scope
148
-
149
-- deleting `Agent` as the public compatibility surface
150
-- full claw-code policy-engine parity
151
-- model-authored verifier narratives as a required runtime dependency
152
-- AST-aware semantic diffs
153
-- a broad visual workflow UI redesign
154
-- multi-agent or team orchestration
155
-
156
-## Audit
157
-
158
-### Status
159
-
160
-- Sprint 24 is complete, and the audit is green. Loader now treats the TUI as another runtime-first internal path instead of a special `Agent` holdout, and the verification lifecycle is explicit across planned, pending, stale, skipped, and observed states inside the canonical policy timeline.
161
-
162
-### Landed
163
-
164
-- the TUI now launches through the runtime-first shell-owner seam below `Agent`: `src/loader/cli/main.py`, `src/loader/ui/app.py`, `src/loader/ui/adapter.py`, and `src/loader/runtime/runtime_handle.py` now build and use a runtime-owned shell owner by default for TUI launch instead of routing through the public `Agent` facade by habit
165
-- verification lifecycle state is now richer and more honest inside the canonical policy/accountability story: `src/loader/runtime/tool_batches.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/task_completion.py`, and `src/loader/runtime/verification_observations.py` now distinguish verification that is planned after new mutating work, pending because verify has started, stale because fresh mutations invalidated earlier proof, intentionally skipped, or actually observed
166
-- operator surfaces now expose that lifecycle directly instead of flattening it back into generic “missing verification”: `src/loader/runtime/workflow_timeline_read_model.py`, `src/loader/runtime/inspection.py`, and `src/loader/cli/main.py` now show `Verify planned`, `Verify pending`, and `Verify stale` states plus the corresponding recent-verification summaries in `loader status`, `loader session show`, and `loader workflow show`
167
-- the public facade boundary is tighter in practice even though Sprint 24 was not primarily a shell-contraction sprint: by moving the last major product path off default `Agent` ownership, Loader now has a clearer answer to what remains public-shell-only versus what is runtime-first by default
168
-
169
-### Verification
170
-
171
-- `uv run pytest -q` is green: `414 passed`
172
-- `tests/test_cli_runtime_owner.py` now pins runtime-first owner selection for non-TUI CLI, `loader explore`, single-prompt paths, and TUI launch
173
-- `tests/test_tool_batches.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, and `tests/test_workflow_runtime.py` now pin the planned -> pending -> stale verification lifecycle through mutating work, verify handoff, and completion/continuation policy
174
-- `tests/test_workflow_timeline_read_model.py` and `tests/test_inspection.py` now pin the operator-facing projection and rendering of planned/pending/stale verification in workflow highlights, status, session, and workflow inspection surfaces
175
-
176
-### Residual debt
177
-
178
-- Loader now uses runtime-first owner seams across CLI, explore, scripted harnesses, and the TUI, but `Agent` plus `runtime.public_shell` still define the outer compatibility boundary instead of a narrower runtime-first external API
179
-- the verification lifecycle is much clearer now, but it is still a bounded runtime-authored model; Loader still does not preserve richer queueing/timestamp semantics for “planned” vs “actively running,” nor does it implement OMX-style deeper verifier reasoning
180
-- Sprint 24 materially improved shell ownership and verification accountability, but Loader still stops short of claw-code's fuller policy engine, richer sandboxing, and the deeper verifier/interview rigor in the refs
.docs/sprints/sprint25.mddeleted
@@ -1,197 +0,0 @@
1
-# Sprint 25: Public Runtime API, Verification Attempts, and Boundary Narrowing
2
-
3
-## Prerequisites
4
-
5
-Sprint 24
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 24: stop treating `Agent` plus `runtime.public_shell` as the only meaningful outer boundary by default, deepen verification from coarse lifecycle labels into explicit attempt semantics, and keep pushing Loader toward a runtime-first shape without pretending it is ready to delete the public compatibility surface.
10
-
11
-Sprint 24 changed the remaining debt in a useful way:
12
-
13
-- CLI, explore, the scripted harness, and the TUI now all use the runtime-first owner seam below `Agent`
14
-- verification lifecycle now distinguishes `planned`, `pending`, `stale`, `skipped`, and observed states inside the canonical policy timeline
15
-- operator surfaces can now explain that lifecycle directly in `status`, `session show`, and `workflow show`
16
-- but `Agent` plus `runtime.public_shell` still remain the outer compatibility shell instead of a narrower runtime-first external API
17
-- verification lifecycle is still a bounded runtime-authored label model, not an explicit record of verification attempts, queueing, start/completion moments, or freshness across retries
18
-- Loader can now say that verification is planned or pending, but it still says less than it should about which verification attempt is active, what superseded it, and what evidence belongs to which attempt
19
-
20
-Sprint 25 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen outer runtime boundaries, bootstrap/session ownership, and event accountability
25
-- use OMX to sharpen verifier-attempt visibility, freshness semantics, and auditability around incomplete versus superseded proof
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too compatibility-shell-bound or too coarse in its verification state model
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
38
-- `refs/claw-code/PARITY.md`
39
-- `refs/oh-my-codex/src/verification/verifier.ts`
40
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/hooks/session.ts`
43
-- `.docs/PARITY.md`
44
-- `.docs/audit.txt`
45
-- `.docs/audit_sprints/trunk_sitrep.md`
46
-- `.docs/sprints/sprint24.md`
47
-
48
-## Deliverables
49
-
50
-### 1. Define a narrower public runtime API below `Agent`
51
-
52
-Sprint 24 made runtime-first real across the major product paths. Sprint 25 should make the remaining outer boundary more explicit and less compatibility-shaped.
53
-
54
-Implementation targets:
55
-
56
-- inventory what external callers actually need from:
57
-  - `src/loader/agent/loop.py`
58
-  - `src/loader/runtime/public_shell.py`
59
-  - `src/loader/runtime/runtime_handle.py`
60
-  - `src/loader/runtime/bootstrap.py`
61
-  - `src/loader/cli/main.py`
62
-  - `src/loader/ui/app.py`
63
-- define a smaller runtime-owned API for external integrations that need:
64
-  - runtime bootstrap/session ownership
65
-  - shell execution entrypoints
66
-  - inspection/continuity hooks
67
-  - steering/question handling
68
-- explicitly settle what stays public-compat-only under `Agent` and what should become runtime-first by default
69
-- migrate at least one more real caller or integration seam onto that narrower runtime-first API if the seam is still compatibility-driven by habit
70
-
71
-The goal is not to delete `Agent`. The goal is to make Loader's outer boundary answerable instead of half public facade and half runtime shell by historical accident.
72
-
73
-### 2. Promote verification lifecycle labels into typed attempt semantics
74
-
75
-Sprint 24 gave Loader lifecycle states. Sprint 25 should give those states explicit attempt structure.
76
-
77
-Implementation targets:
78
-
79
-- inventory where Loader already knows more than a plain status label across:
80
-  - `src/loader/runtime/dod.py`
81
-  - `src/loader/runtime/finalization.py`
82
-  - `src/loader/runtime/task_completion.py`
83
-  - `src/loader/runtime/tool_batches.py`
84
-  - `src/loader/runtime/workflow_policy.py`
85
-  - `src/loader/runtime/policy_timeline.py`
86
-  - `src/loader/runtime/verification_observations.py`
87
-- define a typed verification-attempt model that can represent things like:
88
-  - verification planned as a future attempt
89
-  - verification queued/pending as the current active attempt
90
-  - verification started versus completed
91
-  - verification superseded or made stale by later mutating work
92
-  - verification intentionally skipped
93
-  - verification attempt results tied to the right command/evidence bundle
94
-- keep that attempt model inside the canonical policy/accountability story instead of creating a second peer verification log with a separate ownership story
95
-
96
-The goal is to make Loader answer not just "is verification pending?" but "which verification attempt is pending, what superseded the last one, and what evidence belongs to this attempt?"
97
-
98
-### 3. Tighten completion and freshness policy around verification attempts
99
-
100
-Once attempt semantics exist, completion policy should stop flattening them back into generic lifecycle summaries.
101
-
102
-Implementation targets:
103
-
104
-- connect completion and reentry decisions more directly to:
105
-  - active verification attempt identity
106
-  - attempt freshness relative to later mutations
107
-  - attempt result timestamps/order
108
-  - superseded/stale attempt reasoning
109
-  - explicit missing proof versus not-yet-finished proof
110
-- preserve a clear distinction between:
111
-  - proof that is planned but not started
112
-  - proof currently in flight
113
-  - proof that finished and passed
114
-  - proof that finished and failed
115
-  - proof that was once green but is no longer fresh
116
-- ensure the canonical policy story explains why a stop/continue/retry decision was tied to one verification attempt rather than another
117
-
118
-The goal is to make Loader's completion honesty stronger when verification gets interrupted, superseded, retried, or resumed.
119
-
120
-### 4. Improve operator visibility for runtime boundary and verification attempts
121
-
122
-Once Loader has a narrower runtime-first boundary and richer attempt semantics, the existing operator surfaces should make that audit story easier to follow.
123
-
124
-Implementation targets:
125
-
126
-- improve the current surfaces so users can answer:
127
-  - which runtime/public boundary handled this session?
128
-  - what is the current verification attempt?
129
-  - what earlier attempt became stale or was superseded?
130
-  - what evidence is attached to the active versus superseded attempt?
131
-- prefer improving:
132
-  - `loader status`
133
-  - `loader session show`
134
-  - `loader workflow show`
135
-  - the TUI status surface
136
-  over inventing a new command unless a new surface is clearly cleaner
137
-- keep concise rollups first and expose deeper attempt detail only where it materially improves debugging
138
-
139
-The goal is to make Loader's runtime ownership and verification story easier to audit after the fact, not simply more verbose.
140
-
141
-## Testing strategy
142
-
143
-- unit coverage for:
144
-  - the narrower runtime-first public API contract
145
-  - verification-attempt normalization, persistence, and supersession
146
-  - completion freshness rules that now depend on attempt semantics
147
-- runtime coverage for:
148
-  - a verify handoff that records a planned attempt before active verification starts
149
-  - a pending attempt that later completes or becomes stale after fresh mutating work
150
-  - runtime-first callers that no longer need `Agent` by default
151
-- regression coverage for:
152
-  - no drift back toward `Agent` as the assumed external owner when a runtime-first API exists
153
-  - no duplicate verification-attempt truth beside the canonical policy timeline
154
-  - no regression in Sprint 24's lifecycle visibility and runtime-first TUI ownership
155
-
156
-## Definition of done
157
-
158
-- Loader has a narrower and more explicit runtime-first external API below `Agent`, or any remaining `Agent` ownership is clearly justified as public compatibility
159
-- verification lifecycle is represented with explicit attempt semantics, not only coarse status labels
160
-- completion and freshness policy can explain which verification attempt is active, stale, superseded, or satisfied
161
-- existing status/session/workflow/TUI surfaces expose the stronger boundary and attempt story without multiplying product commands
162
-- Sprint 24's runtime-first ownership and verification lifecycle gains remain green
163
-
164
-## Explicitly out of scope
165
-
166
-- deleting `Agent` as the public compatibility surface
167
-- full claw-code policy-engine parity
168
-- model-authored verifier narratives as a required runtime dependency
169
-- AST-aware semantic diffs
170
-- a broad visual workflow UI redesign
171
-- multi-agent or team orchestration
172
-
173
-## Audit
174
-
175
-### Status
176
-
177
-- Sprint 25 is complete, and the audit is green. Loader now has a runtime-owned outer API contract below `Agent`, explicit verification-attempt identity threaded through completion/freshness policy, and operator surfaces that can explain both the runtime boundary and the active versus superseded verification attempts.
178
-
179
-### Landed
180
-
181
-- Loader now has a narrower runtime-owned shell API below `Agent`: `src/loader/runtime/runtime_api.py`, `src/loader/cli/main.py`, and `src/loader/ui/app.py` now share one runtime-owned boundary for shell-owner construction instead of treating CLI and TUI ownership as adjacent but separate contracts
182
-- verification lifecycle is now represented with explicit attempt identity instead of only lifecycle labels: `src/loader/runtime/dod.py`, `src/loader/runtime/verification_observations.py`, `src/loader/runtime/tool_batches.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/task_completion.py`, and `src/loader/runtime/completion_policy.py` now preserve which attempt is planned, pending, stale, observed, skipped, or superseded
183
-- completion/freshness policy now explains itself in attempt-aware terms: when Loader continues, finalizes, or rejects stale proof, it can say which attempt was still active, which one was superseded, and why one attempt did or did not satisfy the stop condition
184
-- operator surfaces now expose the stronger boundary and attempt story without adding new commands: `src/loader/runtime/inspection.py`, `src/loader/cli/main.py`, `src/loader/runtime/owner_metadata.py`, `src/loader/ui/status_helpers.py`, `src/loader/ui/widgets/status_line.py`, and `src/loader/ui/app.py` now show runtime-boundary summaries plus attempt-aware verification state in `loader status`, `loader session show`, `loader workflow show`, and the TUI status line
185
-
186
-### Verification
187
-
188
-- `uv run pytest -q` is green: `416 passed`
189
-- `tests/test_cli_runtime_owner.py` now pins the runtime-owned shell API boundary for CLI and TUI ownership
190
-- `tests/test_completion_policy.py` and `tests/test_turn_completion.py` now pin attempt-aware completion/freshness reasoning, including superseded-attempt summaries
191
-- `tests/test_inspection.py` and `tests/test_status_surfaces.py` now pin runtime-boundary summaries, verification-state summaries, and TUI owner/attempt status rendering
192
-
193
-### Residual debt
194
-
195
-- `src/loader/runtime/runtime_api.py` narrows the boundary materially, but `Agent` plus `runtime.public_shell` still remain the documented public compatibility layer instead of Loader exposing a fully runtime-first external API to all callers
196
-- verification attempt identity is now explicit, but Loader still does not preserve richer attempt timing and queue semantics such as first-planned versus actively-started timestamps, or deeper multi-command attempt bundles
197
-- the operator surfaces now explain active versus superseded attempts clearly, but they are still concise rollups rather than a deeper attempt-history debugger or OMX-style verifier narrative
.docs/sprints/sprint26.mddeleted
@@ -1,165 +0,0 @@
1
-# Sprint 26: Verification Attempt Timelines and Public Facade Delamination
2
-
3
-## Prerequisites
4
-
5
-Sprint 25
6
-
7
-## Goals
8
-
9
-Take the next honest step after Sprint 25: preserve more than attempt labels in the verifier lifecycle, make the runtime-owned API more authoritative as Loader's practical outer boundary, and keep shrinking the gap between "runtime-first internally" and "runtime-first by default" without pretending the public compatibility surface is ready to disappear.
10
-
11
-Sprint 25 changed the remaining debt in a useful way:
12
-
13
-- Loader now has a runtime-owned shell API boundary in `src/loader/runtime/runtime_api.py`
14
-- verification lifecycle now preserves explicit attempt identity across planned, pending, stale, skipped, and observed states
15
-- completion policy can explain active versus superseded proof in attempt-aware terms
16
-- operator surfaces can now show runtime-boundary summaries and attempt-aware verification state in CLI and TUI
17
-- but Loader still says more about attempt labels than about attempt timeline facts such as when an attempt was first planned, when it actually started, when it completed, and what exactly superseded it
18
-- and `Agent` plus `runtime.public_shell` still remain the documented public compatibility shell even though the runtime-owned boundary is much stronger now
19
-
20
-Sprint 26 should keep using the references as architectural guardrails, not as a feature-copy list.
21
-
22
-The standard remains:
23
-
24
-- use claw-code to sharpen runtime/bootstrap/session ownership, policy/accountability event structure, and explicit outer runtime boundaries
25
-- use OMX to sharpen verifier attempt visibility, freshness reasoning, and the audit trail around incomplete, stale, superseded, or resumed proof
26
-- do not add work just because the refs have it
27
-- do add work when the refs show that Loader is still too compatibility-shell-bound or too coarse in its verification-attempt history
28
-
29
-`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap.
30
-
31
-The references for this sprint are:
32
-
33
-- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs`
34
-- `refs/claw-code/rust/crates/runtime/src/session_control.rs`
35
-- `refs/claw-code/rust/crates/runtime/src/conversation.rs`
36
-- `refs/claw-code/rust/crates/runtime/src/lane_events.rs`
37
-- `refs/claw-code/rust/crates/runtime/src/green_contract.rs`
38
-- `refs/claw-code/PARITY.md`
39
-- `refs/oh-my-codex/src/verification/verifier.ts`
40
-- `refs/oh-my-codex/src/autoresearch/runtime.ts`
41
-- `refs/oh-my-codex/src/autoresearch/contracts.ts`
42
-- `refs/oh-my-codex/src/hooks/session.ts`
43
-- `.docs/PARITY.md`
44
-- `.docs/audit.txt`
45
-- `.docs/audit_sprints/trunk_sitrep.md`
46
-- `.docs/sprints/sprint25.md`
47
-
48
-## Deliverables
49
-
50
-### 1. Promote verification attempts into richer timeline records
51
-
52
-Sprint 25 gave Loader attempt identity. Sprint 26 should make those attempts feel like real runtime history, not only labeled states.
53
-
54
-Implementation targets:
55
-
56
-- inventory where Loader already knows attempt-order or lifecycle facts across:
57
-  - `src/loader/runtime/dod.py`
58
-  - `src/loader/runtime/finalization.py`
59
-  - `src/loader/runtime/tool_batches.py`
60
-  - `src/loader/runtime/verification_observations.py`
61
-  - `src/loader/runtime/workflow_policy.py`
62
-  - `src/loader/runtime/policy_timeline.py`
63
-  - `src/loader/runtime/workflow_timeline_read_model.py`
64
-- define a typed verification-attempt timeline model that can preserve things like:
65
-  - when an attempt was first planned
66
-  - when it actually became active
67
-  - when it completed or was skipped
68
-  - when and why it became stale or was superseded
69
-  - which commands/evidence bundle belonged to that attempt
70
-- keep that richer attempt model inside the canonical policy/accountability story instead of creating a side log
71
-
72
-The goal is to make Loader answer not only "which attempt is active?" but "what happened to attempt 2, when did attempt 3 actually start, and what proof belongs to each?"
73
-
74
-### 2. Tighten completion/freshness policy around attempt history
75
-
76
-Once attempt history is richer, completion policy should stop flattening that history into one summary string too early.
77
-
78
-Implementation targets:
79
-
80
-- connect completion and reentry decisions more directly to:
81
-  - attempt planning versus active-start moments
82
-  - completed versus superseded attempt ordering
83
-  - freshness relative to later mutating work
84
-  - explicit gaps between planned proof, running proof, and finished proof
85
-- preserve a clear distinction between:
86
-  - proof that is scheduled but has not started
87
-  - proof that started but has not completed
88
-  - proof that completed and passed
89
-  - proof that completed and failed
90
-  - proof that was once green but is now stale because a later attempt superseded it
91
-- ensure the canonical policy story explains why Loader trusted, rejected, or waited on one attempt instead of another
92
-
93
-The goal is to make Loader's stop/continue/retry logic more auditable when verification spans multiple retries or resumes.
94
-
95
-### 3. Narrow the public facade below `Agent` again
96
-
97
-Sprint 25 added a runtime-owned shell API. Sprint 26 should make that boundary more authoritative in practice.
98
-
99
-Implementation targets:
100
-
101
-- inventory what still materially depends on:
102
-  - `src/loader/agent/loop.py`
103
-  - `src/loader/runtime/public_shell.py`
104
-  - `src/loader/runtime/runtime_api.py`
105
-  - `src/loader/runtime/runtime_handle.py`
106
-  - `src/loader/cli/main.py`
107
-  - `src/loader/ui/app.py`
108
-- migrate at least one more real caller or ownership seam onto the runtime-owned API if it is still compatibility-shaped by habit
109
-- make remaining `Agent`-owned behavior explicitly compatibility-facing rather than ambiguous runtime glue
110
-- add or extend boundary tests so future work does not drift back toward `Agent` as the assumed outer runtime owner
111
-
112
-The goal is not to delete `Agent`. The goal is to make the runtime-owned API the default answer more often, and the compatibility facade the deliberate exception.
113
-
114
-### 4. Improve operator visibility for attempt history and public boundary
115
-
116
-Once attempt records get richer and the outer boundary gets cleaner, the existing surfaces should tell that story with less reconstruction.
117
-
118
-Implementation targets:
119
-
120
-- improve the current surfaces so users can answer:
121
-  - which runtime/public boundary handled this session?
122
-  - which verification attempt is active right now?
123
-  - when did it become planned, active, stale, or superseded?
124
-  - what evidence belongs to the active attempt versus an older one?
125
-- prefer improving:
126
-  - `loader status`
127
-  - `loader session show`
128
-  - `loader workflow show`
129
-  - the TUI status surface
130
-  over inventing a new command unless one is clearly cleaner
131
-- keep concise rollups first, and expose deeper attempt history only where it materially improves debugging
132
-
133
-The goal is to make Loader's runtime boundary and verifier history easier to audit after the fact, not simply more verbose.
134
-
135
-## Testing strategy
136
-
137
-- unit coverage for:
138
-  - richer verification-attempt timeline normalization and persistence
139
-  - completion/freshness decisions that now depend on attempt history
140
-  - the narrower runtime-owned API boundary and any new caller migration
141
-- runtime coverage for:
142
-  - a planned attempt that later becomes active and then completes
143
-  - a completed attempt that becomes stale after new mutating work
144
-  - a resumed session whose active attempt history and runtime-owner boundary remain coherent
145
-- regression coverage for:
146
-  - no duplicate verification-attempt truth beside the canonical policy timeline
147
-  - no drift back toward `Agent` as the assumed outer runtime owner when a runtime-owned API exists
148
-  - no regression in Sprint 25's attempt-aware completion/freshness and operator surfaces
149
-
150
-## Definition of done
151
-
152
-- Loader preserves richer verification-attempt history than plain attempt labels inside the canonical policy/accountability story
153
-- completion and freshness policy can explain which attempt was planned, active, completed, stale, or superseded and why
154
-- the runtime-owned API below `Agent` is more authoritative in at least one more real integration seam, or remaining `Agent` ownership is explicitly justified as compatibility-only
155
-- existing status/session/workflow/TUI surfaces expose the stronger boundary and attempt-history story without multiplying product commands
156
-- Sprint 25's runtime-boundary and attempt-aware verification gains remain green
157
-
158
-## Explicitly out of scope
159
-
160
-- deleting `Agent` as the public compatibility surface
161
-- full claw-code policy-engine parity
162
-- model-authored verifier narratives as a required runtime dependency
163
-- AST-aware semantic diffs
164
-- a broad visual workflow UI redesign
165
-- multi-agent or team orchestration
.gitignoremodified
@@ -58,3 +58,4 @@ node_modules/
5858
 CLAUDE.md
5959
 refs/
6060
 .loader/
61
+.docs/