Untrack .docs/ from remote, keep local
- SHA
e765d3774ac3a66a5f1c17e6bb9e7c5afe264fbe- Parents
-
346a24c - Tree
8bc3666
e765d37
e765d3774ac3a66a5f1c17e6bb9e7c5afe264fbe346a24c
8bc3666.docs/PARITY.mddeleted@@ -1,250 +0,0 @@ | ||
| 1 | -# Loader Runtime Parity Checkpoint | |
| 2 | - | |
| 3 | -Date: 2026-04-09 | |
| 4 | - | |
| 5 | -Deterministic baseline: `uv run pytest -q` → `416 passed` | |
| 6 | - | |
| 7 | -This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests. | |
| 8 | - | |
| 9 | -## Supported today | |
| 10 | - | |
| 11 | -- streamed text-only replies | |
| 12 | -- native-tool round trips for `read`, `write`, `edit`, `patch`, `glob`, `grep`, `bash`, `git`, `TodoWrite`, `AskUserQuestion`, `project_memory_*`, and `notepad_*` | |
| 13 | -- explicit permission modes: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow` | |
| 14 | -- tool lifecycle hooks in `pre_tool_use` → permission check → execute → `post_tool_use` / `post_tool_use_failure` order | |
| 15 | -- rule-based permission policy with workspace-local `allow` / `deny` / `ask` rules from `.loader/permission-rules.json` | |
| 16 | -- policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information | |
| 17 | -- `loader permissions show` for normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by hand | |
| 18 | -- `loader permissions check` for dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning | |
| 19 | -- raw JSON fallback when the model emits tool syntax in plain text | |
| 20 | -- raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as `TodoWrite` and `AskUserQuestion` | |
| 21 | -- persisted definition-of-done state under `.loader/dod/` | |
| 22 | -- persisted clarify briefs under `.loader/briefs/` | |
| 23 | -- persisted implementation and verification plans under `.loader/plans/` | |
| 24 | -- persisted conversation sessions under `.loader/sessions/` plus active session state under `.loader/state/` | |
| 25 | -- persisted permission policy metadata alongside session state, so `loader status` / `loader session list` / `loader session show` can explain the effective policy that ran | |
| 26 | -- `loader --resume` and `loader --resume <session-id>` restore persisted session state | |
| 27 | -- durable project memory in `.loader/project-memory.json` and working notes in `.loader/notepad.md` | |
| 28 | -- native memory tools for `project_memory_*` and `notepad_*` | |
| 29 | -- scored workflow routing across `clarify` → `plan` → `execute` → `verify`, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints | |
| 30 | -- typed workflow-signal extraction with persisted `signal_summary` context for route pressure, recent workflow history, and unresolved questions | |
| 31 | -- mode-specific system prompts for clarify, plan, execute, and verify | |
| 32 | -- intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward | |
| 33 | -- pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata | |
| 34 | -- codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts | |
| 35 | -- semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues | |
| 36 | -- structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift | |
| 37 | -- persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection | |
| 38 | -- persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips | |
| 39 | -- explicit verify/fix loops for mutating tasks, with a bounded retry budget | |
| 40 | -- verify/fix retries return to execute mode without re-triggering clarify or plan | |
| 41 | -- task-size-aware verification command derivation based on actual tool history | |
| 42 | -- verification command loading from persisted `verification.md` artifacts when present | |
| 43 | -- heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate | |
| 44 | -- typed `TurnSummary` output for completed turns, including trace events and tool-result messages | |
| 45 | -- normalized per-turn usage plus cumulative session usage in `TurnSummary` | |
| 46 | -- automatic transcript compaction with priority-aware line compression and continuation instructions | |
| 47 | -- unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor` | |
| 48 | -- typed tool-result messages backed by `Message.tool_results` | |
| 49 | -- typed prompt construction in `runtime.prompting`, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state | |
| 50 | -- persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection | |
| 51 | -- validated turn-state transitions (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) with typed transition metadata, persisted session state, and emitted runtime events | |
| 52 | -- typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints | |
| 53 | -- `loader doctor` for backend, capability, workspace, command, state, and permission health checks outside the main runtime loop | |
| 54 | -- `loader status` plus `loader session list/show/resume` for inspecting persisted runtime state without invoking the LLM | |
| 55 | -- `loader prompt show [task]` for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model request | |
| 56 | -- `loader prompt diff [session-id]` for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demand | |
| 57 | -- `loader workflow show [session-id]` with `--mode`, `--kind`, and `--limit` filters plus operator-focused workflow highlights, recent timeline snippets in `loader session show`, and `--diff` / `--full-diff` artifact comparison for persisted workflow artifacts | |
| 58 | -- `loader explore <prompt>` as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under `.loader/state/explore.json`, `--fresh` to ignore prior explore history when needed, and `--status` / `--reset` for continuity inspection and reset | |
| 59 | -- `RuntimeContext` is now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the older `RuntimeLegacyServices` shim has been removed | |
| 60 | -- shared runtime bootstrap through `runtime.bootstrap.build_runtime_context(...)` / `sync_runtime_context(...)`, with both conversation and explore runtimes now starting from an explicit `RuntimeBootstrapView` rather than a raw `Agent` object by default | |
| 61 | -- runtime-owned safeguard and reasoning helpers now have canonical homes under `src/loader/runtime/`; `src/loader/agent/safeguards.py` and `src/loader/agent/reasoning.py` are compatibility-export layers rather than the primary implementations | |
| 62 | -- the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through `src/loader/runtime/launcher.py`, which leaves a smaller and more honest `src/loader/agent/loop.py` | |
| 63 | -- runtime-owned prompt/session shell helpers now live in `src/loader/runtime/public_shell.py`, including prompt construction, prompt snapshot persistence, session creation, session restore, and few-shot example selection | |
| 64 | -- runtime-owned public-shell helpers now also own prompt-mode resolution, workflow-mode prompt invalidation, and owner-bound system/few-shot construction, leaving `src/loader/agent/loop.py` as an explicitly documented public facade instead of an ambiguous leftover shell | |
| 65 | -- compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to `agent/reasoning.py` / `agent/safeguards.py` imports | |
| 66 | -- CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id | |
| 67 | -- CLI status now also surfaces recent explore activity, including bounded explore turn/message counts, the last explore history mode, and the last explore query | |
| 68 | -- CLI and TUI workflow-mode visibility plus artifact notifications | |
| 69 | -- CLI and TUI permission-mode visibility with color-coded status | |
| 70 | -- workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata | |
| 71 | -- shell mutability classification plus structured truncation and stderr/exit-code metadata | |
| 72 | -- richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform responses | |
| 73 | -- honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers | |
| 74 | -- raw-text tool recovery now also fails honestly once its budget is exhausted instead of adding a synthetic follow-up invitation | |
| 75 | -- dedicated assistant-response routing in `runtime.response_routing`, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline inside `turn_iteration.py` | |
| 76 | -- assistant-turn request handling now lives in `runtime.assistant_turns`, clarify/plan lane execution now lives in `runtime.workflow_lanes`, tool-batch execution/recovery now lives in `runtime.tool_batches`, DoD/finalization logic now lives in `runtime.finalization`, workflow-state/session mutation lives in `runtime.workflow_state`, and the main loop now runs through `runtime.turn_preparation`, `runtime.turn_preamble`, `runtime.turn_iteration`, and `runtime.turn_loop` instead of accumulating further inside `conversation.py` | |
| 77 | -- `src/loader/runtime/conversation.py` now acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop | |
| 78 | -- persisted completion-decision summaries plus bounded completion traces are now available in session/runtime state, status/session inspection, and resume flows | |
| 79 | -- typed follow-through evidence now backs non-mutating completion checks, with explicit required/missing evidence and honest terminal failure once the continuation budget is exhausted without enough proof of completion | |
| 80 | -- the workflow timeline is now the canonical policy/accountability artifact for completion decisions too, with live and persisted completion traces projected from that timeline as a compact read model rather than maintained as a separate peer runtime contract | |
| 81 | -- runtime-owned public-shell helpers now cover prompt/session factories, session install/load helpers, steering mailbox behavior, sync/async event wrapping, and capability-refresh decision helpers | |
| 82 | -- `loader workflow show --policy` now filters directly to unified repair / verify-skip / completion accountability events, and `loader session show` now includes a `Policy Timeline` preview so operators can inspect the stop/continue/retry story without stitching together separate surfaces by hand | |
| 83 | -- `loader status` and `loader session show` now also surface a latest-policy rollup sourced from the canonical workflow timeline so operators can see the last important stop/continue/retry decision at a glance | |
| 84 | -- typed evidence provenance now flows through canonical workflow events and projected completion traces, so Loader can distinguish supporting, missing, and contradictory evidence instead of flattening everything into one summary string too early | |
| 85 | -- `loader status`, `loader session show`, and `loader workflow show` now surface concise policy-evidence rollups derived from the canonical workflow timeline, including what evidence was still needed and what evidence satisfied the latest stop/continue decision | |
| 86 | -- typed verification observations now flow through finalization, canonical policy events, and projected completion traces, so Loader preserves what verification was actually observed closer to execution instead of reconstructing that story only from later DoD/session summaries | |
| 87 | -- completion stop/continue policy now cites observed verification facts when available, and exhausted continuation failures preserve those observed verification results in the canonical accountability story instead of only reporting generic missing evidence | |
| 88 | -- `loader status`, `loader session show`, and `loader workflow show` now surface observed verification directly, and `Recent Verification` is unified from canonical policy observations first with DoD evidence only as a fallback | |
| 89 | -- Loader now has a runtime-owned internal execution handle in `src/loader/runtime/runtime_handle.py`, and runtime-oriented launcher/bootstrap/public-shell tests no longer need to treat `Agent` as the only valid runtime owner | |
| 90 | -- non-TUI CLI paths, `loader explore`, and the scripted runtime harness now default to the runtime-first owner seam below `Agent`, so Loader uses `RuntimeHandle` in real internal integrations instead of reserving it for tests | |
| 91 | -- the TUI now also launches through that runtime-first shell-owner seam below `Agent`, so the last major product path is no longer using the public facade by habit | |
| 92 | -- `src/loader/runtime/runtime_api.py` now defines a narrower runtime-owned shell API used by CLI and TUI owner construction, with `Agent` left as the documented public compatibility facade instead of the default internal contract | |
| 93 | -- persisted session state now records the active runtime-owner path, and `loader status`, `loader session list/show`, and `loader workflow show` surface that runtime-owner provenance directly | |
| 94 | -- verification now emits per-command `verify_observation` events into the canonical workflow timeline while the verification loop is running, and workflow/policy read models project those entries as first-class accountability state | |
| 95 | -- verification lifecycle now distinguishes planned, pending, stale, skipped, and observed states inside the canonical workflow timeline, and completion policy plus inspection surfaces preserve those states directly instead of flattening them into generic missing-proof summaries | |
| 96 | -- verification lifecycle now also carries explicit attempt identity across planned, pending, stale, skipped, and observed states, including supersession labels like `attempt 1 -> attempt 2` inside completion policy and inspection surfaces | |
| 97 | -- `loader status`, `loader session show`, `loader workflow show`, and the TUI status line now surface runtime-boundary summaries plus attempt-aware verification state/DoD summaries instead of only raw owner or lifecycle labels | |
| 98 | - | |
| 99 | -## Known weak spots | |
| 100 | - | |
| 101 | -- the public runtime boundary is now explicit and runtime-shaped, and Loader now also has real runtime-first internal integrations through `RuntimeHandle` plus the narrower `runtime.runtime_api` contract across CLI, explore, the scripted harness, and TUI launch, but `Agent` plus `runtime.public_shell` still supply the outer compatibility boundary instead of a fully runtime-first external API | |
| 102 | -- verification attempt identity is now explicit and attempt-aware completion can explain active versus superseded proof, but Loader still does not preserve richer queue/start/finish timing semantics, deeper multi-command attempt bundles, or OMX-style verifier reasoning | |
| 103 | -- [`src/loader/agent/loop.py`](../src/loader/agent/loop.py) is down to 267 lines and much closer to a public facade than the pre-Sprint-15 shell, but it still owns the compatibility shell and remaining launcher/UI glue instead of disappearing entirely | |
| 104 | -- [`src/loader/agent/reasoning.py`](../src/loader/agent/reasoning.py) and [`src/loader/agent/safeguards.py`](../src/loader/agent/safeguards.py) are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further | |
| 105 | -- [`src/loader/runtime/tool_batches.py`](../src/loader/runtime/tool_batches.py) and parts of [`src/loader/runtime/workflow_lanes.py`](../src/loader/runtime/workflow_lanes.py) are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in `refs/claw-code` | |
| 106 | -- the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth | |
| 107 | -- clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class | |
| 108 | -- plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes | |
| 109 | -- the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance | |
| 110 | -- plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop | |
| 111 | -- DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state | |
| 112 | -- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives | |
| 113 | -- session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts | |
| 114 | -- project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction | |
| 115 | -- rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface | |
| 116 | -- policy state is inspectable in doctor/status/session surfaces and dry-runnable through `loader permissions show/check`, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface | |
| 117 | -- follow-through evidence is now explicit and persisted, but it is still heuristic and runtime-authored rather than backed by a deeper verifier/model contract or richer artifact-derived proof model | |
| 118 | -- prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage | |
| 119 | -- workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace | |
| 120 | -- shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model | |
| 121 | -- explore mode now has lightweight transcript continuity plus `loader explore --status` / `--reset`, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances | |
| 122 | -- the read-only `git` helper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and the `patch` tool still stops short of AST/LSP-aware editing | |
| 123 | - | |
| 124 | -## Out of scope in the current baseline | |
| 125 | - | |
| 126 | -- richer permission-rule UX / per-command allowlists | |
| 127 | -- multi-agent / team orchestration | |
| 128 | - | |
| 129 | -## Deterministic parity scenarios | |
| 130 | - | |
| 131 | -The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity. Sprint 17 adds [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), expanded [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py) / [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py) / [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the explicit bootstrap view, expanded [`tests/test_repair.py`](../tests/test_repair.py) / [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py) coverage for honest raw-text recovery failure, and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for explore continuity inspection, reset, and persisted fresh-vs-continue visibility. Sprint 21 adds [`tests/test_evidence_provenance.py`](../tests/test_evidence_provenance.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), [`tests/test_runtime_handle.py`](../tests/test_runtime_handle.py), and expanded [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_runtime_public_shell.py`](../tests/test_runtime_public_shell.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed evidence provenance, grouped policy-evidence rollups, the runtime-first internal handle, and evidence-backed status/session/workflow inspection. Sprint 22 adds [`tests/test_verification_observations.py`](../tests/test_verification_observations.py) plus expanded [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for typed verification observations, observed-verification stop/continue reasoning, and unified verification inspection sourced from canonical policy events. Sprint 24 adds expanded [`tests/test_cli_runtime_owner.py`](../tests/test_cli_runtime_owner.py) coverage for runtime-first TUI launch ownership plus expanded [`tests/test_tool_batches.py`](../tests/test_tool_batches.py), [`tests/test_finalization.py`](../tests/test_finalization.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), [`tests/test_workflow_timeline_read_model.py`](../tests/test_workflow_timeline_read_model.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for planned/pending/stale verification lifecycle and its operator-facing projections. Sprint 25 adds expanded [`tests/test_cli_runtime_owner.py`](../tests/test_cli_runtime_owner.py), [`tests/test_completion_policy.py`](../tests/test_completion_policy.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_inspection.py`](../tests/test_inspection.py), and [`tests/test_status_surfaces.py`](../tests/test_status_surfaces.py) coverage for the runtime-owned shell API boundary, explicit verification-attempt identity, attempt-aware completion/freshness reasoning, and boundary/attempt operator surfaces. | |
| 132 | - | |
| 133 | -- `streaming_text`: green | |
| 134 | -- `read_file_roundtrip`: green | |
| 135 | -- `multi_tool_turn_roundtrip`: green | |
| 136 | -- `write_file_allowed`: green | |
| 137 | -- `write_file_denied`: green | |
| 138 | -- `bash_stdout_roundtrip`: green | |
| 139 | -- `bash_confirmation_prompt_approved`: green | |
| 140 | -- `bash_confirmation_prompt_denied`: green | |
| 141 | -- `read_only_mode_denies_write`: green | |
| 142 | -- `read_only_mode_denies_mutating_bash`: green | |
| 143 | -- `read_only_mode_allows_safe_bash`: green | |
| 144 | -- `workspace_write_denies_write_outside_root`: green | |
| 145 | -- `danger_full_access_allows_dangerous_bash`: green | |
| 146 | -- `prompt_mode_prompts_destructive_write`: green | |
| 147 | -- `allow_mode_skips_prompt_for_destructive_write`: green | |
| 148 | -- `deny_rule_blocks_allowed_mode`: green | |
| 149 | -- `ask_rule_prompts_even_when_mode_would_allow`: green | |
| 150 | -- `raw_json_tool_call_fallback`: green | |
| 151 | -- `completion_check_continuation`: green | |
| 152 | -- `tool_result_contract_regression`: green | |
| 153 | -- `turn_summary_smoke_for_multi_tool_turn`: green | |
| 154 | -- `native_and_raw_tool_paths_share_executor_trace`: green | |
| 155 | -- `backend_capability_probe_refreshes_native_tool_mode`: green | |
| 156 | -- `run_streaming_delegates_to_primary_runtime`: green | |
| 157 | -- `definition_of_done_verify_phase`: green | |
| 158 | -- `verify_failure_routes_to_fix_loop`: green | |
| 159 | -- `verify_retry_budget_exhaustion`: green | |
| 160 | -- `ambiguous_prompt_routes_to_clarify`: green | |
| 161 | -- `complex_prompt_routes_to_plan`: green | |
| 162 | -- `verify_failure_fix_loop_does_not_reroute_workflow`: green | |
| 163 | -- `conversational_task_skips_verify_phase`: green | |
| 164 | -- `explore_mode_skips_dod_and_router`: green | |
| 165 | -- `explore_mode_denies_write`: green | |
| 166 | -- `explore_mode_ignores_global_allow_policy`: green | |
| 167 | - | |
| 168 | -## Verification snapshot | |
| 169 | - | |
| 170 | -As of 2026-04-09: | |
| 171 | - | |
| 172 | -- `uv run pytest -q`: 416 passed | |
| 173 | -- `tests/test_runtime_harness.py` is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression | |
| 174 | -- `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence | |
| 175 | -- `tests/test_turn_state_machine.py` covers allowed/disallowed turn transitions and terminal transition metadata | |
| 176 | -- `tests/test_runtime_phases.py` covers repair/completion phase transitions plus persisted transition metadata in runtime events and session state | |
| 177 | -- `tests/test_runtime_repair_flows.py` covers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutes | |
| 178 | -- `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a full `Agent` | |
| 179 | -- `tests/test_runtime_bootstrap.py` covers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seam | |
| 180 | -- `tests/test_runtime_public_shell.py` covers runtime-owned prompt/session shell helpers directly, including session creation metadata, prompt snapshot persistence, restored last-turn summary state, steering mailbox behavior, sync/async event emitter normalization, capability-refresh helper behavior, and fresh/load session-install helpers | |
| 181 | -- `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` cover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegation | |
| 182 | -- `tests/test_safeguard_services.py` covers the canonical runtime safeguard implementation plus the compatibility-export path under `loader.agent.safeguards` | |
| 183 | -- `tests/test_reasoning_compat.py` covers runtime-owned deliberation/completion helpers plus the compatibility-export path under `loader.agent.reasoning` | |
| 184 | -- `tests/test_compat_boundaries.py` fails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shims | |
| 185 | -- `tests/test_repair.py` covers raw-text fallback through the runtime parser and active registry, including `TodoWrite` recovery and honest failure once raw-tool recovery exceeds its budget | |
| 186 | -- `tests/test_completion_policy.py` covers direct text-loop bailout and continuation-prompt behavior on the typed runtime context | |
| 187 | -- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now also pin typed follow-through evidence, honest budget-exhausted completion finalization, and persisted completion-evidence summaries | |
| 188 | -- `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing at the new response-policy seam | |
| 189 | -- `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation | |
| 190 | -- `tests/test_workflow.py` covers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncing | |
| 191 | -- `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` context | |
| 192 | -- `tests/test_clarify_strategy.py` covers clarify-slot prioritization and targeted follow-up question selection | |
| 193 | -- `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hints | |
| 194 | -- `tests/test_workflow_policy.py` covers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serialization | |
| 195 | -- `tests/test_artifact_invalidation.py` covers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plan | |
| 196 | -- `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries | |
| 197 | -- `tests/test_workflow_runtime.py` covers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recovery | |
| 198 | -- `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` cover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime tests | |
| 199 | -- `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover `TodoWrite`, `AskUserQuestion`, and runtime callback plumbing | |
| 200 | -- `tests/test_session_state.py` covers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot history | |
| 201 | -- `tests/test_compaction.py` covers claw-style line compression and compacted continuation-message behavior | |
| 202 | -- `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory | |
| 203 | -- `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore | |
| 204 | -- `tests/test_inspection.py` covers `loader doctor`, `loader status`, `loader session list/show`, `loader permissions show/check`, `loader prompt show`, `loader prompt diff`, `loader workflow show --diff`, workflow timeline filtering/highlights, the new `loader workflow show --policy` accountability filter, and workflow/session inspection surfaces, including recent explore activity plus `loader explore --status` / `--reset` | |
| 205 | -- `tests/test_evidence_provenance.py` covers typed provenance serialization plus structured stop/continue evidence carried through completion policy state | |
| 206 | -- `tests/test_verification_observations.py` covers typed verification-observation serialization and normalization | |
| 207 | -- `tests/test_workflow_timeline_read_model.py` covers grouped supporting/missing policy-evidence rollups, latest-policy derivation, and observed-verification read models from the canonical workflow timeline | |
| 208 | -- `tests/test_runtime_handle.py` covers the runtime-owned internal handle below `Agent`, including direct launcher/context/runtime construction without depending on the public compatibility facade | |
| 209 | -- `tests/test_cli_runtime_owner.py` covers runtime-first owner selection for non-TUI CLI, `loader explore`, single-prompt execution, and TUI launch paths | |
| 210 | -- `tests/test_completion_policy.py` and `tests/test_turn_completion.py` now also pin attempt-aware completion/freshness reasoning, including superseded-attempt summaries in continuation-stop decisions | |
| 211 | -- `tests/test_inspection.py` and `tests/test_status_surfaces.py` now also cover runtime-boundary summaries, verification-state summaries, and TUI owner/attempt status rendering | |
| 212 | -- `tests/test_tool_batches.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, `tests/test_workflow_runtime.py`, `tests/test_workflow_timeline_read_model.py`, and `tests/test_inspection.py` now cover planned/pending/stale verification lifecycle transitions plus their policy/accountability projections | |
| 213 | -- `tests/test_explore_runtime.py` covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, persisted `fresh` vs `continue` visibility, and `fresh` explore resets outside the parity harness | |
| 214 | -- `tests/test_expanded_tools.py` covers structured patch application, read-only git helpers, `notepad_append`, and richer structured user questions | |
| 215 | -- `tests/test_permissions.py` covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering | |
| 216 | -- `tests/test_tool_safety.py` covers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classification | |
| 217 | -- `tests/test_status_surfaces.py` covers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers | |
| 218 | -- `tests/test_runtime_public_shell.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now also cover persisted runtime-owner metadata plus its status/session/workflow rendering | |
| 219 | -- native and extracted tool calls now record the same executor trace events, with source-specific metadata | |
| 220 | -- turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith | |
| 221 | - | |
| 222 | -## Definition of honesty | |
| 223 | - | |
| 224 | -- If a scenario is green here, it should have deterministic automated coverage. | |
| 225 | -- If a scenario is flaky or broken, it should be called out here before we claim parity work is done. | |
| 226 | -- Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test. | |
| 227 | -- Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+. | |
| 228 | -- Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants. | |
| 229 | -- Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor. | |
| 230 | -- Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction. | |
| 231 | -- Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime. | |
| 232 | -- Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs. | |
| 233 | -- Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class `loader permissions show/check` operator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs. | |
| 234 | -- Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions, `loader prompt show`, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs. | |
| 235 | -- Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs. | |
| 236 | -- Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in `runtime.workflow_lanes`, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs. | |
| 237 | -- Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped `runtime.conversation`, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs. | |
| 238 | -- Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics. | |
| 239 | -- Sprint 14 is complete: Loader now treats `RuntimeContext` as the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces. | |
| 240 | -- Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no remaining `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py` after dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor. | |
| 241 | -- Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor. | |
| 242 | -- Sprint 17 is complete: Loader now starts public runtime launch from an explicit runtime-shaped bootstrap view, moves prompt/session shell helpers into `src/loader/runtime/public_shell.py`, fails raw-text tool recovery more honestly once its budget is exhausted, and exposes explore continuity through `loader explore --status` / `--reset`, but it still stops short of a fully minimal public shell, deeper completion-policy deletions, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor. | |
| 243 | -- Sprint 18 is complete: Loader now persists explicit completion decisions and bounded completion traces, exposes them through status/session inspection, restores them across resume, and moves more event/steering/capability/session shell glue into `src/loader/runtime/public_shell.py`, but it still stops short of a fully minimal public shell, harder deletion of all continuation heuristics, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor. | |
| 244 | -- Sprint 19 is complete: Loader now pushes more public entry glue under `src/loader/runtime/public_shell.py`, derives typed follow-through evidence for non-mutating completion checks, fails honestly when that evidence is still missing after the continuation budget is exhausted, and exposes a clearer unified policy story through `loader workflow show --policy` plus the `Policy Timeline` preview in `loader session show`, but it still stops short of a fully minimal public shell, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. | |
| 245 | -- Sprint 20 is complete: Loader now treats the workflow timeline as the canonical policy/accountability artifact even for live completion-trace projection, grounds more follow-through decisions in DoD verification state and tracked runtime evidence, exposes latest-policy rollups in the existing status/session surfaces, and explicitly settles the remaining `Agent` shell as a documented public facade guarded by boundary tests, but it still stops short of claw-code's fuller policy engine, a narrower runtime-first external API, and OMX's deeper verifier/interview rigor. | |
| 246 | -- Sprint 21 is complete: Loader now carries typed evidence provenance through canonical policy events, derives grouped policy-evidence rollups from one shared workflow-timeline read model, exposes “needed” vs “satisfied” evidence in `loader status` / `loader session show` / `loader workflow show`, and provides a runtime-owned internal handle so runtime-oriented code and tests no longer need to treat `Agent` as the only valid execution owner, but it still stops short of claw-code's fuller policy engine, a narrower runtime-first external API, and OMX's deeper verifier/interview rigor. | |
| 247 | -- Sprint 22 is complete on the verification-observation lane: Loader now captures typed verification observations closer to execution, carries those observations through canonical policy events and completion-stop decisions, and surfaces observed verification plus a unified `Recent Verification` view in `loader status` / `loader session show` / `loader workflow show`, but the planned runtime-first entry promotion beyond tests did not land and rolls forward as Sprint 23 debt alongside Loader's remaining gap to claw-code's fuller policy engine and OMX's deeper verifier/interview rigor. | |
| 248 | -- Sprint 23 is complete: Loader now uses the runtime-first seam in real internal integrations through `RuntimeHandle`, emits per-command `verify_observation` events while the verification loop runs, and surfaces persisted runtime-owner provenance in the existing operator views, but it still stops short of a narrower runtime-first public API, TUI migration away from the public shell, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. | |
| 249 | -- Sprint 24 is complete: Loader now uses the runtime-first owner seam for the TUI as well, distinguishes planned/pending/stale verification lifecycle state inside the canonical policy timeline, and surfaces that lifecycle directly in status/session/workflow inspection, but it still stops short of a narrower runtime-first external API, richer verification queue/timestamp semantics, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. | |
| 250 | -- Sprint 25 is complete: Loader now has a runtime-owned shell API boundary below `Agent`, explicit verification-attempt identity carried through completion/freshness policy, and operator surfaces that expose runtime-boundary plus attempt-aware verification state across CLI and TUI, but it still stops short of a fully runtime-first external API, richer attempt queue/timestamp semantics, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor. | |
.docs/REPORT.mddeleted@@ -1,1116 +0,0 @@ | ||
| 1 | -# Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior | |
| 2 | - | |
| 3 | -Date: 2026-04-06 | |
| 4 | - | |
| 5 | -## Scope and assumptions | |
| 6 | - | |
| 7 | -This report compares three things: | |
| 8 | - | |
| 9 | -1. `Loader` itself | |
| 10 | -2. `refs/claw-code`, using the Rust workspace under `refs/claw-code/rust/` as the canonical runtime | |
| 11 | -3. `refs/oh-my-codex` as the workflow-layer parent repo | |
| 12 | - | |
| 13 | -Assumption: `oh-my-codex` is the correct “parent repo” for this exercise. That assumption is based on: | |
| 14 | - | |
| 15 | -- `refs/claw-code/README.md` | |
| 16 | -- `refs/claw-code/PHILOSOPHY.md` | |
| 17 | -- the fact that `refs/claw-code` explicitly describes `src/` as a companion Python/reference workspace, not the primary runtime | |
| 18 | - | |
| 19 | -If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass. | |
| 20 | - | |
| 21 | -## Executive summary | |
| 22 | - | |
| 23 | -Loader has the right instincts but is operating at the wrong layer. | |
| 24 | - | |
| 25 | -The codebase already knows that models need: | |
| 26 | - | |
| 27 | -- planning help | |
| 28 | -- recovery help | |
| 29 | -- confidence checks | |
| 30 | -- completion checks | |
| 31 | -- safe tool use | |
| 32 | - | |
| 33 | -But Loader mostly tries to enforce those after the model has already started drifting. `claw-code` and `oh-my-codex` get better behavior because they shape the work before, during, and after the model call: | |
| 34 | - | |
| 35 | -- before: explicit mode selection, clarification, approved planning artifacts | |
| 36 | -- during: durable runtime state, richer tool surface, explicit permission model, session persistence | |
| 37 | -- after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics | |
| 38 | - | |
| 39 | -The biggest lesson is not “copy their prompt.” | |
| 40 | - | |
| 41 | -The biggest lesson is: | |
| 42 | - | |
| 43 | -> Loader needs a stronger execution contract, not just stronger prompting. | |
| 44 | - | |
| 45 | -If we want Loader to feel closer to `claw-code` regardless of model choice, the highest-leverage work is: | |
| 46 | - | |
| 47 | -1. replace the monolithic heuristic loop with a typed turn engine | |
| 48 | -2. add durable workflow/state artifacts | |
| 49 | -3. make “definition of done” evidence-based instead of heuristic | |
| 50 | -4. add real permission/safety boundaries around tools | |
| 51 | -5. build a parity harness so we can improve behavior intentionally | |
| 52 | - | |
| 53 | -## Method | |
| 54 | - | |
| 55 | -I reviewed: | |
| 56 | - | |
| 57 | -- Loader source under `src/loader/` | |
| 58 | -- Loader tests under `tests/` | |
| 59 | -- `refs/claw-code/README.md` | |
| 60 | -- `refs/claw-code/USAGE.md` | |
| 61 | -- `refs/claw-code/PARITY.md` | |
| 62 | -- `refs/claw-code/PHILOSOPHY.md` | |
| 63 | -- `refs/claw-code/rust/crates/runtime/*` | |
| 64 | -- `refs/claw-code/rust/crates/tools/src/lib.rs` | |
| 65 | -- `refs/oh-my-codex/README.md` | |
| 66 | -- `refs/oh-my-codex/AGENTS.md` | |
| 67 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 68 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 69 | -- `refs/oh-my-codex/skills/ralph/SKILL.md` | |
| 70 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 71 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 72 | -- `refs/oh-my-codex/src/mcp/memory-server.ts` | |
| 73 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 74 | -- `refs/oh-my-codex/src/cli/doctor.ts` | |
| 75 | -- `refs/oh-my-codex/src/scripts/notify-hook.ts` | |
| 76 | - | |
| 77 | -I also ran Loader verification commands: | |
| 78 | - | |
| 79 | -- `uv run pytest` | |
| 80 | - - failed during collection | |
| 81 | - - discovered `refs/claw-code/tests/*` | |
| 82 | - - also failed to import `loader` | |
| 83 | -- `uv run --with pytest --with pytest-asyncio python -m pytest tests -q` | |
| 84 | - - 56 passed | |
| 85 | - - 3 failed | |
| 86 | - | |
| 87 | -That matters because some of Loader’s runtime paths are clearly under-tested. | |
| 88 | - | |
| 89 | -## What Loader already does well | |
| 90 | - | |
| 91 | -### 1. Loader is small, understandable, and hackable | |
| 92 | - | |
| 93 | -This is a real advantage. | |
| 94 | - | |
| 95 | -`src/loader/` is about 55 source files, and the core agent behavior is easy to locate. Compared to `claw-code` and especially OMX, Loader is much easier to refactor aggressively. | |
| 96 | - | |
| 97 | -### 2. Loader is genuinely local-first | |
| 98 | - | |
| 99 | -The Ollama-first posture is simple and useful. A lot of the complexity in `claw-code` and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas. | |
| 100 | - | |
| 101 | -### 3. Loader already contains the seeds of a better system | |
| 102 | - | |
| 103 | -These are the right instincts: | |
| 104 | - | |
| 105 | -- project context detection in `src/loader/context/project.py` | |
| 106 | -- runtime safeguards in `src/loader/agent/safeguards.py` | |
| 107 | -- recovery categorization in `src/loader/agent/recovery.py` | |
| 108 | -- optional decomposition / critique / confidence / verification / completion checks in `src/loader/agent/reasoning.py` | |
| 109 | -- a decent Textual app in `src/loader/ui/app.py` | |
| 110 | - | |
| 111 | -The problem is not that Loader lacks ideas. | |
| 112 | - | |
| 113 | -The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture. | |
| 114 | - | |
| 115 | -### 4. The TUI is a meaningful strength | |
| 116 | - | |
| 117 | -Loader’s TUI already gives you: | |
| 118 | - | |
| 119 | -- model selection | |
| 120 | -- streaming output | |
| 121 | -- approval handling | |
| 122 | -- status line updates | |
| 123 | -- tool widgets | |
| 124 | - | |
| 125 | -That is more product surface than many small local agents. It is worth keeping. | |
| 126 | - | |
| 127 | -## Where Loader is weak today | |
| 128 | - | |
| 129 | -### 1. Loader’s product surface is not trustworthy yet | |
| 130 | - | |
| 131 | -The most visible sign is the README: | |
| 132 | - | |
| 133 | -- `README.md:1-2` still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.” | |
| 134 | - | |
| 135 | -That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. `claw-code` and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool. | |
| 136 | - | |
| 137 | -### 2. Loader’s main runtime is too monolithic and too heuristic | |
| 138 | - | |
| 139 | -`src/loader/agent/loop.py` is the heart of Loader, and it is doing too much: | |
| 140 | - | |
| 141 | -- prompt construction | |
| 142 | -- streaming output handling | |
| 143 | -- raw tool-call extraction | |
| 144 | -- duplicate tool execution flows | |
| 145 | -- recovery | |
| 146 | -- validation | |
| 147 | -- rollback tracking | |
| 148 | -- completion nudging | |
| 149 | -- loop detection | |
| 150 | -- steering | |
| 151 | -- partial planning | |
| 152 | -- decomposition | |
| 153 | - | |
| 154 | -The result is a loop that is hard to reason about and easy to destabilize. | |
| 155 | - | |
| 156 | -The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol. | |
| 157 | - | |
| 158 | -### 3. Loader has a real runtime contract bug in tool-result handling | |
| 159 | - | |
| 160 | -**Verified directly against the code.** There is a concrete mismatch between `Message` and the loop: | |
| 161 | - | |
| 162 | -- `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, and `tool_results`. There is no `tool_call_id` field on `Message` — that field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`. | |
| 163 | -- `src/loader/agent/loop.py:885` and `src/loader/agent/loop.py:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`. | |
| 164 | - | |
| 165 | -Both call sites will raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have **zero** integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work. | |
| 166 | - | |
| 167 | -### 4. Loader duplicates tool execution logic instead of centralizing it | |
| 168 | - | |
| 169 | -There are effectively two execution paths: | |
| 170 | - | |
| 171 | -- the normal native/ReAct tool path | |
| 172 | -- the “raw JSON extracted tool call” path | |
| 173 | - | |
| 174 | -Those paths duplicate: | |
| 175 | - | |
| 176 | -- duplicate checking | |
| 177 | -- validation | |
| 178 | -- confirmation behavior | |
| 179 | -- result recording | |
| 180 | -- loop/error handling | |
| 181 | - | |
| 182 | -That makes behavior inconsistent and increases the chance that fixes in one path never land in the other. | |
| 183 | - | |
| 184 | -`claw-code`’s `ConversationRuntime::run_turn()` is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat. | |
| 185 | - | |
| 186 | -### 5. Loader’s system prompt is too shallow and too rigid | |
| 187 | - | |
| 188 | -`src/loader/agent/prompts.py:148-208` gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt. | |
| 189 | - | |
| 190 | -This is too blunt. | |
| 191 | - | |
| 192 | -Problems: | |
| 193 | - | |
| 194 | -- it treats all tasks like immediate tool-execution tasks | |
| 195 | -- it globally bans numbered steps, which is bad for planning/reporting tasks | |
| 196 | -- it does not define modes | |
| 197 | -- it does not encode verification expectations | |
| 198 | -- it does not encode completion criteria | |
| 199 | -- it does not distinguish “clarify”, “plan”, “execute”, and “verify” | |
| 200 | - | |
| 201 | -OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract. | |
| 202 | - | |
| 203 | -### 6. Loader’s tool surface is too thin | |
| 204 | - | |
| 205 | -Loader has 6 default tools: | |
| 206 | - | |
| 207 | -- `read` | |
| 208 | -- `write` | |
| 209 | -- `edit` | |
| 210 | -- `glob` | |
| 211 | -- `bash` | |
| 212 | -- `grep` | |
| 213 | - | |
| 214 | -That is enough for toy execution, but not enough for strong agent behavior. | |
| 215 | - | |
| 216 | -What is missing compared to `claw-code` / OMX: | |
| 217 | - | |
| 218 | -- task/todo tracking | |
| 219 | -- structured ask-user surfaces | |
| 220 | -- memory/notepad | |
| 221 | -- doctor/status/session tooling | |
| 222 | -- git-aware helpers | |
| 223 | -- explore vs full-execution split | |
| 224 | -- diff/patch-aware editing | |
| 225 | -- web/search/fetch surfaces | |
| 226 | -- structured output surfaces | |
| 227 | -- subagent/team coordination surfaces | |
| 228 | -- MCP-backed state and memory | |
| 229 | - | |
| 230 | -The result is that Loader has to keep too much in the prompt and too much in ephemeral model state. | |
| 231 | - | |
| 232 | -### 7. Loader’s safety model is primitive | |
| 233 | - | |
| 234 | -Loader’s current protection model is mostly: | |
| 235 | - | |
| 236 | -- “safe commands” vs “ask for confirmation” | |
| 237 | -- destructive tool flags | |
| 238 | - | |
| 239 | -Problems in practice: | |
| 240 | - | |
| 241 | -- no permission modes like `read-only`, `workspace-write`, `danger-full-access` | |
| 242 | -- no strong workspace boundary checks | |
| 243 | -- no binary-file guards | |
| 244 | -- no file size limits | |
| 245 | -- no symlink escape protection | |
| 246 | -- no command semantics beyond a short safe list | |
| 247 | - | |
| 248 | -Evidence: | |
| 249 | - | |
| 250 | -- `src/loader/tools/file_tools.py` reads/writes resolved paths directly | |
| 251 | -- `src/loader/tools/shell_tools.py` uses `create_subprocess_shell()` on arbitrary shell strings | |
| 252 | -- `src/loader/tools/shell_tools.py:13-20` uses a short safe command set, but no mode-based authorization model | |
| 253 | - | |
| 254 | -By comparison, `claw-code` has: | |
| 255 | - | |
| 256 | -- `PermissionPolicy` | |
| 257 | -- `PermissionEnforcer` | |
| 258 | -- workspace boundary checks | |
| 259 | -- binary/size guards in file ops | |
| 260 | -- permission-mode aware tool definitions | |
| 261 | - | |
| 262 | -That does not just make it safer. It makes the agent more predictable. | |
| 263 | - | |
| 264 | -### 8. Loader’s “definition of done” is heuristic, not contractual | |
| 265 | - | |
| 266 | -The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code. | |
| 267 | - | |
| 268 | -Loader’s current strategy is: | |
| 269 | - | |
| 270 | -- heuristically decide whether the response looks premature | |
| 271 | -- nudge the model to continue | |
| 272 | -- maybe ask it to confirm completion | |
| 273 | - | |
| 274 | -See: | |
| 275 | - | |
| 276 | -- `src/loader/agent/reasoning.py:721-854` | |
| 277 | - | |
| 278 | -This is well-intentioned, but it is still guesswork. | |
| 279 | - | |
| 280 | -It does not require: | |
| 281 | - | |
| 282 | -- explicit acceptance criteria | |
| 283 | -- a verification plan | |
| 284 | -- fresh command evidence | |
| 285 | -- zero pending tasks | |
| 286 | -- a final sign-off phase | |
| 287 | - | |
| 288 | -OMX’s `ralph` workflow does. | |
| 289 | - | |
| 290 | -That difference is enormous. | |
| 291 | - | |
| 292 | -### 9. Loader has no durable workflow state | |
| 293 | - | |
| 294 | -Loader has plans, decomposition, and completion logic, but they live inside one run and disappear. | |
| 295 | - | |
| 296 | -Missing pieces: | |
| 297 | - | |
| 298 | -- persisted mode state | |
| 299 | -- session memory | |
| 300 | -- approved plan artifacts | |
| 301 | -- PRD / test-spec artifacts | |
| 302 | -- progress ledger | |
| 303 | -- durable “what was already decided” | |
| 304 | -- resume-safe task state | |
| 305 | - | |
| 306 | -OMX writes state under `.omx/` and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only. | |
| 307 | - | |
| 308 | -### 10. Loader is too backend-specific and too capability-fragile | |
| 309 | - | |
| 310 | -Despite defining an abstract LLM backend, Loader is effectively Ollama-only today. | |
| 311 | - | |
| 312 | -Evidence: | |
| 313 | - | |
| 314 | -- `src/loader/cli/main.py` supports only `ollama` | |
| 315 | -- `src/loader/llm/ollama.py` hardcodes native tool support by model-name substring matching | |
| 316 | - | |
| 317 | -This is fragile for behavior matching “with any model chosen.” | |
| 318 | - | |
| 319 | -What Loader needs instead is: | |
| 320 | - | |
| 321 | -- a provider-independent tool-calling contract | |
| 322 | -- explicit capability profiles | |
| 323 | -- distinct fallback strategies for native tools vs text tool calling | |
| 324 | -- prompts/workflows that degrade gracefully | |
| 325 | - | |
| 326 | -### 11. Loader’s tests are not protecting the real runtime | |
| 327 | - | |
| 328 | -Loader’s test suite is mostly: | |
| 329 | - | |
| 330 | -- tool unit tests | |
| 331 | -- parsing tests | |
| 332 | -- recovery tests | |
| 333 | - | |
| 334 | -That is useful, but insufficient. | |
| 335 | - | |
| 336 | -The current state: | |
| 337 | - | |
| 338 | -- `uv run pytest` fails by default after adding `refs/` | |
| 339 | -- the repo does not scope pytest discovery | |
| 340 | -- the “normal” targeted run needs `--with pytest --with pytest-asyncio` | |
| 341 | -- even then, 3 tests fail | |
| 342 | -- there are no strong turn-loop integration tests | |
| 343 | -- there is no deterministic mock backend harness comparable to `claw-code` | |
| 344 | - | |
| 345 | -This is why structural issues like the `tool_call_id` mismatch can survive. | |
| 346 | - | |
| 347 | -## What `claw-code` gets right | |
| 348 | - | |
| 349 | -## 1. The runtime contract is explicit | |
| 350 | - | |
| 351 | -`refs/claw-code/rust/crates/runtime/src/conversation.rs` is the biggest thing Loader should study. | |
| 352 | - | |
| 353 | -The core `run_turn()` flow is clean: | |
| 354 | - | |
| 355 | -1. append user message to session | |
| 356 | -2. stream assistant response | |
| 357 | -3. build a typed assistant message | |
| 358 | -4. extract tool uses | |
| 359 | -5. run permission checks | |
| 360 | -6. execute tool | |
| 361 | -7. append tool result message | |
| 362 | -8. repeat until no more tool uses | |
| 363 | -9. optionally compact session | |
| 364 | -10. return a typed turn summary | |
| 365 | - | |
| 366 | -That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach. | |
| 367 | - | |
| 368 | -## 2. Session persistence and compaction are first-class | |
| 369 | - | |
| 370 | -`claw-code` treats long-lived sessions as a product feature: | |
| 371 | - | |
| 372 | -- persisted sessions | |
| 373 | -- resume support | |
| 374 | -- usage tracking | |
| 375 | -- compaction thresholds | |
| 376 | -- summarized continuation messages | |
| 377 | - | |
| 378 | -Relevant files: | |
| 379 | - | |
| 380 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 381 | -- `refs/claw-code/rust/crates/runtime/src/compact.rs` | |
| 382 | -- `refs/claw-code/rust/crates/runtime/src/summary_compression.rs` | |
| 383 | -- `refs/claw-code/rust/crates/runtime/src/usage.rs` | |
| 384 | - | |
| 385 | -This matters because good agent behavior is often continuity behavior. | |
| 386 | - | |
| 387 | -## 3. Permissions are part of the runtime, not just UI confirmation | |
| 388 | - | |
| 389 | -`claw-code` has an actual permission model with three layers: | |
| 390 | - | |
| 391 | -- **Mode layer** — `PermissionMode` enum with `ReadOnly`, `WorkspaceWrite`, `DangerFullAccess`, `Prompt`, and `Allow` (`refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`) | |
| 392 | -- **Per-tool requirement layer** — every `ToolSpec` declares the minimum mode it requires, mapped in `PermissionPolicy.tool_requirements` | |
| 393 | -- **Rule layer** — three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides on top of the mode/requirement check | |
| 394 | - | |
| 395 | -Plus typed authorization outcomes, file-write boundary logic, and bash gating. | |
| 396 | - | |
| 397 | -Relevant files: | |
| 398 | - | |
| 399 | -- `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs` | |
| 400 | -- `refs/claw-code/rust/crates/runtime/src/permissions.rs` | |
| 401 | - | |
| 402 | -Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later. | |
| 403 | - | |
| 404 | -## 4. File and shell operations are engineered, not just exposed | |
| 405 | - | |
| 406 | -`claw-code`’s file layer includes: | |
| 407 | - | |
| 408 | -- max read size | |
| 409 | -- max write size | |
| 410 | -- binary detection | |
| 411 | -- workspace-boundary validation | |
| 412 | -- structured patch outputs | |
| 413 | - | |
| 414 | -Relevant file: | |
| 415 | - | |
| 416 | -- `refs/claw-code/rust/crates/runtime/src/file_ops.rs` | |
| 417 | - | |
| 418 | -Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior. | |
| 419 | - | |
| 420 | -## 5. Hooks and lifecycle surfaces give the runtime escape valves | |
| 421 | - | |
| 422 | -`claw-code` has pre-tool and post-tool hooks, including failure hooks. | |
| 423 | - | |
| 424 | -That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself. | |
| 425 | - | |
| 426 | -Relevant files: | |
| 427 | - | |
| 428 | -- `refs/claw-code/rust/crates/runtime/src/hooks.rs` | |
| 429 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 430 | - | |
| 431 | -## 6. The project is honest about parity and weaknesses | |
| 432 | - | |
| 433 | -`refs/claw-code/PARITY.md` is one of the best engineering lessons in the whole comparison. | |
| 434 | - | |
| 435 | -It does three things Loader does not yet do: | |
| 436 | - | |
| 437 | -- names what is actually shipped | |
| 438 | -- names what is still shallow or stubbed | |
| 439 | -- ties roadmap claims to concrete evidence | |
| 440 | - | |
| 441 | -That alone reduces thrash. | |
| 442 | - | |
| 443 | -Loader needs a similar parity/backlog document for runtime behavior. | |
| 444 | - | |
| 445 | -## 7. Diagnostics and operator surfaces are part of the product | |
| 446 | - | |
| 447 | -`claw-code` exposes operational commands like: | |
| 448 | - | |
| 449 | -- `status` | |
| 450 | -- `sandbox` | |
| 451 | -- `agents` | |
| 452 | -- `mcp` | |
| 453 | -- `skills` | |
| 454 | -- `doctor` | |
| 455 | -- session resume | |
| 456 | - | |
| 457 | -This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime. | |
| 458 | - | |
| 459 | -## Where `claw-code` is still incomplete | |
| 460 | - | |
| 461 | -It is worth staying honest here too. | |
| 462 | - | |
| 463 | -Even `claw-code` admits some shallowness in `PARITY.md`: | |
| 464 | - | |
| 465 | -- some surfaces are registry-backed approximations, not deep external integrations | |
| 466 | -- session compaction parity is still open | |
| 467 | -- token accounting accuracy is still open | |
| 468 | -- some tool surfaces remain shallow or partially stubbed | |
| 469 | - | |
| 470 | -That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior. | |
| 471 | - | |
| 472 | -## What OMX adds that Loader is currently missing almost entirely | |
| 473 | - | |
| 474 | -`claw-code` gives a better runtime. OMX gives a better workflow. | |
| 475 | - | |
| 476 | -This is where most of Loader’s “definition of done” and “follow-through” problems are answered. | |
| 477 | - | |
| 478 | -### 1. Clarification is a mode, not an ad hoc question | |
| 479 | - | |
| 480 | -`deep-interview` is not “ask a question if confused.” | |
| 481 | - | |
| 482 | -It is a formal ambiguity-reduction workflow with: | |
| 483 | - | |
| 484 | -- a context snapshot | |
| 485 | -- one-question rounds | |
| 486 | -- ambiguity scoring | |
| 487 | -- explicit non-goals | |
| 488 | -- explicit decision boundaries | |
| 489 | -- a crystallized artifact for downstream execution | |
| 490 | - | |
| 491 | -Relevant files: | |
| 492 | - | |
| 493 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 494 | - | |
| 495 | -Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight. | |
| 496 | - | |
| 497 | -### 2. Planning is artifact-based and consensus-based | |
| 498 | - | |
| 499 | -`ralplan` is much more than “make a numbered list.” | |
| 500 | - | |
| 501 | -It includes: | |
| 502 | - | |
| 503 | -- Planner / Architect / Critic loops | |
| 504 | -- max iteration handling | |
| 505 | -- planning completion gates | |
| 506 | -- PRD and test-spec artifacts | |
| 507 | -- approved handoff into execution | |
| 508 | - | |
| 509 | -Relevant files: | |
| 510 | - | |
| 511 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 512 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 513 | -- `refs/oh-my-codex/src/planning/artifacts.ts` | |
| 514 | - | |
| 515 | -Loader’s `Plan` object is fine as a local helper, but it is nowhere near this level of control. | |
| 516 | - | |
| 517 | -### 3. “Done” is a workflow contract in Ralph | |
| 518 | - | |
| 519 | -This is the single biggest lesson for Loader. | |
| 520 | - | |
| 521 | -Ralph encodes: | |
| 522 | - | |
| 523 | -- persistence until done | |
| 524 | -- mandatory verification | |
| 525 | -- architect verification | |
| 526 | -- retry/fix loops | |
| 527 | -- state transitions | |
| 528 | -- explicit cleanup on completion | |
| 529 | -- a final checklist | |
| 530 | - | |
| 531 | -Relevant file: | |
| 532 | - | |
| 533 | -- `refs/oh-my-codex/skills/ralph/SKILL.md` | |
| 534 | - | |
| 535 | -This directly addresses the exact Loader problems you named: | |
| 536 | - | |
| 537 | -- weak tool follow-through | |
| 538 | -- finishing too early | |
| 539 | -- spending too long in loops | |
| 540 | -- poor task closure | |
| 541 | - | |
| 542 | -### 4. Workflow state lives outside the prompt | |
| 543 | - | |
| 544 | -OMX stores durable mode state under `.omx/` and exposes it through state tools. | |
| 545 | - | |
| 546 | -Relevant files: | |
| 547 | - | |
| 548 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 549 | -- `refs/oh-my-codex/src/mcp/state-server.ts` | |
| 550 | -- `refs/oh-my-codex/src/mcp/memory-server.ts` | |
| 551 | - | |
| 552 | -That means: | |
| 553 | - | |
| 554 | -- progress survives interruptions | |
| 555 | -- execution can be resumed | |
| 556 | -- handoffs are grounded | |
| 557 | -- context can be audited | |
| 558 | -- the model does not have to remember everything itself | |
| 559 | - | |
| 560 | -### 5. Memory and notepad are explicit tools | |
| 561 | - | |
| 562 | -OMX has project memory and a notepad. | |
| 563 | - | |
| 564 | -That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store: | |
| 565 | - | |
| 566 | -- conventions | |
| 567 | -- known build commands | |
| 568 | -- temporary working notes | |
| 569 | -- durable directives | |
| 570 | - | |
| 571 | -Relevant file: | |
| 572 | - | |
| 573 | -- `refs/oh-my-codex/src/mcp/memory-server.ts` | |
| 574 | - | |
| 575 | -Loader currently rediscovers too much per turn. | |
| 576 | - | |
| 577 | -### 6. Verification is standardized | |
| 578 | - | |
| 579 | -OMX has verification instructions that scale by task size and explicitly require evidence. | |
| 580 | - | |
| 581 | -Relevant file: | |
| 582 | - | |
| 583 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 584 | - | |
| 585 | -Loader has completion heuristics. OMX has verification policy. | |
| 586 | - | |
| 587 | -That is the difference between “the model sounded done” and “the system proved done.” | |
| 588 | - | |
| 589 | -### 7. Doctor / explore / sparkshell reduce prompt waste | |
| 590 | - | |
| 591 | -OMX distinguishes: | |
| 592 | - | |
| 593 | -- health checking (`doctor`) | |
| 594 | -- lightweight read-only exploration (`explore`) | |
| 595 | -- bounded shell-native inspection (`sparkshell`) | |
| 596 | - | |
| 597 | -That is smart. | |
| 598 | - | |
| 599 | -It keeps the main execution loop from becoming the only place everything happens. | |
| 600 | - | |
| 601 | -Relevant files: | |
| 602 | - | |
| 603 | -- `refs/oh-my-codex/src/cli/doctor.ts` | |
| 604 | -- `refs/oh-my-codex/src/cli/explore.ts` | |
| 605 | -- `refs/oh-my-codex/src/cli/sparkshell.ts` | |
| 606 | - | |
| 607 | -### 8. Follow-through is supported outside the agent context window | |
| 608 | - | |
| 609 | -The idle notifications, leader nudges, and continuation prompts in OMX are important. | |
| 610 | - | |
| 611 | -Relevant file: | |
| 612 | - | |
| 613 | -- `refs/oh-my-codex/src/scripts/notify-hook.ts` | |
| 614 | - | |
| 615 | -This is one of the deeper design differences: | |
| 616 | - | |
| 617 | -- Loader tries to keep the model on-task from inside the loop | |
| 618 | -- OMX also nudges, monitors, and routes from outside the loop | |
| 619 | - | |
| 620 | -That is a more robust design. | |
| 621 | - | |
| 622 | -## Comparison matrix | |
| 623 | - | |
| 624 | -| Area | Loader today | `claw-code` | OMX lesson | Takeaway for Loader | | |
| 625 | -|---|---|---|---|---| | |
| 626 | -| Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first | | |
| 627 | -| Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools | | |
| 628 | -| Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries | | |
| 629 | -| Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification | | |
| 630 | -| Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts | | |
| 631 | -| Memory/state | none | sessions + compaction + tracing | `.omx/` mode state + memory | add `.loader/` state dir | | |
| 632 | -| Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable | | |
| 633 | -| Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests | | |
| 634 | -| Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later | | |
| 635 | -| Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy | | |
| 636 | - | |
| 637 | -## Why Loader’s current weaknesses produce the behavior you described | |
| 638 | - | |
| 639 | -### Poor tool use | |
| 640 | - | |
| 641 | -Root causes: | |
| 642 | - | |
| 643 | -- shallow tool surface | |
| 644 | -- brittle prompt contract | |
| 645 | -- native-vs-ReAct bifurcation | |
| 646 | -- duplicated execution code paths | |
| 647 | -- no typed runtime contract for tool results | |
| 648 | - | |
| 649 | -### Weak follow-through | |
| 650 | - | |
| 651 | -Root causes: | |
| 652 | - | |
| 653 | -- no persistent task state | |
| 654 | -- no approved plan artifact | |
| 655 | -- no explicit verification lane | |
| 656 | -- no final completion checklist | |
| 657 | - | |
| 658 | -### Finishing early | |
| 659 | - | |
| 660 | -Root causes: | |
| 661 | - | |
| 662 | -- completion is heuristic | |
| 663 | -- no required evidence model | |
| 664 | -- no acceptance criteria artifact | |
| 665 | -- no final “prove it” pass | |
| 666 | - | |
| 667 | -### Spending too long on simple tasks | |
| 668 | - | |
| 669 | -Root causes: | |
| 670 | - | |
| 671 | -- the runtime loop tries too many recoveries in one place | |
| 672 | -- the system prompt does not distinguish task modes cleanly | |
| 673 | -- there is no “lightweight inspect” lane like `explore` | |
| 674 | -- the model often has to infer the workflow instead of being routed into one | |
| 675 | - | |
| 676 | -### Model sensitivity | |
| 677 | - | |
| 678 | -Root causes: | |
| 679 | - | |
| 680 | -- behavior is prompt-and-heuristic driven | |
| 681 | -- capability detection is backend-specific and brittle | |
| 682 | -- no workflow artifacts that survive model variance | |
| 683 | - | |
| 684 | -This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise. | |
| 685 | - | |
| 686 | -## Concrete implementation targets | |
| 687 | - | |
| 688 | -These are ordered by impact on Loader behavior, not by code convenience. | |
| 689 | - | |
| 690 | -### Target 1: Introduce a real turn engine | |
| 691 | - | |
| 692 | -Goal: | |
| 693 | - | |
| 694 | -- replace the current giant loop with a smaller, typed conversation runtime | |
| 695 | - | |
| 696 | -Implementation target: | |
| 697 | - | |
| 698 | -- create a new `src/loader/runtime/` package | |
| 699 | -- move message/session/tool-result logic out of `src/loader/agent/loop.py` | |
| 700 | -- give tool results a first-class typed representation | |
| 701 | -- unify native, ReAct, and extracted-tool execution through one executor path | |
| 702 | - | |
| 703 | -Why: | |
| 704 | - | |
| 705 | -- this is the foundation for every other improvement | |
| 706 | - | |
| 707 | -### Target 2: Add persistent Loader state under `.loader/` | |
| 708 | - | |
| 709 | -Goal: | |
| 710 | - | |
| 711 | -- make workflow state durable instead of prompt-only | |
| 712 | - | |
| 713 | -Implementation target: | |
| 714 | - | |
| 715 | -- `.loader/state/` | |
| 716 | -- `.loader/sessions/` | |
| 717 | -- `.loader/plans/` | |
| 718 | -- `.loader/notepad.md` | |
| 719 | -- `.loader/project-memory.json` | |
| 720 | - | |
| 721 | -Why: | |
| 722 | - | |
| 723 | -- Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge | |
| 724 | - | |
| 725 | -### Target 3: Separate task modes | |
| 726 | - | |
| 727 | -Goal: | |
| 728 | - | |
| 729 | -- stop treating all requests like immediate tool-execution requests | |
| 730 | - | |
| 731 | -Implementation target: | |
| 732 | - | |
| 733 | -- mode router with at least: | |
| 734 | - - `clarify` | |
| 735 | - - `plan` | |
| 736 | - - `execute` | |
| 737 | - - `verify` | |
| 738 | - | |
| 739 | -Why: | |
| 740 | - | |
| 741 | -- this is the minimum structure needed to stop overthinking simple work and underthinking complex work | |
| 742 | - | |
| 743 | -### Target 4: Replace heuristic completion with an evidence-backed done contract | |
| 744 | - | |
| 745 | -Goal: | |
| 746 | - | |
| 747 | -- make completion explicit and testable | |
| 748 | - | |
| 749 | -Implementation target: | |
| 750 | - | |
| 751 | -- define a `DefinitionOfDone` object per task | |
| 752 | -- require: | |
| 753 | - - acceptance criteria | |
| 754 | - - verification commands | |
| 755 | - - evidence summary | |
| 756 | - - zero pending task items | |
| 757 | - | |
| 758 | -Why: | |
| 759 | - | |
| 760 | -- this is the main fix for premature completion | |
| 761 | - | |
| 762 | -### Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents | |
| 763 | - | |
| 764 | -Goal: | |
| 765 | - | |
| 766 | -- pull ambiguity reduction and planning review out of the middle of execution | |
| 767 | - | |
| 768 | -Implementation target: | |
| 769 | - | |
| 770 | -- `clarify` mode writes a task brief | |
| 771 | -- `plan` mode writes: | |
| 772 | - - a short implementation plan | |
| 773 | - - a test/verification plan | |
| 774 | - | |
| 775 | -Do not try to copy every OMX feature immediately. Copy the artifact discipline first. | |
| 776 | - | |
| 777 | -### Target 6: Build a real permission model | |
| 778 | - | |
| 779 | -Goal: | |
| 780 | - | |
| 781 | -- move from confirmation prompts to policy-based authorization | |
| 782 | - | |
| 783 | -Implementation target: | |
| 784 | - | |
| 785 | -- permission modes: | |
| 786 | - - `read-only` | |
| 787 | - - `workspace-write` | |
| 788 | - - `danger-full-access` | |
| 789 | -- tool specs declare required permission | |
| 790 | -- file writes enforce workspace boundaries | |
| 791 | -- shell commands go through command classification | |
| 792 | - | |
| 793 | -Why: | |
| 794 | - | |
| 795 | -- this is both safety and behavior quality | |
| 796 | - | |
| 797 | -### Target 7: Harden file and shell tools | |
| 798 | - | |
| 799 | -Goal: | |
| 800 | - | |
| 801 | -- make tool use trustworthy enough for automation | |
| 802 | - | |
| 803 | -Implementation target: | |
| 804 | - | |
| 805 | -- size limits | |
| 806 | -- binary detection | |
| 807 | -- symlink/traversal protection | |
| 808 | -- structured patch/diff return values | |
| 809 | -- shell command semantics and mutability classification | |
| 810 | - | |
| 811 | -### Target 8: Add `loader doctor`, `loader status`, and `loader session` | |
| 812 | - | |
| 813 | -Goal: | |
| 814 | - | |
| 815 | -- make Loader operable as a product | |
| 816 | - | |
| 817 | -Implementation target: | |
| 818 | - | |
| 819 | -- backend health | |
| 820 | -- model capability snapshot | |
| 821 | -- workspace detection | |
| 822 | -- write-access detection | |
| 823 | -- test/build command detection | |
| 824 | -- active session summary | |
| 825 | - | |
| 826 | -Why: | |
| 827 | - | |
| 828 | -- better operator feedback means less guesswork in the agent loop | |
| 829 | - | |
| 830 | -### Target 9: Add memory/notepad tools | |
| 831 | - | |
| 832 | -Goal: | |
| 833 | - | |
| 834 | -- give Loader durable short-term and long-term memory | |
| 835 | - | |
| 836 | -Implementation target: | |
| 837 | - | |
| 838 | -- read/write project memory | |
| 839 | -- append working notes | |
| 840 | -- store user directives and repo conventions | |
| 841 | - | |
| 842 | -Why: | |
| 843 | - | |
| 844 | -- this reduces re-discovery and improves follow-through across turns | |
| 845 | - | |
| 846 | -### Target 10: Add a lightweight read-only inspect lane | |
| 847 | - | |
| 848 | -Goal: | |
| 849 | - | |
| 850 | -- avoid using the full agent loop for every lookup | |
| 851 | - | |
| 852 | -Implementation target: | |
| 853 | - | |
| 854 | -- `loader explore` or equivalent internal mode | |
| 855 | -- optimized for: | |
| 856 | - - file/symbol lookup | |
| 857 | - - pattern discovery | |
| 858 | - - relationship questions | |
| 859 | - | |
| 860 | -Why: | |
| 861 | - | |
| 862 | -- simple tasks should stay cheap and fast | |
| 863 | - | |
| 864 | -### Target 11: Add a parity harness | |
| 865 | - | |
| 866 | -Goal: | |
| 867 | - | |
| 868 | -- improve behavior intentionally instead of impressionistically | |
| 869 | - | |
| 870 | -Implementation target: | |
| 871 | - | |
| 872 | -- scripted mock backend scenarios for: | |
| 873 | - - simple read | |
| 874 | - - multi-tool turn | |
| 875 | - - denied permission | |
| 876 | - - write/edit success | |
| 877 | - - verification-required task | |
| 878 | - - premature completion rejection | |
| 879 | - - looped/duplicate action prevention | |
| 880 | - | |
| 881 | -Why: | |
| 882 | - | |
| 883 | -- this is how Loader becomes reliable | |
| 884 | - | |
| 885 | -### Target 12: Add workflow-aware prompts and capability profiles | |
| 886 | - | |
| 887 | -Goal: | |
| 888 | - | |
| 889 | -- make Loader less brittle across models | |
| 890 | - | |
| 891 | -Implementation target: | |
| 892 | - | |
| 893 | -- replace one generic system prompt with mode-specific prompts | |
| 894 | -- add provider/model capability profiles: | |
| 895 | - - native tools | |
| 896 | - - streaming | |
| 897 | - - context budget | |
| 898 | - - preferred tool-call format | |
| 899 | - - verification strictness | |
| 900 | - | |
| 901 | -Why: | |
| 902 | - | |
| 903 | -- behavior should be shaped by runtime policy, not guessed from model substrings | |
| 904 | - | |
| 905 | -## Priority order | |
| 906 | - | |
| 907 | -This section was rewritten after a deeper validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: **the Definition-of-Done work is the user's actual pain point and should land before permission modes**, not after, because permissions are a safety win and DoD is the behavior win. | |
| 908 | - | |
| 909 | -### P0: Stabilize before changing behavior (Sprint 00) | |
| 910 | - | |
| 911 | -- write a failing regression test for the `tool_call_id` bug at `agent/loop.py:885,906` *first*, before any harness work — it proves the bug is real and proves the harness exists in one move | |
| 912 | -- scope pytest discovery so `refs/` stops contaminating collection | |
| 913 | -- exclude `refs/` from ruff and mypy too | |
| 914 | -- make `uv run pytest` work out of the box | |
| 915 | -- port the scenario taxonomy from `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs` | |
| 916 | -- rewrite `README.md` (currently still says "FortranGoingOnForty") | |
| 917 | -- baseline parity checklist for current runtime behavior | |
| 918 | - | |
| 919 | -### P1: Replace the loop with a real runtime (Sprint 01) | |
| 920 | - | |
| 921 | -- new `src/loader/runtime/` package with a typed turn engine | |
| 922 | -- unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor | |
| 923 | -- fix the named bugs from Sprint 00's failing tests (`tool_call_id`, duplicate execution path) | |
| 924 | -- replace substring-based `NATIVE_TOOL_MODELS`/`NO_TOOL_MODELS` model detection with a `runtime/capabilities.py` profile system — Loader needs to behave consistently across model choices | |
| 925 | -- structured `TurnSummary` output | |
| 926 | - | |
| 927 | -### P2: The behavior fix the user actually asked for (Sprint 02) | |
| 928 | - | |
| 929 | -- `DefinitionOfDone` object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items | |
| 930 | -- explicit verify phase that runs the verification commands and gates completion on evidence | |
| 931 | -- fix loop: verification failure returns to execution, not to final answer | |
| 932 | -- minimum `.loader/` directory shape (`.loader/dod/`) — full session/memory layout deferred to Sprint 05 | |
| 933 | - | |
| 934 | -This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through." | |
| 935 | - | |
| 936 | -### P3: Safety as policy, not as confirmation prompt (Sprint 03) | |
| 937 | - | |
| 938 | -- permission modes: `read-only`, `workspace-write`, `danger-full-access` | |
| 939 | -- three-event tool lifecycle hooks (`pre_tool_use`, `post_tool_use`, `post_tool_use_failure`) modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs` | |
| 940 | -- refactor `safeguards.py` (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls | |
| 941 | -- file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches) | |
| 942 | -- shell operation hardening | |
| 943 | -- expose active mode in CLI/TUI status | |
| 944 | - | |
| 945 | -Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle. | |
| 946 | - | |
| 947 | -### P4: Stop improvising one workflow for everything (Sprint 04) | |
| 948 | - | |
| 949 | -- mode router: clarify, plan, execute, verify (verify already exists from Sprint 02) | |
| 950 | -- clarify artifact written to `.loader/briefs/` | |
| 951 | -- planning artifacts (implementation plan + verification plan) written to `.loader/plans/` and fed into the existing DoD object | |
| 952 | -- tool prerequisites pulled forward from Sprint 06: `TodoWrite` (the "zero pending tasks" gate is empty without it) and `AskUserQuestion` (clarify rounds) | |
| 953 | - | |
| 954 | -### P5: Durable continuity (Sprint 05) | |
| 955 | - | |
| 956 | -- full `.loader/` state directory under the layout already started in Sprint 02 | |
| 957 | -- session persistence and resume | |
| 958 | -- transcript compaction with priority-aware summarization (model the design on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`) | |
| 959 | -- memory/notepad surfaces | |
| 960 | -- usage/cost tracking | |
| 961 | - | |
| 962 | -### P6: Operability and tool-surface expansion (Sprint 06) | |
| 963 | - | |
| 964 | -- `loader doctor`, `loader status`, `loader session` | |
| 965 | -- read-only explore lane | |
| 966 | -- broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — `TodoWrite` and `AskUserQuestion` already exist from Sprint 04 | |
| 967 | - | |
| 968 | -### Deferred indefinitely | |
| 969 | - | |
| 970 | -- workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring) | |
| 971 | -- task/team/subagent orchestration | |
| 972 | -- broad MCP ecosystem | |
| 973 | -- richer plugin systems | |
| 974 | - | |
| 975 | -These are real wins in `claw-code`/OMX, but Loader should not pursue them until the solo runtime is trustworthy. | |
| 976 | - | |
| 977 | -## What Loader should copy directly, and what it should not | |
| 978 | - | |
| 979 | -### Copy directly | |
| 980 | - | |
| 981 | -- typed turn runtime | |
| 982 | -- permission model | |
| 983 | -- file/shell hardening | |
| 984 | -- session persistence | |
| 985 | -- compaction | |
| 986 | -- doctor/status/session surfaces | |
| 987 | -- workflow artifacts | |
| 988 | -- evidence-backed verification | |
| 989 | -- parity harness discipline | |
| 990 | - | |
| 991 | -### Copy in simplified form | |
| 992 | - | |
| 993 | -- deep-interview | |
| 994 | -- ralplan | |
| 995 | -- ralph | |
| 996 | -- memory/notepad | |
| 997 | -- explore vs full-execution split | |
| 998 | - | |
| 999 | -### Do not copy blindly yet | |
| 1000 | - | |
| 1001 | -- full tmux/team runtime | |
| 1002 | -- huge command surface | |
| 1003 | -- Discord/openclaw notification stack | |
| 1004 | -- broad MCP ecosystem | |
| 1005 | - | |
| 1006 | -Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help. | |
| 1007 | - | |
| 1008 | -## Recommended Loader architecture direction | |
| 1009 | - | |
| 1010 | -If we want behavior closer to `claw-code` without losing Loader’s simplicity, I would steer toward: | |
| 1011 | - | |
| 1012 | -### Layer 1: Runtime core | |
| 1013 | - | |
| 1014 | -- typed `TurnRuntime` | |
| 1015 | -- `SessionStore` | |
| 1016 | -- `PermissionPolicy` | |
| 1017 | -- `ToolExecutor` | |
| 1018 | -- `VerificationEngine` | |
| 1019 | - | |
| 1020 | -### Layer 2: Workflow layer | |
| 1021 | - | |
| 1022 | -- `ClarifyWorkflow` | |
| 1023 | -- `PlanWorkflow` | |
| 1024 | -- `ExecuteWorkflow` | |
| 1025 | -- `VerifyWorkflow` | |
| 1026 | - | |
| 1027 | -### Layer 3: Product surfaces | |
| 1028 | - | |
| 1029 | -- TUI | |
| 1030 | -- CLI | |
| 1031 | -- `doctor` | |
| 1032 | -- `status` | |
| 1033 | -- `session` | |
| 1034 | -- `explore` | |
| 1035 | - | |
| 1036 | -### Layer 4: Optional future orchestration | |
| 1037 | - | |
| 1038 | -- hooks | |
| 1039 | -- background verification | |
| 1040 | -- multi-agent/task orchestration | |
| 1041 | - | |
| 1042 | -That is a better fit for Loader than trying to clone all of OMX wholesale. | |
| 1043 | - | |
| 1044 | -## Immediate conclusions | |
| 1045 | - | |
| 1046 | -1. Loader’s biggest problems are architectural, not just prompt-related. | |
| 1047 | -2. `claw-code` is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity. | |
| 1048 | -3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops. | |
| 1049 | -4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding: | |
| 1050 | - - workflow artifacts | |
| 1051 | - - explicit verification | |
| 1052 | - - persistent state | |
| 1053 | - - a smaller, more trustworthy turn engine | |
| 1054 | - | |
| 1055 | -## Sprint scaffolding | |
| 1056 | - | |
| 1057 | -After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under: | |
| 1058 | - | |
| 1059 | -- `.docs/sprints/index.md` | |
| 1060 | -- `.docs/sprints/sprint00.md` — Foundation, Measurement, and Parity Harness | |
| 1061 | -- `.docs/sprints/sprint01.md` — Turn Engine, Tool Contract, and Capability Profiles | |
| 1062 | -- `.docs/sprints/sprint02.md` — Definition of Done and Verify/Fix Loop | |
| 1063 | -- `.docs/sprints/sprint03.md` — Permission Modes and Tool Lifecycle Hooks | |
| 1064 | -- `.docs/sprints/sprint04.md` — Mode Router, Clarify, and Plan Artifacts | |
| 1065 | -- `.docs/sprints/sprint05.md` — Session State, Memory, and Compaction | |
| 1066 | -- `.docs/sprints/sprint06.md` — Doctor, Explore, Status, and Tool Surface Expansion | |
| 1067 | - | |
| 1068 | -## Recommended next move | |
| 1069 | - | |
| 1070 | -Start with Sprint 00, and start Sprint 00 with the failing regression test. | |
| 1071 | - | |
| 1072 | -Reason: | |
| 1073 | - | |
| 1074 | -- Loader needs a measurable baseline and a safer runtime before adding more behavior | |
| 1075 | -- the `tool_call_id` bug at `agent/loop.py:885,906` is proof that untested code paths are silently broken | |
| 1076 | -- writing the failing test first proves both the bug and the harness in one move | |
| 1077 | -- otherwise every feature sprint will be built on unstable agent semantics | |
| 1078 | - | |
| 1079 | -The execution phase should then be: | |
| 1080 | - | |
| 1081 | -1. lock down the runtime and test harness (Sprint 00) | |
| 1082 | -2. replace the loop with a typed runtime and capability profiles (Sprint 01) | |
| 1083 | -3. define and enforce the completion contract (Sprint 02) | |
| 1084 | -4. add the policy-based safety layer with hooks (Sprint 03) | |
| 1085 | -5. add workflow modes and planning artifacts on top (Sprint 04) | |
| 1086 | -6. then widen the durability and product surfaces (Sprints 05 and 06) | |
| 1087 | - | |
| 1088 | -## Plan adjustments after deeper review | |
| 1089 | - | |
| 1090 | -The following changes were applied to the original report after a firsthand validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus spot-checks of Loader's runtime. | |
| 1091 | - | |
| 1092 | -### Verified directly against the code | |
| 1093 | - | |
| 1094 | -- **`tool_call_id` bug confirmed at `src/loader/agent/loop.py:885` and `:906`.** Both call sites construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`, but `Message` (`src/loader/llm/base.py:33-39`) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage. | |
| 1095 | -- **Pytest discovery is broken by default.** `uv run pytest --collect-only` picks up `refs/claw-code/tests/test_porting_workspace.py` and fails to import `loader` because there is no `tool.pytest.ini_options` block in `pyproject.toml`. | |
| 1096 | -- **Loop monolith confirmed by line counts.** `agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196, `agent/safeguards.py` is 1079 — roughly 4200 lines of orchestration in one cluster. | |
| 1097 | -- **claw-code's `run_turn()` shape** is exactly as the report describes. Read directly at `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed `ConversationMessage::tool_result()` → push → repeat. ~175 lines of clean code. | |
| 1098 | -- **claw-code permission modes** are `ReadOnly` / `WorkspaceWrite` / `DangerFullAccess` (plus `Prompt` and `Allow`), defined at `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in `file_ops.rs` are all real. | |
| 1099 | -- **claw-code hooks** are `PreToolUse` / `PostToolUse` / `PostToolUseFailure`, defined at `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34` and wired into the conversation loop at lines 371, 427-453. | |
| 1100 | -- **OMX skills are real and even more rigorous than the report described.** `ralplan` enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. `ralph` has explicit phase enums (`starting`/`executing`/`verifying`/`fixing`/`complete`/`failed`/`cancelled`) persisted via `state_write` to `.omx/state/{mode}-state.json`. The verifier in `src/verification/verifier.ts` scales by task size with concrete file-count thresholds. | |
| 1101 | - | |
| 1102 | -### Corrected facts | |
| 1103 | - | |
| 1104 | -- **Tool count: 49, not 40.** `refs/claw-code/rust/crates/tools/src/lib.rs` exposes 49 `ToolSpec` entries in `mvp_tool_specs()`. Doesn't change the lesson, but worth knowing. | |
| 1105 | -- **claw-code permissions have a third layer.** Beyond `PermissionMode` and per-tool requirements, `PermissionPolicy` carries three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides. Loader can land the mode layer first and defer the rule layer. | |
| 1106 | -- **claw-code summary compression is sophisticated.** It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`. Sprint 05 should model on this rather than reinventing. | |
| 1107 | - | |
| 1108 | -### Structural plan changes | |
| 1109 | - | |
| 1110 | -- **The old Sprint 03 was split.** It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04. | |
| 1111 | -- **The old Sprint 02 (permissions) became the new Sprint 03** and was reordered to land *after* DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first. | |
| 1112 | -- **Hooks landed in the same sprint as permissions.** The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both. | |
| 1113 | -- **Capability profiles became a Sprint 01 deliverable.** They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal. | |
| 1114 | -- **The minimum `.loader/` directory shape moves to Sprint 02** (just `.loader/dod/`). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05. | |
| 1115 | -- **`TodoWrite` and `AskUserQuestion` move from Sprint 06 to Sprint 04** as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06. | |
| 1116 | -- **Sprint 00's first deliverable is now the failing regression test** for the `tool_call_id` bug, before any harness work. It proves the bug and proves the harness exist in one move. | |
.docs/audit_sprints/index.mddeleted@@ -1,82 +0,0 @@ | ||
| 1 | -# Loader Audit Cleanup Sprint Index | |
| 2 | - | |
| 3 | -These sprints translate the 2026-04-07 audit in `.docs/audit.txt` into a post-Sprint-08 cleanup plan that is explicitly about deleting contract debt, not wrapping it in more helpers. | |
| 4 | - | |
| 5 | -The repo has moved since the audit snapshot. On this planning branch: | |
| 6 | - | |
| 7 | -- `uv run pytest -q` is green with `226 passed` | |
| 8 | -- Sprint 08's prompt builder, turn-phase tracking, and permission inspection surfaces are already present on `HEAD` | |
| 9 | -- Sprint 09 interactive validation has started; `loader doctor` now distinguishes metadata reachability from live chat readiness, and both native-capable and `json_tag` Ollama lanes currently fail the live chat probe on `/api/chat` with HTTP 500 | |
| 10 | -- Sprint 10's runtime-ownership inversion is now materially in place: `src/loader/runtime/` no longer reaches into `Agent` directly, and the remaining legacy dependencies are explicit `RuntimeLegacyServices` seams | |
| 11 | -- Sprint 11 has already deleted several puppet behaviors and collapsed the raw-text fallback stack onto the shared parser used by the runtime and Ollama text fallback paths | |
| 12 | -- Sprint 13 has now finished the runtime-side retirement work: | |
| 13 | - - `src/loader/runtime/` has `0` direct imports from `agent/*` | |
| 14 | - - shared safeguard, rollback, recovery, parsing, reasoning-type, and task-classification surfaces now live under `src/loader/runtime/` | |
| 15 | - - the remaining debt is narrower: | |
| 16 | - - workflow modes are honestly scoped, but the refs' deeper protocol and routing discipline are still absent | |
| 17 | - - `agent/reasoning.py` and `agent/safeguards.py` still hold some real legacy behavior behind explicit seams | |
| 18 | - - interactive validation against a healthy live backend is still blocked on Ollama `/api/chat` failures | |
| 19 | - | |
| 20 | -## Sprint 09 Ownership Baseline | |
| 21 | - | |
| 22 | -- `src/loader/runtime/conversation.py`: 881 lines, `49` `self.agent.` reach-ins | |
| 23 | -- `src/loader/runtime/assistant_turns.py`: `23` `self.agent.` reach-ins | |
| 24 | -- `src/loader/runtime/tool_batches.py`: `26` `self.agent.` reach-ins | |
| 25 | -- `src/loader/runtime/completion_policy.py`: `9` `self.agent.` reach-ins | |
| 26 | -- `src/loader/runtime/repair.py`: `4` `self.agent.` reach-ins | |
| 27 | -- `src/loader/runtime/explore.py`: `12` `self.agent.` reach-ins | |
| 28 | -- `src/loader/agent/loop.py`: 1108 lines | |
| 29 | -- `src/loader/agent/reasoning.py`: 1235 lines | |
| 30 | -- `src/loader/agent/safeguards.py`: 1142 lines | |
| 31 | -- `src/loader/agent/recovery.py`: 648 lines | |
| 32 | - | |
| 33 | -## Current runtime ownership status | |
| 34 | - | |
| 35 | -- `src/loader/runtime/` direct `self.agent.` reach-ins: `0` | |
| 36 | -- runtime ownership now flows through `RuntimeContext` plus explicit `RuntimeLegacyServices` adapters | |
| 37 | -- `src/loader/runtime/` direct imports from `agent/*`: `0` | |
| 38 | -- the runtime-side ownership migration phase is complete; the remaining work is closure reporting and any future follow-on deletion beyond the runtime package | |
| 39 | - | |
| 40 | -## Current legacy-tree status | |
| 41 | - | |
| 42 | -- `src/loader/agent/loop.py`: `721` lines, down `390` lines from the Sprint 09 baseline of `1111` | |
| 43 | -- `src/loader/agent/reasoning.py`: `649` lines, down `586` lines from the Sprint 09 baseline of `1235` | |
| 44 | -- `src/loader/agent/safeguards.py`: `595` lines, down `547` lines from the Sprint 09 baseline of `1142` | |
| 45 | -- `src/loader/agent/recovery.py`: `12` lines, down `636` lines from the Sprint 09 baseline of `648` | |
| 46 | -- shared raw-text parsing now runs through `src/loader/runtime/parsing.py`; `src/loader/agent/parsing.py` is compatibility-only and the stale `_extract_raw_json_tool_calls(...)` fallback is gone from `src/loader/agent/loop.py` | |
| 47 | -- deleted assistant-puppeting behaviors: | |
| 48 | - - post-action follow-up suffix | |
| 49 | - - first-turn `[` prefill trick | |
| 50 | - - fake-tool narration scolding | |
| 51 | - - deflection repair prompts | |
| 52 | - - self-critique reroute | |
| 53 | - - non-mutating completion nudge | |
| 54 | - - text-loop bailout | |
| 55 | - - action-loop bailout on successful repeated tool patterns | |
| 56 | -- empty-output handling is now one honest retry followed by explicit failure instead of five fake assistant continuation prompts | |
| 57 | -- the remaining debt is no longer parser fragmentation or hidden runtime ownership; it is the narrower set of explicit legacy callbacks, streamed safeguards, workflow-depth gaps, and still-blocked live backend validation | |
| 58 | - | |
| 59 | -## Phase 1: Validate Before Deleting | |
| 60 | - | |
| 61 | -- [Sprint 09](sprint09.md) — Interactive Validation, Baselines, and Guardrails | |
| 62 | - | |
| 63 | -## Phase 2: Tighten the Runtime Contract | |
| 64 | - | |
| 65 | -- [Sprint 10](sprint10.md) — Runtime Context and Ownership Inversion | |
| 66 | -- [Sprint 11](sprint11.md) — Recovery Deletion and Tool Parsing Unification | |
| 67 | - | |
| 68 | -## Phase 3: Finish the Behavioral Cleanup | |
| 69 | - | |
| 70 | -- [Sprint 12](sprint12.md) — Workflow Protocol Hardening and Decomposition Decision | |
| 71 | -- [Sprint 13](sprint13.md) — Legacy Runtime Retirement and Safeguard Refactor | |
| 72 | -- [Sprint 13 Closure](sprint13_closure.md) — Final scoreboard, remaining debt, and audit-closure honesty pass | |
| 73 | -- [Trunk Sitrep](trunk_sitrep.md) — Divergence snapshot and integration recommendation against current `trunk` | |
| 74 | - | |
| 75 | -## Working principles | |
| 76 | - | |
| 77 | -- Prefer deletion over relocation. A sprint that only moves code around is not done. | |
| 78 | -- Each discrete fix gets its own commit. Do not batch unrelated parser, workflow, safety, and runtime-contract changes into one "cleanup" commit. | |
| 79 | -- Pair each behavior change with deterministic coverage or a committed interactive validation artifact. | |
| 80 | -- Keep `.docs/PARITY.md` and any residual-debt notes honest as claims change. | |
| 81 | -- Stop at the sprint boundary. If a sprint uncovers a larger follow-on job, create the next sprint or artifact instead of expanding the current one mid-flight. | |
| 82 | -- Treat interactive testing as first-class evidence. The parity harness is necessary, but it is not enough for deciding which recovery layers are truly load-bearing. | |
.docs/audit_sprints/sprint09.mddeleted@@ -1,99 +0,0 @@ | ||
| 1 | -# Sprint 09: Interactive Validation, Baselines, and Guardrails | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 08 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn the audit from a strong paper read into an executable cleanup baseline. | |
| 10 | - | |
| 11 | -Before deleting recovery layers, Loader needs two things the repo does not have yet: | |
| 12 | - | |
| 13 | -- direct evidence from real interactive runs across at least one native-tool path and one raw-text/smaller-model path | |
| 14 | -- guardrail coverage for the specific parser/runtime blind spots the audit surfaced | |
| 15 | - | |
| 16 | -This sprint is intentionally narrow. It should produce evidence, close the known stale allowlist bug, and leave the larger contract surgery to later sprints. | |
| 17 | - | |
| 18 | -## Deliverables | |
| 19 | - | |
| 20 | -### 1. Interactive validation matrix | |
| 21 | - | |
| 22 | -Run a fixed task matrix against at least: | |
| 23 | - | |
| 24 | -- one native-tool-capable backend/profile | |
| 25 | -- one smaller or raw-text-prone backend/profile | |
| 26 | - | |
| 27 | -Capture for each run: | |
| 28 | - | |
| 29 | -- task prompt | |
| 30 | -- active capability profile | |
| 31 | -- whether native tools or raw-text fallback fired | |
| 32 | -- phase trace | |
| 33 | -- final response quality | |
| 34 | -- verification outcome | |
| 35 | -- which recovery layers fired and whether they helped or harmed | |
| 36 | - | |
| 37 | -Persist the report under `.docs/audit_sprints/` so later deletion sprints can cite real evidence instead of memory. | |
| 38 | - | |
| 39 | -### 2. Parser blind-spot guardrails | |
| 40 | - | |
| 41 | -Close the audit's most concrete regression before deeper parser work: | |
| 42 | - | |
| 43 | -- add deterministic coverage for raw-text recovery of `TodoWrite` | |
| 44 | -- add deterministic coverage for raw-text recovery of `patch` | |
| 45 | -- add deterministic coverage for at least one newer workflow/operator tool such as `AskUserQuestion` | |
| 46 | -- stop hardcoding the six-tool allowlist in `agent/loop.py` | |
| 47 | - | |
| 48 | -This sprint does not need the final parser architecture yet. It does need to stop widening the gap every time the registry grows. | |
| 49 | - | |
| 50 | -### 3. Contract baseline artifacts | |
| 51 | - | |
| 52 | -Record the current cleanup baseline in a committed artifact: | |
| 53 | - | |
| 54 | -- `self.agent.` reach-in counts across `src/loader/runtime/` | |
| 55 | -- line counts for the major legacy files under `src/loader/agent/` | |
| 56 | -- inventory of every remaining recovery behavior, including: | |
| 57 | - - owner | |
| 58 | - - trigger | |
| 59 | - - dependency | |
| 60 | - - current test coverage | |
| 61 | - - proposed disposition: `delete`, `gate`, or `keep` | |
| 62 | - | |
| 63 | -This inventory becomes the checklist for Sprint 11 and Sprint 13. | |
| 64 | - | |
| 65 | -### 4. Deletion criteria | |
| 66 | - | |
| 67 | -Write down the rules for what survives: | |
| 68 | - | |
| 69 | -- default to delete unless the interactive matrix shows the recovery path is materially load-bearing | |
| 70 | -- if a behavior stays, tie it to an explicit capability-profile condition or session-level guardrail | |
| 71 | -- forbid new fake-assistant continuation text unless a future sprint explicitly re-approves it with evidence | |
| 72 | - | |
| 73 | -## Commit slicing | |
| 74 | - | |
| 75 | -- one commit for the interactive validation artifact and baseline metrics | |
| 76 | -- one commit for each new raw-parser regression test cluster | |
| 77 | -- one commit for the allowlist fix or registry-derived fallback guard | |
| 78 | -- one commit for the recovery inventory / disposition table | |
| 79 | - | |
| 80 | -## Testing strategy | |
| 81 | - | |
| 82 | -- `uv run pytest -q` | |
| 83 | -- targeted runtime/parity coverage for raw-text recovery of newer tools | |
| 84 | -- at least one committed interactive validation report from a native-tool profile | |
| 85 | -- at least one committed interactive validation report from a smaller/raw-text-prone profile | |
| 86 | - | |
| 87 | -## Definition of done | |
| 88 | - | |
| 89 | -- Loader has committed interactive evidence for the contract discussion instead of only paper analysis | |
| 90 | -- the stale six-tool raw-extraction regression is closed and covered | |
| 91 | -- every remaining recovery layer has an owner, a disposition, and a later sprint target | |
| 92 | -- the repo has a stable baseline for reach-ins, line counts, and runtime heuristics before larger deletion work starts | |
| 93 | - | |
| 94 | -## Explicitly out of scope | |
| 95 | - | |
| 96 | -- introducing `RuntimeContext` | |
| 97 | -- deleting multiple recovery layers at once | |
| 98 | -- redesigning clarify/plan workflows | |
| 99 | -- refactoring `agent/safeguards.py` in bulk | |
.docs/audit_sprints/sprint09_baseline.mddeleted@@ -1,108 +0,0 @@ | ||
| 1 | -# Sprint 09 Baseline and Recovery Inventory | |
| 2 | - | |
| 3 | -## Snapshot | |
| 4 | - | |
| 5 | -- Branch: `cleanup-audit-plan` | |
| 6 | -- Worktree: `/tmp/loader-audit-cleanup` | |
| 7 | -- Source snapshot for this baseline: `5c10aab` (`Harden raw tool-call fallback coverage`) | |
| 8 | -- Targeted verification completed before writing this baseline: | |
| 9 | - - `uv run pytest -q tests/test_parsing.py` | |
| 10 | - - `uv run pytest -q tests/test_runtime_harness.py -k 'raw_json or native_and_raw_tool_paths_share_executor_trace or runtime_parity_manifest_matches_implemented_cases'` | |
| 11 | -- Repo-wide verification after the latest Sprint 09 characterization update: | |
| 12 | - - `uv run pytest -q` → `191 passed` | |
| 13 | -- Doctor surface update: | |
| 14 | - - `f60523c` adds a dedicated live chat probe so `loader doctor` now reports `backend` reachability separately from `/api/chat` readiness | |
| 15 | -- New Sprint 09 guardrails now cover raw-text recovery for: | |
| 16 | - - `read` | |
| 17 | - - `TodoWrite` | |
| 18 | - - `patch` | |
| 19 | - - `AskUserQuestion` | |
| 20 | - | |
| 21 | -## Runtime Ownership Baseline | |
| 22 | - | |
| 23 | -| File | Lines | `self.agent.` reach-ins | | |
| 24 | -| --- | ---: | ---: | | |
| 25 | -| `src/loader/runtime/conversation.py` | 881 | 49 | | |
| 26 | -| `src/loader/runtime/assistant_turns.py` | 155 | 23 | | |
| 27 | -| `src/loader/runtime/tool_batches.py` | 372 | 26 | | |
| 28 | -| `src/loader/runtime/finalization.py` | 339 | 8 | | |
| 29 | -| `src/loader/runtime/completion_policy.py` | 187 | 9 | | |
| 30 | -| `src/loader/runtime/repair.py` | 208 | 4 | | |
| 31 | -| `src/loader/runtime/explore.py` | 220 | 12 | | |
| 32 | - | |
| 33 | -This is the migration scoreboard for Sprint 10. The goal is not only to move code around, but to remove `Agent` as the runtime's implicit data model. | |
| 34 | - | |
| 35 | -## Legacy Tree Baseline | |
| 36 | - | |
| 37 | -| File | Lines | | |
| 38 | -| --- | ---: | | |
| 39 | -| `src/loader/agent/loop.py` | 1111 | | |
| 40 | -| `src/loader/agent/reasoning.py` | 1235 | | |
| 41 | -| `src/loader/agent/safeguards.py` | 1142 | | |
| 42 | -| `src/loader/agent/recovery.py` | 648 | | |
| 43 | - | |
| 44 | -This is the subtraction scoreboard for Sprint 11 and Sprint 13. | |
| 45 | - | |
| 46 | -## Sprint 11 Progress Against This Baseline | |
| 47 | - | |
| 48 | -- `src/loader/agent/loop.py`: `1111` -> `926` (`-185`) | |
| 49 | -- `src/loader/runtime/conversation.py`: `881` -> `828` (`-53`) | |
| 50 | -- `src/loader/runtime/repair.py`: `208` -> `146` (`-62`) | |
| 51 | -- `src/loader/runtime/completion_policy.py`: `187` -> `182` (`-5`) | |
| 52 | -- `src/loader/agent/parsing.py`: `182` -> `276` | |
| 53 | - - this file grew because it is now the shared owner of raw-text parsing after Sprint 11 deleted the legacy loop extractor | |
| 54 | -- parser-contract status: | |
| 55 | - - `_extract_raw_json_tool_calls(...)` has been deleted from `src/loader/agent/loop.py` | |
| 56 | - - `src/loader/runtime/repair.py`, `src/loader/runtime/explore.py`, and `src/loader/llm/ollama.py` now converge on the shared parser | |
| 57 | -- deleted behaviors relative to this inventory: | |
| 58 | - - prefill trick | |
| 59 | - - fake-tool narration repair | |
| 60 | - - deflection repair | |
| 61 | - - post-action follow-up suffix | |
| 62 | -- tightened behavior relative to this inventory: | |
| 63 | - - empty-response handling is now one honest retry plus explicit failure instead of five fake assistant continuation prompts | |
| 64 | - - self-critique reroute, text-loop bailout, non-mutating completion nudge, and action-loop bailout have all been deleted from the runtime turn path | |
| 65 | -- still open relative to this inventory: | |
| 66 | - - no remaining inline completion/critique bailout from this inventory survives in the runtime turn path | |
| 67 | - | |
| 68 | -## Recovery Inventory | |
| 69 | - | |
| 70 | -| Behavior | Current owner | Trigger | Dependency | Current coverage | Proposed disposition | | |
| 71 | -| --- | --- | --- | --- | --- | --- | | |
| 72 | -| Prefill trick | `src/loader/runtime/conversation.py:142-161` | First iteration, single user message, action-keyword heuristic | Direct session write of fake assistant `[` | `tests/test_runtime_repair_flows.py::test_fresh_agent_messages_are_disconnected_from_session_history` | Delete. Fresh sessions currently keep `agent.messages` disconnected from `session.messages`, so this gate is already stale in practice. | | |
| 73 | -| Empty-output retry prompts | `src/loader/runtime/repair.py:43-76` via `conversation.py:192-211` | Assistant content is empty up to `max_empty_retries=5` | Five fake assistant continuation prompts | `tests/test_runtime_repair_flows.py::test_empty_response_repair_injects_retry_prompt_and_recovers` | Delete or reduce to one bounded retry with honest failure | | |
| 74 | -| Raw-text tool fallback | `src/loader/runtime/repair.py:101-125` plus `src/loader/agent/parsing.py` and legacy `src/loader/agent/loop.py:862-1111` | Native tool call list is empty but response contains tool syntax | Parser stack, capability-profile behavior, legacy extractor | `tests/test_parsing.py`, `tests/test_runtime_harness.py` raw JSON scenarios | Keep short-term, gate by capability profile, unify in Sprint 11 | | |
| 75 | -| Fake-tool narration repair | `src/loader/runtime/repair.py:156-182` plus `src/loader/agent/loop.py:770-860` | `_contains_unexecuted_code(...)` matches narration or code-block heuristics | Legacy regex wall plus injected scolding prompt | `tests/test_runtime_repair_flows.py::test_fake_tool_narration_repair_injects_scolding_prompt` | Delete | | |
| 76 | -| Deflection repair | `src/loader/runtime/repair.py:184-201` | Non-ReAct response deflects with "you can/should/could/try running" and no actions taken | Phrase heuristic plus injected user repair turn | `tests/test_runtime_repair_flows.py::test_deflection_repair_injects_use_your_tools_prompt` | Delete unless interactive evidence shows it is load-bearing | | |
| 77 | -| Self-critique reroute | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | Long response and `should_self_critique(...)` said revise | `agent._self_critique`, reasoning prompt, session reinjection | `tests/test_runtime_repair_flows.py::test_long_code_response_no_longer_reroutes_for_self_critique` | Deleted | | |
| 78 | -| Text-loop bailout | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | `self.agent.safeguards.detect_text_loop(...)` reported repetition | `agent/safeguards.py` action tracker | `tests/test_runtime_repair_flows.py::test_non_mutating_completion_returns_directly_without_text_bailout` | Deleted | | |
| 79 | -| Non-mutating completion nudge | Deleted in Sprint 11 (was `src/loader/runtime/completion_policy.py`) | `completion_check` enabled, no mutating actions, `detect_premature_completion(...)` hit | `agent/reasoning.py` continuation heuristics plus session reinjection | `tests/test_runtime_harness.py::test_non_mutating_completion_no_longer_forces_continuation` | Deleted | | |
| 80 | -| Post-action follow-up suffix | `src/loader/runtime/completion_policy.py:173-186` | Actions were taken and final text does not already end in `?` | Pure string heuristic | `tests/test_runtime_repair_flows.py::test_post_action_follow_up_suffix_is_appended_to_final_response` | Delete | | |
| 81 | -| Action-loop bailout | Deleted in Sprint 11 (was `src/loader/runtime/tool_batches.py`) | `self.agent.safeguards.detect_loop()` reported repeated tool behavior | `agent/safeguards.py` action tracker | `tests/test_runtime_repair_flows.py::test_repeated_tool_pattern_no_longer_triggers_action_loop_bailout` | Deleted | | |
| 82 | - | |
| 83 | -## Interactive Validation Matrix | |
| 84 | - | |
| 85 | -These runs are still pending. They require at least one configured real native-tool backend and one configured raw-text-prone backend. | |
| 86 | - | |
| 87 | -| Lane | Task | Why it matters | Status | | |
| 88 | -| --- | --- | --- | --- | | |
| 89 | -| Native-tool lane | Read a file, then write a small file and let DoD verify it | Confirms the runtime can stay on the normal native path without repair machinery stepping in | Blocked: `loader doctor` now reports `backend: pass` but `chat: fail`, and live `/api/chat` still fails with HTTP 500 before turn execution. See `sprint09_interactive_validation_native.md`. | | |
| 90 | -| Native-tool lane | Ambiguous request that routes through clarify mode | Measures whether current clarify behavior is helpful or just extra prompt text | Blocked behind the same native-lane `/api/chat` failure. | | |
| 91 | -| Native-tool lane | Multi-step implementation that uses `TodoWrite` and verification | Measures if completion behavior stays disciplined without fake continuations | Blocked behind the same native-lane `/api/chat` failure. | | |
| 92 | -| Raw-text-prone lane | Recover `read`, `patch`, `TodoWrite`, and `AskUserQuestion` from raw JSON/text | Confirms which raw fallback paths are still load-bearing after the new guardrails | Blocked: `loader doctor` now reports `backend: pass` but `chat: fail`, and live `/api/chat` fails with HTTP 500 before any raw-text output is produced. See `sprint09_interactive_validation_raw_text.md`. | | |
| 93 | -| Raw-text-prone lane | Prompt that tends to elicit narrated fake tool use | Measures whether fake-tool repair actually saves the run or just churns the conversation | Blocked behind the same raw-text-lane `/api/chat` failure. | | |
| 94 | -| Raw-text-prone lane | Prompt that tends to return empty or deflective text | Measures whether empty-output and deflection repairs help enough to justify keeping them | Blocked behind the same raw-text-lane `/api/chat` failure. | | |
| 95 | - | |
| 96 | -Use [sprint09_interactive_validation.md](sprint09_interactive_validation.md) as the capture format for each completed run set. | |
| 97 | - | |
| 98 | -Completed captures: | |
| 99 | - | |
| 100 | -- [Native lane](sprint09_interactive_validation_native.md) | |
| 101 | -- [Raw-text-prone lane](sprint09_interactive_validation_raw_text.md) | |
| 102 | - | |
| 103 | -## Immediate Sprint 09 Follow-on | |
| 104 | - | |
| 105 | -- Use the `191 passed` repo-wide baseline as the regression floor for the next Sprint 09 slices. | |
| 106 | -- Restore a working live chat backend, then rerun the interactive validation matrix against the documented native and raw-text-prone model lanes. | |
| 107 | -- Keep `191 passed` as the regression floor for any Sprint 10 runtime-seam work that begins before the backend is healthy again. | |
| 108 | -- Use this inventory as the checklist for Sprint 10 service seams and Sprint 11 deletions. | |
.docs/audit_sprints/sprint09_interactive_validation.mddeleted@@ -1,96 +0,0 @@ | ||
| 1 | -# Sprint 09 Interactive Validation Template | |
| 2 | - | |
| 3 | -Use this template for each real-backend validation run in Sprint 09. The goal is to turn the audit's architectural concerns into concrete runtime evidence that later deletion sprints can cite. | |
| 4 | - | |
| 5 | -Create one filled copy per backend/profile pair, or append multiple runs under the same backend heading if the profile is stable. | |
| 6 | - | |
| 7 | -## Backend Summary | |
| 8 | - | |
| 9 | -- Date: | |
| 10 | -- Operator: | |
| 11 | -- Branch: | |
| 12 | -- Commit: | |
| 13 | -- Backend: | |
| 14 | -- Model: | |
| 15 | -- Capability profile: | |
| 16 | -- Native tools enabled: | |
| 17 | -- Streaming enabled: | |
| 18 | -- Permission mode: | |
| 19 | -- Workflow override: | |
| 20 | - | |
| 21 | -## Run 1 | |
| 22 | - | |
| 23 | -- Task: | |
| 24 | -- Expected productive path: | |
| 25 | -- Expected risky heuristics: | |
| 26 | -- Actual outcome: | |
| 27 | -- Verification outcome: | |
| 28 | -- Final response quality: | |
| 29 | - | |
| 30 | -### Evidence | |
| 31 | - | |
| 32 | -- Tool path used: | |
| 33 | -- Phase trace: | |
| 34 | -- Recovery layers fired: | |
| 35 | -- Session/runtime notes: | |
| 36 | - | |
| 37 | -### Assessment | |
| 38 | - | |
| 39 | -- Did the runtime help or get in the way? | |
| 40 | -- Which heuristics were load-bearing? | |
| 41 | -- Which heuristics felt like churn? | |
| 42 | -- Proposed disposition updates: | |
| 43 | - | |
| 44 | -## Run 2 | |
| 45 | - | |
| 46 | -- Task: | |
| 47 | -- Expected productive path: | |
| 48 | -- Expected risky heuristics: | |
| 49 | -- Actual outcome: | |
| 50 | -- Verification outcome: | |
| 51 | -- Final response quality: | |
| 52 | - | |
| 53 | -### Evidence | |
| 54 | - | |
| 55 | -- Tool path used: | |
| 56 | -- Phase trace: | |
| 57 | -- Recovery layers fired: | |
| 58 | -- Session/runtime notes: | |
| 59 | - | |
| 60 | -### Assessment | |
| 61 | - | |
| 62 | -- Did the runtime help or get in the way? | |
| 63 | -- Which heuristics were load-bearing? | |
| 64 | -- Which heuristics felt like churn? | |
| 65 | -- Proposed disposition updates: | |
| 66 | - | |
| 67 | -## Run 3 | |
| 68 | - | |
| 69 | -- Task: | |
| 70 | -- Expected productive path: | |
| 71 | -- Expected risky heuristics: | |
| 72 | -- Actual outcome: | |
| 73 | -- Verification outcome: | |
| 74 | -- Final response quality: | |
| 75 | - | |
| 76 | -### Evidence | |
| 77 | - | |
| 78 | -- Tool path used: | |
| 79 | -- Phase trace: | |
| 80 | -- Recovery layers fired: | |
| 81 | -- Session/runtime notes: | |
| 82 | - | |
| 83 | -### Assessment | |
| 84 | - | |
| 85 | -- Did the runtime help or get in the way? | |
| 86 | -- Which heuristics were load-bearing? | |
| 87 | -- Which heuristics felt like churn? | |
| 88 | -- Proposed disposition updates: | |
| 89 | - | |
| 90 | -## Summary | |
| 91 | - | |
| 92 | -- Most useful runtime behaviors: | |
| 93 | -- Least useful runtime behaviors: | |
| 94 | -- Recovery layers that should likely be deleted: | |
| 95 | -- Recovery layers that should likely be gated by capability profile: | |
| 96 | -- Follow-up code or test tasks: | |
.docs/audit_sprints/sprint09_interactive_validation_native.mddeleted@@ -1,89 +0,0 @@ | ||
| 1 | -# Sprint 09 Interactive Validation — Native Lane | |
| 2 | - | |
| 3 | -## Backend Summary | |
| 4 | - | |
| 5 | -- Date: 2026-04-07 | |
| 6 | -- Operator: Codex | |
| 7 | -- Branch: `cleanup-audit-plan` | |
| 8 | -- Capture base: `f60523c` (`Probe live chat health in doctor`) | |
| 9 | -- Backend: `ollama` | |
| 10 | -- Models exercised: | |
| 11 | - - `qwen2.5:7b` | |
| 12 | - - `qwen2.5:14b` | |
| 13 | -- Capability profile: | |
| 14 | - - `qwen2.5:7b` → `native` | |
| 15 | - - `qwen2.5:14b` → `native` | |
| 16 | -- Native tools enabled: yes | |
| 17 | -- Streaming enabled: yes | |
| 18 | -- Permission mode: `read-only` | |
| 19 | -- Workflow override: none (`execute`) | |
| 20 | - | |
| 21 | -## Run 1 | |
| 22 | - | |
| 23 | -- Task: `Say hello in five words.` | |
| 24 | -- Expected productive path: one assistant turn, no tools, immediate final response | |
| 25 | -- Expected risky heuristics: none; this should not need repair or completion nudges | |
| 26 | -- Actual outcome: failed before the first assistant turn completed | |
| 27 | -- Verification outcome: not reached | |
| 28 | -- Final response quality: none; CLI exited with `httpx.HTTPStatusError` | |
| 29 | - | |
| 30 | -### Evidence | |
| 31 | - | |
| 32 | -- Doctor result: `uv run loader doctor -m qwen2.5:7b` reported `backend: pass`, `chat: fail`, and capabilities `pass` | |
| 33 | -- Runtime invocation: `uv run loader -m qwen2.5:7b --no-tui --permission-mode read-only "Say hello in five words."` | |
| 34 | -- Tool path used: none | |
| 35 | -- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk | |
| 36 | -- Recovery layers fired: none observed | |
| 37 | -- Session/runtime notes: | |
| 38 | - - Loader selected `Mode: Native` | |
| 39 | - - session id: `20260407T190318Z-2f8a2087` | |
| 40 | - - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)` | |
| 41 | - | |
| 42 | -### Assessment | |
| 43 | - | |
| 44 | -- Did the runtime help or get in the way? The runtime did not get a chance to help or interfere; the backend failed before the first assistant turn was materialized. | |
| 45 | -- Which heuristics were load-bearing? None in this run. | |
| 46 | -- Which heuristics felt like churn? None in this run. | |
| 47 | -- Proposed disposition updates: | |
| 48 | - - Do not draw recovery-layer conclusions from this run. | |
| 49 | - - Treat the native validation lane as blocked on live `/api/chat` reliability, not on Loader turn logic. | |
| 50 | - | |
| 51 | -## Run 2 | |
| 52 | - | |
| 53 | -- Task: `Say hello in five words.` | |
| 54 | -- Expected productive path: one assistant turn, no tools, immediate final response | |
| 55 | -- Expected risky heuristics: none | |
| 56 | -- Actual outcome: failed before the first assistant turn completed | |
| 57 | -- Verification outcome: not reached | |
| 58 | -- Final response quality: none; CLI exited with `httpx.HTTPStatusError` | |
| 59 | - | |
| 60 | -### Evidence | |
| 61 | - | |
| 62 | -- Doctor result: `uv run loader doctor -m qwen2.5:14b` reported `backend: pass`, `chat: fail`, and capabilities `pass` | |
| 63 | -- Runtime invocation: `uv run loader -m qwen2.5:14b --no-tui --permission-mode read-only "Say hello in five words."` | |
| 64 | -- Tool path used: none | |
| 65 | -- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk | |
| 66 | -- Recovery layers fired: none observed | |
| 67 | -- Session/runtime notes: | |
| 68 | - - Loader selected `Mode: Native` | |
| 69 | - - session id: `20260407T190412Z-0eaaa107` | |
| 70 | - - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)` | |
| 71 | - | |
| 72 | -### Assessment | |
| 73 | - | |
| 74 | -- Did the runtime help or get in the way? Same result as Run 1; the runtime never got past the first backend chat request. | |
| 75 | -- Which heuristics were load-bearing? None in this run. | |
| 76 | -- Which heuristics felt like churn? None in this run. | |
| 77 | -- Proposed disposition updates: | |
| 78 | - - The failure is not isolated to one native-capable model. | |
| 79 | - - Keep the native Sprint 09 lane open, but mark it blocked until Ollama chat requests succeed again. | |
| 80 | - | |
| 81 | -## Summary | |
| 82 | - | |
| 83 | -- Most useful runtime behaviors: `loader doctor` now separates model availability from live chat readiness, which makes the blockage explicit instead of implying the lane is healthy. | |
| 84 | -- Least useful runtime behaviors: none assessed; runtime behavior was not exercised. | |
| 85 | -- Recovery layers that should likely be deleted: no update from this artifact | |
| 86 | -- Recovery layers that should likely be gated by capability profile: no update from this artifact | |
| 87 | -- Follow-up code or test tasks: | |
| 88 | - - investigate why `doctor` passes on `/api/tags` and `/api/show` while live `/api/chat` requests fail with HTTP 500 | |
| 89 | - - rerun this lane with a file-read task once chat requests are healthy, so the runtime can actually exercise native-tool behavior | |
.docs/audit_sprints/sprint09_interactive_validation_raw_text.mddeleted@@ -1,117 +0,0 @@ | ||
| 1 | -# Sprint 09 Interactive Validation — Raw-Text-Prone Lane | |
| 2 | - | |
| 3 | -## Backend Summary | |
| 4 | - | |
| 5 | -- Date: 2026-04-07 | |
| 6 | -- Operator: Codex | |
| 7 | -- Branch: `cleanup-audit-plan` | |
| 8 | -- Capture base: `f60523c` (`Probe live chat health in doctor`) | |
| 9 | -- Backend: `ollama` | |
| 10 | -- Models exercised: | |
| 11 | - - `qwen3-coder:30b` | |
| 12 | - - `gemma3:12b` | |
| 13 | - - `llama2:latest` (doctor only) | |
| 14 | -- Capability profile: | |
| 15 | - - `qwen3-coder:30b` → `json_tag` | |
| 16 | - - `gemma3:12b` → `json_tag` | |
| 17 | - - `llama2:latest` → `json_tag` | |
| 18 | -- Native tools enabled: no | |
| 19 | -- Streaming enabled: yes | |
| 20 | -- Permission mode: `read-only` | |
| 21 | -- Workflow override: none (`execute`) | |
| 22 | - | |
| 23 | -## Run 1 | |
| 24 | - | |
| 25 | -- Task: `Say hello in five words.` | |
| 26 | -- Expected productive path: one assistant turn, no tools, immediate final response | |
| 27 | -- Expected risky heuristics: none; even the raw-text lane should handle this without fallback parsing | |
| 28 | -- Actual outcome: failed before the first assistant turn completed | |
| 29 | -- Verification outcome: not reached | |
| 30 | -- Final response quality: none; CLI exited with `httpx.HTTPStatusError` | |
| 31 | - | |
| 32 | -### Evidence | |
| 33 | - | |
| 34 | -- Doctor result: `uv run loader doctor -m qwen3-coder:30b` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`) | |
| 35 | -- Runtime invocation: `uv run loader -m qwen3-coder:30b --no-tui --permission-mode read-only --react "Say hello in five words."` | |
| 36 | -- Tool path used: none | |
| 37 | -- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk | |
| 38 | -- Recovery layers fired: none observed | |
| 39 | -- Session/runtime notes: | |
| 40 | - - Loader selected `Mode: ReAct` | |
| 41 | - - session id: `20260407T190318Z-a3a6385a` | |
| 42 | - - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)` | |
| 43 | - | |
| 44 | -### Assessment | |
| 45 | - | |
| 46 | -- Did the runtime help or get in the way? The runtime did not get a usable response back from the backend, so the raw-text tool fallback was never exercised. | |
| 47 | -- Which heuristics were load-bearing? None in this run. | |
| 48 | -- Which heuristics felt like churn? None in this run. | |
| 49 | -- Proposed disposition updates: | |
| 50 | - - Do not use this run to justify keeping or deleting raw-text repair heuristics. | |
| 51 | - - Treat the raw-text-prone lane as blocked on live `/api/chat` reliability. | |
| 52 | - | |
| 53 | -## Run 2 | |
| 54 | - | |
| 55 | -- Task: `Say hello in five words.` | |
| 56 | -- Expected productive path: one assistant turn, no tools, immediate final response | |
| 57 | -- Expected risky heuristics: none | |
| 58 | -- Actual outcome: failed before the first assistant turn completed | |
| 59 | -- Verification outcome: not reached | |
| 60 | -- Final response quality: none; CLI exited with `httpx.HTTPStatusError` | |
| 61 | - | |
| 62 | -### Evidence | |
| 63 | - | |
| 64 | -- Doctor result: `uv run loader doctor -m gemma3:12b` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`) | |
| 65 | -- Runtime invocation: `uv run loader -m gemma3:12b --no-tui --permission-mode read-only --react "Say hello in five words."` | |
| 66 | -- Tool path used: none | |
| 67 | -- Phase trace: startup banner printed, then `Generating...`, then `/api/chat` failed with HTTP 500 before the first streamed chunk | |
| 68 | -- Recovery layers fired: none observed | |
| 69 | -- Session/runtime notes: | |
| 70 | - - Loader selected `Mode: ReAct` | |
| 71 | - - session id: `20260407T190412Z-7e745da3` | |
| 72 | - - failure site: `src/loader/llm/ollama.py:363` during `backend.stream(...)` | |
| 73 | - | |
| 74 | -### Assessment | |
| 75 | - | |
| 76 | -- Did the runtime help or get in the way? Same result as Run 1; the runtime never reached the point where ReAct parsing or repair could matter. | |
| 77 | -- Which heuristics were load-bearing? None in this run. | |
| 78 | -- Which heuristics felt like churn? None in this run. | |
| 79 | -- Proposed disposition updates: | |
| 80 | - - The block is not isolated to one ReAct-profile model. | |
| 81 | - - Keep the raw-text Sprint 09 lane open, but mark it blocked until live chat requests succeed. | |
| 82 | - | |
| 83 | -## Run 3 | |
| 84 | - | |
| 85 | -- Task: doctor-only preflight for a second fallback-profile model | |
| 86 | -- Expected productive path: confirm the lane is not blocked by model availability alone | |
| 87 | -- Expected risky heuristics: none | |
| 88 | -- Actual outcome: doctor passed for `llama2:latest`, but no live run was attempted after the matching `/api/chat` failures in Runs 1 and 2 | |
| 89 | -- Verification outcome: not applicable | |
| 90 | -- Final response quality: not applicable | |
| 91 | - | |
| 92 | -### Evidence | |
| 93 | - | |
| 94 | -- Doctor result: `uv run loader doctor -m llama2:latest` reported `backend: pass`, `chat: fail`, and capabilities `warn` (`json_tag`) | |
| 95 | -- Tool path used: none | |
| 96 | -- Phase trace: not applicable | |
| 97 | -- Recovery layers fired: none observed | |
| 98 | -- Session/runtime notes: | |
| 99 | - - This run was intentionally limited to preflight evidence because the lane was already blocked by repeated `/api/chat` failures. | |
| 100 | - | |
| 101 | -### Assessment | |
| 102 | - | |
| 103 | -- Did the runtime help or get in the way? Not assessed. | |
| 104 | -- Which heuristics were load-bearing? Not assessed. | |
| 105 | -- Which heuristics felt like churn? Not assessed. | |
| 106 | -- Proposed disposition updates: | |
| 107 | - - Once `/api/chat` is stable again, prioritize this model family for a real raw-text fallback task that exercises `read`, `TodoWrite`, `patch`, and `AskUserQuestion`. | |
| 108 | - | |
| 109 | -## Summary | |
| 110 | - | |
| 111 | -- Most useful runtime behaviors: `loader doctor` now distinguishes capability classification from live chat readiness, so the raw-text lane blockage is explicit. | |
| 112 | -- Least useful runtime behaviors: none assessed; runtime behavior was not exercised. | |
| 113 | -- Recovery layers that should likely be deleted: no update from this artifact | |
| 114 | -- Recovery layers that should likely be gated by capability profile: no update from this artifact | |
| 115 | -- Follow-up code or test tasks: | |
| 116 | - - investigate the `/api/chat` failure independently of Sprint 09 runtime deletion work | |
| 117 | - - rerun this lane with a raw-text tool-use task as soon as live chat requests succeed so the fallback machinery can be evaluated on real evidence | |
.docs/audit_sprints/sprint10.mddeleted@@ -1,93 +0,0 @@ | ||
| 1 | -# Sprint 10: Runtime Context and Ownership Inversion | |
| 2 | - | |
| 3 | -## Status on `cleanup-audit-plan` | |
| 4 | - | |
| 5 | -- `RuntimeContext` and `RuntimeLegacyServices` are now in place under `src/loader/runtime/context.py` | |
| 6 | -- the core turn path helpers run on typed runtime context inputs instead of direct `self.agent.*` access | |
| 7 | -- `src/loader/runtime/` direct `self.agent.` reach-ins are down to `0` | |
| 8 | -- focused fake-context coverage landed for assistant turns, tool batches, completion policy, finalization, repair, and explore | |
| 9 | -- repo-wide verification is green with `203 passed` | |
| 10 | - | |
| 11 | -## Prerequisites | |
| 12 | - | |
| 13 | -Sprint 09 | |
| 14 | - | |
| 15 | -## Goals | |
| 16 | - | |
| 17 | -Make the runtime a real runtime instead of a set of helper classes that reach back into `Agent`. | |
| 18 | - | |
| 19 | -Sprint 01 split the loop across files, and Sprints 07-08 continued that split. The next step is to replace `agent: Any` with a typed boundary so runtime correctness no longer depends on reaching into `self.agent.*` from every helper. | |
| 20 | - | |
| 21 | -## Deliverables | |
| 22 | - | |
| 23 | -### 1. Introduce a typed runtime context | |
| 24 | - | |
| 25 | -Add a typed context layer under `src/loader/runtime/` that owns the state and services the runtime actually needs, such as: | |
| 26 | - | |
| 27 | -- session access | |
| 28 | -- backend access | |
| 29 | -- registry and tool schemas | |
| 30 | -- permission policy and hook manager | |
| 31 | -- workflow mode and capability profile | |
| 32 | -- prompt-building inputs | |
| 33 | -- runtime-scoped callbacks/services that still need legacy behavior during migration | |
| 34 | - | |
| 35 | -The key rule is that runtime helpers accept this context, not `Agent`. | |
| 36 | - | |
| 37 | -### 2. Migrate runtime helpers off `self.agent` | |
| 38 | - | |
| 39 | -Migrate at least these modules to the new context boundary: | |
| 40 | - | |
| 41 | -- `runtime/conversation.py` | |
| 42 | -- `runtime/assistant_turns.py` | |
| 43 | -- `runtime/tool_batches.py` | |
| 44 | -- `runtime/finalization.py` | |
| 45 | -- `runtime/repair.py` | |
| 46 | -- `runtime/completion_policy.py` | |
| 47 | -- `runtime/explore.py` | |
| 48 | - | |
| 49 | -During this sprint it is acceptable for the `Agent` to construct the context and act as an adapter. It is not acceptable for runtime modules to keep reaching through that adapter directly. | |
| 50 | - | |
| 51 | -### 3. Introduce typed service seams for legacy dependencies | |
| 52 | - | |
| 53 | -Where runtime logic still depends on legacy behavior, surface that dependency explicitly as a typed service or callback protocol instead of an object reach-in. Likely seams include: | |
| 54 | - | |
| 55 | -- self-critique | |
| 56 | -- loop detection | |
| 57 | -- text-loop detection | |
| 58 | -- action verification and recovery bookkeeping | |
| 59 | -- any remaining stream filtering or steering hooks | |
| 60 | - | |
| 61 | -This keeps Sprint 11 focused on deleting heuristics rather than disentangling call sites. | |
| 62 | - | |
| 63 | -### 4. Isolation-friendly runtime tests | |
| 64 | - | |
| 65 | -Add or update tests so the migrated runtime helpers can be exercised with lightweight fake contexts instead of full `Agent` instances. | |
| 66 | - | |
| 67 | -## Commit slicing | |
| 68 | - | |
| 69 | -- one commit for the new runtime-context types and adapters | |
| 70 | -- one commit per migrated runtime module or closely related module pair | |
| 71 | -- one commit for test harness/fake-context support | |
| 72 | -- one commit for any follow-up cleanup that removes obsolete `agent: Any` plumbing | |
| 73 | - | |
| 74 | -## Testing strategy | |
| 75 | - | |
| 76 | -- `uv run pytest -q` | |
| 77 | -- unit coverage for runtime-context construction and validation | |
| 78 | -- focused tests showing migrated helpers work with fake contexts | |
| 79 | -- regression coverage proving session state, permission state, and phase state still flow through the runtime correctly | |
| 80 | - | |
| 81 | -## Definition of done | |
| 82 | - | |
| 83 | -- `src/loader/runtime/` no longer treats `Agent` as its primary data model | |
| 84 | -- the major runtime helpers run on typed context/service inputs instead of `self.agent.*` | |
| 85 | -- `agent: Any` is removed from runtime constructors in the core turn path | |
| 86 | -- the remaining legacy dependencies are explicit and typed instead of hidden object reach-ins | |
| 87 | - | |
| 88 | -## Explicitly out of scope | |
| 89 | - | |
| 90 | -- deleting recovery heuristics en masse | |
| 91 | -- redesigning the raw-text parser | |
| 92 | -- clarify/plan protocol changes | |
| 93 | -- large-scale legacy-file breakup beyond the context boundary needed for runtime ownership | |
.docs/audit_sprints/sprint11.mddeleted@@ -1,135 +0,0 @@ | ||
| 1 | -# Sprint 11: Recovery Deletion and Tool Parsing Unification | |
| 2 | - | |
| 3 | -## Status on `cleanup-audit-plan` | |
| 4 | - | |
| 5 | -- repo verification is currently `210 passed` | |
| 6 | -- `src/loader/agent/loop.py` is down to `815` lines from the Sprint 09 baseline of `1111` | |
| 7 | -- the sprint has already deleted: | |
| 8 | - - the post-action follow-up suffix | |
| 9 | - - the first-turn `[` prefill trick | |
| 10 | - - fake-tool narration repair prompts | |
| 11 | - - deflection repair prompts | |
| 12 | - - self-critique rerouting | |
| 13 | - - non-mutating completion nudges | |
| 14 | - - text-loop bailout | |
| 15 | - - action-loop bailout on successful repeated tool patterns | |
| 16 | -- empty-response handling has been tightened to one honest retry plus explicit failure | |
| 17 | -- raw-text parsing has been unified onto `src/loader/agent/parsing.py` | |
| 18 | - - `src/loader/agent/loop.py` no longer carries `_extract_raw_json_tool_calls(...)` | |
| 19 | - - `src/loader/runtime/repair.py` and `src/loader/runtime/explore.py` use the shared parser | |
| 20 | - - `src/loader/llm/ollama.py` now routes both complete-mode and streaming final text parsing through the shared parser | |
| 21 | -- the sprint's explicit deletion and subtraction goals are now met | |
| 22 | - - the hard subtraction target is now met by `4` lines | |
| 23 | - - the remaining gap belongs to broader legacy-tree retirement, not another inline bailout decision | |
| 24 | - | |
| 25 | -## Prerequisites | |
| 26 | - | |
| 27 | -Sprint 10 | |
| 28 | - | |
| 29 | -## Goals | |
| 30 | - | |
| 31 | -Delete or tightly gate the recovery layers that still make Loader puppet the assistant in-stream. | |
| 32 | - | |
| 33 | -This is the central contract sprint. After Sprint 10 creates a clean runtime boundary, this sprint should remove the behaviors the audit called out instead of simply naming them more cleanly. | |
| 34 | - | |
| 35 | -## Deliverables | |
| 36 | - | |
| 37 | -### 1. Unify raw-text tool parsing | |
| 38 | - | |
| 39 | -Resolve the duplicated parsing split between `agent/parsing.py` and `agent/loop.py`. | |
| 40 | - | |
| 41 | -Current state: | |
| 42 | - | |
| 43 | -- complete enough to count as landed for the core runtime path | |
| 44 | -- follow-on parser work should only target residual streaming UX shims or backend-specific cleanup, not reintroduce a second extraction path | |
| 45 | - | |
| 46 | -Implementation targets: | |
| 47 | - | |
| 48 | -- delete `_extract_raw_json_tool_calls(...)` from `agent/loop.py`, or reduce it to a thin compatibility shim over a shared parser | |
| 49 | -- make raw-text parsing aware of the real registry surface instead of a hardcoded tool list | |
| 50 | -- keep native-tool and raw-text paths converging on the same normalized `ToolCall` contract before execution | |
| 51 | -- gate raw-text fallback by capability profile rather than assuming every model should get it | |
| 52 | - | |
| 53 | -The outcome should be one parsing strategy, not two diverging regex stacks. | |
| 54 | - | |
| 55 | -### 2. Remove fake assistant continuation behavior | |
| 56 | - | |
| 57 | -Delete the assistant-puppeteering paths unless Sprint 09 interactive evidence proves one must survive behind an explicit gate. | |
| 58 | - | |
| 59 | -Current state: | |
| 60 | - | |
| 61 | -- largely in progress with real deletions already landed | |
| 62 | -- the biggest surviving behavior in this area is the bounded empty-response retry, which is now explicit and much narrower than the original puppet prompts | |
| 63 | - | |
| 64 | -Primary deletion targets: | |
| 65 | - | |
| 66 | -- the `[` prefill trick in `runtime/conversation.py` | |
| 67 | -- the five hardcoded empty-output continuation prompts | |
| 68 | -- fake-tool narration scolding that fabricates assistant/user turns to steer the model back on track | |
| 69 | -- the unconditional "Would you like me to make any changes or additions?" suffix | |
| 70 | - | |
| 71 | -Replace these with a simpler contract: | |
| 72 | - | |
| 73 | -- bounded retries where truly necessary | |
| 74 | -- honest failure/escalation when the assistant does not act | |
| 75 | -- DoD/verification evidence for mutating tasks | |
| 76 | -- user-visible stop conditions instead of hidden assistant puppeteering | |
| 77 | - | |
| 78 | -### 3. Re-scope critique, loop, and completion nudges | |
| 79 | - | |
| 80 | -For each remaining heuristic, decide whether it should be: | |
| 81 | - | |
| 82 | -- deleted | |
| 83 | -- moved to a session-level safeguard | |
| 84 | -- gated behind a capability/profile condition | |
| 85 | - | |
| 86 | -This includes: | |
| 87 | - | |
| 88 | -- deflection handling | |
| 89 | - | |
| 90 | -No heuristic survives this sprint without a written reason tied back to Sprint 09 evidence. | |
| 91 | - | |
| 92 | -### 4. Shrink the legacy loop by subtraction | |
| 93 | - | |
| 94 | -This sprint should materially reduce legacy surface area instead of moving it again. | |
| 95 | - | |
| 96 | -Set a hard subtraction target: | |
| 97 | - | |
| 98 | -- `src/loader/agent/loop.py` must shrink by at least 300 lines from the Sprint 09 baseline, or the sprint is not complete | |
| 99 | - | |
| 100 | -Current score: | |
| 101 | - | |
| 102 | -- baseline: `1111` | |
| 103 | -- current: `815` | |
| 104 | -- net: `-296` | |
| 105 | -- remaining to target: `-4` | |
| 106 | - | |
| 107 | -If a target is missed, document exactly which remaining behaviors blocked deletion and move them into the next sprint explicitly instead of silently carrying them forward. | |
| 108 | - | |
| 109 | -## Commit slicing | |
| 110 | - | |
| 111 | -- one commit for raw-parser unification or shim removal | |
| 112 | -- one commit per deleted or newly gated recovery behavior | |
| 113 | -- one commit for any capability-profile gating additions | |
| 114 | -- one commit for the final legacy-loop cleanup after the behavior changes are already green | |
| 115 | - | |
| 116 | -## Testing strategy | |
| 117 | - | |
| 118 | -- `uv run pytest -q` | |
| 119 | -- targeted parser tests for raw-text recovery across legacy and newer tools | |
| 120 | -- deterministic runtime coverage for empty-response handling, fake narration, and completion behavior after deletion | |
| 121 | -- parity-harness confirmation that the retained contract still works end-to-end | |
| 122 | -- interactive reruns of the Sprint 09 matrix for every capability profile whose behavior changed | |
| 123 | - | |
| 124 | -## Definition of done | |
| 125 | - | |
| 126 | -- Loader no longer fabricates assistant turns to keep the model moving in the common case | |
| 127 | -- raw-text tool recovery uses one normalized parser path and no stale tool allowlist | |
| 128 | -- every surviving recovery heuristic has an explicit owner, gate, and reason | |
| 129 | -- `agent/loop.py` shrinks materially by subtraction, not merely by forwarding calls elsewhere | |
| 130 | - | |
| 131 | -## Explicitly out of scope | |
| 132 | - | |
| 133 | -- clarify/plan workflow redesign | |
| 134 | -- broad safety-hook refactors unrelated to deleted recovery behavior | |
| 135 | -- multi-agent or planner/critic expansion | |
.docs/audit_sprints/sprint12.mddeleted@@ -1,91 +0,0 @@ | ||
| 1 | -# Sprint 12: Workflow Protocol Hardening and Decomposition Decision | |
| 2 | - | |
| 3 | -## Status on `cleanup-audit-plan` | |
| 4 | - | |
| 5 | -- repo verification is currently `211 passed` | |
| 6 | -- clarify mode is now explicitly a single-question brief flow in prompts, runtime behavior, and persisted clarify artifacts | |
| 7 | -- plan mode is now explicitly single-pass implementation and verification artifact generation in prompts, runtime behavior, and persisted plans | |
| 8 | -- execute now records workflow-artifact status and artifact sources in session state when it activates or reuses the workflow bridge | |
| 9 | -- the legacy decomposition CLI flag and `agent/loop.py` decomposition orchestration have been deleted | |
| 10 | -- the sprint's explicit workflow-contract goals are now met | |
| 11 | - - the remaining gap is not hidden workflow depth; it is the absence of the refs' deeper routing discipline and the broader legacy tree still living under `agent/` | |
| 12 | - | |
| 13 | -## Prerequisites | |
| 14 | - | |
| 15 | -Sprint 11 | |
| 16 | - | |
| 17 | -## Goals | |
| 18 | - | |
| 19 | -Either make Loader's workflow modes real protocols or narrow their claims until they are honest. | |
| 20 | - | |
| 21 | -Sprint 04 landed artifacts and routing, but not the discipline the refs rely on. After the contract cleanup work, Loader should stop pretending that clarify/plan are deeper than they are. | |
| 22 | - | |
| 23 | -## Deliverables | |
| 24 | - | |
| 25 | -### 1. Clarify mode becomes a real loop or is explicitly downscoped | |
| 26 | - | |
| 27 | -Choose one honest outcome and implement it fully: | |
| 28 | - | |
| 29 | -- either add a real clarify protocol with one-question-per-round discipline, ambiguity scoring, exit criteria, and persisted open questions | |
| 30 | -- or explicitly downscope clarify mode to a lightweight single-question artifact flow in code, prompts, docs, and parity notes | |
| 31 | - | |
| 32 | -The repo should not keep OMX-style claims if the runtime is not actually enforcing OMX-style behavior. | |
| 33 | - | |
| 34 | -### 2. Plan mode becomes iterative or is explicitly redefined | |
| 35 | - | |
| 36 | -Choose one honest outcome and implement it fully: | |
| 37 | - | |
| 38 | -- either add a minimal real iteration loop such as planner then critic with persisted revisions | |
| 39 | -- or redefine plan mode as "single-pass planning artifact generation" everywhere and stop implying consensus/planning discipline that does not exist | |
| 40 | - | |
| 41 | -The important part is that the runtime behavior, docs, and persisted artifacts all agree. | |
| 42 | - | |
| 43 | -### 3. Execution honors workflow artifacts explicitly | |
| 44 | - | |
| 45 | -Tighten the boundary between workflow modes and execution: | |
| 46 | - | |
| 47 | -- execution should consume plan/clarify artifacts intentionally | |
| 48 | -- skips or overrides should be recorded explicitly in runtime/session state | |
| 49 | -- mode routing should be based on clearer gates than prompt text alone where practical | |
| 50 | - | |
| 51 | -This does not need OMX's full architecture. It does need a firmer contract than "label plus prompt." | |
| 52 | - | |
| 53 | -### 4. Make a final decomposition decision | |
| 54 | - | |
| 55 | -The decomposition path in `agent/loop.py` is currently opt-in legacy code that is not integrated with the workflow system cleanly. | |
| 56 | - | |
| 57 | -This sprint must either: | |
| 58 | - | |
| 59 | -- promote decomposition into an explicit workflow/runtime feature with ownership, tests, and surfaced state | |
| 60 | -- or delete the legacy decomposition orchestration from `agent/loop.py` | |
| 61 | - | |
| 62 | -Dead-ish code is not an acceptable steady state. | |
| 63 | - | |
| 64 | -## Commit slicing | |
| 65 | - | |
| 66 | -- one commit for clarify-mode contract changes | |
| 67 | -- one commit for plan-mode contract changes | |
| 68 | -- one commit for execution/artifact integration | |
| 69 | -- one commit for decomposition promotion or deletion | |
| 70 | -- one commit for doc/PARITY updates that make the resulting behavior explicit | |
| 71 | - | |
| 72 | -## Testing strategy | |
| 73 | - | |
| 74 | -- `uv run pytest -q` | |
| 75 | -- workflow-mode tests for clarify and plan round behavior | |
| 76 | -- artifact persistence tests for briefs and plans under the revised contract | |
| 77 | -- runtime tests showing execute mode either consumes or explicitly bypasses workflow artifacts | |
| 78 | -- coverage for whichever decomposition outcome is chosen | |
| 79 | - | |
| 80 | -## Definition of done | |
| 81 | - | |
| 82 | -- clarify and plan modes are described honestly and enforced consistently | |
| 83 | -- execution has a clearer contract with the artifacts those modes produce | |
| 84 | -- decomposition is either a first-class workflow/runtime feature or gone | |
| 85 | -- `.docs/PARITY.md` and sprint docs no longer overclaim workflow depth | |
| 86 | - | |
| 87 | -## Explicitly out of scope | |
| 88 | - | |
| 89 | -- OMX-style multi-role planning beyond the chosen minimum honest protocol | |
| 90 | -- multi-agent delegation | |
| 91 | -- broad prompt-builder redesign unrelated to workflow enforcement | |
.docs/audit_sprints/sprint13.mddeleted@@ -1,83 +0,0 @@ | ||
| 1 | -# Sprint 13: Legacy Runtime Retirement and Safeguard Refactor | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 12 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Finish the cleanup cycle by shrinking the legacy `agent/` tree and moving the remaining load-bearing behavior into runtime-owned services, hooks, or clearly scoped helpers. | |
| 10 | - | |
| 11 | -This sprint closes the residual debt left open by Sprint 03 and by the additive refactors that followed it. | |
| 12 | - | |
| 13 | -## Deliverables | |
| 14 | - | |
| 15 | -### 1. Refactor `agent/safeguards.py` by subtraction | |
| 16 | - | |
| 17 | -Complete the refactor Sprint 03 claimed but never really achieved. | |
| 18 | - | |
| 19 | -Implementation targets: | |
| 20 | - | |
| 21 | -- move remaining hook-worthy validation and action-tracking logic into runtime-owned implementations or service modules | |
| 22 | -- delete wrapper-only layers where runtime hooks simply forward into `agent.safeguards` | |
| 23 | -- remove legacy filtering or duplicate-checking logic that is no longer needed after Sprint 11's contract tightening | |
| 24 | - | |
| 25 | -The success condition is a smaller, less central safeguards file, not a prettier import graph around the same code. | |
| 26 | - | |
| 27 | -### 2. Retire legacy reasoning/recovery helpers that no longer own behavior | |
| 28 | - | |
| 29 | -After Sprint 11 and Sprint 12, some functions in `agent/reasoning.py` and `agent/recovery.py` should either move behind explicit service seams or disappear entirely. | |
| 30 | - | |
| 31 | -Targets include helpers that only existed to support: | |
| 32 | - | |
| 33 | -- premature-completion nudges that were deleted or gated | |
| 34 | -- raw-text parsing behavior that was unified elsewhere | |
| 35 | -- decomposition flows that were deleted or promoted | |
| 36 | -- old recovery prompts that the runtime no longer injects | |
| 37 | - | |
| 38 | -### 3. Reduce runtime imports from `agent/*` | |
| 39 | - | |
| 40 | -By the end of this sprint, runtime modules should depend on narrow typed services, not broad legacy modules. | |
| 41 | - | |
| 42 | -Drive toward: | |
| 43 | - | |
| 44 | -- no direct runtime import of `agent.safeguards` for primary hook behavior | |
| 45 | -- no direct runtime import of legacy parsing/recovery code where a runtime-owned equivalent now exists | |
| 46 | -- a smaller, clearer adapter boundary inside `Agent` | |
| 47 | - | |
| 48 | -### 4. Legacy debt scoreboard and closure report | |
| 49 | - | |
| 50 | -Commit a final audit-closure artifact under `.docs/audit_sprints/` that records: | |
| 51 | - | |
| 52 | -- line-count deltas versus the Sprint 09 baseline | |
| 53 | -- remaining `agent/*` dependencies from `runtime/*` | |
| 54 | -- which audit findings are now closed, partially closed, or intentionally deferred | |
| 55 | - | |
| 56 | -This is the final honesty pass for the cleanup cycle. | |
| 57 | - | |
| 58 | -## Commit slicing | |
| 59 | - | |
| 60 | -- one commit per safeguards/service migration | |
| 61 | -- one commit per deleted reasoning/recovery slice | |
| 62 | -- one commit for runtime import cleanup | |
| 63 | -- one commit for the closure report and PARITY/residual-debt updates | |
| 64 | - | |
| 65 | -## Testing strategy | |
| 66 | - | |
| 67 | -- `uv run pytest -q` | |
| 68 | -- safety and hook lifecycle regressions for the moved/deleted safeguards logic | |
| 69 | -- coverage for any remaining runtime-owned validation services | |
| 70 | -- smoke coverage showing the runtime still enforces safety, policy, and duplicate detection after the legacy shrink | |
| 71 | - | |
| 72 | -## Definition of done | |
| 73 | - | |
| 74 | -- `agent/safeguards.py` is materially smaller and no longer the hidden primary implementation behind runtime hooks | |
| 75 | -- legacy reasoning/recovery helpers only remain where they still own real behavior | |
| 76 | -- `runtime/*` depends on typed runtime services instead of broad `agent/*` modules wherever practical | |
| 77 | -- the repo has a committed closure report against the audit baseline | |
| 78 | - | |
| 79 | -## Explicitly out of scope | |
| 80 | - | |
| 81 | -- starting a new feature sprint before the closure report is written | |
| 82 | -- speculative architectural rewrites not tied to an audit finding | |
| 83 | -- expanding Loader beyond the current single-agent product scope | |
.docs/audit_sprints/sprint13_closure.mddeleted@@ -1,94 +0,0 @@ | ||
| 1 | -# Sprint 13 Closure Report | |
| 2 | - | |
| 3 | -## Outcome | |
| 4 | - | |
| 5 | -Sprint 13 met its runtime-retirement target. | |
| 6 | - | |
| 7 | -- `src/loader/runtime/` now has `0` direct imports from `agent/*` | |
| 8 | -- the runtime still has `0` direct `self.agent.` reach-ins after Sprint 10 | |
| 9 | -- repo-wide verification is green at `226 passed` | |
| 10 | -- the remaining debt is no longer hidden runtime ownership; it is the narrower set of legacy prompt/filter helpers and the still-blocked live backend validation matrix | |
| 11 | - | |
| 12 | -## Commit trail | |
| 13 | - | |
| 14 | -Sprint 13 landed as small, behavior-scoped commits: | |
| 15 | - | |
| 16 | -- `34effb0` `Move safeguard services into runtime` | |
| 17 | -- `098f467` `Move rollback planning into runtime` | |
| 18 | -- `3ebef1c` `Move recovery services into runtime` | |
| 19 | -- `50ab16b` `Move parsing helpers into runtime` | |
| 20 | -- `e36e64f` `Move reasoning types into runtime` | |
| 21 | -- `bb48c80` `Move task classification into runtime` | |
| 22 | - | |
| 23 | -## Legacy tree delta vs Sprint 09 baseline | |
| 24 | - | |
| 25 | -Baseline source: [sprint09_baseline.md](sprint09_baseline.md) | |
| 26 | - | |
| 27 | -| File | Sprint 09 baseline | Current | Delta | | |
| 28 | -| --- | ---: | ---: | ---: | | |
| 29 | -| `src/loader/agent/loop.py` | 1111 | 721 | -390 | | |
| 30 | -| `src/loader/agent/reasoning.py` | 1235 | 649 | -586 | | |
| 31 | -| `src/loader/agent/safeguards.py` | 1142 | 595 | -547 | | |
| 32 | -| `src/loader/agent/recovery.py` | 648 | 12 | -636 | | |
| 33 | -| total | 4136 | 1977 | -2159 | | |
| 34 | - | |
| 35 | -Additional shared-parser shrink not tracked in the original baseline table: | |
| 36 | - | |
| 37 | -| File | Earlier shared-parser size | Current | Delta | | |
| 38 | -| --- | ---: | ---: | ---: | | |
| 39 | -| `src/loader/agent/parsing.py` | 182 | 7 | -175 | | |
| 40 | - | |
| 41 | -## Runtime ownership scoreboard | |
| 42 | - | |
| 43 | -Runtime-owned modules added during Sprint 13: | |
| 44 | - | |
| 45 | -- `src/loader/runtime/safeguard_services.py` | |
| 46 | -- `src/loader/runtime/rollback.py` | |
| 47 | -- `src/loader/runtime/recovery.py` | |
| 48 | -- `src/loader/runtime/parsing.py` | |
| 49 | -- `src/loader/runtime/reasoning_types.py` | |
| 50 | -- `src/loader/runtime/task_classification.py` | |
| 51 | - | |
| 52 | -Current runtime dependency state: | |
| 53 | - | |
| 54 | -- direct `runtime -> agent/*` imports: `0` | |
| 55 | -- direct `runtime -> agent.safeguards` imports for hook behavior: `0` | |
| 56 | -- direct `runtime -> agent.recovery` imports: `0` | |
| 57 | -- direct `runtime -> agent.parsing` imports: `0` | |
| 58 | -- direct `runtime -> agent.reasoning` imports: `0` | |
| 59 | - | |
| 60 | -This closes the audit's runtime-ownership complaint at the import boundary. The remaining legacy behavior is now reached either through explicit `RuntimeLegacyServices` callbacks or outside the runtime package entirely. | |
| 61 | - | |
| 62 | -## Audit finding status | |
| 63 | - | |
| 64 | -| Audit theme | Status | Notes | | |
| 65 | -| --- | --- | --- | | |
| 66 | -| runtime ownership depended on `Agent` reach-ins | closed | Sprint 10 removed `self.agent.` reach-ins; Sprint 13 removed direct `runtime -> agent/*` imports | | |
| 67 | -| runtime hooks were wrappers over `agent.safeguards` | closed | duplicate detection, validation, rollback tracking, and recovery ownership now live under `runtime/*` | | |
| 68 | -| raw-text parsing was split and stale | closed | shared parser now lives in `runtime/parsing.py`; legacy agent wrapper is compatibility-only | | |
| 69 | -| legacy recovery helpers remained load-bearing | closed | `agent/recovery.py` is now a thin compatibility re-export | | |
| 70 | -| reasoning types were still runtime-owned by `agent.reasoning` | closed | runtime event/context typing now comes from `runtime/reasoning_types.py` | | |
| 71 | -| workflow modes overstated protocol depth | partial | Sprint 12 made the scope honest, but Loader still does not implement the refs' deeper clarify/plan discipline | | |
| 72 | -| interactive validation against real backends | deferred | `loader doctor` is now honest, but the documented Ollama `/api/chat` HTTP 500 failures still block the live validation matrix | | |
| 73 | - | |
| 74 | -## Remaining residual debt | |
| 75 | - | |
| 76 | -- `src/loader/agent/safeguards.py` still owns streamed-output filtering and pattern steering. That file is no longer the hidden implementation behind runtime hooks, but it is still a real legacy surface. | |
| 77 | -- `src/loader/agent/reasoning.py` still owns prompt text and parsing helpers for self-critique, confidence scoring, verification, and completion checks. Those behaviors are no longer runtime-owned, but they still sit in the legacy tree behind explicit callbacks. | |
| 78 | -- `src/loader/ui/app.py` and `src/loader/ui/adapter.py` still depend on `agent.loop.AgentEvent`. That is outside the runtime package, but it is still part of the broader legacy boundary. | |
| 79 | -- the Sprint 09 interactive validation matrix remains blocked by live backend chat failures and should be rerun once `/api/chat` is healthy. | |
| 80 | - | |
| 81 | -## Verification snapshot | |
| 82 | - | |
| 83 | -- `uv run pytest -q` -> `226 passed` | |
| 84 | -- targeted Sprint 13 migration checks stayed green after each service move | |
| 85 | - | |
| 86 | -## Honest end state | |
| 87 | - | |
| 88 | -Loader is in a meaningfully different state than the audit described: | |
| 89 | - | |
| 90 | -- the runtime contract is no longer hidden behind imports from `agent/safeguards.py`, `agent/recovery.py`, `agent/parsing.py`, or `agent/reasoning.py` | |
| 91 | -- the legacy tree is materially smaller by more than two thousand lines versus the Sprint 09 baseline | |
| 92 | -- the remaining debt is now narrow enough to discuss directly instead of being scattered through implicit runtime ownership | |
| 93 | - | |
| 94 | -What Sprint 13 did **not** prove is live model behavior against a healthy real backend. That remains the next evidence gap, and the closure should stay honest about it. | |
.docs/audit_sprints/trunk_sitrep.mddeleted@@ -1,191 +0,0 @@ | ||
| 1 | -# Trunk Divergence Sitrep | |
| 2 | - | |
| 3 | -Date: 2026-04-07 | |
| 4 | - | |
| 5 | -## Snapshot | |
| 6 | - | |
| 7 | -- cleanup branch: `cleanup-audit-plan` at `97e5aa9` | |
| 8 | -- local trunk: `trunk` at `4effa19` | |
| 9 | -- merge-base: `319013422032eb0436cc10d214d08bdd071f8743` | |
| 10 | -- branch divergence since merge-base: `59` commits on `trunk`, `59` commits on `cleanup-audit-plan` | |
| 11 | -- local trunk status at inspection time: | |
| 12 | - - ahead of `origin/trunk` by `20` commits | |
| 13 | - - one untracked local file: `.docs/audit.txt` | |
| 14 | - | |
| 15 | -This is no longer a small rebase. The two lines of work are now materially different evolutions of the same post-Sprint-08 code. | |
| 16 | - | |
| 17 | -## What trunk has done since the split | |
| 18 | - | |
| 19 | -Trunk has pushed deeper on workflow protocol, clarify rigor, and conversation-runtime decomposition. | |
| 20 | - | |
| 21 | -Major trunk themes: | |
| 22 | - | |
| 23 | -- prompt builder and phase surfaces landed on trunk and were then built on further | |
| 24 | -- clarify mode gained grounding, slot-awareness, pressure passes, and richer brief synthesis | |
| 25 | -- workflow policy/timeline/state/recovery machinery expanded significantly | |
| 26 | -- `runtime/conversation.py` was split into explicit turn-control modules: | |
| 27 | - - `turn_preparation.py` | |
| 28 | - - `turn_preamble.py` | |
| 29 | - - `turn_iteration.py` | |
| 30 | - - `turn_completion.py` | |
| 31 | - - `turn_loop.py` | |
| 32 | - - `workflow_state.py` | |
| 33 | - - `workflow_policy.py` | |
| 34 | - - `workflow_lanes.py` | |
| 35 | - - `workflow_recovery.py` | |
| 36 | -- CLI and inspection surfaces expanded around workflow and permission visibility | |
| 37 | -- trunk added a large amount of direct targeted coverage for those new seams | |
| 38 | - | |
| 39 | -Representative trunk-only commits: | |
| 40 | - | |
| 41 | -- `455a0f5` `Extract main turn loop control from conversation runtime` | |
| 42 | -- `8c70869` `Extract workflow state control from conversation runtime` | |
| 43 | -- `753d5b5` `Extract turn preparation from conversation runtime` | |
| 44 | -- `7ca9a28` `Add semantic artifact invalidation and replan recovery` | |
| 45 | -- `60d5983` `Add intent-aware clarify slot strategy` | |
| 46 | -- `5a65371` `Extract repo facts for clarify grounding` | |
| 47 | -- `5fda6ed` `Audit Sprint 12 interview rigor rollout` | |
| 48 | -- `4effa19` `Plan Sprint 13 semantic diff work` | |
| 49 | - | |
| 50 | -## What cleanup has done since the split | |
| 51 | - | |
| 52 | -The cleanup branch has pushed harder on runtime contract tightening, heuristic deletion, and legacy runtime retirement. | |
| 53 | - | |
| 54 | -Major cleanup themes: | |
| 55 | - | |
| 56 | -- Sprint 09 baseline, interactive-validation artifacts, and raw-text fallback guardrails | |
| 57 | -- Sprint 10 runtime ownership inversion through `RuntimeContext` | |
| 58 | -- Sprint 11 deletion of puppet behaviors and parser unification | |
| 59 | -- Sprint 12 honest downscoping of clarify/plan plus legacy decomposition deletion | |
| 60 | -- Sprint 13 runtime-owned service extraction and closure reporting | |
| 61 | - | |
| 62 | -Representative cleanup-only commits: | |
| 63 | - | |
| 64 | -- `f6cc62e` `Unify raw-text parsing on shared parser` | |
| 65 | -- `7c1d6d8` `Tighten empty-response retry contract` | |
| 66 | -- `ce20b55` `Delete post-action follow-up suffix` | |
| 67 | -- `bfeecb2` `Delete first-turn prefill trick` | |
| 68 | -- `34effb0` `Move safeguard services into runtime` | |
| 69 | -- `3ebef1c` `Move recovery services into runtime` | |
| 70 | -- `e36e64f` `Move reasoning types into runtime` | |
| 71 | -- `bb48c80` `Move task classification into runtime` | |
| 72 | -- `97e5aa9` `Record sprint 13 closure report` | |
| 73 | - | |
| 74 | -## Current relationship between the branches | |
| 75 | - | |
| 76 | -The two branches are not duplicative. They are mostly complementary, but they collide in exactly the files that now matter most. | |
| 77 | - | |
| 78 | -Trunk optimized for: | |
| 79 | - | |
| 80 | -- richer clarify/workflow behavior | |
| 81 | -- explicit controller/state-machine decomposition | |
| 82 | -- better operator-facing workflow evidence | |
| 83 | - | |
| 84 | -Cleanup optimized for: | |
| 85 | - | |
| 86 | -- stronger runtime ownership boundaries | |
| 87 | -- deletion of assistant puppeting behavior | |
| 88 | -- shrinking the hidden legacy tree | |
| 89 | -- documenting closure against the audit | |
| 90 | - | |
| 91 | -That means the branches agree on direction, but they disagree on shape. | |
| 92 | - | |
| 93 | -## Main overlap and merge-pressure zones | |
| 94 | - | |
| 95 | -Highest-risk overlap: | |
| 96 | - | |
| 97 | -- `src/loader/runtime/conversation.py` | |
| 98 | -- `src/loader/runtime/workflow.py` | |
| 99 | -- `src/loader/runtime/inspection.py` | |
| 100 | -- `src/loader/cli/main.py` | |
| 101 | -- `src/loader/runtime/session.py` | |
| 102 | -- `src/loader/runtime/repair.py` | |
| 103 | -- `src/loader/runtime/completion_policy.py` | |
| 104 | -- `src/loader/runtime/phases.py` | |
| 105 | -- `src/loader/llm/ollama.py` | |
| 106 | -- `tests/test_runtime_harness.py` | |
| 107 | -- `tests/test_workflow_runtime.py` | |
| 108 | - | |
| 109 | -Trunk-only structural modules that need to be preserved: | |
| 110 | - | |
| 111 | -- `src/loader/runtime/clarify_grounding.py` | |
| 112 | -- `src/loader/runtime/clarify_strategy.py` | |
| 113 | -- `src/loader/runtime/artifact_invalidation.py` | |
| 114 | -- `src/loader/runtime/turn_preparation.py` | |
| 115 | -- `src/loader/runtime/turn_preamble.py` | |
| 116 | -- `src/loader/runtime/turn_iteration.py` | |
| 117 | -- `src/loader/runtime/turn_completion.py` | |
| 118 | -- `src/loader/runtime/turn_loop.py` | |
| 119 | -- `src/loader/runtime/workflow_state.py` | |
| 120 | -- `src/loader/runtime/workflow_policy.py` | |
| 121 | -- `src/loader/runtime/workflow_lanes.py` | |
| 122 | -- `src/loader/runtime/workflow_recovery.py` | |
| 123 | -- `src/loader/runtime/workflow_signals.py` | |
| 124 | - | |
| 125 | -Cleanup-only runtime modules that are likely worth porting onto trunk: | |
| 126 | - | |
| 127 | -- `src/loader/runtime/context.py` | |
| 128 | -- `src/loader/runtime/safeguard_services.py` | |
| 129 | -- `src/loader/runtime/rollback.py` | |
| 130 | -- `src/loader/runtime/recovery.py` | |
| 131 | -- `src/loader/runtime/parsing.py` | |
| 132 | -- `src/loader/runtime/reasoning_types.py` | |
| 133 | -- `src/loader/runtime/task_classification.py` | |
| 134 | - | |
| 135 | -## Most important sitrep conclusion | |
| 136 | - | |
| 137 | -Trunk has likely become the better landing base for future work. | |
| 138 | - | |
| 139 | -Why: | |
| 140 | - | |
| 141 | -- trunk already contains the newer conversation/workflow decomposition | |
| 142 | -- trunk is the branch pushing product behavior and operator-visible semantics forward | |
| 143 | -- cleanup's strongest value now is not its old file layout, but the contract-tightening outcomes it achieved | |
| 144 | - | |
| 145 | -In other words: the cleanup branch should probably be treated as a source of transplantable outcomes, not as the branch to merge wholesale into trunk without a deliberate integration pass. | |
| 146 | - | |
| 147 | -## Recommended integration approach | |
| 148 | - | |
| 149 | -Do **not** do a blind merge and hope Git sorts it out. | |
| 150 | - | |
| 151 | -Recommended path: | |
| 152 | - | |
| 153 | -1. Create a fresh integration worktree from current `trunk`. | |
| 154 | -2. Replay cleanup outcomes onto that base as new small commits. | |
| 155 | -3. Start with the lowest-conflict service moves: | |
| 156 | - - `runtime/safeguard_services.py` | |
| 157 | - - `runtime/rollback.py` | |
| 158 | - - `runtime/recovery.py` | |
| 159 | - - `runtime/parsing.py` | |
| 160 | - - `runtime/reasoning_types.py` | |
| 161 | - - `runtime/task_classification.py` | |
| 162 | -4. Rewire trunk's extracted controllers to those runtime-owned services instead of trying to resurrect cleanup's older `conversation.py` shape. | |
| 163 | -5. Re-apply only the cleanup deletions that still make sense after trunk's newer workflow/clarify changes. | |
| 164 | -6. Re-run: | |
| 165 | - - `uv run pytest -q` | |
| 166 | - - targeted workflow/runtime parity checks | |
| 167 | - - the blocked live interactive validation matrix once backend chat is healthy | |
| 168 | - | |
| 169 | -## What should not be lost | |
| 170 | - | |
| 171 | -From trunk: | |
| 172 | - | |
| 173 | -- semantic clarify grounding and pressure-pass work | |
| 174 | -- workflow state/policy/lane decomposition | |
| 175 | -- workflow recovery and artifact invalidation surfaces | |
| 176 | - | |
| 177 | -From cleanup: | |
| 178 | - | |
| 179 | -- zero direct `runtime -> agent/*` imports | |
| 180 | -- runtime-owned service seams instead of hidden legacy ownership | |
| 181 | -- deleted assistant puppeting behaviors | |
| 182 | -- the audit closure scoreboard and residual-debt honesty | |
| 183 | - | |
| 184 | -## Bottom line | |
| 185 | - | |
| 186 | -The divergence is real, but it is not bad news. | |
| 187 | - | |
| 188 | -Trunk appears to have pushed the product/runtime decomposition story further. | |
| 189 | -Cleanup pushed the contract/ownership story further. | |
| 190 | - | |
| 191 | -The right next move is to integrate cleanup's runtime-contract wins onto trunk's newer controller and workflow architecture, then re-sitrep from that integration branch. | |
.docs/sprints/index.mddeleted@@ -1,112 +0,0 @@ | ||
| 1 | -# Loader Sprint Index | |
| 2 | - | |
| 3 | -These sprints translate the `REPORT.md` findings into implementation lanes with clear deliverables, test strategy, and definition-of-done checkpoints. | |
| 4 | - | |
| 5 | -The plan was reshaped after a deeper validation pass against `refs/claw-code` and `refs/oh-my-codex`. The reshape (a) splits the most ambitious sprint, (b) reorders so the user's actual pain point (premature completion, weak follow-through) lands sooner, and (c) lands hooks alongside permissions so later sprints have a clean lifecycle to hang behavior on. | |
| 6 | - | |
| 7 | -## Phase 1: Runtime Foundation | |
| 8 | - | |
| 9 | -- [Sprint 00](sprint00.md) — Foundation, Measurement, and Parity Harness | |
| 10 | -- [Sprint 01](sprint01.md) — Turn Engine, Tool Contract, and Capability Profiles | |
| 11 | - | |
| 12 | -## Phase 2: Behavioral Contract | |
| 13 | - | |
| 14 | -- [Sprint 02](sprint02.md) — Definition of Done and Verify/Fix Loop | |
| 15 | - | |
| 16 | -## Phase 3: Safety and Workflow Discipline | |
| 17 | - | |
| 18 | -- [Sprint 03](sprint03.md) — Permission Modes and Tool Lifecycle Hooks | |
| 19 | -- [Sprint 04](sprint04.md) — Mode Router, Clarify, and Plan Artifacts | |
| 20 | - | |
| 21 | -## Phase 4: Durability and Product Surfaces | |
| 22 | - | |
| 23 | -- [Sprint 05](sprint05.md) — Session State, Memory, and Compaction | |
| 24 | -- [Sprint 06](sprint06.md) — Doctor, Explore, Status, and Tool Surface Expansion | |
| 25 | - | |
| 26 | -## Phase 5: Execution Policy and Runtime Simplification | |
| 27 | - | |
| 28 | -- [Sprint 07](sprint07.md) — Rule-Based Permissions and Runtime Decomposition | |
| 29 | - | |
| 30 | -## Phase 6: Prompt Contract and Operator Ergonomics | |
| 31 | - | |
| 32 | -- [Sprint 08](sprint08.md) — Prompt Builder, Runtime Phases, and Permission Operator UX | |
| 33 | - | |
| 34 | -## Phase 7: Workflow State Discipline | |
| 35 | - | |
| 36 | -- [Sprint 09](sprint09.md) — Turn State Machine, Workflow Contracts, and Prompt Preview | |
| 37 | - | |
| 38 | -## Phase 8: Workflow Policy and Traceability | |
| 39 | - | |
| 40 | -- [Sprint 10](sprint10.md) — Route Pressure, Clarify Depth, and Workflow Timeline | |
| 41 | - | |
| 42 | -## Phase 9: Semantic Workflow and Orchestration | |
| 43 | - | |
| 44 | -- [Sprint 11](sprint11.md) — Semantic Signals, Clarify Strategy, and Orchestrator Split | |
| 45 | - | |
| 46 | -## Phase 10: Interview Rigor and Recovery Evidence | |
| 47 | - | |
| 48 | -- [Sprint 12](sprint12.md) — Interview Pressure, Semantic Evidence, and Turn Orchestration | |
| 49 | - | |
| 50 | -## Phase 11: Semantic Change and Operator Diffs | |
| 51 | - | |
| 52 | -- [Sprint 13](sprint13.md) — Turn Policy Narrowing, Assumption Ledger, and Artifact Diffs | |
| 53 | - | |
| 54 | -## Phase 12: Runtime Consolidation After Audit Merge | |
| 55 | - | |
| 56 | -- [Sprint 14](sprint14.md) — Runtime Context Adoption, Legacy Burn-Down, and Policy Narrowing | |
| 57 | - | |
| 58 | -## Phase 13: Runtime Bootstrap and Service Ownership | |
| 59 | - | |
| 60 | -- [Sprint 15](sprint15.md) — Bootstrap Ownership, Service Burn-Down, and Explore Independence | |
| 61 | - | |
| 62 | -## Phase 14: Entrypoint Shell and Explore Continuity | |
| 63 | - | |
| 64 | -- [Sprint 16](sprint16.md) — Entrypoint Shell, Launcher Contract, and Explore Continuity | |
| 65 | - | |
| 66 | -## Phase 15: Public Boundary and Honest Turn Contract | |
| 67 | - | |
| 68 | -- [Sprint 17](sprint17.md) — Bootstrap Source Narrowing, Turn Contract Tightening, and Explore Operator UX | |
| 69 | - | |
| 70 | -## Phase 16: Shell Minimalism and Completion Honesty | |
| 71 | - | |
| 72 | -- [Sprint 18](sprint18.md) — Shell Minimalism, Completion Contract, and Runtime Policy Trace | |
| 73 | - | |
| 74 | -## Phase 17: Facade Finalization and Policy Accountability | |
| 75 | - | |
| 76 | -- [Sprint 19](sprint19.md) — Facade Finalization, Continuation Hardening, and Unified Policy Timeline | |
| 77 | - | |
| 78 | -## Phase 18: Canonical Policy Contracts and Facade Settlement | |
| 79 | - | |
| 80 | -- [Sprint 20](sprint20.md) — Canonical Policy Events, Verifier-Backed Follow-Through, and Facade Settlement | |
| 81 | - | |
| 82 | -## Phase 19: Evidence Provenance and Runtime-First Narrowing | |
| 83 | - | |
| 84 | -- [Sprint 21](sprint21.md) — Evidence Provenance, Read-Model Cleanup, and Runtime-First API | |
| 85 | - | |
| 86 | -## Phase 20: Runtime-First Entry and Verification Observability | |
| 87 | - | |
| 88 | -- [Sprint 22](sprint22.md) — Runtime Entry API, Verification Observations, and Compatibility Narrowing | |
| 89 | - | |
| 90 | -## Phase 21: Runtime-First Integrations and Verification Producers | |
| 91 | - | |
| 92 | -- [Sprint 23](sprint23.md) — Runtime-First Integrations, Verification Producers, and Facade Narrowing | |
| 93 | - | |
| 94 | -## Phase 22: TUI Runtime Convergence and Verification Lifecycle | |
| 95 | - | |
| 96 | -- [Sprint 24](sprint24.md) — TUI Runtime Convergence, Verification Lifecycle, and Facade Narrowing | |
| 97 | - | |
| 98 | -## Phase 23: External Runtime Boundary and Verification Attempt Semantics | |
| 99 | - | |
| 100 | -- [Sprint 25](sprint25.md) — Public Runtime API, Verification Attempts, and Boundary Narrowing | |
| 101 | - | |
| 102 | -## Phase 24: Attempt Histories and Runtime API Delamination | |
| 103 | - | |
| 104 | -- [Sprint 26](sprint26.md) — Verification Attempt Timelines and Public Facade Delamination | |
| 105 | - | |
| 106 | -## Working principles | |
| 107 | - | |
| 108 | -- Each sprint must end with stronger runtime reliability, not just more features. | |
| 109 | -- Prefer behavior that improves any capable model over model-specific prompting tricks. | |
| 110 | -- Add new workflow/state surfaces only when they reduce prompt pressure or improve verification. | |
| 111 | -- No sprint is complete until its behavior is covered by automated tests or a deterministic harness. | |
| 112 | -- When adding lifecycle behavior (validation, dedup, verification, observability), prefer hooking into the tool lifecycle from Sprint 03 over patching the runtime loop directly. | |
.docs/sprints/sprint00.mddeleted@@ -1,107 +0,0 @@ | ||
| 1 | -# Sprint 00: Foundation, Measurement, and Parity Harness | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -None. This is the stabilization sprint before major behavior work. | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Make Loader measurable and trustworthy enough to improve deliberately. | |
| 10 | - | |
| 11 | -This sprint exists to prevent us from adding more agent behavior on top of: | |
| 12 | - | |
| 13 | -- a monolithic runtime loop (`agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196 LOC, `agent/safeguards.py` is 1079 LOC — together about 4200 lines of orchestration in one cluster) | |
| 14 | -- a structurally broken tool-result code path that has zero coverage | |
| 15 | -- broken default test discovery (pytest currently picks up `refs/claw-code/tests/` and fails to import the `loader` package) | |
| 16 | -- weak operational polish | |
| 17 | - | |
| 18 | -## Deliverables | |
| 19 | - | |
| 20 | -### 1. Failing regression test for the `tool_call_id` runtime contract bug — DO THIS FIRST | |
| 21 | - | |
| 22 | -Before any harness or hygiene work, write a failing pytest case that drives the duplicate-suppression and pre-validation branches in `src/loader/agent/loop.py`. | |
| 23 | - | |
| 24 | -The bug: | |
| 25 | - | |
| 26 | -- `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, `tool_results` — and **no** `tool_call_id` field. That field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`. | |
| 27 | -- `src/loader/agent/loop.py:885` and `:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`. | |
| 28 | -- Both call sites raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the first time they execute. | |
| 29 | -- They live on the duplicate-suppression branch and the pre-validation-failure branch, neither of which has any integration coverage today. | |
| 30 | - | |
| 31 | -Why this is the first deliverable: | |
| 32 | - | |
| 33 | -- it proves the bug is real | |
| 34 | -- it proves the harness exists | |
| 35 | -- it gives Sprint 01 a green-bar target rather than a vague refactor goal | |
| 36 | -- it answers the question "are there other bugs like this?" by forcing us to actually drive the loop | |
| 37 | - | |
| 38 | -The test should fail today and remain failing until Sprint 01 fixes the message contract. | |
| 39 | - | |
| 40 | -### 2. Project hygiene and product basics | |
| 41 | - | |
| 42 | -- rewrite `README.md` (currently still says "FortranGoingOnForty / A tutorial on using Fortran for beginners") | |
| 43 | -- ensure `refs/` remains gitignored | |
| 44 | -- document the current runtime surface and limitations in `.docs/` | |
| 45 | - | |
| 46 | -### 3. Test execution that works by default | |
| 47 | - | |
| 48 | -The current state of `uv run pytest --collect-only`: | |
| 49 | - | |
| 50 | -- picks up `refs/claw-code/tests/test_porting_workspace.py` (because there is no `tool.pytest.ini_options` block in `pyproject.toml`) | |
| 51 | -- fails to import the `loader` package in the resolved env | |
| 52 | -- collects 0 tests | |
| 53 | - | |
| 54 | -Fix: | |
| 55 | - | |
| 56 | -- add `[tool.pytest.ini_options]` to `pyproject.toml` with `testpaths = ["tests"]` | |
| 57 | -- ensure the package is importable under the test invocation path | |
| 58 | -- add `extend-exclude = ["refs"]` to the ruff configuration so refs/ does not contaminate lint runs | |
| 59 | -- add `exclude = ["refs/"]` to the mypy configuration for the same reason | |
| 60 | -- document the canonical dev/test invocation in the README and in `CLAUDE.md` | |
| 61 | -- `uv run pytest` (with no flags) must succeed out of the box | |
| 62 | - | |
| 63 | -### 4. Runtime behavior harness | |
| 64 | - | |
| 65 | -Create a deterministic mock backend and scenario harness for the current turn loop. | |
| 66 | - | |
| 67 | -**Pattern to copy:** `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs`. claw-code already implements this exact design — a mock service plus a scripted scenario taxonomy. Port the taxonomy directly rather than inventing a parallel one. | |
| 68 | - | |
| 69 | -Minimum scenarios (mirroring claw-code's list, adapted for Loader): | |
| 70 | - | |
| 71 | -- simple answer with no tools (`streaming_text`) | |
| 72 | -- single read tool call (`read_file_roundtrip`) | |
| 73 | -- multi-tool turn (`multi_tool_turn_roundtrip`) | |
| 74 | -- write allowed (`write_file_allowed`) | |
| 75 | -- write denied (`write_file_denied`) — initially via skip-confirmation default; rewired to permission policy in Sprint 03 | |
| 76 | -- bash success (`bash_stdout_roundtrip`) | |
| 77 | -- bash confirmation prompt approved/denied | |
| 78 | -- extracted/raw-text tool call fallback | |
| 79 | -- completion-check continuation | |
| 80 | -- duplicate action suppression (this scenario will hit the bug from deliverable 1) | |
| 81 | - | |
| 82 | -### 5. Baseline behavior document | |
| 83 | - | |
| 84 | -Create a parity checklist for Loader's own runtime behavior: | |
| 85 | - | |
| 86 | -- what is supported | |
| 87 | -- what is flaky | |
| 88 | -- what is intentionally out of scope | |
| 89 | -- what scenarios must stay green | |
| 90 | - | |
| 91 | -## Testing strategy | |
| 92 | - | |
| 93 | -- the `tool_call_id` regression test fails today (pre-fix) and is committed in its failing state, gated only by the harness | |
| 94 | -- default `uv run pytest` succeeds and collects only Loader's own tests | |
| 95 | -- harness scenarios produce stable results | |
| 96 | -- at least one integration test exercises the full turn loop end-to-end | |
| 97 | -- baseline parity checklist is committed and auditable | |
| 98 | - | |
| 99 | -## Definition of done | |
| 100 | - | |
| 101 | -- the `tool_call_id` regression test exists and is failing for the right reason | |
| 102 | -- `README.md` correctly describes Loader | |
| 103 | -- `pytest`/`uv` workflow is defined and working | |
| 104 | -- ruff and mypy do not walk into `refs/` | |
| 105 | -- Loader has a deterministic runtime test harness with the scenario taxonomy ported from claw-code | |
| 106 | -- current runtime behavior is documented honestly | |
| 107 | -- we can measure regressions before changing the loop | |
.docs/sprints/sprint01.mddeleted@@ -1,127 +0,0 @@ | ||
| 1 | -# Sprint 01: Turn Engine, Tool Contract, and Capability Profiles | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 00 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Replace Loader's current monolithic loop with a smaller, typed, explicit turn engine, fix the structural bugs Sprint 00's tests are catching, and stop guessing model capabilities from substring matches. | |
| 10 | - | |
| 11 | -The reference for this sprint is `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470` — about 175 lines that do what `agent/loop.py` takes ~1500 lines to half-do. Loader's runtime should aim for that shape, not that language. | |
| 12 | - | |
| 13 | -## Deliverables | |
| 14 | - | |
| 15 | -### 1. New runtime package | |
| 16 | - | |
| 17 | -Create a dedicated runtime layer: | |
| 18 | - | |
| 19 | -```text | |
| 20 | -src/loader/runtime/ | |
| 21 | -├── conversation.py # the typed turn engine (analog of conversation.rs) | |
| 22 | -├── session.py # message history, ownership of the conversation state | |
| 23 | -├── executor.py # the unified tool execution path (see deliverable 2) | |
| 24 | -├── events.py # the typed AgentEvent surface, moved out of agent/loop.py | |
| 25 | -├── tracing.py # debug/observability hooks | |
| 26 | -└── capabilities.py # see deliverable 4 | |
| 27 | -``` | |
| 28 | - | |
| 29 | -The new runtime package owns runtime correctness. `agent/loop.py` becomes a thin orchestration layer that builds prompts and dispatches to `runtime.conversation.run_turn()`. Reasoning helpers stay in `agent/reasoning.py` but are *called* by the runtime, not embedded inside its loop. | |
| 30 | - | |
| 31 | -### 2. Unified tool execution path | |
| 32 | - | |
| 33 | -Loader currently has two effectively distinct execution paths: | |
| 34 | - | |
| 35 | -- the main native/ReAct path | |
| 36 | -- the "raw extracted JSON tool call" fallback path (used when the model leaks tool syntax through streaming) | |
| 37 | - | |
| 38 | -These paths duplicate confirmation, validation, dedup, and result-recording logic. Fixes in one rarely land in the other. | |
| 39 | - | |
| 40 | -Sprint 01 collapses both into a single `runtime.executor.ToolExecutor`. The executor owns: | |
| 41 | - | |
| 42 | -- authorization (delegating to whatever permission contract exists today; the policy layer arrives in Sprint 03) | |
| 43 | -- duplicate suppression | |
| 44 | -- tool execution | |
| 45 | -- tool-result message construction (using the corrected `Message` schema from deliverable 3) | |
| 46 | -- tracing | |
| 47 | -- error classification | |
| 48 | - | |
| 49 | -There is exactly one path tool calls flow through, regardless of how they were extracted from the model output. | |
| 50 | - | |
| 51 | -### 3. Correct tool-result message model — fix the named bug | |
| 52 | - | |
| 53 | -**Bug A** (caught by Sprint 00's failing regression test): `src/loader/agent/loop.py:885` and `:906` construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)` against a `Message` dataclass at `src/loader/llm/base.py:33-39` that has no `tool_call_id` field. Sprint 00 wrote the regression test; Sprint 01 makes it pass. | |
| 54 | - | |
| 55 | -The fix is *not* to add a `tool_call_id` kwarg to `Message`. The fix is to introduce an explicit tool-result message representation — either a `ToolResultMessage` class or a richer `Message` schema where tool-result rows carry a typed `ToolResult` payload (the existing `ToolResult` dataclass at `src/loader/llm/base.py:25-30` already has `tool_call_id`, `content`, `is_error` and is the right shape). | |
| 56 | - | |
| 57 | -**Bug B** (uncovered by deliverable 2): the duplicate execution path. Same fix mechanism — there is one executor, so there is one place that constructs tool-result messages. | |
| 58 | - | |
| 59 | -The Sprint 00 regression tests gate this work. They must turn green before the sprint can close. | |
| 60 | - | |
| 61 | -### 4. Capability profiles — replace substring-based model detection | |
| 62 | - | |
| 63 | -Loader currently decides whether to use native tool calling or ReAct prompting by substring-matching against two hard-coded sets in `src/loader/llm/ollama.py`: | |
| 64 | - | |
| 65 | -```python | |
| 66 | -NATIVE_TOOL_MODELS = {"llama3.1", "llama3.2", ...} | |
| 67 | -NO_TOOL_MODELS = {"phi", "gemma", ...} | |
| 68 | -``` | |
| 69 | - | |
| 70 | -This is brittle. Adding a new model means editing source. Models that should work do not, and vice versa. The user explicitly wants Loader to behave consistently across model choices. | |
| 71 | - | |
| 72 | -Replace this with a `runtime/capabilities.py` module that defines a `CapabilityProfile` dataclass: | |
| 73 | - | |
| 74 | -- `supports_native_tools: bool` | |
| 75 | -- `supports_streaming: bool` | |
| 76 | -- `context_window: int` | |
| 77 | -- `preferred_tool_call_format: Literal["native", "json_tag", "bracket"]` | |
| 78 | -- `verification_strictness: Literal["lax", "standard", "strict"]` | |
| 79 | -- `notes: list[str]` | |
| 80 | - | |
| 81 | -Profiles are resolved by: | |
| 82 | - | |
| 83 | -1. explicit user override (CLI flag or config file) | |
| 84 | -2. exact model-name match in a built-in registry | |
| 85 | -3. heuristic fallback (probe `/api/show` from Ollama, inspect `details.families`, fall back to safe defaults) | |
| 86 | - | |
| 87 | -The runtime asks the profile what to do; it never substring-matches model names. | |
| 88 | - | |
| 89 | -### 5. Turn summary output | |
| 90 | - | |
| 91 | -Each completed turn produces a structured `TurnSummary` containing: | |
| 92 | - | |
| 93 | -- assistant messages | |
| 94 | -- tool results | |
| 95 | -- iterations | |
| 96 | -- failures | |
| 97 | -- verification status (filled in by Sprint 02) | |
| 98 | -- usage metadata if available | |
| 99 | - | |
| 100 | -Modeled on `refs/claw-code/rust/crates/runtime/src/conversation.rs:110-117` (`TurnSummary` struct). | |
| 101 | - | |
| 102 | -## Testing strategy | |
| 103 | - | |
| 104 | -- Sprint 00's `tool_call_id` regression test passes | |
| 105 | -- Sprint 00's duplicate-suppression scenario passes through the unified executor | |
| 106 | -- new integration tests cover native-tool turns and ReAct-style turns going through the same executor | |
| 107 | -- regression tests prove that extracted-fallback and native calls share one path (e.g., assert the same trace events fire from both entry points) | |
| 108 | -- capability profiles have unit tests for the resolution priority order | |
| 109 | -- a `TurnSummary` smoke test asserts the structured output is populated for a multi-tool turn | |
| 110 | - | |
| 111 | -## Definition of done | |
| 112 | - | |
| 113 | -- `agent/loop.py` is no longer the sole owner of runtime correctness — it delegates to `runtime.conversation` | |
| 114 | -- tool execution logic is centralized in `runtime.executor` | |
| 115 | -- the message schema is internally consistent and the named `tool_call_id` bug is fixed | |
| 116 | -- capability profiles replace substring-based model detection | |
| 117 | -- full-turn tests cover the critical runtime paths | |
| 118 | -- the parity checklist from Sprint 00 reflects the new state | |
| 119 | - | |
| 120 | -## Audit Notes | |
| 121 | - | |
| 122 | -Audit checkpoint on 2026-04-06: | |
| 123 | - | |
| 124 | -- closed the gap where Ollama `/api/show` probing existed in code but did not affect the agent before the first request | |
| 125 | -- collapsed `Agent.run_streaming()` onto the primary runtime path so Loader no longer carries a second shadow execution loop | |
| 126 | -- expanded deterministic parity coverage for `TurnSummary`, shared executor tracing, capability refresh, and streaming delegation | |
| 127 | -- residual debt remains in runtime size and heuristics, but the Sprint 01 contract is now implemented rather than partially scaffolded | |
.docs/sprints/sprint02.mddeleted@@ -1,134 +0,0 @@ | ||
| 1 | -# Sprint 02: Definition of Done and Verify/Fix Loop | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 01 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Replace heuristic completion with an evidence-backed completion contract. This is the single highest-leverage behavioral change in the plan and the direct answer to the user's stated complaints: | |
| 10 | - | |
| 11 | -- finishing too early without followup | |
| 12 | -- weak tool follow-through | |
| 13 | -- poor task closure | |
| 14 | - | |
| 15 | -The reference for the contract shape is `refs/oh-my-codex/skills/ralph/SKILL.md` (the persistence-until-done protocol) and `refs/oh-my-codex/src/verification/verifier.ts` (task-size-aware evidence scaling). | |
| 16 | - | |
| 17 | -This sprint deliberately runs *before* permission modes (Sprint 03). Permissions are a safety win; this is the behavior win, and the user asked for behavior first. | |
| 18 | - | |
| 19 | -## Deliverables | |
| 20 | - | |
| 21 | -### 1. Definition of Done object | |
| 22 | - | |
| 23 | -Add a `DefinitionOfDone` dataclass that every non-trivial task carries from start to finish: | |
| 24 | - | |
| 25 | -```python | |
| 26 | -@dataclass | |
| 27 | -class DefinitionOfDone: | |
| 28 | - task_statement: str | |
| 29 | - acceptance_criteria: list[str] # the testable claims the task must satisfy | |
| 30 | - verification_commands: list[str] # commands whose output is the evidence | |
| 31 | - pending_items: list[str] # outstanding subtasks (zero before completion) | |
| 32 | - completed_items: list[str] # finished subtasks (audit trail) | |
| 33 | - evidence: list[VerificationEvidence] # populated by the verify phase | |
| 34 | - confidence: Literal["high", "medium", "low"] | |
| 35 | - status: Literal["draft", "in_progress", "verifying", "fixing", "done", "failed"] | |
| 36 | -``` | |
| 37 | - | |
| 38 | -The DoD object is constructed at task entry and updated as the runtime advances. It is the single source of truth for "is this task done?". | |
| 39 | - | |
| 40 | -### 2. Verify phase in the runtime | |
| 41 | - | |
| 42 | -Add an explicit verify phase to `runtime.conversation` that runs *after* the model thinks it is done but *before* the turn returns a final answer. | |
| 43 | - | |
| 44 | -The verify phase: | |
| 45 | - | |
| 46 | -- runs each `verification_commands` entry as a tool call | |
| 47 | -- captures stdout, stderr, and exit code as `VerificationEvidence` | |
| 48 | -- attaches each piece of evidence to the DoD object | |
| 49 | -- gates completion on (a) all verification commands exited zero, (b) `pending_items` is empty, (c) the model has produced an evidence summary that references the captured output | |
| 50 | - | |
| 51 | -If any of those gates fail, the runtime moves into the fix phase rather than completing. | |
| 52 | - | |
| 53 | -### 3. Fix loop | |
| 54 | - | |
| 55 | -Verification failure does not return to the user. It returns to execution with: | |
| 56 | - | |
| 57 | -- the failed evidence attached to the next prompt | |
| 58 | -- a structured "what failed and why" message | |
| 59 | -- a bounded retry budget (default 3 attempts; configurable per task via DoD) | |
| 60 | -- escalation to the user only when the budget is exhausted | |
| 61 | - | |
| 62 | -This is the same shape as `refs/oh-my-codex/skills/ralph/SKILL.md` Step 7.5–7.6 (mandatory verification + regression re-verification + retry on failure). | |
| 63 | - | |
| 64 | -### 4. Task-size-aware evidence requirements | |
| 65 | - | |
| 66 | -Borrow the sizing model from `refs/oh-my-codex/src/verification/verifier.ts:99-106`: | |
| 67 | - | |
| 68 | -- **small** (≤3 files, <100 lines changed): minimal verification — typecheck + tests for affected modules | |
| 69 | -- **standard** (≤15 files, <500 lines): full stack — typecheck + tests + lint + smoke | |
| 70 | -- **large** (>15 files or >500 lines): comprehensive — typecheck + tests + lint + integration + regression | |
| 71 | - | |
| 72 | -Sizing is computed from the actual tool call history of the turn, not guessed from the prompt. | |
| 73 | - | |
| 74 | -Conversational/lookup-only tasks (no tool calls that mutate state) skip the verify phase entirely. This is what keeps simple tasks cheap. | |
| 75 | - | |
| 76 | -### 5. Minimum `.loader/` directory layout | |
| 77 | - | |
| 78 | -Create the minimum state directory shape so DoD objects can be persisted: | |
| 79 | - | |
| 80 | -```text | |
| 81 | -.loader/ | |
| 82 | -└── dod/ | |
| 83 | - └── {timestamp}-{task-slug}.json | |
| 84 | -``` | |
| 85 | - | |
| 86 | -This is intentionally narrow. The full session/memory/compaction layout is Sprint 05's job. Sprint 02 only needs somewhere to put DoD objects so that fix-loop continuations can reload them after process restarts, and so that future sprints have a directory to extend. | |
| 87 | - | |
| 88 | -`.loader/` should be added to `.gitignore` as part of this sprint. | |
| 89 | - | |
| 90 | -### 6. CLI/TUI surfaces for the DoD state | |
| 91 | - | |
| 92 | -The TUI status line and the non-TUI CLI output both need to surface: | |
| 93 | - | |
| 94 | -- current DoD status (`draft` / `in_progress` / `verifying` / `fixing` / `done`) | |
| 95 | -- pending items count | |
| 96 | -- last verification result | |
| 97 | - | |
| 98 | -This is what makes the contract visible to the user instead of hidden inside the runtime. | |
| 99 | - | |
| 100 | -## Testing strategy | |
| 101 | - | |
| 102 | -- a task with verification commands cannot complete without evidence (assert: completion attempted before evidence collected → routed to verify phase) | |
| 103 | -- a verification failure routes to fix loop, not to final answer | |
| 104 | -- the fix loop respects the retry budget and escalates to the user on exhaustion | |
| 105 | -- a conversational task (no mutating tool calls) skips verify entirely | |
| 106 | -- DoD objects round-trip through `.loader/dod/` and survive a simulated process restart | |
| 107 | -- task sizing classifies correctly across small/standard/large boundaries | |
| 108 | -- the CLI/TUI status surfaces show the right phase at each transition | |
| 109 | - | |
| 110 | -## Definition of done | |
| 111 | - | |
| 112 | -- Loader no longer relies on heuristic continuation prompts alone | |
| 113 | -- completion is explicit and evidence-backed | |
| 114 | -- `DefinitionOfDone` objects exist on disk under `.loader/dod/` | |
| 115 | -- failed verification cannot escape into a "looks done" final answer | |
| 116 | -- simple tasks stay cheap (verify is skipped); complex tasks enter the verify/fix loop automatically | |
| 117 | -- the user can see the DoD phase from the CLI and TUI | |
| 118 | - | |
| 119 | -## Audit Notes | |
| 120 | - | |
| 121 | -Audit checkpoint on 2026-04-06: | |
| 122 | - | |
| 123 | -- added a persisted `DefinitionOfDone` runtime object under `src/loader/runtime/dod.py` and store-backed state under `.loader/dod/` | |
| 124 | -- routed mutating tasks through an explicit verify/fix gate in `src/loader/runtime/conversation.py`, with retry-budget exhaustion returning an honest failure summary instead of a premature success | |
| 125 | -- taught verification runs to execute through the shared executor with duplicate suppression disabled, confirmations skipped, and project-root working-directory awareness | |
| 126 | -- tightened duplicate suppression so rewrites used for recovery are allowed while true same-content rewrites are still skipped | |
| 127 | -- surfaced DoD state in both the non-TUI CLI and the TUI status line, and added deterministic coverage for runtime parity, DoD persistence/sizing, and status formatting | |
| 128 | -- full verification is green at `uv run pytest -q` with 90 passing tests | |
| 129 | - | |
| 130 | -Residual debt after Sprint 02: | |
| 131 | - | |
| 132 | -- DoD acceptance criteria and pending items are still runtime-derived and shallow; Loader does not yet have the richer task/workflow artifacts planned in Sprint 04 and Sprint 05 | |
| 133 | -- verification summaries are runtime-generated from captured evidence rather than model-authored evidence explanations | |
| 134 | -- task-size-aware verification is intentionally conservative today; larger-task evidence scaling still has room to move closer to the reference verifier design | |
.docs/sprints/sprint03.mddeleted@@ -1,135 +0,0 @@ | ||
| 1 | -# Sprint 03: Permission Modes and Tool Lifecycle Hooks | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 02 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Move Loader from confirmation-only safety to policy-based runtime safety, and introduce the tool lifecycle hooks that all subsequent runtime behavior should hang on. | |
| 10 | - | |
| 11 | -The two deliverable groups land together because they share the same code path. claw-code's `conversation.rs:370-453` shows the pattern: every tool call flows through pre-hook → permission check → execute → post-hook (success or failure variant). Loader needs the same shape so that Sprint 04's mode router and Sprint 05's session/memory work plug into a stable lifecycle instead of patching `loop.py` again. | |
| 12 | - | |
| 13 | -This sprint deliberately runs *after* the DoD work in Sprint 02, because permissions are a safety win, not a behavior win, and the user asked for behavior first. | |
| 14 | - | |
| 15 | -## Deliverables | |
| 16 | - | |
| 17 | -### 1. Permission modes | |
| 18 | - | |
| 19 | -Add explicit runtime permission modes mirroring `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`: | |
| 20 | - | |
| 21 | -- `read-only` | |
| 22 | -- `workspace-write` | |
| 23 | -- `danger-full-access` | |
| 24 | - | |
| 25 | -(claw-code also has `Prompt` and `Allow` variants — Loader can defer those.) | |
| 26 | - | |
| 27 | -Each tool declares the minimum permission level it requires via a `required_permission` attribute on `Tool`. The default registry's tools map roughly to: | |
| 28 | - | |
| 29 | -- `read-only`: `read`, `glob`, `grep` | |
| 30 | -- `workspace-write`: `write`, `edit` | |
| 31 | -- `danger-full-access`: `bash` | |
| 32 | - | |
| 33 | -### 2. PermissionPolicy | |
| 34 | - | |
| 35 | -A `PermissionPolicy` object owns the active mode and the per-tool requirement map: | |
| 36 | - | |
| 37 | -```python | |
| 38 | -@dataclass | |
| 39 | -class PermissionPolicy: | |
| 40 | - active_mode: PermissionMode | |
| 41 | - tool_requirements: dict[str, PermissionMode] | |
| 42 | - workspace_root: Path | |
| 43 | -``` | |
| 44 | - | |
| 45 | -The rule layer (`allow_rules` / `deny_rules` / `ask_rules`) from claw-code is **deferred**. Sprint 03 only needs modes and tool requirements; rules can come later if they prove necessary. | |
| 46 | - | |
| 47 | -### 3. Tool lifecycle hooks — three events | |
| 48 | - | |
| 49 | -Add a three-event hook lifecycle modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34`: | |
| 50 | - | |
| 51 | -- `pre_tool_use` — runs before authorization; can override input, deny, or inject messages | |
| 52 | -- `post_tool_use` — runs after successful execution; can modify output or add follow-up messages | |
| 53 | -- `post_tool_use_failure` — runs after a tool error; separate from `post_tool_use` so failure handling does not have to branch on `is_error` | |
| 54 | - | |
| 55 | -The `runtime.executor.ToolExecutor` from Sprint 01 calls hooks at each lifecycle point. Hook results are typed (allow / deny / cancel / fail / inject-message / override-permission) and the executor merges hook feedback into the tool result message. | |
| 56 | - | |
| 57 | -This is the most important architectural piece in the sprint, because Sprints 04, 05, and 06 will all want to add lifecycle behavior. Without hooks, every later sprint would patch `loop.py` again. | |
| 58 | - | |
| 59 | -### 4. Refactor safeguards.py into hook implementations | |
| 60 | - | |
| 61 | -`src/loader/agent/safeguards.py` (1079 LOC) currently does duplicate detection, validation, and rollback tracking via ad-hoc method calls inside `loop.py`. Refactor each of those into a `pre_tool_use` hook implementation: | |
| 62 | - | |
| 63 | -- `DuplicateActionHook` — what `safeguards.check_duplicate()` does today | |
| 64 | -- `ActionValidationHook` — what `safeguards.validate_action()` does today | |
| 65 | -- `RollbackTrackingHook` — what `loop.py:917-935` does today | |
| 66 | - | |
| 67 | -The streaming filter (`CodeBlockFilter`) is a separate concern and stays for now, but Sprint 03 should mark it as "candidate for removal once the typed runtime makes the leakage it filters impossible." | |
| 68 | - | |
| 69 | -### 5. File operation hardening | |
| 70 | - | |
| 71 | -Match the safety guards in `refs/claw-code/rust/crates/runtime/src/file_ops.rs`: | |
| 72 | - | |
| 73 | -- workspace boundary enforcement (with `canonicalize()` before the boundary check, to defeat symlink escapes) | |
| 74 | -- file size limits (10 MB read, 10 MB write — same as claw-code's `MAX_READ_SIZE` / `MAX_WRITE_SIZE`) | |
| 75 | -- binary file detection (NUL byte in the first 8 KB) | |
| 76 | -- structured patch metadata for edits/writes (return `StructuredPatchHunk` data alongside the human-readable diff) | |
| 77 | - | |
| 78 | -These guards live in the file tools themselves, not in hooks — they are intrinsic to the operation, not policy. | |
| 79 | - | |
| 80 | -### 6. Shell operation hardening | |
| 81 | - | |
| 82 | -- command mutability classification (read-only vs mutating, used by the `read-only` mode policy) | |
| 83 | -- read-only safe-command policy | |
| 84 | -- prompt-mode authorization path | |
| 85 | -- structured stderr/exit-code result | |
| 86 | -- output truncation with metadata when output exceeds a budget | |
| 87 | - | |
| 88 | -### 7. CLI/TUI visibility | |
| 89 | - | |
| 90 | -Expose the active permission mode in: | |
| 91 | - | |
| 92 | -- the TUI status line | |
| 93 | -- the non-TUI CLI startup banner | |
| 94 | -- `loader status` (which lands in Sprint 06, but wire the data source now) | |
| 95 | - | |
| 96 | -Show the mode with a color hint: green for `read-only`, yellow for `workspace-write`, red for `danger-full-access`. | |
| 97 | - | |
| 98 | -## Testing strategy | |
| 99 | - | |
| 100 | -- `read-only` mode denies writes and mutating shell commands; verify via the unified executor and the hook lifecycle | |
| 101 | -- `workspace-write` allows in-repo changes but denies writes outside `workspace_root` | |
| 102 | -- `danger-full-access` allows everything | |
| 103 | -- file boundary tests cover `../`, symlink escape (canonicalized first), binary, and oversize cases | |
| 104 | -- shell tests cover read-only safe-command policy and output truncation | |
| 105 | -- hook lifecycle tests assert all three events fire in order, and that a `pre_tool_use` deny short-circuits execution but still produces a typed tool-result message | |
| 106 | -- the refactored `DuplicateActionHook` / `ActionValidationHook` / `RollbackTrackingHook` produce the same observable behavior as the old `safeguards.py` paths (use Sprint 00's harness scenarios as the regression suite) | |
| 107 | -- verifying that hooks compose: a `pre_tool_use` deny + a `post_tool_use_failure` hook still emits exactly one tool-result message | |
| 108 | - | |
| 109 | -## Definition of done | |
| 110 | - | |
| 111 | -- Loader has explicit permission modes | |
| 112 | -- file and shell safety rules are enforced in the runtime, not in the UI | |
| 113 | -- the three-event tool lifecycle is in place and `safeguards.py` has been refactored into hook implementations | |
| 114 | -- the streaming filter is annotated as deprecated-pending-removal | |
| 115 | -- the safety behavior is covered by automated tests | |
| 116 | -- the CLI/TUI surfaces the active mode | |
| 117 | -- Sprint 04, 05, and 06 have a clean lifecycle to plug into instead of patching `loop.py` | |
| 118 | - | |
| 119 | -## Audit Notes | |
| 120 | - | |
| 121 | -Audit checkpoint on 2026-04-06: | |
| 122 | - | |
| 123 | -- added `PermissionMode`, `PermissionPolicy`, and lazy runtime exports under `src/loader/runtime/permissions.py` and `src/loader/runtime/__init__.py` | |
| 124 | -- refactored tool execution so `ToolExecutor` now runs hooks before and after policy evaluation in `src/loader/runtime/executor.py` | |
| 125 | -- added lifecycle hook infrastructure in `src/loader/runtime/hooks.py`, including `DuplicateActionHook`, `ActionValidationHook`, `RollbackTrackingHook`, and a success-side action-history hook for loop/dedup tracking | |
| 126 | -- hardened file and search tools with canonicalized workspace-root enforcement, symlink escape blocking, binary detection, file-size limits, and structured patch metadata | |
| 127 | -- hardened shell execution with permission classification, structured truncation metadata, and mode-aware authorization | |
| 128 | -- surfaced the active permission mode in the CLI startup banner and the TUI status line, and wired `Agent.active_permission_mode` as the current data source for later status/session work | |
| 129 | -- full verification is green at `uv run pytest -q` with 106 passing tests | |
| 130 | - | |
| 131 | -Residual debt after Sprint 03: | |
| 132 | - | |
| 133 | -- Loader now has mode-based permission policy, but the richer rule system (`allow` / `deny` / `ask`) is still deferred | |
| 134 | -- destructive operations still flow through the legacy confirmation path after policy allows them, so Loader has not fully matched claw-code's prompt/allow permission model | |
| 135 | -- shell mutability classification is still heuristic and intentionally conservative rather than deeply semantic | |
.docs/sprints/sprint04.mddeleted@@ -1,141 +0,0 @@ | ||
| 1 | -# Sprint 04: Mode Router, Clarify Mode, and Plan Artifacts | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 03 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Teach Loader to separate kinds of work instead of improvising one workflow for everything. Builds on the DoD contract from Sprint 02 and the hook lifecycle from Sprint 03. | |
| 10 | - | |
| 11 | -This sprint is the direct answer to: | |
| 12 | - | |
| 13 | -- spending too long on simple tasks (modes route lookups out of the full loop) | |
| 14 | -- overthinking small work (clarify only fires for genuinely ambiguous prompts) | |
| 15 | -- jumping into execution without a plan on complex work | |
| 16 | - | |
| 17 | -The references are `refs/oh-my-codex/skills/deep-interview/SKILL.md` (clarify mode) and `refs/oh-my-codex/skills/ralplan/SKILL.md` (plan mode). Loader should copy the artifact discipline first; the full Planner/Architect/Critic loop and the ambiguity-scoring formula are stretch goals, not Sprint 04 requirements. | |
| 18 | - | |
| 19 | -## Deliverables | |
| 20 | - | |
| 21 | -### 1. Tool prerequisites pulled forward from Sprint 06 | |
| 22 | - | |
| 23 | -The clarify and DoD work both need tools that don't exist yet. Add them now rather than at the end of the plan: | |
| 24 | - | |
| 25 | -- **`TodoWrite`** — task/todo tracking. The "zero pending tasks" gate in Sprint 02's DoD contract is currently empty because there is no tool to write tasks into. Without `TodoWrite`, Sprint 02's contract has a hole. | |
| 26 | -- **`AskUserQuestion`** — structured user-question surface. The clarify mode needs a way to ask one question per round (per `deep-interview/SKILL.md`) rather than embedding questions in free-form responses. | |
| 27 | - | |
| 28 | -These two tools are the minimum new tool surface this sprint introduces. The broad expansion (diff/patch-aware editing, git helpers, web fetch, etc.) stays in Sprint 06. | |
| 29 | - | |
| 30 | -Both tools declare their permission level (`read-only` for `TodoWrite` writing to `.loader/`, `read-only` for `AskUserQuestion`) and integrate with the hook lifecycle from Sprint 03. | |
| 31 | - | |
| 32 | -### 2. Mode router | |
| 33 | - | |
| 34 | -Introduce a router that selects a mode at task entry based on the task shape and explicit user intent: | |
| 35 | - | |
| 36 | -- `clarify` — fired when ambiguity score crosses a threshold or the user invokes `--clarify` | |
| 37 | -- `plan` — fired for complex tasks (heuristic on prompt length / signal density, or explicit `--plan`) | |
| 38 | -- `execute` — the default for concrete actionable prompts | |
| 39 | -- `verify` — already exists from Sprint 02; the router wires it in as the gate after `execute` | |
| 40 | - | |
| 41 | -The router does not have to be smart in this sprint. It needs to be *explicit*: the chosen mode is logged, surfaced in the TUI status line, and recorded in the DoD object. | |
| 42 | - | |
| 43 | -A heuristic-only first pass is fine. Borrowing the OMX deep-interview ambiguity formula is a stretch goal. | |
| 44 | - | |
| 45 | -### 3. Clarify artifact | |
| 46 | - | |
| 47 | -When the router selects `clarify` mode, the runtime drives a one-question-per-round loop (using `AskUserQuestion`) and writes a task brief to `.loader/briefs/{timestamp}-{slug}.md` containing: | |
| 48 | - | |
| 49 | -- task statement | |
| 50 | -- desired outcome | |
| 51 | -- in-scope items | |
| 52 | -- non-goals | |
| 53 | -- decision boundaries | |
| 54 | -- constraints | |
| 55 | -- likely touchpoints | |
| 56 | -- assumptions | |
| 57 | - | |
| 58 | -This is a simplified port of `refs/oh-my-codex/skills/deep-interview/SKILL.md` Phase 4. Loader does not need the full ambiguity scoring or pressure-pass discipline yet. It needs the artifact. | |
| 59 | - | |
| 60 | -The brief is then handed off as input to the next mode (`plan` or `execute`), and the DoD object's `acceptance_criteria` is seeded from the brief. | |
| 61 | - | |
| 62 | -### 4. Planning artifacts | |
| 63 | - | |
| 64 | -When the router selects `plan` mode, the runtime produces two persistent artifacts under `.loader/plans/{timestamp}-{slug}/`: | |
| 65 | - | |
| 66 | -- `implementation.md` — what files change, in what order, with what risks | |
| 67 | -- `verification.md` — the verification commands and acceptance criteria that will populate the DoD object | |
| 68 | - | |
| 69 | -These do not need full OMX ralplan complexity (Planner/Architect/Critic with iteration cap). They need to: | |
| 70 | - | |
| 71 | -- exist on disk | |
| 72 | -- survive across turns (so a process restart can resume planning) | |
| 73 | -- feed directly into the DoD object created by Sprint 02 | |
| 74 | -- be visible to the user via the TUI | |
| 75 | - | |
| 76 | -### 5. Mode-specific prompts | |
| 77 | - | |
| 78 | -Replace Sprint 00's single generic system prompt with mode-specific prompts: | |
| 79 | - | |
| 80 | -- `clarify` mode prompt enforces "ask one question, do not propose solutions yet" | |
| 81 | -- `plan` mode prompt enforces "produce two artifacts, do not start writing code" | |
| 82 | -- `execute` mode prompt is the current Loader system prompt, trimmed (the global "no numbered steps" rule that breaks reporting tasks should be relaxed) | |
| 83 | -- `verify` mode prompt enforces "run the verification commands, attach evidence, do not declare done without zero failures" | |
| 84 | - | |
| 85 | -Mode-specific prompts are how the routing decision actually changes model behavior. | |
| 86 | - | |
| 87 | -### 6. Wire artifacts into the DoD object | |
| 88 | - | |
| 89 | -The DoD object from Sprint 02 gains optional links: | |
| 90 | - | |
| 91 | -- `clarify_brief: Path | None` | |
| 92 | -- `implementation_plan: Path | None` | |
| 93 | -- `verification_plan: Path | None` | |
| 94 | - | |
| 95 | -When verify phase runs, it pulls verification commands from `verification.md` if present, falling back to the inline DoD list otherwise. | |
| 96 | - | |
| 97 | -## Testing strategy | |
| 98 | - | |
| 99 | -- ambiguous prompts route into `clarify` (assert via the mock harness) | |
| 100 | -- complex prompts route into `plan` | |
| 101 | -- simple lookups route directly into `execute` and skip the verify phase entirely | |
| 102 | -- clarify briefs round-trip through `.loader/briefs/` | |
| 103 | -- planning artifacts round-trip through `.loader/plans/` | |
| 104 | -- the DoD object correctly absorbs `acceptance_criteria` from a clarify brief and `verification_commands` from a plan | |
| 105 | -- mode-specific prompts produce mode-appropriate behavior under the mock harness | |
| 106 | -- a verify failure that returns to `execute` (per Sprint 02's fix loop) does not re-trigger `clarify` or `plan` | |
| 107 | - | |
| 108 | -## Definition of done | |
| 109 | - | |
| 110 | -- Loader routes tasks into modes instead of treating every prompt the same way | |
| 111 | -- clarify briefs and planning artifacts exist on disk and feed the DoD object | |
| 112 | -- `TodoWrite` and `AskUserQuestion` tools exist and are wired into the hook lifecycle | |
| 113 | -- mode-specific prompts replace the single generic prompt | |
| 114 | -- simple tasks stay lightweight; complex tasks become more structured | |
| 115 | -- the TUI surfaces the active mode | |
| 116 | - | |
| 117 | -## Audit | |
| 118 | - | |
| 119 | -### Landed | |
| 120 | - | |
| 121 | -- `TodoWrite` now persists workflow tasks under `.loader/todos/` and `AskUserQuestion` now routes through the same typed executor path as every other tool | |
| 122 | -- Loader now routes entry turns through a heuristic `clarify` / `plan` / `execute` decision and records the active mode in the DoD object | |
| 123 | -- `clarify` mode now asks one structured question, writes a persisted brief under `.loader/briefs/`, and seeds DoD acceptance criteria from that artifact | |
| 124 | -- `plan` mode now writes persisted `implementation.md` and `verification.md` artifacts under `.loader/plans/`, seeds DoD verification commands from `verification.md`, and seeds todo state from the implementation steps | |
| 125 | -- the verify gate now advertises `verify` mode explicitly, reads verification commands from the persisted plan when present, and returns to `execute` on retry without re-running clarify or plan | |
| 126 | -- mode-specific system prompts now change model behavior by mode instead of using one generic prompt for every turn | |
| 127 | -- CLI and TUI surfaces now show workflow mode transitions, artifact creation, and real `AskUserQuestion` answer collection | |
| 128 | - | |
| 129 | -### Verification | |
| 130 | - | |
| 131 | -- `uv run pytest -q` is green: `126 passed` | |
| 132 | -- `tests/test_runtime_harness.py` now includes explicit workflow parity for clarify routing, plan routing, and verify-fix handoff stability | |
| 133 | -- `tests/test_workflow.py` and `tests/test_workflow_runtime.py` cover the artifact stores, router heuristics, DoD wiring, and mode-specific runtime behavior | |
| 134 | -- `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover the new tool contracts and user-question callback plumbing | |
| 135 | - | |
| 136 | -### Residual debt | |
| 137 | - | |
| 138 | -- clarify mode is intentionally shallow compared with OMX deep-interview: one question, one brief, no pressure-pass loop, and no ambiguity re-scoring | |
| 139 | -- plan mode is intentionally lighter than OMX ralplan: one pass, no consensus review agents, and no ADR-quality output contract yet | |
| 140 | -- todo state is now real, but Loader still auto-clears remaining plan todos on successful verification rather than requiring richer per-step completion semantics | |
| 141 | -- `conversation.py` is now more capable but even more responsibility-dense, so Sprint 05+ should keep carving this into smaller runtime components instead of letting the turn engine become the new monolith | |
.docs/sprints/sprint05.mddeleted@@ -1,154 +0,0 @@ | ||
| 1 | -# Sprint 05: Session State, Memory, and Compaction | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 04 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Give Loader durable continuity. | |
| 10 | - | |
| 11 | -This sprint should reduce re-discovery, improve multi-turn coherence, and support longer-running tasks without drowning the model in raw transcript. | |
| 12 | - | |
| 13 | -The references for this sprint are: | |
| 14 | - | |
| 15 | -- `refs/claw-code/rust/crates/runtime/src/session.rs` (session persistence with rotation and compaction metadata) | |
| 16 | -- `refs/claw-code/rust/crates/runtime/src/compact.rs` (auto-compaction trigger) | |
| 17 | -- `refs/claw-code/rust/crates/runtime/src/summary_compression.rs` (priority-aware line-level summarization — model on this rather than reinventing) | |
| 18 | -- `refs/oh-my-codex/src/mcp/memory-server.ts` (project memory + notepad surfaces) | |
| 19 | - | |
| 20 | -## Deliverables | |
| 21 | - | |
| 22 | -### 1. Full session store under `.loader/` | |
| 23 | - | |
| 24 | -The minimum `.loader/` shape was created in Sprint 02 (`.loader/dod/`). This sprint extends it to the full layout: | |
| 25 | - | |
| 26 | -```text | |
| 27 | -.loader/ | |
| 28 | -├── dod/ # already exists from Sprint 02 | |
| 29 | -├── briefs/ # already exists from Sprint 04 | |
| 30 | -├── plans/ # already exists from Sprint 04 | |
| 31 | -├── sessions/ # NEW — persisted conversation state | |
| 32 | -├── state/ # NEW — runtime state (current session pointer, etc.) | |
| 33 | -├── notepad.md # NEW — durable working notes | |
| 34 | -└── project-memory.json # NEW — repo conventions and user directives | |
| 35 | -``` | |
| 36 | - | |
| 37 | -Sessions are persisted with: | |
| 38 | - | |
| 39 | -- session id | |
| 40 | -- created/updated timestamps | |
| 41 | -- messages (using the Sprint 01 typed message schema) | |
| 42 | -- compaction metadata (when applicable) | |
| 43 | -- DoD object reference (link to the active task's DoD file in `.loader/dod/`) | |
| 44 | -- usage tracking | |
| 45 | - | |
| 46 | -File rotation: cap individual session files at ~256 KB and rotate, matching `refs/claw-code/rust/crates/runtime/src/session.rs:13-14`. | |
| 47 | - | |
| 48 | -### 2. Resume support | |
| 49 | - | |
| 50 | -Allow Loader to resume the latest or a named session: | |
| 51 | - | |
| 52 | -- `loader --resume` resumes the most recent session | |
| 53 | -- `loader --resume <session-id>` resumes by id | |
| 54 | -- `loader session list` (which lands in Sprint 06) shows available sessions | |
| 55 | - | |
| 56 | -Resume must restore the message history, the active DoD object, the active mode, and the active permission policy. | |
| 57 | - | |
| 58 | -### 3. Working memory and project memory | |
| 59 | - | |
| 60 | -Add tools/surfaces to: | |
| 61 | - | |
| 62 | -- read/write project memory (`.loader/project-memory.json` — tech stack, build commands, conventions, directives, structure) | |
| 63 | -- append working notes (`.loader/notepad.md` — temporary context that survives across turns within a session) | |
| 64 | -- store user directives ("we use uv, never pip" / "tests live in tests/" / etc.) | |
| 65 | - | |
| 66 | -These are MCP-style tools (sections similar to OMX's `project_memory_read` / `project_memory_add_note` / `project_memory_add_directive` / `notepad_read` / `notepad_write_*`), but they live as native Loader tools, not MCP servers. Loader does not need a full MCP runtime yet. | |
| 67 | - | |
| 68 | -Each memory tool declares `read-only` permission for the read variants and `workspace-write` for the write variants (since `.loader/` is inside the workspace). | |
| 69 | - | |
| 70 | -### 4. Transcript compaction | |
| 71 | - | |
| 72 | -Implement session compaction modeled on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`: | |
| 73 | - | |
| 74 | -- triggered automatically when input tokens exceed a threshold (default 100,000, matching claw-code's `DEFAULT_AUTO_COMPACTION_INPUT_TOKENS_THRESHOLD`) | |
| 75 | -- preserves the most recent N messages (default 4, matching claw-code) | |
| 76 | -- summarizes older context with priority-aware line-level compression: | |
| 77 | - - dedupes identical lines | |
| 78 | - - collapses inline whitespace | |
| 79 | - - prioritizes "Summary:", "Current work:", "Key files referenced:", and similar core lines | |
| 80 | - - bounded to a token/line/char budget | |
| 81 | -- emits explicit continuation instructions in the summary | |
| 82 | - | |
| 83 | -The compacted session is written back to `.loader/sessions/` with metadata recording what was removed. | |
| 84 | - | |
| 85 | -### 5. Usage tracking | |
| 86 | - | |
| 87 | -Track per-turn and cumulative: | |
| 88 | - | |
| 89 | -- input tokens | |
| 90 | -- output tokens | |
| 91 | -- cache creation tokens (when available from the backend) | |
| 92 | -- cache read tokens (when available) | |
| 93 | -- tool calls | |
| 94 | -- iterations | |
| 95 | - | |
| 96 | -Usage is attached to the `TurnSummary` from Sprint 01 and accumulated across the session. Cost estimation is a stretch goal — Loader is local-first and the dollar cost is zero, but token tracking is still useful for compaction triggers and observability. | |
| 97 | - | |
| 98 | -### 6. Memory hooks for the lifecycle | |
| 99 | - | |
| 100 | -Use the Sprint 03 hook lifecycle for: | |
| 101 | - | |
| 102 | -- a `post_tool_use` hook that updates `notepad.md` when the user explicitly invokes a "remember this" tool | |
| 103 | -- a session-finalization hook that writes the DoD evidence summary into `project-memory.json` when relevant (e.g., "the canonical test command is `uv run pytest`") | |
| 104 | - | |
| 105 | -This is how the durability layer integrates with the rest of the runtime instead of bolting on. | |
| 106 | - | |
| 107 | -## Testing strategy | |
| 108 | - | |
| 109 | -- sessions persist and reload correctly across simulated process restarts | |
| 110 | -- resume restores message history, DoD, mode, and permission policy | |
| 111 | -- memory/notepad operations survive across turns within a session | |
| 112 | -- compacted sessions preserve recent messages exactly and produce a summary that round-trips through the priority-aware compression | |
| 113 | -- compaction triggers automatically at the token threshold | |
| 114 | -- compacted summary contains the explicit continuation instruction | |
| 115 | -- file rotation kicks in at the size cap | |
| 116 | - | |
| 117 | -## Definition of done | |
| 118 | - | |
| 119 | -- Loader can continue work across sessions without fully re-priming the model | |
| 120 | -- memory/state lives outside the prompt under `.loader/` | |
| 121 | -- long sessions can be compacted safely without losing the active DoD or recent messages | |
| 122 | -- multi-turn work becomes more predictable | |
| 123 | -- usage tracking is wired into the turn summary | |
| 124 | -- the file layout is stable enough that Sprint 06's product surfaces can rely on it | |
| 125 | - | |
| 126 | -## Audit | |
| 127 | - | |
| 128 | -### Landed | |
| 129 | - | |
| 130 | -- Loader now persists full session snapshots under `.loader/sessions/` and tracks the active session pointer under `.loader/state/current_session.json` | |
| 131 | -- persisted sessions now carry typed messages, cumulative usage totals, compaction metadata, the active DoD path, the current task, workflow mode, and permission mode | |
| 132 | -- `Agent.resume_session(...)` now restores message history, the active DoD object, workflow mode, permission mode, and current task across process restarts | |
| 133 | -- the CLI now supports both `loader --resume` and `loader --resume <session-id>` by rewriting that syntax into an internal hidden option before Click parsing | |
| 134 | -- transcript compaction now triggers automatically at the configured input-token threshold, keeps the latest four messages verbatim, and inserts a claw-inspired continuation summary with priority-aware line compression | |
| 135 | -- `TurnSummary` now carries normalized per-turn usage and cumulative session usage, with streamed Ollama responses reporting prompt/output token counts when available | |
| 136 | -- Loader now exposes native `project_memory_*` and `notepad_*` tools backed by `.loader/project-memory.json` and `.loader/notepad.md` | |
| 137 | -- the hook lifecycle now mirrors successful memory writes into the notepad, and finalized DoD evidence summaries are captured into project memory when verification produced useful evidence | |
| 138 | - | |
| 139 | -### Verification | |
| 140 | - | |
| 141 | -- `uv run pytest -q` is green: `137 passed` | |
| 142 | -- `tests/test_session_state.py` covers persistence, resume, rotation, compaction persistence, and cumulative usage rollups | |
| 143 | -- `tests/test_compaction.py` covers priority-aware summary compression and continuation-message compaction behavior | |
| 144 | -- `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory | |
| 145 | -- `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore | |
| 146 | -- `tests/test_runtime_harness.py` and `tests/test_workflow_runtime.py` remain green after the session/memory changes, so Sprint 05 did not regress the earlier parity baseline | |
| 147 | - | |
| 148 | -### Residual debt | |
| 149 | - | |
| 150 | -- session compaction summaries are runtime-authored heuristics; Loader still does not have claw-code's richer continuation semantics or OMX-style semantic memory extraction | |
| 151 | -- the DoD-to-project-memory capture is intentionally conservative and may miss higher-value repo conventions unless the evidence summary makes them explicit | |
| 152 | -- Sprint 05 restores sessions in the CLI runtime, but Sprint 06 still needs to surface session ids, listing, and inspection as first-class product commands | |
| 153 | -- cache token tracking is normalized when the backend provides it, but Loader still does not estimate cost and some backends may report fewer usage fields than Ollama | |
| 154 | -- `conversation.py` keeps growing as Sprint 05 logic lands, so Sprint 06+ should keep carving persistence/finalization concerns into smaller runtime components instead of leaving durability inside the turn monolith | |
.docs/sprints/sprint06.mddeleted@@ -1,131 +0,0 @@ | ||
| 1 | -# Sprint 06: Doctor, Explore, Status, and Tool Surface Expansion | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 05 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader from "a loop with a TUI" into a tool with inspectable operational surfaces and a tool surface broad enough to keep the main runtime loop from being the only place every operation happens. | |
| 10 | - | |
| 11 | -The references for this sprint are: | |
| 12 | - | |
| 13 | -- `refs/oh-my-codex/src/cli/doctor.ts` (health checks) | |
| 14 | -- `refs/oh-my-codex/src/cli/explore.ts` (read-only inspection lane) | |
| 15 | -- `refs/claw-code/rust/crates/tools/src/lib.rs` (the 49-tool surface — Loader does not need all of them, but the categories are the right reference) | |
| 16 | - | |
| 17 | -## Deliverables | |
| 18 | - | |
| 19 | -### 1. `loader doctor` | |
| 20 | - | |
| 21 | -Add a health-check surface that reports: | |
| 22 | - | |
| 23 | -- backend connectivity (Ollama up, model pulled) | |
| 24 | -- model capability summary (resolved from the Sprint 01 capability profile) | |
| 25 | -- workspace detection (Sprint 01 project context) | |
| 26 | -- write access (workspace boundary, `.loader/` writable) | |
| 27 | -- test/build command detection (from project context) | |
| 28 | -- state/session directory health (`.loader/` exists, sessions/dod/briefs/plans dirs present, project-memory parseable) | |
| 29 | -- permission mode (current default and what each tool would resolve to) | |
| 30 | - | |
| 31 | -Each check returns `pass | warn | fail` with a one-line message and a remediation hint. | |
| 32 | - | |
| 33 | -`loader doctor` should be runnable without entering the main runtime loop — it is a diagnostic, not a turn. | |
| 34 | - | |
| 35 | -### 2. `loader status` and `loader session` | |
| 36 | - | |
| 37 | -Expose: | |
| 38 | - | |
| 39 | -- `loader status` — current model, capability profile, permission mode, workflow mode, active session, recent verification/evidence state, DoD phase | |
| 40 | -- `loader session list` — sessions in `.loader/sessions/` with id, started-at, last-updated, message count, DoD status | |
| 41 | -- `loader session show <id>` — full detail for one session | |
| 42 | -- `loader session resume <id>` — wired to Sprint 05's resume support | |
| 43 | - | |
| 44 | -Both commands read from `.loader/` and never invoke the LLM. | |
| 45 | - | |
| 46 | -### 3. Lightweight read-only explore lane | |
| 47 | - | |
| 48 | -Add an optimized read-only inspection path for: | |
| 49 | - | |
| 50 | -- file lookup | |
| 51 | -- symbol lookup | |
| 52 | -- pattern discovery | |
| 53 | -- repo relationship questions ("where is X imported?", "what calls Y?") | |
| 54 | - | |
| 55 | -The explore lane: | |
| 56 | - | |
| 57 | -- runs in `read-only` permission mode by default (regardless of the user's session-wide setting) | |
| 58 | -- skips the verify phase from Sprint 02 (lookups have no DoD) | |
| 59 | -- skips the mode router's `clarify` and `plan` modes from Sprint 04 — it goes straight to a constrained `execute` with read-only tools only | |
| 60 | -- has its own concise system prompt focused on lookup, not action | |
| 61 | - | |
| 62 | -Reference: `refs/oh-my-codex/src/cli/explore.ts:48-77` (allows only read-only git subcommands, validates against pipes/redirects/semicolons in tokenized form). | |
| 63 | - | |
| 64 | -This is what keeps simple lookups out of the full execution loop and addresses the user's "spending too long on simple tasks" complaint. | |
| 65 | - | |
| 66 | -### 4. Tool surface expansion | |
| 67 | - | |
| 68 | -Add a first serious expansion pass for Loader tools. `TodoWrite` and `AskUserQuestion` already exist from Sprint 04. New tools in this sprint: | |
| 69 | - | |
| 70 | -- **diff/patch-aware editing** — a tool that takes a structured patch (using the `StructuredPatchHunk` shape from Sprint 03's file_ops hardening) instead of raw old/new strings. Reduces the failure rate on multi-line edits. | |
| 71 | -- **git status helper** — read-only git status / log / diff / show / branch surface, similar to OMX explore's tokenized subcommand allowlist. Lives in the explore lane but also available to the main runtime. | |
| 72 | -- **memory/notepad tools** — wire the Sprint 05 memory surfaces into the tool registry so the model can call them: `project_memory_read`, `project_memory_write`, `notepad_read`, `notepad_append`. | |
| 73 | -- **structured ask-user** — a richer variant of `AskUserQuestion` that can present multiple-choice options or numbered alternatives (for plan-mode handoff and verify-mode "which fix do you want?") | |
| 74 | - | |
| 75 | -Do **not** add team/subagent orchestration in this sprint. The solo runtime needs to be stable first. Multi-agent surfaces are deferred indefinitely. | |
| 76 | - | |
| 77 | -### 5. CLI/TUI consolidation | |
| 78 | - | |
| 79 | -By this point Loader has accumulated a lot of CLI flags and TUI surfaces. Sprint 06 should: | |
| 80 | - | |
| 81 | -- consolidate the flag surface (deprecate flags that are now redundant with mode router decisions) | |
| 82 | -- update the TUI status line to surface model + capability profile + permission mode + workflow mode + DoD phase + session id, all in one line | |
| 83 | -- ensure `loader --help` is coherent | |
| 84 | - | |
| 85 | -## Testing strategy | |
| 86 | - | |
| 87 | -- doctor reports meaningful failures for: Ollama down, model not pulled, workspace not writable, `.loader/` corrupted | |
| 88 | -- doctor reports pass for a known-good local setup | |
| 89 | -- status/session surfaces reflect real runtime state (assert against fixtures in `.loader/`) | |
| 90 | -- explore mode handles read-only lookups without entering the full execution workflow (assert: no DoD object created, no verify phase, no mode router decision logged) | |
| 91 | -- explore mode denies write attempts even if the user has `workspace-write` set globally | |
| 92 | -- new tools are covered by both unit tests and the Sprint 00 mock harness | |
| 93 | -- the Sprint 02 baseline parity checklist (which has been growing each sprint) covers the new product surfaces | |
| 94 | - | |
| 95 | -## Definition of done | |
| 96 | - | |
| 97 | -- Loader is operable and inspectable from outside the main runtime loop | |
| 98 | -- simple inspection tasks are faster and cheaper via the explore lane | |
| 99 | -- the expanded tool surface reduces prompt pressure on the main loop | |
| 100 | -- doctor / status / session surfaces reflect real state | |
| 101 | -- Loader feels closer to a product and less like an experiment | |
| 102 | -- the team / multi-agent / hook-ecosystem deferrals are still deferred (and that is the right call) | |
| 103 | - | |
| 104 | -## Audit | |
| 105 | - | |
| 106 | -### Landed | |
| 107 | - | |
| 108 | -- Loader now exposes `loader doctor`, `loader status`, `loader session list`, `loader session show`, and `loader session resume` as real product surfaces backed by persisted runtime state under `.loader/`, without entering the main LLM loop | |
| 109 | -- `loader doctor` reports backend health, resolved capabilities, workspace and write access, command detection, runtime-state health, and permission-mode summaries with pass/warn/fail status plus remediation hints | |
| 110 | -- the read-only explore lane is live through `loader explore <prompt>` and `Agent.run_explore(...)`, with its own system prompt, constrained read-only registry, forced `read-only` permission mode, and no workflow routing or DoD persistence | |
| 111 | -- Loader's tool registry now includes a structured `patch` tool, a read-only `git` helper, `notepad_append`, and richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform answers | |
| 112 | -- the CLI and TUI status surfaces now show model, capability profile, mode, workflow mode, permission mode, DoD state, and active session id in a single coherent surface, and `loader --help` reflects the new product entry points | |
| 113 | -- Sprint 06 coverage now extends the deterministic parity harness with explore-mode scenarios, so the constrained lookup lane is measured alongside the earlier runtime contracts | |
| 114 | - | |
| 115 | -### Verification | |
| 116 | - | |
| 117 | -- `uv run pytest -q` is green: `153 passed` | |
| 118 | -- `tests/test_inspection.py` covers doctor health reporting, persisted status/session inspection, root help text, and session-resume dispatch | |
| 119 | -- `tests/test_explore_runtime.py` covers the direct explore-lane contract and forced read-only behavior | |
| 120 | -- `tests/test_expanded_tools.py` covers structured patch application, read-only git tooling, `notepad_append`, and richer `AskUserQuestion` behavior | |
| 121 | -- `tests/test_runtime_harness.py` remains green and now includes deterministic explore-mode parity scenarios in addition to the earlier runtime baseline | |
| 122 | -- `tests/test_status_surfaces.py` covers the consolidated CLI/TUI capability-profile and session-id status formatting | |
| 123 | - | |
| 124 | -### Residual debt | |
| 125 | - | |
| 126 | -- explore mode is intentionally one-shot and read-only; Loader still does not have a richer interactive inspection lane or OMX-style repo-navigation ergonomics | |
| 127 | -- the CLI surface is more coherent, but Sprint 06 does not fully deprecate every older entry path or simplify all historical flag combinations | |
| 128 | -- the read-only `git` helper is still much narrower than claw-code and OMX's broader repo/product surfaces | |
| 129 | -- the structured `patch` tool improves multi-line edits, but Loader still lacks AST-aware, LSP-aware, or symbol-aware editing semantics | |
| 130 | -- `conversation.py` remains a large runtime module even though explore now bypasses it for simple lookup work | |
| 131 | -- multi-agent and team-oriented surfaces remain intentionally deferred, and that continues to be the right tradeoff for Loader at this stage | |
.docs/sprints/sprint07.mddeleted@@ -1,166 +0,0 @@ | ||
| 1 | -# Sprint 07: Rule-Based Permissions and Runtime Decomposition | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 06 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Finish the permission model and keep the runtime from re-forming a monolith. | |
| 10 | - | |
| 11 | -Sprint 03 gave Loader explicit permission modes and lifecycle hooks. Sprint 06 made Loader more inspectable and product-like. The next leverage point is to replace the remaining legacy confirmation behavior with a real policy layer, then carve authorization and finalization concerns out of `src/loader/runtime/conversation.py` so later work does not keep accumulating there. | |
| 12 | - | |
| 13 | -This is not a stop-the-world cleanup sprint. It is a contract sprint: | |
| 14 | - | |
| 15 | -- policy decides whether tool use is allowed, denied, or prompted | |
| 16 | -- prompts are driven by authorization outcomes instead of ad hoc tool confirmations | |
| 17 | -- `conversation.py` becomes orchestration, not a sink for every new behavior | |
| 18 | - | |
| 19 | -The references for this sprint are: | |
| 20 | - | |
| 21 | -- `refs/claw-code/rust/crates/runtime/src/permissions.rs` | |
| 22 | -- `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs` | |
| 23 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470` | |
| 24 | - | |
| 25 | -## Deliverables | |
| 26 | - | |
| 27 | -### 1. Full permission policy layer | |
| 28 | - | |
| 29 | -Extend Loader's current mode-based policy to a real rule-based policy. | |
| 30 | - | |
| 31 | -Implementation targets: | |
| 32 | - | |
| 33 | -- add `prompt` and `allow` to `PermissionMode`, mirroring `claw-code` | |
| 34 | -- extend `PermissionPolicy` with `allow_rules`, `deny_rules`, and `ask_rules` | |
| 35 | -- introduce a typed permission-rule representation with conservative matching over: | |
| 36 | - - tool name | |
| 37 | - - normalized tool input summary | |
| 38 | - - optional workspace/path context where relevant | |
| 39 | -- define deterministic policy precedence: | |
| 40 | - - deny rules always win | |
| 41 | - - hook-level deny/cancel/fail still deny | |
| 42 | - - ask rules and `prompt` mode route to interactive approval | |
| 43 | - - allow rules and `allow` mode can elevate within policy | |
| 44 | - - otherwise the required-mode gate still applies | |
| 45 | -- keep rule syntax intentionally narrow; do not invent a complex DSL in this sprint | |
| 46 | - | |
| 47 | -The goal is to get the runtime to a `PermissionPolicy` shape closer to claw-code, not to build a policy language for its own sake. | |
| 48 | - | |
| 49 | -### 2. Policy-backed prompting instead of legacy confirmations | |
| 50 | - | |
| 51 | -Today Loader still falls back to legacy confirmation flows after policy allows certain destructive actions. This sprint should invert that relationship. | |
| 52 | - | |
| 53 | -Implementation targets: | |
| 54 | - | |
| 55 | -- route write/bash/edit/patch approvals through policy outcomes (`allow`, `deny`, `ask`) | |
| 56 | -- make `ToolExecutor` the primary owner of interactive approval decisions | |
| 57 | -- keep `Tool.check_confirmation()` only as a compatibility shim while call sites are migrated | |
| 58 | -- ensure prompt payloads include: | |
| 59 | - - tool name | |
| 60 | - - normalized input summary | |
| 61 | - - active mode | |
| 62 | - - required mode | |
| 63 | - - matched rule or hook reason when available | |
| 64 | -- preserve Sprint 06's explore guarantee: explore mode remains forced `read-only` even if the broader session policy is `allow` | |
| 65 | - | |
| 66 | -This is the behavioral finish to Sprint 03. Safety and approval should come from one runtime contract, not a policy layer plus leftover tool-specific prompting. | |
| 67 | - | |
| 68 | -### 3. Split `conversation.py` into smaller runtime components | |
| 69 | - | |
| 70 | -`src/loader/runtime/conversation.py` is working, but it is still too responsibility-dense. The next sprint should reduce risk by moving major responsibilities into dedicated runtime modules. | |
| 71 | - | |
| 72 | -Implementation targets: | |
| 73 | - | |
| 74 | -- extract assistant-turn request/response handling into a smaller request/turn helper | |
| 75 | -- extract tool-batch execution and retry bookkeeping into a dedicated runtime component | |
| 76 | -- extract verify/fix completion gating and session-finalization concerns into a dedicated runtime component | |
| 77 | -- keep `ConversationRuntime.run_turn(...)` as the coordinator that wires those parts together | |
| 78 | -- avoid moving behavior back into `agent/loop.py`; decomposition should continue inside `src/loader/runtime/` | |
| 79 | - | |
| 80 | -A good outcome is not just fewer lines. A good outcome is that future work on authorization, verification, and completion lands in focused modules instead of reopening the turn loop every time. | |
| 81 | - | |
| 82 | -### 4. Observable policy state in product surfaces | |
| 83 | - | |
| 84 | -If Loader gains rule-based permissions, operators need to be able to see that state without reading code. | |
| 85 | - | |
| 86 | -Implementation targets: | |
| 87 | - | |
| 88 | -- extend `loader status` to show: | |
| 89 | - - active permission mode | |
| 90 | - - whether policy prompting is enabled | |
| 91 | - - allow/deny/ask rule counts | |
| 92 | -- extend `loader doctor` to validate policy configuration and warn on invalid or conflicting rules | |
| 93 | -- persist enough policy metadata in session/runtime state that inspection surfaces can explain the effective policy cleanly | |
| 94 | -- fail closed on invalid policy configuration and report the failure clearly | |
| 95 | - | |
| 96 | -This keeps the policy layer inspectable and reduces surprise when a tool is prompted, denied, or silently allowed by rule. | |
| 97 | - | |
| 98 | -## Testing strategy | |
| 99 | - | |
| 100 | -- unit coverage for: | |
| 101 | - - `PermissionMode` parsing including `prompt` and `allow` | |
| 102 | - - rule parsing and normalization | |
| 103 | - - precedence between deny/ask/allow rules and hook overrides | |
| 104 | - - invalid policy configuration failing closed | |
| 105 | -- deterministic harness coverage for: | |
| 106 | - - `prompt_mode_prompts_destructive_write` | |
| 107 | - - `allow_mode_skips_prompt_for_destructive_write` | |
| 108 | - - `deny_rule_blocks_allowed_mode` | |
| 109 | - - `ask_rule_prompts_even_when_mode_would_allow` | |
| 110 | - - `explore_mode_ignores_global_allow_policy` | |
| 111 | -- inspection coverage for: | |
| 112 | - - doctor reporting invalid policy files | |
| 113 | - - status/session surfaces showing rule summaries and active policy mode | |
| 114 | -- regression coverage: | |
| 115 | - - Sprint 00-06 parity scenarios remain green after the runtime split | |
| 116 | - - no new behavior path should bypass `ToolExecutor` | |
| 117 | - | |
| 118 | -## Definition of done | |
| 119 | - | |
| 120 | -- Loader can express `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow` modes | |
| 121 | -- permission approvals and denials come from policy evaluation, not mostly from leftover tool-specific confirmation logic | |
| 122 | -- allow/deny/ask rules are real, typed, tested, and visible in product surfaces | |
| 123 | -- `conversation.py` is slimmer and more coordinator-like, with authorization/finalization logic living in dedicated runtime modules | |
| 124 | -- the existing parity baseline stays green and the new policy scenarios are deterministic | |
| 125 | -- Loader is closer to claw-code's execution-policy contract without taking on claw-code's full complexity | |
| 126 | - | |
| 127 | -## Explicitly out of scope | |
| 128 | - | |
| 129 | -- interactive multi-step explore workflows | |
| 130 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 131 | -- multi-agent or team orchestration | |
| 132 | -- broad plugin or MCP expansion | |
| 133 | - | |
| 134 | -## Audit | |
| 135 | - | |
| 136 | -### Landed | |
| 137 | - | |
| 138 | -- Loader now supports the full Sprint 07 permission-mode surface: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow` | |
| 139 | -- workspace-local `.loader/permission-rules.json` files now load into typed `allow` / `deny` / `ask` rule sets with conservative matching over tool name, normalized input summaries, and optional path hints | |
| 140 | -- policy precedence is now deterministic: deny rules win first, hook-level overrides can still deny/ask/allow, ask rules drive interactive approval, allow rules and `allow` mode can elevate, and the required-mode gate still applies otherwise | |
| 141 | -- `ToolExecutor` is now the primary owner of interactive approval decisions, and destructive write/bash/edit/patch paths run through policy outcomes instead of relying mainly on tool-specific confirmation logic | |
| 142 | -- policy prompt payloads now include the active mode, required mode, normalized input summary, and matched rule or hook reason where available | |
| 143 | -- explore mode preserves the Sprint 06 read-only guarantee by copying deny/ask rules but intentionally ignoring broader allow rules | |
| 144 | -- `loader doctor` and `loader status` now expose prompting state, allow/deny/ask rule counts, and invalid rule configuration clearly, while `loader session list/show` now persist and surface the effective policy metadata that a session actually ran with | |
| 145 | -- invalid permission configuration now fails closed both when starting an `Agent` and when inspecting a workspace through doctor/status surfaces | |
| 146 | -- runtime decomposition continued inside `src/loader/runtime/`: assistant requests now live in `assistant_turns.py`, tool-batch execution/recovery/post-tool verification now live in `tool_batches.py`, and DoD/finalization logic now live in `finalization.py` | |
| 147 | -- `ConversationRuntime.run_turn(...)` is now more coordinator-like: it prepares the workflow, requests an assistant turn, delegates tool execution to the batch runner, delegates DoD gating/finalization, and keeps orchestration in one place instead of owning every behavior directly | |
| 148 | -- the deterministic parity harness now includes prompt/allow/rule-policy scenarios and remains green after the runtime split | |
| 149 | - | |
| 150 | -### Verification | |
| 151 | - | |
| 152 | -- `uv run pytest -q` is green: `167 passed` | |
| 153 | -- `tests/test_permissions.py` covers `prompt` / `allow` parsing, rule parsing, deny/ask/allow precedence, hook overrides, and policy-backed prompting behavior | |
| 154 | -- `tests/test_runtime_harness.py` keeps the full Sprint 00-06 baseline green and now covers prompt-mode prompting, allow-mode skipping, deny-rule blocking, ask-rule prompting, and explore-mode isolation from global allow policy | |
| 155 | -- `tests/test_inspection.py` covers invalid rule reporting plus rule-aware `doctor`, `status`, `session list`, and `session show` surfaces | |
| 156 | -- `tests/test_session_state.py` now covers persisted permission-policy metadata alongside the earlier session persistence/resume/compaction coverage | |
| 157 | -- targeted `ruff` checks are green for the new runtime modules and the touched inspection/session test files | |
| 158 | - | |
| 159 | -### Residual debt | |
| 160 | - | |
| 161 | -- rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model, preview UX, or temporary allow/deny override ergonomics | |
| 162 | -- policy-backed prompting is now primary, but the older tool-confirmation compatibility layer still exists and should continue shrinking rather than becoming a second policy path again | |
| 163 | -- `conversation.py` is materially slimmer than before Sprint 07, but it still owns workflow routing, prompt repair, self-critique/completion heuristics, and several other coordinator behaviors that remain more heuristic-heavy than the refs | |
| 164 | -- shell mutability classification and rule matching are still conservative string/command heuristics rather than a richer semantic sandbox or argument-aware policy model | |
| 165 | -- session inspection now preserves effective policy state, but Loader still does not offer a first-class product surface for authoring, validating, or dry-running permission rules | |
| 166 | -- explore mode remains intentionally one-shot and read-only; Sprint 07 does not add a richer interactive inspection workflow | |
.docs/sprints/sprint08.mddeleted@@ -1,180 +0,0 @@ | ||
| 1 | -# Sprint 08: Prompt Builder, Runtime Phases, and Permission Operator UX | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 07 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn the remaining "smart heuristics" into explicit runtime contracts and make Loader's permission system operable without reading JSON files by hand. | |
| 10 | - | |
| 11 | -Sprint 07 gave Loader a real execution-policy layer and smaller runtime seams. The next leverage point is to stop letting prompt assembly and response repair live as ad hoc strings and inline heuristics. `claw-code` keeps the runtime tighter partly because prompt construction is its own subsystem, and OMX keeps workflows legible because phase/state is explicit instead of implied. | |
| 12 | - | |
| 13 | -This sprint is the bridge from "better runtime internals" to "better operator control": | |
| 14 | - | |
| 15 | -- prompt construction becomes a typed builder with explicit sections | |
| 16 | -- turn phases become named runtime states instead of inline branches | |
| 17 | -- permission policy becomes dry-runnable and explainable from the CLI | |
| 18 | - | |
| 19 | -The references for this sprint are: | |
| 20 | - | |
| 21 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 22 | -- `refs/claw-code/rust/crates/runtime/src/permissions.rs` | |
| 23 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470` | |
| 24 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 25 | - | |
| 26 | -## Deliverables | |
| 27 | - | |
| 28 | -### 1. Typed prompt builder instead of hand-built templates | |
| 29 | - | |
| 30 | -Loader's prompt layer is still mostly string templates in `src/loader/agent/prompts.py`. Sprint 08 should turn that into a runtime prompt builder with explicit sections and clearer ownership. | |
| 31 | - | |
| 32 | -Implementation targets: | |
| 33 | - | |
| 34 | -- replace the current monolithic prompt strings with a typed builder under `src/loader/runtime/` or another clearly-owned prompt module | |
| 35 | -- separate static scaffolding from dynamic runtime context with an explicit boundary, following the shape of `claw-code`'s prompt builder | |
| 36 | -- render mode guidance, project context, runtime config, permission mode, and relevant workflow/artifact context as independent sections instead of one merged string blob | |
| 37 | -- keep native-tool vs ReAct prompt differences as a thin formatting concern, not two largely separate prompt bodies | |
| 38 | -- make it easy to add or remove one section without rewriting the whole prompt template | |
| 39 | -- preserve Loader's current local-first/product-specific guidance; this sprint is about structure first, not prompt verbosity for its own sake | |
| 40 | - | |
| 41 | -The goal is not "more prompt text." The goal is a prompt contract that is inspectable, composable, and easier to evolve without reopening the turn loop. | |
| 42 | - | |
| 43 | -### 2. Explicit runtime phases for response repair and completion behavior | |
| 44 | - | |
| 45 | -`src/loader/runtime/conversation.py` is healthier after Sprint 07, but it still owns too many inline heuristics: | |
| 46 | - | |
| 47 | -- empty-output retries | |
| 48 | -- raw-tool-call fallback | |
| 49 | -- fake-tool narration correction | |
| 50 | -- self-critique rerouting | |
| 51 | -- completion nudges for non-mutating tasks | |
| 52 | -- text-loop bailout behavior | |
| 53 | - | |
| 54 | -Sprint 08 should move those into named runtime phases and focused helpers. | |
| 55 | - | |
| 56 | -Implementation targets: | |
| 57 | - | |
| 58 | -- define a typed turn-phase model for the remaining coordinator behaviors, for example: | |
| 59 | - - `prepare` | |
| 60 | - - `assistant` | |
| 61 | - - `tools` | |
| 62 | - - `repair` | |
| 63 | - - `critique` | |
| 64 | - - `completion` | |
| 65 | -- extract response-repair and completion-policy decisions into dedicated runtime components rather than leaving them inline in `ConversationRuntime.run_turn(...)` | |
| 66 | -- keep `conversation.py` as the turn coordinator that advances phase state and delegates to helpers | |
| 67 | -- persist enough phase metadata in trace/session state that status surfaces and debugging can explain where Loader spent time during a turn | |
| 68 | -- avoid moving these heuristics back into `agent/loop.py`; the runtime split should keep going in one direction | |
| 69 | - | |
| 70 | -This is the next step toward a runtime that is easier to debug and less vulnerable to regressions when we tune follow-through behavior. | |
| 71 | - | |
| 72 | -### 3. First-class permission operator surfaces | |
| 73 | - | |
| 74 | -Sprint 07 made policy inspectable. Sprint 08 should make it operable. | |
| 75 | - | |
| 76 | -Implementation targets: | |
| 77 | - | |
| 78 | -- add `loader permissions show` to display: | |
| 79 | - - active permission mode | |
| 80 | - - prompting state | |
| 81 | - - rule source path | |
| 82 | - - normalized allow/deny/ask rules | |
| 83 | - - rule counts and validity | |
| 84 | -- add `loader permissions check` to dry-run one hypothetical tool request and show: | |
| 85 | - - tool name | |
| 86 | - - normalized input summary | |
| 87 | - - required mode | |
| 88 | - - policy decision (`allow` / `deny` / `ask`) | |
| 89 | - - matched rule or hook-style reason when present | |
| 90 | -- support both structured JSON-like tool arguments and simple string inputs where practical | |
| 91 | -- wire `loader doctor` remediation hints to these policy commands when rules are invalid or confusing | |
| 92 | -- keep the UX read-mostly in this sprint; authoring/edit flows can remain file-based for now | |
| 93 | - | |
| 94 | -This gives operators a way to answer "why did Loader allow/prompt/deny this?" without reproducing the behavior in a live turn. | |
| 95 | - | |
| 96 | -### 4. Prompt and policy state surfaced coherently in product surfaces | |
| 97 | - | |
| 98 | -Once prompt construction and phase state are explicit, the product surfaces should expose enough context to make Loader's behavior legible. | |
| 99 | - | |
| 100 | -Implementation targets: | |
| 101 | - | |
| 102 | -- extend `loader status` and the TUI status line to surface the active turn phase when a run is in progress | |
| 103 | -- make doctor/status/session output consistent about permission terminology (`mode`, `prompting`, `rules`, `source`) | |
| 104 | -- expose prompt-builder metadata in a minimal, operator-friendly way: | |
| 105 | - - current mode | |
| 106 | - - whether ReAct/native formatting is active | |
| 107 | - - which dynamic sections were included | |
| 108 | -- keep these surfaces concise; this sprint is about observability, not dumping the whole prompt to the screen by default | |
| 109 | - | |
| 110 | -The goal is to make Loader easier to reason about in live use, not just in code review. | |
| 111 | - | |
| 112 | -## Testing strategy | |
| 113 | - | |
| 114 | -- unit coverage for: | |
| 115 | - - prompt-builder section rendering | |
| 116 | - - native vs ReAct prompt-format differences over the same section set | |
| 117 | - - turn-phase transitions for repair/critique/completion flows | |
| 118 | - - permission dry-run explanations and invalid-input handling | |
| 119 | -- CLI coverage for: | |
| 120 | - - `loader permissions show` | |
| 121 | - - `loader permissions check` | |
| 122 | - - doctor remediation output that points users toward policy inspection | |
| 123 | -- deterministic/runtime coverage for: | |
| 124 | - - empty assistant output triggering repair-phase retries without regressing the tool path | |
| 125 | - - raw-tool fallback still sharing the same executor and phase bookkeeping | |
| 126 | - - non-mutating completion nudges remaining deterministic after the phase split | |
| 127 | - - Sprint 00-07 parity scenarios staying green | |
| 128 | -- status/TUI coverage for: | |
| 129 | - - active phase rendering | |
| 130 | - - coherent permission terminology across doctor/status/session output | |
| 131 | - | |
| 132 | -## Definition of done | |
| 133 | - | |
| 134 | -- prompt construction is builder-based, sectioned, and easier to inspect than the current template strings | |
| 135 | -- `conversation.py` is slimmer again, with response-repair and completion heuristics moved into focused runtime components | |
| 136 | -- Loader exposes first-class permission inspection and dry-run commands instead of requiring manual JSON reading | |
| 137 | -- prompt, phase, and policy state are more legible in status/inspection surfaces | |
| 138 | -- the full parity baseline remains green after the phase split | |
| 139 | -- Loader moves closer to claw-code's "tight runtime, explicit prompt contract" shape without overcommitting to a huge configuration system | |
| 140 | - | |
| 141 | -## Explicitly out of scope | |
| 142 | - | |
| 143 | -- a full interactive rule editor or TUI-based permission authoring flow | |
| 144 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 145 | -- a richer shell sandbox than the current command-based model | |
| 146 | -- interactive multi-step explore workflows | |
| 147 | -- multi-agent or team orchestration | |
| 148 | - | |
| 149 | -## Audit | |
| 150 | - | |
| 151 | -### Landed | |
| 152 | - | |
| 153 | -- Loader's prompt construction now lives in `src/loader/runtime/prompting.py` as a typed builder with explicit sections, a static/dynamic boundary marker, and thin native-vs-ReAct formatting differences instead of one mostly hand-built string blob | |
| 154 | -- prompt metadata now persists in session state, so `loader status`, `loader session list/show`, and the live agent state can explain the active prompt format and which dynamic sections were actually included for the current workspace/task | |
| 155 | -- the remaining coordinator heuristics are now split into explicit runtime components: phase tracking in `runtime.phases`, response repair in `runtime.repair`, and completion/self-critique policy in `runtime.completion_policy` | |
| 156 | -- `ConversationRuntime.run_turn(...)` now advances explicit turn phases (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) and persists the active phase into session state while also emitting runtime events for the CLI/TUI | |
| 157 | -- the TUI status line and CLI/session inspection surfaces now expose the active turn phase while a turn is in flight, which makes Loader's mid-turn behavior much easier to debug than the earlier implicit branch structure | |
| 158 | -- Loader now has first-class permission operator commands: | |
| 159 | - - `loader permissions show` displays the active mode, prompting state, rules source, validity, counts, and normalized allow/deny/ask rules | |
| 160 | - - `loader permissions check` dry-runs one hypothetical tool request and reports the normalized input summary, required mode, allow/deny/ask decision, matched rule, and policy reason | |
| 161 | -- `loader permissions check` supports both JSON object arguments and practical positional input mapping for common tools such as `bash`, `read`, `write`, `edit`, `patch`, `glob`, `grep`, `git`, and read-only memory/notepad lookups | |
| 162 | -- `loader doctor` remediation now points operators toward `loader permissions show` / `loader permissions check` instead of leaving permission debugging as a code/JSON-reading exercise | |
| 163 | -- doctor/status/session output now uses more consistent permission terminology around mode, prompting, rules, and source instead of mixing several labels for the same policy concepts | |
| 164 | - | |
| 165 | -### Verification | |
| 166 | - | |
| 167 | -- `uv run pytest -q` is green: `176 passed` | |
| 168 | -- `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt-builder persistence metadata | |
| 169 | -- `tests/test_runtime_phases.py` covers repair/completion phase transitions and active phase bookkeeping | |
| 170 | -- `tests/test_inspection.py` now covers `loader permissions show`, `loader permissions check`, invalid JSON input handling, invalid-rule visibility, prompt/policy metadata in status/session surfaces, and the existing doctor/session inspection behavior | |
| 171 | -- targeted `ruff` checks are green for `src/loader/runtime/inspection.py`, `tests/test_inspection.py`, and import ordering in `src/loader/cli/main.py` | |
| 172 | -- the full Sprint 00-07 parity baseline stayed green through the prompt/phase split and permission CLI rollout | |
| 173 | - | |
| 174 | -### Residual debt | |
| 175 | - | |
| 176 | -- `src/loader/runtime/conversation.py` is slimmer than before Sprint 08, but it still coordinates workflow routing and phase transitions with a heuristic branch structure rather than a more formal state machine | |
| 177 | -- prompt construction is now inspectable and sectioned, but Loader still does not offer prompt previews/diffs, a richer prompt-contract parity harness, or operator controls for temporarily adjusting prompt sections | |
| 178 | -- `loader permissions show/check` make the policy operable, but authoring/editing rules is still file-based and there is still no first-class preview UX for comparing multiple rule sets or applying temporary session overrides | |
| 179 | -- doctor/status/session terminology is more coherent now, but the product still stops short of the richer policy UX and sandbox semantics used by the references | |
| 180 | -- the explicit turn phases improve observability, but they are still runtime bookkeeping around heuristics, not yet a deeper workflow-state contract on the level of OMX's more opinionated routing discipline | |
.docs/sprints/sprint09.mddeleted@@ -1,185 +0,0 @@ | ||
| 1 | -# Sprint 09: Turn State Machine, Workflow Contracts, and Prompt Preview | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 08 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader's explicit phase labels into a real runtime contract and make workflow routing explainable instead of merely observable. | |
| 10 | - | |
| 11 | -Sprint 08 gave Loader a typed prompt builder, named turn phases, and permission/operator surfaces. That was the right bridge, but the audit is honest about what still hurts: | |
| 12 | - | |
| 13 | -- `conversation.py` still coordinates too much through heuristic branching | |
| 14 | -- workflow routing is still implied by helper decisions more than enforced by a transition model | |
| 15 | -- phase state is visible, but not yet validated like a real state machine | |
| 16 | -- prompt metadata is inspectable, but the actual prompt contract is still hard to preview directly | |
| 17 | - | |
| 18 | -The next leverage point is to stop treating state as labels attached to heuristics and start treating it as the runtime. | |
| 19 | - | |
| 20 | -This sprint is about discipline: | |
| 21 | - | |
| 22 | -- turn progression becomes a validated state machine | |
| 23 | -- workflow routing becomes a typed decision contract with persisted reasons | |
| 24 | -- `conversation.py` becomes thinner again by delegating transitions instead of deciding everything inline | |
| 25 | -- operators can inspect the prompt/workflow contract without triggering a live model turn | |
| 26 | - | |
| 27 | -The references for this sprint are: | |
| 28 | - | |
| 29 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 30 | -- `refs/claw-code/rust/crates/runtime/src/mcp_lifecycle_hardened.rs` | |
| 31 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 32 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 33 | - | |
| 34 | -## Deliverables | |
| 35 | - | |
| 36 | -### 1. Validated turn state machine instead of phase bookkeeping only | |
| 37 | - | |
| 38 | -Sprint 08 added named phases. Sprint 09 should make those phases authoritative. | |
| 39 | - | |
| 40 | -Implementation targets: | |
| 41 | - | |
| 42 | -- introduce a dedicated turn-state machine under `src/loader/runtime/` with typed states, transition intents, and terminal reasons | |
| 43 | -- define explicit allowed transitions for the main turn lifecycle, for example: | |
| 44 | - - `prepare -> assistant` | |
| 45 | - - `assistant -> repair` | |
| 46 | - - `assistant -> tools` | |
| 47 | - - `assistant -> completion` | |
| 48 | - - `tools -> critique` | |
| 49 | - - `critique -> completion` | |
| 50 | - - `completion -> finalize` | |
| 51 | -- fail loudly on impossible transitions in tests and tracing instead of silently drifting | |
| 52 | -- capture transition metadata such as: | |
| 53 | - - why a transition happened | |
| 54 | - - whether it was a normal path, reroute, retry, or recovery move | |
| 55 | - - whether it ended in success, blocked, fix-loop reentry, or iteration exhaustion | |
| 56 | -- keep the state machine separate from model/tool side effects so transition logic is testable on its own | |
| 57 | - | |
| 58 | -The goal is not ceremony. The goal is to make Loader's turn lifecycle harder to accidentally regress. | |
| 59 | - | |
| 60 | -### 2. Typed workflow-routing contract with persisted reasoning | |
| 61 | - | |
| 62 | -Loader already has `clarify`, `plan`, `execute`, `verify`, and `explore`, but the router still behaves more like a cluster of heuristics than a durable contract. | |
| 63 | - | |
| 64 | -Implementation targets: | |
| 65 | - | |
| 66 | -- replace the current loose routing outputs with a typed workflow decision object, for example: | |
| 67 | - - selected mode | |
| 68 | - - reason code | |
| 69 | - - ambiguity/complexity signal | |
| 70 | - - whether the mode is an initial route, a reentry, or a forced continuation | |
| 71 | - - whether a downstream mode is already scheduled | |
| 72 | -- persist workflow-decision metadata in session/runtime state so inspection surfaces can explain: | |
| 73 | - - why Loader chose clarify vs plan vs execute | |
| 74 | - - why a verify failure returned to execute | |
| 75 | - - why a conversational task skipped verification | |
| 76 | -- move verify/fix loop reentry onto the same contract instead of letting it remain a partial special case | |
| 77 | -- keep explore explicitly outside the main mutating workflow contract while still representing it as a typed lane | |
| 78 | - | |
| 79 | -This is how Loader gets closer to OMX's “mode with a contract” behavior without copying OMX's full planning stack yet. | |
| 80 | - | |
| 81 | -### 3. Slim `conversation.py` around transitions and routing | |
| 82 | - | |
| 83 | -After Sprint 08, `conversation.py` is better, but it still knows too much about routing and control flow. | |
| 84 | - | |
| 85 | -Implementation targets: | |
| 86 | - | |
| 87 | -- extract workflow-decision evaluation and turn-transition advancement into dedicated runtime modules | |
| 88 | -- make `ConversationRuntime.run_turn(...)` primarily: | |
| 89 | - - build turn context | |
| 90 | - - ask the router for the next workflow decision | |
| 91 | - - advance the state machine | |
| 92 | - - delegate assistant/tool/repair/finalization work | |
| 93 | - - persist the resulting turn summary | |
| 94 | -- remove remaining inline branches that duplicate completion/reentry/skip logic in multiple places | |
| 95 | -- keep all new control-flow work inside `src/loader/runtime/`, not back in `agent/loop.py` | |
| 96 | - | |
| 97 | -A good outcome is that `conversation.py` reads like an orchestrator over contracts rather than a place where routing policy keeps accumulating. | |
| 98 | - | |
| 99 | -### 4. First-class prompt/workflow preview surfaces | |
| 100 | - | |
| 101 | -Sprint 08 made prompt and policy metadata legible. Sprint 09 should make the actual contract previewable. | |
| 102 | - | |
| 103 | -Implementation targets: | |
| 104 | - | |
| 105 | -- add `loader prompt show` to render the current prompt contract for a task without issuing a model request | |
| 106 | -- support previewing at least: | |
| 107 | - - workflow mode | |
| 108 | - - prompt format (`native` / `react`) | |
| 109 | - - included dynamic sections | |
| 110 | - - current permission mode | |
| 111 | - - relevant workflow context | |
| 112 | -- keep the output operator-friendly: | |
| 113 | - - summary metadata first | |
| 114 | - - prompt body second | |
| 115 | - - no hidden live side effects | |
| 116 | -- extend `loader status` / `loader session show` with the latest workflow-decision reason and latest transition summary when present | |
| 117 | - | |
| 118 | -The goal is to make Loader's runtime choices debuggable before and after a turn, not only during one. | |
| 119 | - | |
| 120 | -## Testing strategy | |
| 121 | - | |
| 122 | -- unit coverage for: | |
| 123 | - - allowed and disallowed turn-state transitions | |
| 124 | - - terminal-state reasons and retry/reentry transitions | |
| 125 | - - workflow-decision objects and persisted reason codes | |
| 126 | - - prompt-preview rendering over multiple workflow modes | |
| 127 | -- CLI coverage for: | |
| 128 | - - `loader prompt show` | |
| 129 | - - workflow reason/transition fields in `loader status` and `loader session show` | |
| 130 | -- deterministic/runtime coverage for: | |
| 131 | - - verify/fix reentry using only valid state-machine transitions | |
| 132 | - - conversational tasks skipping verification through a typed workflow decision instead of an implicit branch | |
| 133 | - - repair-phase retries preserving valid transition sequences | |
| 134 | - - Sprint 00-08 parity scenarios staying green after the state-machine split | |
| 135 | -- regression coverage: | |
| 136 | - - no turn path should bypass the state machine once the contract is introduced | |
| 137 | - - `conversation.py` should no longer be the only place that knows whether a reroute is legal | |
| 138 | - | |
| 139 | -## Definition of done | |
| 140 | - | |
| 141 | -- Loader has a validated turn-state machine, not just named phase labels | |
| 142 | -- workflow routing emits typed, persisted decisions with explainable reasons and reentry metadata | |
| 143 | -- `conversation.py` is slimmer again and more coordinator-like | |
| 144 | -- operators can preview the current prompt contract without a live model call | |
| 145 | -- status/session surfaces can explain the latest workflow decision and transition outcome | |
| 146 | -- the full parity baseline remains green after the control-flow split | |
| 147 | - | |
| 148 | -## Explicitly out of scope | |
| 149 | - | |
| 150 | -- a full workflow editor or visual state-machine UI | |
| 151 | -- a first-class permission rule editor | |
| 152 | -- OMX-style multi-iteration consensus planning | |
| 153 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 154 | -- multi-agent or team orchestration | |
| 155 | - | |
| 156 | -## Audit | |
| 157 | - | |
| 158 | -### Landed | |
| 159 | - | |
| 160 | -- Loader now has a validated turn-state machine in `src/loader/runtime/phases.py` instead of phase bookkeeping only; allowed transitions are explicit, invalid transitions fail loudly, and transition metadata now captures reason code, human summary, and whether the move was normal, retry, reroute, recovery, or terminal | |
| 161 | -- turn-transition metadata is now persisted in session state and surfaced through typed runtime events and `TurnSummary`, so Loader can explain the latest transition outcome instead of only exposing the current phase label | |
| 162 | -- workflow routing now uses a richer typed `ModeDecision` contract in `src/loader/runtime/workflow.py`, including reason code, reason summary, decision kind, ambiguity/complexity scores, and optional scheduled-next-mode hints | |
| 163 | -- verify/fix loop handoffs now use the same workflow-decision contract as initial routing, so verify entry and execute reentry are persisted and inspectable instead of living as partial special cases | |
| 164 | -- `ConversationRuntime` now treats workflow decisions more like contracts than loose labels: it sets workflow state from typed decisions, records reason metadata into session state, and keeps `conversation.py` more coordinator-like than before Sprint 09 | |
| 165 | -- `loader status`, `loader session list`, and `loader session show` now surface the latest workflow-decision reason, decision kind, and last validated transition summary when those fields are present | |
| 166 | -- Loader now has `loader prompt show [task]`, implemented through `runtime.inspection.collect_prompt_preview(...)`, which renders the current prompt contract without issuing a model request and reports workflow mode, permission mode, prompt format, dynamic sections, and the full prompt body | |
| 167 | -- prompt preview reuses the typed prompt builder and capability resolution rather than duplicating prompt strings in the CLI, so the operator surface stays aligned with the runtime contract it is previewing | |
| 168 | - | |
| 169 | -### Verification | |
| 170 | - | |
| 171 | -- `uv run pytest -q` is green: `180 passed` | |
| 172 | -- `tests/test_turn_state_machine.py` covers valid/invalid turn transitions and terminal transition metadata | |
| 173 | -- `tests/test_runtime_phases.py` now covers persisted transition metadata in runtime events, session state, and final turn summaries | |
| 174 | -- `tests/test_workflow_runtime.py` now covers persisted workflow-decision reason codes and handoff kinds for clarify/plan/verify flows | |
| 175 | -- `tests/test_session_state.py` now covers round-tripping workflow-decision metadata and transition metadata through persisted session snapshots | |
| 176 | -- `tests/test_inspection.py` now covers workflow-reason/transition rendering in status/session surfaces plus `loader prompt show` | |
| 177 | -- targeted `ruff` checks are green for `src/loader/runtime/inspection.py` and `tests/test_inspection.py` | |
| 178 | - | |
| 179 | -### Residual debt | |
| 180 | - | |
| 181 | -- Loader now has a real turn state machine, but workflow routing itself is still heuristic-only and materially lighter than OMX's deeper route discipline | |
| 182 | -- `src/loader/runtime/conversation.py` is slimmer and more contract-driven than before Sprint 09, but it still coordinates several heuristic completion/repair paths that could be decomposed further | |
| 183 | -- `loader prompt show` gives operators a real preview surface, but Loader still does not support prompt diffs, historical prompt snapshots, or richer side-by-side comparison workflows | |
| 184 | -- workflow reasons and transitions are now inspectable, but the product still does not offer a richer workflow trace/timeline surface or stronger workflow-authoring controls | |
| 185 | -- Sprint 09 strengthens state and inspection contracts, but it does not add deeper consensus planning, AST/LSP-aware editing, or a richer permission-rule authoring UX | |
.docs/sprints/sprint10.mddeleted@@ -1,198 +0,0 @@ | ||
| 1 | -# Sprint 10: Route Pressure, Clarify Depth, and Workflow Timeline | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 09 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader's workflow routing from a better heuristic into a more opinionated workflow policy, and make that policy inspectable over time instead of only at the latest state. | |
| 10 | - | |
| 11 | -Sprint 09 gave Loader a validated turn state machine, typed workflow decisions, and prompt preview. That was the right contract layer, but the audit is honest about what still hurts: | |
| 12 | - | |
| 13 | -- workflow routing is still threshold-based and lighter than OMX's route discipline | |
| 14 | -- clarify and plan are still mostly one-shot preprocessors instead of deeper workflow lanes | |
| 15 | -- status/session surfaces can explain the latest decision, but not the sequence of decisions that got Loader there | |
| 16 | -- `conversation.py` is slimmer, but it still owns route/handoff/skip logic that should live in dedicated workflow policy code | |
| 17 | - | |
| 18 | -The next leverage point is to stop asking only "what mode are we in now?" and start asking "what workflow pressure led us here, what is still unresolved, and what should happen next if the task keeps moving?" | |
| 19 | - | |
| 20 | -This sprint is about workflow rigor: | |
| 21 | - | |
| 22 | -- routing becomes a scored workflow policy instead of a small threshold router | |
| 23 | -- clarify and plan become more durable lanes with bounded depth and freshness rules | |
| 24 | -- workflow history becomes a persisted timeline instead of only last-known fields | |
| 25 | -- `conversation.py` gets thinner again by delegating route pressure and handoff policy | |
| 26 | - | |
| 27 | -The references for this sprint are: | |
| 28 | - | |
| 29 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 30 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 31 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 32 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 33 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 34 | - | |
| 35 | -## Deliverables | |
| 36 | - | |
| 37 | -### 1. Workflow policy engine instead of threshold-only routing | |
| 38 | - | |
| 39 | -Sprint 09 made workflow decisions typed. Sprint 10 should make them more deliberate. | |
| 40 | - | |
| 41 | -Implementation targets: | |
| 42 | - | |
| 43 | -- replace the current simple threshold router with a workflow-policy module that evaluates route pressure using typed signals, for example: | |
| 44 | - - ambiguity | |
| 45 | - - complexity | |
| 46 | - - mutability / verification pressure | |
| 47 | - - artifact availability | |
| 48 | - - artifact freshness | |
| 49 | - - explicit user request | |
| 50 | - - unresolved assumptions | |
| 51 | -- produce a scored route evaluation that explains: | |
| 52 | - - which mode won | |
| 53 | - - why it won | |
| 54 | - - what the runner-up pressure was | |
| 55 | - - whether a downstream mode is already scheduled | |
| 56 | -- keep initial route, handoff, and reentry decisions on the same policy surface instead of splitting them between the router and coordinator code | |
| 57 | -- make non-mutating "skip verify" behavior a typed workflow decision instead of an implicit completion branch | |
| 58 | - | |
| 59 | -The goal is not a giant planning engine. The goal is to make Loader's workflow behavior less accidental and easier to tune deliberately. | |
| 60 | - | |
| 61 | -### 2. Deeper clarify lane with bounded follow-through | |
| 62 | - | |
| 63 | -Loader can clarify today, but it still behaves like a one-question prelude. | |
| 64 | - | |
| 65 | -Implementation targets: | |
| 66 | - | |
| 67 | -- allow clarify to continue for more than one round when the first answer leaves key boundaries unresolved | |
| 68 | -- keep that depth bounded with an explicit clarify budget and escalation rules | |
| 69 | -- persist unresolved assumptions or open questions in workflow state so later turns can explain why clarify continued or stopped | |
| 70 | -- distinguish between: | |
| 71 | - - clarify completed cleanly | |
| 72 | - - clarify exhausted its budget | |
| 73 | - - clarify was bypassed by a stronger route decision | |
| 74 | -- keep the UX disciplined: no deep-interview sprawl, only focused rounds that materially reduce ambiguity | |
| 75 | - | |
| 76 | -This is how Loader gets closer to OMX's deeper interview behavior without turning every task into a questionnaire. | |
| 77 | - | |
| 78 | -### 3. Plan freshness and re-plan discipline | |
| 79 | - | |
| 80 | -Loader now persists plans, but it still treats them as mostly static once created. | |
| 81 | - | |
| 82 | -Implementation targets: | |
| 83 | - | |
| 84 | -- add typed freshness checks for clarify briefs and plan artifacts so Loader can detect when a plan no longer matches the task state | |
| 85 | -- define explicit refresh triggers, for example: | |
| 86 | - - task meaning changed after clarification | |
| 87 | - - verify/fix reentry reveals plan gaps | |
| 88 | - - touched files drift outside the planned touchpoints | |
| 89 | - - acceptance criteria changed materially | |
| 90 | -- route stale artifacts through a typed re-plan or plan-refresh decision instead of silently continuing with outdated plans | |
| 91 | -- keep refresh lightweight: prefer targeted plan refresh over full workflow restart when possible | |
| 92 | - | |
| 93 | -This brings Loader closer to claw-code's stronger execution discipline: artifacts should steer the turn, not become dead files on disk. | |
| 94 | - | |
| 95 | -### 4. Persisted workflow timeline and operator surfaces | |
| 96 | - | |
| 97 | -Sprint 09 exposed the latest workflow reason. Sprint 10 should expose workflow history. | |
| 98 | - | |
| 99 | -Implementation targets: | |
| 100 | - | |
| 101 | -- persist workflow timeline entries in session/runtime state, including: | |
| 102 | - - route decisions | |
| 103 | - - handoffs | |
| 104 | - - reentries | |
| 105 | - - clarify-budget outcomes | |
| 106 | - - plan refresh decisions | |
| 107 | - - the latest prompt-contract metadata attached to those moments when relevant | |
| 108 | -- add a first-class inspection surface such as `loader workflow show` for summarizing that timeline without a live model call | |
| 109 | -- extend `loader session show` to surface the most recent workflow timeline items when present | |
| 110 | -- keep the output operator-friendly: | |
| 111 | - - newest-important events first | |
| 112 | - - concise summaries | |
| 113 | - - artifact paths or mode transitions only when they help explain behavior | |
| 114 | - | |
| 115 | -The goal is to make Loader's workflow evolution debuggable, not just its final state. | |
| 116 | - | |
| 117 | -### 5. Continue slimming `conversation.py` | |
| 118 | - | |
| 119 | -Sprint 09 improved the coordinator. Sprint 10 should keep the split moving in the same direction. | |
| 120 | - | |
| 121 | -Implementation targets: | |
| 122 | - | |
| 123 | -- extract route-pressure evaluation, clarify-budget handling, and artifact-freshness checks into dedicated runtime modules | |
| 124 | -- make `ConversationRuntime.run_turn(...)` primarily: | |
| 125 | - - collect turn context | |
| 126 | - - ask the workflow policy for a route or reentry decision | |
| 127 | - - delegate clarify/plan/execute/verify work | |
| 128 | - - append timeline entries | |
| 129 | - - finalize the turn summary | |
| 130 | -- avoid rebuilding the monolith by placing all new workflow-policy logic under `src/loader/runtime/` | |
| 131 | - | |
| 132 | -A good outcome is that `conversation.py` keeps shrinking because policy is becoming modular, not because the behavior is disappearing. | |
| 133 | - | |
| 134 | -## Testing strategy | |
| 135 | - | |
| 136 | -- unit coverage for: | |
| 137 | - - workflow-policy score breakdowns and winning-route selection | |
| 138 | - - clarify-budget continuation vs exhaustion | |
| 139 | - - artifact-freshness detection and targeted plan refresh triggers | |
| 140 | - - workflow timeline persistence and serialization | |
| 141 | -- CLI coverage for: | |
| 142 | - - `loader workflow show` | |
| 143 | - - workflow timeline rendering in `loader session show` | |
| 144 | -- deterministic/runtime coverage for: | |
| 145 | - - ambiguous tasks that require more than one clarify round before execution | |
| 146 | - - verify/fix reentry that triggers plan refresh instead of blindly continuing | |
| 147 | - - non-mutating tasks skipping verify through a typed workflow decision | |
| 148 | - - Sprint 00-09 parity scenarios staying green after the workflow-policy split | |
| 149 | -- regression coverage: | |
| 150 | - - route and reentry choices should come from the workflow-policy layer, not ad hoc coordinator branches | |
| 151 | - - persisted workflow history should survive session resume and inspection | |
| 152 | - | |
| 153 | -## Definition of done | |
| 154 | - | |
| 155 | -- Loader routes through a scored workflow-policy engine instead of a small threshold router | |
| 156 | -- clarify can continue for bounded, explainable follow-up rounds when ambiguity remains | |
| 157 | -- plan artifacts can be marked stale and refreshed through typed workflow decisions | |
| 158 | -- workflow history is persisted and inspectable, not only the latest decision | |
| 159 | -- `conversation.py` is slimmer again and more policy-driven | |
| 160 | -- the full parity baseline remains green after the workflow-policy and timeline work | |
| 161 | - | |
| 162 | -## Explicitly out of scope | |
| 163 | - | |
| 164 | -- full OMX-style consensus planning | |
| 165 | -- a visual workflow timeline UI | |
| 166 | -- a first-class permission rule editor | |
| 167 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 168 | -- multi-agent or team orchestration | |
| 169 | - | |
| 170 | -## Audit | |
| 171 | - | |
| 172 | -### Landed | |
| 173 | - | |
| 174 | -- Loader now routes through a scored workflow policy in `src/loader/runtime/workflow_policy.py` instead of the old threshold-only router contract; route decisions now carry winner score, runner-up mode/score, unresolved questions, and a human-readable pressure summary | |
| 175 | -- initial route, artifact reuse, stale-plan reentry, and handoff metadata now sit on the same typed `ModeDecision` surface, which makes workflow choices easier to persist, inspect, and tune deliberately | |
| 176 | -- clarify is no longer a single-pass prelude: `src/loader/runtime/conversation.py` now supports a bounded multi-round clarify lane, re-evaluates ambiguity after each answer, and persists unresolved questions when the clarify budget is exhausted | |
| 177 | -- plan freshness is now an explicit runtime concern: Loader can detect file-drift against persisted plan artifacts, route back through a targeted plan refresh, regenerate implementation/verification artifacts, and hand back to execute without restarting the whole workflow | |
| 178 | -- non-mutating turns now record verify-skip as an explicit workflow timeline event instead of disappearing through an implicit branch | |
| 179 | -- workflow history is now persisted as `workflow_timeline` session state via `src/loader/runtime/session.py`, with timeline entries for routes, handoffs, reentries, clarify continuation/exit, plan refresh behavior, and verify skips | |
| 180 | -- operators now have `loader workflow show [session-id]` plus recent workflow timeline snippets inside `loader session show`, implemented through `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` | |
| 181 | -- `conversation.py` is slimmer than before Sprint 10 because scoring, clarify review, artifact freshness, and timeline contracts now live in dedicated runtime modules instead of accumulating as coordinator-only heuristics | |
| 182 | - | |
| 183 | -### Verification | |
| 184 | - | |
| 185 | -- `uv run pytest -q` is green: `188 passed` | |
| 186 | -- `tests/test_workflow_policy.py` covers scored-route breakdowns, clarify follow-up reviews, artifact-freshness detection, and workflow timeline serialization | |
| 187 | -- `tests/test_workflow_runtime.py` covers bounded clarify continuation, targeted plan refresh on stale artifacts, verify/fix reentry, and persisted workflow timeline behavior | |
| 188 | -- `tests/test_inspection.py` covers `loader workflow show`, recent timeline rendering in `loader session show`, and persisted workflow timeline inspection without a live model call | |
| 189 | -- `tests/test_workflow.py` now aligns legacy router expectations with Sprint 10's scored policy contract instead of the older raw-threshold assumption | |
| 190 | -- targeted `ruff` checks are green for `src/loader/runtime/inspection.py`, `tests/test_inspection.py`, and `tests/test_workflow.py`; `src/loader/cli/main.py` was also checked for new unused-import regressions | |
| 191 | - | |
| 192 | -### Residual debt | |
| 193 | - | |
| 194 | -- the new workflow policy is scored, but it is still hand-tuned and text-heuristic; Loader still does not match OMX's deeper ambiguity analysis, route-pressure passes, or richer branch-specific workflow policies | |
| 195 | -- clarify now has bounded follow-through, but it is still intentionally shallow compared with OMX's deep-interview behavior and does not yet adapt its budget or questioning style by task class | |
| 196 | -- plan freshness is currently driven by touched-file drift; Loader still does not reason well about semantic task changes, changed acceptance criteria, or broader artifact invalidation | |
| 197 | -- `loader workflow show` makes workflow evolution inspectable, but the operator UX still stops short of timeline filtering, artifact diffs, or richer prompt/history comparison | |
| 198 | -- `src/loader/runtime/conversation.py` is smaller and more policy-driven than before Sprint 10, but it still coordinates more workflow behavior than the claw-code references, especially around completion and downstream execution orchestration | |
.docs/sprints/sprint11.mddeleted@@ -1,215 +0,0 @@ | ||
| 1 | -# Sprint 11: Semantic Signals, Clarify Strategy, and Orchestrator Split | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 10 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader's new workflow policy from a better scorecard into a more structured workflow contract, and keep shrinking the coordinator so policy and orchestration live in dedicated runtime seams instead of collecting back inside `conversation.py`. | |
| 10 | - | |
| 11 | -Sprint 10 was a meaningful step forward. Loader now has scored routing, bounded clarify follow-through, plan refresh, and a persisted workflow timeline. That closes a real gap with claw-code and OMX, but the audit is honest about what still hurts: | |
| 12 | - | |
| 13 | -- workflow scoring is still hand-tuned and text-heuristic rather than driven by a typed signal model | |
| 14 | -- clarify has follow-through now, but the questioning strategy is still generic and shallow compared with OMX's deep-interview discipline | |
| 15 | -- plan freshness is still mostly file-drift based instead of understanding broader semantic invalidation | |
| 16 | -- workflow history is inspectable, but not yet filtered or summarized around the most useful operator questions | |
| 17 | -- `conversation.py` is smaller than it was, but it still coordinates more workflow behavior than the refs | |
| 18 | - | |
| 19 | -The next leverage point is to stop asking only "what pressure score won?" and start asking "what concrete workflow signals are in play, which task boundaries remain unresolved, and which orchestration module should own the next move?" | |
| 20 | - | |
| 21 | -This sprint is about workflow structure: | |
| 22 | - | |
| 23 | -- route policy consumes typed workflow signals rather than leaning so heavily on inline heuristics | |
| 24 | -- clarify becomes intent-aware instead of merely multi-round | |
| 25 | -- replan discipline becomes more semantic than touched-file drift alone | |
| 26 | -- workflow inspection becomes more useful for debugging why Loader stayed in or re-entered a lane | |
| 27 | -- `conversation.py` shrinks again because orchestration moves into dedicated runtime modules | |
| 28 | - | |
| 29 | -The references for this sprint are: | |
| 30 | - | |
| 31 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 32 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 34 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 35 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 36 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 37 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 38 | - | |
| 39 | -## Deliverables | |
| 40 | - | |
| 41 | -### 1. Typed workflow-signal extraction instead of score inputs assembled inline | |
| 42 | - | |
| 43 | -Sprint 10 made routing scored. Sprint 11 should make the inputs first-class. | |
| 44 | - | |
| 45 | -Implementation targets: | |
| 46 | - | |
| 47 | -- introduce a dedicated workflow-signal module under `src/loader/runtime/`, for example around: | |
| 48 | - - ambiguity signals | |
| 49 | - - complexity signals | |
| 50 | - - mutation / verification pressure | |
| 51 | - - unresolved clarification slots | |
| 52 | - - artifact availability and freshness | |
| 53 | - - explicit user workflow requests | |
| 54 | - - recent workflow timeline pressure | |
| 55 | -- separate signal extraction from route scoring so policy code can reason over a typed signal packet rather than rebuilding context ad hoc | |
| 56 | -- persist enough of the winning signal context to explain: | |
| 57 | - - why clarify won over execute | |
| 58 | - - why plan refresh was triggered | |
| 59 | - - why direct execution was still allowed despite ambiguity | |
| 60 | -- keep route scoring tunable, but move the fragile task-text heuristics out of the coordinator path | |
| 61 | - | |
| 62 | -The goal is not to build a giant intent engine. The goal is to make workflow policy more explainable, testable, and less accidental. | |
| 63 | - | |
| 64 | -### 2. Intent-aware clarify strategy instead of generic follow-up rounds | |
| 65 | - | |
| 66 | -Loader can now clarify more than once, but it still asks questions in a relatively flat way. | |
| 67 | - | |
| 68 | -Implementation targets: | |
| 69 | - | |
| 70 | -- define typed clarify objectives or slots such as: | |
| 71 | - - desired outcome | |
| 72 | - - acceptance criteria | |
| 73 | - - constraints | |
| 74 | - - non-goals | |
| 75 | - - risk boundaries | |
| 76 | -- choose the next clarify question from unresolved slots instead of using a mostly generic follow-up loop | |
| 77 | -- adapt clarify behavior based on signal severity and task class while preserving a hard upper bound | |
| 78 | -- persist why clarify stopped: | |
| 79 | - - enough boundaries gathered | |
| 80 | - - budget exhausted | |
| 81 | - - route pressure shifted toward plan or execute | |
| 82 | - - explicit user answer narrowed the scope sufficiently | |
| 83 | -- carry unresolved slots forward into workflow state and artifacts so later plan/execute decisions can explain what was still uncertain | |
| 84 | - | |
| 85 | -This is how Loader gets closer to OMX's deeper interview rigor without turning every task into a long questionnaire. | |
| 86 | - | |
| 87 | -### 3. Semantic artifact invalidation and stronger re-plan discipline | |
| 88 | - | |
| 89 | -Sprint 10 made plan refresh possible. Sprint 11 should make refresh triggers smarter. | |
| 90 | - | |
| 91 | -Implementation targets: | |
| 92 | - | |
| 93 | -- enrich planning artifacts with more structured metadata where it materially helps, for example: | |
| 94 | - - expected touchpoints | |
| 95 | - - acceptance-criteria anchors | |
| 96 | - - planned files or subsystems | |
| 97 | - - known risks or assumptions | |
| 98 | -- define broader invalidation triggers beyond file drift, for example: | |
| 99 | - - verification evidence contradicts the plan assumptions | |
| 100 | - - the implementation touched files or subsystems outside the expected scope | |
| 101 | - - acceptance criteria changed materially after clarify or verification | |
| 102 | - - the current task wording narrowed or expanded after the plan was written | |
| 103 | -- distinguish between: | |
| 104 | - - targeted plan refresh | |
| 105 | - - clarify reentry | |
| 106 | - - full re-plan | |
| 107 | -- keep the runtime disciplined: prefer the smallest valid recovery move instead of restarting workflow lanes casually | |
| 108 | - | |
| 109 | -This should move Loader closer to claw-code's stronger artifact discipline, where plans remain live contracts instead of just persisted markdown. | |
| 110 | - | |
| 111 | -### 4. Workflow inspection that answers operator questions more directly | |
| 112 | - | |
| 113 | -Sprint 10 made workflow history visible. Sprint 11 should make it more usable. | |
| 114 | - | |
| 115 | -Implementation targets: | |
| 116 | - | |
| 117 | -- extend `loader workflow show` with higher-signal inspection affordances such as: | |
| 118 | - - filtering by mode or event kind | |
| 119 | - - limiting to the most recent meaningful items | |
| 120 | - - clearer summaries for refresh, reentry, and clarify-budget outcomes | |
| 121 | -- expose the signal/reason context that most directly answers questions like: | |
| 122 | - - why did Loader ask again? | |
| 123 | - - why did Loader refresh the plan? | |
| 124 | - - why did Loader skip verify? | |
| 125 | -- keep session surfaces concise by surfacing only the most recent or most important workflow events by default | |
| 126 | -- avoid building a visual UI in this sprint; prioritize text inspection that reduces debugging time immediately | |
| 127 | - | |
| 128 | -The goal is not prettier output. The goal is faster workflow debugging and better operator trust. | |
| 129 | - | |
| 130 | -### 5. Continue shrinking `conversation.py` into a coordinator over runtime modules | |
| 131 | - | |
| 132 | -Sprint 10 improved the split, but the coordinator still owns too much sequencing logic. | |
| 133 | - | |
| 134 | -Implementation targets: | |
| 135 | - | |
| 136 | -- extract additional orchestration seams under `src/loader/runtime/`, likely around: | |
| 137 | - - signal extraction | |
| 138 | - - clarify-lane control | |
| 139 | - - plan refresh / invalidation decisions | |
| 140 | - - workflow timeline append policy | |
| 141 | -- make `ConversationRuntime.run_turn(...)` read more like: | |
| 142 | - - collect turn state | |
| 143 | - - compute workflow signals | |
| 144 | - - ask policy/orchestrator for the next lane decision | |
| 145 | - - delegate lane execution | |
| 146 | - - persist summary and timeline outcomes | |
| 147 | -- keep completion and downstream workflow handoff logic out of the signal-extraction path | |
| 148 | -- avoid replacing one monolith with another; new orchestration modules should have narrow responsibilities and direct tests | |
| 149 | - | |
| 150 | -A good outcome is that `conversation.py` keeps shrinking because ownership is clearer, not because behavior gets hidden. | |
| 151 | - | |
| 152 | -## Testing strategy | |
| 153 | - | |
| 154 | -- unit coverage for: | |
| 155 | - - typed workflow-signal extraction and normalization | |
| 156 | - - route-policy scoring over structured signals | |
| 157 | - - clarify-slot progression and stop reasons | |
| 158 | - - semantic invalidation triggers and targeted recovery selection | |
| 159 | -- CLI coverage for: | |
| 160 | - - `loader workflow show` filtering and summarization | |
| 161 | - - session/workflow output for clarify exhaustion, plan refresh, and reentry reasons | |
| 162 | -- deterministic/runtime coverage for: | |
| 163 | - - ambiguous tasks where clarify chooses different follow-up questions based on unresolved slots | |
| 164 | - - verification failure that triggers plan refresh vs clarify reentry based on typed invalidation reasons | |
| 165 | - - tasks that remain executable even with mild ambiguity because stronger signals favor direct execution | |
| 166 | - - Sprint 00-10 parity scenarios staying green after the workflow-policy split deepens again | |
| 167 | -- regression coverage: | |
| 168 | - - route policy should consume typed signals rather than rebuilding them ad hoc inside the coordinator | |
| 169 | - - workflow inspection should continue to work after session resume and compaction | |
| 170 | - | |
| 171 | -## Definition of done | |
| 172 | - | |
| 173 | -- Loader extracts typed workflow signals before route scoring | |
| 174 | -- clarify behavior is intent-aware and persists why it continued or stopped | |
| 175 | -- plan refresh uses richer invalidation reasons than file drift alone | |
| 176 | -- workflow inspection better explains reentry, refresh, and clarify behavior | |
| 177 | -- `conversation.py` is slimmer again and more coordinator-like | |
| 178 | -- the full parity baseline remains green after the deeper workflow-policy split | |
| 179 | - | |
| 180 | -## Explicitly out of scope | |
| 181 | - | |
| 182 | -- full OMX-style consensus planning | |
| 183 | -- a visual workflow timeline UI | |
| 184 | -- a first-class permission rule editor | |
| 185 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 186 | -- multi-agent or team orchestration | |
| 187 | - | |
| 188 | -## Audit | |
| 189 | - | |
| 190 | -### Landed | |
| 191 | - | |
| 192 | -- Loader now extracts typed workflow signals in `src/loader/runtime/workflow_signals.py`, and route decisions persist `signal_summary` context so we can explain why clarify, plan, or direct execute won without rebuilding those heuristics inside the coordinator | |
| 193 | -- clarify is now intent-aware instead of generic: `src/loader/runtime/clarify_strategy.py` defines explicit slots such as desired outcome, non-goals, acceptance criteria, constraints, decision boundaries, and likely touchpoints, and the runtime now persists why clarify continued or stopped around those slots | |
| 194 | -- replan discipline is broader and more honest: `src/loader/runtime/artifact_invalidation.py` can now distinguish targeted plan refresh, clarify reentry, and full re-plan based on semantic drift instead of only touched-file mismatch, and `src/loader/runtime/conversation.py` routes those recovery moves explicitly | |
| 195 | -- workflow inspection is more useful for actual operator questions: `loader workflow show` now supports mode/kind filtering and entry limits, and `src/loader/runtime/inspection.py` plus `src/loader/cli/main.py` surface concise workflow highlights for re-asks, reentries, refreshes, and verify skips | |
| 196 | -- `conversation.py` is slimmer again because clarify/plan lane execution moved into `src/loader/runtime/workflow_lanes.py`, which now owns lane prompts, artifact writes, clarify follow-up handling, and plan todo seeding while the main runtime acts more like a coordinator | |
| 197 | - | |
| 198 | -### Verification | |
| 199 | - | |
| 200 | -- `uv run pytest -q` is green: `197 passed` | |
| 201 | -- `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` state | |
| 202 | -- `tests/test_clarify_strategy.py` covers slot prioritization and targeted clarify questions | |
| 203 | -- `tests/test_artifact_invalidation.py` covers semantic invalidation and recovery-mode selection | |
| 204 | -- `tests/test_workflow_runtime.py` covers intent-aware clarify continuation, targeted plan refresh, and full re-plan through clarify reentry | |
| 205 | -- `tests/test_inspection.py` covers `loader workflow show` filtering/highlights plus session/status workflow inspection | |
| 206 | -- targeted `ruff` checks are green for `src/loader/runtime/workflow_signals.py`, `src/loader/runtime/clarify_strategy.py`, `src/loader/runtime/artifact_invalidation.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/workflow_lanes.py`, `src/loader/runtime/conversation.py`, `src/loader/runtime/inspection.py`, and the new/expanded workflow inspection tests | |
| 207 | -- `uv run python -m compileall src/loader/cli/main.py` is green; full-file `ruff` on `src/loader/cli/main.py` still inherits the repo's older line-length backlog, so CLI verification here is anchored primarily by the inspection command tests | |
| 208 | - | |
| 209 | -### Residual debt | |
| 210 | - | |
| 211 | -- typed workflow signals are a better contract than inline heuristics, but they are still built from hand-tuned text/runtime cues rather than OMX-style ambiguity scoring, evidence passes, or richer task semantics | |
| 212 | -- clarify is now slot-driven, but it still stops well short of OMX's deep-interview pressure-pass discipline, repository-backed fact gathering, and task-profile-dependent depth | |
| 213 | -- artifact invalidation is broader now, but it is still lightweight and text-based; Loader does not yet reason over richer artifact metadata, deeper verification contradictions, or more explicit assumption tracking | |
| 214 | -- `loader workflow show` now answers the common operator questions much better, but it still lacks artifact diffs, prompt-history comparison, and richer timeline drill-down ergonomics | |
| 215 | -- `src/loader/runtime/conversation.py` is more coordinator-like than before Sprint 11, but it still owns the main turn loop, completion policy handoff, and some recovery sequencing that claw-code keeps in even more dedicated seams | |
.docs/sprints/sprint12.mddeleted@@ -1,204 +0,0 @@ | ||
| 1 | -# Sprint 12: Interview Pressure, Semantic Evidence, and Turn Orchestration | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 11 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader's newer workflow structure into a more disciplined execution contract by deepening clarify beyond slot selection, making semantic invalidation rely on richer evidence than text overlap alone, and shrinking the main turn loop into a clearer orchestration shell. | |
| 10 | - | |
| 11 | -Sprint 11 closed several real gaps. Loader now has typed workflow signals, slot-aware clarify, semantic invalidation, better workflow inspection, and a slimmer coordinator. That is meaningful progress toward claw-code and OMX, but the audit is honest about what still hurts: | |
| 12 | - | |
| 13 | -- typed workflow signals are still hand-tuned runtime heuristics rather than a deeper ambiguity/evidence model | |
| 14 | -- clarify is more intentional now, but it still lacks OMX's pressure-pass discipline, evidence-chasing, and codebase-backed interview style | |
| 15 | -- artifact invalidation is broader than file drift, but it still reasons from lightweight text overlap instead of richer structured evidence | |
| 16 | -- `conversation.py` is smaller, but it still owns the main assistant/recovery/completion orchestration loop that the refs spread across narrower runtime seams | |
| 17 | - | |
| 18 | -The next leverage point is to stop treating clarify as "ask a better next question" and start treating it as "run a bounded interview with explicit pressure passes, factual grounding, and a stronger handoff contract for later execution." | |
| 19 | - | |
| 20 | -This sprint is about execution rigor: | |
| 21 | - | |
| 22 | -- clarify gains pressure-pass behavior instead of only slot-follow-up behavior | |
| 23 | -- semantic invalidation uses richer structured evidence and contradiction tracking | |
| 24 | -- the main turn loop shrinks again by delegating orchestration checkpoints into dedicated runtime modules | |
| 25 | -- Loader gets closer to closed-source agentic tools not by more prompt prose, but by stronger workflow contracts | |
| 26 | - | |
| 27 | -The references for this sprint are: | |
| 28 | - | |
| 29 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 30 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 31 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 32 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 33 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 34 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 35 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 36 | - | |
| 37 | -## Deliverables | |
| 38 | - | |
| 39 | -### 1. Pressure-pass clarify controller instead of slot selection alone | |
| 40 | - | |
| 41 | -Sprint 11 made clarify targeted. Sprint 12 should make it disciplined. | |
| 42 | - | |
| 43 | -Implementation targets: | |
| 44 | - | |
| 45 | -- introduce a dedicated clarify controller under `src/loader/runtime/` that tracks: | |
| 46 | - - current interview stage | |
| 47 | - - weakest clarity dimension | |
| 48 | - - whether a pressure pass has occurred | |
| 49 | - - whether non-goals and decision boundaries are explicit | |
| 50 | - - how much interview budget remains | |
| 51 | -- extend clarify reasoning beyond "what slot is unresolved?" to also ask: | |
| 52 | - - was the last answer too broad? | |
| 53 | - - has this assumption been challenged yet? | |
| 54 | - - do we still need an example, counterexample, tradeoff, or explicit stop boundary? | |
| 55 | -- persist clarify progress in structured form so later workflow decisions can explain: | |
| 56 | - - which dimension clarify was targeting | |
| 57 | - - whether Loader was still gathering boundaries | |
| 58 | - - whether it stopped because the budget was exhausted or because readiness gates were met | |
| 59 | -- keep it bounded and pragmatic: | |
| 60 | - - no unbounded interviews | |
| 61 | - - no long questionnaires | |
| 62 | - - one question at a time with explicit stop conditions | |
| 63 | - | |
| 64 | -The goal is not to copy OMX wholesale. The goal is to adopt the parts that materially reduce misaligned execution and premature planning. | |
| 65 | - | |
| 66 | -### 2. Codebase-backed clarify grounding and stronger requirement artifacts | |
| 67 | - | |
| 68 | -Sprint 11 still relies mostly on the user answer plus task text. Sprint 12 should let clarify lean on facts Loader can gather directly. | |
| 69 | - | |
| 70 | -Implementation targets: | |
| 71 | - | |
| 72 | -- add a lightweight preflight/context seam for brownfield tasks that can feed clarify with discovered facts before asking the user for repository details | |
| 73 | -- prefer evidence-backed clarify questions when Loader already knows something, for example: | |
| 74 | - - "I found X in Y. Should this change follow that pattern?" | |
| 75 | - - "The current touchpoints appear to be A and B. Should I keep C out of scope?" | |
| 76 | -- persist richer clarify artifact metadata where it helps downstream runtime behavior, for example: | |
| 77 | - - explicit non-goal status | |
| 78 | - - explicit decision-boundary status | |
| 79 | - - whether a pressure pass occurred | |
| 80 | - - likely touchpoint evidence | |
| 81 | - - inferred vs confirmed boundaries | |
| 82 | -- keep this grounded in Loader's existing tool surface rather than inventing a large research subsystem | |
| 83 | - | |
| 84 | -This moves Loader closer to OMX's "reduce user effort and don't ask for facts we can discover" principle. | |
| 85 | - | |
| 86 | -### 3. Structured semantic evidence for invalidation and replan decisions | |
| 87 | - | |
| 88 | -Sprint 11 improved invalidation, but it still reasons mostly from text coverage. Sprint 12 should give recovery choices a stronger evidence model. | |
| 89 | - | |
| 90 | -Implementation targets: | |
| 91 | - | |
| 92 | -- define a structured invalidation/evidence contract under `src/loader/runtime/`, for example around: | |
| 93 | - - confirmed touchpoints | |
| 94 | - - inferred touchpoints | |
| 95 | - - acceptance anchors | |
| 96 | - - contradicted assumptions | |
| 97 | - - verification contradiction signals | |
| 98 | - - changed user boundaries after clarify | |
| 99 | -- teach invalidation to distinguish: | |
| 100 | - - plan mismatch | |
| 101 | - - brief contradiction | |
| 102 | - - verification contradiction | |
| 103 | - - stale assumptions | |
| 104 | -- improve recovery selection so Loader can explain not only what it chose, but what evidence forced that choice | |
| 105 | -- preserve "smallest valid recovery move first" as the governing behavior | |
| 106 | - | |
| 107 | -This is how Loader gets from "semantic-ish refresh" to a more trustworthy workflow contract. | |
| 108 | - | |
| 109 | -### 4. Turn orchestration split beyond lane execution | |
| 110 | - | |
| 111 | -Sprint 11 moved clarify/plan lanes out. Sprint 12 should keep shrinking the top-level turn loop. | |
| 112 | - | |
| 113 | -Implementation targets: | |
| 114 | - | |
| 115 | -- extract additional runtime seams under `src/loader/runtime/`, likely around: | |
| 116 | - - turn preparation/bootstrap | |
| 117 | - - workflow recovery/reentry control | |
| 118 | - - completion/continuation orchestration | |
| 119 | - - assistant-response repair routing | |
| 120 | -- make `ConversationRuntime.run_turn(...)` read more like: | |
| 121 | - - initialize turn state | |
| 122 | - - prepare workflow contract | |
| 123 | - - delegate iteration/orchestration helpers | |
| 124 | - - finalize summary | |
| 125 | -- avoid creating a new monolith module; prefer narrow orchestration seams with direct tests | |
| 126 | - | |
| 127 | -A good outcome is that the turn loop becomes easier to reason about and less likely to collect ad hoc behavior again. | |
| 128 | - | |
| 129 | -### 5. Workflow/operator surfaces that explain evidence, not just decisions | |
| 130 | - | |
| 131 | -Sprint 11 made `loader workflow show` more useful. Sprint 12 should make it explain the evidence behind recovery and clarify pressure more directly. | |
| 132 | - | |
| 133 | -Implementation targets: | |
| 134 | - | |
| 135 | -- extend workflow inspection surfaces to show: | |
| 136 | - - whether a pressure pass occurred | |
| 137 | - - which clarify dimension was active | |
| 138 | - - which evidence triggered refresh or reentry | |
| 139 | - - which assumptions were still unresolved | |
| 140 | -- keep the default UX concise, but expose richer detail when explicitly requested | |
| 141 | -- avoid a visual UI in this sprint; prioritize text surfaces that make the runtime easier to debug immediately | |
| 142 | - | |
| 143 | -## Testing strategy | |
| 144 | - | |
| 145 | -- unit coverage for: | |
| 146 | - - clarify pressure-pass progression and readiness gates | |
| 147 | - - codebase-backed clarify question selection from discovered facts | |
| 148 | - - structured invalidation evidence and contradiction handling | |
| 149 | - - new orchestration seams preserving current turn behavior | |
| 150 | -- CLI coverage for: | |
| 151 | - - workflow inspection showing clarify pressure/evidence | |
| 152 | - - session/workflow output for contradiction-driven reentry | |
| 153 | -- deterministic/runtime coverage for: | |
| 154 | - - ambiguous brownfield tasks where Loader asks evidence-backed clarify questions | |
| 155 | - - tasks that need an assumption/tradeoff pressure pass before planning | |
| 156 | - - verification contradictions that trigger targeted refresh vs full re-plan | |
| 157 | - - Sprint 00-11 parity scenarios staying green after the deeper orchestration split | |
| 158 | -- regression coverage: | |
| 159 | - - clarify should not ask the user for repository facts Loader can gather directly | |
| 160 | - - orchestration extraction should not regress the verify/fix or permission/runtime contracts | |
| 161 | - | |
| 162 | -## Definition of done | |
| 163 | - | |
| 164 | -- clarify uses a bounded pressure-pass controller rather than slot selection alone | |
| 165 | -- brownfield clarify can ask evidence-backed questions from discovered facts | |
| 166 | -- invalidation relies on richer structured evidence and contradiction tracking | |
| 167 | -- workflow/operator surfaces explain clarify and recovery evidence more directly | |
| 168 | -- `conversation.py` is slimmer again and more orchestration-shell-like | |
| 169 | -- the full parity baseline remains green after the deeper clarify/orchestration split | |
| 170 | - | |
| 171 | -## Explicitly out of scope | |
| 172 | - | |
| 173 | -- full OMX-style consensus planning | |
| 174 | -- a visual workflow timeline UI | |
| 175 | -- a first-class permission rule editor | |
| 176 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 177 | -- multi-agent or team orchestration | |
| 178 | - | |
| 179 | -## Audit | |
| 180 | - | |
| 181 | -### Landed | |
| 182 | - | |
| 183 | -- clarify now has explicit pressure-pass discipline instead of only slot-follow-up behavior: `src/loader/runtime/clarify_strategy.py`, `src/loader/runtime/workflow_policy.py`, and `src/loader/runtime/workflow_lanes.py` track readiness gates such as `non_goals`, `decision_boundaries`, and `pressure_pass`, and can drive later clarify rounds toward examples, tradeoffs, and challenged assumptions | |
| 184 | -- brownfield clarify is now grounded in discovered workspace evidence instead of relying only on user answers and task text: `src/loader/runtime/clarify_grounding.py` feeds repo paths, repo facts, slot-aware evidence, pressure-aware evidence, and grounded brief hints into clarify prompts, fallback questions, and persisted brief synthesis | |
| 185 | -- invalidation and recovery now use richer structured evidence than file drift alone: `src/loader/runtime/artifact_invalidation.py`, `src/loader/runtime/workflow_policy.py`, and `src/loader/runtime/workflow_recovery.py` now distinguish confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift, and that evidence is surfaced through workflow inspection | |
| 186 | -- workflow/operator surfaces now explain clarify pressure and recovery evidence more directly: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` surface pressure metadata, recovery evidence, and the newer workflow history context instead of only route labels | |
| 187 | -- the runtime shell is now genuinely controller-based instead of monolithic: `src/loader/runtime/workflow_recovery.py`, `src/loader/runtime/turn_preparation.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/turn_preamble.py`, `src/loader/runtime/workflow_state.py`, and `src/loader/runtime/turn_loop.py` now own distinct orchestration seams, and `src/loader/runtime/conversation.py` is down to a compact coordinator | |
| 188 | - | |
| 189 | -### Verification | |
| 190 | - | |
| 191 | -- `uv run pytest -q` is green: `231 passed` | |
| 192 | -- `tests/test_clarify_strategy.py` covers pressure-pass reviews, readiness gates, and later-round clarify pressure selection | |
| 193 | -- `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware evidence selection, pressure-aware grounding, and grounded brief hints | |
| 194 | -- `tests/test_artifact_invalidation.py`, `tests/test_workflow_policy.py`, `tests/test_workflow_runtime.py`, and `tests/test_inspection.py` cover structured drift evidence, contradiction-driven recovery, workflow pressure metadata, and operator-facing recovery summaries | |
| 195 | -- `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` give direct coverage to the new controller seams instead of relying only on large end-to-end runtime tests | |
| 196 | -- targeted `ruff` checks stayed green on the touched runtime/controller modules and their new tests throughout the extraction work, and the full suite remained green after each slice | |
| 197 | - | |
| 198 | -### Residual debt | |
| 199 | - | |
| 200 | -- clarify is now pressure-aware and grounded, but it is still bounded and lighter than OMX's deeper interview style; Loader still does not adapt interview depth by task class or run richer challenge/consensus passes | |
| 201 | -- the new invalidation evidence is a much better contract than text overlap alone, but it is still runtime-authored and heuristic; Loader still does not use deeper semantic reasoning over artifacts, symbols, or model-assisted contradiction analysis | |
| 202 | -- `src/loader/runtime/conversation.py` is now a real coordinator, but `src/loader/runtime/turn_iteration.py` remains the heaviest seam and still carries a fair amount of repair/completion/tool-routing policy that claw-code spreads across even narrower runtime modules | |
| 203 | -- workflow/operator surfaces explain more than they did at Sprint 11, but they still stop short of artifact diffs, prompt/history comparison, and richer timeline drill-down | |
| 204 | -- Loader is much closer to a controller-based runtime than it was at the start of Sprint 12, but it still does not match claw-code or OMX on deeper planning rigor, semantic artifact discipline, or broader operator ergonomics | |
.docs/sprints/sprint13.mddeleted@@ -1,203 +0,0 @@ | ||
| 1 | -# Sprint 13: Turn Policy Narrowing, Assumption Ledger, and Artifact Diffs | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 12 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Loader's newly controllerized runtime into a more semantically explicit workflow system by shrinking the still-heavy `turn_iteration` seam, promoting assumptions and contradictions into first-class workflow state, and giving operators diff-oriented artifact visibility instead of only latest-state inspection. | |
| 10 | - | |
| 11 | -Sprint 12 was a real structural win. Loader now has pressure-pass clarify, codebase-backed grounding, structured recovery evidence, and a controller-shaped runtime shell. That meaningfully closes the gap with claw-code and OMX. The audit is also honest about what still hurts: | |
| 12 | - | |
| 13 | -- `turn_iteration.py` is still carrying a lot of repair, tool-routing, and completion policy in one seam | |
| 14 | -- contradiction and invalidation evidence are richer than before, but they are still mostly runtime-authored summaries rather than a reusable semantic ledger | |
| 15 | -- operator surfaces can explain "why did this happen?" better than before, but they still cannot show "what changed?" across briefs, plans, verification, or prompt contracts | |
| 16 | -- Loader now has better workflow discipline, but it still lacks some of the day-two operator ergonomics that make claw-code and OMX easier to trust during long tasks | |
| 17 | - | |
| 18 | -The next leverage point is to stop treating semantic drift and operator visibility as one-off summaries and start treating them as durable contracts: | |
| 19 | - | |
| 20 | -- the turn runtime should classify and route assistant output through narrower policy seams | |
| 21 | -- assumptions, contradictions, and acceptance anchors should survive across workflow phases as explicit state | |
| 22 | -- inspection should be able to show diffs between the artifacts and prompt contracts that drove behavior | |
| 23 | - | |
| 24 | -This sprint is about making Loader more inspectable and less accidental: | |
| 25 | - | |
| 26 | -- `turn_iteration` shrinks into narrower policy-oriented seams | |
| 27 | -- workflow invalidation gains an explicit assumption/contradiction ledger | |
| 28 | -- operator tooling gains artifact and prompt diff visibility | |
| 29 | -- Loader gets closer to claw-code not just in structure, but in debuggability | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 36 | -- `refs/claw-code/PARITY.md` | |
| 37 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 38 | -- `refs/oh-my-codex/src/modes/base.ts` | |
| 39 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 40 | -- `refs/oh-my-codex/skills/deep-interview/SKILL.md` | |
| 41 | -- `refs/oh-my-codex/skills/ralplan/SKILL.md` | |
| 42 | - | |
| 43 | -## Deliverables | |
| 44 | - | |
| 45 | -### 1. Split `turn_iteration` into narrower response-policy seams | |
| 46 | - | |
| 47 | -Sprint 12 made `conversation.py` coordinator-shaped. Sprint 13 should keep the same discipline for the still-heavy iteration seam. | |
| 48 | - | |
| 49 | -Implementation targets: | |
| 50 | - | |
| 51 | -- extract narrower helpers under `src/loader/runtime/`, likely around: | |
| 52 | - - assistant-response classification | |
| 53 | - - repair routing | |
| 54 | - - final-answer routing | |
| 55 | - - tool-batch routing | |
| 56 | - - no-tool completion handoff | |
| 57 | -- make `turn_iteration.py` read more like: | |
| 58 | - - request assistant turn | |
| 59 | - - classify response | |
| 60 | - - delegate the winning route | |
| 61 | - - return loop-state deltas | |
| 62 | -- keep the main behavior unchanged while reducing policy density per module | |
| 63 | -- add direct controller tests so future iteration changes do not depend only on broad runtime integration coverage | |
| 64 | - | |
| 65 | -The goal is not more files for their own sake. The goal is to make assistant-turn behavior easier to tune deliberately. | |
| 66 | - | |
| 67 | -### 2. Assumption and contradiction ledger instead of one-off evidence summaries | |
| 68 | - | |
| 69 | -Sprint 12 introduced richer drift evidence. Sprint 13 should make that evidence durable and reusable. | |
| 70 | - | |
| 71 | -Implementation targets: | |
| 72 | - | |
| 73 | -- define a typed workflow ledger contract under `src/loader/runtime/` for: | |
| 74 | - - explicit assumptions | |
| 75 | - - confirmed assumptions | |
| 76 | - - contradicted assumptions | |
| 77 | - - acceptance anchors | |
| 78 | - - open decision boundaries | |
| 79 | - - closed decision boundaries | |
| 80 | -- thread that ledger through clarify, planning, verification, and recovery instead of only summarizing evidence at refresh time | |
| 81 | -- persist enough structure to answer: | |
| 82 | - - which assumption was invalidated? | |
| 83 | - - which workflow phase introduced it? | |
| 84 | - - what evidence contradicted it? | |
| 85 | - - whether the contradiction forced refresh, reentry, or only inspection visibility | |
| 86 | -- keep the first version pragmatic and text-first; do not try to build a symbolic reasoning engine | |
| 87 | - | |
| 88 | -This is how Loader gets from "richer summaries" to a more explicit semantic workflow contract. | |
| 89 | - | |
| 90 | -### 3. Artifact and prompt diff surfaces for operators | |
| 91 | - | |
| 92 | -Loader can now show the latest prompt and workflow timeline. Sprint 13 should help operators see what changed. | |
| 93 | - | |
| 94 | -Implementation targets: | |
| 95 | - | |
| 96 | -- add diff-oriented inspection surfaces, likely around: | |
| 97 | - - clarify brief vs refreshed brief | |
| 98 | - - old plan vs refreshed plan | |
| 99 | - - workflow ledger changes across reentry | |
| 100 | - - prompt metadata or prompt-body diffs across relevant turns | |
| 101 | -- keep the product surface text-first and operator-friendly, for example via: | |
| 102 | - - `loader workflow show --diff` | |
| 103 | - - `loader prompt diff` | |
| 104 | - - or an equivalent `loader artifact show` family if that is cleaner | |
| 105 | -- include concise change summaries by default and fuller diffs when explicitly requested | |
| 106 | -- avoid a visual UI in this sprint; prioritize fast CLI/TUI debugging value | |
| 107 | - | |
| 108 | -The goal is to make workflow changes legible, not just persisted. | |
| 109 | - | |
| 110 | -### 4. Workflow/operator surfaces that explain semantic change, not only event history | |
| 111 | - | |
| 112 | -Sprint 12 improved evidence visibility. Sprint 13 should improve semantic visibility. | |
| 113 | - | |
| 114 | -Implementation targets: | |
| 115 | - | |
| 116 | -- extend inspection surfaces so they can show: | |
| 117 | - - which assumptions remain open | |
| 118 | - - which assumptions were contradicted | |
| 119 | - - which acceptance anchors changed across clarify/plan/verify | |
| 120 | - - whether a refresh was forced by contradiction, touchpoint drift, or acceptance drift | |
| 121 | -- preserve concise defaults so everyday status remains readable | |
| 122 | -- make session/workflow output useful for long-running or resumed tasks, not only single-turn debugging | |
| 123 | - | |
| 124 | -This brings Loader closer to claw-code's stronger operator trust model. | |
| 125 | - | |
| 126 | -### 5. Keep the parity baseline honest while the runtime narrows again | |
| 127 | - | |
| 128 | -Sprint 12 closed a big structural loop. Sprint 13 should protect that gain. | |
| 129 | - | |
| 130 | -Implementation targets: | |
| 131 | - | |
| 132 | -- add direct tests for the newly split iteration policy seams | |
| 133 | -- extend workflow/inspection coverage for diff and ledger behavior | |
| 134 | -- keep existing parity scenarios green after the iteration split | |
| 135 | -- update `PARITY.md` and the sprint audit only after the new surfaces and contracts are actually covered | |
| 136 | - | |
| 137 | -## Testing strategy | |
| 138 | - | |
| 139 | -- unit coverage for: | |
| 140 | - - response classification and per-route delegation | |
| 141 | - - assumption-ledger updates and contradiction recording | |
| 142 | - - artifact/prompt diff formatting and summaries | |
| 143 | - - workflow refresh decisions reading from the new ledger state | |
| 144 | -- CLI coverage for: | |
| 145 | - - prompt/artifact/workflow diff surfaces | |
| 146 | - - workflow/session output for contradiction-led refreshes | |
| 147 | -- deterministic/runtime coverage for: | |
| 148 | - - a clarify answer that seeds assumptions later contradicted during verification | |
| 149 | - - a plan refresh where the operator surface can show exactly what changed | |
| 150 | - - a resumed session where workflow inspection still reflects semantic ledger state | |
| 151 | - - Sprint 00-12 parity scenarios staying green after the deeper iteration split | |
| 152 | -- regression coverage: | |
| 153 | - - iteration refactors should not regress verify/fix, permission, or explore contracts | |
| 154 | - - diff surfaces should read persisted artifacts/session state rather than reconstructing history heuristically | |
| 155 | - | |
| 156 | -## Definition of done | |
| 157 | - | |
| 158 | -- `turn_iteration.py` is slimmer and delegates through narrower response-policy seams | |
| 159 | -- assumptions and contradictions are persisted as explicit workflow state | |
| 160 | -- operators can inspect artifact or prompt diffs from the product surface | |
| 161 | -- workflow inspection explains semantic change, not only route history | |
| 162 | -- the full parity baseline remains green after the deeper iteration split | |
| 163 | - | |
| 164 | -## Explicitly out of scope | |
| 165 | - | |
| 166 | -- full OMX-style consensus planning | |
| 167 | -- a visual workflow diff UI | |
| 168 | -- AST-aware, LSP-aware, or symbol-aware editing | |
| 169 | -- a first-class permission rule editor | |
| 170 | -- multi-agent or team orchestration | |
| 171 | - | |
| 172 | -## Audit | |
| 173 | - | |
| 174 | -### Status | |
| 175 | - | |
| 176 | -- Sprint 13 is complete. The semantic ledger, prompt/artifact diff surfaces, deletion-oriented runtime cleanup, and narrower assistant-response routing seams are now all landed and covered. | |
| 177 | - | |
| 178 | -### Landed | |
| 179 | - | |
| 180 | -- the runtime is less accidental and less puppet-like even before a deeper iteration split: `src/loader/runtime/turn_preamble.py`, `src/loader/runtime/repair.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/turn_completion.py`, and `src/loader/runtime/completion_policy.py` no longer inject synthetic prefill, no longer puppet repeated empty responses, no longer scold fake-tool narration or deflection through injected reroutes, and no longer bounce long no-tool answers through the self-critique reroute | |
| 181 | -- `turn_iteration.py` is now routing through a dedicated response-policy seam instead of owning final-answer, tool-batch, and no-tool dispatch inline: `src/loader/runtime/response_routing.py` now owns classified assistant-response routing, and `src/loader/runtime/turn_iteration.py` is correspondingly smaller and closer to a request/repair/loop-state controller | |
| 182 | -- assumptions, contradictions, acceptance anchors, and decision-boundary state are now persisted as an explicit workflow ledger instead of one-off summaries: `src/loader/runtime/workflow_ledger.py` and `src/loader/runtime/session.py` define and persist the ledger, `src/loader/runtime/workflow_lanes.py` seeds it from clarify/plan artifacts, and `src/loader/runtime/workflow_recovery.py` updates it from contradiction and freshness evidence | |
| 183 | -- workflow/operator surfaces now explain semantic change more directly: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` expose the workflow ledger, contradiction highlights, and richer workflow history so operators can see which assumptions remain open, which were contradicted, and which acceptance anchors changed | |
| 184 | -- prompt and artifact change visibility is now a product surface instead of an inferred debugging exercise: `src/loader/runtime/prompt_history.py`, `src/loader/runtime/session.py`, `src/loader/runtime/inspection.py`, `src/loader/agent/loop.py`, and `src/loader/cli/main.py` persist prompt snapshots, add `loader prompt diff`, and add `loader workflow show --diff` with concise summaries by default and fuller unified diffs on demand | |
| 185 | -- the parity baseline stayed green while the semantic contract expanded and `turn_iteration` narrowed: the new coverage lives in `tests/test_runtime_repair_flows.py`, `tests/test_response_routing.py`, `tests/test_workflow_ledger.py`, `tests/test_session_state.py`, `tests/test_inspection.py`, `tests/test_turn_completion.py`, and the existing runtime/workflow inspection tests | |
| 186 | - | |
| 187 | -### Verification | |
| 188 | - | |
| 189 | -- `uv run pytest -q` is green: `247 passed` | |
| 190 | -- `tests/test_runtime_repair_flows.py` covers the honest empty-response retry path, no synthetic prefill on first turns, and the absence of the older no-tool puppeting/scolding behavior | |
| 191 | -- `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing without relying only on the larger iteration loop tests | |
| 192 | -- `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries | |
| 193 | -- `tests/test_session_state.py` covers persistence of the workflow ledger and prompt snapshot history across saved/resumed sessions | |
| 194 | -- `tests/test_inspection.py` covers workflow-ledger inspection, `loader prompt diff`, and `loader workflow show --diff` against persisted prompt and artifact history | |
| 195 | -- the earlier Sprint 12 controller/runtime coverage remained green after the repair cleanup and semantic-state additions, so Sprint 13 did not regress the controllerized turn runtime while adding richer inspection state | |
| 196 | - | |
| 197 | -### Residual debt | |
| 198 | - | |
| 199 | -- `src/loader/runtime/turn_iteration.py` is no longer the main policy knot, but `src/loader/runtime/response_routing.py` and `src/loader/runtime/tool_batches.py` still carry more heuristic response/tool policy than the narrower reference seams in claw-code | |
| 200 | -- the workflow ledger is intentionally pragmatic and text-first; Loader still does not have deeper symbolic reasoning, model-authored contradiction analysis, or richer provenance for every semantic state change | |
| 201 | -- prompt and artifact diffs are based on persisted snapshots and versioned text artifacts; Loader still does not offer pre-run candidate prompt comparison, semantic/AST-aware artifact diffs, or richer visual timeline tooling | |
| 202 | -- older sessions may not have prompt-history or artifact-history depth comparable to new sessions, so the newest diff surfaces are strongest on sessions created after the Sprint 13 persistence changes | |
| 203 | -- Loader is more inspectable and less accidental than it was at the end of Sprint 12, but it still does not match claw-code or OMX on deeper planning rigor, semantic artifact discipline, or broader day-two operator ergonomics | |
.docs/sprints/sprint14.mddeleted@@ -1,190 +0,0 @@ | ||
| 1 | -# Sprint 14: Runtime Context Adoption, Legacy Burn-Down, and Policy Narrowing | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 13 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn the newly merged audit-line cleanup into a first-class runtime contract by promoting `RuntimeContext` from a compatibility seam into the primary execution boundary, burning down more agent-owned policy/services, and narrowing the still-heavy response/tool policy helpers. | |
| 10 | - | |
| 11 | -Sprint 13 closed a real loop. Loader now has a semantic workflow ledger, prompt/artifact diff surfaces, a more honest no-tool completion path, and a dedicated response-routing seam. The audit branch is now merged too, which changes the next leverage point in an important way: | |
| 12 | - | |
| 13 | -- Loader now has runtime-owned context, parsing, recovery, rollback, safeguard, and task-classification modules on `trunk` | |
| 14 | -- `src/loader/runtime/` no longer imports `agent/*` directly, and the merged tree is green at `286 passed` | |
| 15 | -- but several of those modules are still only partially adopted, with compatibility layers and legacy agent hooks still carrying too much behavior | |
| 16 | -- `response_routing.py` and `tool_batches.py` are better than the old inline loop, but they still carry more heuristic policy than the narrower seams in claw-code | |
| 17 | -- `agent/loop.py` is smaller in responsibility than it used to be, but it is still doing too much as a holder for runtime-owned decisions, helper methods, and policy services | |
| 18 | -- `agent/reasoning.py` and `agent/safeguards.py` are no longer hidden runtime implementations, but they still own meaningful legacy behavior behind explicit seams | |
| 19 | - | |
| 20 | -The next leverage point is to stop treating the merged audit runtime pieces as optional support modules and start treating them as the default execution contract: | |
| 21 | - | |
| 22 | -- `RuntimeContext` should become the normal runtime boundary, not only a bridge for selected helpers and tests | |
| 23 | -- runtime-owned parsing/recovery/rollback/safeguard/task-classification services should replace more legacy agent ownership | |
| 24 | -- response/tool policy should narrow further now that the runtime service surface is richer | |
| 25 | - | |
| 26 | -This sprint is about consolidating the merge into a cleaner architecture rather than leaving the audit branch as a large historical merge with only partial adoption. | |
| 27 | - | |
| 28 | -The references for this sprint are: | |
| 29 | - | |
| 30 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 31 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 32 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 33 | -- `refs/claw-code/PARITY.md` | |
| 34 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 35 | -- `.docs/audit_sprints/sprint13_closure.md` | |
| 36 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 37 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 38 | - | |
| 39 | -## Deliverables | |
| 40 | - | |
| 41 | -### 1. Promote `RuntimeContext` from bridge to primary runtime contract | |
| 42 | - | |
| 43 | -The merged audit branch introduced a typed runtime context. Sprint 14 should make that seam the default runtime boundary for more helpers. | |
| 44 | - | |
| 45 | -Implementation targets: | |
| 46 | - | |
| 47 | -- adopt `src/loader/runtime/context.py` across more runtime controllers and services so they consume typed runtime state instead of direct `Agent` access where practical | |
| 48 | -- reduce direct runtime dependencies on: | |
| 49 | - - `Agent._extract_raw_json_tool_calls(...)` | |
| 50 | - - `Agent._assess_confidence(...)` | |
| 51 | - - `Agent._verify_action(...)` | |
| 52 | - - ad hoc steering/recovery workflow hooks | |
| 53 | -- make the remaining `legacy` services in `RuntimeContext` explicit migration seams instead of long-term hidden dependencies | |
| 54 | -- shrink the number of callbacks exposed through `RuntimeLegacyServices`, or narrow them so their contracts are obviously transitional | |
| 55 | -- keep the contract pragmatic: | |
| 56 | - - do not rewrite the whole runtime in one sweep | |
| 57 | - - prefer controller/service boundaries that improve testability immediately | |
| 58 | - | |
| 59 | -The goal is not purity for its own sake. The goal is to make runtime behavior easier to reason about and less likely to regress when we continue narrowing policy seams. | |
| 60 | - | |
| 61 | -### 2. Burn down more agent-owned runtime services | |
| 62 | - | |
| 63 | -The merged branch brought runtime-owned modules onto `trunk`. Sprint 14 should make more of them actually own behavior. | |
| 64 | - | |
| 65 | -Implementation targets: | |
| 66 | - | |
| 67 | -- move more effective ownership for the following out of `agent/loop.py` and into runtime modules: | |
| 68 | - - task classification | |
| 69 | - - raw-text parsing and tool-call recovery | |
| 70 | - - rollback planning helpers | |
| 71 | - - safeguard services | |
| 72 | - - recovery prompts / retry guidance | |
| 73 | -- make the remaining `agent/reasoning.py` and `agent/safeguards.py` callbacks inventoryable and explicit, so we can say which ones are still required and which are just legacy inertia | |
| 74 | -- keep `agent/loop.py` focused on: | |
| 75 | - - public entrypoints | |
| 76 | - - user/session-facing orchestration | |
| 77 | - - compatibility wrappers that truly still need to exist | |
| 78 | -- avoid duplicating logic across `agent/*` and `runtime/*`; prefer one real implementation plus compatibility exports only where needed | |
| 79 | - | |
| 80 | -This is how the merged audit work becomes structural improvement instead of passive file accumulation. | |
| 81 | - | |
| 82 | -### 3. Narrow `response_routing.py` and `tool_batches.py` | |
| 83 | - | |
| 84 | -Sprint 13 moved response dispatch out of `turn_iteration.py`. Sprint 14 should keep pushing the same discipline into the next heavy seams. | |
| 85 | - | |
| 86 | -Implementation targets: | |
| 87 | - | |
| 88 | -- split `src/loader/runtime/response_routing.py` into narrower policy helpers where it pays off, likely around: | |
| 89 | - - final-answer routing | |
| 90 | - - raw-text tool routing | |
| 91 | - - no-tool completion routing | |
| 92 | - - halt/finalize decision shaping | |
| 93 | -- split `src/loader/runtime/tool_batches.py` more deliberately around: | |
| 94 | - - confidence gate | |
| 95 | - - recovery handling | |
| 96 | - - DoD post-tool bookkeeping | |
| 97 | - - post-tool verification | |
| 98 | -- keep behavior steady while making route ownership and failure handling easier to test directly | |
| 99 | - | |
| 100 | -The goal is to keep moving away from broad “policy soup” modules and toward claw-code-style narrower execution seams. | |
| 101 | - | |
| 102 | -### 4. Consolidate merged cleanup behavior into the test contract | |
| 103 | - | |
| 104 | -The merge brought in new tests and new runtime compatibility seams. Sprint 14 should turn that into a clearer, intentional contract. | |
| 105 | - | |
| 106 | -Implementation targets: | |
| 107 | - | |
| 108 | -- keep the `RuntimeContext` tests green while reducing how much behavior still depends on compatibility shims | |
| 109 | -- add direct tests for any newly split response/tool policy controllers | |
| 110 | -- keep honest-repair, no synthetic prefill, and no-tool completion cleanup behavior covered after the deeper service migration | |
| 111 | -- extend parity and inspection coverage where the merged audit docs surfaced real operator-facing expectations | |
| 112 | -- treat the merged `.docs/audit_sprints/` artifacts as regression evidence when deciding whether a seam is actually load-bearing | |
| 113 | - | |
| 114 | -This makes the merged audit branch a maintained baseline rather than a one-time reconciliation event. | |
| 115 | - | |
| 116 | -### 5. Reconcile docs after the merged audit line | |
| 117 | - | |
| 118 | -The branch is merged. The docs should stop behaving like the audit line is still “over there.” | |
| 119 | - | |
| 120 | -Implementation targets: | |
| 121 | - | |
| 122 | -- refresh `REPORT.md`, `PARITY.md`, and the sprint audit trail where the merge changed the architectural baseline in a meaningful way | |
| 123 | -- preserve `.docs/audit_sprints/` as historical evidence, not as a second active roadmap | |
| 124 | -- keep the parity checkpoint honest about which runtime-context/service seams are truly primary vs compatibility-bound | |
| 125 | - | |
| 126 | -## Testing strategy | |
| 127 | - | |
| 128 | -- unit coverage for: | |
| 129 | - - `RuntimeContext`-owned service behavior | |
| 130 | - - response-routing subcontrollers | |
| 131 | - - tool-batch subcontrollers | |
| 132 | - - raw-text fallback and capability refresh paths after the deeper context adoption | |
| 133 | -- runtime coverage for: | |
| 134 | - - native-tool and raw-text tool parity | |
| 135 | - - explore mode through the typed runtime context | |
| 136 | - - verification/recovery after service migration | |
| 137 | - - Sprint 00-13 parity scenarios staying green after the context/service ownership shift | |
| 138 | -- regression coverage: | |
| 139 | - - no synthetic prefill | |
| 140 | - - no repeated empty-response puppeting | |
| 141 | - - no self-critique reroute regression | |
| 142 | - - no accidental rollback/recovery ownership drift between `agent/*` and `runtime/*` | |
| 143 | - | |
| 144 | -## Definition of done | |
| 145 | - | |
| 146 | -- `RuntimeContext` is a real primary runtime seam for more controllers/services, not just a compatibility adapter | |
| 147 | -- more runtime-owned parsing/recovery/rollback/safeguard/task-classification behavior is actually moved off `agent/loop.py` | |
| 148 | -- `response_routing.py` and `tool_batches.py` are narrower and more directly tested | |
| 149 | -- the merged audit line is reflected as one baseline, not two parallel architectures | |
| 150 | -- the remaining legacy callbacks are enumerated, narrower, and clearly transitional | |
| 151 | -- the full parity baseline remains green after the deeper runtime-context adoption | |
| 152 | - | |
| 153 | -## Explicitly out of scope | |
| 154 | - | |
| 155 | -- full claw-code policy-engine parity | |
| 156 | -- AST-aware or LSP-aware semantic artifact diffs | |
| 157 | -- visual workflow or timeline UIs | |
| 158 | -- multi-agent or team orchestration | |
| 159 | - | |
| 160 | -## Audit | |
| 161 | - | |
| 162 | -### Status | |
| 163 | - | |
| 164 | -- Sprint 14 is complete, and the audit is green. `RuntimeContext` is now the normal runtime seam for the main turn path rather than a selective bridge layered over legacy callbacks. | |
| 165 | - | |
| 166 | -### Landed | |
| 167 | - | |
| 168 | -- `src/loader/runtime/context.py` is now a real primary contract instead of a partial adapter: the old `RuntimeLegacyServices` shim is gone, workflow-mode mutation lives on the typed context, and the runtime can refresh capability state, steering state, and workflow state without routing back through a legacy wrapper | |
| 169 | -- runtime-owned service adoption is materially deeper than the merged-audit baseline: `src/loader/runtime/workflow_state.py`, `src/loader/runtime/phases.py`, `src/loader/runtime/repair.py`, `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/response_route_handlers.py`, `src/loader/runtime/response_routing.py`, `src/loader/runtime/turn_loop.py`, `src/loader/runtime/turn_iteration.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_lanes.py`, and `src/loader/runtime/workflow_recovery.py` now consume typed runtime state instead of reaching into `Agent` for session, backend, registry, or workflow state | |
| 170 | -- raw-text tool recovery no longer depends on a hidden `Agent._extract_raw_json_tool_calls(...)` escape hatch: `src/loader/runtime/repair.py` now routes fallback parsing through the runtime parser plus the active registry, which closes an old audit concern around newer tools such as `TodoWrite` | |
| 171 | -- the response/tool path is narrower and more directly testable: the earlier extraction of `src/loader/runtime/tool_batch_checks.py`, `src/loader/runtime/tool_batch_recovery.py`, `src/loader/runtime/response_route_handlers.py`, and `src/loader/runtime/response_route_types.py` is now paired with typed-context adoption across the hot path, so response routing, tool-batch gating, no-tool completion, and finalization no longer behave like disguised `agent/loop.py` helpers | |
| 172 | -- the merged audit line is now reflected as one architectural baseline instead of a second hidden runtime: the active `trunk` runtime owns reasoning callbacks, raw-text recovery, response policy, workflow state, turn state, and finalization directly, and the remaining `agent/*` ownership is much smaller and easier to inventory | |
| 173 | -- the test contract around this migration is stronger and more intentional: `tests/test_runtime_context.py`, `tests/test_runtime_state_controllers.py`, `tests/test_completion_policy.py`, `tests/test_repair.py`, `tests/test_response_route_handlers.py`, `tests/test_turn_loop.py`, `tests/test_turn_iteration.py`, and `tests/test_explore_runtime.py` now pin the typed-context contract directly instead of relying only on large integration tests | |
| 174 | - | |
| 175 | -### Verification | |
| 176 | - | |
| 177 | -- `uv run pytest -q` is green: `303 passed` | |
| 178 | -- `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed context construction plus direct workflow-state and phase-tracker behavior without an `Agent` object on the other side | |
| 179 | -- `tests/test_repair.py` covers raw-text fallback through the runtime parser/registry, including modern workflow-tool recovery such as `TodoWrite` | |
| 180 | -- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_response_route_handlers.py`, `tests/test_response_routing.py`, `tests/test_turn_iteration.py`, and `tests/test_turn_loop.py` cover the main response-policy and assistant-cycle path after the deeper context adoption | |
| 181 | -- `tests/test_explore_runtime.py` still proves explore refreshes capabilities before the first request, so the typed runtime context is also now load-bearing outside the main task loop | |
| 182 | -- the larger workflow/runtime suites remained green after the migration, so Sprint 14 did not trade architectural cleanup for parity regressions | |
| 183 | - | |
| 184 | -### Residual debt | |
| 185 | - | |
| 186 | -- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` still bootstrap from `agent._build_runtime_context()`, and `conversation.py` still performs a post-prepare sync of capability/prompt state from the agent wrapper; the remaining runtime/agent coupling is now mostly bootstrap ownership rather than policy ownership | |
| 187 | -- `src/loader/agent/loop.py` still owns meaningful planning/prompt/session orchestration outside the hot runtime path, so Loader is much cleaner than the merged-audit baseline but has not fully collapsed to a minimal public entrypoint shell yet | |
| 188 | -- `src/loader/agent/reasoning.py` and `src/loader/agent/safeguards.py` are now behind typed runtime protocols, but they still own meaningful behavior and remain future burn-down candidates if Sprint 15 wants to keep reducing agent-owned runtime services | |
| 189 | -- `src/loader/runtime/tool_batches.py` and parts of `src/loader/runtime/workflow_lanes.py` are narrower than before, but they still carry more heuristic policy than the tighter claw-code reference seams | |
| 190 | -- the workflow policy is stronger and the runtime contract is cleaner, but Loader still stops short of claw-code's fuller policy engine, OMX's deeper planning/interview rigor, and a richer operator UX for editing or simulating policy/rule state | |
.docs/sprints/sprint15.mddeleted@@ -1,182 +0,0 @@ | ||
| 1 | -# Sprint 15: Bootstrap Ownership, Service Burn-Down, and Explore Independence | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 14 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Finish the last high-value runtime cleanup that Sprint 14 exposed: move bootstrap ownership and the remaining agent-owned runtime services onto explicit runtime seams, so Loader's hot path is not only context-driven after initialization, but also context-driven at initialization. | |
| 10 | - | |
| 11 | -Sprint 14 was a real architectural win. `RuntimeContext` is now the primary seam across workflow state, turn phases, response repair, response routing, turn looping, workflow recovery, and finalization. The older `RuntimeLegacyServices` shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the main runtime path is much less accidental than it was when the audit line first branched. | |
| 12 | - | |
| 13 | -That said, the current residual debt is now very specific: | |
| 14 | - | |
| 15 | -- `conversation.py` and `explore.py` still bootstrap from `agent._build_runtime_context()` | |
| 16 | -- `agent/loop.py` still owns too much prompt/session/bootstrap coordination for a runtime that is otherwise context-owned | |
| 17 | -- `agent/reasoning.py` and `agent/safeguards.py` still own meaningful runtime behavior behind typed protocols | |
| 18 | -- the audit's core warning still matters in a narrower form: | |
| 19 | - Loader should keep deleting wrapper-only ownership, not just wrapping it in nicer files | |
| 20 | - | |
| 21 | -Sprint 15 is about finishing that next contraction honestly: | |
| 22 | - | |
| 23 | -- runtime bootstrapping becomes an explicit runtime contract instead of an agent-only helper | |
| 24 | -- explore mode stops being a special runtime that still depends on agent construction shape | |
| 25 | -- reasoning and safeguard ownership become more inventoryable and less agent-bound | |
| 26 | -- `agent/loop.py` shrinks further toward entrypoint/session orchestration instead of runtime-service ownership | |
| 27 | - | |
| 28 | -This sprint should feel like closing the structural loop opened by Sprint 14, not starting a new product branch. | |
| 29 | - | |
| 30 | -The references for this sprint are: | |
| 31 | - | |
| 32 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs` | |
| 36 | -- `refs/claw-code/PARITY.md` | |
| 37 | -- `.docs/audit.txt` | |
| 38 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 39 | -- `.docs/audit_sprints/sprint13_closure.md` | |
| 40 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 41 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 42 | - | |
| 43 | -## Deliverables | |
| 44 | - | |
| 45 | -### 1. Runtime bootstrap becomes a first-class runtime seam | |
| 46 | - | |
| 47 | -Sprint 14 made `RuntimeContext` load-bearing after construction. Sprint 15 should make construction itself less agent-special. | |
| 48 | - | |
| 49 | -Implementation targets: | |
| 50 | - | |
| 51 | -- introduce an explicit runtime bootstrap/factory seam under `src/loader/runtime/`, likely around: | |
| 52 | - - building `RuntimeContext` | |
| 53 | - - initializing project/session/prompt/capability state needed by runtimes | |
| 54 | - - synchronizing prompt/capability metadata when the backend or prompt contract changes | |
| 55 | -- reduce direct runtime dependence on `agent._build_runtime_context()` so `ConversationRuntime` and `ExploreRuntime` do not depend on a hidden agent helper as their primary construction mechanism | |
| 56 | -- keep the contract pragmatic: | |
| 57 | - - it is acceptable for `Agent` to call the factory | |
| 58 | - - it is not acceptable for runtime correctness to depend on ad hoc agent-only bootstrap behavior | |
| 59 | - | |
| 60 | -The goal is to make runtime ownership explicit from the first line of construction, not only once the turn is already running. | |
| 61 | - | |
| 62 | -### 2. Burn down more agent-owned runtime services | |
| 63 | - | |
| 64 | -The remaining runtime service ownership now lives mostly in `agent/reasoning.py` and `agent/safeguards.py`. | |
| 65 | - | |
| 66 | -Implementation targets: | |
| 67 | - | |
| 68 | -- inventory the still-runtime-relevant behavior in: | |
| 69 | - - `src/loader/agent/reasoning.py` | |
| 70 | - - `src/loader/agent/safeguards.py` | |
| 71 | -- move or re-home the behavior that is still genuinely runtime-owned, especially around: | |
| 72 | - - confidence / verification service boundaries | |
| 73 | - - stream filtering / steering / duplicate detection | |
| 74 | - - action validation hooks | |
| 75 | -- prefer one real implementation plus compatibility exports over keeping a runtime wrapper around an agent-owned implementation indefinitely | |
| 76 | -- explicitly delete or retire dead wrapper layers where the runtime already has a better home | |
| 77 | - | |
| 78 | -This is the sprint where we should be suspicious of “adapter forever” solutions. If a behavior is still part of the runtime contract, it should increasingly live under `runtime/`. | |
| 79 | - | |
| 80 | -### 3. Explore runtime should share the same bootstrap discipline | |
| 81 | - | |
| 82 | -Explore is intentionally narrower than the main runtime, but it should not be structurally special in the wrong way. | |
| 83 | - | |
| 84 | -Implementation targets: | |
| 85 | - | |
| 86 | -- remove or narrow the `ExploreRuntime(agent)` construction shape so explore can be built from the same runtime bootstrap contract as the main runtime | |
| 87 | -- keep the read-only registry, read-only permission mode, and capability refresh behavior intact | |
| 88 | -- add direct tests for explore bootstrap and state ownership so explore remains a maintained runtime lane rather than a side path | |
| 89 | - | |
| 90 | -The goal is not to make explore bigger. The goal is to make it less magical and more aligned with the primary runtime contract. | |
| 91 | - | |
| 92 | -### 4. Shrink `agent/loop.py` toward entrypoint orchestration | |
| 93 | - | |
| 94 | -Sprint 14 made the runtime path smaller and more explicit. Sprint 15 should let `agent/loop.py` benefit from that work. | |
| 95 | - | |
| 96 | -Implementation targets: | |
| 97 | - | |
| 98 | -- move more bootstrap/session/prompt/runtime wiring out of `agent/loop.py` where it has become runtime ownership in practice | |
| 99 | -- keep `agent/loop.py` focused on: | |
| 100 | - - public entrypoints | |
| 101 | - - session-facing orchestration | |
| 102 | - - UI/event integration | |
| 103 | - - compatibility wrappers that still truly need to exist | |
| 104 | -- avoid letting new runtime helpers bounce back into `agent/loop.py` just to preserve old ownership lines | |
| 105 | - | |
| 106 | -This is the step that turns Sprint 14's seam cleanup into a visibly smaller agent shell. | |
| 107 | - | |
| 108 | -### 5. Keep the audit line active as a regression check, not a second roadmap | |
| 109 | - | |
| 110 | -`audit.txt` is old on specifics but still sharp on the pattern to avoid: additive cleanup that never deletes ownership. | |
| 111 | - | |
| 112 | -Implementation targets: | |
| 113 | - | |
| 114 | -- use the audit's core complaint as a check against Sprint 15 implementation: | |
| 115 | - - do not add a new wrapper if we can adopt or delete | |
| 116 | - - do not leave bootstrap ownership ambiguous | |
| 117 | - - do not grow a “temporary” compatibility seam without direct tests and an exit story | |
| 118 | -- update `PARITY.md` and the sprint audit only after the bootstrap/service changes are actually covered | |
| 119 | - | |
| 120 | -## Testing strategy | |
| 121 | - | |
| 122 | -- unit coverage for: | |
| 123 | - - runtime bootstrap/factory behavior | |
| 124 | - - explore bootstrap behavior | |
| 125 | - - runtime-owned reasoning/safeguard services after migration | |
| 126 | - - prompt/capability synchronization at the new bootstrap seam | |
| 127 | -- runtime coverage for: | |
| 128 | - - main turn execution through the new bootstrap path | |
| 129 | - - explore mode through the shared bootstrap/runtime contract | |
| 130 | - - Sprint 00-14 parity scenarios staying green after the bootstrap/service migration | |
| 131 | -- regression coverage for: | |
| 132 | - - no reintroduction of hidden raw-text extractors | |
| 133 | - - no reintroduction of legacy callback shims equivalent to `RuntimeLegacyServices` | |
| 134 | - - no ownership drift where runtime modules silently depend on agent-only helpers again | |
| 135 | - | |
| 136 | -## Definition of done | |
| 137 | - | |
| 138 | -- runtime bootstrapping is a first-class runtime seam, not primarily an agent helper | |
| 139 | -- explore mode shares the same bootstrap discipline as the main runtime | |
| 140 | -- more runtime-relevant behavior is moved or retired out of `agent/reasoning.py` and `agent/safeguards.py` | |
| 141 | -- `agent/loop.py` shrinks further toward entrypoint/session orchestration | |
| 142 | -- the parity baseline remains green after the bootstrap/service migration | |
| 143 | - | |
| 144 | -## Explicitly out of scope | |
| 145 | - | |
| 146 | -- full claw-code policy-engine parity | |
| 147 | -- AST-aware or LSP-aware semantic artifact diffs | |
| 148 | -- a richer permission rule editor | |
| 149 | -- visual workflow tooling | |
| 150 | -- multi-agent or team orchestration | |
| 151 | - | |
| 152 | -## Audit | |
| 153 | - | |
| 154 | -### Status | |
| 155 | - | |
| 156 | -- Sprint 15 is complete, and the audit is green. Bootstrap ownership is now explicit under `src/loader/runtime/bootstrap.py`, runtime-owned safeguards and reasoning helpers have canonical homes under `src/loader/runtime/`, and the remaining `agent/*` surface is much closer to entrypoint/session orchestration than runtime-service ownership. | |
| 157 | - | |
| 158 | -### Landed | |
| 159 | - | |
| 160 | -- runtime bootstrap is now a first-class shared seam: `src/loader/runtime/bootstrap.py` defines the typed `RuntimeBootstrapSource` contract plus `build_runtime_context(...)` / `sync_runtime_context(...)`, and both `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` construct runtime state through that shared path instead of a hidden `Agent._build_runtime_context()` helper | |
| 161 | -- the old bootstrap helper is gone from `src/loader/agent/loop.py`, and `tests/test_runtime_context.py` now exercises the shared bootstrap contract directly, which makes runtime construction visibly runtime-owned from the first line of setup | |
| 162 | -- safeguard ownership is now honest: `src/loader/runtime/safeguards.py` is the canonical implementation, `src/loader/agent/safeguards.py` is a compatibility export only, and `src/loader/agent/loop.py` now imports `RuntimeSafeguards` from the runtime package rather than its own shim | |
| 163 | -- reasoning ownership is materially cleaner: decomposition/self-critique helpers now live in `src/loader/runtime/deliberation.py`, completion-check parsing now lives in `src/loader/runtime/task_completion.py`, and `src/loader/agent/reasoning.py` has been reduced to a compatibility-export layer over runtime-owned modules instead of a second live implementation | |
| 164 | -- `src/loader/agent/loop.py` is substantially smaller and less misleading than the Sprint 14 baseline: dead planner hooks, dead self-critique hooks, dead raw extraction helpers, and the unused `src/loader/agent/planner.py` module have been deleted, bringing the loop shell down to `668` lines from the earlier four-figure baseline | |
| 165 | -- the direct proof contract is stronger and more intentional: `tests/test_runtime_bootstrap.py`, `tests/test_safeguard_services.py`, `tests/test_reasoning_compat.py`, and `tests/test_runtime_context.py` now pin the shared bootstrap seam plus the compatibility-export story for safeguards/reasoning directly, rather than relying only on larger runtime integration tests | |
| 166 | - | |
| 167 | -### Verification | |
| 168 | - | |
| 169 | -- `uv run pytest -q` is green: `312 passed` | |
| 170 | -- `tests/test_runtime_bootstrap.py` covers the shared bootstrap contract, prompt/capability synchronization, and both conversation/explore construction through the runtime bootstrap seam | |
| 171 | -- `tests/test_runtime_context.py` now proves typed context construction without an `Agent._build_runtime_context()` escape hatch | |
| 172 | -- `tests/test_safeguard_services.py` proves `src/loader/runtime/safeguards.py` is the canonical implementation and `loader.agent.safeguards` is compatibility-only | |
| 173 | -- `tests/test_reasoning_compat.py` proves the runtime-owned deliberation/completion helpers are canonical and `loader.agent.reasoning` re-exports those runtime implementations rather than carrying a second live copy | |
| 174 | -- the full parity suite remained green after the service migration and deletion work, so Sprint 15 reduced ownership ambiguity without trading away deterministic runtime coverage | |
| 175 | - | |
| 176 | -### Residual debt | |
| 177 | - | |
| 178 | -- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` no longer depend on a hidden bootstrap helper, but their constructors still start from an `Agent`-shaped bootstrap source at the entrypoint boundary; the remaining coupling is now public bootstrap orchestration rather than hidden runtime behavior | |
| 179 | -- `src/loader/agent/loop.py` is much smaller and cleaner, but it still owns the conversational fast path, decomposition orchestration, public run/explore entrypoints, and session/UI-facing glue; it is closer to the right shell, not yet minimal | |
| 180 | -- `src/loader/agent/reasoning.py` and `src/loader/agent/safeguards.py` are now compatibility shims instead of primary implementations, but those compatibility exports still exist until we decide whether the external import surface can be reduced further | |
| 181 | -- explore mode now shares the bootstrap discipline, but it is still a one-shot read-only lane rather than a richer interactive inspection workflow | |
| 182 | -- Loader’s workflow/runtime architecture is much cleaner after Sprint 15, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and a richer operator UX around policy/rule authoring | |
.docs/sprints/sprint16.mddeleted@@ -1,189 +0,0 @@ | ||
| 1 | -# Sprint 16: Entrypoint Shell, Launcher Contract, and Explore Continuity | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 15 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Turn Sprint 15's ownership cleanup into a cleaner public runtime shape: reduce the remaining `Agent`-shaped bootstrap dependency, shrink `agent/loop.py` further toward a thin facade, and make explore mode feel like a maintained read-only runtime lane instead of a one-shot side path. | |
| 10 | - | |
| 11 | -Sprint 15 closed an important loop. Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no hidden `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py`. | |
| 12 | - | |
| 13 | -That leaves a tighter but more visible set of residual debts: | |
| 14 | - | |
| 15 | -- `conversation.py` and `explore.py` still start from an `Agent`-shaped bootstrap source at the public entrypoint layer | |
| 16 | -- `agent/loop.py` still owns conversational fast-path behavior, decomposition orchestration, and too much session/UI-facing wiring for a runtime that otherwise lives under `src/loader/runtime/` | |
| 17 | -- `agent/reasoning.py` and `agent/safeguards.py` are now compatibility-only, but Loader has not yet decided how narrow that compatibility surface should become | |
| 18 | -- explore mode is structurally cleaner than it was, but it is still a one-shot lookup lane rather than a more intentional read-only workflow | |
| 19 | - | |
| 20 | -Sprint 16 is about finishing that next contraction without drifting into a brand-new roadmap: | |
| 21 | - | |
| 22 | -- the runtime gets a first-class launcher/entrypoint contract instead of depending on an `Agent`-shaped bootstrap source everywhere | |
| 23 | -- `agent/loop.py` shrinks further toward a public facade over runtime-owned launch and orchestration helpers | |
| 24 | -- compatibility exports remain explicit, tested, and intentionally narrow instead of becoming permanent soft ownership | |
| 25 | -- explore mode gains a better continuity/inspection contract so it feels like a real product lane, not just a single-turn utility | |
| 26 | - | |
| 27 | -This sprint should feel like consolidating the public shell after Sprint 15, not reopening the old runtime ownership debates. | |
| 28 | - | |
| 29 | -The references for this sprint are: | |
| 30 | - | |
| 31 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 32 | -- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs` | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 35 | -- `refs/claw-code/PARITY.md` | |
| 36 | -- `.docs/audit.txt` | |
| 37 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 38 | -- `.docs/sprints/sprint15.md` | |
| 39 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 40 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 41 | - | |
| 42 | -## Deliverables | |
| 43 | - | |
| 44 | -### 1. Introduce a first-class runtime launcher contract | |
| 45 | - | |
| 46 | -Sprint 15 made bootstrap shared. Sprint 16 should make the public launch path less `Agent`-shaped. | |
| 47 | - | |
| 48 | -Implementation targets: | |
| 49 | - | |
| 50 | -- introduce an explicit launcher/entrypoint seam under `src/loader/runtime/`, likely around: | |
| 51 | - - creating a main-turn runtime | |
| 52 | - - creating an explore runtime | |
| 53 | - - carrying the minimum public bootstrap state needed by those runtimes | |
| 54 | -- reduce direct dependence on `ConversationRuntime(self)` / `ExploreRuntime(self)` from `agent/loop.py` | |
| 55 | -- make the launcher contract narrow and explicit: | |
| 56 | - - project/session/prompt/capability state that truly belongs to launch-time setup | |
| 57 | - - public callbacks that still genuinely need to stay at the `Agent` layer | |
| 58 | -- avoid replacing one implicit wrapper with another vague wrapper; the goal is a visibly smaller ownership surface | |
| 59 | - | |
| 60 | -The goal is not to erase `Agent`. The goal is to make `Agent` a thin public shell over a runtime launcher contract instead of the default owner of launch-time state. | |
| 61 | - | |
| 62 | -### 2. Shrink `agent/loop.py` into a thinner facade | |
| 63 | - | |
| 64 | -Sprint 15 deleted dead code. Sprint 16 should move more of the still-live orchestration to clearer homes. | |
| 65 | - | |
| 66 | -Implementation targets: | |
| 67 | - | |
| 68 | -- move or re-home more of the still-live `agent/loop.py` behavior that has become runtime or launcher ownership in practice, especially: | |
| 69 | - - conversational fast-path handling | |
| 70 | - - decomposition orchestration | |
| 71 | - - runtime/explore construction and launch wiring | |
| 72 | - - session-facing setup that does not truly need to live on the agent object | |
| 73 | -- keep `agent/loop.py` focused on: | |
| 74 | - - public entrypoints | |
| 75 | - - top-level session lifecycle | |
| 76 | - - UI/event integration | |
| 77 | - - explicit compatibility shims that still need to exist | |
| 78 | -- continue deleting dead helper paths instead of letting the shell regrow around the new launcher seam | |
| 79 | - | |
| 80 | -This is the sprint where the public shell should become visibly easier to understand. | |
| 81 | - | |
| 82 | -### 3. Narrow the compatibility-export surface deliberately | |
| 83 | - | |
| 84 | -`agent/reasoning.py` and `agent/safeguards.py` are now compatibility layers. Sprint 16 should make that state intentional rather than indefinite. | |
| 85 | - | |
| 86 | -Implementation targets: | |
| 87 | - | |
| 88 | -- inventory the remaining compatibility-only exports under: | |
| 89 | - - `src/loader/agent/reasoning.py` | |
| 90 | - - `src/loader/agent/safeguards.py` | |
| 91 | -- decide which exports still need to exist for test/import compatibility and which can be retired | |
| 92 | -- keep internal Loader code importing canonical runtime modules instead of their compatibility mirrors | |
| 93 | -- add direct tests that lock the intended compatibility contract: | |
| 94 | - - which symbols are still re-exported | |
| 95 | - - which internal hot paths must not import via compatibility modules anymore | |
| 96 | - | |
| 97 | -The goal is not immediate deletion of every compatibility export. The goal is to stop compatibility from behaving like shadow ownership. | |
| 98 | - | |
| 99 | -### 4. Give explore mode better continuity and inspection shape | |
| 100 | - | |
| 101 | -Explore is cleaner now, but still underpowered as a product lane. | |
| 102 | - | |
| 103 | -Implementation targets: | |
| 104 | - | |
| 105 | -- deepen the explore runtime without turning it into full workflow mode, likely around: | |
| 106 | - - lightweight transcript continuity for follow-up explore questions | |
| 107 | - - clearer inspection or status surfaces for recent explore activity | |
| 108 | - - a stronger read-only session/state story | |
| 109 | -- preserve the current constraints: | |
| 110 | - - read-only registry | |
| 111 | - - no DoD | |
| 112 | - - no mutating workflow artifacts | |
| 113 | - - explicit denial of write/destructive actions | |
| 114 | -- add direct coverage so explore remains a maintained runtime lane rather than a product afterthought | |
| 115 | - | |
| 116 | -The goal is to make explore feel intentionally usable over multiple adjacent questions, not to turn it into a second main runtime. | |
| 117 | - | |
| 118 | -### 5. Keep the audit line active as a deletion-first check | |
| 119 | - | |
| 120 | -The useful part of `audit.txt` is still the bias toward deleting disguised ownership instead of endlessly wrapping it. | |
| 121 | - | |
| 122 | -Implementation targets: | |
| 123 | - | |
| 124 | -- use the audit's core complaint as a Sprint 16 check: | |
| 125 | - - do not leave the launcher contract `Agent`-shaped by habit | |
| 126 | - - do not regrow logic in `agent/loop.py` after moving it out | |
| 127 | - - do not let compatibility exports become the internal default import path again | |
| 128 | -- update `PARITY.md` and the sprint audit only after the launcher/entrypoint work is directly covered | |
| 129 | - | |
| 130 | -## Testing strategy | |
| 131 | - | |
| 132 | -- unit coverage for: | |
| 133 | - - runtime launcher creation and bootstrap narrowing | |
| 134 | - - conversational/decomposition orchestration after re-homing | |
| 135 | - - compatibility-export boundaries | |
| 136 | - - explore continuity state and inspection helpers | |
| 137 | -- runtime coverage for: | |
| 138 | - - main runtime launch through the new launcher contract | |
| 139 | - - explore runtime launch and follow-up continuity | |
| 140 | - - Sprint 00-15 parity scenarios staying green after the entrypoint-shell contraction | |
| 141 | -- regression coverage for: | |
| 142 | - - no return of hidden bootstrap helpers | |
| 143 | - - no internal fallback to compatibility imports for runtime-owned helpers | |
| 144 | - - no mutation leakage into explore continuity/session behavior | |
| 145 | - | |
| 146 | -## Definition of done | |
| 147 | - | |
| 148 | -- Loader has a first-class runtime launcher/entrypoint seam instead of relying on an `Agent`-shaped bootstrap source everywhere | |
| 149 | -- `agent/loop.py` shrinks further toward a public facade over runtime-owned launch/orchestration helpers | |
| 150 | -- compatibility exports under `agent/reasoning.py` and `agent/safeguards.py` are narrower, explicitly tested, and no longer used by internal hot paths | |
| 151 | -- explore mode has a stronger continuity/inspection contract while staying read-only and workflow-light | |
| 152 | -- the parity baseline remains green after the launcher/entrypoint changes | |
| 153 | - | |
| 154 | -## Explicitly out of scope | |
| 155 | - | |
| 156 | -- full claw-code policy-engine parity | |
| 157 | -- multi-agent or team orchestration | |
| 158 | -- AST-aware or LSP-aware semantic artifact diffs | |
| 159 | -- a full visual explore/workflow UI | |
| 160 | - | |
| 161 | -## Audit | |
| 162 | - | |
| 163 | -### Status | |
| 164 | - | |
| 165 | -- Sprint 16 is complete, and the audit is green. Loader now has a real public launcher/entrypoint contract, a visibly smaller `agent/loop.py`, explicit compatibility-boundary proof, and a persisted read-only explore continuity story that stays outside the main workflow runtime. | |
| 166 | - | |
| 167 | -### Landed | |
| 168 | - | |
| 169 | -- the launcher contract is now first-class under `src/loader/runtime/launcher.py`: the public runtime seam no longer just constructs runtimes, it now owns conversational fast-path routing, decomposition entry routing, direct turn routing, and read-only explore launch through a single entry contract | |
| 170 | -- `src/loader/agent/loop.py` has shrunk further toward a real facade: conversational handling and decomposition orchestration now live under `src/loader/runtime/chat_lane.py` and `src/loader/runtime/decomposition_lane.py`, and the remaining shell is much closer to public entrypoints, session lifecycle, prompt factories, and UI/event integration than to runtime ownership | |
| 171 | -- the compatibility surface is now deliberate instead of implicit: `tests/test_compat_boundaries.py`, `tests/test_reasoning_compat.py`, and `tests/test_safeguard_services.py` explicitly lock the current `agent/reasoning.py` / `agent/safeguards.py` export contract and assert that internal runtime code does not drift back to importing through those compatibility shims | |
| 172 | -- explore mode now has a stronger continuity/state story without becoming a second workflow runtime: `src/loader/runtime/explore_state.py` persists a bounded read-only transcript under `.loader/state/explore.json`, `src/loader/runtime/explore.py` reuses that transcript for follow-up questions by default, and `loader explore --fresh` gives operators a clean escape hatch when they want a one-off lookup | |
| 173 | -- operator visibility now reflects that explore state: `src/loader/runtime/inspection.py` includes recent explore activity in the status snapshot, and `src/loader/cli/main.py` surfaces explore turns, bounded transcript state, and the last explore query in `loader status` | |
| 174 | - | |
| 175 | -### Verification | |
| 176 | - | |
| 177 | -- `uv run pytest -q` is green: `329 passed` | |
| 178 | -- `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` now prove the public launcher contract directly, including conversational routing, decomposition delegation, direct turn routing, and explore launch behavior | |
| 179 | -- `tests/test_runtime_bootstrap.py` and `tests/test_runtime_context.py` remained green after the launcher ownership shift, which keeps the shared bootstrap/context seam honest instead of reintroducing an agent-only backdoor | |
| 180 | -- `tests/test_compat_boundaries.py`, `tests/test_reasoning_compat.py`, and `tests/test_safeguard_services.py` now pin the intended compatibility-export boundary directly and fail if internal Loader code slides back into using those shims as primary imports | |
| 181 | -- `tests/test_explore_runtime.py` and `tests/test_inspection.py` now cover persisted explore continuity, `fresh` resets, and status-surface visibility for recent explore activity | |
| 182 | - | |
| 183 | -### Residual debt | |
| 184 | - | |
| 185 | -- `src/loader/agent/loop.py` is much smaller and clearer, but it still owns prompt/session factories, resume/clear lifecycle, and event-wrapper glue; Loader is close to a minimal public shell, not fully there yet | |
| 186 | -- `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` still start from an `Agent`-shaped bootstrap source at the public boundary, even though the launcher contract is now more explicit and load-bearing | |
| 187 | -- the compatibility exports are now intentionally bounded and tested, but they still exist until Loader decides whether the external import surface can narrow further | |
| 188 | -- explore continuity is now real, but it is still transcript-first and lightweight: there is no richer explore inspection command, multi-step read-only workflow, or deeper repo-navigation UX yet | |
| 189 | -- Loader’s runtime shell is materially cleaner after Sprint 16, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and richer operator tooling around policy, rules, and explore workflows | |
.docs/sprints/sprint17.mddeleted@@ -1,190 +0,0 @@ | ||
| 1 | -# Sprint 17: Bootstrap Source Narrowing, Turn Contract Tightening, and Explore Operator UX | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 16 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Finish the next real contraction after Sprint 16: stop treating the public runtime boundary as an `Agent` object by default, keep deleting in-stream repair ownership where the runtime can enforce a stronger contract instead, and give the new explore continuity story a small but real operator surface. | |
| 10 | - | |
| 11 | -Sprint 16 closed an important shell-level loop. Loader now has a first-class runtime launcher contract, a smaller `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity. That work changed the shape of the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- the runtime no longer needs `agent/loop.py` to decide chat vs decompose vs direct turn routing | |
| 14 | -- compatibility shims are now explicit and guarded instead of silently load-bearing | |
| 15 | -- explore is no longer purely one-shot | |
| 16 | -- but the runtime still starts from an `Agent`-shaped bootstrap source at the public boundary | |
| 17 | -- `agent/loop.py` still owns prompt/session factory behavior and too much entrypoint lifecycle glue | |
| 18 | -- and the old audit critique still matters in a narrower form: Loader must keep deleting or hardening in-stream repair behavior, not just moving it again | |
| 19 | - | |
| 20 | -`audit.txt` is stale on specifics, test counts, and many file-level claims. It is not the roadmap anymore. But one core warning is still worth carrying into Sprint 17: | |
| 21 | - | |
| 22 | -- do not keep wrapping model-misbehavior recovery in nicer files if the runtime can instead enforce a stronger explicit contract | |
| 23 | - | |
| 24 | -Sprint 17 should use that warning deliberately while staying grounded in the current codebase rather than the old audit snapshot. | |
| 25 | - | |
| 26 | -The references for this sprint are: | |
| 27 | - | |
| 28 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 29 | -- `refs/claw-code/rust/crates/runtime/src/runtime_context.rs` | |
| 30 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 31 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 32 | -- `refs/claw-code/PARITY.md` | |
| 33 | -- `.docs/audit.txt` | |
| 34 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 35 | -- `.docs/sprints/sprint16.md` | |
| 36 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 37 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 38 | - | |
| 39 | -## Deliverables | |
| 40 | - | |
| 41 | -### 1. Narrow the public bootstrap source below `Agent` | |
| 42 | - | |
| 43 | -Sprint 16 gave Loader a launcher contract. Sprint 17 should stop treating that launcher contract as “an agent object with the right fields.” | |
| 44 | - | |
| 45 | -Implementation targets: | |
| 46 | - | |
| 47 | -- introduce a smaller public bootstrap/launcher source under `src/loader/runtime/`, likely around: | |
| 48 | - - backend | |
| 49 | - - registry | |
| 50 | - - session | |
| 51 | - - permission/capability state | |
| 52 | - - prompt/session callbacks that still genuinely need to stay dynamic | |
| 53 | -- reduce direct `RuntimeBootstrapSource` dependence on the full `Agent` object | |
| 54 | -- keep `ConversationRuntime`, `ExploreRuntime`, and `RuntimeLauncher` constructing from that narrower source rather than from `Agent` by convention | |
| 55 | -- avoid moving fields mechanically; the point is to decide which launch-time responsibilities truly belong in the runtime boundary and which belong in the public shell | |
| 56 | - | |
| 57 | -The goal is not “zero references to Agent.” The goal is to make the public runtime boundary look intentionally runtime-shaped rather than coincidentally agent-shaped. | |
| 58 | - | |
| 59 | -### 2. Move prompt/session shell behavior out of `agent/loop.py` | |
| 60 | - | |
| 61 | -Sprint 16 shrank entry routing. Sprint 17 should take the next obvious shell debt: prompt/session factories and lifecycle glue. | |
| 62 | - | |
| 63 | -Implementation targets: | |
| 64 | - | |
| 65 | -- inventory what still makes `src/loader/agent/loop.py` feel heavier than a public facade, especially: | |
| 66 | - - system-prompt construction and snapshot persistence | |
| 67 | - - few-shot example selection | |
| 68 | - - session creation/replacement boilerplate | |
| 69 | - - resume/clear lifecycle helpers that have become reusable runtime-shell behavior in practice | |
| 70 | -- move or re-home the behavior that is no longer meaningfully “agent-owned” | |
| 71 | -- keep `agent/loop.py` focused on: | |
| 72 | - - public entrypoints | |
| 73 | - - explicit lifecycle commands | |
| 74 | - - UI/event wrappers | |
| 75 | - - compatibility accessors that still truly need to exist | |
| 76 | - | |
| 77 | -The goal is to make Sprint 16’s thinner shell materially easier to understand on sight. | |
| 78 | - | |
| 79 | -### 3. Tighten the remaining turn-repair contract | |
| 80 | - | |
| 81 | -This is where `audit.txt` still matters. The remaining question is no longer “did we extract the repair path?” but “are we still repairing things the runtime should simply reject or halt on?” | |
| 82 | - | |
| 83 | -Implementation targets: | |
| 84 | - | |
| 85 | -- inventory the remaining repair/completion heuristics across: | |
| 86 | - - `src/loader/runtime/repair.py` | |
| 87 | - - `src/loader/runtime/completion_policy.py` | |
| 88 | - - `src/loader/runtime/turn_completion.py` | |
| 89 | - - `src/loader/runtime/assistant_turns.py` | |
| 90 | -- identify the heuristics that still look like in-stream puppeting or speculative follow-through rather than explicit runtime contract | |
| 91 | -- prefer one of: | |
| 92 | - - deletion | |
| 93 | - - a stricter typed failure state | |
| 94 | - - an explicit retry budget with honest surfaced failure | |
| 95 | - over adding another wrapper or reroute | |
| 96 | -- add direct regression tests for every deleted/tightened behavior so we do not silently reintroduce it later | |
| 97 | - | |
| 98 | -This is the sprint where we should convert more “runtime tries to rescue the model” behavior into “runtime enforces the contract and reports honestly.” | |
| 99 | - | |
| 100 | -### 4. Give explore a small but real operator surface | |
| 101 | - | |
| 102 | -Sprint 16 gave explore continuity. Sprint 17 should make that state inspectable and manageable. | |
| 103 | - | |
| 104 | -Implementation targets: | |
| 105 | - | |
| 106 | -- add a small operator-facing explore surface, likely around: | |
| 107 | - - recent explore status/history inspection | |
| 108 | - - resetting explore continuity without touching main workflow sessions | |
| 109 | - - clearer visibility into whether an explore query ran fresh vs continued | |
| 110 | -- keep the explore lane intentionally lightweight: | |
| 111 | - - no DoD | |
| 112 | - - no workflow artifacts | |
| 113 | - - no mutation | |
| 114 | - - no conversion into a second main runtime | |
| 115 | -- prefer one or two clean CLI/inspection surfaces over a broad subcommand family | |
| 116 | - | |
| 117 | -The goal is to make explore continuity usable and debuggable, not to build a whole second product mode. | |
| 118 | - | |
| 119 | -### 5. Keep the audit line active as a contract check, not a competing roadmap | |
| 120 | - | |
| 121 | -Implementation targets: | |
| 122 | - | |
| 123 | -- use `audit.txt` only for the still-valid patterns it warns about: | |
| 124 | - - additive wrapper cleanup | |
| 125 | - - hidden ownership | |
| 126 | - - in-stream rescue behavior that should become explicit contract | |
| 127 | -- prefer current code and current sprint audits over old audit counts or old branch-era file claims | |
| 128 | -- update `PARITY.md` and the sprint audit only after the bootstrap narrowing and turn-contract work are directly covered | |
| 129 | - | |
| 130 | -## Testing strategy | |
| 131 | - | |
| 132 | -- unit coverage for: | |
| 133 | - - narrowed launcher/bootstrap source construction | |
| 134 | - - prompt/session-shell helpers after re-homing | |
| 135 | - - deleted or tightened repair/completion heuristics | |
| 136 | - - explore operator surfaces and continuity reset behavior | |
| 137 | -- runtime coverage for: | |
| 138 | - - main runtime launch through the narrowed bootstrap source | |
| 139 | - - explore continuity through the new operator surface | |
| 140 | - - existing launcher/chat/decomposition parity staying green after the shell contraction | |
| 141 | -- regression coverage for: | |
| 142 | - - no new implicit dependence on full `Agent` shape at the runtime boundary | |
| 143 | - - no silent return of deleted repair/continuation heuristics | |
| 144 | - - no explore continuity leakage into main session/DoD state | |
| 145 | - | |
| 146 | -## Definition of done | |
| 147 | - | |
| 148 | -- the public runtime bootstrap source is narrower and less `Agent`-shaped than Sprint 16 | |
| 149 | -- `agent/loop.py` shrinks further toward public facade and lifecycle glue only | |
| 150 | -- Loader deletes or tightens more remaining in-stream repair behavior instead of merely relocating it | |
| 151 | -- explore continuity gains a small operator surface while staying read-only and workflow-light | |
| 152 | -- the parity baseline remains green after the Sprint 17 contract tightening | |
| 153 | - | |
| 154 | -## Explicitly out of scope | |
| 155 | - | |
| 156 | -- full claw-code policy-engine parity | |
| 157 | -- multi-agent or team orchestration | |
| 158 | -- AST-aware or LSP-aware semantic artifact diffs | |
| 159 | -- a full visual explore workflow | |
| 160 | -- a broad rule editor or policy authoring UI | |
| 161 | - | |
| 162 | -## Audit | |
| 163 | - | |
| 164 | -### Status | |
| 165 | - | |
| 166 | -- Sprint 17 is complete, and the audit is green. Loader now starts the public runtime from an explicit runtime-shaped bootstrap view, `agent/loop.py` is materially thinner again, one more rescue-style repair path was converted into an honest failure contract, and explore continuity has a small operator surface instead of being invisible state. | |
| 167 | - | |
| 168 | -### Landed | |
| 169 | - | |
| 170 | -- the public runtime boundary is no longer “raw `Agent` by convention”: `src/loader/runtime/bootstrap.py` now exposes an explicit `RuntimeBootstrapView`, `src/loader/runtime/launcher.py` stores that narrowed source directly, and both `src/loader/runtime/conversation.py` and `src/loader/runtime/explore.py` now construct from that runtime-shaped contract rather than from `Agent`-typed ownership | |
| 171 | -- prompt/session shell behavior moved into runtime-owned helpers under `src/loader/runtime/public_shell.py`: session creation, session restore, prompt construction, prompt snapshot persistence, and few-shot example selection no longer live inline inside `src/loader/agent/loop.py` | |
| 172 | -- `src/loader/agent/loop.py` has shrunk again, now down to 437 lines; its remaining weight is much closer to what Sprint 17 intended: public entrypoints, resume/clear lifecycle, capability refresh, steering, and UI-facing wrapper behavior | |
| 173 | -- the repair contract tightened in a useful audit-aligned place: when raw-text tool recovery exhausts its budget, `src/loader/runtime/repair.py` now stops with an explicit honest failure instead of appending another soft “let me know if you'd like me to continue” rescue line | |
| 174 | -- explore continuity now has a real operator surface while staying workflow-light: `src/loader/runtime/explore_state.py` persists whether the last lookup ran `fresh` or `continue`, `src/loader/runtime/inspection.py` exposes that continuity directly, and `src/loader/cli/main.py` now supports `loader explore --status` and `loader explore --reset` alongside the existing read-only lookup flow | |
| 175 | - | |
| 176 | -### Verification | |
| 177 | - | |
| 178 | -- `uv run pytest -q` is green: `336 passed` | |
| 179 | -- `tests/test_runtime_bootstrap.py`, `tests/test_runtime_launcher.py`, and `tests/test_runtime_context.py` now pin the narrowed bootstrap boundary directly and assert that the launcher/runtime contract is a `RuntimeBootstrapView` rather than a raw `Agent` | |
| 180 | -- `tests/test_runtime_public_shell.py` now covers runtime-owned prompt/session shell helpers directly, including prompt-contract persistence, session creation metadata, and restored last-turn summary state | |
| 181 | -- `tests/test_repair.py` and `tests/test_runtime_repair_flows.py` now cover honest raw-text tool recovery failure once the recovery budget is exhausted | |
| 182 | -- `tests/test_explore_runtime.py` and `tests/test_inspection.py` now cover persisted explore history mode (`fresh` vs `continue`), explore continuity inspection, and explore reset behavior from the CLI | |
| 183 | - | |
| 184 | -### Residual debt | |
| 185 | - | |
| 186 | -- `src/loader/agent/loop.py` is much closer to a public facade than it was at Sprint 16, but it still owns resume/clear lifecycle, steering plumbing, capability refresh, and UI/event wrapper behavior; it is thin enough to be honest, not yet minimal | |
| 187 | -- the public runtime boundary is now explicitly runtime-shaped, but `Agent` still constructs and supplies that boundary; Loader has not yet decided whether later sprints should narrow the public shell further or treat that as stable product architecture | |
| 188 | -- the repair path is more honest than it was, but some completion/continuation heuristics still remain in `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py`; those paths are now the right place to look for any future audit-driven deletions | |
| 189 | -- explore continuity is now inspectable and resettable, but it is still intentionally narrow: no richer browse/navigation UX, no multi-step explore workflow, and no broader product surface than the single-command lookup lane plus the new operator flags | |
| 190 | -- Loader is in a healthier public-boundary shape after Sprint 17, but it still stops short of claw-code’s tighter policy seams, OMX’s deeper planning/interview rigor, and richer operator tooling around policy, rules, and explore workflows | |
.docs/sprints/sprint18.mddeleted@@ -1,189 +0,0 @@ | ||
| 1 | -# Sprint 18: Shell Minimalism, Completion Contract, and Runtime Policy Trace | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 17 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest contraction after Sprint 17: finish shrinking the public shell where it is still load-bearing, tighten the remaining completion-policy heuristics that still act like soft rescue behavior, and expose more of the runtime’s stop/continue policy as inspectable state instead of hidden control flow. | |
| 10 | - | |
| 11 | -Sprint 17 closed several important seams: | |
| 12 | - | |
| 13 | -- the public runtime boundary is now explicitly runtime-shaped instead of raw-`Agent` by convention | |
| 14 | -- prompt/session shell behavior moved into `src/loader/runtime/public_shell.py` | |
| 15 | -- raw-text tool recovery now fails honestly once its budget is exhausted | |
| 16 | -- explore continuity now has a small but real operator surface | |
| 17 | - | |
| 18 | -That changes the remaining debt in a useful way: | |
| 19 | - | |
| 20 | -- `src/loader/agent/loop.py` is now much thinner, but it still owns resume/clear lifecycle, steering plumbing, capability refresh, and public event-wrapper glue | |
| 21 | -- `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py` still contain heuristics that may be useful, but they are the clearest remaining place where Loader can still look like it is nudging the model instead of enforcing a typed contract | |
| 22 | -- the runtime now makes more honest decisions, but operators still cannot inspect completion-policy decisions with the same clarity they can inspect permissions, prompts, workflow, or explore continuity | |
| 23 | - | |
| 24 | -Sprint 18 should keep using `refs/claw-code` and `refs/oh-my-codex` as architectural references, not as a literal feature-copy checklist. Loader is no longer in the “blindly match every upstream surface” phase. The right standard now is: | |
| 25 | - | |
| 26 | -- honor claw-code where it provides proven runtime seams, policy discipline, and explicit state ownership | |
| 27 | -- honor OMX where it sharpens workflow/verifier thinking | |
| 28 | -- keep Loader-specific product surfaces when they improve inspectability or fit the current Python runtime better | |
| 29 | - | |
| 30 | -That means Sprint 18 should still be guided by the refs, but it should use them to make better Loader decisions rather than to perform a mechanical port. | |
| 31 | - | |
| 32 | -The references for this sprint are: | |
| 33 | - | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 37 | -- `refs/claw-code/PARITY.md` | |
| 38 | -- `.docs/PARITY.md` | |
| 39 | -- `.docs/audit.txt` | |
| 40 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 41 | -- `.docs/sprints/sprint17.md` | |
| 42 | -- `refs/oh-my-codex/src/ralplan/runtime.ts` | |
| 43 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 44 | - | |
| 45 | -## Deliverables | |
| 46 | - | |
| 47 | -### 1. Shrink the remaining public shell to an intentionally small facade | |
| 48 | - | |
| 49 | -Sprint 17 moved prompt/session helpers out of `agent/loop.py`. Sprint 18 should finish the next obvious shell contraction. | |
| 50 | - | |
| 51 | -Implementation targets: | |
| 52 | - | |
| 53 | -- inventory the remaining `Agent` responsibilities that still feel like runtime/public-shell plumbing rather than true product entrypoint behavior, especially: | |
| 54 | - - resume/clear lifecycle helpers | |
| 55 | - - capability refresh wiring | |
| 56 | - - steering queue plumbing | |
| 57 | - - event-wrapper helpers for run/explore/streaming entrypoints | |
| 58 | -- move what is reusable/runtime-owned into dedicated runtime/public-shell helpers | |
| 59 | -- keep `agent/loop.py` focused on: | |
| 60 | - - public API entrypoints | |
| 61 | - - the minimal state that truly belongs to the agent shell | |
| 62 | - - explicit compatibility or UI-facing integration points | |
| 63 | - | |
| 64 | -The goal is not “delete `Agent`.” The goal is for `agent/loop.py` to read like an intentionally tiny facade rather than “the last place leftover things live.” | |
| 65 | - | |
| 66 | -### 2. Tighten the remaining completion/continuation contract | |
| 67 | - | |
| 68 | -This is the strongest remaining audit line after Sprint 17. | |
| 69 | - | |
| 70 | -Implementation targets: | |
| 71 | - | |
| 72 | -- inventory the remaining heuristic behavior across: | |
| 73 | - - `src/loader/runtime/completion_policy.py` | |
| 74 | - - `src/loader/runtime/turn_completion.py` | |
| 75 | - - `src/loader/runtime/assistant_turns.py` | |
| 76 | - - any nearby turn-loop controllers that still re-enter based on soft textual cues | |
| 77 | -- identify which of those behaviors still look like: | |
| 78 | - - speculative completion rescue | |
| 79 | - - soft continuation nudges that should instead become explicit failure or explicit follow-through state | |
| 80 | - - hidden policy that operators cannot inspect afterward | |
| 81 | -- prefer: | |
| 82 | - - deletion | |
| 83 | - - stricter typed completion decisions | |
| 84 | - - explicit stop/continue reason codes | |
| 85 | - - surfaced runtime evidence | |
| 86 | - over adding more wrapper logic | |
| 87 | - | |
| 88 | -The goal is to continue the Sprint 13 and Sprint 17 line: the runtime should either proceed for a clear reason or stop honestly, not softly coax the model onward without making that policy visible. | |
| 89 | - | |
| 90 | -### 3. Expose completion-policy and stop/continue decisions as inspectable runtime state | |
| 91 | - | |
| 92 | -Loader’s policy decisions should become easier to inspect, not just easier to reason about in code. | |
| 93 | - | |
| 94 | -Implementation targets: | |
| 95 | - | |
| 96 | -- define a small typed trace or summary for completion-policy decisions, likely covering: | |
| 97 | - - why a text response was accepted | |
| 98 | - - why a continuation was requested | |
| 99 | - - why a turn was stopped or finalized | |
| 100 | - - whether the decision came from text-loop detection, DoD gating, raw-tool recovery exhaustion, or completion-policy evaluation | |
| 101 | -- persist enough of that state in the current session/turn summary to support operator inspection | |
| 102 | -- surface it through existing product seams, likely: | |
| 103 | - - `loader status` | |
| 104 | - - `loader session show` | |
| 105 | - - possibly `loader workflow show` if that is the cleanest fit | |
| 106 | - | |
| 107 | -The goal is to make completion policy inspectable the same way permissions, prompts, workflow, and explore continuity are now inspectable. | |
| 108 | - | |
| 109 | -### 4. Keep the ref relationship explicit and healthy | |
| 110 | - | |
| 111 | -By Sprint 18, Loader should be clearly past the “copy the refs feature-for-feature” stage without losing the discipline the refs gave us. | |
| 112 | - | |
| 113 | -Implementation targets: | |
| 114 | - | |
| 115 | -- use claw-code for: | |
| 116 | - - runtime seam quality | |
| 117 | - - explicit policy/state ownership | |
| 118 | - - prompt/runtime separation | |
| 119 | -- use OMX for: | |
| 120 | - - verifier/workflow pressure and follow-through ideas | |
| 121 | -- do not add work just because the refs have it | |
| 122 | -- do add work when a ref reveals a real Loader weakness in: | |
| 123 | - - honesty | |
| 124 | - - inspectability | |
| 125 | - - runtime ownership | |
| 126 | - - follow-through | |
| 127 | - | |
| 128 | -The goal is to make Sprint 18 explicitly “reference-guided Loader optimization,” not “shadow-porting another codebase.” | |
| 129 | - | |
| 130 | -## Testing strategy | |
| 131 | - | |
| 132 | -- unit coverage for: | |
| 133 | - - new public-shell helpers or reduced `Agent` shell behavior | |
| 134 | - - tightened completion-policy decisions | |
| 135 | - - persisted completion/stop decision summaries | |
| 136 | -- runtime coverage for: | |
| 137 | - - no regression in normal follow-through | |
| 138 | - - honest stop behavior where rescue logic was deleted or tightened | |
| 139 | - - session/status inspection of completion-policy decisions | |
| 140 | -- regression coverage for: | |
| 141 | - - no reintroduction of soft rescue phrasing after stop/failure conditions | |
| 142 | - - no drift back toward `agent/loop.py` owning runtime plumbing | |
| 143 | - - no loss of current explore/session/workflow inspection behavior while shell state shifts again | |
| 144 | - | |
| 145 | -## Definition of done | |
| 146 | - | |
| 147 | -- `agent/loop.py` shrinks again or becomes materially more facade-like even if line count does not fall dramatically | |
| 148 | -- Loader deletes or hardens more remaining completion/continuation heuristics instead of preserving them through softer wrappers | |
| 149 | -- operators can inspect more of the runtime’s completion-policy reasoning after the fact | |
| 150 | -- Sprint 17’s explicit-bootstrap and explore-operator gains remain green | |
| 151 | -- the parity baseline remains green after the Sprint 18 shell/policy tightening | |
| 152 | - | |
| 153 | -## Explicitly out of scope | |
| 154 | - | |
| 155 | -- full claw-code policy-engine parity | |
| 156 | -- multi-agent or team orchestration | |
| 157 | -- AST-aware semantic diffs | |
| 158 | -- a broad explore workflow UI | |
| 159 | -- broad permission-rule editing UX | |
| 160 | - | |
| 161 | -## Audit | |
| 162 | - | |
| 163 | -### Status | |
| 164 | - | |
| 165 | -- Sprint 18 is complete, and the audit is green. Loader now persists explicit completion-policy decisions and step traces, exposes them through the existing operator surfaces, and moves more shell glue out of `src/loader/agent/loop.py` into runtime-owned public-shell helpers. | |
| 166 | - | |
| 167 | -### Landed | |
| 168 | - | |
| 169 | -- completion-policy outcomes are now explicit runtime contract instead of hidden side effects: `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, and `src/loader/runtime/finalization.py` now preserve reason-coded stop/continue outcomes such as `premature_completion_nudge`, `non_mutating_response_accepted`, `verification_passed`, and `verification_failed_reentry` | |
| 170 | -- session/runtime state now persists both the latest completion decision and a bounded per-turn completion trace through `src/loader/runtime/session.py`, `src/loader/runtime/events.py`, and the new `src/loader/runtime/completion_trace.py` | |
| 171 | -- completion-policy state is now inspectable from the product surface: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` show the latest completion decision in `loader status`, `loader session list`, and `loader session show`, while `loader session show` also prints the richer step-by-step completion trace | |
| 172 | -- resumed session state now restores completion decisions and traces through `src/loader/runtime/public_shell.py`, so inspection survives restart instead of only describing the active process | |
| 173 | -- the public shell is thinner again: `src/loader/runtime/public_shell.py` now owns a real `SteeringMailbox`, fresh/load session-install helpers, sync/async event-emitter normalization, and capability-refresh decision helpers | |
| 174 | -- `src/loader/agent/loop.py` now adopts those runtime-owned helpers instead of open-coding them; the file is down to 432 lines, steering no longer leaks across resume/clear lifecycle, and replacing a session now invalidates cached prompt state explicitly | |
| 175 | - | |
| 176 | -### Verification | |
| 177 | - | |
| 178 | -- `uv run pytest -q` is green: `342 passed` | |
| 179 | -- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_finalization.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now pin persisted completion decisions plus step-trace behavior directly | |
| 180 | -- `tests/test_runtime_public_shell.py` now covers steering mailbox behavior, fresh/load session-install helpers, sync/async event emitters, and capability-refresh helper behavior directly | |
| 181 | -- `tests/test_turn_preparation.py` now proves a new turn clears stale completion trace state before execution starts | |
| 182 | -- `tests/test_runtime_context.py`, `tests/test_runtime_bootstrap.py`, and `tests/test_runtime_launcher.py` stayed green while `Agent` adopted the slimmer public-shell helper set | |
| 183 | - | |
| 184 | -### Residual debt | |
| 185 | - | |
| 186 | -- `src/loader/agent/loop.py` is materially more facade-like than Sprint 17, but it still owns the public entrypoints themselves plus the remaining launcher/UI glue; Sprint 18 shrank the shell again, it did not eliminate it | |
| 187 | -- completion policy is now explicit and inspectable, but Loader still keeps bounded continuation heuristics for some non-mutating tasks; those nudges are no longer hidden, but they are still policy choices rather than hard-stop failures | |
| 188 | -- the new completion trace is intentionally compact and per-turn; Loader still does not have a richer long-horizon policy/debug timeline that merges completion, workflow, and repair evidence into one operator view | |
| 189 | -- Sprint 18 kept following claw-code and OMX as architectural guardrails, but Loader still stops short of claw-code's tighter policy-engine seams and OMX's deeper verifier/interview rigor | |
.docs/sprints/sprint19.mddeleted@@ -1,191 +0,0 @@ | ||
| 1 | -# Sprint 19: Facade Finalization, Continuation Hardening, and Unified Policy Timeline | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 18 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest contraction after Sprint 18: finish thinning the public shell where it still reads like runtime glue, harden the last continuation heuristics that can still feel like soft rescue behavior, and unify the scattered policy/debug surfaces into one clearer operator story. | |
| 10 | - | |
| 11 | -Sprint 18 changed the shape of the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- completion policy is now explicit, persisted, and inspectable | |
| 14 | -- public-shell helpers now own steering, session install/load, event-emitter normalization, and capability-refresh decisions | |
| 15 | -- `src/loader/agent/loop.py` is materially thinner | |
| 16 | -- but `Agent` still owns the public entrypoints and launcher glue | |
| 17 | -- completion traces and workflow traces now both exist, but they are still separate operator surfaces | |
| 18 | -- Loader still keeps bounded continuation nudges for some non-mutating tasks, and those nudges are explicit now but not yet deeply justified | |
| 19 | - | |
| 20 | -Sprint 19 should stay reference-guided, not reference-submissive. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen runtime seams, policy ownership, and explicit lifecycle contracts | |
| 25 | -- use OMX to sharpen follow-through, verifier pressure, and operator-facing runtime accountability | |
| 26 | -- do not add a feature just because the refs have it | |
| 27 | -- do pursue changes when the refs reveal that Loader is still too soft, too implicit, or too hard to audit | |
| 28 | - | |
| 29 | -`audit.txt` is still not the roadmap. It is useful only as a guardrail against sliding back into wrapper-heavy cleanup and soft model rescue behavior. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 38 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 39 | -- `refs/claw-code/rust/crates/runtime/src/prompt.rs` | |
| 40 | -- `refs/claw-code/PARITY.md` | |
| 41 | -- `.docs/PARITY.md` | |
| 42 | -- `.docs/audit.txt` | |
| 43 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 44 | -- `.docs/sprints/sprint18.md` | |
| 45 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 46 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 47 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 48 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 49 | -- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts` | |
| 50 | - | |
| 51 | -## Deliverables | |
| 52 | - | |
| 53 | -### 1. Finish the next public-shell contraction below `Agent` | |
| 54 | - | |
| 55 | -Sprint 18 moved more shell behavior into `src/loader/runtime/public_shell.py`, but `Agent` still owns the public entrypoint wrappers and some launch-time glue. | |
| 56 | - | |
| 57 | -Implementation targets: | |
| 58 | - | |
| 59 | -- inventory what remains in `src/loader/agent/loop.py` that still feels like runtime/public-shell plumbing instead of true public API ownership, especially: | |
| 60 | - - run / run_streaming / run_explore event-wrapper glue | |
| 61 | - - resume / clear lifecycle orchestration | |
| 62 | - - launcher construction and runtime-source preparation | |
| 63 | - - capability-refresh and prompt invalidation wiring | |
| 64 | -- move what is reusable into runtime-owned helpers or a tighter launcher/public-shell seam | |
| 65 | -- keep `Agent` focused on: | |
| 66 | - - public API shape | |
| 67 | - - compatibility-facing attributes | |
| 68 | - - minimal UI-facing integration points | |
| 69 | - | |
| 70 | -The goal is not “delete `Agent`.” The goal is for `Agent` to read like an intentionally tiny facade instead of a convenient place for leftover runtime glue. | |
| 71 | - | |
| 72 | -### 2. Harden the remaining continuation contract | |
| 73 | - | |
| 74 | -Sprint 18 made continuation behavior visible. Sprint 19 should decide which of that behavior is still too soft. | |
| 75 | - | |
| 76 | -Implementation targets: | |
| 77 | - | |
| 78 | -- inventory the remaining continuation behavior across: | |
| 79 | - - `src/loader/runtime/completion_policy.py` | |
| 80 | - - `src/loader/runtime/turn_completion.py` | |
| 81 | - - `src/loader/runtime/assistant_turns.py` | |
| 82 | - - any nearby repair/finalization controller that can still nudge rather than stop | |
| 83 | -- identify which continuation cases are still justified by explicit runtime evidence versus merely tolerated by textual heuristics | |
| 84 | -- prefer: | |
| 85 | - - deletion | |
| 86 | - - a stricter typed stop/fail state | |
| 87 | - - explicit follow-through requirements derived from runtime artifacts or session state | |
| 88 | - over keeping broad “continue once more” behavior | |
| 89 | -- where a continuation path remains, make the required evidence explicit and persisted | |
| 90 | - | |
| 91 | -The goal is to keep following the Sprint 13 / Sprint 17 / Sprint 18 line: the runtime should proceed for a clear typed reason or stop honestly, not continue because the model “probably meant well.” | |
| 92 | - | |
| 93 | -### 3. Unify completion, workflow, and repair accountability into one operator-facing timeline | |
| 94 | - | |
| 95 | -Loader now has workflow timeline entries and a separate completion trace. That is better than hidden state, but still fragmented. | |
| 96 | - | |
| 97 | -Implementation targets: | |
| 98 | - | |
| 99 | -- define a compact unified policy timeline or policy event model that can carry: | |
| 100 | - - workflow routing/handoff decisions | |
| 101 | - - completion-policy outcomes | |
| 102 | - - repair / retry / recovery decisions | |
| 103 | - - terminal stop reasons | |
| 104 | -- decide whether the existing workflow timeline should absorb completion/repair events or whether a sibling policy timeline is the cleaner contract | |
| 105 | -- persist enough of that state to survive resume and make post-mortem inspection more useful | |
| 106 | -- surface it through existing product seams, likely one of: | |
| 107 | - - `loader workflow show` | |
| 108 | - - `loader session show` | |
| 109 | - - a narrowly-scoped new policy-focused surface if and only if it is cleaner than overloading the workflow view | |
| 110 | - | |
| 111 | -The goal is that operators can answer “why did Loader keep going, stop, retry, or accept this result?” from one coherent surface instead of stitching together multiple tables by hand. | |
| 112 | - | |
| 113 | -### 4. Keep the ref relationship explicit and healthy | |
| 114 | - | |
| 115 | -Implementation targets: | |
| 116 | - | |
| 117 | -- use claw-code for: | |
| 118 | - - lane-event shape | |
| 119 | - - session/runtime control seams | |
| 120 | - - explicit policy and green-contract ownership | |
| 121 | -- use OMX for: | |
| 122 | - - verifier/follow-through accountability | |
| 123 | - - session/runtime operator clarity | |
| 124 | - - stronger prompt/runtime contract thinking | |
| 125 | -- do not add work just because the refs have it | |
| 126 | -- do add work when the refs reveal a real Loader weakness in: | |
| 127 | - - honesty | |
| 128 | - - inspectability | |
| 129 | - - shell minimalism | |
| 130 | - - follow-through | |
| 131 | - | |
| 132 | -The goal is to keep Loader reference-guided and self-aware, not to drift into either blind feature copying or isolated local optimization. | |
| 133 | - | |
| 134 | -## Testing strategy | |
| 135 | - | |
| 136 | -- unit coverage for: | |
| 137 | - - any new public-shell or launcher helper that further reduces `Agent` ownership | |
| 138 | - - tightened continuation/terminal-stop decisions | |
| 139 | - - unified policy timeline serialization and restoration | |
| 140 | -- runtime coverage for: | |
| 141 | - - no regression in normal follow-through on non-mutating and mutating tasks | |
| 142 | - - honest terminal behavior where continuation heuristics were deleted or narrowed | |
| 143 | - - session/workflow inspection of the unified policy/debug story | |
| 144 | -- regression coverage for: | |
| 145 | - - no drift back toward `agent/loop.py` owning extracted shell glue | |
| 146 | - - no silent reintroduction of soft continuation phrasing after harder stop conditions | |
| 147 | - - no loss of the current workflow/completion/explore inspection surfaces while timelines are unified | |
| 148 | - | |
| 149 | -## Definition of done | |
| 150 | - | |
| 151 | -- `agent/loop.py` shrinks again or becomes materially more facade-like even if line count only drops modestly | |
| 152 | -- Loader deletes or hardens more of the remaining continuation heuristics instead of merely explaining them better | |
| 153 | -- operators can inspect one more coherent runtime policy story after the fact | |
| 154 | -- Sprint 18’s completion-trace and public-shell gains remain green | |
| 155 | -- the parity baseline remains green after the Sprint 19 shell and policy tightening | |
| 156 | - | |
| 157 | -## Explicitly out of scope | |
| 158 | - | |
| 159 | -- full claw-code policy-engine parity | |
| 160 | -- multi-agent or team orchestration | |
| 161 | -- AST-aware semantic diffs | |
| 162 | -- a broad visual workflow UI | |
| 163 | -- rich permission-rule editing UX | |
| 164 | - | |
| 165 | -## Audit | |
| 166 | - | |
| 167 | -### Status | |
| 168 | - | |
| 169 | -- Sprint 19 is complete, and the audit is green. Loader now enforces a typed follow-through contract for non-mutating completion checks, fails honestly once that contract is still unsatisfied after the continuation budget is exhausted, and exposes a clearer unified policy story through workflow and session inspection. | |
| 170 | - | |
| 171 | -### Landed | |
| 172 | - | |
| 173 | -- the public shell contraction continued below `Agent`: `src/loader/runtime/public_shell.py` now owns the public run / run_streaming / run_explore entry helpers plus session-install, resume/reset, and capability-refresh helpers, and `src/loader/agent/loop.py` is down to 292 lines instead of remaining a mixed runtime shell | |
| 174 | -- non-mutating completion checks are now evidence-shaped instead of binary-only: `src/loader/runtime/task_completion.py` and `src/loader/runtime/completion_policy.py` derive explicit required and missing follow-through evidence, thread that through `TaskCompletionCheck`, and persist it through runtime policy decisions | |
| 175 | -- Loader now stops honestly when follow-through evidence is still missing after the continuation budget is exhausted: `src/loader/runtime/completion_policy.py` and `src/loader/runtime/turn_completion.py` finalize with an explicit failure response and decision code instead of silently accepting the response | |
| 176 | -- completion-policy evidence now survives inspection and resume: `src/loader/runtime/completion_trace.py`, `src/loader/runtime/policy_timeline.py`, `src/loader/runtime/session.py`, and `src/loader/runtime/turn_completion.py` persist evidence summaries on completion traces and unified workflow-policy timeline entries | |
| 177 | -- operators now get a clearer single policy story: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` add policy-focused workflow filtering through `loader workflow show --policy`, extend workflow kind filters to repair/completion entries, and show a `Policy Timeline` preview inside `loader session show` | |
| 178 | - | |
| 179 | -### Verification | |
| 180 | - | |
| 181 | -- `uv run pytest -q` is green: `359 passed` | |
| 182 | -- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now pin the typed follow-through contract plus the honest budget-exhausted finalize path and persisted evidence summaries | |
| 183 | -- `tests/test_inspection.py` now covers `loader workflow show --policy`, accountability-only timeline filtering, and the new policy-timeline preview in `loader session show` | |
| 184 | -- the broader runtime shell and launcher coverage stayed green while the shell contraction continued: `tests/test_runtime_public_shell.py`, `tests/test_runtime_harness.py`, and the existing status/session/workflow inspection coverage all remained green | |
| 185 | - | |
| 186 | -### Residual debt | |
| 187 | - | |
| 188 | -- `src/loader/agent/loop.py` is now much closer to a true facade, but it still exists as the public compatibility shell and still owns some user-facing API shape rather than disappearing entirely | |
| 189 | -- the unified operator story is better, but completion traces and the workflow timeline still remain separate persisted artifacts under the hood; Sprint 19 made them coherent to inspect, not identical contracts | |
| 190 | -- the new follow-through assessment is explicit and typed, but it is still heuristic and runtime-authored rather than driven by a deeper verifier/model contract like OMX | |
| 191 | -- Loader is now more honest and inspectable than Sprint 18, but it still stops short of claw-code's fuller policy engine and OMX's richer long-horizon verifier/interview rigor | |
.docs/sprints/sprint20.mddeleted@@ -1,196 +0,0 @@ | ||
| 1 | -# Sprint 20: Canonical Policy Events, Verifier-Backed Follow-Through, and Facade Settlement | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 19 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 19: stop treating policy accountability as a set of coordinated side channels, strengthen follow-through requirements with better runtime evidence, and decide what the public shell should actually remain responsible for now that most of the runtime owns itself. | |
| 10 | - | |
| 11 | -Sprint 19 improved the shape of the remaining debt again: | |
| 12 | - | |
| 13 | -- `src/loader/agent/loop.py` is now much smaller and more facade-like | |
| 14 | -- continuation behavior is more honest because missing follow-through can now end in explicit failure instead of silent acceptance | |
| 15 | -- operators can inspect policy accountability more coherently through `loader workflow show --policy` and the `Policy Timeline` preview in `loader session show` | |
| 16 | -- but completion traces and workflow timeline entries still coexist as separate persisted artifacts | |
| 17 | -- follow-through evidence is typed now, but it is still heuristic and runtime-authored rather than grounded in a stronger verifier/evidence contract | |
| 18 | -- Loader is much closer to an intentional public boundary, but it still has not explicitly settled whether the remaining `Agent` shell is the desired long-term seam or simply the next temporary contraction point | |
| 19 | - | |
| 20 | -Sprint 20 should keep using the references as architectural guardrails, not as a porting checklist. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen canonical runtime ownership, policy/event contracts, and session/runtime seams | |
| 25 | -- use OMX to sharpen verifier-backed follow-through requirements, evidence accounting, and operator-facing accountability | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too heuristic, too fragmented, or too hard to audit | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against backsliding into wrappers and soft rescue behavior. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 38 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 39 | -- `refs/claw-code/PARITY.md` | |
| 40 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 43 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 44 | -- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts` | |
| 45 | -- `.docs/PARITY.md` | |
| 46 | -- `.docs/audit.txt` | |
| 47 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 48 | -- `.docs/sprints/sprint19.md` | |
| 49 | - | |
| 50 | -## Deliverables | |
| 51 | - | |
| 52 | -### 1. Define one canonical persisted policy-event contract | |
| 53 | - | |
| 54 | -Sprint 19 made the operator story clearer, but Loader still persists policy accountability in more than one shape. | |
| 55 | - | |
| 56 | -Implementation targets: | |
| 57 | - | |
| 58 | -- inventory the current persisted/debug policy surfaces across: | |
| 59 | - - `src/loader/runtime/completion_trace.py` | |
| 60 | - - `src/loader/runtime/policy_timeline.py` | |
| 61 | - - `src/loader/runtime/workflow_policy.py` | |
| 62 | - - `src/loader/runtime/session.py` | |
| 63 | - - `src/loader/runtime/events.py` | |
| 64 | -- decide what the canonical persisted policy/accountability artifact should be: | |
| 65 | - - extend the workflow timeline into the sole source of truth | |
| 66 | - - or introduce a first-class policy-event model and derive the old surfaces from it | |
| 67 | -- make sure the canonical event contract can express: | |
| 68 | - - workflow routing/handoff/reentry | |
| 69 | - - repair retries/failures | |
| 70 | - - completion accept/continue/finalize outcomes | |
| 71 | - - verify skips and explicit stop reasons | |
| 72 | - - evidence summaries and prompt/runtime context | |
| 73 | -- avoid preserving multiple peer artifacts unless one is clearly a compatibility/read-model projection of the other | |
| 74 | - | |
| 75 | -The goal is that Loader has one canonical answer to “what policy decisions happened during this turn/session?” instead of two overlapping persistence models that merely render similarly. | |
| 76 | - | |
| 77 | -### 2. Move follow-through requirements closer to a verifier-backed contract | |
| 78 | - | |
| 79 | -Sprint 19 added typed required/missing evidence. Sprint 20 should make that contract less purely heuristic. | |
| 80 | - | |
| 81 | -Implementation targets: | |
| 82 | - | |
| 83 | -- inventory where follow-through requirements are still inferred from weak textual cues across: | |
| 84 | - - `src/loader/runtime/task_completion.py` | |
| 85 | - - `src/loader/runtime/completion_policy.py` | |
| 86 | - - `src/loader/runtime/turn_completion.py` | |
| 87 | - - any nearby DoD/verification/session helpers that already know more than the completion heuristic does | |
| 88 | -- define a stronger runtime completion-requirements contract using available structured evidence such as: | |
| 89 | - - task class and workflow mode | |
| 90 | - - DoD state and acceptance criteria | |
| 91 | - - verification plans and prior verification results | |
| 92 | - - actual tool history, artifact paths, and session/runtime evidence | |
| 93 | -- prefer explicit requirements like: | |
| 94 | - - “verification command ran and passed” | |
| 95 | - - “mutating touchpoint is accounted for in the active artifact set” | |
| 96 | - - “claimed result is backed by observed output” | |
| 97 | - over generic “probably incomplete” heuristics | |
| 98 | -- where the runtime still cannot prove completion, stop honestly and preserve the exact missing requirement set | |
| 99 | - | |
| 100 | -The goal is to move closer to claw-code’s green-contract thinking and OMX’s verifier/accountability discipline without forcing Loader into a fake model-assisted verifier that it cannot honestly support yet. | |
| 101 | - | |
| 102 | -### 3. Settle the intended long-term public shell boundary | |
| 103 | - | |
| 104 | -By Sprint 19, `Agent` is much smaller. Sprint 20 should decide what remains on purpose. | |
| 105 | - | |
| 106 | -Implementation targets: | |
| 107 | - | |
| 108 | -- inventory the current responsibilities still living in `src/loader/agent/loop.py` | |
| 109 | -- identify which of those are: | |
| 110 | - - true public API / compatibility surface | |
| 111 | - - UI integration seam | |
| 112 | - - leftover runtime or launcher ownership | |
| 113 | -- move remaining runtime-ish behavior below the public shell where that is still obviously correct | |
| 114 | -- explicitly document what stays in `Agent` and why, so future sprints stop treating the shell as a vague cleanup target | |
| 115 | - | |
| 116 | -The goal is not “delete the public shell.” The goal is to stop having an ambiguous shell. | |
| 117 | - | |
| 118 | -### 4. Make the operator-facing accountability story sharper without multiplying product surfaces | |
| 119 | - | |
| 120 | -Sprint 19 improved inspection, but there is still room to reduce cognitive stitching. | |
| 121 | - | |
| 122 | -Implementation targets: | |
| 123 | - | |
| 124 | -- improve the existing policy-facing views so operators can answer: | |
| 125 | - - why did Loader continue? | |
| 126 | - - why did Loader stop? | |
| 127 | - - what evidence was still missing? | |
| 128 | - - what policy stage made the final decision? | |
| 129 | -- prefer improving: | |
| 130 | - - `loader workflow show` | |
| 131 | - - `loader session show` | |
| 132 | - - `loader status` | |
| 133 | - over inventing a brand-new inspection command unless a new surface is clearly cleaner | |
| 134 | -- where useful, add concise rollup/highlight views that summarize the last important policy event rather than requiring the user to parse the whole timeline manually | |
| 135 | - | |
| 136 | -The goal is to make Loader easier to audit after the fact, not simply more verbose. | |
| 137 | - | |
| 138 | -## Testing strategy | |
| 139 | - | |
| 140 | -- unit coverage for: | |
| 141 | - - the canonical policy-event serialization/restoration contract | |
| 142 | - - verifier-backed completion-requirement derivation | |
| 143 | - - any reduced/settled public-shell boundary helpers | |
| 144 | -- runtime coverage for: | |
| 145 | - - honest finalization when runtime evidence still fails the completion contract | |
| 146 | - - no regression in successful follow-through on normal non-mutating and mutating tasks | |
| 147 | - - session/workflow/status inspection of the canonical policy story | |
| 148 | -- regression coverage for: | |
| 149 | - - no drift back toward duplicate persisted policy artifacts without a canonical source of truth | |
| 150 | - - no reintroduction of soft continuation acceptance after missing-evidence finalization | |
| 151 | - - no drift back toward `agent/loop.py` accumulating runtime ownership because the shell boundary is still ambiguous | |
| 152 | - | |
| 153 | -## Definition of done | |
| 154 | - | |
| 155 | -- Loader has one clearly canonical persisted policy/accountability contract | |
| 156 | -- follow-through requirements are more explicitly grounded in runtime/verifier evidence instead of mostly textual heuristics | |
| 157 | -- the remaining public-shell responsibilities are materially smaller or explicitly settled on purpose | |
| 158 | -- operators can answer the main stop/continue/retry questions from one clearer set of existing inspection surfaces | |
| 159 | -- Sprint 19’s honesty and policy-inspection gains remain green | |
| 160 | - | |
| 161 | -## Explicitly out of scope | |
| 162 | - | |
| 163 | -- full claw-code policy-engine parity | |
| 164 | -- model-authored verifier narratives as a new mandatory runtime dependency | |
| 165 | -- multi-agent or team orchestration | |
| 166 | -- AST-aware semantic diffs | |
| 167 | -- a broad visual workflow UI | |
| 168 | -- rich permission-rule editing UX | |
| 169 | - | |
| 170 | -## Audit | |
| 171 | - | |
| 172 | -### Status | |
| 173 | - | |
| 174 | -- Sprint 20 is complete, and the audit is green. Loader now treats the workflow timeline as the canonical policy/accountability artifact, grounds more follow-through decisions in runtime verification state, and makes the remaining `Agent` shell boundary explicit in both code and tests. | |
| 175 | - | |
| 176 | -### Landed | |
| 177 | - | |
| 178 | -- canonical policy/accountability ownership is tighter now: `src/loader/runtime/session.py`, `src/loader/runtime/completion_trace.py`, and `src/loader/runtime/turn_completion.py` now project live completion traces from the canonical workflow timeline instead of treating completion trace writes as a peer runtime artifact | |
| 179 | -- follow-through checks now consume stronger runtime evidence: `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, and `src/loader/runtime/turn_completion.py` now use DoD verification commands, prior verification results, successful verification evidence, and tracked pending items to decide whether a non-mutating turn can honestly stop | |
| 180 | -- operator accountability is sharper without adding new commands: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` now surface a one-line latest-policy rollup in `loader status` and `loader session show`, sourced from the canonical workflow timeline rather than an ad hoc side channel | |
| 181 | -- the public shell boundary is now settled on purpose instead of merely smaller by accident: `src/loader/runtime/public_shell.py` owns prompt-mode resolution, prompt-cache invalidation, and owner-bound system/few-shot construction, while `src/loader/agent/loop.py` is down to 267 lines and explicitly documented as the public facade over runtime-owned launch/session helpers | |
| 182 | -- the shell boundary is also pinned by proof now: `tests/test_runtime_public_shell.py` covers owner-based prompt/few-shot construction and workflow-mode invalidation, while `tests/test_compat_boundaries.py` now guards `agent/loop.py` against drifting back to direct runtime-controller imports | |
| 183 | - | |
| 184 | -### Verification | |
| 185 | - | |
| 186 | -- `uv run pytest -q` is green: `372 passed` | |
| 187 | -- `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, and `tests/test_session_state.py` now pin verifier-backed follow-through decisions plus live canonical completion-trace projection | |
| 188 | -- `tests/test_inspection.py` now covers the latest-policy rollup in the existing status/session surfaces | |
| 189 | -- `tests/test_runtime_public_shell.py`, `tests/test_runtime_bootstrap.py`, and `tests/test_compat_boundaries.py` now cover the settled public-shell boundary directly | |
| 190 | - | |
| 191 | -### Residual debt | |
| 192 | - | |
| 193 | -- the workflow timeline is now the canonical policy artifact, but completion traces still remain as a compatibility/read-model surface because status/session inspection still benefits from a compact per-turn view | |
| 194 | -- follow-through is more verifier-backed than Sprint 19, but it is still runtime-authored and heuristic in places; Loader still does not have OMX-style deeper verifier reasoning or richer artifact-derived proof models | |
| 195 | -- `src/loader/agent/loop.py` is now explicitly the public facade, but Loader still keeps that compatibility shell instead of collapsing to a narrower runtime-first API | |
| 196 | -- Loader is more coherent and auditable than Sprint 19, but it still stops short of claw-code's fuller policy engine, richer rule/prompt policy surfaces, and OMX's deeper interview/verifier rigor | |
.docs/sprints/sprint21.mddeleted@@ -1,187 +0,0 @@ | ||
| 1 | -# Sprint 21: Evidence Provenance, Read-Model Cleanup, and Runtime-First API | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 20 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 20: move Loader's completion and verification story from "better heuristics with canonical policy events" toward stronger evidence provenance, reduce compatibility read-model duplication where the canonical workflow timeline already carries the truth, and begin narrowing internal callers toward a runtime-first API instead of treating `Agent` as the only natural seam. | |
| 10 | - | |
| 11 | -Sprint 20 changed the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- the workflow timeline is now the canonical policy/accountability artifact, including live completion-trace projection | |
| 14 | -- follow-through checks now use stronger DoD/runtime evidence instead of only textual heuristics | |
| 15 | -- the remaining `Agent` shell is explicitly documented and guarded as a public facade | |
| 16 | -- but completion/verification evidence is still mostly flattened into human-readable strings rather than typed provenance | |
| 17 | -- compatibility/read-model surfaces still rely on a few projections that are honest but not yet minimal | |
| 18 | -- internal callers still treat `Agent` as the default runtime entry seam even though the shell is now explicitly a compatibility/public facade | |
| 19 | - | |
| 20 | -Sprint 21 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen canonical event ownership, green-contract discipline, and runtime-first seams | |
| 25 | -- use OMX to sharpen verifier/accountability provenance and evidence-backed follow-through | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too stringly-typed, too duplicative, or too dependent on a compatibility shell | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and soft rescue behavior. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 38 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 39 | -- `refs/claw-code/PARITY.md` | |
| 40 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 43 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 44 | -- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts` | |
| 45 | -- `.docs/PARITY.md` | |
| 46 | -- `.docs/audit.txt` | |
| 47 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 48 | -- `.docs/sprints/sprint20.md` | |
| 49 | - | |
| 50 | -## Deliverables | |
| 51 | - | |
| 52 | -### 1. Introduce typed evidence provenance for completion and verification | |
| 53 | - | |
| 54 | -Sprint 20 strengthened follow-through, but most of the contract still collapses into free-text evidence summaries too early. | |
| 55 | - | |
| 56 | -Implementation targets: | |
| 57 | - | |
| 58 | -- inventory where completion/verification evidence is currently flattened into strings across: | |
| 59 | - - `src/loader/runtime/task_completion.py` | |
| 60 | - - `src/loader/runtime/completion_trace.py` | |
| 61 | - - `src/loader/runtime/policy_timeline.py` | |
| 62 | - - `src/loader/runtime/finalization.py` | |
| 63 | - - `src/loader/runtime/dod.py` | |
| 64 | - - `src/loader/runtime/workflow_policy.py` | |
| 65 | -- define a small typed provenance model that can represent things like: | |
| 66 | - - verification command ran and passed/failed | |
| 67 | - - verification command was still missing | |
| 68 | - - tracked work item remained incomplete | |
| 69 | - - artifact/touchpoint evidence existed or was contradicted | |
| 70 | - - claimed runtime outcome was backed by observed output | |
| 71 | -- prefer structured provenance that can still be rendered into human-readable summaries, instead of making strings the primary contract | |
| 72 | -- thread that provenance through completion policy and canonical policy events where it materially improves honesty or inspectability | |
| 73 | - | |
| 74 | -The goal is not to build a fake theorem prover. The goal is to stop throwing away runtime evidence structure too early. | |
| 75 | - | |
| 76 | -### 2. Reduce read-model duplication around the canonical workflow timeline | |
| 77 | - | |
| 78 | -Sprint 20 made the workflow timeline canonical, but a few read models still feel more coupled than they need to be. | |
| 79 | - | |
| 80 | -Implementation targets: | |
| 81 | - | |
| 82 | -- inventory where compatibility/read-model projections still depend on direct mutation or duplicated logic across: | |
| 83 | - - `src/loader/runtime/completion_trace.py` | |
| 84 | - - `src/loader/runtime/session.py` | |
| 85 | - - `src/loader/runtime/inspection.py` | |
| 86 | - - `src/loader/runtime/events.py` | |
| 87 | - - any nearby status/session helper that reconstructs policy state manually | |
| 88 | -- make sure projections like completion traces and latest-policy summaries are clearly derivations from canonical policy events instead of semi-independent contracts | |
| 89 | -- remove any remaining direct writes or state bookkeeping that are only there to keep parallel policy read models in sync | |
| 90 | -- keep compact operator-facing read models where they help, but make their derived nature explicit in code and tests | |
| 91 | - | |
| 92 | -The goal is one canonical truth plus honest projections, not a forest of near-duplicates. | |
| 93 | - | |
| 94 | -### 3. Start the runtime-first internal API transition below the public `Agent` facade | |
| 95 | - | |
| 96 | -Sprint 20 settled `Agent` as the public compatibility shell. Sprint 21 should stop using that shell as the default internal seam where it no longer needs to be. | |
| 97 | - | |
| 98 | -Implementation targets: | |
| 99 | - | |
| 100 | -- inventory current internal call sites that still instantiate or consume `Agent` when a runtime-first seam would be cleaner, especially in: | |
| 101 | - - launcher/bootstrap helpers | |
| 102 | - - CLI/TUI integration code | |
| 103 | - - tests that are really exercising runtime behavior rather than public compatibility | |
| 104 | -- define a small runtime-first entry contract for internal consumers where it clearly reduces shell coupling | |
| 105 | -- keep `Agent` as the public compatibility surface, but begin migrating internal runtime-oriented callers away from assuming that `Agent` is the only valid execution owner | |
| 106 | -- document what remains intentionally public-shell-only versus what is now runtime-first | |
| 107 | - | |
| 108 | -The goal is not to delete `Agent`. The goal is to make `Agent` clearly public/compatibility-facing while runtime internals use runtime-first seams by default. | |
| 109 | - | |
| 110 | -### 4. Sharpen operator visibility for evidence-backed stop/continue decisions | |
| 111 | - | |
| 112 | -Sprint 20 improved policy summaries, but the evidence itself is still only partially visible. | |
| 113 | - | |
| 114 | -Implementation targets: | |
| 115 | - | |
| 116 | -- improve the existing operator views so users can answer: | |
| 117 | - - what exact evidence was missing when Loader stopped? | |
| 118 | - - what exact evidence satisfied the completion contract? | |
| 119 | - - which policy event carried that evidence? | |
| 120 | -- prefer improving: | |
| 121 | - - `loader workflow show` | |
| 122 | - - `loader session show` | |
| 123 | - - `loader status` | |
| 124 | - over inventing a new command unless a new surface is clearly cleaner | |
| 125 | -- add concise rollups first, and expose deeper provenance only where it materially helps post-mortem inspection | |
| 126 | - | |
| 127 | -The goal is to make Loader easier to audit after the fact, not simply more verbose. | |
| 128 | - | |
| 129 | -## Testing strategy | |
| 130 | - | |
| 131 | -- unit coverage for: | |
| 132 | - - typed evidence-provenance normalization/rendering | |
| 133 | - - derived read-model projections from the canonical workflow timeline | |
| 134 | - - any new runtime-first internal entry contract below `Agent` | |
| 135 | -- runtime coverage for: | |
| 136 | - - honest finalization with explicit evidence provenance when completion still fails | |
| 137 | - - successful completion paths that now surface structured proof instead of only summary strings | |
| 138 | - - status/session/workflow inspection of the evidence-backed policy story | |
| 139 | -- regression coverage for: | |
| 140 | - - no drift back toward peer policy artifacts beside the canonical workflow timeline | |
| 141 | - - no drift back toward `Agent` as the default internal seam when a runtime-first contract exists | |
| 142 | - - no loss of the current compact operator read models while provenance becomes richer | |
| 143 | - | |
| 144 | -## Definition of done | |
| 145 | - | |
| 146 | -- Loader preserves one canonical policy/accountability artifact while making evidence provenance more structured | |
| 147 | -- completion/verification evidence is less stringly-typed and more inspectable without weakening honesty | |
| 148 | -- internal runtime-oriented code has at least one cleaner runtime-first seam below the public `Agent` facade | |
| 149 | -- existing status/session/workflow surfaces answer stop/continue questions with clearer evidence context | |
| 150 | -- Sprint 20's canonical-policy and facade-settlement gains remain green | |
| 151 | - | |
| 152 | -## Explicitly out of scope | |
| 153 | - | |
| 154 | -- full claw-code policy-engine parity | |
| 155 | -- model-authored verifier narratives as a mandatory dependency | |
| 156 | -- multi-agent or team orchestration | |
| 157 | -- AST-aware semantic diffs | |
| 158 | -- a broad visual workflow UI | |
| 159 | -- rich permission-rule editing UX | |
| 160 | - | |
| 161 | -## Audit | |
| 162 | - | |
| 163 | -### Status | |
| 164 | - | |
| 165 | -- Sprint 21 is complete, and the audit is green. Loader now carries typed evidence provenance through canonical policy events, derives more of its operator/accountability story from one shared workflow-timeline read model, and has a real runtime-first internal owner below the public `Agent` facade. | |
| 166 | - | |
| 167 | -### Landed | |
| 168 | - | |
| 169 | -- evidence provenance is now a stronger first-class contract instead of a mostly flattened string path: `src/loader/runtime/evidence_provenance.py`, `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, `src/loader/runtime/turn_completion.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/policy_timeline.py`, and `src/loader/runtime/completion_trace.py` now preserve typed support/missing/contradiction context through canonical policy events and projected completion traces | |
| 170 | -- canonical read-model duplication is lower: `src/loader/runtime/workflow_timeline_read_model.py` now owns shared policy projections, grouped evidence rollups, latest-policy summaries, and operator highlights instead of scattering that logic across inspection surfaces | |
| 171 | -- the runtime-first internal API transition is real now, not just planned: `src/loader/runtime/runtime_handle.py` provides a runtime-owned owner below `Agent`, and runtime-oriented tests in `tests/test_runtime_handle.py`, `tests/test_runtime_launcher.py`, `tests/test_turn_preparation.py`, and `tests/test_runtime_public_shell.py` now exercise launcher/bootstrap/public-shell behavior without assuming the public compatibility facade is the only valid execution owner | |
| 172 | -- operator visibility is sharper without adding new product surfaces: `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` now show latest policy evidence rollups in `loader status`, `loader session show`, and `loader workflow show`, including concise “needed” vs “satisfied” evidence summaries derived from the canonical workflow timeline | |
| 173 | - | |
| 174 | -### Verification | |
| 175 | - | |
| 176 | -- `uv run pytest -q` is green: `380 passed` | |
| 177 | -- `tests/test_evidence_provenance.py`, `tests/test_completion_policy.py`, `tests/test_turn_completion.py`, `tests/test_session_state.py`, and `tests/test_inspection.py` now pin typed provenance through completion/finalization plus persisted operator inspection | |
| 178 | -- `tests/test_workflow_timeline_read_model.py` now covers grouped supporting/missing evidence rollups from canonical policy events | |
| 179 | -- `tests/test_runtime_handle.py`, `tests/test_runtime_launcher.py`, `tests/test_turn_preparation.py`, and `tests/test_runtime_public_shell.py` now cover the new runtime-first owner seam directly | |
| 180 | -- `tests/test_compat_boundaries.py` remains green, including the runtime import-boundary guard after the new handle landed | |
| 181 | - | |
| 182 | -### Residual debt | |
| 183 | - | |
| 184 | -- Loader now has a runtime-first internal owner, but the public `Agent` shell still exists as the outer compatibility API and is still used by many public-surface and end-to-end tests | |
| 185 | -- evidence provenance is more structured and inspectable, but it is still runtime-authored and bounded; Loader still does not have OMX-style deeper verifier reasoning, richer artifact-derived proof, or model-assisted audit narratives | |
| 186 | -- the workflow timeline read model is cleaner, but Loader still keeps compact derived surfaces like completion traces and latest-policy summaries because operators benefit from them | |
| 187 | -- the new policy-evidence rollups make stop/continue decisions easier to inspect, but Loader still stops short of claw-code's fuller policy engine, richer rule surfaces, and OMX's deeper interview/verifier rigor | |
.docs/sprints/sprint22.mddeleted@@ -1,191 +0,0 @@ | ||
| 1 | -# Sprint 22: Runtime Entry API, Verification Observations, and Compatibility Narrowing | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 21 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 21: stop treating the new runtime-first owner as only a testing seam, move verification/accountability closer to the moment verification actually happens, and keep narrowing the public compatibility shell without pretending Loader is ready to delete `Agent`. | |
| 10 | - | |
| 11 | -Sprint 21 changed the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- Loader now has a runtime-owned internal handle and no longer needs `Agent` for some runtime-oriented tests | |
| 14 | -- policy/accountability surfaces can now show grouped supporting vs missing evidence instead of flattening everything into one summary string | |
| 15 | -- the workflow timeline read model is more canonical and less duplicative | |
| 16 | -- but `Agent` still remains the default construction seam for most real integrations | |
| 17 | -- verification evidence is still largely reconstructed from DoD/session state after the fact rather than captured as a first-class observation at execution time | |
| 18 | -- evidence provenance is richer, but it is still bounded and runtime-authored; Loader still cannot always answer which observed verification attempt or artifact actually justified the final stop/continue decision | |
| 19 | - | |
| 20 | -Sprint 22 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen runtime-first entry seams, session/bootstrap ownership, and event accountability | |
| 25 | -- use OMX to sharpen verifier-backed evidence capture and auditability around successful vs missing proof | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too shell-bound, too post-hoc, or too hard to audit honestly | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and fake rescue behavior. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 38 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 39 | -- `refs/claw-code/PARITY.md` | |
| 40 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 43 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 44 | -- `.docs/PARITY.md` | |
| 45 | -- `.docs/audit.txt` | |
| 46 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 47 | -- `.docs/sprints/sprint21.md` | |
| 48 | - | |
| 49 | -## Deliverables | |
| 50 | - | |
| 51 | -### 1. Promote the runtime-first entry contract beyond test-only use | |
| 52 | - | |
| 53 | -Sprint 21 proved that `RuntimeHandle` is a valid internal owner. Sprint 22 should use that seam in more real integration paths where `Agent` is only serving as an unnecessary wrapper. | |
| 54 | - | |
| 55 | -Implementation targets: | |
| 56 | - | |
| 57 | -- inventory internal call sites that still instantiate `Agent` by default even though they are really consuming runtime-owned behavior, especially around: | |
| 58 | - - launcher/bootstrap helpers | |
| 59 | - - interactive CLI/TUI integration seams | |
| 60 | - - harnesses and utilities that are not actually testing public compatibility | |
| 61 | -- define a small runtime-first entry contract for internal integrations that need: | |
| 62 | - - runtime bootstrap/session ownership | |
| 63 | - - launcher/public-shell execution | |
| 64 | - - inspection or session continuity hooks | |
| 65 | -- migrate a bounded but real set of internal callers to that runtime-first seam | |
| 66 | -- explicitly document what remains intentionally public-shell-only versus what is now runtime-first by default | |
| 67 | - | |
| 68 | -The goal is not to delete `Agent`. The goal is to make `Agent` more clearly public/compatibility-facing while real internal integrations stop depending on it by habit. | |
| 69 | - | |
| 70 | -### 2. Introduce typed verification observations closer to execution time | |
| 71 | - | |
| 72 | -Loader is better at explaining missing evidence now, but it still often reconstructs that explanation after the fact from session and DoD state. | |
| 73 | - | |
| 74 | -Implementation targets: | |
| 75 | - | |
| 76 | -- inventory where verification evidence is currently inferred or flattened after execution across: | |
| 77 | - - `src/loader/runtime/dod.py` | |
| 78 | - - `src/loader/runtime/finalization.py` | |
| 79 | - - `src/loader/runtime/task_completion.py` | |
| 80 | - - `src/loader/runtime/completion_policy.py` | |
| 81 | - - `src/loader/runtime/session.py` | |
| 82 | - - `src/loader/runtime/workflow_policy.py` | |
| 83 | -- define a typed verification-observation contract that can represent things like: | |
| 84 | - - verification command requested | |
| 85 | - - verification command ran | |
| 86 | - - verification command passed or failed | |
| 87 | - - verification command output backed or contradicted a claimed result | |
| 88 | - - verification attempt was skipped, stale, or still missing | |
| 89 | - - observed artifact/touchpoint evidence that materially supported or blocked completion | |
| 90 | -- decide whether those observations should live directly in the canonical workflow timeline, as a derived session sub-artifact, or as a canonical companion model with a single clear ownership story | |
| 91 | -- avoid inventing a second peer truth beside the canonical policy story | |
| 92 | - | |
| 93 | -The goal is to move Loader’s accountability closer to “this is what was actually observed” rather than “this is what the runtime later summarized.” | |
| 94 | - | |
| 95 | -### 3. Strengthen stop/continue proof using observed verification and artifact evidence | |
| 96 | - | |
| 97 | -Sprint 21 made evidence more structured, but it still leaves some stop/continue decisions too dependent on runtime-authored summaries. | |
| 98 | - | |
| 99 | -Implementation targets: | |
| 100 | - | |
| 101 | -- connect completion/finalization decisions more directly to: | |
| 102 | - - typed verification observations | |
| 103 | - - active DoD acceptance state | |
| 104 | - - tracked pending items | |
| 105 | - - observed artifact/touchpoint evidence | |
| 106 | - - contradictions already captured in the workflow ledger or drift evidence | |
| 107 | -- prefer explicit proof stories like: | |
| 108 | - - “verification command X passed and covers acceptance boundary Y” | |
| 109 | - - “artifact Z was updated and verified against the planned touchpoint set” | |
| 110 | - - “completion still failed because verification command X never ran / failed / contradicted the claim” | |
| 111 | - over broader fallback summaries when the runtime has stronger evidence available | |
| 112 | -- where Loader still cannot prove completion, preserve the exact missing or contradictory observation set in the policy/accountability story | |
| 113 | - | |
| 114 | -The goal is not to build a deep theorem prover. The goal is to keep pushing Loader from heuristic completion toward evidence-backed completion without bluffing. | |
| 115 | - | |
| 116 | -### 4. Sharpen operator inspection around observed verification state | |
| 117 | - | |
| 118 | -Sprint 21 made policy evidence easier to read. Sprint 22 should make observed verification attempts easier to audit from the same existing product surfaces. | |
| 119 | - | |
| 120 | -Implementation targets: | |
| 121 | - | |
| 122 | -- improve the existing operator views so users can answer: | |
| 123 | - - which verification command last ran? | |
| 124 | - - did it pass, fail, or never run? | |
| 125 | - - what output or observed artifact actually backed the stop/continue decision? | |
| 126 | - - what evidence is still missing versus already satisfied? | |
| 127 | -- prefer improving: | |
| 128 | - - `loader status` | |
| 129 | - - `loader session show` | |
| 130 | - - `loader workflow show` | |
| 131 | - over inventing a new command unless a new surface is clearly cleaner | |
| 132 | -- keep concise rollups first, and expose deeper verification/observation detail only where it materially improves post-mortem debugging | |
| 133 | - | |
| 134 | -The goal is to make Loader easier to audit after the fact, not simply more verbose. | |
| 135 | - | |
| 136 | -## Testing strategy | |
| 137 | - | |
| 138 | -- unit coverage for: | |
| 139 | - - runtime-first entry helpers adopted below `Agent` | |
| 140 | - - typed verification-observation normalization and persistence | |
| 141 | - - derived policy/accountability projections from observed verification state | |
| 142 | -- runtime coverage for: | |
| 143 | - - successful completion with explicit observed verification backing | |
| 144 | - - failed or missing verification that now produces a more concrete stop/continue story | |
| 145 | - - internal integration paths that now use the runtime-first entry contract | |
| 146 | -- regression coverage for: | |
| 147 | - - no drift back toward `Agent` as the default internal seam when a runtime-first contract exists | |
| 148 | - - no duplicate truth beside the canonical policy/accountability story | |
| 149 | - - no regression in Sprint 21’s evidence provenance, policy rollups, or runtime-handle contract | |
| 150 | - | |
| 151 | -## Definition of done | |
| 152 | - | |
| 153 | -- Loader has at least one more real internal integration path using a runtime-first entry contract below `Agent` | |
| 154 | -- verification/accountability captures more first-class observed state instead of only post-hoc summaries | |
| 155 | -- stop/continue decisions can point to clearer observed proof or missing proof | |
| 156 | -- existing status/session/workflow surfaces expose that stronger verification/accountability story without multiplying commands | |
| 157 | -- Sprint 21’s runtime-handle, provenance, and grouped policy-evidence gains remain green | |
| 158 | - | |
| 159 | -## Explicitly out of scope | |
| 160 | - | |
| 161 | -- deleting `Agent` as the public compatibility surface | |
| 162 | -- full claw-code policy-engine parity | |
| 163 | -- model-authored verifier narratives as a required runtime dependency | |
| 164 | -- AST-aware semantic diffs | |
| 165 | -- a broad visual workflow UI | |
| 166 | -- multi-agent or team orchestration | |
| 167 | - | |
| 168 | -## Audit | |
| 169 | - | |
| 170 | -### Status | |
| 171 | - | |
| 172 | -- Sprint 22 is complete on the verification-observation and accountability lane, and the audit is green. Loader now captures typed verification observations closer to execution, carries them through the canonical policy story, and exposes that observed verification state directly in the existing operator surfaces. | |
| 173 | -- The planned runtime-first entry promotion beyond test-only use did not land in this sprint. That debt is now an explicit Sprint 23 carry-forward item, not an implied cleanup tail. | |
| 174 | - | |
| 175 | -### Landed | |
| 176 | - | |
| 177 | -- verification observations are now a first-class runtime contract instead of a reconstructed afterthought: `src/loader/runtime/verification_observations.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/policy_timeline.py`, `src/loader/runtime/completion_trace.py`, and `src/loader/runtime/turn_completion.py` now preserve typed observed verification state through the DoD gate, canonical policy events, and projected completion traces | |
| 178 | -- stop/continue policy is more explicit about why Loader stopped: `src/loader/runtime/task_completion.py`, `src/loader/runtime/completion_policy.py`, and `src/loader/runtime/turn_completion.py` now use observed verification facts when they exist and preserve those facts on exhausted continuation failures instead of only falling back to generic missing-evidence language | |
| 179 | -- operator inspection is sharper without multiplying surfaces: `src/loader/runtime/workflow_timeline_read_model.py`, `src/loader/runtime/inspection.py`, and `src/loader/cli/main.py` now surface observed verification in `loader status`, `loader session show`, and `loader workflow show`, including a unified `Recent Verification` view sourced from canonical policy observations first and DoD evidence second | |
| 180 | - | |
| 181 | -### Verification | |
| 182 | - | |
| 183 | -- `uv run pytest -q` is green: `388 passed` | |
| 184 | -- `tests/test_verification_observations.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, and `tests/test_turn_completion.py` now pin the verification-observation contract through finalization and completion stop policy | |
| 185 | -- `tests/test_workflow_timeline_read_model.py` and `tests/test_inspection.py` now cover observed-verification rollups, the unified recent-verification view, and the operator-facing explanation strings sourced from canonical policy state | |
| 186 | - | |
| 187 | -### Residual debt | |
| 188 | - | |
| 189 | -- Sprint 22 intentionally did not complete Deliverable 1. Loader still has a runtime-first internal owner from Sprint 21, but this sprint did not promote additional real integration paths away from `Agent` | |
| 190 | -- verification/accountability is more observed and audit-friendly now, but it is still bounded and runtime-authored; Loader still stops short of deeper OMX-style verifier reasoning, richer artifact-derived proof, or model-assisted audit narratives | |
| 191 | -- the existing status/session/workflow surfaces are clearer now, but Loader still stops short of claw-code's fuller policy engine, narrower runtime-first public API, and richer rule/prompt accountability surfaces | |
.docs/sprints/sprint23.mddeleted@@ -1,183 +0,0 @@ | ||
| 1 | -# Sprint 23: Runtime-First Integrations, Verification Producers, and Facade Narrowing | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 22 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 22: stop treating the runtime-first owner as mostly a testing seam, move more real integrations onto that seam where `Agent` is only habit, and widen the new verification-observation contract from a finalization/result story into a more direct execution-time producer story. | |
| 10 | - | |
| 11 | -Sprint 22 improved the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- Loader now carries typed verification observations through canonical policy events and completion-stop decisions | |
| 14 | -- the operator surfaces can now explain recent verification from canonical policy observations first instead of stitching together post-hoc summaries | |
| 15 | -- the canonical workflow timeline is stronger as the accountability story | |
| 16 | -- but the planned runtime-first entry promotion beyond tests did not land | |
| 17 | -- `Agent` still remains the default construction seam for many real internal paths even though Loader now has a runtime-owned internal handle and explicit public facade boundaries | |
| 18 | -- verification observations are still strongest at the DoD/finalization edge; Loader still captures less from the earlier verification lifecycle than it now knows how to represent | |
| 19 | - | |
| 20 | -Sprint 23 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen runtime-first bootstrap/session ownership, event capture, and narrower public boundaries | |
| 25 | -- use OMX to sharpen direct verifier-observation capture and evidence-backed accountability | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too shell-bound, too late in its evidence capture, or too fuzzy about which boundary owns what | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and soft compatibility habits. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 38 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 39 | -- `refs/claw-code/PARITY.md` | |
| 40 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 43 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 44 | -- `.docs/PARITY.md` | |
| 45 | -- `.docs/audit.txt` | |
| 46 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 47 | -- `.docs/sprints/sprint22.md` | |
| 48 | - | |
| 49 | -## Deliverables | |
| 50 | - | |
| 51 | -### 1. Promote the runtime-first entry contract into real internal integrations | |
| 52 | - | |
| 53 | -Sprint 22 left this as explicit debt. Sprint 23 should land it for real. | |
| 54 | - | |
| 55 | -Implementation targets: | |
| 56 | - | |
| 57 | -- inventory internal call sites that still instantiate or route through `Agent` by default even though they are consuming runtime-owned behavior, especially around: | |
| 58 | - - launcher/bootstrap helpers | |
| 59 | - - CLI/TUI integration seams that are not actually testing the public compatibility contract | |
| 60 | - - harnesses, utilities, and inspection/session helpers that only need runtime ownership | |
| 61 | -- define or refine a small runtime-first entry contract for internal consumers that need: | |
| 62 | - - runtime bootstrap/session ownership | |
| 63 | - - launcher/public-shell execution | |
| 64 | - - inspection, continuity, or workflow/accountability hooks | |
| 65 | -- migrate a bounded but real set of internal integrations onto that seam | |
| 66 | -- explicitly document what remains intentionally public-shell-only versus what is now runtime-first by default | |
| 67 | - | |
| 68 | -The goal is not to delete `Agent`. The goal is to stop using it internally by reflex where a runtime-owned seam is cleaner and already exists. | |
| 69 | - | |
| 70 | -### 2. Expand verification observations to earlier execution-time producers | |
| 71 | - | |
| 72 | -Sprint 22 made verification observations real, but they still enter the story relatively late. | |
| 73 | - | |
| 74 | -Implementation targets: | |
| 75 | - | |
| 76 | -- inventory where verification-related facts are currently available earlier than finalization across: | |
| 77 | - - `src/loader/runtime/dod.py` | |
| 78 | - - `src/loader/runtime/finalization.py` | |
| 79 | - - `src/loader/runtime/tool_batches.py` | |
| 80 | - - `src/loader/runtime/executor.py` | |
| 81 | - - `src/loader/runtime/workflow_lanes.py` | |
| 82 | - - `src/loader/runtime/workflow_policy.py` | |
| 83 | -- identify which observation kinds should be emitted closer to execution, such as: | |
| 84 | - - verification command planned/requested | |
| 85 | - - verification command actually executed | |
| 86 | - - verification output observed and classified as passed/failed/contradictory | |
| 87 | - - verification was intentionally skipped, stale, or still pending | |
| 88 | - - observed artifact/touchpoint evidence materially backed or blocked completion before finalization | |
| 89 | -- thread those earlier observations into the canonical policy story without creating a peer truth beside the workflow timeline | |
| 90 | - | |
| 91 | -The goal is to move Loader’s accountability closer to “this is what we observed while verification happened” instead of only “this is what the runtime concluded later.” | |
| 92 | - | |
| 93 | -### 3. Narrow the remaining public facade boundary on purpose | |
| 94 | - | |
| 95 | -Sprint 20 settled `Agent` as the public shell. Sprint 23 should keep making that shell smaller and more explicit. | |
| 96 | - | |
| 97 | -Implementation targets: | |
| 98 | - | |
| 99 | -- inventory what still lives in `src/loader/agent/loop.py` and nearby public-shell glue | |
| 100 | -- identify which pieces are: | |
| 101 | - - true public compatibility API | |
| 102 | - - UI integration seam | |
| 103 | - - leftover runtime ownership that can move below the shell now | |
| 104 | -- move the still-obviously-runtime pieces below the public shell where that reduces ambiguity | |
| 105 | -- add or extend direct boundary tests so internal code does not drift back toward `Agent` ownership once a runtime seam exists | |
| 106 | - | |
| 107 | -The goal is not “make the file smaller” for its own sake. The goal is that future work has a clearer answer to what the public shell is for. | |
| 108 | - | |
| 109 | -### 4. Sharpen operator visibility for runtime-first ownership and observed verification | |
| 110 | - | |
| 111 | -Once more of the real integration paths go runtime-first and more observations are captured earlier, the existing product surfaces should make that easier to audit. | |
| 112 | - | |
| 113 | -Implementation targets: | |
| 114 | - | |
| 115 | -- improve the existing surfaces so users can answer: | |
| 116 | - - which runtime-owned path produced the current session/accountability state? | |
| 117 | - - what verification was actually observed earlier in the turn versus only concluded at finalization? | |
| 118 | - - what evidence is pending, contradicted, or already satisfied? | |
| 119 | -- prefer improving: | |
| 120 | - - `loader status` | |
| 121 | - - `loader session show` | |
| 122 | - - `loader workflow show` | |
| 123 | - over inventing a new command unless a new surface is clearly cleaner | |
| 124 | -- keep concise rollups first, and expose deeper ownership/observation detail only where it materially improves post-mortem debugging | |
| 125 | - | |
| 126 | -The goal is to make Loader easier to audit after the fact, not simply more verbose. | |
| 127 | - | |
| 128 | -## Testing strategy | |
| 129 | - | |
| 130 | -- unit coverage for: | |
| 131 | - - runtime-first entry helpers adopted in real internal integration paths | |
| 132 | - - earlier verification-observation producer normalization and persistence | |
| 133 | - - public-shell boundary helpers and import/boundary guards | |
| 134 | -- runtime coverage for: | |
| 135 | - - successful completion with observed verification facts emitted before finalization | |
| 136 | - - failed or pending verification that now leaves a clearer producer-backed policy trail | |
| 137 | - - internal integration paths that no longer need `Agent` by default | |
| 138 | -- regression coverage for: | |
| 139 | - - no drift back toward `Agent` as the default internal seam when a runtime-owned seam exists | |
| 140 | - - no duplicate truth beside the canonical policy/accountability story | |
| 141 | - - no regression in Sprint 22’s observed-verification inspection and stop/continue honesty | |
| 142 | - | |
| 143 | -## Definition of done | |
| 144 | - | |
| 145 | -- Loader has at least one more real internal integration path using a runtime-first entry seam below `Agent` | |
| 146 | -- verification observations are emitted from at least one earlier execution-time producer instead of only the finalization edge | |
| 147 | -- the remaining public-shell boundary is smaller or more explicitly defended on purpose | |
| 148 | -- existing status/session/workflow surfaces expose the stronger runtime-first and verification-observation story without multiplying commands | |
| 149 | -- Sprint 22’s observed-verification and accountability gains remain green | |
| 150 | - | |
| 151 | -## Explicitly out of scope | |
| 152 | - | |
| 153 | -- deleting `Agent` as the public compatibility surface | |
| 154 | -- full claw-code policy-engine parity | |
| 155 | -- model-authored verifier narratives as a required runtime dependency | |
| 156 | -- AST-aware semantic diffs | |
| 157 | -- a broad visual workflow UI | |
| 158 | -- multi-agent or team orchestration | |
| 159 | - | |
| 160 | -## Audit | |
| 161 | - | |
| 162 | -### Status | |
| 163 | - | |
| 164 | -- Sprint 23 is complete, and the audit is green. Loader now uses the runtime-first seam in real internal integrations, captures verification observations closer to the moment verification runs, and exposes runtime-owner provenance in the same operator surfaces that already carry policy and workflow accountability. | |
| 165 | - | |
| 166 | -### Landed | |
| 167 | - | |
| 168 | -- runtime-first ownership is now materially real outside tests: `src/loader/runtime/runtime_handle.py` now owns direct `run` / `run_streaming` / `run_explore` entrypoints, `src/loader/cli/main.py` routes non-TUI CLI and `loader explore` through that runtime-first owner by default, and `tests/helpers/runtime_harness.py` now uses `RuntimeHandle` for scripted runtime scenarios instead of instantiating `Agent` by habit | |
| 169 | -- verification observations now enter the canonical accountability story closer to execution: `src/loader/runtime/finalization.py`, `src/loader/runtime/workflow_policy.py`, `src/loader/runtime/policy_timeline.py`, and `src/loader/runtime/workflow_timeline_read_model.py` now persist and project per-command `verify_observation` entries, so Loader can explain what verification actually ran and what it observed instead of only summarizing that state later | |
| 170 | -- runtime-owner provenance is now part of persisted session state and inspection: `src/loader/runtime/owner_metadata.py`, `src/loader/runtime/bootstrap.py`, `src/loader/runtime/public_shell.py`, and `src/loader/runtime/session.py` now persist owner-path metadata, while `src/loader/runtime/inspection.py` and `src/loader/cli/main.py` surface that metadata in `loader status`, `loader session list/show`, and `loader workflow show` | |
| 171 | - | |
| 172 | -### Verification | |
| 173 | - | |
| 174 | -- `uv run pytest -q` is green: `397 passed` | |
| 175 | -- `tests/test_runtime_handle.py`, `tests/test_cli_runtime_owner.py`, and `tests/helpers/runtime_harness.py` now pin real runtime-first integration paths below `Agent` | |
| 176 | -- `tests/test_finalization.py` and `tests/test_workflow_timeline_read_model.py` now pin per-command verification-observation entries and their projection into workflow/policy views | |
| 177 | -- `tests/test_session_state.py`, `tests/test_runtime_public_shell.py`, `tests/test_runtime_bootstrap.py`, `tests/test_runtime_launcher.py`, and `tests/test_inspection.py` now cover persisted runtime-owner metadata plus its status/session/workflow rendering | |
| 178 | - | |
| 179 | -### Residual debt | |
| 180 | - | |
| 181 | -- Loader now has real runtime-first internal integrations, but the TUI still routes through the public `Agent` facade and the public shell still remains the outermost construction contract for external integrations | |
| 182 | -- verification observations are now closer to execution, but they are still strongest around the verification loop/finalization path; Loader still does not yet emit a richer lifecycle story for planned, pending, or stale verification outside that bounded lane | |
| 183 | -- the new owner-path visibility makes runtime-first adoption auditable, but Loader still stops short of a narrower public runtime API, claw-code's fuller policy engine, and OMX's deeper verifier/interview rigor | |
.docs/sprints/sprint24.mddeleted@@ -1,180 +0,0 @@ | ||
| 1 | -# Sprint 24: TUI Runtime Convergence, Verification Lifecycle, and Facade Narrowing | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 23 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 23: stop treating the TUI as the last major product path that still defaults to `Agent` by habit, widen the verification-observation story from “what ran” toward “what is planned, pending, or stale,” and keep narrowing the public shell without pretending Loader is ready to delete it. | |
| 10 | - | |
| 11 | -Sprint 23 changed the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- non-TUI CLI, `loader explore`, and the scripted runtime harness now use the runtime-first owner seam below `Agent` | |
| 14 | -- verification observations now enter the canonical workflow timeline closer to execution through per-command verification events | |
| 15 | -- operator surfaces can now show which runtime-owner path produced the current session/accountability state | |
| 16 | -- but the TUI still routes through the public `Agent` shell even though it primarily consumes runtime-owned behavior | |
| 17 | -- verification observations still say more about commands that ran than about commands that are only planned, still pending, or now stale | |
| 18 | -- Loader is much closer to an explicit public/runtime boundary, but it still has not fully converged on what must remain public-shell-only versus what can be runtime-first by default | |
| 19 | - | |
| 20 | -Sprint 24 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen runtime/bootstrap ownership, event accountability, and narrower UI-facing shell seams | |
| 25 | -- use OMX to sharpen verifier-lifecycle visibility and auditability around pending vs observed proof | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too shell-bound, too late in its verification story, or too ambiguous about what the public shell still owns | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 38 | -- `refs/claw-code/PARITY.md` | |
| 39 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 40 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 43 | -- `.docs/PARITY.md` | |
| 44 | -- `.docs/audit.txt` | |
| 45 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 46 | -- `.docs/sprints/sprint23.md` | |
| 47 | - | |
| 48 | -## Deliverables | |
| 49 | - | |
| 50 | -### 1. Move the TUI onto the runtime-first owner seam | |
| 51 | - | |
| 52 | -Sprint 23 made runtime-first real in the CLI and scripted harness. Sprint 24 should make the TUI stop depending on `Agent` by default when it mostly consumes runtime-owned behavior. | |
| 53 | - | |
| 54 | -Implementation targets: | |
| 55 | - | |
| 56 | -- inventory what the TUI actually needs from its execution owner across: | |
| 57 | - - `src/loader/ui/app.py` | |
| 58 | - - `src/loader/ui/adapter.py` | |
| 59 | - - `src/loader/runtime/public_shell.py` | |
| 60 | - - `src/loader/runtime/runtime_handle.py` | |
| 61 | - - `src/loader/cli/main.py` | |
| 62 | -- define or refine a small shell-owner contract for the TUI instead of letting `Agent` be the assumed type | |
| 63 | -- migrate the TUI launch path to the runtime-first owner where that does not weaken the public compatibility story | |
| 64 | -- preserve explicit public compatibility where it still matters, but stop using `Agent` as the default UI owner by habit | |
| 65 | - | |
| 66 | -The goal is not to remove `Agent` from the codebase. The goal is to make the biggest remaining real product path stop depending on it unnecessarily. | |
| 67 | - | |
| 68 | -### 2. Expand verification observations to planned, pending, and stale states | |
| 69 | - | |
| 70 | -Sprint 23 captures commands that ran. Sprint 24 should make Loader more honest about verification work that has not yet happened or is no longer fresh. | |
| 71 | - | |
| 72 | -Implementation targets: | |
| 73 | - | |
| 74 | -- inventory where verification lifecycle facts already exist before or beyond command execution across: | |
| 75 | - - `src/loader/runtime/dod.py` | |
| 76 | - - `src/loader/runtime/finalization.py` | |
| 77 | - - `src/loader/runtime/task_completion.py` | |
| 78 | - - `src/loader/runtime/workflow_policy.py` | |
| 79 | - - `src/loader/runtime/policy_timeline.py` | |
| 80 | - - `src/loader/runtime/workflow_timeline_read_model.py` | |
| 81 | -- define which observation kinds Loader should represent directly, such as: | |
| 82 | - - verification planned | |
| 83 | - - verification pending | |
| 84 | - - verification stale | |
| 85 | - - verification intentionally skipped | |
| 86 | - - verification observed as passed or failed | |
| 87 | -- keep those lifecycle facts inside the canonical policy/accountability story instead of inventing a second peer model | |
| 88 | - | |
| 89 | -The goal is to make Loader answer “what proof is still pending?” as directly as it can already answer “what proof did we observe?” | |
| 90 | - | |
| 91 | -### 3. Tighten the public facade boundary around runtime-first defaults | |
| 92 | - | |
| 93 | -Sprint 23 improved real internal adoption, but the long-term shell boundary is still not quite settled enough. | |
| 94 | - | |
| 95 | -Implementation targets: | |
| 96 | - | |
| 97 | -- inventory what remains in `src/loader/agent/loop.py` that is: | |
| 98 | - - true public compatibility API | |
| 99 | - - UI integration glue | |
| 100 | - - leftover runtime ownership | |
| 101 | -- move any still-obviously-runtime ownership lower when that reduces ambiguity | |
| 102 | -- add or extend boundary tests so future work does not drift back toward `Agent` as the default internal seam after Sprint 24 | |
| 103 | - | |
| 104 | -The goal is not to shrink files for vanity. The goal is to make the remaining public shell answerable and intentionally narrow. | |
| 105 | - | |
| 106 | -### 4. Improve operator visibility for owner path and verification lifecycle | |
| 107 | - | |
| 108 | -Once Loader can show runtime-owner provenance and richer verification lifecycle state, the existing product surfaces should make that easier to audit. | |
| 109 | - | |
| 110 | -Implementation targets: | |
| 111 | - | |
| 112 | -- improve the current surfaces so users can answer: | |
| 113 | - - which owner path produced this session? | |
| 114 | - - what verification is planned, pending, stale, skipped, or observed? | |
| 115 | - - what evidence is still needed versus already satisfied? | |
| 116 | -- prefer improving: | |
| 117 | - - `loader status` | |
| 118 | - - `loader session show` | |
| 119 | - - `loader workflow show` | |
| 120 | - - the TUI status surface | |
| 121 | - over inventing a new command unless one is clearly cleaner | |
| 122 | - | |
| 123 | -The goal is to make Loader easier to audit live and after the fact, not simply more verbose. | |
| 124 | - | |
| 125 | -## Testing strategy | |
| 126 | - | |
| 127 | -- unit coverage for: | |
| 128 | - - the TUI/runtime shell-owner contract | |
| 129 | - - runtime-first TUI launch routing | |
| 130 | - - planned/pending/stale verification-observation normalization and projection | |
| 131 | -- runtime coverage for: | |
| 132 | - - the TUI or UI-facing shell path using the runtime-first owner without losing steering, confirmation, or question handling | |
| 133 | - - policy/accountability views that now distinguish pending vs observed verification | |
| 134 | -- regression coverage for: | |
| 135 | - - no drift back toward `Agent` as the default owner for internal UI/runtime integrations | |
| 136 | - - no duplicate verification lifecycle truth beside the canonical policy timeline | |
| 137 | - - no regression in Sprint 23's runtime-owner visibility and execution-time verification observations | |
| 138 | - | |
| 139 | -## Definition of done | |
| 140 | - | |
| 141 | -- the TUI uses a runtime-first owner seam below `Agent`, or any remaining public-shell dependency is explicit and justified | |
| 142 | -- Loader preserves richer verification lifecycle state than just “command ran” within the canonical policy/accountability story | |
| 143 | -- the public facade boundary is narrower or more explicitly defended on purpose | |
| 144 | -- existing status/session/workflow/TUI surfaces expose the stronger owner-path and verification-lifecycle story without multiplying product commands | |
| 145 | -- Sprint 23's runtime-first integration and verification-producer gains remain green | |
| 146 | - | |
| 147 | -## Explicitly out of scope | |
| 148 | - | |
| 149 | -- deleting `Agent` as the public compatibility surface | |
| 150 | -- full claw-code policy-engine parity | |
| 151 | -- model-authored verifier narratives as a required runtime dependency | |
| 152 | -- AST-aware semantic diffs | |
| 153 | -- a broad visual workflow UI redesign | |
| 154 | -- multi-agent or team orchestration | |
| 155 | - | |
| 156 | -## Audit | |
| 157 | - | |
| 158 | -### Status | |
| 159 | - | |
| 160 | -- Sprint 24 is complete, and the audit is green. Loader now treats the TUI as another runtime-first internal path instead of a special `Agent` holdout, and the verification lifecycle is explicit across planned, pending, stale, skipped, and observed states inside the canonical policy timeline. | |
| 161 | - | |
| 162 | -### Landed | |
| 163 | - | |
| 164 | -- the TUI now launches through the runtime-first shell-owner seam below `Agent`: `src/loader/cli/main.py`, `src/loader/ui/app.py`, `src/loader/ui/adapter.py`, and `src/loader/runtime/runtime_handle.py` now build and use a runtime-owned shell owner by default for TUI launch instead of routing through the public `Agent` facade by habit | |
| 165 | -- verification lifecycle state is now richer and more honest inside the canonical policy/accountability story: `src/loader/runtime/tool_batches.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/task_completion.py`, and `src/loader/runtime/verification_observations.py` now distinguish verification that is planned after new mutating work, pending because verify has started, stale because fresh mutations invalidated earlier proof, intentionally skipped, or actually observed | |
| 166 | -- operator surfaces now expose that lifecycle directly instead of flattening it back into generic “missing verification”: `src/loader/runtime/workflow_timeline_read_model.py`, `src/loader/runtime/inspection.py`, and `src/loader/cli/main.py` now show `Verify planned`, `Verify pending`, and `Verify stale` states plus the corresponding recent-verification summaries in `loader status`, `loader session show`, and `loader workflow show` | |
| 167 | -- the public facade boundary is tighter in practice even though Sprint 24 was not primarily a shell-contraction sprint: by moving the last major product path off default `Agent` ownership, Loader now has a clearer answer to what remains public-shell-only versus what is runtime-first by default | |
| 168 | - | |
| 169 | -### Verification | |
| 170 | - | |
| 171 | -- `uv run pytest -q` is green: `414 passed` | |
| 172 | -- `tests/test_cli_runtime_owner.py` now pins runtime-first owner selection for non-TUI CLI, `loader explore`, single-prompt paths, and TUI launch | |
| 173 | -- `tests/test_tool_batches.py`, `tests/test_finalization.py`, `tests/test_completion_policy.py`, and `tests/test_workflow_runtime.py` now pin the planned -> pending -> stale verification lifecycle through mutating work, verify handoff, and completion/continuation policy | |
| 174 | -- `tests/test_workflow_timeline_read_model.py` and `tests/test_inspection.py` now pin the operator-facing projection and rendering of planned/pending/stale verification in workflow highlights, status, session, and workflow inspection surfaces | |
| 175 | - | |
| 176 | -### Residual debt | |
| 177 | - | |
| 178 | -- Loader now uses runtime-first owner seams across CLI, explore, scripted harnesses, and the TUI, but `Agent` plus `runtime.public_shell` still define the outer compatibility boundary instead of a narrower runtime-first external API | |
| 179 | -- the verification lifecycle is much clearer now, but it is still a bounded runtime-authored model; Loader still does not preserve richer queueing/timestamp semantics for “planned” vs “actively running,” nor does it implement OMX-style deeper verifier reasoning | |
| 180 | -- Sprint 24 materially improved shell ownership and verification accountability, but Loader still stops short of claw-code's fuller policy engine, richer sandboxing, and the deeper verifier/interview rigor in the refs | |
.docs/sprints/sprint25.mddeleted@@ -1,197 +0,0 @@ | ||
| 1 | -# Sprint 25: Public Runtime API, Verification Attempts, and Boundary Narrowing | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 24 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 24: stop treating `Agent` plus `runtime.public_shell` as the only meaningful outer boundary by default, deepen verification from coarse lifecycle labels into explicit attempt semantics, and keep pushing Loader toward a runtime-first shape without pretending it is ready to delete the public compatibility surface. | |
| 10 | - | |
| 11 | -Sprint 24 changed the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- CLI, explore, the scripted harness, and the TUI now all use the runtime-first owner seam below `Agent` | |
| 14 | -- verification lifecycle now distinguishes `planned`, `pending`, `stale`, `skipped`, and observed states inside the canonical policy timeline | |
| 15 | -- operator surfaces can now explain that lifecycle directly in `status`, `session show`, and `workflow show` | |
| 16 | -- but `Agent` plus `runtime.public_shell` still remain the outer compatibility shell instead of a narrower runtime-first external API | |
| 17 | -- verification lifecycle is still a bounded runtime-authored label model, not an explicit record of verification attempts, queueing, start/completion moments, or freshness across retries | |
| 18 | -- Loader can now say that verification is planned or pending, but it still says less than it should about which verification attempt is active, what superseded it, and what evidence belongs to which attempt | |
| 19 | - | |
| 20 | -Sprint 25 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen outer runtime boundaries, bootstrap/session ownership, and event accountability | |
| 25 | -- use OMX to sharpen verifier-attempt visibility, freshness semantics, and auditability around incomplete versus superseded proof | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too compatibility-shell-bound or too coarse in its verification state model | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 38 | -- `refs/claw-code/PARITY.md` | |
| 39 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 40 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 43 | -- `.docs/PARITY.md` | |
| 44 | -- `.docs/audit.txt` | |
| 45 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 46 | -- `.docs/sprints/sprint24.md` | |
| 47 | - | |
| 48 | -## Deliverables | |
| 49 | - | |
| 50 | -### 1. Define a narrower public runtime API below `Agent` | |
| 51 | - | |
| 52 | -Sprint 24 made runtime-first real across the major product paths. Sprint 25 should make the remaining outer boundary more explicit and less compatibility-shaped. | |
| 53 | - | |
| 54 | -Implementation targets: | |
| 55 | - | |
| 56 | -- inventory what external callers actually need from: | |
| 57 | - - `src/loader/agent/loop.py` | |
| 58 | - - `src/loader/runtime/public_shell.py` | |
| 59 | - - `src/loader/runtime/runtime_handle.py` | |
| 60 | - - `src/loader/runtime/bootstrap.py` | |
| 61 | - - `src/loader/cli/main.py` | |
| 62 | - - `src/loader/ui/app.py` | |
| 63 | -- define a smaller runtime-owned API for external integrations that need: | |
| 64 | - - runtime bootstrap/session ownership | |
| 65 | - - shell execution entrypoints | |
| 66 | - - inspection/continuity hooks | |
| 67 | - - steering/question handling | |
| 68 | -- explicitly settle what stays public-compat-only under `Agent` and what should become runtime-first by default | |
| 69 | -- migrate at least one more real caller or integration seam onto that narrower runtime-first API if the seam is still compatibility-driven by habit | |
| 70 | - | |
| 71 | -The goal is not to delete `Agent`. The goal is to make Loader's outer boundary answerable instead of half public facade and half runtime shell by historical accident. | |
| 72 | - | |
| 73 | -### 2. Promote verification lifecycle labels into typed attempt semantics | |
| 74 | - | |
| 75 | -Sprint 24 gave Loader lifecycle states. Sprint 25 should give those states explicit attempt structure. | |
| 76 | - | |
| 77 | -Implementation targets: | |
| 78 | - | |
| 79 | -- inventory where Loader already knows more than a plain status label across: | |
| 80 | - - `src/loader/runtime/dod.py` | |
| 81 | - - `src/loader/runtime/finalization.py` | |
| 82 | - - `src/loader/runtime/task_completion.py` | |
| 83 | - - `src/loader/runtime/tool_batches.py` | |
| 84 | - - `src/loader/runtime/workflow_policy.py` | |
| 85 | - - `src/loader/runtime/policy_timeline.py` | |
| 86 | - - `src/loader/runtime/verification_observations.py` | |
| 87 | -- define a typed verification-attempt model that can represent things like: | |
| 88 | - - verification planned as a future attempt | |
| 89 | - - verification queued/pending as the current active attempt | |
| 90 | - - verification started versus completed | |
| 91 | - - verification superseded or made stale by later mutating work | |
| 92 | - - verification intentionally skipped | |
| 93 | - - verification attempt results tied to the right command/evidence bundle | |
| 94 | -- keep that attempt model inside the canonical policy/accountability story instead of creating a second peer verification log with a separate ownership story | |
| 95 | - | |
| 96 | -The goal is to make Loader answer not just "is verification pending?" but "which verification attempt is pending, what superseded the last one, and what evidence belongs to this attempt?" | |
| 97 | - | |
| 98 | -### 3. Tighten completion and freshness policy around verification attempts | |
| 99 | - | |
| 100 | -Once attempt semantics exist, completion policy should stop flattening them back into generic lifecycle summaries. | |
| 101 | - | |
| 102 | -Implementation targets: | |
| 103 | - | |
| 104 | -- connect completion and reentry decisions more directly to: | |
| 105 | - - active verification attempt identity | |
| 106 | - - attempt freshness relative to later mutations | |
| 107 | - - attempt result timestamps/order | |
| 108 | - - superseded/stale attempt reasoning | |
| 109 | - - explicit missing proof versus not-yet-finished proof | |
| 110 | -- preserve a clear distinction between: | |
| 111 | - - proof that is planned but not started | |
| 112 | - - proof currently in flight | |
| 113 | - - proof that finished and passed | |
| 114 | - - proof that finished and failed | |
| 115 | - - proof that was once green but is no longer fresh | |
| 116 | -- ensure the canonical policy story explains why a stop/continue/retry decision was tied to one verification attempt rather than another | |
| 117 | - | |
| 118 | -The goal is to make Loader's completion honesty stronger when verification gets interrupted, superseded, retried, or resumed. | |
| 119 | - | |
| 120 | -### 4. Improve operator visibility for runtime boundary and verification attempts | |
| 121 | - | |
| 122 | -Once Loader has a narrower runtime-first boundary and richer attempt semantics, the existing operator surfaces should make that audit story easier to follow. | |
| 123 | - | |
| 124 | -Implementation targets: | |
| 125 | - | |
| 126 | -- improve the current surfaces so users can answer: | |
| 127 | - - which runtime/public boundary handled this session? | |
| 128 | - - what is the current verification attempt? | |
| 129 | - - what earlier attempt became stale or was superseded? | |
| 130 | - - what evidence is attached to the active versus superseded attempt? | |
| 131 | -- prefer improving: | |
| 132 | - - `loader status` | |
| 133 | - - `loader session show` | |
| 134 | - - `loader workflow show` | |
| 135 | - - the TUI status surface | |
| 136 | - over inventing a new command unless a new surface is clearly cleaner | |
| 137 | -- keep concise rollups first and expose deeper attempt detail only where it materially improves debugging | |
| 138 | - | |
| 139 | -The goal is to make Loader's runtime ownership and verification story easier to audit after the fact, not simply more verbose. | |
| 140 | - | |
| 141 | -## Testing strategy | |
| 142 | - | |
| 143 | -- unit coverage for: | |
| 144 | - - the narrower runtime-first public API contract | |
| 145 | - - verification-attempt normalization, persistence, and supersession | |
| 146 | - - completion freshness rules that now depend on attempt semantics | |
| 147 | -- runtime coverage for: | |
| 148 | - - a verify handoff that records a planned attempt before active verification starts | |
| 149 | - - a pending attempt that later completes or becomes stale after fresh mutating work | |
| 150 | - - runtime-first callers that no longer need `Agent` by default | |
| 151 | -- regression coverage for: | |
| 152 | - - no drift back toward `Agent` as the assumed external owner when a runtime-first API exists | |
| 153 | - - no duplicate verification-attempt truth beside the canonical policy timeline | |
| 154 | - - no regression in Sprint 24's lifecycle visibility and runtime-first TUI ownership | |
| 155 | - | |
| 156 | -## Definition of done | |
| 157 | - | |
| 158 | -- Loader has a narrower and more explicit runtime-first external API below `Agent`, or any remaining `Agent` ownership is clearly justified as public compatibility | |
| 159 | -- verification lifecycle is represented with explicit attempt semantics, not only coarse status labels | |
| 160 | -- completion and freshness policy can explain which verification attempt is active, stale, superseded, or satisfied | |
| 161 | -- existing status/session/workflow/TUI surfaces expose the stronger boundary and attempt story without multiplying product commands | |
| 162 | -- Sprint 24's runtime-first ownership and verification lifecycle gains remain green | |
| 163 | - | |
| 164 | -## Explicitly out of scope | |
| 165 | - | |
| 166 | -- deleting `Agent` as the public compatibility surface | |
| 167 | -- full claw-code policy-engine parity | |
| 168 | -- model-authored verifier narratives as a required runtime dependency | |
| 169 | -- AST-aware semantic diffs | |
| 170 | -- a broad visual workflow UI redesign | |
| 171 | -- multi-agent or team orchestration | |
| 172 | - | |
| 173 | -## Audit | |
| 174 | - | |
| 175 | -### Status | |
| 176 | - | |
| 177 | -- Sprint 25 is complete, and the audit is green. Loader now has a runtime-owned outer API contract below `Agent`, explicit verification-attempt identity threaded through completion/freshness policy, and operator surfaces that can explain both the runtime boundary and the active versus superseded verification attempts. | |
| 178 | - | |
| 179 | -### Landed | |
| 180 | - | |
| 181 | -- Loader now has a narrower runtime-owned shell API below `Agent`: `src/loader/runtime/runtime_api.py`, `src/loader/cli/main.py`, and `src/loader/ui/app.py` now share one runtime-owned boundary for shell-owner construction instead of treating CLI and TUI ownership as adjacent but separate contracts | |
| 182 | -- verification lifecycle is now represented with explicit attempt identity instead of only lifecycle labels: `src/loader/runtime/dod.py`, `src/loader/runtime/verification_observations.py`, `src/loader/runtime/tool_batches.py`, `src/loader/runtime/finalization.py`, `src/loader/runtime/task_completion.py`, and `src/loader/runtime/completion_policy.py` now preserve which attempt is planned, pending, stale, observed, skipped, or superseded | |
| 183 | -- completion/freshness policy now explains itself in attempt-aware terms: when Loader continues, finalizes, or rejects stale proof, it can say which attempt was still active, which one was superseded, and why one attempt did or did not satisfy the stop condition | |
| 184 | -- operator surfaces now expose the stronger boundary and attempt story without adding new commands: `src/loader/runtime/inspection.py`, `src/loader/cli/main.py`, `src/loader/runtime/owner_metadata.py`, `src/loader/ui/status_helpers.py`, `src/loader/ui/widgets/status_line.py`, and `src/loader/ui/app.py` now show runtime-boundary summaries plus attempt-aware verification state in `loader status`, `loader session show`, `loader workflow show`, and the TUI status line | |
| 185 | - | |
| 186 | -### Verification | |
| 187 | - | |
| 188 | -- `uv run pytest -q` is green: `416 passed` | |
| 189 | -- `tests/test_cli_runtime_owner.py` now pins the runtime-owned shell API boundary for CLI and TUI ownership | |
| 190 | -- `tests/test_completion_policy.py` and `tests/test_turn_completion.py` now pin attempt-aware completion/freshness reasoning, including superseded-attempt summaries | |
| 191 | -- `tests/test_inspection.py` and `tests/test_status_surfaces.py` now pin runtime-boundary summaries, verification-state summaries, and TUI owner/attempt status rendering | |
| 192 | - | |
| 193 | -### Residual debt | |
| 194 | - | |
| 195 | -- `src/loader/runtime/runtime_api.py` narrows the boundary materially, but `Agent` plus `runtime.public_shell` still remain the documented public compatibility layer instead of Loader exposing a fully runtime-first external API to all callers | |
| 196 | -- verification attempt identity is now explicit, but Loader still does not preserve richer attempt timing and queue semantics such as first-planned versus actively-started timestamps, or deeper multi-command attempt bundles | |
| 197 | -- the operator surfaces now explain active versus superseded attempts clearly, but they are still concise rollups rather than a deeper attempt-history debugger or OMX-style verifier narrative | |
.docs/sprints/sprint26.mddeleted@@ -1,165 +0,0 @@ | ||
| 1 | -# Sprint 26: Verification Attempt Timelines and Public Facade Delamination | |
| 2 | - | |
| 3 | -## Prerequisites | |
| 4 | - | |
| 5 | -Sprint 25 | |
| 6 | - | |
| 7 | -## Goals | |
| 8 | - | |
| 9 | -Take the next honest step after Sprint 25: preserve more than attempt labels in the verifier lifecycle, make the runtime-owned API more authoritative as Loader's practical outer boundary, and keep shrinking the gap between "runtime-first internally" and "runtime-first by default" without pretending the public compatibility surface is ready to disappear. | |
| 10 | - | |
| 11 | -Sprint 25 changed the remaining debt in a useful way: | |
| 12 | - | |
| 13 | -- Loader now has a runtime-owned shell API boundary in `src/loader/runtime/runtime_api.py` | |
| 14 | -- verification lifecycle now preserves explicit attempt identity across planned, pending, stale, skipped, and observed states | |
| 15 | -- completion policy can explain active versus superseded proof in attempt-aware terms | |
| 16 | -- operator surfaces can now show runtime-boundary summaries and attempt-aware verification state in CLI and TUI | |
| 17 | -- but Loader still says more about attempt labels than about attempt timeline facts such as when an attempt was first planned, when it actually started, when it completed, and what exactly superseded it | |
| 18 | -- and `Agent` plus `runtime.public_shell` still remain the documented public compatibility shell even though the runtime-owned boundary is much stronger now | |
| 19 | - | |
| 20 | -Sprint 26 should keep using the references as architectural guardrails, not as a feature-copy list. | |
| 21 | - | |
| 22 | -The standard remains: | |
| 23 | - | |
| 24 | -- use claw-code to sharpen runtime/bootstrap/session ownership, policy/accountability event structure, and explicit outer runtime boundaries | |
| 25 | -- use OMX to sharpen verifier attempt visibility, freshness reasoning, and the audit trail around incomplete, stale, superseded, or resumed proof | |
| 26 | -- do not add work just because the refs have it | |
| 27 | -- do add work when the refs show that Loader is still too compatibility-shell-bound or too coarse in its verification-attempt history | |
| 28 | - | |
| 29 | -`audit.txt` remains a guardrail against wrapper-heavy drift and compatibility-by-habit. It is not the factual roadmap. | |
| 30 | - | |
| 31 | -The references for this sprint are: | |
| 32 | - | |
| 33 | -- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` | |
| 34 | -- `refs/claw-code/rust/crates/runtime/src/session_control.rs` | |
| 35 | -- `refs/claw-code/rust/crates/runtime/src/conversation.rs` | |
| 36 | -- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` | |
| 37 | -- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` | |
| 38 | -- `refs/claw-code/PARITY.md` | |
| 39 | -- `refs/oh-my-codex/src/verification/verifier.ts` | |
| 40 | -- `refs/oh-my-codex/src/autoresearch/runtime.ts` | |
| 41 | -- `refs/oh-my-codex/src/autoresearch/contracts.ts` | |
| 42 | -- `refs/oh-my-codex/src/hooks/session.ts` | |
| 43 | -- `.docs/PARITY.md` | |
| 44 | -- `.docs/audit.txt` | |
| 45 | -- `.docs/audit_sprints/trunk_sitrep.md` | |
| 46 | -- `.docs/sprints/sprint25.md` | |
| 47 | - | |
| 48 | -## Deliverables | |
| 49 | - | |
| 50 | -### 1. Promote verification attempts into richer timeline records | |
| 51 | - | |
| 52 | -Sprint 25 gave Loader attempt identity. Sprint 26 should make those attempts feel like real runtime history, not only labeled states. | |
| 53 | - | |
| 54 | -Implementation targets: | |
| 55 | - | |
| 56 | -- inventory where Loader already knows attempt-order or lifecycle facts across: | |
| 57 | - - `src/loader/runtime/dod.py` | |
| 58 | - - `src/loader/runtime/finalization.py` | |
| 59 | - - `src/loader/runtime/tool_batches.py` | |
| 60 | - - `src/loader/runtime/verification_observations.py` | |
| 61 | - - `src/loader/runtime/workflow_policy.py` | |
| 62 | - - `src/loader/runtime/policy_timeline.py` | |
| 63 | - - `src/loader/runtime/workflow_timeline_read_model.py` | |
| 64 | -- define a typed verification-attempt timeline model that can preserve things like: | |
| 65 | - - when an attempt was first planned | |
| 66 | - - when it actually became active | |
| 67 | - - when it completed or was skipped | |
| 68 | - - when and why it became stale or was superseded | |
| 69 | - - which commands/evidence bundle belonged to that attempt | |
| 70 | -- keep that richer attempt model inside the canonical policy/accountability story instead of creating a side log | |
| 71 | - | |
| 72 | -The goal is to make Loader answer not only "which attempt is active?" but "what happened to attempt 2, when did attempt 3 actually start, and what proof belongs to each?" | |
| 73 | - | |
| 74 | -### 2. Tighten completion/freshness policy around attempt history | |
| 75 | - | |
| 76 | -Once attempt history is richer, completion policy should stop flattening that history into one summary string too early. | |
| 77 | - | |
| 78 | -Implementation targets: | |
| 79 | - | |
| 80 | -- connect completion and reentry decisions more directly to: | |
| 81 | - - attempt planning versus active-start moments | |
| 82 | - - completed versus superseded attempt ordering | |
| 83 | - - freshness relative to later mutating work | |
| 84 | - - explicit gaps between planned proof, running proof, and finished proof | |
| 85 | -- preserve a clear distinction between: | |
| 86 | - - proof that is scheduled but has not started | |
| 87 | - - proof that started but has not completed | |
| 88 | - - proof that completed and passed | |
| 89 | - - proof that completed and failed | |
| 90 | - - proof that was once green but is now stale because a later attempt superseded it | |
| 91 | -- ensure the canonical policy story explains why Loader trusted, rejected, or waited on one attempt instead of another | |
| 92 | - | |
| 93 | -The goal is to make Loader's stop/continue/retry logic more auditable when verification spans multiple retries or resumes. | |
| 94 | - | |
| 95 | -### 3. Narrow the public facade below `Agent` again | |
| 96 | - | |
| 97 | -Sprint 25 added a runtime-owned shell API. Sprint 26 should make that boundary more authoritative in practice. | |
| 98 | - | |
| 99 | -Implementation targets: | |
| 100 | - | |
| 101 | -- inventory what still materially depends on: | |
| 102 | - - `src/loader/agent/loop.py` | |
| 103 | - - `src/loader/runtime/public_shell.py` | |
| 104 | - - `src/loader/runtime/runtime_api.py` | |
| 105 | - - `src/loader/runtime/runtime_handle.py` | |
| 106 | - - `src/loader/cli/main.py` | |
| 107 | - - `src/loader/ui/app.py` | |
| 108 | -- migrate at least one more real caller or ownership seam onto the runtime-owned API if it is still compatibility-shaped by habit | |
| 109 | -- make remaining `Agent`-owned behavior explicitly compatibility-facing rather than ambiguous runtime glue | |
| 110 | -- add or extend boundary tests so future work does not drift back toward `Agent` as the assumed outer runtime owner | |
| 111 | - | |
| 112 | -The goal is not to delete `Agent`. The goal is to make the runtime-owned API the default answer more often, and the compatibility facade the deliberate exception. | |
| 113 | - | |
| 114 | -### 4. Improve operator visibility for attempt history and public boundary | |
| 115 | - | |
| 116 | -Once attempt records get richer and the outer boundary gets cleaner, the existing surfaces should tell that story with less reconstruction. | |
| 117 | - | |
| 118 | -Implementation targets: | |
| 119 | - | |
| 120 | -- improve the current surfaces so users can answer: | |
| 121 | - - which runtime/public boundary handled this session? | |
| 122 | - - which verification attempt is active right now? | |
| 123 | - - when did it become planned, active, stale, or superseded? | |
| 124 | - - what evidence belongs to the active attempt versus an older one? | |
| 125 | -- prefer improving: | |
| 126 | - - `loader status` | |
| 127 | - - `loader session show` | |
| 128 | - - `loader workflow show` | |
| 129 | - - the TUI status surface | |
| 130 | - over inventing a new command unless one is clearly cleaner | |
| 131 | -- keep concise rollups first, and expose deeper attempt history only where it materially improves debugging | |
| 132 | - | |
| 133 | -The goal is to make Loader's runtime boundary and verifier history easier to audit after the fact, not simply more verbose. | |
| 134 | - | |
| 135 | -## Testing strategy | |
| 136 | - | |
| 137 | -- unit coverage for: | |
| 138 | - - richer verification-attempt timeline normalization and persistence | |
| 139 | - - completion/freshness decisions that now depend on attempt history | |
| 140 | - - the narrower runtime-owned API boundary and any new caller migration | |
| 141 | -- runtime coverage for: | |
| 142 | - - a planned attempt that later becomes active and then completes | |
| 143 | - - a completed attempt that becomes stale after new mutating work | |
| 144 | - - a resumed session whose active attempt history and runtime-owner boundary remain coherent | |
| 145 | -- regression coverage for: | |
| 146 | - - no duplicate verification-attempt truth beside the canonical policy timeline | |
| 147 | - - no drift back toward `Agent` as the assumed outer runtime owner when a runtime-owned API exists | |
| 148 | - - no regression in Sprint 25's attempt-aware completion/freshness and operator surfaces | |
| 149 | - | |
| 150 | -## Definition of done | |
| 151 | - | |
| 152 | -- Loader preserves richer verification-attempt history than plain attempt labels inside the canonical policy/accountability story | |
| 153 | -- completion and freshness policy can explain which attempt was planned, active, completed, stale, or superseded and why | |
| 154 | -- the runtime-owned API below `Agent` is more authoritative in at least one more real integration seam, or remaining `Agent` ownership is explicitly justified as compatibility-only | |
| 155 | -- existing status/session/workflow/TUI surfaces expose the stronger boundary and attempt-history story without multiplying product commands | |
| 156 | -- Sprint 25's runtime-boundary and attempt-aware verification gains remain green | |
| 157 | - | |
| 158 | -## Explicitly out of scope | |
| 159 | - | |
| 160 | -- deleting `Agent` as the public compatibility surface | |
| 161 | -- full claw-code policy-engine parity | |
| 162 | -- model-authored verifier narratives as a required runtime dependency | |
| 163 | -- AST-aware semantic diffs | |
| 164 | -- a broad visual workflow UI redesign | |
| 165 | -- multi-agent or team orchestration | |
.gitignoremodified@@ -58,3 +58,4 @@ node_modules/ | ||
| 58 | 58 | CLAUDE.md |
| 59 | 59 | refs/ |
| 60 | 60 | .loader/ |
| 61 | +.docs/ | |