Loader Runtime Parity Checkpoint
Date: 2026-04-08
This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests.
Supported today
- streamed text-only replies
- native-tool round trips for
read,write,edit,patch,glob,grep,bash,git,TodoWrite,AskUserQuestion,project_memory_*, andnotepad_* - explicit permission modes:
read-only,workspace-write,danger-full-access,prompt, andallow - tool lifecycle hooks in
pre_tool_use→ permission check → execute →post_tool_use/post_tool_use_failureorder - rule-based permission policy with workspace-local
allow/deny/askrules from.loader/permission-rules.json - policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information
loader permissions showfor normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by handloader permissions checkfor dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning- raw JSON fallback when the model emits tool syntax in plain text
- raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as
TodoWriteandAskUserQuestion - persisted definition-of-done state under
.loader/dod/ - persisted clarify briefs under
.loader/briefs/ - persisted implementation and verification plans under
.loader/plans/ - persisted conversation sessions under
.loader/sessions/plus active session state under.loader/state/ - persisted permission policy metadata alongside session state, so
loader status/loader session list/loader session showcan explain the effective policy that ran loader --resumeandloader --resume <session-id>restore persisted session state- durable project memory in
.loader/project-memory.jsonand working notes in.loader/notepad.md - native memory tools for
project_memory_*andnotepad_* - scored workflow routing across
clarify→plan→execute→verify, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints - typed workflow-signal extraction with persisted
signal_summarycontext for route pressure, recent workflow history, and unresolved questions - mode-specific system prompts for clarify, plan, execute, and verify
- intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward
- pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata
- codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts
- semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues
- structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift
- persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection
- persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips
- explicit verify/fix loops for mutating tasks, with a bounded retry budget
- verify/fix retries return to execute mode without re-triggering clarify or plan
- task-size-aware verification command derivation based on actual tool history
- verification command loading from persisted
verification.mdartifacts when present - heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
- typed
TurnSummaryoutput for completed turns, including trace events and tool-result messages - normalized per-turn usage plus cumulative session usage in
TurnSummary - automatic transcript compaction with priority-aware line compression and continuation instructions
- unified tool execution for native and extracted tool calls through
runtime.executor.ToolExecutor - typed tool-result messages backed by
Message.tool_results - typed prompt construction in
runtime.prompting, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state - persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection
- validated turn-state transitions (
prepare,assistant,repair,tools,critique,completion,finalize) with typed transition metadata, persisted session state, and emitted runtime events - typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints
loader doctorfor backend, capability, workspace, command, state, and permission health checks outside the main runtime looploader statusplusloader session list/show/resumefor inspecting persisted runtime state without invoking the LLMloader prompt show [task]for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model requestloader prompt diff [session-id]for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demandloader workflow show [session-id]with--mode,--kind, and--limitfilters plus operator-focused workflow highlights, recent timeline snippets inloader session show, and--diff/--full-diffartifact comparison for persisted workflow artifactsloader explore <prompt>as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under.loader/state/explore.json, and--freshto ignore prior explore history when neededRuntimeContextis now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the olderRuntimeLegacyServicesshim has been removed- shared runtime bootstrap through
runtime.bootstrap.build_runtime_context(...)/sync_runtime_context(...), with both conversation and explore runtimes constructing typed context through the same runtime-owned contract - runtime-owned safeguard and reasoning helpers now have canonical homes under
src/loader/runtime/;src/loader/agent/safeguards.pyandsrc/loader/agent/reasoning.pyare compatibility-export layers rather than the primary implementations - the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through
src/loader/runtime/launcher.py, which leaves a smaller and more honestsrc/loader/agent/loop.py - compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to
agent/reasoning.py/agent/safeguards.pyimports - CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id
- CLI status now also surfaces recent explore activity, including bounded explore turn/message counts and the last explore query
- CLI and TUI workflow-mode visibility plus artifact notifications
- CLI and TUI permission-mode visibility with color-coded status
- workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata
- shell mutability classification plus structured truncation and stderr/exit-code metadata
- richer structured
AskUserQuestionprompts with titles, context, options, and optional freeform responses - honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers
- dedicated assistant-response routing in
runtime.response_routing, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline insideturn_iteration.py - assistant-turn request handling now lives in
runtime.assistant_turns, clarify/plan lane execution now lives inruntime.workflow_lanes, tool-batch execution/recovery now lives inruntime.tool_batches, DoD/finalization logic now lives inruntime.finalization, workflow-state/session mutation lives inruntime.workflow_state, and the main loop now runs throughruntime.turn_preparation,runtime.turn_preamble,runtime.turn_iteration, andruntime.turn_loopinstead of accumulating further insideconversation.py src/loader/runtime/conversation.pynow acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop
Known weak spots
- the hot runtime path no longer depends on a hidden bootstrap helper, but
src/loader/runtime/conversation.pyandsrc/loader/runtime/explore.pystill start from anAgent-shaped bootstrap source at the public entrypoint layer src/loader/agent/loop.pyis much smaller and less misleading than the pre-Sprint-15 shell, but it still owns prompt/session factories, resume/clear lifecycle, and UI-facing entrypoint glue instead of collapsing fully to a minimal public facadesrc/loader/agent/reasoning.pyandsrc/loader/agent/safeguards.pyare now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface furthersrc/loader/runtime/tool_batches.pyand parts ofsrc/loader/runtime/workflow_lanes.pyare narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams inrefs/claw-code- the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth
- clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class
- plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes
- the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance
- plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop
- DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state
- evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
- session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts
- project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction
- rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface
- policy state is inspectable in doctor/status/session surfaces and dry-runnable through
loader permissions show/check, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface - prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage
- workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace
- shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model
- explore mode now has lightweight transcript continuity, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances or dedicated explore inspection commands
- the read-only
githelper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and thepatchtool still stops short of AST/LSP-aware editing
Out of scope in the current baseline
- richer permission-rule UX / per-command allowlists
- multi-agent / team orchestration
Deterministic parity scenarios
The auditable manifest lives at tests/fixtures/runtime_parity_manifest.json and is exercised by tests/test_runtime_harness.py. Sprint 04 adds focused workflow integration coverage in tests/test_workflow_runtime.py and artifact/router unit coverage in tests/test_workflow.py. Sprint 06 adds inspection/explore coverage in tests/test_inspection.py, tests/test_explore_runtime.py, and tests/test_expanded_tools.py. Sprint 10 extends that workflow coverage in tests/test_workflow_policy.py, tests/test_workflow_runtime.py, and tests/test_inspection.py for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds tests/test_workflow_signals.py, tests/test_clarify_strategy.py, tests/test_artifact_invalidation.py, and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds tests/test_clarify_grounding.py, tests/test_turn_preparation.py, tests/test_turn_completion.py, tests/test_turn_iteration.py, tests/test_turn_preamble.py, tests/test_workflow_state.py, and tests/test_turn_loop.py for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds tests/test_runtime_repair_flows.py, tests/test_response_routing.py, tests/test_workflow_ledger.py, and expanded tests/test_session_state.py / tests/test_inspection.py coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds tests/test_runtime_bootstrap.py, tests/test_safeguard_services.py, tests/test_reasoning_compat.py, and updated tests/test_runtime_context.py coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds tests/test_runtime_launcher.py, tests/test_chat_lane.py, tests/test_decomposition_lane.py, tests/test_compat_boundaries.py, and expanded tests/test_explore_runtime.py / tests/test_inspection.py coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity.
streaming_text: greenread_file_roundtrip: greenmulti_tool_turn_roundtrip: greenwrite_file_allowed: greenwrite_file_denied: greenbash_stdout_roundtrip: greenbash_confirmation_prompt_approved: greenbash_confirmation_prompt_denied: greenread_only_mode_denies_write: greenread_only_mode_denies_mutating_bash: greenread_only_mode_allows_safe_bash: greenworkspace_write_denies_write_outside_root: greendanger_full_access_allows_dangerous_bash: greenprompt_mode_prompts_destructive_write: greenallow_mode_skips_prompt_for_destructive_write: greendeny_rule_blocks_allowed_mode: greenask_rule_prompts_even_when_mode_would_allow: greenraw_json_tool_call_fallback: greencompletion_check_continuation: greentool_result_contract_regression: greenturn_summary_smoke_for_multi_tool_turn: greennative_and_raw_tool_paths_share_executor_trace: greenbackend_capability_probe_refreshes_native_tool_mode: greenrun_streaming_delegates_to_primary_runtime: greendefinition_of_done_verify_phase: greenverify_failure_routes_to_fix_loop: greenverify_retry_budget_exhaustion: greenambiguous_prompt_routes_to_clarify: greencomplex_prompt_routes_to_plan: greenverify_failure_fix_loop_does_not_reroute_workflow: greenconversational_task_skips_verify_phase: greenexplore_mode_skips_dod_and_router: greenexplore_mode_denies_write: greenexplore_mode_ignores_global_allow_policy: green
Verification snapshot
As of 2026-04-08:
uv run pytest -q: 329 passedtests/test_runtime_harness.pyis fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regressiontests/test_prompt_builder.pycovers section rendering, native-vs-ReAct formatting, and prompt metadata persistencetests/test_turn_state_machine.pycovers allowed/disallowed turn transitions and terminal transition metadatatests/test_runtime_phases.pycovers repair/completion phase transitions plus persisted transition metadata in runtime events and session statetests/test_runtime_repair_flows.pycovers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutestests/test_runtime_context.pyandtests/test_runtime_state_controllers.pycover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a fullAgenttests/test_runtime_bootstrap.pycovers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seamtests/test_runtime_launcher.py,tests/test_chat_lane.py, andtests/test_decomposition_lane.pycover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegationtests/test_safeguard_services.pycovers the canonical runtime safeguard implementation plus the compatibility-export path underloader.agent.safeguardstests/test_reasoning_compat.pycovers runtime-owned deliberation/completion helpers plus the compatibility-export path underloader.agent.reasoningtests/test_compat_boundaries.pyfails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shimstests/test_repair.pycovers raw-text fallback through the runtime parser and active registry, includingTodoWriterecoverytests/test_completion_policy.pycovers direct text-loop bailout and continuation-prompt behavior on the typed runtime contexttests/test_response_routing.pycovers direct final-answer routing and halted tool-batch routing at the new response-policy seamtests/test_dod.pycovers persistence, sizing boundaries, and verification command derivationtests/test_workflow.pycovers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncingtests/test_workflow_signals.pycovers typed signal extraction, recent timeline pressure, and persistedsignal_summarycontexttests/test_clarify_strategy.pycovers clarify-slot prioritization and targeted follow-up question selectiontests/test_clarify_grounding.pycovers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hintstests/test_workflow_policy.pycovers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serializationtests/test_artifact_invalidation.pycovers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plantests/test_workflow_ledger.pycovers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summariestests/test_workflow_runtime.pycovers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recoverytests/test_turn_preparation.py,tests/test_turn_completion.py,tests/test_turn_iteration.py,tests/test_turn_preamble.py,tests/test_workflow_state.py, andtests/test_turn_loop.pycover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime teststests/test_workflow_tools.pyandtests/test_workflow_runtime_tools.pycoverTodoWrite,AskUserQuestion, and runtime callback plumbingtests/test_session_state.pycovers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot historytests/test_compaction.pycovers claw-style line compression and compacted continuation-message behaviortests/test_memory_tools.pycovers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memorytests/test_cli_resume.pycovers--resumeargument rewriting for latest and named-session restoretests/test_inspection.pycoversloader doctor,loader status,loader session list/show,loader permissions show/check,loader prompt show,loader prompt diff,loader workflow show --diff, workflow timeline filtering/highlights, and workflow inspection surfaces, including recent explore activitytests/test_explore_runtime.pycovers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, andfreshexplore resets outside the parity harnesstests/test_expanded_tools.pycovers structured patch application, read-only git helpers,notepad_append, and richer structured user questionstests/test_permissions.pycovers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle orderingtests/test_tool_safety.pycovers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classificationtests/test_status_surfaces.pycovers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers- native and extracted tool calls now record the same executor trace events, with source-specific metadata
- turn startup can refine backend capability profiles before the first request,
run_streaming()delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith
Definition of honesty
- If a scenario is green here, it should have deterministic automated coverage.
- If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
- Sprint 01 turned the original
tool_call_idregression green by fixing the message contract, not by weakening the test. - Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.
- Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants.
- Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor.
- Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction.
- Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime.
- Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs.
- Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class
loader permissions show/checkoperator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs. - Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions,
loader prompt show, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs. - Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs.
- Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in
runtime.workflow_lanes, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs. - Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped
runtime.conversation, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs. - Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics.
- Sprint 14 is complete: Loader now treats
RuntimeContextas the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces. - Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only
agent/reasoning.pyandagent/safeguards.py, no remainingAgent._build_runtime_context()helper, and a much smalleragent/loop.pyafter dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor. - Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner
agent/loop.py, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.
View source
| 1 | # Loader Runtime Parity Checkpoint |
| 2 | |
| 3 | Date: 2026-04-08 |
| 4 | |
| 5 | This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests. |
| 6 | |
| 7 | ## Supported today |
| 8 | |
| 9 | - streamed text-only replies |
| 10 | - native-tool round trips for `read`, `write`, `edit`, `patch`, `glob`, `grep`, `bash`, `git`, `TodoWrite`, `AskUserQuestion`, `project_memory_*`, and `notepad_*` |
| 11 | - explicit permission modes: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow` |
| 12 | - tool lifecycle hooks in `pre_tool_use` → permission check → execute → `post_tool_use` / `post_tool_use_failure` order |
| 13 | - rule-based permission policy with workspace-local `allow` / `deny` / `ask` rules from `.loader/permission-rules.json` |
| 14 | - policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information |
| 15 | - `loader permissions show` for normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by hand |
| 16 | - `loader permissions check` for dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning |
| 17 | - raw JSON fallback when the model emits tool syntax in plain text |
| 18 | - raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as `TodoWrite` and `AskUserQuestion` |
| 19 | - persisted definition-of-done state under `.loader/dod/` |
| 20 | - persisted clarify briefs under `.loader/briefs/` |
| 21 | - persisted implementation and verification plans under `.loader/plans/` |
| 22 | - persisted conversation sessions under `.loader/sessions/` plus active session state under `.loader/state/` |
| 23 | - persisted permission policy metadata alongside session state, so `loader status` / `loader session list` / `loader session show` can explain the effective policy that ran |
| 24 | - `loader --resume` and `loader --resume <session-id>` restore persisted session state |
| 25 | - durable project memory in `.loader/project-memory.json` and working notes in `.loader/notepad.md` |
| 26 | - native memory tools for `project_memory_*` and `notepad_*` |
| 27 | - scored workflow routing across `clarify` → `plan` → `execute` → `verify`, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints |
| 28 | - typed workflow-signal extraction with persisted `signal_summary` context for route pressure, recent workflow history, and unresolved questions |
| 29 | - mode-specific system prompts for clarify, plan, execute, and verify |
| 30 | - intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward |
| 31 | - pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata |
| 32 | - codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts |
| 33 | - semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues |
| 34 | - structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift |
| 35 | - persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection |
| 36 | - persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips |
| 37 | - explicit verify/fix loops for mutating tasks, with a bounded retry budget |
| 38 | - verify/fix retries return to execute mode without re-triggering clarify or plan |
| 39 | - task-size-aware verification command derivation based on actual tool history |
| 40 | - verification command loading from persisted `verification.md` artifacts when present |
| 41 | - heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate |
| 42 | - typed `TurnSummary` output for completed turns, including trace events and tool-result messages |
| 43 | - normalized per-turn usage plus cumulative session usage in `TurnSummary` |
| 44 | - automatic transcript compaction with priority-aware line compression and continuation instructions |
| 45 | - unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor` |
| 46 | - typed tool-result messages backed by `Message.tool_results` |
| 47 | - typed prompt construction in `runtime.prompting`, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state |
| 48 | - persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection |
| 49 | - validated turn-state transitions (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) with typed transition metadata, persisted session state, and emitted runtime events |
| 50 | - typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints |
| 51 | - `loader doctor` for backend, capability, workspace, command, state, and permission health checks outside the main runtime loop |
| 52 | - `loader status` plus `loader session list/show/resume` for inspecting persisted runtime state without invoking the LLM |
| 53 | - `loader prompt show [task]` for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model request |
| 54 | - `loader prompt diff [session-id]` for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demand |
| 55 | - `loader workflow show [session-id]` with `--mode`, `--kind`, and `--limit` filters plus operator-focused workflow highlights, recent timeline snippets in `loader session show`, and `--diff` / `--full-diff` artifact comparison for persisted workflow artifacts |
| 56 | - `loader explore <prompt>` as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under `.loader/state/explore.json`, and `--fresh` to ignore prior explore history when needed |
| 57 | - `RuntimeContext` is now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the older `RuntimeLegacyServices` shim has been removed |
| 58 | - shared runtime bootstrap through `runtime.bootstrap.build_runtime_context(...)` / `sync_runtime_context(...)`, with both conversation and explore runtimes constructing typed context through the same runtime-owned contract |
| 59 | - runtime-owned safeguard and reasoning helpers now have canonical homes under `src/loader/runtime/`; `src/loader/agent/safeguards.py` and `src/loader/agent/reasoning.py` are compatibility-export layers rather than the primary implementations |
| 60 | - the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through `src/loader/runtime/launcher.py`, which leaves a smaller and more honest `src/loader/agent/loop.py` |
| 61 | - compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to `agent/reasoning.py` / `agent/safeguards.py` imports |
| 62 | - CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id |
| 63 | - CLI status now also surfaces recent explore activity, including bounded explore turn/message counts and the last explore query |
| 64 | - CLI and TUI workflow-mode visibility plus artifact notifications |
| 65 | - CLI and TUI permission-mode visibility with color-coded status |
| 66 | - workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata |
| 67 | - shell mutability classification plus structured truncation and stderr/exit-code metadata |
| 68 | - richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform responses |
| 69 | - honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers |
| 70 | - dedicated assistant-response routing in `runtime.response_routing`, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline inside `turn_iteration.py` |
| 71 | - assistant-turn request handling now lives in `runtime.assistant_turns`, clarify/plan lane execution now lives in `runtime.workflow_lanes`, tool-batch execution/recovery now lives in `runtime.tool_batches`, DoD/finalization logic now lives in `runtime.finalization`, workflow-state/session mutation lives in `runtime.workflow_state`, and the main loop now runs through `runtime.turn_preparation`, `runtime.turn_preamble`, `runtime.turn_iteration`, and `runtime.turn_loop` instead of accumulating further inside `conversation.py` |
| 72 | - `src/loader/runtime/conversation.py` now acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop |
| 73 | |
| 74 | ## Known weak spots |
| 75 | |
| 76 | - the hot runtime path no longer depends on a hidden bootstrap helper, but [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py) and [`src/loader/runtime/explore.py`](../src/loader/runtime/explore.py) still start from an `Agent`-shaped bootstrap source at the public entrypoint layer |
| 77 | - [`src/loader/agent/loop.py`](../src/loader/agent/loop.py) is much smaller and less misleading than the pre-Sprint-15 shell, but it still owns prompt/session factories, resume/clear lifecycle, and UI-facing entrypoint glue instead of collapsing fully to a minimal public facade |
| 78 | - [`src/loader/agent/reasoning.py`](../src/loader/agent/reasoning.py) and [`src/loader/agent/safeguards.py`](../src/loader/agent/safeguards.py) are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further |
| 79 | - [`src/loader/runtime/tool_batches.py`](../src/loader/runtime/tool_batches.py) and parts of [`src/loader/runtime/workflow_lanes.py`](../src/loader/runtime/workflow_lanes.py) are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in `refs/claw-code` |
| 80 | - the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth |
| 81 | - clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class |
| 82 | - plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes |
| 83 | - the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance |
| 84 | - plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop |
| 85 | - DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state |
| 86 | - evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives |
| 87 | - session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts |
| 88 | - project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction |
| 89 | - rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface |
| 90 | - policy state is inspectable in doctor/status/session surfaces and dry-runnable through `loader permissions show/check`, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface |
| 91 | - prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage |
| 92 | - workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace |
| 93 | - shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model |
| 94 | - explore mode now has lightweight transcript continuity, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances or dedicated explore inspection commands |
| 95 | - the read-only `git` helper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and the `patch` tool still stops short of AST/LSP-aware editing |
| 96 | |
| 97 | ## Out of scope in the current baseline |
| 98 | |
| 99 | - richer permission-rule UX / per-command allowlists |
| 100 | - multi-agent / team orchestration |
| 101 | |
| 102 | ## Deterministic parity scenarios |
| 103 | |
| 104 | The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity. |
| 105 | |
| 106 | - `streaming_text`: green |
| 107 | - `read_file_roundtrip`: green |
| 108 | - `multi_tool_turn_roundtrip`: green |
| 109 | - `write_file_allowed`: green |
| 110 | - `write_file_denied`: green |
| 111 | - `bash_stdout_roundtrip`: green |
| 112 | - `bash_confirmation_prompt_approved`: green |
| 113 | - `bash_confirmation_prompt_denied`: green |
| 114 | - `read_only_mode_denies_write`: green |
| 115 | - `read_only_mode_denies_mutating_bash`: green |
| 116 | - `read_only_mode_allows_safe_bash`: green |
| 117 | - `workspace_write_denies_write_outside_root`: green |
| 118 | - `danger_full_access_allows_dangerous_bash`: green |
| 119 | - `prompt_mode_prompts_destructive_write`: green |
| 120 | - `allow_mode_skips_prompt_for_destructive_write`: green |
| 121 | - `deny_rule_blocks_allowed_mode`: green |
| 122 | - `ask_rule_prompts_even_when_mode_would_allow`: green |
| 123 | - `raw_json_tool_call_fallback`: green |
| 124 | - `completion_check_continuation`: green |
| 125 | - `tool_result_contract_regression`: green |
| 126 | - `turn_summary_smoke_for_multi_tool_turn`: green |
| 127 | - `native_and_raw_tool_paths_share_executor_trace`: green |
| 128 | - `backend_capability_probe_refreshes_native_tool_mode`: green |
| 129 | - `run_streaming_delegates_to_primary_runtime`: green |
| 130 | - `definition_of_done_verify_phase`: green |
| 131 | - `verify_failure_routes_to_fix_loop`: green |
| 132 | - `verify_retry_budget_exhaustion`: green |
| 133 | - `ambiguous_prompt_routes_to_clarify`: green |
| 134 | - `complex_prompt_routes_to_plan`: green |
| 135 | - `verify_failure_fix_loop_does_not_reroute_workflow`: green |
| 136 | - `conversational_task_skips_verify_phase`: green |
| 137 | - `explore_mode_skips_dod_and_router`: green |
| 138 | - `explore_mode_denies_write`: green |
| 139 | - `explore_mode_ignores_global_allow_policy`: green |
| 140 | |
| 141 | ## Verification snapshot |
| 142 | |
| 143 | As of 2026-04-08: |
| 144 | |
| 145 | - `uv run pytest -q`: 329 passed |
| 146 | - `tests/test_runtime_harness.py` is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression |
| 147 | - `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence |
| 148 | - `tests/test_turn_state_machine.py` covers allowed/disallowed turn transitions and terminal transition metadata |
| 149 | - `tests/test_runtime_phases.py` covers repair/completion phase transitions plus persisted transition metadata in runtime events and session state |
| 150 | - `tests/test_runtime_repair_flows.py` covers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutes |
| 151 | - `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a full `Agent` |
| 152 | - `tests/test_runtime_bootstrap.py` covers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seam |
| 153 | - `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` cover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegation |
| 154 | - `tests/test_safeguard_services.py` covers the canonical runtime safeguard implementation plus the compatibility-export path under `loader.agent.safeguards` |
| 155 | - `tests/test_reasoning_compat.py` covers runtime-owned deliberation/completion helpers plus the compatibility-export path under `loader.agent.reasoning` |
| 156 | - `tests/test_compat_boundaries.py` fails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shims |
| 157 | - `tests/test_repair.py` covers raw-text fallback through the runtime parser and active registry, including `TodoWrite` recovery |
| 158 | - `tests/test_completion_policy.py` covers direct text-loop bailout and continuation-prompt behavior on the typed runtime context |
| 159 | - `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing at the new response-policy seam |
| 160 | - `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation |
| 161 | - `tests/test_workflow.py` covers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncing |
| 162 | - `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` context |
| 163 | - `tests/test_clarify_strategy.py` covers clarify-slot prioritization and targeted follow-up question selection |
| 164 | - `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hints |
| 165 | - `tests/test_workflow_policy.py` covers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serialization |
| 166 | - `tests/test_artifact_invalidation.py` covers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plan |
| 167 | - `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries |
| 168 | - `tests/test_workflow_runtime.py` covers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recovery |
| 169 | - `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` cover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime tests |
| 170 | - `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover `TodoWrite`, `AskUserQuestion`, and runtime callback plumbing |
| 171 | - `tests/test_session_state.py` covers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot history |
| 172 | - `tests/test_compaction.py` covers claw-style line compression and compacted continuation-message behavior |
| 173 | - `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory |
| 174 | - `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore |
| 175 | - `tests/test_inspection.py` covers `loader doctor`, `loader status`, `loader session list/show`, `loader permissions show/check`, `loader prompt show`, `loader prompt diff`, `loader workflow show --diff`, workflow timeline filtering/highlights, and workflow inspection surfaces, including recent explore activity |
| 176 | - `tests/test_explore_runtime.py` covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, and `fresh` explore resets outside the parity harness |
| 177 | - `tests/test_expanded_tools.py` covers structured patch application, read-only git helpers, `notepad_append`, and richer structured user questions |
| 178 | - `tests/test_permissions.py` covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering |
| 179 | - `tests/test_tool_safety.py` covers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classification |
| 180 | - `tests/test_status_surfaces.py` covers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers |
| 181 | - native and extracted tool calls now record the same executor trace events, with source-specific metadata |
| 182 | - turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith |
| 183 | |
| 184 | ## Definition of honesty |
| 185 | |
| 186 | - If a scenario is green here, it should have deterministic automated coverage. |
| 187 | - If a scenario is flaky or broken, it should be called out here before we claim parity work is done. |
| 188 | - Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test. |
| 189 | - Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+. |
| 190 | - Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants. |
| 191 | - Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor. |
| 192 | - Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction. |
| 193 | - Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime. |
| 194 | - Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs. |
| 195 | - Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class `loader permissions show/check` operator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs. |
| 196 | - Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions, `loader prompt show`, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs. |
| 197 | - Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs. |
| 198 | - Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in `runtime.workflow_lanes`, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs. |
| 199 | - Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped `runtime.conversation`, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs. |
| 200 | - Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics. |
| 201 | - Sprint 14 is complete: Loader now treats `RuntimeContext` as the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces. |
| 202 | - Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no remaining `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py` after dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor. |
| 203 | - Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor. |