loader Public

Watch 0 Fork 0 Star 0

markdown · 31181 bytes Raw Blame History

Loader Runtime Parity Checkpoint

Date: 2026-04-08

This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests.

Supported today

streamed text-only replies
native-tool round trips for read, write, edit, patch, glob, grep, bash, git, TodoWrite, AskUserQuestion, project_memory_*, and notepad_*
explicit permission modes: read-only, workspace-write, danger-full-access, prompt, and allow
tool lifecycle hooks in pre_tool_use → permission check → execute → post_tool_use / post_tool_use_failure order
rule-based permission policy with workspace-local allow / deny / ask rules from .loader/permission-rules.json
policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information
loader permissions show for normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by hand
loader permissions check for dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning
raw JSON fallback when the model emits tool syntax in plain text
raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as TodoWrite and AskUserQuestion
persisted definition-of-done state under .loader/dod/
persisted clarify briefs under .loader/briefs/
persisted implementation and verification plans under .loader/plans/
persisted conversation sessions under .loader/sessions/ plus active session state under .loader/state/
persisted permission policy metadata alongside session state, so loader status / loader session list / loader session show can explain the effective policy that ran
loader --resume and loader --resume <session-id> restore persisted session state
durable project memory in .loader/project-memory.json and working notes in .loader/notepad.md
native memory tools for project_memory_* and notepad_*
scored workflow routing across clarify → plan → execute → verify, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints
typed workflow-signal extraction with persisted signal_summary context for route pressure, recent workflow history, and unresolved questions
mode-specific system prompts for clarify, plan, execute, and verify
intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward
pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata
codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts
semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues
structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift
persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection
persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips
explicit verify/fix loops for mutating tasks, with a bounded retry budget
verify/fix retries return to execute mode without re-triggering clarify or plan
task-size-aware verification command derivation based on actual tool history
verification command loading from persisted verification.md artifacts when present
heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
typed TurnSummary output for completed turns, including trace events and tool-result messages
normalized per-turn usage plus cumulative session usage in TurnSummary
automatic transcript compaction with priority-aware line compression and continuation instructions
unified tool execution for native and extracted tool calls through runtime.executor.ToolExecutor
typed tool-result messages backed by Message.tool_results
typed prompt construction in runtime.prompting, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state
persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection
validated turn-state transitions (prepare, assistant, repair, tools, critique, completion, finalize) with typed transition metadata, persisted session state, and emitted runtime events
typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints
loader doctor for backend, capability, workspace, command, state, and permission health checks outside the main runtime loop
loader status plus loader session list/show/resume for inspecting persisted runtime state without invoking the LLM
loader prompt show [task] for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model request
loader prompt diff [session-id] for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demand
loader workflow show [session-id] with --mode, --kind, and --limit filters plus operator-focused workflow highlights, recent timeline snippets in loader session show, and --diff / --full-diff artifact comparison for persisted workflow artifacts
loader explore <prompt> as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under .loader/state/explore.json, and --fresh to ignore prior explore history when needed
RuntimeContext is now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the older RuntimeLegacyServices shim has been removed
shared runtime bootstrap through runtime.bootstrap.build_runtime_context(...) / sync_runtime_context(...), with both conversation and explore runtimes constructing typed context through the same runtime-owned contract
runtime-owned safeguard and reasoning helpers now have canonical homes under src/loader/runtime/; src/loader/agent/safeguards.py and src/loader/agent/reasoning.py are compatibility-export layers rather than the primary implementations
the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through src/loader/runtime/launcher.py, which leaves a smaller and more honest src/loader/agent/loop.py
compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to agent/reasoning.py / agent/safeguards.py imports
CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id
CLI status now also surfaces recent explore activity, including bounded explore turn/message counts and the last explore query
CLI and TUI workflow-mode visibility plus artifact notifications
CLI and TUI permission-mode visibility with color-coded status
workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata
shell mutability classification plus structured truncation and stderr/exit-code metadata
richer structured AskUserQuestion prompts with titles, context, options, and optional freeform responses
honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers
dedicated assistant-response routing in runtime.response_routing, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline inside turn_iteration.py
assistant-turn request handling now lives in runtime.assistant_turns, clarify/plan lane execution now lives in runtime.workflow_lanes, tool-batch execution/recovery now lives in runtime.tool_batches, DoD/finalization logic now lives in runtime.finalization, workflow-state/session mutation lives in runtime.workflow_state, and the main loop now runs through runtime.turn_preparation, runtime.turn_preamble, runtime.turn_iteration, and runtime.turn_loop instead of accumulating further inside conversation.py
src/loader/runtime/conversation.py now acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop

Known weak spots

the hot runtime path no longer depends on a hidden bootstrap helper, but src/loader/runtime/conversation.py and src/loader/runtime/explore.py still start from an Agent-shaped bootstrap source at the public entrypoint layer
src/loader/agent/loop.py is much smaller and less misleading than the pre-Sprint-15 shell, but it still owns prompt/session factories, resume/clear lifecycle, and UI-facing entrypoint glue instead of collapsing fully to a minimal public facade
src/loader/agent/reasoning.py and src/loader/agent/safeguards.py are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further
src/loader/runtime/tool_batches.py and parts of src/loader/runtime/workflow_lanes.py are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in refs/claw-code
the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth
clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class
plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes
the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance
plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop
DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state
evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts
project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction
rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface
policy state is inspectable in doctor/status/session surfaces and dry-runnable through loader permissions show/check, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface
prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage
workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace
shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model
explore mode now has lightweight transcript continuity, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances or dedicated explore inspection commands
the read-only git helper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and the patch tool still stops short of AST/LSP-aware editing

Out of scope in the current baseline

richer permission-rule UX / per-command allowlists
multi-agent / team orchestration

Deterministic parity scenarios

The auditable manifest lives at tests/fixtures/runtime_parity_manifest.json and is exercised by tests/test_runtime_harness.py. Sprint 04 adds focused workflow integration coverage in tests/test_workflow_runtime.py and artifact/router unit coverage in tests/test_workflow.py. Sprint 06 adds inspection/explore coverage in tests/test_inspection.py, tests/test_explore_runtime.py, and tests/test_expanded_tools.py. Sprint 10 extends that workflow coverage in tests/test_workflow_policy.py, tests/test_workflow_runtime.py, and tests/test_inspection.py for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds tests/test_workflow_signals.py, tests/test_clarify_strategy.py, tests/test_artifact_invalidation.py, and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds tests/test_clarify_grounding.py, tests/test_turn_preparation.py, tests/test_turn_completion.py, tests/test_turn_iteration.py, tests/test_turn_preamble.py, tests/test_workflow_state.py, and tests/test_turn_loop.py for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds tests/test_runtime_repair_flows.py, tests/test_response_routing.py, tests/test_workflow_ledger.py, and expanded tests/test_session_state.py / tests/test_inspection.py coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds tests/test_runtime_bootstrap.py, tests/test_safeguard_services.py, tests/test_reasoning_compat.py, and updated tests/test_runtime_context.py coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds tests/test_runtime_launcher.py, tests/test_chat_lane.py, tests/test_decomposition_lane.py, tests/test_compat_boundaries.py, and expanded tests/test_explore_runtime.py / tests/test_inspection.py coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity.

streaming_text: green
read_file_roundtrip: green
multi_tool_turn_roundtrip: green
write_file_allowed: green
write_file_denied: green
bash_stdout_roundtrip: green
bash_confirmation_prompt_approved: green
bash_confirmation_prompt_denied: green
read_only_mode_denies_write: green
read_only_mode_denies_mutating_bash: green
read_only_mode_allows_safe_bash: green
workspace_write_denies_write_outside_root: green
danger_full_access_allows_dangerous_bash: green
prompt_mode_prompts_destructive_write: green
allow_mode_skips_prompt_for_destructive_write: green
deny_rule_blocks_allowed_mode: green
ask_rule_prompts_even_when_mode_would_allow: green
raw_json_tool_call_fallback: green
completion_check_continuation: green
tool_result_contract_regression: green
turn_summary_smoke_for_multi_tool_turn: green
native_and_raw_tool_paths_share_executor_trace: green
backend_capability_probe_refreshes_native_tool_mode: green
run_streaming_delegates_to_primary_runtime: green
definition_of_done_verify_phase: green
verify_failure_routes_to_fix_loop: green
verify_retry_budget_exhaustion: green
ambiguous_prompt_routes_to_clarify: green
complex_prompt_routes_to_plan: green
verify_failure_fix_loop_does_not_reroute_workflow: green
conversational_task_skips_verify_phase: green
explore_mode_skips_dod_and_router: green
explore_mode_denies_write: green
explore_mode_ignores_global_allow_policy: green

Verification snapshot

As of 2026-04-08:

uv run pytest -q: 329 passed
tests/test_runtime_harness.py is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression
tests/test_prompt_builder.py covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence
tests/test_turn_state_machine.py covers allowed/disallowed turn transitions and terminal transition metadata
tests/test_runtime_phases.py covers repair/completion phase transitions plus persisted transition metadata in runtime events and session state
tests/test_runtime_repair_flows.py covers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutes
tests/test_runtime_context.py and tests/test_runtime_state_controllers.py cover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a full Agent
tests/test_runtime_bootstrap.py covers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seam
tests/test_runtime_launcher.py, tests/test_chat_lane.py, and tests/test_decomposition_lane.py cover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegation
tests/test_safeguard_services.py covers the canonical runtime safeguard implementation plus the compatibility-export path under loader.agent.safeguards
tests/test_reasoning_compat.py covers runtime-owned deliberation/completion helpers plus the compatibility-export path under loader.agent.reasoning
tests/test_compat_boundaries.py fails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shims
tests/test_repair.py covers raw-text fallback through the runtime parser and active registry, including TodoWrite recovery
tests/test_completion_policy.py covers direct text-loop bailout and continuation-prompt behavior on the typed runtime context
tests/test_response_routing.py covers direct final-answer routing and halted tool-batch routing at the new response-policy seam
tests/test_dod.py covers persistence, sizing boundaries, and verification command derivation
tests/test_workflow.py covers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncing
tests/test_workflow_signals.py covers typed signal extraction, recent timeline pressure, and persisted signal_summary context
tests/test_clarify_strategy.py covers clarify-slot prioritization and targeted follow-up question selection
tests/test_clarify_grounding.py covers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hints
tests/test_workflow_policy.py covers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serialization
tests/test_artifact_invalidation.py covers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plan
tests/test_workflow_ledger.py covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries
tests/test_workflow_runtime.py covers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recovery
tests/test_turn_preparation.py, tests/test_turn_completion.py, tests/test_turn_iteration.py, tests/test_turn_preamble.py, tests/test_workflow_state.py, and tests/test_turn_loop.py cover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime tests
tests/test_workflow_tools.py and tests/test_workflow_runtime_tools.py cover TodoWrite, AskUserQuestion, and runtime callback plumbing
tests/test_session_state.py covers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot history
tests/test_compaction.py covers claw-style line compression and compacted continuation-message behavior
tests/test_memory_tools.py covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory
tests/test_cli_resume.py covers --resume argument rewriting for latest and named-session restore
tests/test_inspection.py covers loader doctor, loader status, loader session list/show, loader permissions show/check, loader prompt show, loader prompt diff, loader workflow show --diff, workflow timeline filtering/highlights, and workflow inspection surfaces, including recent explore activity
tests/test_explore_runtime.py covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, and fresh explore resets outside the parity harness
tests/test_expanded_tools.py covers structured patch application, read-only git helpers, notepad_append, and richer structured user questions
tests/test_permissions.py covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering
tests/test_tool_safety.py covers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classification
tests/test_status_surfaces.py covers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers
native and extracted tool calls now record the same executor trace events, with source-specific metadata
turn startup can refine backend capability profiles before the first request, run_streaming() delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith

Definition of honesty

If a scenario is green here, it should have deterministic automated coverage.
If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
Sprint 01 turned the original tool_call_id regression green by fixing the message contract, not by weakening the test.
Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.
Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants.
Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor.
Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction.
Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime.
Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs.
Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class loader permissions show/check operator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs.
Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions, loader prompt show, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs.
Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs.
Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in runtime.workflow_lanes, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs.
Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped runtime.conversation, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs.
Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics.
Sprint 14 is complete: Loader now treats RuntimeContext as the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces.
Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only agent/reasoning.py and agent/safeguards.py, no remaining Agent._build_runtime_context() helper, and a much smaller agent/loop.py after dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor.
Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner agent/loop.py, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.

View source

  
        1
        # Loader Runtime Parity Checkpoint
      
        2
        
        3
        Date: 2026-04-08
      
        4
        
        5
        This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests.
      
        6
        
        7
        ## Supported today
      
        8
        
        9
        - streamed text-only replies
      
        10
        - native-tool round trips for `read`, `write`, `edit`, `patch`, `glob`, `grep`, `bash`, `git`, `TodoWrite`, `AskUserQuestion`, `project_memory_*`, and `notepad_*`
      
        11
        - explicit permission modes: `read-only`, `workspace-write`, `danger-full-access`, `prompt`, and `allow`
      
        12
        - tool lifecycle hooks in `pre_tool_use` → permission check → execute → `post_tool_use` / `post_tool_use_failure` order
      
        13
        - rule-based permission policy with workspace-local `allow` / `deny` / `ask` rules from `.loader/permission-rules.json`
      
        14
        - policy-backed prompting for destructive tool use, with approval context that includes mode, requirement, and matched rule information
      
        15
        - `loader permissions show` for normalized rule inspection, source-path visibility, and prompt-state inspection without opening JSON files by hand
      
        16
        - `loader permissions check` for dry-running one hypothetical tool request against the active policy, including required mode, normalized input summary, and matched-rule reasoning
      
        17
        - raw JSON fallback when the model emits tool syntax in plain text
      
        18
        - raw JSON fallback now routes through the runtime parser plus the active registry, including modern workflow tools such as `TodoWrite` and `AskUserQuestion`
      
        19
        - persisted definition-of-done state under `.loader/dod/`
      
        20
        - persisted clarify briefs under `.loader/briefs/`
      
        21
        - persisted implementation and verification plans under `.loader/plans/`
      
        22
        - persisted conversation sessions under `.loader/sessions/` plus active session state under `.loader/state/`
      
        23
        - persisted permission policy metadata alongside session state, so `loader status` / `loader session list` / `loader session show` can explain the effective policy that ran
      
        24
        - `loader --resume` and `loader --resume <session-id>` restore persisted session state
      
        25
        - durable project memory in `.loader/project-memory.json` and working notes in `.loader/notepad.md`
      
        26
        - native memory tools for `project_memory_*` and `notepad_*`
      
        27
        - scored workflow routing across `clarify` → `plan` → `execute` → `verify`, with route scores, runner-up pressure, unresolved-question carry-forward, and scheduled-next-mode hints
      
        28
        - typed workflow-signal extraction with persisted `signal_summary` context for route pressure, recent workflow history, and unresolved questions
      
        29
        - mode-specific system prompts for clarify, plan, execute, and verify
      
        30
        - intent-aware bounded clarify follow-through with explicit focus slots and persisted unresolved-question carry-forward
      
        31
        - pressure-pass clarify reviews with explicit readiness gates, challenged-assumption/tradeoff/example pressure kinds, and persisted clarify-pressure metadata
      
        32
        - codebase-backed clarify grounding with workspace evidence, repo facts, slot-aware evidence selection, pressure-aware evidence selection, and grounded brief hints for persisted clarify artifacts
      
        33
        - semantic artifact invalidation that can choose targeted plan refresh, clarify reentry, or full re-plan before execution continues
      
        34
        - structured workflow drift evidence covering confirmed touchpoints, inferred touchpoints, acceptance anchors, contradicted assumptions, verification contradictions, and task-boundary drift
      
        35
        - persisted workflow ledger state for assumptions, contradicted assumptions, acceptance anchors, and open/closed decision boundaries, threaded through clarify, plan, recovery, and inspection
      
        36
        - persisted workflow timeline entries for routes, handoffs, reentries, clarify outcomes, plan refreshes, and verify skips
      
        37
        - explicit verify/fix loops for mutating tasks, with a bounded retry budget
      
        38
        - verify/fix retries return to execute mode without re-triggering clarify or plan
      
        39
        - task-size-aware verification command derivation based on actual tool history
      
        40
        - verification command loading from persisted `verification.md` artifacts when present
      
        41
        - heuristic completion nudges only for non-mutating tasks; mutating tasks now complete through the DoD gate
      
        42
        - typed `TurnSummary` output for completed turns, including trace events and tool-result messages
      
        43
        - normalized per-turn usage plus cumulative session usage in `TurnSummary`
      
        44
        - automatic transcript compaction with priority-aware line compression and continuation instructions
      
        45
        - unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor`
      
        46
        - typed tool-result messages backed by `Message.tool_results`
      
        47
        - typed prompt construction in `runtime.prompting`, with explicit dynamic sections, a static/dynamic boundary marker, and persisted prompt-format / prompt-section metadata in session state
      
        48
        - persisted prompt snapshot history in session state so prompt-contract changes survive resume and later inspection
      
        49
        - validated turn-state transitions (`prepare`, `assistant`, `repair`, `tools`, `critique`, `completion`, `finalize`) with typed transition metadata, persisted session state, and emitted runtime events
      
        50
        - typed workflow-decision metadata persisted in session/runtime state, including reason codes, summaries, decision kind, workflow scores, and scheduled-next-mode hints
      
        51
        - `loader doctor` for backend, capability, workspace, command, state, and permission health checks outside the main runtime loop
      
        52
        - `loader status` plus `loader session list/show/resume` for inspecting persisted runtime state without invoking the LLM
      
        53
        - `loader prompt show [task]` for previewing the current prompt contract, workflow mode, permission mode, dynamic sections, and prompt body without a live model request
      
        54
        - `loader prompt diff [session-id]` for comparing persisted prompt contracts, with concise summaries by default and unified diffs on demand
      
        55
        - `loader workflow show [session-id]` with `--mode`, `--kind`, and `--limit` filters plus operator-focused workflow highlights, recent timeline snippets in `loader session show`, and `--diff` / `--full-diff` artifact comparison for persisted workflow artifacts
      
        56
        - `loader explore <prompt>` as a read-only lookup lane with its own prompt, constrained registry, persisted bounded continuity under `.loader/state/explore.json`, and `--fresh` to ignore prior explore history when needed
      
        57
        - `RuntimeContext` is now the primary runtime seam for workflow state, turn phases, response repair, no-tool completion, response routing, turn looping, finalization, workflow lanes, and workflow recovery; the older `RuntimeLegacyServices` shim has been removed
      
        58
        - shared runtime bootstrap through `runtime.bootstrap.build_runtime_context(...)` / `sync_runtime_context(...)`, with both conversation and explore runtimes constructing typed context through the same runtime-owned contract
      
        59
        - runtime-owned safeguard and reasoning helpers now have canonical homes under `src/loader/runtime/`; `src/loader/agent/safeguards.py` and `src/loader/agent/reasoning.py` are compatibility-export layers rather than the primary implementations
      
        60
        - the public launcher contract now owns conversational routing, decomposition entry routing, direct turn routing, and explore launch through `src/loader/runtime/launcher.py`, which leaves a smaller and more honest `src/loader/agent/loop.py`
      
        61
        - compatibility exports are now explicitly bounded by direct tests, and internal runtime code is guarded against drifting back to `agent/reasoning.py` / `agent/safeguards.py` imports
      
        62
        - CLI and TUI status surfaces for model, capability profile, mode, workflow mode, workflow reason, last transition summary, permission mode, explicit turn phase, prompt format/sections, DoD phase, pending items, last verification result, and active session id
      
        63
        - CLI status now also surfaces recent explore activity, including bounded explore turn/message counts and the last explore query
      
        64
        - CLI and TUI workflow-mode visibility plus artifact notifications
      
        65
        - CLI and TUI permission-mode visibility with color-coded status
      
        66
        - workspace-bound file operations with canonicalized boundary checks, binary detection, size limits, and structured patch metadata
      
        67
        - shell mutability classification plus structured truncation and stderr/exit-code metadata
      
        68
        - richer structured `AskUserQuestion` prompts with titles, context, options, and optional freeform responses
      
        69
        - honest repair/completion behavior for no-tool turns: empty assistant replies get a single explicit retry, and Loader no longer relies on synthetic prefill, fake-tool scolding reroutes, or self-critique puppeting for plain-text answers
      
        70
        - dedicated assistant-response routing in `runtime.response_routing`, so final-answer, tool-batch, and no-tool completion dispatch no longer live inline inside `turn_iteration.py`
      
        71
        - assistant-turn request handling now lives in `runtime.assistant_turns`, clarify/plan lane execution now lives in `runtime.workflow_lanes`, tool-batch execution/recovery now lives in `runtime.tool_batches`, DoD/finalization logic now lives in `runtime.finalization`, workflow-state/session mutation lives in `runtime.workflow_state`, and the main loop now runs through `runtime.turn_preparation`, `runtime.turn_preamble`, `runtime.turn_iteration`, and `runtime.turn_loop` instead of accumulating further inside `conversation.py`
      
        72
        - `src/loader/runtime/conversation.py` now acts as a compact coordinator over dedicated runtime controllers rather than owning a monolithic turn loop
      
        73
        
        74
        ## Known weak spots
      
        75
        
        76
        - the hot runtime path no longer depends on a hidden bootstrap helper, but [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py) and [`src/loader/runtime/explore.py`](../src/loader/runtime/explore.py) still start from an `Agent`-shaped bootstrap source at the public entrypoint layer
      
        77
        - [`src/loader/agent/loop.py`](../src/loader/agent/loop.py) is much smaller and less misleading than the pre-Sprint-15 shell, but it still owns prompt/session factories, resume/clear lifecycle, and UI-facing entrypoint glue instead of collapsing fully to a minimal public facade
      
        78
        - [`src/loader/agent/reasoning.py`](../src/loader/agent/reasoning.py) and [`src/loader/agent/safeguards.py`](../src/loader/agent/safeguards.py) are now compatibility shims rather than primary implementations, but they still remain as export layers until Loader narrows its external compatibility surface further
      
        79
        - [`src/loader/runtime/tool_batches.py`](../src/loader/runtime/tool_batches.py) and parts of [`src/loader/runtime/workflow_lanes.py`](../src/loader/runtime/workflow_lanes.py) are narrower and more directly tested than before, but they still carry more heuristic policy than the tightest reference seams in `refs/claw-code`
      
        80
        - the workflow policy now consumes typed signals, but signal extraction is still heuristic and hand-tuned; Loader does not yet implement OMX's deeper ambiguity analysis, richer pressure-pass discipline, or branch-specific policy depth
      
        81
        - clarify is now intent-aware, pressure-aware, and codebase-grounded, but it is still much shallower than OMX's deep-interview behavior and does not adapt its budget or questioning style by task class
      
        82
        - plan freshness now handles broader semantic invalidation with typed evidence, but it is still lightweight and runtime-authored; Loader does not yet reason deeply over richer artifact metadata, contradicting verification evidence, or larger task reframes
      
        83
        - the workflow ledger is now explicit and persisted, but it is still a pragmatic text-first contract rather than deeper symbolic task/state reasoning with stronger provenance
      
        84
        - plan mode is still a single-pass artifact generator, not a Planner/Architect/Critic consensus loop
      
        85
        - DoD acceptance criteria and pending items are stronger than Sprint 02, but todo progress is still lightly structured compared with claw-code's richer workflow state
      
        86
        - evidence summaries are deterministic runtime summaries of captured output, not model-written verification narratives
      
        87
        - session compaction summaries are heuristic runtime summaries, not model-assisted continuity artifacts
      
        88
        - project-memory capture on finalized DoD evidence is still lightweight and command-summary oriented, not semantically curated memory extraction
      
        89
        - rule syntax is intentionally narrow and workspace-local; Loader still does not have claw-code's richer rule model or broader prompt/allow operator surface
      
        90
        - policy state is inspectable in doctor/status/session surfaces and dry-runnable through `loader permissions show/check`, but there is not yet a richer UX for editing, previewing multiple candidate rule sets, or temporarily overriding rules from the product surface
      
        91
        - prompt assembly is now typed, previewable, and diffable across persisted sessions, but Loader still does not compare multiple candidate prompt contracts before execution or enforce a richer prompt-contract parity harness beyond the current unit and inspection coverage
      
        92
        - workflow history is now filterable, ledger-backed, and diffable for persisted artifacts, but it is still text-first; Loader still does not offer semantic/AST-aware artifact diffs, richer artifact preview UX, or a visual workflow trace
      
        93
        - shell safety is still heuristic and command-based; Loader does not yet have a richer shell sandbox or argument-aware mutability model
      
        94
        - explore mode now has lightweight transcript continuity, but it is still a narrow read-only lookup lane rather than a richer interactive inspection workflow with deeper repo navigation affordances or dedicated explore inspection commands
      
        95
        - the read-only `git` helper is intentionally narrow compared with claw-code and OMX's broader repo/product surfaces, and the `patch` tool still stops short of AST/LSP-aware editing
      
        96
        
        97
        ## Out of scope in the current baseline
      
        98
        
        99
        - richer permission-rule UX / per-command allowlists
      
        100
        - multi-agent / team orchestration
      
        101
        
        102
        ## Deterministic parity scenarios
      
        103
        
        104
        The auditable manifest lives at [`tests/fixtures/runtime_parity_manifest.json`](../tests/fixtures/runtime_parity_manifest.json) and is exercised by [`tests/test_runtime_harness.py`](../tests/test_runtime_harness.py). Sprint 04 adds focused workflow integration coverage in [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py) and artifact/router unit coverage in [`tests/test_workflow.py`](../tests/test_workflow.py). Sprint 06 adds inspection/explore coverage in [`tests/test_inspection.py`](../tests/test_inspection.py), [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py), and [`tests/test_expanded_tools.py`](../tests/test_expanded_tools.py). Sprint 10 extends that workflow coverage in [`tests/test_workflow_policy.py`](../tests/test_workflow_policy.py), [`tests/test_workflow_runtime.py`](../tests/test_workflow_runtime.py), and [`tests/test_inspection.py`](../tests/test_inspection.py) for scored routing, clarify-budget behavior, plan refresh, and workflow timeline inspection. Sprint 11 adds [`tests/test_workflow_signals.py`](../tests/test_workflow_signals.py), [`tests/test_clarify_strategy.py`](../tests/test_clarify_strategy.py), [`tests/test_artifact_invalidation.py`](../tests/test_artifact_invalidation.py), and expanded inspection/runtime coverage for signal summaries, intent-aware clarify, semantic replan recovery, and workflow timeline filtering/highlights. Sprint 12 adds [`tests/test_clarify_grounding.py`](../tests/test_clarify_grounding.py), [`tests/test_turn_preparation.py`](../tests/test_turn_preparation.py), [`tests/test_turn_completion.py`](../tests/test_turn_completion.py), [`tests/test_turn_iteration.py`](../tests/test_turn_iteration.py), [`tests/test_turn_preamble.py`](../tests/test_turn_preamble.py), [`tests/test_workflow_state.py`](../tests/test_workflow_state.py), and [`tests/test_turn_loop.py`](../tests/test_turn_loop.py) for grounded clarify, structured recovery evidence, and the controllerized turn runtime. Sprint 13 adds [`tests/test_runtime_repair_flows.py`](../tests/test_runtime_repair_flows.py), [`tests/test_response_routing.py`](../tests/test_response_routing.py), [`tests/test_workflow_ledger.py`](../tests/test_workflow_ledger.py), and expanded [`tests/test_session_state.py`](../tests/test_session_state.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for honest repair behavior, dedicated response routing, persisted semantic ledger state, prompt snapshot history, and prompt/artifact diff inspection. Sprint 15 adds [`tests/test_runtime_bootstrap.py`](../tests/test_runtime_bootstrap.py), [`tests/test_safeguard_services.py`](../tests/test_safeguard_services.py), [`tests/test_reasoning_compat.py`](../tests/test_reasoning_compat.py), and updated [`tests/test_runtime_context.py`](../tests/test_runtime_context.py) coverage for the shared bootstrap seam plus the runtime-owned safeguards/reasoning compatibility contract. Sprint 16 adds [`tests/test_runtime_launcher.py`](../tests/test_runtime_launcher.py), [`tests/test_chat_lane.py`](../tests/test_chat_lane.py), [`tests/test_decomposition_lane.py`](../tests/test_decomposition_lane.py), [`tests/test_compat_boundaries.py`](../tests/test_compat_boundaries.py), and expanded [`tests/test_explore_runtime.py`](../tests/test_explore_runtime.py) / [`tests/test_inspection.py`](../tests/test_inspection.py) coverage for the public launcher contract, compatibility boundaries, and persisted explore continuity.
      
        105
        
        106
        - `streaming_text`: green
      
        107
        - `read_file_roundtrip`: green
      
        108
        - `multi_tool_turn_roundtrip`: green
      
        109
        - `write_file_allowed`: green
      
        110
        - `write_file_denied`: green
      
        111
        - `bash_stdout_roundtrip`: green
      
        112
        - `bash_confirmation_prompt_approved`: green
      
        113
        - `bash_confirmation_prompt_denied`: green
      
        114
        - `read_only_mode_denies_write`: green
      
        115
        - `read_only_mode_denies_mutating_bash`: green
      
        116
        - `read_only_mode_allows_safe_bash`: green
      
        117
        - `workspace_write_denies_write_outside_root`: green
      
        118
        - `danger_full_access_allows_dangerous_bash`: green
      
        119
        - `prompt_mode_prompts_destructive_write`: green
      
        120
        - `allow_mode_skips_prompt_for_destructive_write`: green
      
        121
        - `deny_rule_blocks_allowed_mode`: green
      
        122
        - `ask_rule_prompts_even_when_mode_would_allow`: green
      
        123
        - `raw_json_tool_call_fallback`: green
      
        124
        - `completion_check_continuation`: green
      
        125
        - `tool_result_contract_regression`: green
      
        126
        - `turn_summary_smoke_for_multi_tool_turn`: green
      
        127
        - `native_and_raw_tool_paths_share_executor_trace`: green
      
        128
        - `backend_capability_probe_refreshes_native_tool_mode`: green
      
        129
        - `run_streaming_delegates_to_primary_runtime`: green
      
        130
        - `definition_of_done_verify_phase`: green
      
        131
        - `verify_failure_routes_to_fix_loop`: green
      
        132
        - `verify_retry_budget_exhaustion`: green
      
        133
        - `ambiguous_prompt_routes_to_clarify`: green
      
        134
        - `complex_prompt_routes_to_plan`: green
      
        135
        - `verify_failure_fix_loop_does_not_reroute_workflow`: green
      
        136
        - `conversational_task_skips_verify_phase`: green
      
        137
        - `explore_mode_skips_dod_and_router`: green
      
        138
        - `explore_mode_denies_write`: green
      
        139
        - `explore_mode_ignores_global_allow_policy`: green
      
        140
        
        141
        ## Verification snapshot
      
        142
        
        143
        As of 2026-04-08:
      
        144
        
        145
        - `uv run pytest -q`: 329 passed
      
        146
        - `tests/test_runtime_harness.py` is fully green, including permission-mode parity, DoD verify/fix coverage, workflow routing parity, and the original contract regression
      
        147
        - `tests/test_prompt_builder.py` covers section rendering, native-vs-ReAct formatting, and prompt metadata persistence
      
        148
        - `tests/test_turn_state_machine.py` covers allowed/disallowed turn transitions and terminal transition metadata
      
        149
        - `tests/test_runtime_phases.py` covers repair/completion phase transitions plus persisted transition metadata in runtime events and session state
      
        150
        - `tests/test_runtime_repair_flows.py` covers honest empty-response retries, no synthetic prefill on first turns, and the removal of the older no-tool puppeting/scolding reroutes
      
        151
        - `tests/test_runtime_context.py` and `tests/test_runtime_state_controllers.py` cover typed runtime-context construction plus direct workflow-state and phase-tracker behavior without relying on a full `Agent`
      
        152
        - `tests/test_runtime_bootstrap.py` covers the shared runtime bootstrap contract, prompt/capability synchronization, and direct conversation/explore construction through the runtime bootstrap seam
      
        153
        - `tests/test_runtime_launcher.py`, `tests/test_chat_lane.py`, and `tests/test_decomposition_lane.py` cover the public launcher contract, conversational entry routing, decomposition entry routing, and direct runtime turn delegation
      
        154
        - `tests/test_safeguard_services.py` covers the canonical runtime safeguard implementation plus the compatibility-export path under `loader.agent.safeguards`
      
        155
        - `tests/test_reasoning_compat.py` covers runtime-owned deliberation/completion helpers plus the compatibility-export path under `loader.agent.reasoning`
      
        156
        - `tests/test_compat_boundaries.py` fails if internal Loader code drifts back to importing runtime-owned helpers through compatibility shims
      
        157
        - `tests/test_repair.py` covers raw-text fallback through the runtime parser and active registry, including `TodoWrite` recovery
      
        158
        - `tests/test_completion_policy.py` covers direct text-loop bailout and continuation-prompt behavior on the typed runtime context
      
        159
        - `tests/test_response_routing.py` covers direct final-answer routing and halted tool-batch routing at the new response-policy seam
      
        160
        - `tests/test_dod.py` covers persistence, sizing boundaries, and verification command derivation
      
        161
        - `tests/test_workflow.py` covers workflow artifact round trips, scored-router expectations, DoD workflow links, and todo-to-DoD syncing
      
        162
        - `tests/test_workflow_signals.py` covers typed signal extraction, recent timeline pressure, and persisted `signal_summary` context
      
        163
        - `tests/test_clarify_strategy.py` covers clarify-slot prioritization and targeted follow-up question selection
      
        164
        - `tests/test_clarify_grounding.py` covers workspace evidence extraction, slot-aware/pressure-aware clarify grounding, and grounded clarify-brief hints
      
        165
        - `tests/test_workflow_policy.py` covers score breakdowns, clarify follow-up reviews, signal-summary persistence, artifact-freshness metadata, and workflow timeline serialization
      
        166
        - `tests/test_artifact_invalidation.py` covers semantic invalidation triggers plus targeted recovery selection for plan refresh vs full re-plan
      
        167
        - `tests/test_workflow_ledger.py` covers ledger seeding, contradiction tracking, acceptance-anchor updates, and operator-facing highlight summaries
      
        168
        - `tests/test_workflow_runtime.py` covers clarify routing, intent-aware clarify continuation, plan routing, targeted plan refresh, full re-plan through clarify reentry, verify-fix workflow handoff, persisted workflow-decision metadata, and workflow-ledger updates through runtime recovery
      
        169
        - `tests/test_turn_preparation.py`, `tests/test_turn_completion.py`, `tests/test_turn_iteration.py`, `tests/test_turn_preamble.py`, `tests/test_workflow_state.py`, and `tests/test_turn_loop.py` cover the controllerized turn-runtime seams directly instead of relying only on large end-to-end runtime tests
      
        170
        - `tests/test_workflow_tools.py` and `tests/test_workflow_runtime_tools.py` cover `TodoWrite`, `AskUserQuestion`, and runtime callback plumbing
      
        171
        - `tests/test_session_state.py` covers session persistence, resume, rotation, compaction persistence, cumulative usage rollups, persisted permission-policy metadata, workflow-ledger state, and prompt snapshot history
      
        172
        - `tests/test_compaction.py` covers claw-style line compression and compacted continuation-message behavior
      
        173
        - `tests/test_memory_tools.py` covers project-memory writes, notepad writes, lifecycle-hook mirroring, and DoD-summary capture into project memory
      
        174
        - `tests/test_cli_resume.py` covers `--resume` argument rewriting for latest and named-session restore
      
        175
        - `tests/test_inspection.py` covers `loader doctor`, `loader status`, `loader session list/show`, `loader permissions show/check`, `loader prompt show`, `loader prompt diff`, `loader workflow show --diff`, workflow timeline filtering/highlights, and workflow inspection surfaces, including recent explore activity
      
        176
        - `tests/test_explore_runtime.py` covers the direct explore lane contract, forced read-only behavior, persisted follow-up continuity, and `fresh` explore resets outside the parity harness
      
        177
        - `tests/test_expanded_tools.py` covers structured patch application, read-only git helpers, `notepad_append`, and richer structured user questions
      
        178
        - `tests/test_permissions.py` covers prompt/allow mode parsing, rule precedence, policy-backed prompting behavior, and hook lifecycle ordering
      
        179
        - `tests/test_tool_safety.py` covers workspace boundaries, binary/oversize guards, patch metadata, and shell truncation/classification
      
        180
        - `tests/test_status_surfaces.py` covers the CLI/TUI DoD, workflow-mode, permission-mode, capability-profile, and session-id formatting helpers
      
        181
        - native and extracted tool calls now record the same executor trace events, with source-specific metadata
      
        182
        - turn startup can refine backend capability profiles before the first request, `run_streaming()` delegates into the main runtime path, mutating tasks route through persisted evidence-backed completion, workflow artifacts and workflow-ledger state survive across turns, sessions compact safely, explore queries bypass DoD/router overhead safely, policy rules are enforced deterministically, operators can inspect/dry-run policy decisions without live turns, prompt construction is sectioned and persisted, prompt snapshots and artifact diffs are inspectable after the fact, explicit turn phases are visible while a turn runs, session inspection preserves effective policy state, typed workflow signals now feed routing directly, semantic invalidation can force targeted refresh vs full re-plan, brownfield clarify can ask evidence-backed questions from repo facts, and the turn runtime now avoids the older synthetic repair/no-tool puppeting while routing assistant outcomes through dedicated controllers instead of a single conversation-loop monolith
      
        183
        
        184
        ## Definition of honesty
      
        185
        
        186
        - If a scenario is green here, it should have deterministic automated coverage.
      
        187
        - If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
      
        188
        - Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test.
      
        189
        - Sprint 02 replaced "looks done" completion for mutating tasks with a real verify/fix gate, but it has not yet reached the richer workflow contracts described in the report and Sprint 04+.
      
        190
        - Sprint 03 established permission modes, hooks, and tool hardening, but it intentionally stops short of claw-code's fuller rule engine and prompt/allow permission variants.
      
        191
        - Sprint 04 adds routing, artifacts, and structured user questions, but it is still a first-pass workflow layer rather than full OMX consensus planning or deep interview rigor.
      
        192
        - Sprint 05 adds durable sessions, resume, compaction, and native memory/notepad tools, but it stops short of Sprint 06's inspectable session/status product surfaces and still uses heuristic continuity summaries rather than richer semantic memory extraction.
      
        193
        - Sprint 06 adds inspectable product surfaces, a constrained explore lane, and a broader tool registry, but it still stops short of interactive explore workflows, richer git ergonomics, AST/LSP-aware editing, or any multi-agent/team runtime.
      
        194
        - Sprint 07 is complete: Loader now has prompt/allow modes, rule-based permission policy, policy-backed prompting, persisted policy inspection state, and smaller assistant-turn/tool-batch/finalization runtime seams, but it still stops short of a richer rule UX, deeper policy sandboxing, and the more opinionated workflow/runtime contracts in the refs.
      
        195
        - Sprint 08 is complete: Loader now has a typed prompt builder, explicit runtime turn phases, first-class `loader permissions show/check` operator surfaces, and more coherent prompt/policy observability in doctor/status/session output, but it still stops short of a richer rule editor, formal state-machine routing, prompt-preview tooling, or the deeper workflow rigor in the refs.
      
        196
        - Sprint 09 is complete: Loader now has a validated turn state machine, typed persisted workflow decisions, `loader prompt show`, and richer workflow-reason/transition inspection surfaces, but it still stops short of deeper workflow routing policy, prompt diffing/versioning, and the more opinionated planning discipline used by the refs.
      
        197
        - Sprint 10 is complete: Loader now has a scored workflow policy, bounded clarify follow-through, targeted plan-refresh discipline, persisted workflow timeline inspection, and a greener workflow test contract, but it still stops short of deeper semantic routing/replanning, adaptive clarify depth, prompt-history diffing, or the fuller workflow rigor used by the refs.
      
        198
        - Sprint 11 is complete: Loader now has typed workflow signals, intent-aware clarify slots, semantic invalidation with targeted recovery vs full re-plan, filtered workflow inspection, and dedicated clarify/plan lane execution in `runtime.workflow_lanes`, but it still stops short of OMX-style pressure-pass interviewing, richer semantic artifact reasoning, and the deeper end-to-end orchestration discipline in the refs.
      
        199
        - Sprint 12 is complete: Loader now has pressure-pass clarify behavior, codebase-backed clarify grounding, structured recovery evidence, controllerized turn-runtime seams, and a genuinely coordinator-shaped `runtime.conversation`, but it still stops short of OMX-style deep interview depth, richer semantic artifact reasoning, artifact/prompt diff ergonomics, and the broader operator/runtime sophistication in the refs.
      
        200
        - Sprint 13 is complete: Loader now avoids synthetic prefill and no-tool puppeting, persists a semantic workflow ledger, exposes prompt/artifact diff surfaces, and routes assistant responses through a dedicated response-policy seam, but it still stops short of claw-code's narrower response/tool policy factoring, deeper planning rigor, and broader operator ergonomics.
      
        201
        - Sprint 14 is complete: Loader now treats `RuntimeContext` as the primary seam across workflow state, turn phases, response policy, turn looping, workflow recovery, and finalization; the old runtime legacy shim is gone, raw-text tool recovery no longer depends on hidden agent extractors, and the hot path is substantially more runtime-owned, but bootstrap ownership still begins at the agent wrapper and Loader still stops short of claw-code's fuller policy engine, OMX's deeper workflow rigor, and richer operator/runtime surfaces.
      
        202
        - Sprint 15 is complete: Loader now has a shared runtime bootstrap seam, runtime-owned safeguards and deliberation helpers, compatibility-only `agent/reasoning.py` and `agent/safeguards.py`, no remaining `Agent._build_runtime_context()` helper, and a much smaller `agent/loop.py` after dead planner/raw-extraction cleanup, but it still stops short of a minimal entrypoint shell, deeper explore ergonomics, claw-code's tighter policy engine, and OMX's richer planning/interview rigor.
      
        203
        - Sprint 16 is complete: Loader now has a first-class runtime launcher contract, a thinner `agent/loop.py`, explicit compatibility-boundary proof, and persisted explore continuity plus status visibility, but it still stops short of a fully minimal public shell, richer explore workflows, claw-code's tighter policy seams, and OMX's deeper planning/interview rigor.