Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior
Date: 2026-04-06
Scope and assumptions
This report compares three things:
Loaderitselfrefs/claw-code, using the Rust workspace underrefs/claw-code/rust/as the canonical runtimerefs/oh-my-codexas the workflow-layer parent repo
Assumption: oh-my-codex is the correct “parent repo” for this exercise. That assumption is based on:
refs/claw-code/README.mdrefs/claw-code/PHILOSOPHY.md- the fact that
refs/claw-codeexplicitly describessrc/as a companion Python/reference workspace, not the primary runtime
If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.
Executive summary
Loader has the right instincts but is operating at the wrong layer.
The codebase already knows that models need:
- planning help
- recovery help
- confidence checks
- completion checks
- safe tool use
But Loader mostly tries to enforce those after the model has already started drifting. claw-code and oh-my-codex get better behavior because they shape the work before, during, and after the model call:
- before: explicit mode selection, clarification, approved planning artifacts
- during: durable runtime state, richer tool surface, explicit permission model, session persistence
- after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics
The biggest lesson is not “copy their prompt.”
The biggest lesson is:
Loader needs a stronger execution contract, not just stronger prompting.
If we want Loader to feel closer to claw-code regardless of model choice, the highest-leverage work is:
- replace the monolithic heuristic loop with a typed turn engine
- add durable workflow/state artifacts
- make “definition of done” evidence-based instead of heuristic
- add real permission/safety boundaries around tools
- build a parity harness so we can improve behavior intentionally
Method
I reviewed:
- Loader source under
src/loader/ - Loader tests under
tests/ refs/claw-code/README.mdrefs/claw-code/USAGE.mdrefs/claw-code/PARITY.mdrefs/claw-code/PHILOSOPHY.mdrefs/claw-code/rust/crates/runtime/*refs/claw-code/rust/crates/tools/src/lib.rsrefs/oh-my-codex/README.mdrefs/oh-my-codex/AGENTS.mdrefs/oh-my-codex/skills/deep-interview/SKILL.mdrefs/oh-my-codex/skills/ralplan/SKILL.mdrefs/oh-my-codex/skills/ralph/SKILL.mdrefs/oh-my-codex/src/modes/base.tsrefs/oh-my-codex/src/ralplan/runtime.tsrefs/oh-my-codex/src/mcp/memory-server.tsrefs/oh-my-codex/src/verification/verifier.tsrefs/oh-my-codex/src/cli/doctor.tsrefs/oh-my-codex/src/scripts/notify-hook.ts
I also ran Loader verification commands:
uv run pytest- failed during collection
- discovered
refs/claw-code/tests/* - also failed to import
loader
uv run --with pytest --with pytest-asyncio python -m pytest tests -q- 56 passed
- 3 failed
That matters because some of Loader’s runtime paths are clearly under-tested.
What Loader already does well
1. Loader is small, understandable, and hackable
This is a real advantage.
src/loader/ is about 55 source files, and the core agent behavior is easy to locate. Compared to claw-code and especially OMX, Loader is much easier to refactor aggressively.
2. Loader is genuinely local-first
The Ollama-first posture is simple and useful. A lot of the complexity in claw-code and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.
3. Loader already contains the seeds of a better system
These are the right instincts:
- project context detection in
src/loader/context/project.py - runtime safeguards in
src/loader/agent/safeguards.py - recovery categorization in
src/loader/agent/recovery.py - optional decomposition / critique / confidence / verification / completion checks in
src/loader/agent/reasoning.py - a decent Textual app in
src/loader/ui/app.py
The problem is not that Loader lacks ideas.
The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.
4. The TUI is a meaningful strength
Loader’s TUI already gives you:
- model selection
- streaming output
- approval handling
- status line updates
- tool widgets
That is more product surface than many small local agents. It is worth keeping.
Where Loader is weak today
1. Loader’s product surface is not trustworthy yet
The most visible sign is the README:
README.md:1-2still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”
That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. claw-code and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.
2. Loader’s main runtime is too monolithic and too heuristic
src/loader/agent/loop.py is the heart of Loader, and it is doing too much:
- prompt construction
- streaming output handling
- raw tool-call extraction
- duplicate tool execution flows
- recovery
- validation
- rollback tracking
- completion nudging
- loop detection
- steering
- partial planning
- decomposition
The result is a loop that is hard to reason about and easy to destabilize.
The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.
3. Loader has a real runtime contract bug in tool-result handling
Verified directly against the code. There is a concrete mismatch between Message and the loop:
src/loader/llm/base.py:33-39definesMessagewithrole,content,tool_calls, andtool_results. There is notool_call_idfield onMessage— that field belongs to the separateToolResultdataclass atsrc/loader/llm/base.py:25-30.src/loader/agent/loop.py:885andsrc/loader/agent/loop.py:906both constructMessage(role=Role.TOOL, content=..., tool_call_id=tool_call.id).
Both call sites will raise TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id' the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have zero integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.
4. Loader duplicates tool execution logic instead of centralizing it
There are effectively two execution paths:
- the normal native/ReAct tool path
- the “raw JSON extracted tool call” path
Those paths duplicate:
- duplicate checking
- validation
- confirmation behavior
- result recording
- loop/error handling
That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.
claw-code’s ConversationRuntime::run_turn() is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.
5. Loader’s system prompt is too shallow and too rigid
src/loader/agent/prompts.py:148-208 gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.
This is too blunt.
Problems:
- it treats all tasks like immediate tool-execution tasks
- it globally bans numbered steps, which is bad for planning/reporting tasks
- it does not define modes
- it does not encode verification expectations
- it does not encode completion criteria
- it does not distinguish “clarify”, “plan”, “execute”, and “verify”
OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.
6. Loader’s tool surface is too thin
Loader has 6 default tools:
readwriteeditglobbashgrep
That is enough for toy execution, but not enough for strong agent behavior.
What is missing compared to claw-code / OMX:
- task/todo tracking
- structured ask-user surfaces
- memory/notepad
- doctor/status/session tooling
- git-aware helpers
- explore vs full-execution split
- diff/patch-aware editing
- web/search/fetch surfaces
- structured output surfaces
- subagent/team coordination surfaces
- MCP-backed state and memory
The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.
7. Loader’s safety model is primitive
Loader’s current protection model is mostly:
- “safe commands” vs “ask for confirmation”
- destructive tool flags
Problems in practice:
- no permission modes like
read-only,workspace-write,danger-full-access - no strong workspace boundary checks
- no binary-file guards
- no file size limits
- no symlink escape protection
- no command semantics beyond a short safe list
Evidence:
src/loader/tools/file_tools.pyreads/writes resolved paths directlysrc/loader/tools/shell_tools.pyusescreate_subprocess_shell()on arbitrary shell stringssrc/loader/tools/shell_tools.py:13-20uses a short safe command set, but no mode-based authorization model
By comparison, claw-code has:
PermissionPolicyPermissionEnforcer- workspace boundary checks
- binary/size guards in file ops
- permission-mode aware tool definitions
That does not just make it safer. It makes the agent more predictable.
8. Loader’s “definition of done” is heuristic, not contractual
The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.
Loader’s current strategy is:
- heuristically decide whether the response looks premature
- nudge the model to continue
- maybe ask it to confirm completion
See:
src/loader/agent/reasoning.py:721-854
This is well-intentioned, but it is still guesswork.
It does not require:
- explicit acceptance criteria
- a verification plan
- fresh command evidence
- zero pending tasks
- a final sign-off phase
OMX’s ralph workflow does.
That difference is enormous.
9. Loader has no durable workflow state
Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.
Missing pieces:
- persisted mode state
- session memory
- approved plan artifacts
- PRD / test-spec artifacts
- progress ledger
- durable “what was already decided”
- resume-safe task state
OMX writes state under .omx/ and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.
10. Loader is too backend-specific and too capability-fragile
Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.
Evidence:
src/loader/cli/main.pysupports onlyollamasrc/loader/llm/ollama.pyhardcodes native tool support by model-name substring matching
This is fragile for behavior matching “with any model chosen.”
What Loader needs instead is:
- a provider-independent tool-calling contract
- explicit capability profiles
- distinct fallback strategies for native tools vs text tool calling
- prompts/workflows that degrade gracefully
11. Loader’s tests are not protecting the real runtime
Loader’s test suite is mostly:
- tool unit tests
- parsing tests
- recovery tests
That is useful, but insufficient.
The current state:
uv run pytestfails by default after addingrefs/- the repo does not scope pytest discovery
- the “normal” targeted run needs
--with pytest --with pytest-asyncio - even then, 3 tests fail
- there are no strong turn-loop integration tests
- there is no deterministic mock backend harness comparable to
claw-code
This is why structural issues like the tool_call_id mismatch can survive.
What claw-code gets right
1. The runtime contract is explicit
refs/claw-code/rust/crates/runtime/src/conversation.rs is the biggest thing Loader should study.
The core run_turn() flow is clean:
- append user message to session
- stream assistant response
- build a typed assistant message
- extract tool uses
- run permission checks
- execute tool
- append tool result message
- repeat until no more tool uses
- optionally compact session
- return a typed turn summary
That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.
2. Session persistence and compaction are first-class
claw-code treats long-lived sessions as a product feature:
- persisted sessions
- resume support
- usage tracking
- compaction thresholds
- summarized continuation messages
Relevant files:
refs/claw-code/rust/crates/runtime/src/conversation.rsrefs/claw-code/rust/crates/runtime/src/compact.rsrefs/claw-code/rust/crates/runtime/src/summary_compression.rsrefs/claw-code/rust/crates/runtime/src/usage.rs
This matters because good agent behavior is often continuity behavior.
3. Permissions are part of the runtime, not just UI confirmation
claw-code has an actual permission model with three layers:
- Mode layer —
PermissionModeenum withReadOnly,WorkspaceWrite,DangerFullAccess,Prompt, andAllow(refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27) - Per-tool requirement layer — every
ToolSpecdeclares the minimum mode it requires, mapped inPermissionPolicy.tool_requirements - Rule layer — three rule lists (
allow_rules,deny_rules,ask_rules) for context-specific overrides on top of the mode/requirement check
Plus typed authorization outcomes, file-write boundary logic, and bash gating.
Relevant files:
refs/claw-code/rust/crates/runtime/src/permission_enforcer.rsrefs/claw-code/rust/crates/runtime/src/permissions.rs
Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.
4. File and shell operations are engineered, not just exposed
claw-code’s file layer includes:
- max read size
- max write size
- binary detection
- workspace-boundary validation
- structured patch outputs
Relevant file:
refs/claw-code/rust/crates/runtime/src/file_ops.rs
Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.
5. Hooks and lifecycle surfaces give the runtime escape valves
claw-code has pre-tool and post-tool hooks, including failure hooks.
That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.
Relevant files:
refs/claw-code/rust/crates/runtime/src/hooks.rsrefs/claw-code/rust/crates/runtime/src/conversation.rs
6. The project is honest about parity and weaknesses
refs/claw-code/PARITY.md is one of the best engineering lessons in the whole comparison.
It does three things Loader does not yet do:
- names what is actually shipped
- names what is still shallow or stubbed
- ties roadmap claims to concrete evidence
That alone reduces thrash.
Loader needs a similar parity/backlog document for runtime behavior.
7. Diagnostics and operator surfaces are part of the product
claw-code exposes operational commands like:
statussandboxagentsmcpskillsdoctor- session resume
This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.
Where claw-code is still incomplete
It is worth staying honest here too.
Even claw-code admits some shallowness in PARITY.md:
- some surfaces are registry-backed approximations, not deep external integrations
- session compaction parity is still open
- token accounting accuracy is still open
- some tool surfaces remain shallow or partially stubbed
That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.
What OMX adds that Loader is currently missing almost entirely
claw-code gives a better runtime. OMX gives a better workflow.
This is where most of Loader’s “definition of done” and “follow-through” problems are answered.
1. Clarification is a mode, not an ad hoc question
deep-interview is not “ask a question if confused.”
It is a formal ambiguity-reduction workflow with:
- a context snapshot
- one-question rounds
- ambiguity scoring
- explicit non-goals
- explicit decision boundaries
- a crystallized artifact for downstream execution
Relevant files:
refs/oh-my-codex/skills/deep-interview/SKILL.md
Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.
2. Planning is artifact-based and consensus-based
ralplan is much more than “make a numbered list.”
It includes:
- Planner / Architect / Critic loops
- max iteration handling
- planning completion gates
- PRD and test-spec artifacts
- approved handoff into execution
Relevant files:
refs/oh-my-codex/skills/ralplan/SKILL.mdrefs/oh-my-codex/src/ralplan/runtime.tsrefs/oh-my-codex/src/planning/artifacts.ts
Loader’s Plan object is fine as a local helper, but it is nowhere near this level of control.
3. “Done” is a workflow contract in Ralph
This is the single biggest lesson for Loader.
Ralph encodes:
- persistence until done
- mandatory verification
- architect verification
- retry/fix loops
- state transitions
- explicit cleanup on completion
- a final checklist
Relevant file:
refs/oh-my-codex/skills/ralph/SKILL.md
This directly addresses the exact Loader problems you named:
- weak tool follow-through
- finishing too early
- spending too long in loops
- poor task closure
4. Workflow state lives outside the prompt
OMX stores durable mode state under .omx/ and exposes it through state tools.
Relevant files:
refs/oh-my-codex/src/modes/base.tsrefs/oh-my-codex/src/mcp/state-server.tsrefs/oh-my-codex/src/mcp/memory-server.ts
That means:
- progress survives interruptions
- execution can be resumed
- handoffs are grounded
- context can be audited
- the model does not have to remember everything itself
5. Memory and notepad are explicit tools
OMX has project memory and a notepad.
That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:
- conventions
- known build commands
- temporary working notes
- durable directives
Relevant file:
refs/oh-my-codex/src/mcp/memory-server.ts
Loader currently rediscovers too much per turn.
6. Verification is standardized
OMX has verification instructions that scale by task size and explicitly require evidence.
Relevant file:
refs/oh-my-codex/src/verification/verifier.ts
Loader has completion heuristics. OMX has verification policy.
That is the difference between “the model sounded done” and “the system proved done.”
7. Doctor / explore / sparkshell reduce prompt waste
OMX distinguishes:
- health checking (
doctor) - lightweight read-only exploration (
explore) - bounded shell-native inspection (
sparkshell)
That is smart.
It keeps the main execution loop from becoming the only place everything happens.
Relevant files:
refs/oh-my-codex/src/cli/doctor.tsrefs/oh-my-codex/src/cli/explore.tsrefs/oh-my-codex/src/cli/sparkshell.ts
8. Follow-through is supported outside the agent context window
The idle notifications, leader nudges, and continuation prompts in OMX are important.
Relevant file:
refs/oh-my-codex/src/scripts/notify-hook.ts
This is one of the deeper design differences:
- Loader tries to keep the model on-task from inside the loop
- OMX also nudges, monitors, and routes from outside the loop
That is a more robust design.
Comparison matrix
| Area | Loader today | claw-code |
OMX lesson | Takeaway for Loader |
|---|---|---|---|---|
| Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first |
| Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools |
| Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries |
| Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification |
| Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts |
| Memory/state | none | sessions + compaction + tracing | .omx/ mode state + memory |
add .loader/ state dir |
| Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable |
| Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests |
| Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later |
| Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy |
Why Loader’s current weaknesses produce the behavior you described
Poor tool use
Root causes:
- shallow tool surface
- brittle prompt contract
- native-vs-ReAct bifurcation
- duplicated execution code paths
- no typed runtime contract for tool results
Weak follow-through
Root causes:
- no persistent task state
- no approved plan artifact
- no explicit verification lane
- no final completion checklist
Finishing early
Root causes:
- completion is heuristic
- no required evidence model
- no acceptance criteria artifact
- no final “prove it” pass
Spending too long on simple tasks
Root causes:
- the runtime loop tries too many recoveries in one place
- the system prompt does not distinguish task modes cleanly
- there is no “lightweight inspect” lane like
explore - the model often has to infer the workflow instead of being routed into one
Model sensitivity
Root causes:
- behavior is prompt-and-heuristic driven
- capability detection is backend-specific and brittle
- no workflow artifacts that survive model variance
This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.
Concrete implementation targets
These are ordered by impact on Loader behavior, not by code convenience.
Target 1: Introduce a real turn engine
Goal:
- replace the current giant loop with a smaller, typed conversation runtime
Implementation target:
- create a new
src/loader/runtime/package - move message/session/tool-result logic out of
src/loader/agent/loop.py - give tool results a first-class typed representation
- unify native, ReAct, and extracted-tool execution through one executor path
Why:
- this is the foundation for every other improvement
Target 2: Add persistent Loader state under .loader/
Goal:
- make workflow state durable instead of prompt-only
Implementation target:
.loader/state/.loader/sessions/.loader/plans/.loader/notepad.md.loader/project-memory.json
Why:
- Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge
Target 3: Separate task modes
Goal:
- stop treating all requests like immediate tool-execution requests
Implementation target:
- mode router with at least:
clarifyplanexecuteverify
Why:
- this is the minimum structure needed to stop overthinking simple work and underthinking complex work
Target 4: Replace heuristic completion with an evidence-backed done contract
Goal:
- make completion explicit and testable
Implementation target:
- define a
DefinitionOfDoneobject per task - require:
- acceptance criteria
- verification commands
- evidence summary
- zero pending task items
Why:
- this is the main fix for premature completion
Target 5: Add deep-interview-lite and ralplan-lite equivalents
Goal:
- pull ambiguity reduction and planning review out of the middle of execution
Implementation target:
clarifymode writes a task briefplanmode writes:- a short implementation plan
- a test/verification plan
Do not try to copy every OMX feature immediately. Copy the artifact discipline first.
Target 6: Build a real permission model
Goal:
- move from confirmation prompts to policy-based authorization
Implementation target:
- permission modes:
read-onlyworkspace-writedanger-full-access
- tool specs declare required permission
- file writes enforce workspace boundaries
- shell commands go through command classification
Why:
- this is both safety and behavior quality
Target 7: Harden file and shell tools
Goal:
- make tool use trustworthy enough for automation
Implementation target:
- size limits
- binary detection
- symlink/traversal protection
- structured patch/diff return values
- shell command semantics and mutability classification
Target 8: Add loader doctor, loader status, and loader session
Goal:
- make Loader operable as a product
Implementation target:
- backend health
- model capability snapshot
- workspace detection
- write-access detection
- test/build command detection
- active session summary
Why:
- better operator feedback means less guesswork in the agent loop
Target 9: Add memory/notepad tools
Goal:
- give Loader durable short-term and long-term memory
Implementation target:
- read/write project memory
- append working notes
- store user directives and repo conventions
Why:
- this reduces re-discovery and improves follow-through across turns
Target 10: Add a lightweight read-only inspect lane
Goal:
- avoid using the full agent loop for every lookup
Implementation target:
loader exploreor equivalent internal mode- optimized for:
- file/symbol lookup
- pattern discovery
- relationship questions
Why:
- simple tasks should stay cheap and fast
Target 11: Add a parity harness
Goal:
- improve behavior intentionally instead of impressionistically
Implementation target:
- scripted mock backend scenarios for:
- simple read
- multi-tool turn
- denied permission
- write/edit success
- verification-required task
- premature completion rejection
- looped/duplicate action prevention
Why:
- this is how Loader becomes reliable
Target 12: Add workflow-aware prompts and capability profiles
Goal:
- make Loader less brittle across models
Implementation target:
- replace one generic system prompt with mode-specific prompts
- add provider/model capability profiles:
- native tools
- streaming
- context budget
- preferred tool-call format
- verification strictness
Why:
- behavior should be shaped by runtime policy, not guessed from model substrings
Priority order
This section was rewritten after a deeper validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: the Definition-of-Done work is the user's actual pain point and should land before permission modes, not after, because permissions are a safety win and DoD is the behavior win.
P0: Stabilize before changing behavior (Sprint 00)
- write a failing regression test for the
tool_call_idbug atagent/loop.py:885,906first, before any harness work — it proves the bug is real and proves the harness exists in one move - scope pytest discovery so
refs/stops contaminating collection - exclude
refs/from ruff and mypy too - make
uv run pytestwork out of the box - port the scenario taxonomy from
refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs - rewrite
README.md(currently still says "FortranGoingOnForty") - baseline parity checklist for current runtime behavior
P1: Replace the loop with a real runtime (Sprint 01)
- new
src/loader/runtime/package with a typed turn engine - unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
- fix the named bugs from Sprint 00's failing tests (
tool_call_id, duplicate execution path) - replace substring-based
NATIVE_TOOL_MODELS/NO_TOOL_MODELSmodel detection with aruntime/capabilities.pyprofile system — Loader needs to behave consistently across model choices - structured
TurnSummaryoutput
P2: The behavior fix the user actually asked for (Sprint 02)
DefinitionOfDoneobject per task: acceptance criteria, verification commands, evidence summary, pending/completed task items- explicit verify phase that runs the verification commands and gates completion on evidence
- fix loop: verification failure returns to execution, not to final answer
- minimum
.loader/directory shape (.loader/dod/) — full session/memory layout deferred to Sprint 05
This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."
P3: Safety as policy, not as confirmation prompt (Sprint 03)
- permission modes:
read-only,workspace-write,danger-full-access - three-event tool lifecycle hooks (
pre_tool_use,post_tool_use,post_tool_use_failure) modeled directly onrefs/claw-code/rust/crates/runtime/src/hooks.rs - refactor
safeguards.py(duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls - file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
- shell operation hardening
- expose active mode in CLI/TUI status
Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.
P4: Stop improvising one workflow for everything (Sprint 04)
- mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
- clarify artifact written to
.loader/briefs/ - planning artifacts (implementation plan + verification plan) written to
.loader/plans/and fed into the existing DoD object - tool prerequisites pulled forward from Sprint 06:
TodoWrite(the "zero pending tasks" gate is empty without it) andAskUserQuestion(clarify rounds)
P5: Durable continuity (Sprint 05)
- full
.loader/state directory under the layout already started in Sprint 02 - session persistence and resume
- transcript compaction with priority-aware summarization (model the design on
refs/claw-code/rust/crates/runtime/src/summary_compression.rs) - memory/notepad surfaces
- usage/cost tracking
P6: Operability and tool-surface expansion (Sprint 06)
loader doctor,loader status,loader session- read-only explore lane
- broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) —
TodoWriteandAskUserQuestionalready exist from Sprint 04
Deferred indefinitely
- workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
- task/team/subagent orchestration
- broad MCP ecosystem
- richer plugin systems
These are real wins in claw-code/OMX, but Loader should not pursue them until the solo runtime is trustworthy.
What Loader should copy directly, and what it should not
Copy directly
- typed turn runtime
- permission model
- file/shell hardening
- session persistence
- compaction
- doctor/status/session surfaces
- workflow artifacts
- evidence-backed verification
- parity harness discipline
Copy in simplified form
- deep-interview
- ralplan
- ralph
- memory/notepad
- explore vs full-execution split
Do not copy blindly yet
- full tmux/team runtime
- huge command surface
- Discord/openclaw notification stack
- broad MCP ecosystem
Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.
Recommended Loader architecture direction
If we want behavior closer to claw-code without losing Loader’s simplicity, I would steer toward:
Layer 1: Runtime core
- typed
TurnRuntime SessionStorePermissionPolicyToolExecutorVerificationEngine
Layer 2: Workflow layer
ClarifyWorkflowPlanWorkflowExecuteWorkflowVerifyWorkflow
Layer 3: Product surfaces
- TUI
- CLI
doctorstatussessionexplore
Layer 4: Optional future orchestration
- hooks
- background verification
- multi-agent/task orchestration
That is a better fit for Loader than trying to clone all of OMX wholesale.
Immediate conclusions
- Loader’s biggest problems are architectural, not just prompt-related.
claw-codeis strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.- OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
- The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
- workflow artifacts
- explicit verification
- persistent state
- a smaller, more trustworthy turn engine
Sprint scaffolding
After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:
.docs/sprints/index.md.docs/sprints/sprint00.md— Foundation, Measurement, and Parity Harness.docs/sprints/sprint01.md— Turn Engine, Tool Contract, and Capability Profiles.docs/sprints/sprint02.md— Definition of Done and Verify/Fix Loop.docs/sprints/sprint03.md— Permission Modes and Tool Lifecycle Hooks.docs/sprints/sprint04.md— Mode Router, Clarify, and Plan Artifacts.docs/sprints/sprint05.md— Session State, Memory, and Compaction.docs/sprints/sprint06.md— Doctor, Explore, Status, and Tool Surface Expansion
Recommended next move
Start with Sprint 00, and start Sprint 00 with the failing regression test.
Reason:
- Loader needs a measurable baseline and a safer runtime before adding more behavior
- the
tool_call_idbug atagent/loop.py:885,906is proof that untested code paths are silently broken - writing the failing test first proves both the bug and the harness in one move
- otherwise every feature sprint will be built on unstable agent semantics
The execution phase should then be:
- lock down the runtime and test harness (Sprint 00)
- replace the loop with a typed runtime and capability profiles (Sprint 01)
- define and enforce the completion contract (Sprint 02)
- add the policy-based safety layer with hooks (Sprint 03)
- add workflow modes and planning artifacts on top (Sprint 04)
- then widen the durability and product surfaces (Sprints 05 and 06)
Plan adjustments after deeper review
The following changes were applied to the original report after a firsthand validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus spot-checks of Loader's runtime.
Verified directly against the code
tool_call_idbug confirmed atsrc/loader/agent/loop.py:885and:906. Both call sites constructMessage(role=Role.TOOL, content=..., tool_call_id=tool_call.id), butMessage(src/loader/llm/base.py:33-39) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.- Pytest discovery is broken by default.
uv run pytest --collect-onlypicks uprefs/claw-code/tests/test_porting_workspace.pyand fails to importloaderbecause there is notool.pytest.ini_optionsblock inpyproject.toml. - Loop monolith confirmed by line counts.
agent/loop.pyis 1929 LOC,agent/reasoning.pyis 1196,agent/safeguards.pyis 1079 — roughly 4200 lines of orchestration in one cluster. - claw-code's
run_turn()shape is exactly as the report describes. Read directly atrefs/claw-code/rust/crates/runtime/src/conversation.rs:295-470. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typedConversationMessage::tool_result()→ push → repeat. ~175 lines of clean code. - claw-code permission modes are
ReadOnly/WorkspaceWrite/DangerFullAccess(plusPromptandAllow), defined atrefs/claw-code/rust/crates/runtime/src/permissions.rs:8-27. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs infile_ops.rsare all real. - claw-code hooks are
PreToolUse/PostToolUse/PostToolUseFailure, defined atrefs/claw-code/rust/crates/runtime/src/hooks.rs:19-34and wired into the conversation loop at lines 371, 427-453. - OMX skills are real and even more rigorous than the report described.
ralplanenforces a max-5-iteration Critic loop with sequential Architect→Critic ordering.ralphhas explicit phase enums (starting/executing/verifying/fixing/complete/failed/cancelled) persisted viastate_writeto.omx/state/{mode}-state.json. The verifier insrc/verification/verifier.tsscales by task size with concrete file-count thresholds.
Corrected facts
- Tool count: 49, not 40.
refs/claw-code/rust/crates/tools/src/lib.rsexposes 49ToolSpecentries inmvp_tool_specs(). Doesn't change the lesson, but worth knowing. - claw-code permissions have a third layer. Beyond
PermissionModeand per-tool requirements,PermissionPolicycarries three rule lists (allow_rules,deny_rules,ask_rules) for context-specific overrides. Loader can land the mode layer first and defer the rule layer. - claw-code summary compression is sophisticated. It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at
refs/claw-code/rust/crates/runtime/src/summary_compression.rs. Sprint 05 should model on this rather than reinventing.
Structural plan changes
- The old Sprint 03 was split. It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
- The old Sprint 02 (permissions) became the new Sprint 03 and was reordered to land after DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
- Hooks landed in the same sprint as permissions. The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
- Capability profiles became a Sprint 01 deliverable. They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
- The minimum
.loader/directory shape moves to Sprint 02 (just.loader/dod/). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05. TodoWriteandAskUserQuestionmove from Sprint 06 to Sprint 04 as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.- Sprint 00's first deliverable is now the failing regression test for the
tool_call_idbug, before any harness work. It proves the bug and proves the harness exist in one move.
View source
| 1 | # Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior |
| 2 | |
| 3 | Date: 2026-04-06 |
| 4 | |
| 5 | ## Scope and assumptions |
| 6 | |
| 7 | This report compares three things: |
| 8 | |
| 9 | 1. `Loader` itself |
| 10 | 2. `refs/claw-code`, using the Rust workspace under `refs/claw-code/rust/` as the canonical runtime |
| 11 | 3. `refs/oh-my-codex` as the workflow-layer parent repo |
| 12 | |
| 13 | Assumption: `oh-my-codex` is the correct “parent repo” for this exercise. That assumption is based on: |
| 14 | |
| 15 | - `refs/claw-code/README.md` |
| 16 | - `refs/claw-code/PHILOSOPHY.md` |
| 17 | - the fact that `refs/claw-code` explicitly describes `src/` as a companion Python/reference workspace, not the primary runtime |
| 18 | |
| 19 | If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass. |
| 20 | |
| 21 | ## Executive summary |
| 22 | |
| 23 | Loader has the right instincts but is operating at the wrong layer. |
| 24 | |
| 25 | The codebase already knows that models need: |
| 26 | |
| 27 | - planning help |
| 28 | - recovery help |
| 29 | - confidence checks |
| 30 | - completion checks |
| 31 | - safe tool use |
| 32 | |
| 33 | But Loader mostly tries to enforce those after the model has already started drifting. `claw-code` and `oh-my-codex` get better behavior because they shape the work before, during, and after the model call: |
| 34 | |
| 35 | - before: explicit mode selection, clarification, approved planning artifacts |
| 36 | - during: durable runtime state, richer tool surface, explicit permission model, session persistence |
| 37 | - after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics |
| 38 | |
| 39 | The biggest lesson is not “copy their prompt.” |
| 40 | |
| 41 | The biggest lesson is: |
| 42 | |
| 43 | > Loader needs a stronger execution contract, not just stronger prompting. |
| 44 | |
| 45 | If we want Loader to feel closer to `claw-code` regardless of model choice, the highest-leverage work is: |
| 46 | |
| 47 | 1. replace the monolithic heuristic loop with a typed turn engine |
| 48 | 2. add durable workflow/state artifacts |
| 49 | 3. make “definition of done” evidence-based instead of heuristic |
| 50 | 4. add real permission/safety boundaries around tools |
| 51 | 5. build a parity harness so we can improve behavior intentionally |
| 52 | |
| 53 | ## Method |
| 54 | |
| 55 | I reviewed: |
| 56 | |
| 57 | - Loader source under `src/loader/` |
| 58 | - Loader tests under `tests/` |
| 59 | - `refs/claw-code/README.md` |
| 60 | - `refs/claw-code/USAGE.md` |
| 61 | - `refs/claw-code/PARITY.md` |
| 62 | - `refs/claw-code/PHILOSOPHY.md` |
| 63 | - `refs/claw-code/rust/crates/runtime/*` |
| 64 | - `refs/claw-code/rust/crates/tools/src/lib.rs` |
| 65 | - `refs/oh-my-codex/README.md` |
| 66 | - `refs/oh-my-codex/AGENTS.md` |
| 67 | - `refs/oh-my-codex/skills/deep-interview/SKILL.md` |
| 68 | - `refs/oh-my-codex/skills/ralplan/SKILL.md` |
| 69 | - `refs/oh-my-codex/skills/ralph/SKILL.md` |
| 70 | - `refs/oh-my-codex/src/modes/base.ts` |
| 71 | - `refs/oh-my-codex/src/ralplan/runtime.ts` |
| 72 | - `refs/oh-my-codex/src/mcp/memory-server.ts` |
| 73 | - `refs/oh-my-codex/src/verification/verifier.ts` |
| 74 | - `refs/oh-my-codex/src/cli/doctor.ts` |
| 75 | - `refs/oh-my-codex/src/scripts/notify-hook.ts` |
| 76 | |
| 77 | I also ran Loader verification commands: |
| 78 | |
| 79 | - `uv run pytest` |
| 80 | - failed during collection |
| 81 | - discovered `refs/claw-code/tests/*` |
| 82 | - also failed to import `loader` |
| 83 | - `uv run --with pytest --with pytest-asyncio python -m pytest tests -q` |
| 84 | - 56 passed |
| 85 | - 3 failed |
| 86 | |
| 87 | That matters because some of Loader’s runtime paths are clearly under-tested. |
| 88 | |
| 89 | ## What Loader already does well |
| 90 | |
| 91 | ### 1. Loader is small, understandable, and hackable |
| 92 | |
| 93 | This is a real advantage. |
| 94 | |
| 95 | `src/loader/` is about 55 source files, and the core agent behavior is easy to locate. Compared to `claw-code` and especially OMX, Loader is much easier to refactor aggressively. |
| 96 | |
| 97 | ### 2. Loader is genuinely local-first |
| 98 | |
| 99 | The Ollama-first posture is simple and useful. A lot of the complexity in `claw-code` and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas. |
| 100 | |
| 101 | ### 3. Loader already contains the seeds of a better system |
| 102 | |
| 103 | These are the right instincts: |
| 104 | |
| 105 | - project context detection in `src/loader/context/project.py` |
| 106 | - runtime safeguards in `src/loader/agent/safeguards.py` |
| 107 | - recovery categorization in `src/loader/agent/recovery.py` |
| 108 | - optional decomposition / critique / confidence / verification / completion checks in `src/loader/agent/reasoning.py` |
| 109 | - a decent Textual app in `src/loader/ui/app.py` |
| 110 | |
| 111 | The problem is not that Loader lacks ideas. |
| 112 | |
| 113 | The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture. |
| 114 | |
| 115 | ### 4. The TUI is a meaningful strength |
| 116 | |
| 117 | Loader’s TUI already gives you: |
| 118 | |
| 119 | - model selection |
| 120 | - streaming output |
| 121 | - approval handling |
| 122 | - status line updates |
| 123 | - tool widgets |
| 124 | |
| 125 | That is more product surface than many small local agents. It is worth keeping. |
| 126 | |
| 127 | ## Where Loader is weak today |
| 128 | |
| 129 | ### 1. Loader’s product surface is not trustworthy yet |
| 130 | |
| 131 | The most visible sign is the README: |
| 132 | |
| 133 | - `README.md:1-2` still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.” |
| 134 | |
| 135 | That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. `claw-code` and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool. |
| 136 | |
| 137 | ### 2. Loader’s main runtime is too monolithic and too heuristic |
| 138 | |
| 139 | `src/loader/agent/loop.py` is the heart of Loader, and it is doing too much: |
| 140 | |
| 141 | - prompt construction |
| 142 | - streaming output handling |
| 143 | - raw tool-call extraction |
| 144 | - duplicate tool execution flows |
| 145 | - recovery |
| 146 | - validation |
| 147 | - rollback tracking |
| 148 | - completion nudging |
| 149 | - loop detection |
| 150 | - steering |
| 151 | - partial planning |
| 152 | - decomposition |
| 153 | |
| 154 | The result is a loop that is hard to reason about and easy to destabilize. |
| 155 | |
| 156 | The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol. |
| 157 | |
| 158 | ### 3. Loader has a real runtime contract bug in tool-result handling |
| 159 | |
| 160 | **Verified directly against the code.** There is a concrete mismatch between `Message` and the loop: |
| 161 | |
| 162 | - `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, and `tool_results`. There is no `tool_call_id` field on `Message` — that field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`. |
| 163 | - `src/loader/agent/loop.py:885` and `src/loader/agent/loop.py:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`. |
| 164 | |
| 165 | Both call sites will raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have **zero** integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work. |
| 166 | |
| 167 | ### 4. Loader duplicates tool execution logic instead of centralizing it |
| 168 | |
| 169 | There are effectively two execution paths: |
| 170 | |
| 171 | - the normal native/ReAct tool path |
| 172 | - the “raw JSON extracted tool call” path |
| 173 | |
| 174 | Those paths duplicate: |
| 175 | |
| 176 | - duplicate checking |
| 177 | - validation |
| 178 | - confirmation behavior |
| 179 | - result recording |
| 180 | - loop/error handling |
| 181 | |
| 182 | That makes behavior inconsistent and increases the chance that fixes in one path never land in the other. |
| 183 | |
| 184 | `claw-code`’s `ConversationRuntime::run_turn()` is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat. |
| 185 | |
| 186 | ### 5. Loader’s system prompt is too shallow and too rigid |
| 187 | |
| 188 | `src/loader/agent/prompts.py:148-208` gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt. |
| 189 | |
| 190 | This is too blunt. |
| 191 | |
| 192 | Problems: |
| 193 | |
| 194 | - it treats all tasks like immediate tool-execution tasks |
| 195 | - it globally bans numbered steps, which is bad for planning/reporting tasks |
| 196 | - it does not define modes |
| 197 | - it does not encode verification expectations |
| 198 | - it does not encode completion criteria |
| 199 | - it does not distinguish “clarify”, “plan”, “execute”, and “verify” |
| 200 | |
| 201 | OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract. |
| 202 | |
| 203 | ### 6. Loader’s tool surface is too thin |
| 204 | |
| 205 | Loader has 6 default tools: |
| 206 | |
| 207 | - `read` |
| 208 | - `write` |
| 209 | - `edit` |
| 210 | - `glob` |
| 211 | - `bash` |
| 212 | - `grep` |
| 213 | |
| 214 | That is enough for toy execution, but not enough for strong agent behavior. |
| 215 | |
| 216 | What is missing compared to `claw-code` / OMX: |
| 217 | |
| 218 | - task/todo tracking |
| 219 | - structured ask-user surfaces |
| 220 | - memory/notepad |
| 221 | - doctor/status/session tooling |
| 222 | - git-aware helpers |
| 223 | - explore vs full-execution split |
| 224 | - diff/patch-aware editing |
| 225 | - web/search/fetch surfaces |
| 226 | - structured output surfaces |
| 227 | - subagent/team coordination surfaces |
| 228 | - MCP-backed state and memory |
| 229 | |
| 230 | The result is that Loader has to keep too much in the prompt and too much in ephemeral model state. |
| 231 | |
| 232 | ### 7. Loader’s safety model is primitive |
| 233 | |
| 234 | Loader’s current protection model is mostly: |
| 235 | |
| 236 | - “safe commands” vs “ask for confirmation” |
| 237 | - destructive tool flags |
| 238 | |
| 239 | Problems in practice: |
| 240 | |
| 241 | - no permission modes like `read-only`, `workspace-write`, `danger-full-access` |
| 242 | - no strong workspace boundary checks |
| 243 | - no binary-file guards |
| 244 | - no file size limits |
| 245 | - no symlink escape protection |
| 246 | - no command semantics beyond a short safe list |
| 247 | |
| 248 | Evidence: |
| 249 | |
| 250 | - `src/loader/tools/file_tools.py` reads/writes resolved paths directly |
| 251 | - `src/loader/tools/shell_tools.py` uses `create_subprocess_shell()` on arbitrary shell strings |
| 252 | - `src/loader/tools/shell_tools.py:13-20` uses a short safe command set, but no mode-based authorization model |
| 253 | |
| 254 | By comparison, `claw-code` has: |
| 255 | |
| 256 | - `PermissionPolicy` |
| 257 | - `PermissionEnforcer` |
| 258 | - workspace boundary checks |
| 259 | - binary/size guards in file ops |
| 260 | - permission-mode aware tool definitions |
| 261 | |
| 262 | That does not just make it safer. It makes the agent more predictable. |
| 263 | |
| 264 | ### 8. Loader’s “definition of done” is heuristic, not contractual |
| 265 | |
| 266 | The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code. |
| 267 | |
| 268 | Loader’s current strategy is: |
| 269 | |
| 270 | - heuristically decide whether the response looks premature |
| 271 | - nudge the model to continue |
| 272 | - maybe ask it to confirm completion |
| 273 | |
| 274 | See: |
| 275 | |
| 276 | - `src/loader/agent/reasoning.py:721-854` |
| 277 | |
| 278 | This is well-intentioned, but it is still guesswork. |
| 279 | |
| 280 | It does not require: |
| 281 | |
| 282 | - explicit acceptance criteria |
| 283 | - a verification plan |
| 284 | - fresh command evidence |
| 285 | - zero pending tasks |
| 286 | - a final sign-off phase |
| 287 | |
| 288 | OMX’s `ralph` workflow does. |
| 289 | |
| 290 | That difference is enormous. |
| 291 | |
| 292 | ### 9. Loader has no durable workflow state |
| 293 | |
| 294 | Loader has plans, decomposition, and completion logic, but they live inside one run and disappear. |
| 295 | |
| 296 | Missing pieces: |
| 297 | |
| 298 | - persisted mode state |
| 299 | - session memory |
| 300 | - approved plan artifacts |
| 301 | - PRD / test-spec artifacts |
| 302 | - progress ledger |
| 303 | - durable “what was already decided” |
| 304 | - resume-safe task state |
| 305 | |
| 306 | OMX writes state under `.omx/` and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only. |
| 307 | |
| 308 | ### 10. Loader is too backend-specific and too capability-fragile |
| 309 | |
| 310 | Despite defining an abstract LLM backend, Loader is effectively Ollama-only today. |
| 311 | |
| 312 | Evidence: |
| 313 | |
| 314 | - `src/loader/cli/main.py` supports only `ollama` |
| 315 | - `src/loader/llm/ollama.py` hardcodes native tool support by model-name substring matching |
| 316 | |
| 317 | This is fragile for behavior matching “with any model chosen.” |
| 318 | |
| 319 | What Loader needs instead is: |
| 320 | |
| 321 | - a provider-independent tool-calling contract |
| 322 | - explicit capability profiles |
| 323 | - distinct fallback strategies for native tools vs text tool calling |
| 324 | - prompts/workflows that degrade gracefully |
| 325 | |
| 326 | ### 11. Loader’s tests are not protecting the real runtime |
| 327 | |
| 328 | Loader’s test suite is mostly: |
| 329 | |
| 330 | - tool unit tests |
| 331 | - parsing tests |
| 332 | - recovery tests |
| 333 | |
| 334 | That is useful, but insufficient. |
| 335 | |
| 336 | The current state: |
| 337 | |
| 338 | - `uv run pytest` fails by default after adding `refs/` |
| 339 | - the repo does not scope pytest discovery |
| 340 | - the “normal” targeted run needs `--with pytest --with pytest-asyncio` |
| 341 | - even then, 3 tests fail |
| 342 | - there are no strong turn-loop integration tests |
| 343 | - there is no deterministic mock backend harness comparable to `claw-code` |
| 344 | |
| 345 | This is why structural issues like the `tool_call_id` mismatch can survive. |
| 346 | |
| 347 | ## What `claw-code` gets right |
| 348 | |
| 349 | ## 1. The runtime contract is explicit |
| 350 | |
| 351 | `refs/claw-code/rust/crates/runtime/src/conversation.rs` is the biggest thing Loader should study. |
| 352 | |
| 353 | The core `run_turn()` flow is clean: |
| 354 | |
| 355 | 1. append user message to session |
| 356 | 2. stream assistant response |
| 357 | 3. build a typed assistant message |
| 358 | 4. extract tool uses |
| 359 | 5. run permission checks |
| 360 | 6. execute tool |
| 361 | 7. append tool result message |
| 362 | 8. repeat until no more tool uses |
| 363 | 9. optionally compact session |
| 364 | 10. return a typed turn summary |
| 365 | |
| 366 | That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach. |
| 367 | |
| 368 | ## 2. Session persistence and compaction are first-class |
| 369 | |
| 370 | `claw-code` treats long-lived sessions as a product feature: |
| 371 | |
| 372 | - persisted sessions |
| 373 | - resume support |
| 374 | - usage tracking |
| 375 | - compaction thresholds |
| 376 | - summarized continuation messages |
| 377 | |
| 378 | Relevant files: |
| 379 | |
| 380 | - `refs/claw-code/rust/crates/runtime/src/conversation.rs` |
| 381 | - `refs/claw-code/rust/crates/runtime/src/compact.rs` |
| 382 | - `refs/claw-code/rust/crates/runtime/src/summary_compression.rs` |
| 383 | - `refs/claw-code/rust/crates/runtime/src/usage.rs` |
| 384 | |
| 385 | This matters because good agent behavior is often continuity behavior. |
| 386 | |
| 387 | ## 3. Permissions are part of the runtime, not just UI confirmation |
| 388 | |
| 389 | `claw-code` has an actual permission model with three layers: |
| 390 | |
| 391 | - **Mode layer** — `PermissionMode` enum with `ReadOnly`, `WorkspaceWrite`, `DangerFullAccess`, `Prompt`, and `Allow` (`refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`) |
| 392 | - **Per-tool requirement layer** — every `ToolSpec` declares the minimum mode it requires, mapped in `PermissionPolicy.tool_requirements` |
| 393 | - **Rule layer** — three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides on top of the mode/requirement check |
| 394 | |
| 395 | Plus typed authorization outcomes, file-write boundary logic, and bash gating. |
| 396 | |
| 397 | Relevant files: |
| 398 | |
| 399 | - `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs` |
| 400 | - `refs/claw-code/rust/crates/runtime/src/permissions.rs` |
| 401 | |
| 402 | Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later. |
| 403 | |
| 404 | ## 4. File and shell operations are engineered, not just exposed |
| 405 | |
| 406 | `claw-code`’s file layer includes: |
| 407 | |
| 408 | - max read size |
| 409 | - max write size |
| 410 | - binary detection |
| 411 | - workspace-boundary validation |
| 412 | - structured patch outputs |
| 413 | |
| 414 | Relevant file: |
| 415 | |
| 416 | - `refs/claw-code/rust/crates/runtime/src/file_ops.rs` |
| 417 | |
| 418 | Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior. |
| 419 | |
| 420 | ## 5. Hooks and lifecycle surfaces give the runtime escape valves |
| 421 | |
| 422 | `claw-code` has pre-tool and post-tool hooks, including failure hooks. |
| 423 | |
| 424 | That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself. |
| 425 | |
| 426 | Relevant files: |
| 427 | |
| 428 | - `refs/claw-code/rust/crates/runtime/src/hooks.rs` |
| 429 | - `refs/claw-code/rust/crates/runtime/src/conversation.rs` |
| 430 | |
| 431 | ## 6. The project is honest about parity and weaknesses |
| 432 | |
| 433 | `refs/claw-code/PARITY.md` is one of the best engineering lessons in the whole comparison. |
| 434 | |
| 435 | It does three things Loader does not yet do: |
| 436 | |
| 437 | - names what is actually shipped |
| 438 | - names what is still shallow or stubbed |
| 439 | - ties roadmap claims to concrete evidence |
| 440 | |
| 441 | That alone reduces thrash. |
| 442 | |
| 443 | Loader needs a similar parity/backlog document for runtime behavior. |
| 444 | |
| 445 | ## 7. Diagnostics and operator surfaces are part of the product |
| 446 | |
| 447 | `claw-code` exposes operational commands like: |
| 448 | |
| 449 | - `status` |
| 450 | - `sandbox` |
| 451 | - `agents` |
| 452 | - `mcp` |
| 453 | - `skills` |
| 454 | - `doctor` |
| 455 | - session resume |
| 456 | |
| 457 | This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime. |
| 458 | |
| 459 | ## Where `claw-code` is still incomplete |
| 460 | |
| 461 | It is worth staying honest here too. |
| 462 | |
| 463 | Even `claw-code` admits some shallowness in `PARITY.md`: |
| 464 | |
| 465 | - some surfaces are registry-backed approximations, not deep external integrations |
| 466 | - session compaction parity is still open |
| 467 | - token accounting accuracy is still open |
| 468 | - some tool surfaces remain shallow or partially stubbed |
| 469 | |
| 470 | That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior. |
| 471 | |
| 472 | ## What OMX adds that Loader is currently missing almost entirely |
| 473 | |
| 474 | `claw-code` gives a better runtime. OMX gives a better workflow. |
| 475 | |
| 476 | This is where most of Loader’s “definition of done” and “follow-through” problems are answered. |
| 477 | |
| 478 | ### 1. Clarification is a mode, not an ad hoc question |
| 479 | |
| 480 | `deep-interview` is not “ask a question if confused.” |
| 481 | |
| 482 | It is a formal ambiguity-reduction workflow with: |
| 483 | |
| 484 | - a context snapshot |
| 485 | - one-question rounds |
| 486 | - ambiguity scoring |
| 487 | - explicit non-goals |
| 488 | - explicit decision boundaries |
| 489 | - a crystallized artifact for downstream execution |
| 490 | |
| 491 | Relevant files: |
| 492 | |
| 493 | - `refs/oh-my-codex/skills/deep-interview/SKILL.md` |
| 494 | |
| 495 | Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight. |
| 496 | |
| 497 | ### 2. Planning is artifact-based and consensus-based |
| 498 | |
| 499 | `ralplan` is much more than “make a numbered list.” |
| 500 | |
| 501 | It includes: |
| 502 | |
| 503 | - Planner / Architect / Critic loops |
| 504 | - max iteration handling |
| 505 | - planning completion gates |
| 506 | - PRD and test-spec artifacts |
| 507 | - approved handoff into execution |
| 508 | |
| 509 | Relevant files: |
| 510 | |
| 511 | - `refs/oh-my-codex/skills/ralplan/SKILL.md` |
| 512 | - `refs/oh-my-codex/src/ralplan/runtime.ts` |
| 513 | - `refs/oh-my-codex/src/planning/artifacts.ts` |
| 514 | |
| 515 | Loader’s `Plan` object is fine as a local helper, but it is nowhere near this level of control. |
| 516 | |
| 517 | ### 3. “Done” is a workflow contract in Ralph |
| 518 | |
| 519 | This is the single biggest lesson for Loader. |
| 520 | |
| 521 | Ralph encodes: |
| 522 | |
| 523 | - persistence until done |
| 524 | - mandatory verification |
| 525 | - architect verification |
| 526 | - retry/fix loops |
| 527 | - state transitions |
| 528 | - explicit cleanup on completion |
| 529 | - a final checklist |
| 530 | |
| 531 | Relevant file: |
| 532 | |
| 533 | - `refs/oh-my-codex/skills/ralph/SKILL.md` |
| 534 | |
| 535 | This directly addresses the exact Loader problems you named: |
| 536 | |
| 537 | - weak tool follow-through |
| 538 | - finishing too early |
| 539 | - spending too long in loops |
| 540 | - poor task closure |
| 541 | |
| 542 | ### 4. Workflow state lives outside the prompt |
| 543 | |
| 544 | OMX stores durable mode state under `.omx/` and exposes it through state tools. |
| 545 | |
| 546 | Relevant files: |
| 547 | |
| 548 | - `refs/oh-my-codex/src/modes/base.ts` |
| 549 | - `refs/oh-my-codex/src/mcp/state-server.ts` |
| 550 | - `refs/oh-my-codex/src/mcp/memory-server.ts` |
| 551 | |
| 552 | That means: |
| 553 | |
| 554 | - progress survives interruptions |
| 555 | - execution can be resumed |
| 556 | - handoffs are grounded |
| 557 | - context can be audited |
| 558 | - the model does not have to remember everything itself |
| 559 | |
| 560 | ### 5. Memory and notepad are explicit tools |
| 561 | |
| 562 | OMX has project memory and a notepad. |
| 563 | |
| 564 | That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store: |
| 565 | |
| 566 | - conventions |
| 567 | - known build commands |
| 568 | - temporary working notes |
| 569 | - durable directives |
| 570 | |
| 571 | Relevant file: |
| 572 | |
| 573 | - `refs/oh-my-codex/src/mcp/memory-server.ts` |
| 574 | |
| 575 | Loader currently rediscovers too much per turn. |
| 576 | |
| 577 | ### 6. Verification is standardized |
| 578 | |
| 579 | OMX has verification instructions that scale by task size and explicitly require evidence. |
| 580 | |
| 581 | Relevant file: |
| 582 | |
| 583 | - `refs/oh-my-codex/src/verification/verifier.ts` |
| 584 | |
| 585 | Loader has completion heuristics. OMX has verification policy. |
| 586 | |
| 587 | That is the difference between “the model sounded done” and “the system proved done.” |
| 588 | |
| 589 | ### 7. Doctor / explore / sparkshell reduce prompt waste |
| 590 | |
| 591 | OMX distinguishes: |
| 592 | |
| 593 | - health checking (`doctor`) |
| 594 | - lightweight read-only exploration (`explore`) |
| 595 | - bounded shell-native inspection (`sparkshell`) |
| 596 | |
| 597 | That is smart. |
| 598 | |
| 599 | It keeps the main execution loop from becoming the only place everything happens. |
| 600 | |
| 601 | Relevant files: |
| 602 | |
| 603 | - `refs/oh-my-codex/src/cli/doctor.ts` |
| 604 | - `refs/oh-my-codex/src/cli/explore.ts` |
| 605 | - `refs/oh-my-codex/src/cli/sparkshell.ts` |
| 606 | |
| 607 | ### 8. Follow-through is supported outside the agent context window |
| 608 | |
| 609 | The idle notifications, leader nudges, and continuation prompts in OMX are important. |
| 610 | |
| 611 | Relevant file: |
| 612 | |
| 613 | - `refs/oh-my-codex/src/scripts/notify-hook.ts` |
| 614 | |
| 615 | This is one of the deeper design differences: |
| 616 | |
| 617 | - Loader tries to keep the model on-task from inside the loop |
| 618 | - OMX also nudges, monitors, and routes from outside the loop |
| 619 | |
| 620 | That is a more robust design. |
| 621 | |
| 622 | ## Comparison matrix |
| 623 | |
| 624 | | Area | Loader today | `claw-code` | OMX lesson | Takeaway for Loader | |
| 625 | |---|---|---|---|---| |
| 626 | | Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first | |
| 627 | | Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools | |
| 628 | | Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries | |
| 629 | | Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification | |
| 630 | | Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts | |
| 631 | | Memory/state | none | sessions + compaction + tracing | `.omx/` mode state + memory | add `.loader/` state dir | |
| 632 | | Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable | |
| 633 | | Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests | |
| 634 | | Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later | |
| 635 | | Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy | |
| 636 | |
| 637 | ## Why Loader’s current weaknesses produce the behavior you described |
| 638 | |
| 639 | ### Poor tool use |
| 640 | |
| 641 | Root causes: |
| 642 | |
| 643 | - shallow tool surface |
| 644 | - brittle prompt contract |
| 645 | - native-vs-ReAct bifurcation |
| 646 | - duplicated execution code paths |
| 647 | - no typed runtime contract for tool results |
| 648 | |
| 649 | ### Weak follow-through |
| 650 | |
| 651 | Root causes: |
| 652 | |
| 653 | - no persistent task state |
| 654 | - no approved plan artifact |
| 655 | - no explicit verification lane |
| 656 | - no final completion checklist |
| 657 | |
| 658 | ### Finishing early |
| 659 | |
| 660 | Root causes: |
| 661 | |
| 662 | - completion is heuristic |
| 663 | - no required evidence model |
| 664 | - no acceptance criteria artifact |
| 665 | - no final “prove it” pass |
| 666 | |
| 667 | ### Spending too long on simple tasks |
| 668 | |
| 669 | Root causes: |
| 670 | |
| 671 | - the runtime loop tries too many recoveries in one place |
| 672 | - the system prompt does not distinguish task modes cleanly |
| 673 | - there is no “lightweight inspect” lane like `explore` |
| 674 | - the model often has to infer the workflow instead of being routed into one |
| 675 | |
| 676 | ### Model sensitivity |
| 677 | |
| 678 | Root causes: |
| 679 | |
| 680 | - behavior is prompt-and-heuristic driven |
| 681 | - capability detection is backend-specific and brittle |
| 682 | - no workflow artifacts that survive model variance |
| 683 | |
| 684 | This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise. |
| 685 | |
| 686 | ## Concrete implementation targets |
| 687 | |
| 688 | These are ordered by impact on Loader behavior, not by code convenience. |
| 689 | |
| 690 | ### Target 1: Introduce a real turn engine |
| 691 | |
| 692 | Goal: |
| 693 | |
| 694 | - replace the current giant loop with a smaller, typed conversation runtime |
| 695 | |
| 696 | Implementation target: |
| 697 | |
| 698 | - create a new `src/loader/runtime/` package |
| 699 | - move message/session/tool-result logic out of `src/loader/agent/loop.py` |
| 700 | - give tool results a first-class typed representation |
| 701 | - unify native, ReAct, and extracted-tool execution through one executor path |
| 702 | |
| 703 | Why: |
| 704 | |
| 705 | - this is the foundation for every other improvement |
| 706 | |
| 707 | ### Target 2: Add persistent Loader state under `.loader/` |
| 708 | |
| 709 | Goal: |
| 710 | |
| 711 | - make workflow state durable instead of prompt-only |
| 712 | |
| 713 | Implementation target: |
| 714 | |
| 715 | - `.loader/state/` |
| 716 | - `.loader/sessions/` |
| 717 | - `.loader/plans/` |
| 718 | - `.loader/notepad.md` |
| 719 | - `.loader/project-memory.json` |
| 720 | |
| 721 | Why: |
| 722 | |
| 723 | - Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge |
| 724 | |
| 725 | ### Target 3: Separate task modes |
| 726 | |
| 727 | Goal: |
| 728 | |
| 729 | - stop treating all requests like immediate tool-execution requests |
| 730 | |
| 731 | Implementation target: |
| 732 | |
| 733 | - mode router with at least: |
| 734 | - `clarify` |
| 735 | - `plan` |
| 736 | - `execute` |
| 737 | - `verify` |
| 738 | |
| 739 | Why: |
| 740 | |
| 741 | - this is the minimum structure needed to stop overthinking simple work and underthinking complex work |
| 742 | |
| 743 | ### Target 4: Replace heuristic completion with an evidence-backed done contract |
| 744 | |
| 745 | Goal: |
| 746 | |
| 747 | - make completion explicit and testable |
| 748 | |
| 749 | Implementation target: |
| 750 | |
| 751 | - define a `DefinitionOfDone` object per task |
| 752 | - require: |
| 753 | - acceptance criteria |
| 754 | - verification commands |
| 755 | - evidence summary |
| 756 | - zero pending task items |
| 757 | |
| 758 | Why: |
| 759 | |
| 760 | - this is the main fix for premature completion |
| 761 | |
| 762 | ### Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents |
| 763 | |
| 764 | Goal: |
| 765 | |
| 766 | - pull ambiguity reduction and planning review out of the middle of execution |
| 767 | |
| 768 | Implementation target: |
| 769 | |
| 770 | - `clarify` mode writes a task brief |
| 771 | - `plan` mode writes: |
| 772 | - a short implementation plan |
| 773 | - a test/verification plan |
| 774 | |
| 775 | Do not try to copy every OMX feature immediately. Copy the artifact discipline first. |
| 776 | |
| 777 | ### Target 6: Build a real permission model |
| 778 | |
| 779 | Goal: |
| 780 | |
| 781 | - move from confirmation prompts to policy-based authorization |
| 782 | |
| 783 | Implementation target: |
| 784 | |
| 785 | - permission modes: |
| 786 | - `read-only` |
| 787 | - `workspace-write` |
| 788 | - `danger-full-access` |
| 789 | - tool specs declare required permission |
| 790 | - file writes enforce workspace boundaries |
| 791 | - shell commands go through command classification |
| 792 | |
| 793 | Why: |
| 794 | |
| 795 | - this is both safety and behavior quality |
| 796 | |
| 797 | ### Target 7: Harden file and shell tools |
| 798 | |
| 799 | Goal: |
| 800 | |
| 801 | - make tool use trustworthy enough for automation |
| 802 | |
| 803 | Implementation target: |
| 804 | |
| 805 | - size limits |
| 806 | - binary detection |
| 807 | - symlink/traversal protection |
| 808 | - structured patch/diff return values |
| 809 | - shell command semantics and mutability classification |
| 810 | |
| 811 | ### Target 8: Add `loader doctor`, `loader status`, and `loader session` |
| 812 | |
| 813 | Goal: |
| 814 | |
| 815 | - make Loader operable as a product |
| 816 | |
| 817 | Implementation target: |
| 818 | |
| 819 | - backend health |
| 820 | - model capability snapshot |
| 821 | - workspace detection |
| 822 | - write-access detection |
| 823 | - test/build command detection |
| 824 | - active session summary |
| 825 | |
| 826 | Why: |
| 827 | |
| 828 | - better operator feedback means less guesswork in the agent loop |
| 829 | |
| 830 | ### Target 9: Add memory/notepad tools |
| 831 | |
| 832 | Goal: |
| 833 | |
| 834 | - give Loader durable short-term and long-term memory |
| 835 | |
| 836 | Implementation target: |
| 837 | |
| 838 | - read/write project memory |
| 839 | - append working notes |
| 840 | - store user directives and repo conventions |
| 841 | |
| 842 | Why: |
| 843 | |
| 844 | - this reduces re-discovery and improves follow-through across turns |
| 845 | |
| 846 | ### Target 10: Add a lightweight read-only inspect lane |
| 847 | |
| 848 | Goal: |
| 849 | |
| 850 | - avoid using the full agent loop for every lookup |
| 851 | |
| 852 | Implementation target: |
| 853 | |
| 854 | - `loader explore` or equivalent internal mode |
| 855 | - optimized for: |
| 856 | - file/symbol lookup |
| 857 | - pattern discovery |
| 858 | - relationship questions |
| 859 | |
| 860 | Why: |
| 861 | |
| 862 | - simple tasks should stay cheap and fast |
| 863 | |
| 864 | ### Target 11: Add a parity harness |
| 865 | |
| 866 | Goal: |
| 867 | |
| 868 | - improve behavior intentionally instead of impressionistically |
| 869 | |
| 870 | Implementation target: |
| 871 | |
| 872 | - scripted mock backend scenarios for: |
| 873 | - simple read |
| 874 | - multi-tool turn |
| 875 | - denied permission |
| 876 | - write/edit success |
| 877 | - verification-required task |
| 878 | - premature completion rejection |
| 879 | - looped/duplicate action prevention |
| 880 | |
| 881 | Why: |
| 882 | |
| 883 | - this is how Loader becomes reliable |
| 884 | |
| 885 | ### Target 12: Add workflow-aware prompts and capability profiles |
| 886 | |
| 887 | Goal: |
| 888 | |
| 889 | - make Loader less brittle across models |
| 890 | |
| 891 | Implementation target: |
| 892 | |
| 893 | - replace one generic system prompt with mode-specific prompts |
| 894 | - add provider/model capability profiles: |
| 895 | - native tools |
| 896 | - streaming |
| 897 | - context budget |
| 898 | - preferred tool-call format |
| 899 | - verification strictness |
| 900 | |
| 901 | Why: |
| 902 | |
| 903 | - behavior should be shaped by runtime policy, not guessed from model substrings |
| 904 | |
| 905 | ## Priority order |
| 906 | |
| 907 | This section was rewritten after a deeper validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: **the Definition-of-Done work is the user's actual pain point and should land before permission modes**, not after, because permissions are a safety win and DoD is the behavior win. |
| 908 | |
| 909 | ### P0: Stabilize before changing behavior (Sprint 00) |
| 910 | |
| 911 | - write a failing regression test for the `tool_call_id` bug at `agent/loop.py:885,906` *first*, before any harness work — it proves the bug is real and proves the harness exists in one move |
| 912 | - scope pytest discovery so `refs/` stops contaminating collection |
| 913 | - exclude `refs/` from ruff and mypy too |
| 914 | - make `uv run pytest` work out of the box |
| 915 | - port the scenario taxonomy from `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs` |
| 916 | - rewrite `README.md` (currently still says "FortranGoingOnForty") |
| 917 | - baseline parity checklist for current runtime behavior |
| 918 | |
| 919 | ### P1: Replace the loop with a real runtime (Sprint 01) |
| 920 | |
| 921 | - new `src/loader/runtime/` package with a typed turn engine |
| 922 | - unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor |
| 923 | - fix the named bugs from Sprint 00's failing tests (`tool_call_id`, duplicate execution path) |
| 924 | - replace substring-based `NATIVE_TOOL_MODELS`/`NO_TOOL_MODELS` model detection with a `runtime/capabilities.py` profile system — Loader needs to behave consistently across model choices |
| 925 | - structured `TurnSummary` output |
| 926 | |
| 927 | ### P2: The behavior fix the user actually asked for (Sprint 02) |
| 928 | |
| 929 | - `DefinitionOfDone` object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items |
| 930 | - explicit verify phase that runs the verification commands and gates completion on evidence |
| 931 | - fix loop: verification failure returns to execution, not to final answer |
| 932 | - minimum `.loader/` directory shape (`.loader/dod/`) — full session/memory layout deferred to Sprint 05 |
| 933 | |
| 934 | This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through." |
| 935 | |
| 936 | ### P3: Safety as policy, not as confirmation prompt (Sprint 03) |
| 937 | |
| 938 | - permission modes: `read-only`, `workspace-write`, `danger-full-access` |
| 939 | - three-event tool lifecycle hooks (`pre_tool_use`, `post_tool_use`, `post_tool_use_failure`) modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs` |
| 940 | - refactor `safeguards.py` (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls |
| 941 | - file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches) |
| 942 | - shell operation hardening |
| 943 | - expose active mode in CLI/TUI status |
| 944 | |
| 945 | Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle. |
| 946 | |
| 947 | ### P4: Stop improvising one workflow for everything (Sprint 04) |
| 948 | |
| 949 | - mode router: clarify, plan, execute, verify (verify already exists from Sprint 02) |
| 950 | - clarify artifact written to `.loader/briefs/` |
| 951 | - planning artifacts (implementation plan + verification plan) written to `.loader/plans/` and fed into the existing DoD object |
| 952 | - tool prerequisites pulled forward from Sprint 06: `TodoWrite` (the "zero pending tasks" gate is empty without it) and `AskUserQuestion` (clarify rounds) |
| 953 | |
| 954 | ### P5: Durable continuity (Sprint 05) |
| 955 | |
| 956 | - full `.loader/` state directory under the layout already started in Sprint 02 |
| 957 | - session persistence and resume |
| 958 | - transcript compaction with priority-aware summarization (model the design on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`) |
| 959 | - memory/notepad surfaces |
| 960 | - usage/cost tracking |
| 961 | |
| 962 | ### P6: Operability and tool-surface expansion (Sprint 06) |
| 963 | |
| 964 | - `loader doctor`, `loader status`, `loader session` |
| 965 | - read-only explore lane |
| 966 | - broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — `TodoWrite` and `AskUserQuestion` already exist from Sprint 04 |
| 967 | |
| 968 | ### Deferred indefinitely |
| 969 | |
| 970 | - workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring) |
| 971 | - task/team/subagent orchestration |
| 972 | - broad MCP ecosystem |
| 973 | - richer plugin systems |
| 974 | |
| 975 | These are real wins in `claw-code`/OMX, but Loader should not pursue them until the solo runtime is trustworthy. |
| 976 | |
| 977 | ## What Loader should copy directly, and what it should not |
| 978 | |
| 979 | ### Copy directly |
| 980 | |
| 981 | - typed turn runtime |
| 982 | - permission model |
| 983 | - file/shell hardening |
| 984 | - session persistence |
| 985 | - compaction |
| 986 | - doctor/status/session surfaces |
| 987 | - workflow artifacts |
| 988 | - evidence-backed verification |
| 989 | - parity harness discipline |
| 990 | |
| 991 | ### Copy in simplified form |
| 992 | |
| 993 | - deep-interview |
| 994 | - ralplan |
| 995 | - ralph |
| 996 | - memory/notepad |
| 997 | - explore vs full-execution split |
| 998 | |
| 999 | ### Do not copy blindly yet |
| 1000 | |
| 1001 | - full tmux/team runtime |
| 1002 | - huge command surface |
| 1003 | - Discord/openclaw notification stack |
| 1004 | - broad MCP ecosystem |
| 1005 | |
| 1006 | Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help. |
| 1007 | |
| 1008 | ## Recommended Loader architecture direction |
| 1009 | |
| 1010 | If we want behavior closer to `claw-code` without losing Loader’s simplicity, I would steer toward: |
| 1011 | |
| 1012 | ### Layer 1: Runtime core |
| 1013 | |
| 1014 | - typed `TurnRuntime` |
| 1015 | - `SessionStore` |
| 1016 | - `PermissionPolicy` |
| 1017 | - `ToolExecutor` |
| 1018 | - `VerificationEngine` |
| 1019 | |
| 1020 | ### Layer 2: Workflow layer |
| 1021 | |
| 1022 | - `ClarifyWorkflow` |
| 1023 | - `PlanWorkflow` |
| 1024 | - `ExecuteWorkflow` |
| 1025 | - `VerifyWorkflow` |
| 1026 | |
| 1027 | ### Layer 3: Product surfaces |
| 1028 | |
| 1029 | - TUI |
| 1030 | - CLI |
| 1031 | - `doctor` |
| 1032 | - `status` |
| 1033 | - `session` |
| 1034 | - `explore` |
| 1035 | |
| 1036 | ### Layer 4: Optional future orchestration |
| 1037 | |
| 1038 | - hooks |
| 1039 | - background verification |
| 1040 | - multi-agent/task orchestration |
| 1041 | |
| 1042 | That is a better fit for Loader than trying to clone all of OMX wholesale. |
| 1043 | |
| 1044 | ## Immediate conclusions |
| 1045 | |
| 1046 | 1. Loader’s biggest problems are architectural, not just prompt-related. |
| 1047 | 2. `claw-code` is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity. |
| 1048 | 3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops. |
| 1049 | 4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding: |
| 1050 | - workflow artifacts |
| 1051 | - explicit verification |
| 1052 | - persistent state |
| 1053 | - a smaller, more trustworthy turn engine |
| 1054 | |
| 1055 | ## Sprint scaffolding |
| 1056 | |
| 1057 | After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under: |
| 1058 | |
| 1059 | - `.docs/sprints/index.md` |
| 1060 | - `.docs/sprints/sprint00.md` — Foundation, Measurement, and Parity Harness |
| 1061 | - `.docs/sprints/sprint01.md` — Turn Engine, Tool Contract, and Capability Profiles |
| 1062 | - `.docs/sprints/sprint02.md` — Definition of Done and Verify/Fix Loop |
| 1063 | - `.docs/sprints/sprint03.md` — Permission Modes and Tool Lifecycle Hooks |
| 1064 | - `.docs/sprints/sprint04.md` — Mode Router, Clarify, and Plan Artifacts |
| 1065 | - `.docs/sprints/sprint05.md` — Session State, Memory, and Compaction |
| 1066 | - `.docs/sprints/sprint06.md` — Doctor, Explore, Status, and Tool Surface Expansion |
| 1067 | |
| 1068 | ## Recommended next move |
| 1069 | |
| 1070 | Start with Sprint 00, and start Sprint 00 with the failing regression test. |
| 1071 | |
| 1072 | Reason: |
| 1073 | |
| 1074 | - Loader needs a measurable baseline and a safer runtime before adding more behavior |
| 1075 | - the `tool_call_id` bug at `agent/loop.py:885,906` is proof that untested code paths are silently broken |
| 1076 | - writing the failing test first proves both the bug and the harness in one move |
| 1077 | - otherwise every feature sprint will be built on unstable agent semantics |
| 1078 | |
| 1079 | The execution phase should then be: |
| 1080 | |
| 1081 | 1. lock down the runtime and test harness (Sprint 00) |
| 1082 | 2. replace the loop with a typed runtime and capability profiles (Sprint 01) |
| 1083 | 3. define and enforce the completion contract (Sprint 02) |
| 1084 | 4. add the policy-based safety layer with hooks (Sprint 03) |
| 1085 | 5. add workflow modes and planning artifacts on top (Sprint 04) |
| 1086 | 6. then widen the durability and product surfaces (Sprints 05 and 06) |
| 1087 | |
| 1088 | ## Plan adjustments after deeper review |
| 1089 | |
| 1090 | The following changes were applied to the original report after a firsthand validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus spot-checks of Loader's runtime. |
| 1091 | |
| 1092 | ### Verified directly against the code |
| 1093 | |
| 1094 | - **`tool_call_id` bug confirmed at `src/loader/agent/loop.py:885` and `:906`.** Both call sites construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`, but `Message` (`src/loader/llm/base.py:33-39`) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage. |
| 1095 | - **Pytest discovery is broken by default.** `uv run pytest --collect-only` picks up `refs/claw-code/tests/test_porting_workspace.py` and fails to import `loader` because there is no `tool.pytest.ini_options` block in `pyproject.toml`. |
| 1096 | - **Loop monolith confirmed by line counts.** `agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196, `agent/safeguards.py` is 1079 — roughly 4200 lines of orchestration in one cluster. |
| 1097 | - **claw-code's `run_turn()` shape** is exactly as the report describes. Read directly at `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed `ConversationMessage::tool_result()` → push → repeat. ~175 lines of clean code. |
| 1098 | - **claw-code permission modes** are `ReadOnly` / `WorkspaceWrite` / `DangerFullAccess` (plus `Prompt` and `Allow`), defined at `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in `file_ops.rs` are all real. |
| 1099 | - **claw-code hooks** are `PreToolUse` / `PostToolUse` / `PostToolUseFailure`, defined at `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34` and wired into the conversation loop at lines 371, 427-453. |
| 1100 | - **OMX skills are real and even more rigorous than the report described.** `ralplan` enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. `ralph` has explicit phase enums (`starting`/`executing`/`verifying`/`fixing`/`complete`/`failed`/`cancelled`) persisted via `state_write` to `.omx/state/{mode}-state.json`. The verifier in `src/verification/verifier.ts` scales by task size with concrete file-count thresholds. |
| 1101 | |
| 1102 | ### Corrected facts |
| 1103 | |
| 1104 | - **Tool count: 49, not 40.** `refs/claw-code/rust/crates/tools/src/lib.rs` exposes 49 `ToolSpec` entries in `mvp_tool_specs()`. Doesn't change the lesson, but worth knowing. |
| 1105 | - **claw-code permissions have a third layer.** Beyond `PermissionMode` and per-tool requirements, `PermissionPolicy` carries three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides. Loader can land the mode layer first and defer the rule layer. |
| 1106 | - **claw-code summary compression is sophisticated.** It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`. Sprint 05 should model on this rather than reinventing. |
| 1107 | |
| 1108 | ### Structural plan changes |
| 1109 | |
| 1110 | - **The old Sprint 03 was split.** It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04. |
| 1111 | - **The old Sprint 02 (permissions) became the new Sprint 03** and was reordered to land *after* DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first. |
| 1112 | - **Hooks landed in the same sprint as permissions.** The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both. |
| 1113 | - **Capability profiles became a Sprint 01 deliverable.** They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal. |
| 1114 | - **The minimum `.loader/` directory shape moves to Sprint 02** (just `.loader/dod/`). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05. |
| 1115 | - **`TodoWrite` and `AskUserQuestion` move from Sprint 06 to Sprint 04** as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06. |
| 1116 | - **Sprint 00's first deliverable is now the failing regression test** for the `tool_call_id` bug, before any harness work. It proves the bug and proves the harness exist in one move. |