markdown · 40377 bytes Raw Blame History

Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior

Date: 2026-04-06

Scope and assumptions

This report compares three things:

  1. Loader itself
  2. refs/claw-code, using the Rust workspace under refs/claw-code/rust/ as the canonical runtime
  3. refs/oh-my-codex as the workflow-layer parent repo

Assumption: oh-my-codex is the correct “parent repo” for this exercise. That assumption is based on:

  • refs/claw-code/README.md
  • refs/claw-code/PHILOSOPHY.md
  • the fact that refs/claw-code explicitly describes src/ as a companion Python/reference workspace, not the primary runtime

If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.

Executive summary

Loader has the right instincts but is operating at the wrong layer.

The codebase already knows that models need:

  • planning help
  • recovery help
  • confidence checks
  • completion checks
  • safe tool use

But Loader mostly tries to enforce those after the model has already started drifting. claw-code and oh-my-codex get better behavior because they shape the work before, during, and after the model call:

  • before: explicit mode selection, clarification, approved planning artifacts
  • during: durable runtime state, richer tool surface, explicit permission model, session persistence
  • after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics

The biggest lesson is not “copy their prompt.”

The biggest lesson is:

Loader needs a stronger execution contract, not just stronger prompting.

If we want Loader to feel closer to claw-code regardless of model choice, the highest-leverage work is:

  1. replace the monolithic heuristic loop with a typed turn engine
  2. add durable workflow/state artifacts
  3. make “definition of done” evidence-based instead of heuristic
  4. add real permission/safety boundaries around tools
  5. build a parity harness so we can improve behavior intentionally

Method

I reviewed:

  • Loader source under src/loader/
  • Loader tests under tests/
  • refs/claw-code/README.md
  • refs/claw-code/USAGE.md
  • refs/claw-code/PARITY.md
  • refs/claw-code/PHILOSOPHY.md
  • refs/claw-code/rust/crates/runtime/*
  • refs/claw-code/rust/crates/tools/src/lib.rs
  • refs/oh-my-codex/README.md
  • refs/oh-my-codex/AGENTS.md
  • refs/oh-my-codex/skills/deep-interview/SKILL.md
  • refs/oh-my-codex/skills/ralplan/SKILL.md
  • refs/oh-my-codex/skills/ralph/SKILL.md
  • refs/oh-my-codex/src/modes/base.ts
  • refs/oh-my-codex/src/ralplan/runtime.ts
  • refs/oh-my-codex/src/mcp/memory-server.ts
  • refs/oh-my-codex/src/verification/verifier.ts
  • refs/oh-my-codex/src/cli/doctor.ts
  • refs/oh-my-codex/src/scripts/notify-hook.ts

I also ran Loader verification commands:

  • uv run pytest
    • failed during collection
    • discovered refs/claw-code/tests/*
    • also failed to import loader
  • uv run --with pytest --with pytest-asyncio python -m pytest tests -q
    • 56 passed
    • 3 failed

That matters because some of Loader’s runtime paths are clearly under-tested.

What Loader already does well

1. Loader is small, understandable, and hackable

This is a real advantage.

src/loader/ is about 55 source files, and the core agent behavior is easy to locate. Compared to claw-code and especially OMX, Loader is much easier to refactor aggressively.

2. Loader is genuinely local-first

The Ollama-first posture is simple and useful. A lot of the complexity in claw-code and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.

3. Loader already contains the seeds of a better system

These are the right instincts:

  • project context detection in src/loader/context/project.py
  • runtime safeguards in src/loader/agent/safeguards.py
  • recovery categorization in src/loader/agent/recovery.py
  • optional decomposition / critique / confidence / verification / completion checks in src/loader/agent/reasoning.py
  • a decent Textual app in src/loader/ui/app.py

The problem is not that Loader lacks ideas.

The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.

4. The TUI is a meaningful strength

Loader’s TUI already gives you:

  • model selection
  • streaming output
  • approval handling
  • status line updates
  • tool widgets

That is more product surface than many small local agents. It is worth keeping.

Where Loader is weak today

1. Loader’s product surface is not trustworthy yet

The most visible sign is the README:

  • README.md:1-2 still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”

That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. claw-code and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.

2. Loader’s main runtime is too monolithic and too heuristic

src/loader/agent/loop.py is the heart of Loader, and it is doing too much:

  • prompt construction
  • streaming output handling
  • raw tool-call extraction
  • duplicate tool execution flows
  • recovery
  • validation
  • rollback tracking
  • completion nudging
  • loop detection
  • steering
  • partial planning
  • decomposition

The result is a loop that is hard to reason about and easy to destabilize.

The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.

3. Loader has a real runtime contract bug in tool-result handling

Verified directly against the code. There is a concrete mismatch between Message and the loop:

  • src/loader/llm/base.py:33-39 defines Message with role, content, tool_calls, and tool_results. There is no tool_call_id field on Message — that field belongs to the separate ToolResult dataclass at src/loader/llm/base.py:25-30.
  • src/loader/agent/loop.py:885 and src/loader/agent/loop.py:906 both construct Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id).

Both call sites will raise TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id' the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have zero integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.

4. Loader duplicates tool execution logic instead of centralizing it

There are effectively two execution paths:

  • the normal native/ReAct tool path
  • the “raw JSON extracted tool call” path

Those paths duplicate:

  • duplicate checking
  • validation
  • confirmation behavior
  • result recording
  • loop/error handling

That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.

claw-code’s ConversationRuntime::run_turn() is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.

5. Loader’s system prompt is too shallow and too rigid

src/loader/agent/prompts.py:148-208 gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.

This is too blunt.

Problems:

  • it treats all tasks like immediate tool-execution tasks
  • it globally bans numbered steps, which is bad for planning/reporting tasks
  • it does not define modes
  • it does not encode verification expectations
  • it does not encode completion criteria
  • it does not distinguish “clarify”, “plan”, “execute”, and “verify”

OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.

6. Loader’s tool surface is too thin

Loader has 6 default tools:

  • read
  • write
  • edit
  • glob
  • bash
  • grep

That is enough for toy execution, but not enough for strong agent behavior.

What is missing compared to claw-code / OMX:

  • task/todo tracking
  • structured ask-user surfaces
  • memory/notepad
  • doctor/status/session tooling
  • git-aware helpers
  • explore vs full-execution split
  • diff/patch-aware editing
  • web/search/fetch surfaces
  • structured output surfaces
  • subagent/team coordination surfaces
  • MCP-backed state and memory

The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.

7. Loader’s safety model is primitive

Loader’s current protection model is mostly:

  • “safe commands” vs “ask for confirmation”
  • destructive tool flags

Problems in practice:

  • no permission modes like read-only, workspace-write, danger-full-access
  • no strong workspace boundary checks
  • no binary-file guards
  • no file size limits
  • no symlink escape protection
  • no command semantics beyond a short safe list

Evidence:

  • src/loader/tools/file_tools.py reads/writes resolved paths directly
  • src/loader/tools/shell_tools.py uses create_subprocess_shell() on arbitrary shell strings
  • src/loader/tools/shell_tools.py:13-20 uses a short safe command set, but no mode-based authorization model

By comparison, claw-code has:

  • PermissionPolicy
  • PermissionEnforcer
  • workspace boundary checks
  • binary/size guards in file ops
  • permission-mode aware tool definitions

That does not just make it safer. It makes the agent more predictable.

8. Loader’s “definition of done” is heuristic, not contractual

The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.

Loader’s current strategy is:

  • heuristically decide whether the response looks premature
  • nudge the model to continue
  • maybe ask it to confirm completion

See:

  • src/loader/agent/reasoning.py:721-854

This is well-intentioned, but it is still guesswork.

It does not require:

  • explicit acceptance criteria
  • a verification plan
  • fresh command evidence
  • zero pending tasks
  • a final sign-off phase

OMX’s ralph workflow does.

That difference is enormous.

9. Loader has no durable workflow state

Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.

Missing pieces:

  • persisted mode state
  • session memory
  • approved plan artifacts
  • PRD / test-spec artifacts
  • progress ledger
  • durable “what was already decided”
  • resume-safe task state

OMX writes state under .omx/ and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.

10. Loader is too backend-specific and too capability-fragile

Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.

Evidence:

  • src/loader/cli/main.py supports only ollama
  • src/loader/llm/ollama.py hardcodes native tool support by model-name substring matching

This is fragile for behavior matching “with any model chosen.”

What Loader needs instead is:

  • a provider-independent tool-calling contract
  • explicit capability profiles
  • distinct fallback strategies for native tools vs text tool calling
  • prompts/workflows that degrade gracefully

11. Loader’s tests are not protecting the real runtime

Loader’s test suite is mostly:

  • tool unit tests
  • parsing tests
  • recovery tests

That is useful, but insufficient.

The current state:

  • uv run pytest fails by default after adding refs/
  • the repo does not scope pytest discovery
  • the “normal” targeted run needs --with pytest --with pytest-asyncio
  • even then, 3 tests fail
  • there are no strong turn-loop integration tests
  • there is no deterministic mock backend harness comparable to claw-code

This is why structural issues like the tool_call_id mismatch can survive.

What claw-code gets right

1. The runtime contract is explicit

refs/claw-code/rust/crates/runtime/src/conversation.rs is the biggest thing Loader should study.

The core run_turn() flow is clean:

  1. append user message to session
  2. stream assistant response
  3. build a typed assistant message
  4. extract tool uses
  5. run permission checks
  6. execute tool
  7. append tool result message
  8. repeat until no more tool uses
  9. optionally compact session
  10. return a typed turn summary

That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.

2. Session persistence and compaction are first-class

claw-code treats long-lived sessions as a product feature:

  • persisted sessions
  • resume support
  • usage tracking
  • compaction thresholds
  • summarized continuation messages

Relevant files:

  • refs/claw-code/rust/crates/runtime/src/conversation.rs
  • refs/claw-code/rust/crates/runtime/src/compact.rs
  • refs/claw-code/rust/crates/runtime/src/summary_compression.rs
  • refs/claw-code/rust/crates/runtime/src/usage.rs

This matters because good agent behavior is often continuity behavior.

3. Permissions are part of the runtime, not just UI confirmation

claw-code has an actual permission model with three layers:

  • Mode layerPermissionMode enum with ReadOnly, WorkspaceWrite, DangerFullAccess, Prompt, and Allow (refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27)
  • Per-tool requirement layer — every ToolSpec declares the minimum mode it requires, mapped in PermissionPolicy.tool_requirements
  • Rule layer — three rule lists (allow_rules, deny_rules, ask_rules) for context-specific overrides on top of the mode/requirement check

Plus typed authorization outcomes, file-write boundary logic, and bash gating.

Relevant files:

  • refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs
  • refs/claw-code/rust/crates/runtime/src/permissions.rs

Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.

4. File and shell operations are engineered, not just exposed

claw-code’s file layer includes:

  • max read size
  • max write size
  • binary detection
  • workspace-boundary validation
  • structured patch outputs

Relevant file:

  • refs/claw-code/rust/crates/runtime/src/file_ops.rs

Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.

5. Hooks and lifecycle surfaces give the runtime escape valves

claw-code has pre-tool and post-tool hooks, including failure hooks.

That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.

Relevant files:

  • refs/claw-code/rust/crates/runtime/src/hooks.rs
  • refs/claw-code/rust/crates/runtime/src/conversation.rs

6. The project is honest about parity and weaknesses

refs/claw-code/PARITY.md is one of the best engineering lessons in the whole comparison.

It does three things Loader does not yet do:

  • names what is actually shipped
  • names what is still shallow or stubbed
  • ties roadmap claims to concrete evidence

That alone reduces thrash.

Loader needs a similar parity/backlog document for runtime behavior.

7. Diagnostics and operator surfaces are part of the product

claw-code exposes operational commands like:

  • status
  • sandbox
  • agents
  • mcp
  • skills
  • doctor
  • session resume

This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.

Where claw-code is still incomplete

It is worth staying honest here too.

Even claw-code admits some shallowness in PARITY.md:

  • some surfaces are registry-backed approximations, not deep external integrations
  • session compaction parity is still open
  • token accounting accuracy is still open
  • some tool surfaces remain shallow or partially stubbed

That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.

What OMX adds that Loader is currently missing almost entirely

claw-code gives a better runtime. OMX gives a better workflow.

This is where most of Loader’s “definition of done” and “follow-through” problems are answered.

1. Clarification is a mode, not an ad hoc question

deep-interview is not “ask a question if confused.”

It is a formal ambiguity-reduction workflow with:

  • a context snapshot
  • one-question rounds
  • ambiguity scoring
  • explicit non-goals
  • explicit decision boundaries
  • a crystallized artifact for downstream execution

Relevant files:

  • refs/oh-my-codex/skills/deep-interview/SKILL.md

Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.

2. Planning is artifact-based and consensus-based

ralplan is much more than “make a numbered list.”

It includes:

  • Planner / Architect / Critic loops
  • max iteration handling
  • planning completion gates
  • PRD and test-spec artifacts
  • approved handoff into execution

Relevant files:

  • refs/oh-my-codex/skills/ralplan/SKILL.md
  • refs/oh-my-codex/src/ralplan/runtime.ts
  • refs/oh-my-codex/src/planning/artifacts.ts

Loader’s Plan object is fine as a local helper, but it is nowhere near this level of control.

3. “Done” is a workflow contract in Ralph

This is the single biggest lesson for Loader.

Ralph encodes:

  • persistence until done
  • mandatory verification
  • architect verification
  • retry/fix loops
  • state transitions
  • explicit cleanup on completion
  • a final checklist

Relevant file:

  • refs/oh-my-codex/skills/ralph/SKILL.md

This directly addresses the exact Loader problems you named:

  • weak tool follow-through
  • finishing too early
  • spending too long in loops
  • poor task closure

4. Workflow state lives outside the prompt

OMX stores durable mode state under .omx/ and exposes it through state tools.

Relevant files:

  • refs/oh-my-codex/src/modes/base.ts
  • refs/oh-my-codex/src/mcp/state-server.ts
  • refs/oh-my-codex/src/mcp/memory-server.ts

That means:

  • progress survives interruptions
  • execution can be resumed
  • handoffs are grounded
  • context can be audited
  • the model does not have to remember everything itself

5. Memory and notepad are explicit tools

OMX has project memory and a notepad.

That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:

  • conventions
  • known build commands
  • temporary working notes
  • durable directives

Relevant file:

  • refs/oh-my-codex/src/mcp/memory-server.ts

Loader currently rediscovers too much per turn.

6. Verification is standardized

OMX has verification instructions that scale by task size and explicitly require evidence.

Relevant file:

  • refs/oh-my-codex/src/verification/verifier.ts

Loader has completion heuristics. OMX has verification policy.

That is the difference between “the model sounded done” and “the system proved done.”

7. Doctor / explore / sparkshell reduce prompt waste

OMX distinguishes:

  • health checking (doctor)
  • lightweight read-only exploration (explore)
  • bounded shell-native inspection (sparkshell)

That is smart.

It keeps the main execution loop from becoming the only place everything happens.

Relevant files:

  • refs/oh-my-codex/src/cli/doctor.ts
  • refs/oh-my-codex/src/cli/explore.ts
  • refs/oh-my-codex/src/cli/sparkshell.ts

8. Follow-through is supported outside the agent context window

The idle notifications, leader nudges, and continuation prompts in OMX are important.

Relevant file:

  • refs/oh-my-codex/src/scripts/notify-hook.ts

This is one of the deeper design differences:

  • Loader tries to keep the model on-task from inside the loop
  • OMX also nudges, monitors, and routes from outside the loop

That is a more robust design.

Comparison matrix

Area Loader today claw-code OMX lesson Takeaway for Loader
Runtime loop monolithic, heuristic-heavy typed turn engine separate mode/workflow from turn runtime split Loader runtime first
Tool surface 6 basic tools 49 exposed tool specs on main tools should include workflow/state surfaces add stateful and diagnostic tools
Permissions confirmation-only permission policy + enforcer safety belongs in runtime add modes and boundaries
Completion heuristic continuation prompt stronger runtime summaries Ralph gives evidence-backed done gates replace “maybe done” with explicit verification
Planning ephemeral numbered list some plan surfaces ralplan = persisted, reviewed planning persist plan artifacts
Memory/state none sessions + compaction + tracing .omx/ mode state + memory add .loader/ state dir
Diagnostics minimal status/sandbox/doctor/session doctor/explore/sparkshell make Loader inspectable
Testing unit-heavy, no runtime harness mock parity harness workflow runtime is tested like product behavior build scripted runtime tests
Extensibility none hooks, plugins, MCP surfaces workflow and notification hooks add lifecycle hooks later
Multi-agent none agent/team surfaces team + ralph staffing defer until solo runtime is trustworthy

Why Loader’s current weaknesses produce the behavior you described

Poor tool use

Root causes:

  • shallow tool surface
  • brittle prompt contract
  • native-vs-ReAct bifurcation
  • duplicated execution code paths
  • no typed runtime contract for tool results

Weak follow-through

Root causes:

  • no persistent task state
  • no approved plan artifact
  • no explicit verification lane
  • no final completion checklist

Finishing early

Root causes:

  • completion is heuristic
  • no required evidence model
  • no acceptance criteria artifact
  • no final “prove it” pass

Spending too long on simple tasks

Root causes:

  • the runtime loop tries too many recoveries in one place
  • the system prompt does not distinguish task modes cleanly
  • there is no “lightweight inspect” lane like explore
  • the model often has to infer the workflow instead of being routed into one

Model sensitivity

Root causes:

  • behavior is prompt-and-heuristic driven
  • capability detection is backend-specific and brittle
  • no workflow artifacts that survive model variance

This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.

Concrete implementation targets

These are ordered by impact on Loader behavior, not by code convenience.

Target 1: Introduce a real turn engine

Goal:

  • replace the current giant loop with a smaller, typed conversation runtime

Implementation target:

  • create a new src/loader/runtime/ package
  • move message/session/tool-result logic out of src/loader/agent/loop.py
  • give tool results a first-class typed representation
  • unify native, ReAct, and extracted-tool execution through one executor path

Why:

  • this is the foundation for every other improvement

Target 2: Add persistent Loader state under .loader/

Goal:

  • make workflow state durable instead of prompt-only

Implementation target:

  • .loader/state/
  • .loader/sessions/
  • .loader/plans/
  • .loader/notepad.md
  • .loader/project-memory.json

Why:

  • Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge

Target 3: Separate task modes

Goal:

  • stop treating all requests like immediate tool-execution requests

Implementation target:

  • mode router with at least:
    • clarify
    • plan
    • execute
    • verify

Why:

  • this is the minimum structure needed to stop overthinking simple work and underthinking complex work

Target 4: Replace heuristic completion with an evidence-backed done contract

Goal:

  • make completion explicit and testable

Implementation target:

  • define a DefinitionOfDone object per task
  • require:
    • acceptance criteria
    • verification commands
    • evidence summary
    • zero pending task items

Why:

  • this is the main fix for premature completion

Target 5: Add deep-interview-lite and ralplan-lite equivalents

Goal:

  • pull ambiguity reduction and planning review out of the middle of execution

Implementation target:

  • clarify mode writes a task brief
  • plan mode writes:
    • a short implementation plan
    • a test/verification plan

Do not try to copy every OMX feature immediately. Copy the artifact discipline first.

Target 6: Build a real permission model

Goal:

  • move from confirmation prompts to policy-based authorization

Implementation target:

  • permission modes:
    • read-only
    • workspace-write
    • danger-full-access
  • tool specs declare required permission
  • file writes enforce workspace boundaries
  • shell commands go through command classification

Why:

  • this is both safety and behavior quality

Target 7: Harden file and shell tools

Goal:

  • make tool use trustworthy enough for automation

Implementation target:

  • size limits
  • binary detection
  • symlink/traversal protection
  • structured patch/diff return values
  • shell command semantics and mutability classification

Target 8: Add loader doctor, loader status, and loader session

Goal:

  • make Loader operable as a product

Implementation target:

  • backend health
  • model capability snapshot
  • workspace detection
  • write-access detection
  • test/build command detection
  • active session summary

Why:

  • better operator feedback means less guesswork in the agent loop

Target 9: Add memory/notepad tools

Goal:

  • give Loader durable short-term and long-term memory

Implementation target:

  • read/write project memory
  • append working notes
  • store user directives and repo conventions

Why:

  • this reduces re-discovery and improves follow-through across turns

Target 10: Add a lightweight read-only inspect lane

Goal:

  • avoid using the full agent loop for every lookup

Implementation target:

  • loader explore or equivalent internal mode
  • optimized for:
    • file/symbol lookup
    • pattern discovery
    • relationship questions

Why:

  • simple tasks should stay cheap and fast

Target 11: Add a parity harness

Goal:

  • improve behavior intentionally instead of impressionistically

Implementation target:

  • scripted mock backend scenarios for:
    • simple read
    • multi-tool turn
    • denied permission
    • write/edit success
    • verification-required task
    • premature completion rejection
    • looped/duplicate action prevention

Why:

  • this is how Loader becomes reliable

Target 12: Add workflow-aware prompts and capability profiles

Goal:

  • make Loader less brittle across models

Implementation target:

  • replace one generic system prompt with mode-specific prompts
  • add provider/model capability profiles:
    • native tools
    • streaming
    • context budget
    • preferred tool-call format
    • verification strictness

Why:

  • behavior should be shaped by runtime policy, not guessed from model substrings

Priority order

This section was rewritten after a deeper validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: the Definition-of-Done work is the user's actual pain point and should land before permission modes, not after, because permissions are a safety win and DoD is the behavior win.

P0: Stabilize before changing behavior (Sprint 00)

  • write a failing regression test for the tool_call_id bug at agent/loop.py:885,906 first, before any harness work — it proves the bug is real and proves the harness exists in one move
  • scope pytest discovery so refs/ stops contaminating collection
  • exclude refs/ from ruff and mypy too
  • make uv run pytest work out of the box
  • port the scenario taxonomy from refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs
  • rewrite README.md (currently still says "FortranGoingOnForty")
  • baseline parity checklist for current runtime behavior

P1: Replace the loop with a real runtime (Sprint 01)

  • new src/loader/runtime/ package with a typed turn engine
  • unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
  • fix the named bugs from Sprint 00's failing tests (tool_call_id, duplicate execution path)
  • replace substring-based NATIVE_TOOL_MODELS/NO_TOOL_MODELS model detection with a runtime/capabilities.py profile system — Loader needs to behave consistently across model choices
  • structured TurnSummary output

P2: The behavior fix the user actually asked for (Sprint 02)

  • DefinitionOfDone object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items
  • explicit verify phase that runs the verification commands and gates completion on evidence
  • fix loop: verification failure returns to execution, not to final answer
  • minimum .loader/ directory shape (.loader/dod/) — full session/memory layout deferred to Sprint 05

This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."

P3: Safety as policy, not as confirmation prompt (Sprint 03)

  • permission modes: read-only, workspace-write, danger-full-access
  • three-event tool lifecycle hooks (pre_tool_use, post_tool_use, post_tool_use_failure) modeled directly on refs/claw-code/rust/crates/runtime/src/hooks.rs
  • refactor safeguards.py (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls
  • file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
  • shell operation hardening
  • expose active mode in CLI/TUI status

Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.

P4: Stop improvising one workflow for everything (Sprint 04)

  • mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
  • clarify artifact written to .loader/briefs/
  • planning artifacts (implementation plan + verification plan) written to .loader/plans/ and fed into the existing DoD object
  • tool prerequisites pulled forward from Sprint 06: TodoWrite (the "zero pending tasks" gate is empty without it) and AskUserQuestion (clarify rounds)

P5: Durable continuity (Sprint 05)

  • full .loader/ state directory under the layout already started in Sprint 02
  • session persistence and resume
  • transcript compaction with priority-aware summarization (model the design on refs/claw-code/rust/crates/runtime/src/summary_compression.rs)
  • memory/notepad surfaces
  • usage/cost tracking

P6: Operability and tool-surface expansion (Sprint 06)

  • loader doctor, loader status, loader session
  • read-only explore lane
  • broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — TodoWrite and AskUserQuestion already exist from Sprint 04

Deferred indefinitely

  • workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
  • task/team/subagent orchestration
  • broad MCP ecosystem
  • richer plugin systems

These are real wins in claw-code/OMX, but Loader should not pursue them until the solo runtime is trustworthy.

What Loader should copy directly, and what it should not

Copy directly

  • typed turn runtime
  • permission model
  • file/shell hardening
  • session persistence
  • compaction
  • doctor/status/session surfaces
  • workflow artifacts
  • evidence-backed verification
  • parity harness discipline

Copy in simplified form

  • deep-interview
  • ralplan
  • ralph
  • memory/notepad
  • explore vs full-execution split

Do not copy blindly yet

  • full tmux/team runtime
  • huge command surface
  • Discord/openclaw notification stack
  • broad MCP ecosystem

Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.

If we want behavior closer to claw-code without losing Loader’s simplicity, I would steer toward:

Layer 1: Runtime core

  • typed TurnRuntime
  • SessionStore
  • PermissionPolicy
  • ToolExecutor
  • VerificationEngine

Layer 2: Workflow layer

  • ClarifyWorkflow
  • PlanWorkflow
  • ExecuteWorkflow
  • VerifyWorkflow

Layer 3: Product surfaces

  • TUI
  • CLI
  • doctor
  • status
  • session
  • explore

Layer 4: Optional future orchestration

  • hooks
  • background verification
  • multi-agent/task orchestration

That is a better fit for Loader than trying to clone all of OMX wholesale.

Immediate conclusions

  1. Loader’s biggest problems are architectural, not just prompt-related.
  2. claw-code is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.
  3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
  4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
    • workflow artifacts
    • explicit verification
    • persistent state
    • a smaller, more trustworthy turn engine

Sprint scaffolding

After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:

  • .docs/sprints/index.md
  • .docs/sprints/sprint00.md — Foundation, Measurement, and Parity Harness
  • .docs/sprints/sprint01.md — Turn Engine, Tool Contract, and Capability Profiles
  • .docs/sprints/sprint02.md — Definition of Done and Verify/Fix Loop
  • .docs/sprints/sprint03.md — Permission Modes and Tool Lifecycle Hooks
  • .docs/sprints/sprint04.md — Mode Router, Clarify, and Plan Artifacts
  • .docs/sprints/sprint05.md — Session State, Memory, and Compaction
  • .docs/sprints/sprint06.md — Doctor, Explore, Status, and Tool Surface Expansion

Start with Sprint 00, and start Sprint 00 with the failing regression test.

Reason:

  • Loader needs a measurable baseline and a safer runtime before adding more behavior
  • the tool_call_id bug at agent/loop.py:885,906 is proof that untested code paths are silently broken
  • writing the failing test first proves both the bug and the harness in one move
  • otherwise every feature sprint will be built on unstable agent semantics

The execution phase should then be:

  1. lock down the runtime and test harness (Sprint 00)
  2. replace the loop with a typed runtime and capability profiles (Sprint 01)
  3. define and enforce the completion contract (Sprint 02)
  4. add the policy-based safety layer with hooks (Sprint 03)
  5. add workflow modes and planning artifacts on top (Sprint 04)
  6. then widen the durability and product surfaces (Sprints 05 and 06)

Plan adjustments after deeper review

The following changes were applied to the original report after a firsthand validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus spot-checks of Loader's runtime.

Verified directly against the code

  • tool_call_id bug confirmed at src/loader/agent/loop.py:885 and :906. Both call sites construct Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id), but Message (src/loader/llm/base.py:33-39) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.
  • Pytest discovery is broken by default. uv run pytest --collect-only picks up refs/claw-code/tests/test_porting_workspace.py and fails to import loader because there is no tool.pytest.ini_options block in pyproject.toml.
  • Loop monolith confirmed by line counts. agent/loop.py is 1929 LOC, agent/reasoning.py is 1196, agent/safeguards.py is 1079 — roughly 4200 lines of orchestration in one cluster.
  • claw-code's run_turn() shape is exactly as the report describes. Read directly at refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed ConversationMessage::tool_result() → push → repeat. ~175 lines of clean code.
  • claw-code permission modes are ReadOnly / WorkspaceWrite / DangerFullAccess (plus Prompt and Allow), defined at refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in file_ops.rs are all real.
  • claw-code hooks are PreToolUse / PostToolUse / PostToolUseFailure, defined at refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34 and wired into the conversation loop at lines 371, 427-453.
  • OMX skills are real and even more rigorous than the report described. ralplan enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. ralph has explicit phase enums (starting/executing/verifying/fixing/complete/failed/cancelled) persisted via state_write to .omx/state/{mode}-state.json. The verifier in src/verification/verifier.ts scales by task size with concrete file-count thresholds.

Corrected facts

  • Tool count: 49, not 40. refs/claw-code/rust/crates/tools/src/lib.rs exposes 49 ToolSpec entries in mvp_tool_specs(). Doesn't change the lesson, but worth knowing.
  • claw-code permissions have a third layer. Beyond PermissionMode and per-tool requirements, PermissionPolicy carries three rule lists (allow_rules, deny_rules, ask_rules) for context-specific overrides. Loader can land the mode layer first and defer the rule layer.
  • claw-code summary compression is sophisticated. It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at refs/claw-code/rust/crates/runtime/src/summary_compression.rs. Sprint 05 should model on this rather than reinventing.

Structural plan changes

  • The old Sprint 03 was split. It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
  • The old Sprint 02 (permissions) became the new Sprint 03 and was reordered to land after DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
  • Hooks landed in the same sprint as permissions. The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
  • Capability profiles became a Sprint 01 deliverable. They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
  • The minimum .loader/ directory shape moves to Sprint 02 (just .loader/dod/). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05.
  • TodoWrite and AskUserQuestion move from Sprint 06 to Sprint 04 as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.
  • Sprint 00's first deliverable is now the failing regression test for the tool_call_id bug, before any harness work. It proves the bug and proves the harness exist in one move.
View source
1 # Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior
2
3 Date: 2026-04-06
4
5 ## Scope and assumptions
6
7 This report compares three things:
8
9 1. `Loader` itself
10 2. `refs/claw-code`, using the Rust workspace under `refs/claw-code/rust/` as the canonical runtime
11 3. `refs/oh-my-codex` as the workflow-layer parent repo
12
13 Assumption: `oh-my-codex` is the correct “parent repo” for this exercise. That assumption is based on:
14
15 - `refs/claw-code/README.md`
16 - `refs/claw-code/PHILOSOPHY.md`
17 - the fact that `refs/claw-code` explicitly describes `src/` as a companion Python/reference workspace, not the primary runtime
18
19 If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.
20
21 ## Executive summary
22
23 Loader has the right instincts but is operating at the wrong layer.
24
25 The codebase already knows that models need:
26
27 - planning help
28 - recovery help
29 - confidence checks
30 - completion checks
31 - safe tool use
32
33 But Loader mostly tries to enforce those after the model has already started drifting. `claw-code` and `oh-my-codex` get better behavior because they shape the work before, during, and after the model call:
34
35 - before: explicit mode selection, clarification, approved planning artifacts
36 - during: durable runtime state, richer tool surface, explicit permission model, session persistence
37 - after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics
38
39 The biggest lesson is not “copy their prompt.”
40
41 The biggest lesson is:
42
43 > Loader needs a stronger execution contract, not just stronger prompting.
44
45 If we want Loader to feel closer to `claw-code` regardless of model choice, the highest-leverage work is:
46
47 1. replace the monolithic heuristic loop with a typed turn engine
48 2. add durable workflow/state artifacts
49 3. make “definition of done” evidence-based instead of heuristic
50 4. add real permission/safety boundaries around tools
51 5. build a parity harness so we can improve behavior intentionally
52
53 ## Method
54
55 I reviewed:
56
57 - Loader source under `src/loader/`
58 - Loader tests under `tests/`
59 - `refs/claw-code/README.md`
60 - `refs/claw-code/USAGE.md`
61 - `refs/claw-code/PARITY.md`
62 - `refs/claw-code/PHILOSOPHY.md`
63 - `refs/claw-code/rust/crates/runtime/*`
64 - `refs/claw-code/rust/crates/tools/src/lib.rs`
65 - `refs/oh-my-codex/README.md`
66 - `refs/oh-my-codex/AGENTS.md`
67 - `refs/oh-my-codex/skills/deep-interview/SKILL.md`
68 - `refs/oh-my-codex/skills/ralplan/SKILL.md`
69 - `refs/oh-my-codex/skills/ralph/SKILL.md`
70 - `refs/oh-my-codex/src/modes/base.ts`
71 - `refs/oh-my-codex/src/ralplan/runtime.ts`
72 - `refs/oh-my-codex/src/mcp/memory-server.ts`
73 - `refs/oh-my-codex/src/verification/verifier.ts`
74 - `refs/oh-my-codex/src/cli/doctor.ts`
75 - `refs/oh-my-codex/src/scripts/notify-hook.ts`
76
77 I also ran Loader verification commands:
78
79 - `uv run pytest`
80 - failed during collection
81 - discovered `refs/claw-code/tests/*`
82 - also failed to import `loader`
83 - `uv run --with pytest --with pytest-asyncio python -m pytest tests -q`
84 - 56 passed
85 - 3 failed
86
87 That matters because some of Loader’s runtime paths are clearly under-tested.
88
89 ## What Loader already does well
90
91 ### 1. Loader is small, understandable, and hackable
92
93 This is a real advantage.
94
95 `src/loader/` is about 55 source files, and the core agent behavior is easy to locate. Compared to `claw-code` and especially OMX, Loader is much easier to refactor aggressively.
96
97 ### 2. Loader is genuinely local-first
98
99 The Ollama-first posture is simple and useful. A lot of the complexity in `claw-code` and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.
100
101 ### 3. Loader already contains the seeds of a better system
102
103 These are the right instincts:
104
105 - project context detection in `src/loader/context/project.py`
106 - runtime safeguards in `src/loader/agent/safeguards.py`
107 - recovery categorization in `src/loader/agent/recovery.py`
108 - optional decomposition / critique / confidence / verification / completion checks in `src/loader/agent/reasoning.py`
109 - a decent Textual app in `src/loader/ui/app.py`
110
111 The problem is not that Loader lacks ideas.
112
113 The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.
114
115 ### 4. The TUI is a meaningful strength
116
117 Loader’s TUI already gives you:
118
119 - model selection
120 - streaming output
121 - approval handling
122 - status line updates
123 - tool widgets
124
125 That is more product surface than many small local agents. It is worth keeping.
126
127 ## Where Loader is weak today
128
129 ### 1. Loader’s product surface is not trustworthy yet
130
131 The most visible sign is the README:
132
133 - `README.md:1-2` still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”
134
135 That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. `claw-code` and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.
136
137 ### 2. Loader’s main runtime is too monolithic and too heuristic
138
139 `src/loader/agent/loop.py` is the heart of Loader, and it is doing too much:
140
141 - prompt construction
142 - streaming output handling
143 - raw tool-call extraction
144 - duplicate tool execution flows
145 - recovery
146 - validation
147 - rollback tracking
148 - completion nudging
149 - loop detection
150 - steering
151 - partial planning
152 - decomposition
153
154 The result is a loop that is hard to reason about and easy to destabilize.
155
156 The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.
157
158 ### 3. Loader has a real runtime contract bug in tool-result handling
159
160 **Verified directly against the code.** There is a concrete mismatch between `Message` and the loop:
161
162 - `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, and `tool_results`. There is no `tool_call_id` field on `Message` — that field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`.
163 - `src/loader/agent/loop.py:885` and `src/loader/agent/loop.py:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`.
164
165 Both call sites will raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have **zero** integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.
166
167 ### 4. Loader duplicates tool execution logic instead of centralizing it
168
169 There are effectively two execution paths:
170
171 - the normal native/ReAct tool path
172 - the “raw JSON extracted tool call” path
173
174 Those paths duplicate:
175
176 - duplicate checking
177 - validation
178 - confirmation behavior
179 - result recording
180 - loop/error handling
181
182 That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.
183
184 `claw-code`’s `ConversationRuntime::run_turn()` is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.
185
186 ### 5. Loader’s system prompt is too shallow and too rigid
187
188 `src/loader/agent/prompts.py:148-208` gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.
189
190 This is too blunt.
191
192 Problems:
193
194 - it treats all tasks like immediate tool-execution tasks
195 - it globally bans numbered steps, which is bad for planning/reporting tasks
196 - it does not define modes
197 - it does not encode verification expectations
198 - it does not encode completion criteria
199 - it does not distinguish “clarify”, “plan”, “execute”, and “verify”
200
201 OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.
202
203 ### 6. Loader’s tool surface is too thin
204
205 Loader has 6 default tools:
206
207 - `read`
208 - `write`
209 - `edit`
210 - `glob`
211 - `bash`
212 - `grep`
213
214 That is enough for toy execution, but not enough for strong agent behavior.
215
216 What is missing compared to `claw-code` / OMX:
217
218 - task/todo tracking
219 - structured ask-user surfaces
220 - memory/notepad
221 - doctor/status/session tooling
222 - git-aware helpers
223 - explore vs full-execution split
224 - diff/patch-aware editing
225 - web/search/fetch surfaces
226 - structured output surfaces
227 - subagent/team coordination surfaces
228 - MCP-backed state and memory
229
230 The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.
231
232 ### 7. Loader’s safety model is primitive
233
234 Loader’s current protection model is mostly:
235
236 - “safe commands” vs “ask for confirmation”
237 - destructive tool flags
238
239 Problems in practice:
240
241 - no permission modes like `read-only`, `workspace-write`, `danger-full-access`
242 - no strong workspace boundary checks
243 - no binary-file guards
244 - no file size limits
245 - no symlink escape protection
246 - no command semantics beyond a short safe list
247
248 Evidence:
249
250 - `src/loader/tools/file_tools.py` reads/writes resolved paths directly
251 - `src/loader/tools/shell_tools.py` uses `create_subprocess_shell()` on arbitrary shell strings
252 - `src/loader/tools/shell_tools.py:13-20` uses a short safe command set, but no mode-based authorization model
253
254 By comparison, `claw-code` has:
255
256 - `PermissionPolicy`
257 - `PermissionEnforcer`
258 - workspace boundary checks
259 - binary/size guards in file ops
260 - permission-mode aware tool definitions
261
262 That does not just make it safer. It makes the agent more predictable.
263
264 ### 8. Loader’s “definition of done” is heuristic, not contractual
265
266 The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.
267
268 Loader’s current strategy is:
269
270 - heuristically decide whether the response looks premature
271 - nudge the model to continue
272 - maybe ask it to confirm completion
273
274 See:
275
276 - `src/loader/agent/reasoning.py:721-854`
277
278 This is well-intentioned, but it is still guesswork.
279
280 It does not require:
281
282 - explicit acceptance criteria
283 - a verification plan
284 - fresh command evidence
285 - zero pending tasks
286 - a final sign-off phase
287
288 OMX’s `ralph` workflow does.
289
290 That difference is enormous.
291
292 ### 9. Loader has no durable workflow state
293
294 Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.
295
296 Missing pieces:
297
298 - persisted mode state
299 - session memory
300 - approved plan artifacts
301 - PRD / test-spec artifacts
302 - progress ledger
303 - durable “what was already decided”
304 - resume-safe task state
305
306 OMX writes state under `.omx/` and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.
307
308 ### 10. Loader is too backend-specific and too capability-fragile
309
310 Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.
311
312 Evidence:
313
314 - `src/loader/cli/main.py` supports only `ollama`
315 - `src/loader/llm/ollama.py` hardcodes native tool support by model-name substring matching
316
317 This is fragile for behavior matching “with any model chosen.”
318
319 What Loader needs instead is:
320
321 - a provider-independent tool-calling contract
322 - explicit capability profiles
323 - distinct fallback strategies for native tools vs text tool calling
324 - prompts/workflows that degrade gracefully
325
326 ### 11. Loader’s tests are not protecting the real runtime
327
328 Loader’s test suite is mostly:
329
330 - tool unit tests
331 - parsing tests
332 - recovery tests
333
334 That is useful, but insufficient.
335
336 The current state:
337
338 - `uv run pytest` fails by default after adding `refs/`
339 - the repo does not scope pytest discovery
340 - the “normal” targeted run needs `--with pytest --with pytest-asyncio`
341 - even then, 3 tests fail
342 - there are no strong turn-loop integration tests
343 - there is no deterministic mock backend harness comparable to `claw-code`
344
345 This is why structural issues like the `tool_call_id` mismatch can survive.
346
347 ## What `claw-code` gets right
348
349 ## 1. The runtime contract is explicit
350
351 `refs/claw-code/rust/crates/runtime/src/conversation.rs` is the biggest thing Loader should study.
352
353 The core `run_turn()` flow is clean:
354
355 1. append user message to session
356 2. stream assistant response
357 3. build a typed assistant message
358 4. extract tool uses
359 5. run permission checks
360 6. execute tool
361 7. append tool result message
362 8. repeat until no more tool uses
363 9. optionally compact session
364 10. return a typed turn summary
365
366 That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.
367
368 ## 2. Session persistence and compaction are first-class
369
370 `claw-code` treats long-lived sessions as a product feature:
371
372 - persisted sessions
373 - resume support
374 - usage tracking
375 - compaction thresholds
376 - summarized continuation messages
377
378 Relevant files:
379
380 - `refs/claw-code/rust/crates/runtime/src/conversation.rs`
381 - `refs/claw-code/rust/crates/runtime/src/compact.rs`
382 - `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`
383 - `refs/claw-code/rust/crates/runtime/src/usage.rs`
384
385 This matters because good agent behavior is often continuity behavior.
386
387 ## 3. Permissions are part of the runtime, not just UI confirmation
388
389 `claw-code` has an actual permission model with three layers:
390
391 - **Mode layer** — `PermissionMode` enum with `ReadOnly`, `WorkspaceWrite`, `DangerFullAccess`, `Prompt`, and `Allow` (`refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`)
392 - **Per-tool requirement layer** — every `ToolSpec` declares the minimum mode it requires, mapped in `PermissionPolicy.tool_requirements`
393 - **Rule layer** — three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides on top of the mode/requirement check
394
395 Plus typed authorization outcomes, file-write boundary logic, and bash gating.
396
397 Relevant files:
398
399 - `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs`
400 - `refs/claw-code/rust/crates/runtime/src/permissions.rs`
401
402 Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.
403
404 ## 4. File and shell operations are engineered, not just exposed
405
406 `claw-code`’s file layer includes:
407
408 - max read size
409 - max write size
410 - binary detection
411 - workspace-boundary validation
412 - structured patch outputs
413
414 Relevant file:
415
416 - `refs/claw-code/rust/crates/runtime/src/file_ops.rs`
417
418 Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.
419
420 ## 5. Hooks and lifecycle surfaces give the runtime escape valves
421
422 `claw-code` has pre-tool and post-tool hooks, including failure hooks.
423
424 That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.
425
426 Relevant files:
427
428 - `refs/claw-code/rust/crates/runtime/src/hooks.rs`
429 - `refs/claw-code/rust/crates/runtime/src/conversation.rs`
430
431 ## 6. The project is honest about parity and weaknesses
432
433 `refs/claw-code/PARITY.md` is one of the best engineering lessons in the whole comparison.
434
435 It does three things Loader does not yet do:
436
437 - names what is actually shipped
438 - names what is still shallow or stubbed
439 - ties roadmap claims to concrete evidence
440
441 That alone reduces thrash.
442
443 Loader needs a similar parity/backlog document for runtime behavior.
444
445 ## 7. Diagnostics and operator surfaces are part of the product
446
447 `claw-code` exposes operational commands like:
448
449 - `status`
450 - `sandbox`
451 - `agents`
452 - `mcp`
453 - `skills`
454 - `doctor`
455 - session resume
456
457 This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.
458
459 ## Where `claw-code` is still incomplete
460
461 It is worth staying honest here too.
462
463 Even `claw-code` admits some shallowness in `PARITY.md`:
464
465 - some surfaces are registry-backed approximations, not deep external integrations
466 - session compaction parity is still open
467 - token accounting accuracy is still open
468 - some tool surfaces remain shallow or partially stubbed
469
470 That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.
471
472 ## What OMX adds that Loader is currently missing almost entirely
473
474 `claw-code` gives a better runtime. OMX gives a better workflow.
475
476 This is where most of Loader’s “definition of done” and “follow-through” problems are answered.
477
478 ### 1. Clarification is a mode, not an ad hoc question
479
480 `deep-interview` is not “ask a question if confused.”
481
482 It is a formal ambiguity-reduction workflow with:
483
484 - a context snapshot
485 - one-question rounds
486 - ambiguity scoring
487 - explicit non-goals
488 - explicit decision boundaries
489 - a crystallized artifact for downstream execution
490
491 Relevant files:
492
493 - `refs/oh-my-codex/skills/deep-interview/SKILL.md`
494
495 Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.
496
497 ### 2. Planning is artifact-based and consensus-based
498
499 `ralplan` is much more than “make a numbered list.”
500
501 It includes:
502
503 - Planner / Architect / Critic loops
504 - max iteration handling
505 - planning completion gates
506 - PRD and test-spec artifacts
507 - approved handoff into execution
508
509 Relevant files:
510
511 - `refs/oh-my-codex/skills/ralplan/SKILL.md`
512 - `refs/oh-my-codex/src/ralplan/runtime.ts`
513 - `refs/oh-my-codex/src/planning/artifacts.ts`
514
515 Loader’s `Plan` object is fine as a local helper, but it is nowhere near this level of control.
516
517 ### 3. “Done” is a workflow contract in Ralph
518
519 This is the single biggest lesson for Loader.
520
521 Ralph encodes:
522
523 - persistence until done
524 - mandatory verification
525 - architect verification
526 - retry/fix loops
527 - state transitions
528 - explicit cleanup on completion
529 - a final checklist
530
531 Relevant file:
532
533 - `refs/oh-my-codex/skills/ralph/SKILL.md`
534
535 This directly addresses the exact Loader problems you named:
536
537 - weak tool follow-through
538 - finishing too early
539 - spending too long in loops
540 - poor task closure
541
542 ### 4. Workflow state lives outside the prompt
543
544 OMX stores durable mode state under `.omx/` and exposes it through state tools.
545
546 Relevant files:
547
548 - `refs/oh-my-codex/src/modes/base.ts`
549 - `refs/oh-my-codex/src/mcp/state-server.ts`
550 - `refs/oh-my-codex/src/mcp/memory-server.ts`
551
552 That means:
553
554 - progress survives interruptions
555 - execution can be resumed
556 - handoffs are grounded
557 - context can be audited
558 - the model does not have to remember everything itself
559
560 ### 5. Memory and notepad are explicit tools
561
562 OMX has project memory and a notepad.
563
564 That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:
565
566 - conventions
567 - known build commands
568 - temporary working notes
569 - durable directives
570
571 Relevant file:
572
573 - `refs/oh-my-codex/src/mcp/memory-server.ts`
574
575 Loader currently rediscovers too much per turn.
576
577 ### 6. Verification is standardized
578
579 OMX has verification instructions that scale by task size and explicitly require evidence.
580
581 Relevant file:
582
583 - `refs/oh-my-codex/src/verification/verifier.ts`
584
585 Loader has completion heuristics. OMX has verification policy.
586
587 That is the difference between “the model sounded done” and “the system proved done.”
588
589 ### 7. Doctor / explore / sparkshell reduce prompt waste
590
591 OMX distinguishes:
592
593 - health checking (`doctor`)
594 - lightweight read-only exploration (`explore`)
595 - bounded shell-native inspection (`sparkshell`)
596
597 That is smart.
598
599 It keeps the main execution loop from becoming the only place everything happens.
600
601 Relevant files:
602
603 - `refs/oh-my-codex/src/cli/doctor.ts`
604 - `refs/oh-my-codex/src/cli/explore.ts`
605 - `refs/oh-my-codex/src/cli/sparkshell.ts`
606
607 ### 8. Follow-through is supported outside the agent context window
608
609 The idle notifications, leader nudges, and continuation prompts in OMX are important.
610
611 Relevant file:
612
613 - `refs/oh-my-codex/src/scripts/notify-hook.ts`
614
615 This is one of the deeper design differences:
616
617 - Loader tries to keep the model on-task from inside the loop
618 - OMX also nudges, monitors, and routes from outside the loop
619
620 That is a more robust design.
621
622 ## Comparison matrix
623
624 | Area | Loader today | `claw-code` | OMX lesson | Takeaway for Loader |
625 |---|---|---|---|---|
626 | Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first |
627 | Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools |
628 | Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries |
629 | Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification |
630 | Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts |
631 | Memory/state | none | sessions + compaction + tracing | `.omx/` mode state + memory | add `.loader/` state dir |
632 | Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable |
633 | Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests |
634 | Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later |
635 | Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy |
636
637 ## Why Loader’s current weaknesses produce the behavior you described
638
639 ### Poor tool use
640
641 Root causes:
642
643 - shallow tool surface
644 - brittle prompt contract
645 - native-vs-ReAct bifurcation
646 - duplicated execution code paths
647 - no typed runtime contract for tool results
648
649 ### Weak follow-through
650
651 Root causes:
652
653 - no persistent task state
654 - no approved plan artifact
655 - no explicit verification lane
656 - no final completion checklist
657
658 ### Finishing early
659
660 Root causes:
661
662 - completion is heuristic
663 - no required evidence model
664 - no acceptance criteria artifact
665 - no final “prove it” pass
666
667 ### Spending too long on simple tasks
668
669 Root causes:
670
671 - the runtime loop tries too many recoveries in one place
672 - the system prompt does not distinguish task modes cleanly
673 - there is no “lightweight inspect” lane like `explore`
674 - the model often has to infer the workflow instead of being routed into one
675
676 ### Model sensitivity
677
678 Root causes:
679
680 - behavior is prompt-and-heuristic driven
681 - capability detection is backend-specific and brittle
682 - no workflow artifacts that survive model variance
683
684 This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.
685
686 ## Concrete implementation targets
687
688 These are ordered by impact on Loader behavior, not by code convenience.
689
690 ### Target 1: Introduce a real turn engine
691
692 Goal:
693
694 - replace the current giant loop with a smaller, typed conversation runtime
695
696 Implementation target:
697
698 - create a new `src/loader/runtime/` package
699 - move message/session/tool-result logic out of `src/loader/agent/loop.py`
700 - give tool results a first-class typed representation
701 - unify native, ReAct, and extracted-tool execution through one executor path
702
703 Why:
704
705 - this is the foundation for every other improvement
706
707 ### Target 2: Add persistent Loader state under `.loader/`
708
709 Goal:
710
711 - make workflow state durable instead of prompt-only
712
713 Implementation target:
714
715 - `.loader/state/`
716 - `.loader/sessions/`
717 - `.loader/plans/`
718 - `.loader/notepad.md`
719 - `.loader/project-memory.json`
720
721 Why:
722
723 - Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge
724
725 ### Target 3: Separate task modes
726
727 Goal:
728
729 - stop treating all requests like immediate tool-execution requests
730
731 Implementation target:
732
733 - mode router with at least:
734 - `clarify`
735 - `plan`
736 - `execute`
737 - `verify`
738
739 Why:
740
741 - this is the minimum structure needed to stop overthinking simple work and underthinking complex work
742
743 ### Target 4: Replace heuristic completion with an evidence-backed done contract
744
745 Goal:
746
747 - make completion explicit and testable
748
749 Implementation target:
750
751 - define a `DefinitionOfDone` object per task
752 - require:
753 - acceptance criteria
754 - verification commands
755 - evidence summary
756 - zero pending task items
757
758 Why:
759
760 - this is the main fix for premature completion
761
762 ### Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents
763
764 Goal:
765
766 - pull ambiguity reduction and planning review out of the middle of execution
767
768 Implementation target:
769
770 - `clarify` mode writes a task brief
771 - `plan` mode writes:
772 - a short implementation plan
773 - a test/verification plan
774
775 Do not try to copy every OMX feature immediately. Copy the artifact discipline first.
776
777 ### Target 6: Build a real permission model
778
779 Goal:
780
781 - move from confirmation prompts to policy-based authorization
782
783 Implementation target:
784
785 - permission modes:
786 - `read-only`
787 - `workspace-write`
788 - `danger-full-access`
789 - tool specs declare required permission
790 - file writes enforce workspace boundaries
791 - shell commands go through command classification
792
793 Why:
794
795 - this is both safety and behavior quality
796
797 ### Target 7: Harden file and shell tools
798
799 Goal:
800
801 - make tool use trustworthy enough for automation
802
803 Implementation target:
804
805 - size limits
806 - binary detection
807 - symlink/traversal protection
808 - structured patch/diff return values
809 - shell command semantics and mutability classification
810
811 ### Target 8: Add `loader doctor`, `loader status`, and `loader session`
812
813 Goal:
814
815 - make Loader operable as a product
816
817 Implementation target:
818
819 - backend health
820 - model capability snapshot
821 - workspace detection
822 - write-access detection
823 - test/build command detection
824 - active session summary
825
826 Why:
827
828 - better operator feedback means less guesswork in the agent loop
829
830 ### Target 9: Add memory/notepad tools
831
832 Goal:
833
834 - give Loader durable short-term and long-term memory
835
836 Implementation target:
837
838 - read/write project memory
839 - append working notes
840 - store user directives and repo conventions
841
842 Why:
843
844 - this reduces re-discovery and improves follow-through across turns
845
846 ### Target 10: Add a lightweight read-only inspect lane
847
848 Goal:
849
850 - avoid using the full agent loop for every lookup
851
852 Implementation target:
853
854 - `loader explore` or equivalent internal mode
855 - optimized for:
856 - file/symbol lookup
857 - pattern discovery
858 - relationship questions
859
860 Why:
861
862 - simple tasks should stay cheap and fast
863
864 ### Target 11: Add a parity harness
865
866 Goal:
867
868 - improve behavior intentionally instead of impressionistically
869
870 Implementation target:
871
872 - scripted mock backend scenarios for:
873 - simple read
874 - multi-tool turn
875 - denied permission
876 - write/edit success
877 - verification-required task
878 - premature completion rejection
879 - looped/duplicate action prevention
880
881 Why:
882
883 - this is how Loader becomes reliable
884
885 ### Target 12: Add workflow-aware prompts and capability profiles
886
887 Goal:
888
889 - make Loader less brittle across models
890
891 Implementation target:
892
893 - replace one generic system prompt with mode-specific prompts
894 - add provider/model capability profiles:
895 - native tools
896 - streaming
897 - context budget
898 - preferred tool-call format
899 - verification strictness
900
901 Why:
902
903 - behavior should be shaped by runtime policy, not guessed from model substrings
904
905 ## Priority order
906
907 This section was rewritten after a deeper validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: **the Definition-of-Done work is the user's actual pain point and should land before permission modes**, not after, because permissions are a safety win and DoD is the behavior win.
908
909 ### P0: Stabilize before changing behavior (Sprint 00)
910
911 - write a failing regression test for the `tool_call_id` bug at `agent/loop.py:885,906` *first*, before any harness work — it proves the bug is real and proves the harness exists in one move
912 - scope pytest discovery so `refs/` stops contaminating collection
913 - exclude `refs/` from ruff and mypy too
914 - make `uv run pytest` work out of the box
915 - port the scenario taxonomy from `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs`
916 - rewrite `README.md` (currently still says "FortranGoingOnForty")
917 - baseline parity checklist for current runtime behavior
918
919 ### P1: Replace the loop with a real runtime (Sprint 01)
920
921 - new `src/loader/runtime/` package with a typed turn engine
922 - unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
923 - fix the named bugs from Sprint 00's failing tests (`tool_call_id`, duplicate execution path)
924 - replace substring-based `NATIVE_TOOL_MODELS`/`NO_TOOL_MODELS` model detection with a `runtime/capabilities.py` profile system — Loader needs to behave consistently across model choices
925 - structured `TurnSummary` output
926
927 ### P2: The behavior fix the user actually asked for (Sprint 02)
928
929 - `DefinitionOfDone` object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items
930 - explicit verify phase that runs the verification commands and gates completion on evidence
931 - fix loop: verification failure returns to execution, not to final answer
932 - minimum `.loader/` directory shape (`.loader/dod/`) — full session/memory layout deferred to Sprint 05
933
934 This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."
935
936 ### P3: Safety as policy, not as confirmation prompt (Sprint 03)
937
938 - permission modes: `read-only`, `workspace-write`, `danger-full-access`
939 - three-event tool lifecycle hooks (`pre_tool_use`, `post_tool_use`, `post_tool_use_failure`) modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs`
940 - refactor `safeguards.py` (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls
941 - file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
942 - shell operation hardening
943 - expose active mode in CLI/TUI status
944
945 Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.
946
947 ### P4: Stop improvising one workflow for everything (Sprint 04)
948
949 - mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
950 - clarify artifact written to `.loader/briefs/`
951 - planning artifacts (implementation plan + verification plan) written to `.loader/plans/` and fed into the existing DoD object
952 - tool prerequisites pulled forward from Sprint 06: `TodoWrite` (the "zero pending tasks" gate is empty without it) and `AskUserQuestion` (clarify rounds)
953
954 ### P5: Durable continuity (Sprint 05)
955
956 - full `.loader/` state directory under the layout already started in Sprint 02
957 - session persistence and resume
958 - transcript compaction with priority-aware summarization (model the design on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`)
959 - memory/notepad surfaces
960 - usage/cost tracking
961
962 ### P6: Operability and tool-surface expansion (Sprint 06)
963
964 - `loader doctor`, `loader status`, `loader session`
965 - read-only explore lane
966 - broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — `TodoWrite` and `AskUserQuestion` already exist from Sprint 04
967
968 ### Deferred indefinitely
969
970 - workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
971 - task/team/subagent orchestration
972 - broad MCP ecosystem
973 - richer plugin systems
974
975 These are real wins in `claw-code`/OMX, but Loader should not pursue them until the solo runtime is trustworthy.
976
977 ## What Loader should copy directly, and what it should not
978
979 ### Copy directly
980
981 - typed turn runtime
982 - permission model
983 - file/shell hardening
984 - session persistence
985 - compaction
986 - doctor/status/session surfaces
987 - workflow artifacts
988 - evidence-backed verification
989 - parity harness discipline
990
991 ### Copy in simplified form
992
993 - deep-interview
994 - ralplan
995 - ralph
996 - memory/notepad
997 - explore vs full-execution split
998
999 ### Do not copy blindly yet
1000
1001 - full tmux/team runtime
1002 - huge command surface
1003 - Discord/openclaw notification stack
1004 - broad MCP ecosystem
1005
1006 Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.
1007
1008 ## Recommended Loader architecture direction
1009
1010 If we want behavior closer to `claw-code` without losing Loader’s simplicity, I would steer toward:
1011
1012 ### Layer 1: Runtime core
1013
1014 - typed `TurnRuntime`
1015 - `SessionStore`
1016 - `PermissionPolicy`
1017 - `ToolExecutor`
1018 - `VerificationEngine`
1019
1020 ### Layer 2: Workflow layer
1021
1022 - `ClarifyWorkflow`
1023 - `PlanWorkflow`
1024 - `ExecuteWorkflow`
1025 - `VerifyWorkflow`
1026
1027 ### Layer 3: Product surfaces
1028
1029 - TUI
1030 - CLI
1031 - `doctor`
1032 - `status`
1033 - `session`
1034 - `explore`
1035
1036 ### Layer 4: Optional future orchestration
1037
1038 - hooks
1039 - background verification
1040 - multi-agent/task orchestration
1041
1042 That is a better fit for Loader than trying to clone all of OMX wholesale.
1043
1044 ## Immediate conclusions
1045
1046 1. Loader’s biggest problems are architectural, not just prompt-related.
1047 2. `claw-code` is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.
1048 3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
1049 4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
1050 - workflow artifacts
1051 - explicit verification
1052 - persistent state
1053 - a smaller, more trustworthy turn engine
1054
1055 ## Sprint scaffolding
1056
1057 After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:
1058
1059 - `.docs/sprints/index.md`
1060 - `.docs/sprints/sprint00.md` — Foundation, Measurement, and Parity Harness
1061 - `.docs/sprints/sprint01.md` — Turn Engine, Tool Contract, and Capability Profiles
1062 - `.docs/sprints/sprint02.md` — Definition of Done and Verify/Fix Loop
1063 - `.docs/sprints/sprint03.md` — Permission Modes and Tool Lifecycle Hooks
1064 - `.docs/sprints/sprint04.md` — Mode Router, Clarify, and Plan Artifacts
1065 - `.docs/sprints/sprint05.md` — Session State, Memory, and Compaction
1066 - `.docs/sprints/sprint06.md` — Doctor, Explore, Status, and Tool Surface Expansion
1067
1068 ## Recommended next move
1069
1070 Start with Sprint 00, and start Sprint 00 with the failing regression test.
1071
1072 Reason:
1073
1074 - Loader needs a measurable baseline and a safer runtime before adding more behavior
1075 - the `tool_call_id` bug at `agent/loop.py:885,906` is proof that untested code paths are silently broken
1076 - writing the failing test first proves both the bug and the harness in one move
1077 - otherwise every feature sprint will be built on unstable agent semantics
1078
1079 The execution phase should then be:
1080
1081 1. lock down the runtime and test harness (Sprint 00)
1082 2. replace the loop with a typed runtime and capability profiles (Sprint 01)
1083 3. define and enforce the completion contract (Sprint 02)
1084 4. add the policy-based safety layer with hooks (Sprint 03)
1085 5. add workflow modes and planning artifacts on top (Sprint 04)
1086 6. then widen the durability and product surfaces (Sprints 05 and 06)
1087
1088 ## Plan adjustments after deeper review
1089
1090 The following changes were applied to the original report after a firsthand validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus spot-checks of Loader's runtime.
1091
1092 ### Verified directly against the code
1093
1094 - **`tool_call_id` bug confirmed at `src/loader/agent/loop.py:885` and `:906`.** Both call sites construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`, but `Message` (`src/loader/llm/base.py:33-39`) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.
1095 - **Pytest discovery is broken by default.** `uv run pytest --collect-only` picks up `refs/claw-code/tests/test_porting_workspace.py` and fails to import `loader` because there is no `tool.pytest.ini_options` block in `pyproject.toml`.
1096 - **Loop monolith confirmed by line counts.** `agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196, `agent/safeguards.py` is 1079 — roughly 4200 lines of orchestration in one cluster.
1097 - **claw-code's `run_turn()` shape** is exactly as the report describes. Read directly at `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed `ConversationMessage::tool_result()` → push → repeat. ~175 lines of clean code.
1098 - **claw-code permission modes** are `ReadOnly` / `WorkspaceWrite` / `DangerFullAccess` (plus `Prompt` and `Allow`), defined at `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in `file_ops.rs` are all real.
1099 - **claw-code hooks** are `PreToolUse` / `PostToolUse` / `PostToolUseFailure`, defined at `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34` and wired into the conversation loop at lines 371, 427-453.
1100 - **OMX skills are real and even more rigorous than the report described.** `ralplan` enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. `ralph` has explicit phase enums (`starting`/`executing`/`verifying`/`fixing`/`complete`/`failed`/`cancelled`) persisted via `state_write` to `.omx/state/{mode}-state.json`. The verifier in `src/verification/verifier.ts` scales by task size with concrete file-count thresholds.
1101
1102 ### Corrected facts
1103
1104 - **Tool count: 49, not 40.** `refs/claw-code/rust/crates/tools/src/lib.rs` exposes 49 `ToolSpec` entries in `mvp_tool_specs()`. Doesn't change the lesson, but worth knowing.
1105 - **claw-code permissions have a third layer.** Beyond `PermissionMode` and per-tool requirements, `PermissionPolicy` carries three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides. Loader can land the mode layer first and defer the rule layer.
1106 - **claw-code summary compression is sophisticated.** It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`. Sprint 05 should model on this rather than reinventing.
1107
1108 ### Structural plan changes
1109
1110 - **The old Sprint 03 was split.** It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
1111 - **The old Sprint 02 (permissions) became the new Sprint 03** and was reordered to land *after* DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
1112 - **Hooks landed in the same sprint as permissions.** The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
1113 - **Capability profiles became a Sprint 01 deliverable.** They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
1114 - **The minimum `.loader/` directory shape moves to Sprint 02** (just `.loader/dod/`). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05.
1115 - **`TodoWrite` and `AskUserQuestion` move from Sprint 06 to Sprint 04** as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.
1116 - **Sprint 00's first deliverable is now the failing regression test** for the `tool_call_id` bug, before any harness work. It proves the bug and proves the harness exist in one move.