loader Public

Watch 0 Fork 0 Star 0

markdown · 40377 bytes Raw Blame History

Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior

Date: 2026-04-06

Scope and assumptions

This report compares three things:

Loader itself
refs/claw-code, using the Rust workspace under refs/claw-code/rust/ as the canonical runtime
refs/oh-my-codex as the workflow-layer parent repo

Assumption: oh-my-codex is the correct “parent repo” for this exercise. That assumption is based on:

refs/claw-code/README.md
refs/claw-code/PHILOSOPHY.md
the fact that refs/claw-code explicitly describes src/ as a companion Python/reference workspace, not the primary runtime

If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.

Executive summary

Loader has the right instincts but is operating at the wrong layer.

The codebase already knows that models need:

planning help
recovery help
confidence checks
completion checks
safe tool use

But Loader mostly tries to enforce those after the model has already started drifting. claw-code and oh-my-codex get better behavior because they shape the work before, during, and after the model call:

before: explicit mode selection, clarification, approved planning artifacts
during: durable runtime state, richer tool surface, explicit permission model, session persistence
after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics

The biggest lesson is not “copy their prompt.”

The biggest lesson is:

Loader needs a stronger execution contract, not just stronger prompting.

If we want Loader to feel closer to claw-code regardless of model choice, the highest-leverage work is:

replace the monolithic heuristic loop with a typed turn engine
add durable workflow/state artifacts
make “definition of done” evidence-based instead of heuristic
add real permission/safety boundaries around tools
build a parity harness so we can improve behavior intentionally

Method

I reviewed:

Loader source under src/loader/
Loader tests under tests/
refs/claw-code/README.md
refs/claw-code/USAGE.md
refs/claw-code/PARITY.md
refs/claw-code/PHILOSOPHY.md
refs/claw-code/rust/crates/runtime/*
refs/claw-code/rust/crates/tools/src/lib.rs
refs/oh-my-codex/README.md
refs/oh-my-codex/AGENTS.md
refs/oh-my-codex/skills/deep-interview/SKILL.md
refs/oh-my-codex/skills/ralplan/SKILL.md
refs/oh-my-codex/skills/ralph/SKILL.md
refs/oh-my-codex/src/modes/base.ts
refs/oh-my-codex/src/ralplan/runtime.ts
refs/oh-my-codex/src/mcp/memory-server.ts
refs/oh-my-codex/src/verification/verifier.ts
refs/oh-my-codex/src/cli/doctor.ts
refs/oh-my-codex/src/scripts/notify-hook.ts

I also ran Loader verification commands:

uv run pytest
- failed during collection
- discovered refs/claw-code/tests/*
- also failed to import loader
uv run --with pytest --with pytest-asyncio python -m pytest tests -q
- 56 passed
- 3 failed

That matters because some of Loader’s runtime paths are clearly under-tested.

What Loader already does well

1. Loader is small, understandable, and hackable

This is a real advantage.

src/loader/ is about 55 source files, and the core agent behavior is easy to locate. Compared to claw-code and especially OMX, Loader is much easier to refactor aggressively.

2. Loader is genuinely local-first

The Ollama-first posture is simple and useful. A lot of the complexity in claw-code and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.

3. Loader already contains the seeds of a better system

These are the right instincts:

project context detection in src/loader/context/project.py
runtime safeguards in src/loader/agent/safeguards.py
recovery categorization in src/loader/agent/recovery.py
optional decomposition / critique / confidence / verification / completion checks in src/loader/agent/reasoning.py
a decent Textual app in src/loader/ui/app.py

The problem is not that Loader lacks ideas.

The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.

4. The TUI is a meaningful strength

Loader’s TUI already gives you:

model selection
streaming output
approval handling
status line updates
tool widgets

That is more product surface than many small local agents. It is worth keeping.

Where Loader is weak today

1. Loader’s product surface is not trustworthy yet

The most visible sign is the README:

README.md:1-2 still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”

That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. claw-code and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.

2. Loader’s main runtime is too monolithic and too heuristic

src/loader/agent/loop.py is the heart of Loader, and it is doing too much:

prompt construction
streaming output handling
raw tool-call extraction
duplicate tool execution flows
recovery
validation
rollback tracking
completion nudging
loop detection
steering
partial planning
decomposition

The result is a loop that is hard to reason about and easy to destabilize.

The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.

3. Loader has a real runtime contract bug in tool-result handling

Verified directly against the code. There is a concrete mismatch between Message and the loop:

src/loader/llm/base.py:33-39 defines Message with role, content, tool_calls, and tool_results. There is no tool_call_id field on Message — that field belongs to the separate ToolResult dataclass at src/loader/llm/base.py:25-30.
src/loader/agent/loop.py:885 and src/loader/agent/loop.py:906 both construct Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id).

Both call sites will raise TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id' the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have zero integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.

4. Loader duplicates tool execution logic instead of centralizing it

There are effectively two execution paths:

the normal native/ReAct tool path
the “raw JSON extracted tool call” path

Those paths duplicate:

duplicate checking
validation
confirmation behavior
result recording
loop/error handling

That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.

claw-code’s ConversationRuntime::run_turn() is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.

5. Loader’s system prompt is too shallow and too rigid

src/loader/agent/prompts.py:148-208 gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.

This is too blunt.

Problems:

it treats all tasks like immediate tool-execution tasks
it globally bans numbered steps, which is bad for planning/reporting tasks
it does not define modes
it does not encode verification expectations
it does not encode completion criteria
it does not distinguish “clarify”, “plan”, “execute”, and “verify”

OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.

6. Loader’s tool surface is too thin

Loader has 6 default tools:

read
write
edit
glob
bash
grep

That is enough for toy execution, but not enough for strong agent behavior.

What is missing compared to claw-code / OMX:

task/todo tracking
structured ask-user surfaces
memory/notepad
doctor/status/session tooling
git-aware helpers
explore vs full-execution split
diff/patch-aware editing
web/search/fetch surfaces
structured output surfaces
subagent/team coordination surfaces
MCP-backed state and memory

The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.

7. Loader’s safety model is primitive

Loader’s current protection model is mostly:

“safe commands” vs “ask for confirmation”
destructive tool flags

Problems in practice:

no permission modes like read-only, workspace-write, danger-full-access
no strong workspace boundary checks
no binary-file guards
no file size limits
no symlink escape protection
no command semantics beyond a short safe list

Evidence:

src/loader/tools/file_tools.py reads/writes resolved paths directly
src/loader/tools/shell_tools.py uses create_subprocess_shell() on arbitrary shell strings
src/loader/tools/shell_tools.py:13-20 uses a short safe command set, but no mode-based authorization model

By comparison, claw-code has:

PermissionPolicy
PermissionEnforcer
workspace boundary checks
binary/size guards in file ops
permission-mode aware tool definitions

That does not just make it safer. It makes the agent more predictable.

8. Loader’s “definition of done” is heuristic, not contractual

The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.

Loader’s current strategy is:

heuristically decide whether the response looks premature
nudge the model to continue
maybe ask it to confirm completion

See:

src/loader/agent/reasoning.py:721-854

This is well-intentioned, but it is still guesswork.

It does not require:

explicit acceptance criteria
a verification plan
fresh command evidence
zero pending tasks
a final sign-off phase

OMX’s ralph workflow does.

That difference is enormous.

9. Loader has no durable workflow state

Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.

Missing pieces:

persisted mode state
session memory
approved plan artifacts
PRD / test-spec artifacts
progress ledger
durable “what was already decided”
resume-safe task state

OMX writes state under .omx/ and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.

10. Loader is too backend-specific and too capability-fragile

Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.

Evidence:

src/loader/cli/main.py supports only ollama
src/loader/llm/ollama.py hardcodes native tool support by model-name substring matching

This is fragile for behavior matching “with any model chosen.”

What Loader needs instead is:

a provider-independent tool-calling contract
explicit capability profiles
distinct fallback strategies for native tools vs text tool calling
prompts/workflows that degrade gracefully

11. Loader’s tests are not protecting the real runtime

Loader’s test suite is mostly:

tool unit tests
parsing tests
recovery tests

That is useful, but insufficient.

The current state:

uv run pytest fails by default after adding refs/
the repo does not scope pytest discovery
the “normal” targeted run needs --with pytest --with pytest-asyncio
even then, 3 tests fail
there are no strong turn-loop integration tests
there is no deterministic mock backend harness comparable to claw-code

This is why structural issues like the tool_call_id mismatch can survive.

What `claw-code` gets right

1. The runtime contract is explicit

refs/claw-code/rust/crates/runtime/src/conversation.rs is the biggest thing Loader should study.

The core run_turn() flow is clean:

append user message to session
stream assistant response
build a typed assistant message
extract tool uses
run permission checks
execute tool
append tool result message
repeat until no more tool uses
optionally compact session
return a typed turn summary

That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.

2. Session persistence and compaction are first-class

claw-code treats long-lived sessions as a product feature:

persisted sessions
resume support
usage tracking
compaction thresholds
summarized continuation messages

Relevant files:

refs/claw-code/rust/crates/runtime/src/conversation.rs
refs/claw-code/rust/crates/runtime/src/compact.rs
refs/claw-code/rust/crates/runtime/src/summary_compression.rs
refs/claw-code/rust/crates/runtime/src/usage.rs

This matters because good agent behavior is often continuity behavior.

3. Permissions are part of the runtime, not just UI confirmation

claw-code has an actual permission model with three layers:

Mode layer — PermissionMode enum with ReadOnly, WorkspaceWrite, DangerFullAccess, Prompt, and Allow (refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27)
Per-tool requirement layer — every ToolSpec declares the minimum mode it requires, mapped in PermissionPolicy.tool_requirements
Rule layer — three rule lists (allow_rules, deny_rules, ask_rules) for context-specific overrides on top of the mode/requirement check

Plus typed authorization outcomes, file-write boundary logic, and bash gating.

Relevant files:

refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs
refs/claw-code/rust/crates/runtime/src/permissions.rs

Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.

4. File and shell operations are engineered, not just exposed

claw-code’s file layer includes:

max read size
max write size
binary detection
workspace-boundary validation
structured patch outputs

Relevant file:

refs/claw-code/rust/crates/runtime/src/file_ops.rs

Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.

5. Hooks and lifecycle surfaces give the runtime escape valves

claw-code has pre-tool and post-tool hooks, including failure hooks.

That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.

Relevant files:

refs/claw-code/rust/crates/runtime/src/hooks.rs
refs/claw-code/rust/crates/runtime/src/conversation.rs

6. The project is honest about parity and weaknesses

refs/claw-code/PARITY.md is one of the best engineering lessons in the whole comparison.

It does three things Loader does not yet do:

names what is actually shipped
names what is still shallow or stubbed
ties roadmap claims to concrete evidence

That alone reduces thrash.

Loader needs a similar parity/backlog document for runtime behavior.

7. Diagnostics and operator surfaces are part of the product

claw-code exposes operational commands like:

status
sandbox
agents
mcp
skills
doctor
session resume

This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.

Where `claw-code` is still incomplete

It is worth staying honest here too.

Even claw-code admits some shallowness in PARITY.md:

some surfaces are registry-backed approximations, not deep external integrations
session compaction parity is still open
token accounting accuracy is still open
some tool surfaces remain shallow or partially stubbed

That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.

What OMX adds that Loader is currently missing almost entirely

claw-code gives a better runtime. OMX gives a better workflow.

This is where most of Loader’s “definition of done” and “follow-through” problems are answered.

1. Clarification is a mode, not an ad hoc question

deep-interview is not “ask a question if confused.”

It is a formal ambiguity-reduction workflow with:

a context snapshot
one-question rounds
ambiguity scoring
explicit non-goals
explicit decision boundaries
a crystallized artifact for downstream execution

Relevant files:

refs/oh-my-codex/skills/deep-interview/SKILL.md

Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.

2. Planning is artifact-based and consensus-based

ralplan is much more than “make a numbered list.”

It includes:

Planner / Architect / Critic loops
max iteration handling
planning completion gates
PRD and test-spec artifacts
approved handoff into execution

Relevant files:

refs/oh-my-codex/skills/ralplan/SKILL.md
refs/oh-my-codex/src/ralplan/runtime.ts
refs/oh-my-codex/src/planning/artifacts.ts

Loader’s Plan object is fine as a local helper, but it is nowhere near this level of control.

3. “Done” is a workflow contract in Ralph

This is the single biggest lesson for Loader.

Ralph encodes:

persistence until done
mandatory verification
architect verification
retry/fix loops
state transitions
explicit cleanup on completion
a final checklist

Relevant file:

refs/oh-my-codex/skills/ralph/SKILL.md

This directly addresses the exact Loader problems you named:

weak tool follow-through
finishing too early
spending too long in loops
poor task closure

4. Workflow state lives outside the prompt

OMX stores durable mode state under .omx/ and exposes it through state tools.

Relevant files:

refs/oh-my-codex/src/modes/base.ts
refs/oh-my-codex/src/mcp/state-server.ts
refs/oh-my-codex/src/mcp/memory-server.ts

That means:

progress survives interruptions
execution can be resumed
handoffs are grounded
context can be audited
the model does not have to remember everything itself

5. Memory and notepad are explicit tools

OMX has project memory and a notepad.

That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:

conventions
known build commands
temporary working notes
durable directives

Relevant file:

refs/oh-my-codex/src/mcp/memory-server.ts

Loader currently rediscovers too much per turn.

6. Verification is standardized

OMX has verification instructions that scale by task size and explicitly require evidence.

Relevant file:

refs/oh-my-codex/src/verification/verifier.ts

Loader has completion heuristics. OMX has verification policy.

That is the difference between “the model sounded done” and “the system proved done.”

7. Doctor / explore / sparkshell reduce prompt waste

OMX distinguishes:

health checking (doctor)
lightweight read-only exploration (explore)
bounded shell-native inspection (sparkshell)

That is smart.

It keeps the main execution loop from becoming the only place everything happens.

Relevant files:

refs/oh-my-codex/src/cli/doctor.ts
refs/oh-my-codex/src/cli/explore.ts
refs/oh-my-codex/src/cli/sparkshell.ts

8. Follow-through is supported outside the agent context window

The idle notifications, leader nudges, and continuation prompts in OMX are important.

Relevant file:

refs/oh-my-codex/src/scripts/notify-hook.ts

This is one of the deeper design differences:

Loader tries to keep the model on-task from inside the loop
OMX also nudges, monitors, and routes from outside the loop

That is a more robust design.

Comparison matrix

Area	Loader today	`claw-code`	OMX lesson	Takeaway for Loader
Runtime loop	monolithic, heuristic-heavy	typed turn engine	separate mode/workflow from turn runtime	split Loader runtime first
Tool surface	6 basic tools	49 exposed tool specs on main	tools should include workflow/state surfaces	add stateful and diagnostic tools
Permissions	confirmation-only	permission policy + enforcer	safety belongs in runtime	add modes and boundaries
Completion	heuristic continuation prompt	stronger runtime summaries	Ralph gives evidence-backed done gates	replace “maybe done” with explicit verification
Planning	ephemeral numbered list	some plan surfaces	ralplan = persisted, reviewed planning	persist plan artifacts
Memory/state	none	sessions + compaction + tracing	`.omx/` mode state + memory	add `.loader/` state dir
Diagnostics	minimal	status/sandbox/doctor/session	doctor/explore/sparkshell	make Loader inspectable
Testing	unit-heavy, no runtime harness	mock parity harness	workflow runtime is tested like product behavior	build scripted runtime tests
Extensibility	none	hooks, plugins, MCP surfaces	workflow and notification hooks	add lifecycle hooks later
Multi-agent	none	agent/team surfaces	team + ralph staffing	defer until solo runtime is trustworthy

Why Loader’s current weaknesses produce the behavior you described

Poor tool use

Root causes:

shallow tool surface
brittle prompt contract
native-vs-ReAct bifurcation
duplicated execution code paths
no typed runtime contract for tool results

Weak follow-through

Root causes:

no persistent task state
no approved plan artifact
no explicit verification lane
no final completion checklist

Finishing early

Root causes:

completion is heuristic
no required evidence model
no acceptance criteria artifact
no final “prove it” pass

Spending too long on simple tasks

Root causes:

the runtime loop tries too many recoveries in one place
the system prompt does not distinguish task modes cleanly
there is no “lightweight inspect” lane like explore
the model often has to infer the workflow instead of being routed into one

Model sensitivity

Root causes:

behavior is prompt-and-heuristic driven
capability detection is backend-specific and brittle
no workflow artifacts that survive model variance

This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.

Concrete implementation targets

These are ordered by impact on Loader behavior, not by code convenience.

Target 1: Introduce a real turn engine

Goal:

replace the current giant loop with a smaller, typed conversation runtime

Implementation target:

create a new src/loader/runtime/ package
move message/session/tool-result logic out of src/loader/agent/loop.py
give tool results a first-class typed representation
unify native, ReAct, and extracted-tool execution through one executor path

Why:

this is the foundation for every other improvement

Target 2: Add persistent Loader state under `.loader/`

Goal:

make workflow state durable instead of prompt-only

Implementation target:

.loader/state/
.loader/sessions/
.loader/plans/
.loader/notepad.md
.loader/project-memory.json

Why:

Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge

Target 3: Separate task modes

Goal:

stop treating all requests like immediate tool-execution requests

Implementation target:

mode router with at least:
- clarify
- plan
- execute
- verify

Why:

this is the minimum structure needed to stop overthinking simple work and underthinking complex work

Target 4: Replace heuristic completion with an evidence-backed done contract

Goal:

make completion explicit and testable

Implementation target:

define a DefinitionOfDone object per task
require:
- acceptance criteria
- verification commands
- evidence summary
- zero pending task items

Why:

this is the main fix for premature completion

Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents

Goal:

pull ambiguity reduction and planning review out of the middle of execution

Implementation target:

clarify mode writes a task brief
plan mode writes:
- a short implementation plan
- a test/verification plan

Do not try to copy every OMX feature immediately. Copy the artifact discipline first.

Target 6: Build a real permission model

Goal:

move from confirmation prompts to policy-based authorization

Implementation target:

permission modes:
- read-only
- workspace-write
- danger-full-access
tool specs declare required permission
file writes enforce workspace boundaries
shell commands go through command classification

Why:

this is both safety and behavior quality

Target 7: Harden file and shell tools

Goal:

make tool use trustworthy enough for automation

Implementation target:

size limits
binary detection
symlink/traversal protection
structured patch/diff return values
shell command semantics and mutability classification

Target 8: Add `loader doctor`, `loader status`, and `loader session`

Goal:

make Loader operable as a product

Implementation target:

backend health
model capability snapshot
workspace detection
write-access detection
test/build command detection
active session summary

Why:

better operator feedback means less guesswork in the agent loop

Target 9: Add memory/notepad tools

Goal:

give Loader durable short-term and long-term memory

Implementation target:

read/write project memory
append working notes
store user directives and repo conventions

Why:

this reduces re-discovery and improves follow-through across turns

Target 10: Add a lightweight read-only inspect lane

Goal:

avoid using the full agent loop for every lookup

Implementation target:

loader explore or equivalent internal mode
optimized for:
- file/symbol lookup
- pattern discovery
- relationship questions

Why:

simple tasks should stay cheap and fast

Target 11: Add a parity harness

Goal:

improve behavior intentionally instead of impressionistically

Implementation target:

scripted mock backend scenarios for:
- simple read
- multi-tool turn
- denied permission
- write/edit success
- verification-required task
- premature completion rejection
- looped/duplicate action prevention

Why:

this is how Loader becomes reliable

Target 12: Add workflow-aware prompts and capability profiles

Goal:

make Loader less brittle across models

Implementation target:

replace one generic system prompt with mode-specific prompts
add provider/model capability profiles:
- native tools
- streaming
- context budget
- preferred tool-call format
- verification strictness

Why:

behavior should be shaped by runtime policy, not guessed from model substrings

Priority order

This section was rewritten after a deeper validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: the Definition-of-Done work is the user's actual pain point and should land before permission modes, not after, because permissions are a safety win and DoD is the behavior win.

P0: Stabilize before changing behavior (Sprint 00)

write a failing regression test for the tool_call_id bug at agent/loop.py:885,906 first, before any harness work — it proves the bug is real and proves the harness exists in one move
scope pytest discovery so refs/ stops contaminating collection
exclude refs/ from ruff and mypy too
make uv run pytest work out of the box
port the scenario taxonomy from refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs
rewrite README.md (currently still says "FortranGoingOnForty")
baseline parity checklist for current runtime behavior

P1: Replace the loop with a real runtime (Sprint 01)

new src/loader/runtime/ package with a typed turn engine
unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
fix the named bugs from Sprint 00's failing tests (tool_call_id, duplicate execution path)
replace substring-based NATIVE_TOOL_MODELS/NO_TOOL_MODELS model detection with a runtime/capabilities.py profile system — Loader needs to behave consistently across model choices
structured TurnSummary output

P2: The behavior fix the user actually asked for (Sprint 02)

DefinitionOfDone object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items
explicit verify phase that runs the verification commands and gates completion on evidence
fix loop: verification failure returns to execution, not to final answer
minimum .loader/ directory shape (.loader/dod/) — full session/memory layout deferred to Sprint 05

This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."

P3: Safety as policy, not as confirmation prompt (Sprint 03)

permission modes: read-only, workspace-write, danger-full-access
three-event tool lifecycle hooks (pre_tool_use, post_tool_use, post_tool_use_failure) modeled directly on refs/claw-code/rust/crates/runtime/src/hooks.rs
refactor safeguards.py (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls
file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
shell operation hardening
expose active mode in CLI/TUI status

Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.

P4: Stop improvising one workflow for everything (Sprint 04)

mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
clarify artifact written to .loader/briefs/
planning artifacts (implementation plan + verification plan) written to .loader/plans/ and fed into the existing DoD object
tool prerequisites pulled forward from Sprint 06: TodoWrite (the "zero pending tasks" gate is empty without it) and AskUserQuestion (clarify rounds)

P5: Durable continuity (Sprint 05)

full .loader/ state directory under the layout already started in Sprint 02
session persistence and resume
transcript compaction with priority-aware summarization (model the design on refs/claw-code/rust/crates/runtime/src/summary_compression.rs)
memory/notepad surfaces
usage/cost tracking

P6: Operability and tool-surface expansion (Sprint 06)

loader doctor, loader status, loader session
read-only explore lane
broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — TodoWrite and AskUserQuestion already exist from Sprint 04

Deferred indefinitely

workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
task/team/subagent orchestration
broad MCP ecosystem
richer plugin systems

These are real wins in claw-code/OMX, but Loader should not pursue them until the solo runtime is trustworthy.

What Loader should copy directly, and what it should not

Copy directly

typed turn runtime
permission model
file/shell hardening
session persistence
compaction
doctor/status/session surfaces
workflow artifacts
evidence-backed verification
parity harness discipline

Copy in simplified form

deep-interview
ralplan
ralph
memory/notepad
explore vs full-execution split

Do not copy blindly yet

full tmux/team runtime
huge command surface
Discord/openclaw notification stack
broad MCP ecosystem

Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.

Recommended Loader architecture direction

If we want behavior closer to claw-code without losing Loader’s simplicity, I would steer toward:

Layer 1: Runtime core

typed TurnRuntime
SessionStore
PermissionPolicy
ToolExecutor
VerificationEngine

Layer 2: Workflow layer

ClarifyWorkflow
PlanWorkflow
ExecuteWorkflow
VerifyWorkflow

Layer 3: Product surfaces

TUI
CLI
doctor
status
session
explore

Layer 4: Optional future orchestration

hooks
background verification
multi-agent/task orchestration

That is a better fit for Loader than trying to clone all of OMX wholesale.

Immediate conclusions

Loader’s biggest problems are architectural, not just prompt-related.
claw-code is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.
OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
- workflow artifacts
- explicit verification
- persistent state
- a smaller, more trustworthy turn engine

Sprint scaffolding

After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:

.docs/sprints/index.md
.docs/sprints/sprint00.md — Foundation, Measurement, and Parity Harness
.docs/sprints/sprint01.md — Turn Engine, Tool Contract, and Capability Profiles
.docs/sprints/sprint02.md — Definition of Done and Verify/Fix Loop
.docs/sprints/sprint03.md — Permission Modes and Tool Lifecycle Hooks
.docs/sprints/sprint04.md — Mode Router, Clarify, and Plan Artifacts
.docs/sprints/sprint05.md — Session State, Memory, and Compaction
.docs/sprints/sprint06.md — Doctor, Explore, Status, and Tool Surface Expansion

Recommended next move

Start with Sprint 00, and start Sprint 00 with the failing regression test.

Reason:

Loader needs a measurable baseline and a safer runtime before adding more behavior
the tool_call_id bug at agent/loop.py:885,906 is proof that untested code paths are silently broken
writing the failing test first proves both the bug and the harness in one move
otherwise every feature sprint will be built on unstable agent semantics

The execution phase should then be:

lock down the runtime and test harness (Sprint 00)
replace the loop with a typed runtime and capability profiles (Sprint 01)
define and enforce the completion contract (Sprint 02)
add the policy-based safety layer with hooks (Sprint 03)
add workflow modes and planning artifacts on top (Sprint 04)
then widen the durability and product surfaces (Sprints 05 and 06)

Plan adjustments after deeper review

The following changes were applied to the original report after a firsthand validation pass against the actual code in refs/claw-code and refs/oh-my-codex, plus spot-checks of Loader's runtime.

Verified directly against the code

tool_call_id bug confirmed at src/loader/agent/loop.py:885 and :906. Both call sites construct Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id), but Message (src/loader/llm/base.py:33-39) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.
Pytest discovery is broken by default. uv run pytest --collect-only picks up refs/claw-code/tests/test_porting_workspace.py and fails to import loader because there is no tool.pytest.ini_options block in pyproject.toml.
Loop monolith confirmed by line counts. agent/loop.py is 1929 LOC, agent/reasoning.py is 1196, agent/safeguards.py is 1079 — roughly 4200 lines of orchestration in one cluster.
claw-code's run_turn() shape is exactly as the report describes. Read directly at refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed ConversationMessage::tool_result() → push → repeat. ~175 lines of clean code.
claw-code permission modes are ReadOnly / WorkspaceWrite / DangerFullAccess (plus Prompt and Allow), defined at refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in file_ops.rs are all real.
claw-code hooks are PreToolUse / PostToolUse / PostToolUseFailure, defined at refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34 and wired into the conversation loop at lines 371, 427-453.
OMX skills are real and even more rigorous than the report described. ralplan enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. ralph has explicit phase enums (starting/executing/verifying/fixing/complete/failed/cancelled) persisted via state_write to .omx/state/{mode}-state.json. The verifier in src/verification/verifier.ts scales by task size with concrete file-count thresholds.

Corrected facts

Tool count: 49, not 40. refs/claw-code/rust/crates/tools/src/lib.rs exposes 49 ToolSpec entries in mvp_tool_specs(). Doesn't change the lesson, but worth knowing.
claw-code permissions have a third layer. Beyond PermissionMode and per-tool requirements, PermissionPolicy carries three rule lists (allow_rules, deny_rules, ask_rules) for context-specific overrides. Loader can land the mode layer first and defer the rule layer.
claw-code summary compression is sophisticated. It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at refs/claw-code/rust/crates/runtime/src/summary_compression.rs. Sprint 05 should model on this rather than reinventing.

Structural plan changes

The old Sprint 03 was split. It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
The old Sprint 02 (permissions) became the new Sprint 03 and was reordered to land after DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
Hooks landed in the same sprint as permissions. The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
Capability profiles became a Sprint 01 deliverable. They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
The minimum .loader/ directory shape moves to Sprint 02 (just .loader/dod/). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05.
TodoWrite and AskUserQuestion move from Sprint 06 to Sprint 04 as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.
Sprint 00's first deliverable is now the failing regression test for the tool_call_id bug, before any harness work. It proves the bug and proves the harness exist in one move.

View source

  
        1
        # Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior
      
        2
        
        3
        Date: 2026-04-06
      
        4
        
        5
        ## Scope and assumptions
      
        6
        
        7
        This report compares three things:
      
        8
        
        9
        1. `Loader` itself
      
        10
        2. `refs/claw-code`, using the Rust workspace under `refs/claw-code/rust/` as the canonical runtime
      
        11
        3. `refs/oh-my-codex` as the workflow-layer parent repo
      
        12
        
        13
        Assumption: `oh-my-codex` is the correct “parent repo” for this exercise. That assumption is based on:
      
        14
        
        15
        - `refs/claw-code/README.md`
      
        16
        - `refs/claw-code/PHILOSOPHY.md`
      
        17
        - the fact that `refs/claw-code` explicitly describes `src/` as a companion Python/reference workspace, not the primary runtime
      
        18
        
        19
        If you meant a different parent, we should rerun the comparison against that repo, but this is a solid first pass.
      
        20
        
        21
        ## Executive summary
      
        22
        
        23
        Loader has the right instincts but is operating at the wrong layer.
      
        24
        
        25
        The codebase already knows that models need:
      
        26
        
        27
        - planning help
      
        28
        - recovery help
      
        29
        - confidence checks
      
        30
        - completion checks
      
        31
        - safe tool use
      
        32
        
        33
        But Loader mostly tries to enforce those after the model has already started drifting. `claw-code` and `oh-my-codex` get better behavior because they shape the work before, during, and after the model call:
      
        34
        
        35
        - before: explicit mode selection, clarification, approved planning artifacts
      
        36
        - during: durable runtime state, richer tool surface, explicit permission model, session persistence
      
        37
        - after: verification protocols, completion gates, retry/fix loops, parity harnesses, operator diagnostics
      
        38
        
        39
        The biggest lesson is not “copy their prompt.”
      
        40
        
        41
        The biggest lesson is:
      
        42
        
        43
        > Loader needs a stronger execution contract, not just stronger prompting.
      
        44
        
        45
        If we want Loader to feel closer to `claw-code` regardless of model choice, the highest-leverage work is:
      
        46
        
        47
        1. replace the monolithic heuristic loop with a typed turn engine
      
        48
        2. add durable workflow/state artifacts
      
        49
        3. make “definition of done” evidence-based instead of heuristic
      
        50
        4. add real permission/safety boundaries around tools
      
        51
        5. build a parity harness so we can improve behavior intentionally
      
        52
        
        53
        ## Method
      
        54
        
        55
        I reviewed:
      
        56
        
        57
        - Loader source under `src/loader/`
      
        58
        - Loader tests under `tests/`
      
        59
        - `refs/claw-code/README.md`
      
        60
        - `refs/claw-code/USAGE.md`
      
        61
        - `refs/claw-code/PARITY.md`
      
        62
        - `refs/claw-code/PHILOSOPHY.md`
      
        63
        - `refs/claw-code/rust/crates/runtime/*`
      
        64
        - `refs/claw-code/rust/crates/tools/src/lib.rs`
      
        65
        - `refs/oh-my-codex/README.md`
      
        66
        - `refs/oh-my-codex/AGENTS.md`
      
        67
        - `refs/oh-my-codex/skills/deep-interview/SKILL.md`
      
        68
        - `refs/oh-my-codex/skills/ralplan/SKILL.md`
      
        69
        - `refs/oh-my-codex/skills/ralph/SKILL.md`
      
        70
        - `refs/oh-my-codex/src/modes/base.ts`
      
        71
        - `refs/oh-my-codex/src/ralplan/runtime.ts`
      
        72
        - `refs/oh-my-codex/src/mcp/memory-server.ts`
      
        73
        - `refs/oh-my-codex/src/verification/verifier.ts`
      
        74
        - `refs/oh-my-codex/src/cli/doctor.ts`
      
        75
        - `refs/oh-my-codex/src/scripts/notify-hook.ts`
      
        76
        
        77
        I also ran Loader verification commands:
      
        78
        
        79
        - `uv run pytest`
      
        80
          - failed during collection
      
        81
          - discovered `refs/claw-code/tests/*`
      
        82
          - also failed to import `loader`
      
        83
        - `uv run --with pytest --with pytest-asyncio python -m pytest tests -q`
      
        84
          - 56 passed
      
        85
          - 3 failed
      
        86
        
        87
        That matters because some of Loader’s runtime paths are clearly under-tested.
      
        88
        
        89
        ## What Loader already does well
      
        90
        
        91
        ### 1. Loader is small, understandable, and hackable
      
        92
        
        93
        This is a real advantage.
      
        94
        
        95
        `src/loader/` is about 55 source files, and the core agent behavior is easy to locate. Compared to `claw-code` and especially OMX, Loader is much easier to refactor aggressively.
      
        96
        
        97
        ### 2. Loader is genuinely local-first
      
        98
        
        99
        The Ollama-first posture is simple and useful. A lot of the complexity in `claw-code` and OMX comes from supporting broad operational surfaces, multiple runtimes, OAuth, MCP, tmux/team flows, and richer tool ecosystems. Loader can keep its local-first identity while still copying the good execution ideas.
      
        100
        
        101
        ### 3. Loader already contains the seeds of a better system
      
        102
        
        103
        These are the right instincts:
      
        104
        
        105
        - project context detection in `src/loader/context/project.py`
      
        106
        - runtime safeguards in `src/loader/agent/safeguards.py`
      
        107
        - recovery categorization in `src/loader/agent/recovery.py`
      
        108
        - optional decomposition / critique / confidence / verification / completion checks in `src/loader/agent/reasoning.py`
      
        109
        - a decent Textual app in `src/loader/ui/app.py`
      
        110
        
        111
        The problem is not that Loader lacks ideas.
      
        112
        
        113
        The problem is that these ideas are bolted onto one big runtime loop instead of being elevated into the architecture.
      
        114
        
        115
        ### 4. The TUI is a meaningful strength
      
        116
        
        117
        Loader’s TUI already gives you:
      
        118
        
        119
        - model selection
      
        120
        - streaming output
      
        121
        - approval handling
      
        122
        - status line updates
      
        123
        - tool widgets
      
        124
        
        125
        That is more product surface than many small local agents. It is worth keeping.
      
        126
        
        127
        ## Where Loader is weak today
      
        128
        
        129
        ### 1. Loader’s product surface is not trustworthy yet
      
        130
        
        131
        The most visible sign is the README:
      
        132
        
        133
        - `README.md:1-2` still says “FortranGoingOnForty” and “A tutorial on using Fortran for beginners.”
      
        134
        
        135
        That looks small, but it reflects a bigger problem: Loader is missing operational polish and self-diagnosis. `claw-code` and OMX both treat installability, health checks, and discoverability as product requirements. Loader currently feels like an experiment more than a tool.
      
        136
        
        137
        ### 2. Loader’s main runtime is too monolithic and too heuristic
      
        138
        
        139
        `src/loader/agent/loop.py` is the heart of Loader, and it is doing too much:
      
        140
        
        141
        - prompt construction
      
        142
        - streaming output handling
      
        143
        - raw tool-call extraction
      
        144
        - duplicate tool execution flows
      
        145
        - recovery
      
        146
        - validation
      
        147
        - rollback tracking
      
        148
        - completion nudging
      
        149
        - loop detection
      
        150
        - steering
      
        151
        - partial planning
      
        152
        - decomposition
      
        153
        
        154
        The result is a loop that is hard to reason about and easy to destabilize.
      
        155
        
        156
        The core design smell is that Loader tries to recover from model misbehavior in-place instead of enforcing a stronger turn protocol.
      
        157
        
        158
        ### 3. Loader has a real runtime contract bug in tool-result handling
      
        159
        
        160
        **Verified directly against the code.** There is a concrete mismatch between `Message` and the loop:
      
        161
        
        162
        - `src/loader/llm/base.py:33-39` defines `Message` with `role`, `content`, `tool_calls`, and `tool_results`. There is no `tool_call_id` field on `Message` — that field belongs to the separate `ToolResult` dataclass at `src/loader/llm/base.py:25-30`.
      
        163
        - `src/loader/agent/loop.py:885` and `src/loader/agent/loop.py:906` both construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`.
      
        164
        
        165
        Both call sites will raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'` the moment they execute. They live on the duplicate-suppression and pre-validation branches of the loop, which means they have **zero** integration coverage today. This single bug is the proof that the test harness gap is real and that Sprint 00 must precede any behavioral work.
      
        166
        
        167
        ### 4. Loader duplicates tool execution logic instead of centralizing it
      
        168
        
        169
        There are effectively two execution paths:
      
        170
        
        171
        - the normal native/ReAct tool path
      
        172
        - the “raw JSON extracted tool call” path
      
        173
        
        174
        Those paths duplicate:
      
        175
        
        176
        - duplicate checking
      
        177
        - validation
      
        178
        - confirmation behavior
      
        179
        - result recording
      
        180
        - loop/error handling
      
        181
        
        182
        That makes behavior inconsistent and increases the chance that fixes in one path never land in the other.
      
        183
        
        184
        `claw-code`’s `ConversationRuntime::run_turn()` is much tighter: receive assistant output, extract tool uses, authorize, execute, append tool results, repeat.
      
        185
        
        186
        ### 5. Loader’s system prompt is too shallow and too rigid
      
        187
        
        188
        `src/loader/agent/prompts.py:148-208` gives Loader a generic “use tools immediately / no code blocks / no numbered steps / read files before editing” prompt.
      
        189
        
        190
        This is too blunt.
      
        191
        
        192
        Problems:
      
        193
        
        194
        - it treats all tasks like immediate tool-execution tasks
      
        195
        - it globally bans numbered steps, which is bad for planning/reporting tasks
      
        196
        - it does not define modes
      
        197
        - it does not encode verification expectations
      
        198
        - it does not encode completion criteria
      
        199
        - it does not distinguish “clarify”, “plan”, “execute”, and “verify”
      
        200
        
        201
        OMX is much better here. It does not just say “do the task.” It routes the task into a workflow lane with an explicit contract.
      
        202
        
        203
        ### 6. Loader’s tool surface is too thin
      
        204
        
        205
        Loader has 6 default tools:
      
        206
        
        207
        - `read`
      
        208
        - `write`
      
        209
        - `edit`
      
        210
        - `glob`
      
        211
        - `bash`
      
        212
        - `grep`
      
        213
        
        214
        That is enough for toy execution, but not enough for strong agent behavior.
      
        215
        
        216
        What is missing compared to `claw-code` / OMX:
      
        217
        
        218
        - task/todo tracking
      
        219
        - structured ask-user surfaces
      
        220
        - memory/notepad
      
        221
        - doctor/status/session tooling
      
        222
        - git-aware helpers
      
        223
        - explore vs full-execution split
      
        224
        - diff/patch-aware editing
      
        225
        - web/search/fetch surfaces
      
        226
        - structured output surfaces
      
        227
        - subagent/team coordination surfaces
      
        228
        - MCP-backed state and memory
      
        229
        
        230
        The result is that Loader has to keep too much in the prompt and too much in ephemeral model state.
      
        231
        
        232
        ### 7. Loader’s safety model is primitive
      
        233
        
        234
        Loader’s current protection model is mostly:
      
        235
        
        236
        - “safe commands” vs “ask for confirmation”
      
        237
        - destructive tool flags
      
        238
        
        239
        Problems in practice:
      
        240
        
        241
        - no permission modes like `read-only`, `workspace-write`, `danger-full-access`
      
        242
        - no strong workspace boundary checks
      
        243
        - no binary-file guards
      
        244
        - no file size limits
      
        245
        - no symlink escape protection
      
        246
        - no command semantics beyond a short safe list
      
        247
        
        248
        Evidence:
      
        249
        
        250
        - `src/loader/tools/file_tools.py` reads/writes resolved paths directly
      
        251
        - `src/loader/tools/shell_tools.py` uses `create_subprocess_shell()` on arbitrary shell strings
      
        252
        - `src/loader/tools/shell_tools.py:13-20` uses a short safe command set, but no mode-based authorization model
      
        253
        
        254
        By comparison, `claw-code` has:
      
        255
        
        256
        - `PermissionPolicy`
      
        257
        - `PermissionEnforcer`
      
        258
        - workspace boundary checks
      
        259
        - binary/size guards in file ops
      
        260
        - permission-mode aware tool definitions
      
        261
        
        262
        That does not just make it safer. It makes the agent more predictable.
      
        263
        
        264
        ### 8. Loader’s “definition of done” is heuristic, not contractual
      
        265
        
        266
        The user complaint about “spending too long on simple tasks or finishing early without followup” is visible directly in the code.
      
        267
        
        268
        Loader’s current strategy is:
      
        269
        
        270
        - heuristically decide whether the response looks premature
      
        271
        - nudge the model to continue
      
        272
        - maybe ask it to confirm completion
      
        273
        
        274
        See:
      
        275
        
        276
        - `src/loader/agent/reasoning.py:721-854`
      
        277
        
        278
        This is well-intentioned, but it is still guesswork.
      
        279
        
        280
        It does not require:
      
        281
        
        282
        - explicit acceptance criteria
      
        283
        - a verification plan
      
        284
        - fresh command evidence
      
        285
        - zero pending tasks
      
        286
        - a final sign-off phase
      
        287
        
        288
        OMX’s `ralph` workflow does.
      
        289
        
        290
        That difference is enormous.
      
        291
        
        292
        ### 9. Loader has no durable workflow state
      
        293
        
        294
        Loader has plans, decomposition, and completion logic, but they live inside one run and disappear.
      
        295
        
        296
        Missing pieces:
      
        297
        
        298
        - persisted mode state
      
        299
        - session memory
      
        300
        - approved plan artifacts
      
        301
        - PRD / test-spec artifacts
      
        302
        - progress ledger
      
        303
        - durable “what was already decided”
      
        304
        - resume-safe task state
      
        305
        
        306
        OMX writes state under `.omx/` and uses that to keep the workflow coherent across retries, handoffs, and interruptions. Loader currently depends on in-memory context plus prompt history only.
      
        307
        
        308
        ### 10. Loader is too backend-specific and too capability-fragile
      
        309
        
        310
        Despite defining an abstract LLM backend, Loader is effectively Ollama-only today.
      
        311
        
        312
        Evidence:
      
        313
        
        314
        - `src/loader/cli/main.py` supports only `ollama`
      
        315
        - `src/loader/llm/ollama.py` hardcodes native tool support by model-name substring matching
      
        316
        
        317
        This is fragile for behavior matching “with any model chosen.”
      
        318
        
        319
        What Loader needs instead is:
      
        320
        
        321
        - a provider-independent tool-calling contract
      
        322
        - explicit capability profiles
      
        323
        - distinct fallback strategies for native tools vs text tool calling
      
        324
        - prompts/workflows that degrade gracefully
      
        325
        
        326
        ### 11. Loader’s tests are not protecting the real runtime
      
        327
        
        328
        Loader’s test suite is mostly:
      
        329
        
        330
        - tool unit tests
      
        331
        - parsing tests
      
        332
        - recovery tests
      
        333
        
        334
        That is useful, but insufficient.
      
        335
        
        336
        The current state:
      
        337
        
        338
        - `uv run pytest` fails by default after adding `refs/`
      
        339
        - the repo does not scope pytest discovery
      
        340
        - the “normal” targeted run needs `--with pytest --with pytest-asyncio`
      
        341
        - even then, 3 tests fail
      
        342
        - there are no strong turn-loop integration tests
      
        343
        - there is no deterministic mock backend harness comparable to `claw-code`
      
        344
        
        345
        This is why structural issues like the `tool_call_id` mismatch can survive.
      
        346
        
        347
        ## What `claw-code` gets right
      
        348
        
        349
        ## 1. The runtime contract is explicit
      
        350
        
        351
        `refs/claw-code/rust/crates/runtime/src/conversation.rs` is the biggest thing Loader should study.
      
        352
        
        353
        The core `run_turn()` flow is clean:
      
        354
        
        355
        1. append user message to session
      
        356
        2. stream assistant response
      
        357
        3. build a typed assistant message
      
        358
        4. extract tool uses
      
        359
        5. run permission checks
      
        360
        6. execute tool
      
        361
        7. append tool result message
      
        362
        8. repeat until no more tool uses
      
        363
        9. optionally compact session
      
        364
        10. return a typed turn summary
      
        365
        
        366
        That is much more trustworthy than Loader’s current “stream + parse + filter + maybe reparse + maybe extract raw JSON + maybe duplicate path” approach.
      
        367
        
        368
        ## 2. Session persistence and compaction are first-class
      
        369
        
        370
        `claw-code` treats long-lived sessions as a product feature:
      
        371
        
        372
        - persisted sessions
      
        373
        - resume support
      
        374
        - usage tracking
      
        375
        - compaction thresholds
      
        376
        - summarized continuation messages
      
        377
        
        378
        Relevant files:
      
        379
        
        380
        - `refs/claw-code/rust/crates/runtime/src/conversation.rs`
      
        381
        - `refs/claw-code/rust/crates/runtime/src/compact.rs`
      
        382
        - `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`
      
        383
        - `refs/claw-code/rust/crates/runtime/src/usage.rs`
      
        384
        
        385
        This matters because good agent behavior is often continuity behavior.
      
        386
        
        387
        ## 3. Permissions are part of the runtime, not just UI confirmation
      
        388
        
        389
        `claw-code` has an actual permission model with three layers:
      
        390
        
        391
        - **Mode layer** — `PermissionMode` enum with `ReadOnly`, `WorkspaceWrite`, `DangerFullAccess`, `Prompt`, and `Allow` (`refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`)
      
        392
        - **Per-tool requirement layer** — every `ToolSpec` declares the minimum mode it requires, mapped in `PermissionPolicy.tool_requirements`
      
        393
        - **Rule layer** — three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides on top of the mode/requirement check
      
        394
        
        395
        Plus typed authorization outcomes, file-write boundary logic, and bash gating.
      
        396
        
        397
        Relevant files:
      
        398
        
        399
        - `refs/claw-code/rust/crates/runtime/src/permission_enforcer.rs`
      
        400
        - `refs/claw-code/rust/crates/runtime/src/permissions.rs`
      
        401
        
        402
        Loader needs this badly. The mode layer alone is the high-leverage start; the rule layer can come later.
      
        403
        
        404
        ## 4. File and shell operations are engineered, not just exposed
      
        405
        
        406
        `claw-code`’s file layer includes:
      
        407
        
        408
        - max read size
      
        409
        - max write size
      
        410
        - binary detection
      
        411
        - workspace-boundary validation
      
        412
        - structured patch outputs
      
        413
        
        414
        Relevant file:
      
        415
        
        416
        - `refs/claw-code/rust/crates/runtime/src/file_ops.rs`
      
        417
        
        418
        Loader’s file tools are functional, but too permissive and too simplistic to support strong autonomous behavior.
      
        419
        
        420
        ## 5. Hooks and lifecycle surfaces give the runtime escape valves
      
        421
        
        422
        `claw-code` has pre-tool and post-tool hooks, including failure hooks.
      
        423
        
        424
        That is important because not every behavioral improvement should live inside the model prompt. Hooks let the system inject policy, observability, and guardrails without changing the LLM call itself.
      
        425
        
        426
        Relevant files:
      
        427
        
        428
        - `refs/claw-code/rust/crates/runtime/src/hooks.rs`
      
        429
        - `refs/claw-code/rust/crates/runtime/src/conversation.rs`
      
        430
        
        431
        ## 6. The project is honest about parity and weaknesses
      
        432
        
        433
        `refs/claw-code/PARITY.md` is one of the best engineering lessons in the whole comparison.
      
        434
        
        435
        It does three things Loader does not yet do:
      
        436
        
        437
        - names what is actually shipped
      
        438
        - names what is still shallow or stubbed
      
        439
        - ties roadmap claims to concrete evidence
      
        440
        
        441
        That alone reduces thrash.
      
        442
        
        443
        Loader needs a similar parity/backlog document for runtime behavior.
      
        444
        
        445
        ## 7. Diagnostics and operator surfaces are part of the product
      
        446
        
        447
        `claw-code` exposes operational commands like:
      
        448
        
        449
        - `status`
      
        450
        - `sandbox`
      
        451
        - `agents`
      
        452
        - `mcp`
      
        453
        - `skills`
      
        454
        - `doctor`
      
        455
        - session resume
      
        456
        
        457
        This is not just convenience. It makes the system inspectable. Loader currently hides too much inside the runtime.
      
        458
        
        459
        ## Where `claw-code` is still incomplete
      
        460
        
        461
        It is worth staying honest here too.
      
        462
        
        463
        Even `claw-code` admits some shallowness in `PARITY.md`:
      
        464
        
        465
        - some surfaces are registry-backed approximations, not deep external integrations
      
        466
        - session compaction parity is still open
      
        467
        - token accounting accuracy is still open
      
        468
        - some tool surfaces remain shallow or partially stubbed
      
        469
        
        470
        That is useful because the goal is not blind imitation. The goal is to copy the parts that most affect day-to-day behavior.
      
        471
        
        472
        ## What OMX adds that Loader is currently missing almost entirely
      
        473
        
        474
        `claw-code` gives a better runtime. OMX gives a better workflow.
      
        475
        
        476
        This is where most of Loader’s “definition of done” and “follow-through” problems are answered.
      
        477
        
        478
        ### 1. Clarification is a mode, not an ad hoc question
      
        479
        
        480
        `deep-interview` is not “ask a question if confused.”
      
        481
        
        482
        It is a formal ambiguity-reduction workflow with:
      
        483
        
        484
        - a context snapshot
      
        485
        - one-question rounds
      
        486
        - ambiguity scoring
      
        487
        - explicit non-goals
      
        488
        - explicit decision boundaries
      
        489
        - a crystallized artifact for downstream execution
      
        490
        
        491
        Relevant files:
      
        492
        
        493
        - `refs/oh-my-codex/skills/deep-interview/SKILL.md`
      
        494
        
        495
        Loader currently has no equivalent. It either acts immediately or tries to self-nudge mid-flight.
      
        496
        
        497
        ### 2. Planning is artifact-based and consensus-based
      
        498
        
        499
        `ralplan` is much more than “make a numbered list.”
      
        500
        
        501
        It includes:
      
        502
        
        503
        - Planner / Architect / Critic loops
      
        504
        - max iteration handling
      
        505
        - planning completion gates
      
        506
        - PRD and test-spec artifacts
      
        507
        - approved handoff into execution
      
        508
        
        509
        Relevant files:
      
        510
        
        511
        - `refs/oh-my-codex/skills/ralplan/SKILL.md`
      
        512
        - `refs/oh-my-codex/src/ralplan/runtime.ts`
      
        513
        - `refs/oh-my-codex/src/planning/artifacts.ts`
      
        514
        
        515
        Loader’s `Plan` object is fine as a local helper, but it is nowhere near this level of control.
      
        516
        
        517
        ### 3. “Done” is a workflow contract in Ralph
      
        518
        
        519
        This is the single biggest lesson for Loader.
      
        520
        
        521
        Ralph encodes:
      
        522
        
        523
        - persistence until done
      
        524
        - mandatory verification
      
        525
        - architect verification
      
        526
        - retry/fix loops
      
        527
        - state transitions
      
        528
        - explicit cleanup on completion
      
        529
        - a final checklist
      
        530
        
        531
        Relevant file:
      
        532
        
        533
        - `refs/oh-my-codex/skills/ralph/SKILL.md`
      
        534
        
        535
        This directly addresses the exact Loader problems you named:
      
        536
        
        537
        - weak tool follow-through
      
        538
        - finishing too early
      
        539
        - spending too long in loops
      
        540
        - poor task closure
      
        541
        
        542
        ### 4. Workflow state lives outside the prompt
      
        543
        
        544
        OMX stores durable mode state under `.omx/` and exposes it through state tools.
      
        545
        
        546
        Relevant files:
      
        547
        
        548
        - `refs/oh-my-codex/src/modes/base.ts`
      
        549
        - `refs/oh-my-codex/src/mcp/state-server.ts`
      
        550
        - `refs/oh-my-codex/src/mcp/memory-server.ts`
      
        551
        
        552
        That means:
      
        553
        
        554
        - progress survives interruptions
      
        555
        - execution can be resumed
      
        556
        - handoffs are grounded
      
        557
        - context can be audited
      
        558
        - the model does not have to remember everything itself
      
        559
        
        560
        ### 5. Memory and notepad are explicit tools
      
        561
        
        562
        OMX has project memory and a notepad.
      
        563
        
        564
        That sounds small, but it matters a lot for agent stability. It gives the system somewhere to store:
      
        565
        
        566
        - conventions
      
        567
        - known build commands
      
        568
        - temporary working notes
      
        569
        - durable directives
      
        570
        
        571
        Relevant file:
      
        572
        
        573
        - `refs/oh-my-codex/src/mcp/memory-server.ts`
      
        574
        
        575
        Loader currently rediscovers too much per turn.
      
        576
        
        577
        ### 6. Verification is standardized
      
        578
        
        579
        OMX has verification instructions that scale by task size and explicitly require evidence.
      
        580
        
        581
        Relevant file:
      
        582
        
        583
        - `refs/oh-my-codex/src/verification/verifier.ts`
      
        584
        
        585
        Loader has completion heuristics. OMX has verification policy.
      
        586
        
        587
        That is the difference between “the model sounded done” and “the system proved done.”
      
        588
        
        589
        ### 7. Doctor / explore / sparkshell reduce prompt waste
      
        590
        
        591
        OMX distinguishes:
      
        592
        
        593
        - health checking (`doctor`)
      
        594
        - lightweight read-only exploration (`explore`)
      
        595
        - bounded shell-native inspection (`sparkshell`)
      
        596
        
        597
        That is smart.
      
        598
        
        599
        It keeps the main execution loop from becoming the only place everything happens.
      
        600
        
        601
        Relevant files:
      
        602
        
        603
        - `refs/oh-my-codex/src/cli/doctor.ts`
      
        604
        - `refs/oh-my-codex/src/cli/explore.ts`
      
        605
        - `refs/oh-my-codex/src/cli/sparkshell.ts`
      
        606
        
        607
        ### 8. Follow-through is supported outside the agent context window
      
        608
        
        609
        The idle notifications, leader nudges, and continuation prompts in OMX are important.
      
        610
        
        611
        Relevant file:
      
        612
        
        613
        - `refs/oh-my-codex/src/scripts/notify-hook.ts`
      
        614
        
        615
        This is one of the deeper design differences:
      
        616
        
        617
        - Loader tries to keep the model on-task from inside the loop
      
        618
        - OMX also nudges, monitors, and routes from outside the loop
      
        619
        
        620
        That is a more robust design.
      
        621
        
        622
        ## Comparison matrix
      
        623
        
        624
        | Area | Loader today | `claw-code` | OMX lesson | Takeaway for Loader |
      
        625
        |---|---|---|---|---|
      
        626
        | Runtime loop | monolithic, heuristic-heavy | typed turn engine | separate mode/workflow from turn runtime | split Loader runtime first |
      
        627
        | Tool surface | 6 basic tools | 49 exposed tool specs on main | tools should include workflow/state surfaces | add stateful and diagnostic tools |
      
        628
        | Permissions | confirmation-only | permission policy + enforcer | safety belongs in runtime | add modes and boundaries |
      
        629
        | Completion | heuristic continuation prompt | stronger runtime summaries | Ralph gives evidence-backed done gates | replace “maybe done” with explicit verification |
      
        630
        | Planning | ephemeral numbered list | some plan surfaces | ralplan = persisted, reviewed planning | persist plan artifacts |
      
        631
        | Memory/state | none | sessions + compaction + tracing | `.omx/` mode state + memory | add `.loader/` state dir |
      
        632
        | Diagnostics | minimal | status/sandbox/doctor/session | doctor/explore/sparkshell | make Loader inspectable |
      
        633
        | Testing | unit-heavy, no runtime harness | mock parity harness | workflow runtime is tested like product behavior | build scripted runtime tests |
      
        634
        | Extensibility | none | hooks, plugins, MCP surfaces | workflow and notification hooks | add lifecycle hooks later |
      
        635
        | Multi-agent | none | agent/team surfaces | team + ralph staffing | defer until solo runtime is trustworthy |
      
        636
        
        637
        ## Why Loader’s current weaknesses produce the behavior you described
      
        638
        
        639
        ### Poor tool use
      
        640
        
        641
        Root causes:
      
        642
        
        643
        - shallow tool surface
      
        644
        - brittle prompt contract
      
        645
        - native-vs-ReAct bifurcation
      
        646
        - duplicated execution code paths
      
        647
        - no typed runtime contract for tool results
      
        648
        
        649
        ### Weak follow-through
      
        650
        
        651
        Root causes:
      
        652
        
        653
        - no persistent task state
      
        654
        - no approved plan artifact
      
        655
        - no explicit verification lane
      
        656
        - no final completion checklist
      
        657
        
        658
        ### Finishing early
      
        659
        
        660
        Root causes:
      
        661
        
        662
        - completion is heuristic
      
        663
        - no required evidence model
      
        664
        - no acceptance criteria artifact
      
        665
        - no final “prove it” pass
      
        666
        
        667
        ### Spending too long on simple tasks
      
        668
        
        669
        Root causes:
      
        670
        
        671
        - the runtime loop tries too many recoveries in one place
      
        672
        - the system prompt does not distinguish task modes cleanly
      
        673
        - there is no “lightweight inspect” lane like `explore`
      
        674
        - the model often has to infer the workflow instead of being routed into one
      
        675
        
        676
        ### Model sensitivity
      
        677
        
        678
        Root causes:
      
        679
        
        680
        - behavior is prompt-and-heuristic driven
      
        681
        - capability detection is backend-specific and brittle
      
        682
        - no workflow artifacts that survive model variance
      
        683
        
        684
        This is why copying OMX’s workflow ideas is so high leverage. It reduces how much we ask the model to improvise.
      
        685
        
        686
        ## Concrete implementation targets
      
        687
        
        688
        These are ordered by impact on Loader behavior, not by code convenience.
      
        689
        
        690
        ### Target 1: Introduce a real turn engine
      
        691
        
        692
        Goal:
      
        693
        
        694
        - replace the current giant loop with a smaller, typed conversation runtime
      
        695
        
        696
        Implementation target:
      
        697
        
        698
        - create a new `src/loader/runtime/` package
      
        699
        - move message/session/tool-result logic out of `src/loader/agent/loop.py`
      
        700
        - give tool results a first-class typed representation
      
        701
        - unify native, ReAct, and extracted-tool execution through one executor path
      
        702
        
        703
        Why:
      
        704
        
        705
        - this is the foundation for every other improvement
      
        706
        
        707
        ### Target 2: Add persistent Loader state under `.loader/`
      
        708
        
        709
        Goal:
      
        710
        
        711
        - make workflow state durable instead of prompt-only
      
        712
        
        713
        Implementation target:
      
        714
        
        715
        - `.loader/state/`
      
        716
        - `.loader/sessions/`
      
        717
        - `.loader/plans/`
      
        718
        - `.loader/notepad.md`
      
        719
        - `.loader/project-memory.json`
      
        720
        
        721
        Why:
      
        722
        
        723
        - Loader needs somewhere to store progress, acceptance criteria, and recovered knowledge
      
        724
        
        725
        ### Target 3: Separate task modes
      
        726
        
        727
        Goal:
      
        728
        
        729
        - stop treating all requests like immediate tool-execution requests
      
        730
        
        731
        Implementation target:
      
        732
        
        733
        - mode router with at least:
      
        734
          - `clarify`
      
        735
          - `plan`
      
        736
          - `execute`
      
        737
          - `verify`
      
        738
        
        739
        Why:
      
        740
        
        741
        - this is the minimum structure needed to stop overthinking simple work and underthinking complex work
      
        742
        
        743
        ### Target 4: Replace heuristic completion with an evidence-backed done contract
      
        744
        
        745
        Goal:
      
        746
        
        747
        - make completion explicit and testable
      
        748
        
        749
        Implementation target:
      
        750
        
        751
        - define a `DefinitionOfDone` object per task
      
        752
        - require:
      
        753
          - acceptance criteria
      
        754
          - verification commands
      
        755
          - evidence summary
      
        756
          - zero pending task items
      
        757
        
        758
        Why:
      
        759
        
        760
        - this is the main fix for premature completion
      
        761
        
        762
        ### Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents
      
        763
        
        764
        Goal:
      
        765
        
        766
        - pull ambiguity reduction and planning review out of the middle of execution
      
        767
        
        768
        Implementation target:
      
        769
        
        770
        - `clarify` mode writes a task brief
      
        771
        - `plan` mode writes:
      
        772
          - a short implementation plan
      
        773
          - a test/verification plan
      
        774
        
        775
        Do not try to copy every OMX feature immediately. Copy the artifact discipline first.
      
        776
        
        777
        ### Target 6: Build a real permission model
      
        778
        
        779
        Goal:
      
        780
        
        781
        - move from confirmation prompts to policy-based authorization
      
        782
        
        783
        Implementation target:
      
        784
        
        785
        - permission modes:
      
        786
          - `read-only`
      
        787
          - `workspace-write`
      
        788
          - `danger-full-access`
      
        789
        - tool specs declare required permission
      
        790
        - file writes enforce workspace boundaries
      
        791
        - shell commands go through command classification
      
        792
        
        793
        Why:
      
        794
        
        795
        - this is both safety and behavior quality
      
        796
        
        797
        ### Target 7: Harden file and shell tools
      
        798
        
        799
        Goal:
      
        800
        
        801
        - make tool use trustworthy enough for automation
      
        802
        
        803
        Implementation target:
      
        804
        
        805
        - size limits
      
        806
        - binary detection
      
        807
        - symlink/traversal protection
      
        808
        - structured patch/diff return values
      
        809
        - shell command semantics and mutability classification
      
        810
        
        811
        ### Target 8: Add `loader doctor`, `loader status`, and `loader session`
      
        812
        
        813
        Goal:
      
        814
        
        815
        - make Loader operable as a product
      
        816
        
        817
        Implementation target:
      
        818
        
        819
        - backend health
      
        820
        - model capability snapshot
      
        821
        - workspace detection
      
        822
        - write-access detection
      
        823
        - test/build command detection
      
        824
        - active session summary
      
        825
        
        826
        Why:
      
        827
        
        828
        - better operator feedback means less guesswork in the agent loop
      
        829
        
        830
        ### Target 9: Add memory/notepad tools
      
        831
        
        832
        Goal:
      
        833
        
        834
        - give Loader durable short-term and long-term memory
      
        835
        
        836
        Implementation target:
      
        837
        
        838
        - read/write project memory
      
        839
        - append working notes
      
        840
        - store user directives and repo conventions
      
        841
        
        842
        Why:
      
        843
        
        844
        - this reduces re-discovery and improves follow-through across turns
      
        845
        
        846
        ### Target 10: Add a lightweight read-only inspect lane
      
        847
        
        848
        Goal:
      
        849
        
        850
        - avoid using the full agent loop for every lookup
      
        851
        
        852
        Implementation target:
      
        853
        
        854
        - `loader explore` or equivalent internal mode
      
        855
        - optimized for:
      
        856
          - file/symbol lookup
      
        857
          - pattern discovery
      
        858
          - relationship questions
      
        859
        
        860
        Why:
      
        861
        
        862
        - simple tasks should stay cheap and fast
      
        863
        
        864
        ### Target 11: Add a parity harness
      
        865
        
        866
        Goal:
      
        867
        
        868
        - improve behavior intentionally instead of impressionistically
      
        869
        
        870
        Implementation target:
      
        871
        
        872
        - scripted mock backend scenarios for:
      
        873
          - simple read
      
        874
          - multi-tool turn
      
        875
          - denied permission
      
        876
          - write/edit success
      
        877
          - verification-required task
      
        878
          - premature completion rejection
      
        879
          - looped/duplicate action prevention
      
        880
        
        881
        Why:
      
        882
        
        883
        - this is how Loader becomes reliable
      
        884
        
        885
        ### Target 12: Add workflow-aware prompts and capability profiles
      
        886
        
        887
        Goal:
      
        888
        
        889
        - make Loader less brittle across models
      
        890
        
        891
        Implementation target:
      
        892
        
        893
        - replace one generic system prompt with mode-specific prompts
      
        894
        - add provider/model capability profiles:
      
        895
          - native tools
      
        896
          - streaming
      
        897
          - context budget
      
        898
          - preferred tool-call format
      
        899
          - verification strictness
      
        900
        
        901
        Why:
      
        902
        
        903
        - behavior should be shaped by runtime policy, not guessed from model substrings
      
        904
        
        905
        ## Priority order
      
        906
        
        907
        This section was rewritten after a deeper validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus firsthand spot-checks of Loader's runtime. The deeper review confirmed every load-bearing claim in this report and surfaced one structural reorder: **the Definition-of-Done work is the user's actual pain point and should land before permission modes**, not after, because permissions are a safety win and DoD is the behavior win.
      
        908
        
        909
        ### P0: Stabilize before changing behavior (Sprint 00)
      
        910
        
        911
        - write a failing regression test for the `tool_call_id` bug at `agent/loop.py:885,906` *first*, before any harness work — it proves the bug is real and proves the harness exists in one move
      
        912
        - scope pytest discovery so `refs/` stops contaminating collection
      
        913
        - exclude `refs/` from ruff and mypy too
      
        914
        - make `uv run pytest` work out of the box
      
        915
        - port the scenario taxonomy from `refs/claw-code/rust/crates/rusty-claude-cli/tests/mock_parity_harness.rs`
      
        916
        - rewrite `README.md` (currently still says "FortranGoingOnForty")
      
        917
        - baseline parity checklist for current runtime behavior
      
        918
        
        919
        ### P1: Replace the loop with a real runtime (Sprint 01)
      
        920
        
        921
        - new `src/loader/runtime/` package with a typed turn engine
      
        922
        - unify the native, ReAct, and "extracted JSON fallback" tool execution paths into one executor
      
        923
        - fix the named bugs from Sprint 00's failing tests (`tool_call_id`, duplicate execution path)
      
        924
        - replace substring-based `NATIVE_TOOL_MODELS`/`NO_TOOL_MODELS` model detection with a `runtime/capabilities.py` profile system — Loader needs to behave consistently across model choices
      
        925
        - structured `TurnSummary` output
      
        926
        
        927
        ### P2: The behavior fix the user actually asked for (Sprint 02)
      
        928
        
        929
        - `DefinitionOfDone` object per task: acceptance criteria, verification commands, evidence summary, pending/completed task items
      
        930
        - explicit verify phase that runs the verification commands and gates completion on evidence
      
        931
        - fix loop: verification failure returns to execution, not to final answer
      
        932
        - minimum `.loader/` directory shape (`.loader/dod/`) — full session/memory layout deferred to Sprint 05
      
        933
        
        934
        This is the highest-leverage behavioral change in the entire plan and is the direct answer to "finishing too early" and "weak follow-through."
      
        935
        
        936
        ### P3: Safety as policy, not as confirmation prompt (Sprint 03)
      
        937
        
        938
        - permission modes: `read-only`, `workspace-write`, `danger-full-access`
      
        939
        - three-event tool lifecycle hooks (`pre_tool_use`, `post_tool_use`, `post_tool_use_failure`) modeled directly on `refs/claw-code/rust/crates/runtime/src/hooks.rs`
      
        940
        - refactor `safeguards.py` (duplicate detection, validation, rollback) into pre-tool hook implementations rather than ad-hoc method calls
      
        941
        - file operation hardening (workspace boundary, symlink, size limits, binary detection, structured patches)
      
        942
        - shell operation hardening
      
        943
        - expose active mode in CLI/TUI status
      
        944
        
        945
        Hooks land alongside permissions because every later sprint hangs new behavior (verification, validation, observability) on the same lifecycle.
      
        946
        
        947
        ### P4: Stop improvising one workflow for everything (Sprint 04)
      
        948
        
        949
        - mode router: clarify, plan, execute, verify (verify already exists from Sprint 02)
      
        950
        - clarify artifact written to `.loader/briefs/`
      
        951
        - planning artifacts (implementation plan + verification plan) written to `.loader/plans/` and fed into the existing DoD object
      
        952
        - tool prerequisites pulled forward from Sprint 06: `TodoWrite` (the "zero pending tasks" gate is empty without it) and `AskUserQuestion` (clarify rounds)
      
        953
        
        954
        ### P5: Durable continuity (Sprint 05)
      
        955
        
        956
        - full `.loader/` state directory under the layout already started in Sprint 02
      
        957
        - session persistence and resume
      
        958
        - transcript compaction with priority-aware summarization (model the design on `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`)
      
        959
        - memory/notepad surfaces
      
        960
        - usage/cost tracking
      
        961
        
        962
        ### P6: Operability and tool-surface expansion (Sprint 06)
      
        963
        
        964
        - `loader doctor`, `loader status`, `loader session`
      
        965
        - read-only explore lane
      
        966
        - broader tool surface (diff/patch-aware editing, git helpers, structured ask-user, etc.) — `TodoWrite` and `AskUserQuestion` already exist from Sprint 04
      
        967
        
        968
        ### Deferred indefinitely
      
        969
        
        970
        - workflow hooks beyond the runtime tool lifecycle (notification/idle nudges, leader monitoring)
      
        971
        - task/team/subagent orchestration
      
        972
        - broad MCP ecosystem
      
        973
        - richer plugin systems
      
        974
        
        975
        These are real wins in `claw-code`/OMX, but Loader should not pursue them until the solo runtime is trustworthy.
      
        976
        
        977
        ## What Loader should copy directly, and what it should not
      
        978
        
        979
        ### Copy directly
      
        980
        
        981
        - typed turn runtime
      
        982
        - permission model
      
        983
        - file/shell hardening
      
        984
        - session persistence
      
        985
        - compaction
      
        986
        - doctor/status/session surfaces
      
        987
        - workflow artifacts
      
        988
        - evidence-backed verification
      
        989
        - parity harness discipline
      
        990
        
        991
        ### Copy in simplified form
      
        992
        
        993
        - deep-interview
      
        994
        - ralplan
      
        995
        - ralph
      
        996
        - memory/notepad
      
        997
        - explore vs full-execution split
      
        998
        
        999
        ### Do not copy blindly yet
      
        1000
        
        1001
        - full tmux/team runtime
      
        1002
        - huge command surface
      
        1003
        - Discord/openclaw notification stack
      
        1004
        - broad MCP ecosystem
      
        1005
        
        1006
        Loader should first become a trustworthy single-agent local runtime. After that, team orchestration will actually help.
      
        1007
        
        1008
        ## Recommended Loader architecture direction
      
        1009
        
        1010
        If we want behavior closer to `claw-code` without losing Loader’s simplicity, I would steer toward:
      
        1011
        
        1012
        ### Layer 1: Runtime core
      
        1013
        
        1014
        - typed `TurnRuntime`
      
        1015
        - `SessionStore`
      
        1016
        - `PermissionPolicy`
      
        1017
        - `ToolExecutor`
      
        1018
        - `VerificationEngine`
      
        1019
        
        1020
        ### Layer 2: Workflow layer
      
        1021
        
        1022
        - `ClarifyWorkflow`
      
        1023
        - `PlanWorkflow`
      
        1024
        - `ExecuteWorkflow`
      
        1025
        - `VerifyWorkflow`
      
        1026
        
        1027
        ### Layer 3: Product surfaces
      
        1028
        
        1029
        - TUI
      
        1030
        - CLI
      
        1031
        - `doctor`
      
        1032
        - `status`
      
        1033
        - `session`
      
        1034
        - `explore`
      
        1035
        
        1036
        ### Layer 4: Optional future orchestration
      
        1037
        
        1038
        - hooks
      
        1039
        - background verification
      
        1040
        - multi-agent/task orchestration
      
        1041
        
        1042
        That is a better fit for Loader than trying to clone all of OMX wholesale.
      
        1043
        
        1044
        ## Immediate conclusions
      
        1045
        
        1046
        1. Loader’s biggest problems are architectural, not just prompt-related.
      
        1047
        2. `claw-code` is strongest where Loader is weakest: runtime contract, permissions, sessions, diagnostics, parity.
      
        1048
        3. OMX is strongest where Loader is currently almost absent: clarification, planning discipline, durable state, completion/verification loops.
      
        1049
        4. The fastest path to “better model behavior today” is not adding more heuristics. It is adding:
      
        1050
           - workflow artifacts
      
        1051
           - explicit verification
      
        1052
           - persistent state
      
        1053
           - a smaller, more trustworthy turn engine
      
        1054
        
        1055
        ## Sprint scaffolding
      
        1056
        
        1057
        After the deeper validation pass the original five-sprint plan was reshaped into seven sprints. The reshape splits the most ambitious sprint (the old Sprint 03, which bundled mode router + clarify + plan + DoD + verify/fix into one) and reorders so the user's actual pain point lands sooner. Sprint scaffolding lives under:
      
        1058
        
        1059
        - `.docs/sprints/index.md`
      
        1060
        - `.docs/sprints/sprint00.md` — Foundation, Measurement, and Parity Harness
      
        1061
        - `.docs/sprints/sprint01.md` — Turn Engine, Tool Contract, and Capability Profiles
      
        1062
        - `.docs/sprints/sprint02.md` — Definition of Done and Verify/Fix Loop
      
        1063
        - `.docs/sprints/sprint03.md` — Permission Modes and Tool Lifecycle Hooks
      
        1064
        - `.docs/sprints/sprint04.md` — Mode Router, Clarify, and Plan Artifacts
      
        1065
        - `.docs/sprints/sprint05.md` — Session State, Memory, and Compaction
      
        1066
        - `.docs/sprints/sprint06.md` — Doctor, Explore, Status, and Tool Surface Expansion
      
        1067
        
        1068
        ## Recommended next move
      
        1069
        
        1070
        Start with Sprint 00, and start Sprint 00 with the failing regression test.
      
        1071
        
        1072
        Reason:
      
        1073
        
        1074
        - Loader needs a measurable baseline and a safer runtime before adding more behavior
      
        1075
        - the `tool_call_id` bug at `agent/loop.py:885,906` is proof that untested code paths are silently broken
      
        1076
        - writing the failing test first proves both the bug and the harness in one move
      
        1077
        - otherwise every feature sprint will be built on unstable agent semantics
      
        1078
        
        1079
        The execution phase should then be:
      
        1080
        
        1081
        1. lock down the runtime and test harness (Sprint 00)
      
        1082
        2. replace the loop with a typed runtime and capability profiles (Sprint 01)
      
        1083
        3. define and enforce the completion contract (Sprint 02)
      
        1084
        4. add the policy-based safety layer with hooks (Sprint 03)
      
        1085
        5. add workflow modes and planning artifacts on top (Sprint 04)
      
        1086
        6. then widen the durability and product surfaces (Sprints 05 and 06)
      
        1087
        
        1088
        ## Plan adjustments after deeper review
      
        1089
        
        1090
        The following changes were applied to the original report after a firsthand validation pass against the actual code in `refs/claw-code` and `refs/oh-my-codex`, plus spot-checks of Loader's runtime.
      
        1091
        
        1092
        ### Verified directly against the code
      
        1093
        
        1094
        - **`tool_call_id` bug confirmed at `src/loader/agent/loop.py:885` and `:906`.** Both call sites construct `Message(role=Role.TOOL, content=..., tool_call_id=tool_call.id)`, but `Message` (`src/loader/llm/base.py:33-39`) has no such field. They live on the duplicate-suppression and pre-validation branches and would crash on first execution. Zero integration coverage.
      
        1095
        - **Pytest discovery is broken by default.** `uv run pytest --collect-only` picks up `refs/claw-code/tests/test_porting_workspace.py` and fails to import `loader` because there is no `tool.pytest.ini_options` block in `pyproject.toml`.
      
        1096
        - **Loop monolith confirmed by line counts.** `agent/loop.py` is 1929 LOC, `agent/reasoning.py` is 1196, `agent/safeguards.py` is 1079 — roughly 4200 lines of orchestration in one cluster.
      
        1097
        - **claw-code's `run_turn()` shape** is exactly as the report describes. Read directly at `refs/claw-code/rust/crates/runtime/src/conversation.rs:295-470`. Typed message build → tool extraction → pre-hook → permission check → execute → post-hook (success or failure variant) → typed `ConversationMessage::tool_result()` → push → repeat. ~175 lines of clean code.
      
        1098
        - **claw-code permission modes** are `ReadOnly` / `WorkspaceWrite` / `DangerFullAccess` (plus `Prompt` and `Allow`), defined at `refs/claw-code/rust/crates/runtime/src/permissions.rs:8-27`. The 10MB read/write caps, binary detection, workspace boundary check, and structured patch outputs in `file_ops.rs` are all real.
      
        1099
        - **claw-code hooks** are `PreToolUse` / `PostToolUse` / `PostToolUseFailure`, defined at `refs/claw-code/rust/crates/runtime/src/hooks.rs:19-34` and wired into the conversation loop at lines 371, 427-453.
      
        1100
        - **OMX skills are real and even more rigorous than the report described.** `ralplan` enforces a max-5-iteration Critic loop with sequential Architect→Critic ordering. `ralph` has explicit phase enums (`starting`/`executing`/`verifying`/`fixing`/`complete`/`failed`/`cancelled`) persisted via `state_write` to `.omx/state/{mode}-state.json`. The verifier in `src/verification/verifier.ts` scales by task size with concrete file-count thresholds.
      
        1101
        
        1102
        ### Corrected facts
      
        1103
        
        1104
        - **Tool count: 49, not 40.** `refs/claw-code/rust/crates/tools/src/lib.rs` exposes 49 `ToolSpec` entries in `mvp_tool_specs()`. Doesn't change the lesson, but worth knowing.
      
        1105
        - **claw-code permissions have a third layer.** Beyond `PermissionMode` and per-tool requirements, `PermissionPolicy` carries three rule lists (`allow_rules`, `deny_rules`, `ask_rules`) for context-specific overrides. Loader can land the mode layer first and defer the rule layer.
      
        1106
        - **claw-code summary compression is sophisticated.** It's not message-level truncation — it's line-level prioritization with deduplication and budget enforcement at `refs/claw-code/rust/crates/runtime/src/summary_compression.rs`. Sprint 05 should model on this rather than reinventing.
      
        1107
        
        1108
        ### Structural plan changes
      
        1109
        
        1110
        - **The old Sprint 03 was split.** It bundled mode router + clarify + plan + DoD + verify/fix into one sprint, which is essentially "ralplan + ralph + verifier" simultaneously. The DoD/verify-fix half became the new Sprint 02 (highest-leverage behavioral fix). The mode router / clarify / plan half became the new Sprint 04.
      
        1111
        - **The old Sprint 02 (permissions) became the new Sprint 03** and was reordered to land *after* DoD. Permissions are a safety win, not a behavior win, and the user's actual complaints are about behavior. DoD lands first.
      
        1112
        - **Hooks landed in the same sprint as permissions.** The original plan split them across sprints; that creates rework because every later runtime addition (verification, observability, validation) wants the same lifecycle. Sprint 03 owns both.
      
        1113
        - **Capability profiles became a Sprint 01 deliverable.** They were Target 12 in the original report and orphaned from the sprint plan. They belong in the runtime layer and are critical for the user's "behave consistently across model choices" goal.
      
        1114
        - **The minimum `.loader/` directory shape moves to Sprint 02** (just `.loader/dod/`). The full session/memory/compaction layout stays in Sprint 05. This unblocks Sprint 02 and Sprint 04 from waiting on Sprint 05.
      
        1115
        - **`TodoWrite` and `AskUserQuestion` move from Sprint 06 to Sprint 04** as prerequisites for the clarify mode and the "zero pending tasks" gate. The broad tool-surface expansion stays in Sprint 06.
      
        1116
        - **Sprint 00's first deliverable is now the failing regression test** for the `tool_call_id` bug, before any harness work. It proves the bug and proves the harness exist in one move.

Loader Deep Dive: Gaps, Strengths, and a Path Toward Claw-Like Behavior

Scope and assumptions

Executive summary

Method

What Loader already does well

1. Loader is small, understandable, and hackable

2. Loader is genuinely local-first

3. Loader already contains the seeds of a better system

4. The TUI is a meaningful strength

Where Loader is weak today

1. Loader’s product surface is not trustworthy yet

2. Loader’s main runtime is too monolithic and too heuristic

3. Loader has a real runtime contract bug in tool-result handling

4. Loader duplicates tool execution logic instead of centralizing it

5. Loader’s system prompt is too shallow and too rigid

6. Loader’s tool surface is too thin

7. Loader’s safety model is primitive

8. Loader’s “definition of done” is heuristic, not contractual

9. Loader has no durable workflow state

10. Loader is too backend-specific and too capability-fragile

11. Loader’s tests are not protecting the real runtime

What claw-code gets right

1. The runtime contract is explicit

2. Session persistence and compaction are first-class

3. Permissions are part of the runtime, not just UI confirmation

4. File and shell operations are engineered, not just exposed

5. Hooks and lifecycle surfaces give the runtime escape valves

6. The project is honest about parity and weaknesses

7. Diagnostics and operator surfaces are part of the product

Where claw-code is still incomplete

What OMX adds that Loader is currently missing almost entirely

1. Clarification is a mode, not an ad hoc question

2. Planning is artifact-based and consensus-based

3. “Done” is a workflow contract in Ralph

4. Workflow state lives outside the prompt

5. Memory and notepad are explicit tools

6. Verification is standardized

7. Doctor / explore / sparkshell reduce prompt waste

8. Follow-through is supported outside the agent context window

Comparison matrix

Why Loader’s current weaknesses produce the behavior you described

Poor tool use

Weak follow-through

Finishing early

Spending too long on simple tasks

Model sensitivity

Concrete implementation targets

Target 1: Introduce a real turn engine

Target 2: Add persistent Loader state under .loader/

Target 3: Separate task modes

Target 4: Replace heuristic completion with an evidence-backed done contract

Target 5: Add deep-interview-lite and ralplan-lite equivalents

Target 6: Build a real permission model

Target 7: Harden file and shell tools

Target 8: Add loader doctor, loader status, and loader session

Target 9: Add memory/notepad tools

Target 10: Add a lightweight read-only inspect lane

Target 11: Add a parity harness

Target 12: Add workflow-aware prompts and capability profiles

Priority order

P0: Stabilize before changing behavior (Sprint 00)

P1: Replace the loop with a real runtime (Sprint 01)

P2: The behavior fix the user actually asked for (Sprint 02)

P3: Safety as policy, not as confirmation prompt (Sprint 03)

P4: Stop improvising one workflow for everything (Sprint 04)

P5: Durable continuity (Sprint 05)

P6: Operability and tool-surface expansion (Sprint 06)

Deferred indefinitely

What Loader should copy directly, and what it should not

Copy directly

Copy in simplified form

Do not copy blindly yet

Recommended Loader architecture direction

Layer 1: Runtime core

Layer 2: Workflow layer

Layer 3: Product surfaces

Layer 4: Optional future orchestration

Immediate conclusions

Sprint scaffolding

Recommended next move

What `claw-code` gets right

Where `claw-code` is still incomplete

Target 2: Add persistent Loader state under `.loader/`

Target 5: Add `deep-interview`-lite and `ralplan`-lite equivalents

Target 8: Add `loader doctor`, `loader status`, and `loader session`