@@ -0,0 +1,177 @@ |
| 1 | +# Sprint 12: Interview Pressure, Semantic Evidence, and Turn Orchestration |
| 2 | + |
| 3 | +## Prerequisites |
| 4 | + |
| 5 | +Sprint 11 |
| 6 | + |
| 7 | +## Goals |
| 8 | + |
| 9 | +Turn Loader's newer workflow structure into a more disciplined execution contract by deepening clarify beyond slot selection, making semantic invalidation rely on richer evidence than text overlap alone, and shrinking the main turn loop into a clearer orchestration shell. |
| 10 | + |
| 11 | +Sprint 11 closed several real gaps. Loader now has typed workflow signals, slot-aware clarify, semantic invalidation, better workflow inspection, and a slimmer coordinator. That is meaningful progress toward claw-code and OMX, but the audit is honest about what still hurts: |
| 12 | + |
| 13 | +- typed workflow signals are still hand-tuned runtime heuristics rather than a deeper ambiguity/evidence model |
| 14 | +- clarify is more intentional now, but it still lacks OMX's pressure-pass discipline, evidence-chasing, and codebase-backed interview style |
| 15 | +- artifact invalidation is broader than file drift, but it still reasons from lightweight text overlap instead of richer structured evidence |
| 16 | +- `conversation.py` is smaller, but it still owns the main assistant/recovery/completion orchestration loop that the refs spread across narrower runtime seams |
| 17 | + |
| 18 | +The next leverage point is to stop treating clarify as "ask a better next question" and start treating it as "run a bounded interview with explicit pressure passes, factual grounding, and a stronger handoff contract for later execution." |
| 19 | + |
| 20 | +This sprint is about execution rigor: |
| 21 | + |
| 22 | +- clarify gains pressure-pass behavior instead of only slot-follow-up behavior |
| 23 | +- semantic invalidation uses richer structured evidence and contradiction tracking |
| 24 | +- the main turn loop shrinks again by delegating orchestration checkpoints into dedicated runtime modules |
| 25 | +- Loader gets closer to closed-source agentic tools not by more prompt prose, but by stronger workflow contracts |
| 26 | + |
| 27 | +The references for this sprint are: |
| 28 | + |
| 29 | +- `refs/claw-code/rust/crates/runtime/src/conversation.rs` |
| 30 | +- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` |
| 31 | +- `refs/claw-code/rust/crates/runtime/src/prompt.rs` |
| 32 | +- `refs/oh-my-codex/src/ralplan/runtime.ts` |
| 33 | +- `refs/oh-my-codex/src/modes/base.ts` |
| 34 | +- `refs/oh-my-codex/skills/deep-interview/SKILL.md` |
| 35 | +- `refs/oh-my-codex/skills/ralplan/SKILL.md` |
| 36 | + |
| 37 | +## Deliverables |
| 38 | + |
| 39 | +### 1. Pressure-pass clarify controller instead of slot selection alone |
| 40 | + |
| 41 | +Sprint 11 made clarify targeted. Sprint 12 should make it disciplined. |
| 42 | + |
| 43 | +Implementation targets: |
| 44 | + |
| 45 | +- introduce a dedicated clarify controller under `src/loader/runtime/` that tracks: |
| 46 | + - current interview stage |
| 47 | + - weakest clarity dimension |
| 48 | + - whether a pressure pass has occurred |
| 49 | + - whether non-goals and decision boundaries are explicit |
| 50 | + - how much interview budget remains |
| 51 | +- extend clarify reasoning beyond "what slot is unresolved?" to also ask: |
| 52 | + - was the last answer too broad? |
| 53 | + - has this assumption been challenged yet? |
| 54 | + - do we still need an example, counterexample, tradeoff, or explicit stop boundary? |
| 55 | +- persist clarify progress in structured form so later workflow decisions can explain: |
| 56 | + - which dimension clarify was targeting |
| 57 | + - whether Loader was still gathering boundaries |
| 58 | + - whether it stopped because the budget was exhausted or because readiness gates were met |
| 59 | +- keep it bounded and pragmatic: |
| 60 | + - no unbounded interviews |
| 61 | + - no long questionnaires |
| 62 | + - one question at a time with explicit stop conditions |
| 63 | + |
| 64 | +The goal is not to copy OMX wholesale. The goal is to adopt the parts that materially reduce misaligned execution and premature planning. |
| 65 | + |
| 66 | +### 2. Codebase-backed clarify grounding and stronger requirement artifacts |
| 67 | + |
| 68 | +Sprint 11 still relies mostly on the user answer plus task text. Sprint 12 should let clarify lean on facts Loader can gather directly. |
| 69 | + |
| 70 | +Implementation targets: |
| 71 | + |
| 72 | +- add a lightweight preflight/context seam for brownfield tasks that can feed clarify with discovered facts before asking the user for repository details |
| 73 | +- prefer evidence-backed clarify questions when Loader already knows something, for example: |
| 74 | + - "I found X in Y. Should this change follow that pattern?" |
| 75 | + - "The current touchpoints appear to be A and B. Should I keep C out of scope?" |
| 76 | +- persist richer clarify artifact metadata where it helps downstream runtime behavior, for example: |
| 77 | + - explicit non-goal status |
| 78 | + - explicit decision-boundary status |
| 79 | + - whether a pressure pass occurred |
| 80 | + - likely touchpoint evidence |
| 81 | + - inferred vs confirmed boundaries |
| 82 | +- keep this grounded in Loader's existing tool surface rather than inventing a large research subsystem |
| 83 | + |
| 84 | +This moves Loader closer to OMX's "reduce user effort and don't ask for facts we can discover" principle. |
| 85 | + |
| 86 | +### 3. Structured semantic evidence for invalidation and replan decisions |
| 87 | + |
| 88 | +Sprint 11 improved invalidation, but it still reasons mostly from text coverage. Sprint 12 should give recovery choices a stronger evidence model. |
| 89 | + |
| 90 | +Implementation targets: |
| 91 | + |
| 92 | +- define a structured invalidation/evidence contract under `src/loader/runtime/`, for example around: |
| 93 | + - confirmed touchpoints |
| 94 | + - inferred touchpoints |
| 95 | + - acceptance anchors |
| 96 | + - contradicted assumptions |
| 97 | + - verification contradiction signals |
| 98 | + - changed user boundaries after clarify |
| 99 | +- teach invalidation to distinguish: |
| 100 | + - plan mismatch |
| 101 | + - brief contradiction |
| 102 | + - verification contradiction |
| 103 | + - stale assumptions |
| 104 | +- improve recovery selection so Loader can explain not only what it chose, but what evidence forced that choice |
| 105 | +- preserve "smallest valid recovery move first" as the governing behavior |
| 106 | + |
| 107 | +This is how Loader gets from "semantic-ish refresh" to a more trustworthy workflow contract. |
| 108 | + |
| 109 | +### 4. Turn orchestration split beyond lane execution |
| 110 | + |
| 111 | +Sprint 11 moved clarify/plan lanes out. Sprint 12 should keep shrinking the top-level turn loop. |
| 112 | + |
| 113 | +Implementation targets: |
| 114 | + |
| 115 | +- extract additional runtime seams under `src/loader/runtime/`, likely around: |
| 116 | + - turn preparation/bootstrap |
| 117 | + - workflow recovery/reentry control |
| 118 | + - completion/continuation orchestration |
| 119 | + - assistant-response repair routing |
| 120 | +- make `ConversationRuntime.run_turn(...)` read more like: |
| 121 | + - initialize turn state |
| 122 | + - prepare workflow contract |
| 123 | + - delegate iteration/orchestration helpers |
| 124 | + - finalize summary |
| 125 | +- avoid creating a new monolith module; prefer narrow orchestration seams with direct tests |
| 126 | + |
| 127 | +A good outcome is that the turn loop becomes easier to reason about and less likely to collect ad hoc behavior again. |
| 128 | + |
| 129 | +### 5. Workflow/operator surfaces that explain evidence, not just decisions |
| 130 | + |
| 131 | +Sprint 11 made `loader workflow show` more useful. Sprint 12 should make it explain the evidence behind recovery and clarify pressure more directly. |
| 132 | + |
| 133 | +Implementation targets: |
| 134 | + |
| 135 | +- extend workflow inspection surfaces to show: |
| 136 | + - whether a pressure pass occurred |
| 137 | + - which clarify dimension was active |
| 138 | + - which evidence triggered refresh or reentry |
| 139 | + - which assumptions were still unresolved |
| 140 | +- keep the default UX concise, but expose richer detail when explicitly requested |
| 141 | +- avoid a visual UI in this sprint; prioritize text surfaces that make the runtime easier to debug immediately |
| 142 | + |
| 143 | +## Testing strategy |
| 144 | + |
| 145 | +- unit coverage for: |
| 146 | + - clarify pressure-pass progression and readiness gates |
| 147 | + - codebase-backed clarify question selection from discovered facts |
| 148 | + - structured invalidation evidence and contradiction handling |
| 149 | + - new orchestration seams preserving current turn behavior |
| 150 | +- CLI coverage for: |
| 151 | + - workflow inspection showing clarify pressure/evidence |
| 152 | + - session/workflow output for contradiction-driven reentry |
| 153 | +- deterministic/runtime coverage for: |
| 154 | + - ambiguous brownfield tasks where Loader asks evidence-backed clarify questions |
| 155 | + - tasks that need an assumption/tradeoff pressure pass before planning |
| 156 | + - verification contradictions that trigger targeted refresh vs full re-plan |
| 157 | + - Sprint 00-11 parity scenarios staying green after the deeper orchestration split |
| 158 | +- regression coverage: |
| 159 | + - clarify should not ask the user for repository facts Loader can gather directly |
| 160 | + - orchestration extraction should not regress the verify/fix or permission/runtime contracts |
| 161 | + |
| 162 | +## Definition of done |
| 163 | + |
| 164 | +- clarify uses a bounded pressure-pass controller rather than slot selection alone |
| 165 | +- brownfield clarify can ask evidence-backed questions from discovered facts |
| 166 | +- invalidation relies on richer structured evidence and contradiction tracking |
| 167 | +- workflow/operator surfaces explain clarify and recovery evidence more directly |
| 168 | +- `conversation.py` is slimmer again and more orchestration-shell-like |
| 169 | +- the full parity baseline remains green after the deeper clarify/orchestration split |
| 170 | + |
| 171 | +## Explicitly out of scope |
| 172 | + |
| 173 | +- full OMX-style consensus planning |
| 174 | +- a visual workflow timeline UI |
| 175 | +- a first-class permission rule editor |
| 176 | +- AST-aware, LSP-aware, or symbol-aware editing |
| 177 | +- multi-agent or team orchestration |