@@ -0,0 +1,159 @@ |
| 1 | +# Sprint 21: Evidence Provenance, Read-Model Cleanup, and Runtime-First API |
| 2 | + |
| 3 | +## Prerequisites |
| 4 | + |
| 5 | +Sprint 20 |
| 6 | + |
| 7 | +## Goals |
| 8 | + |
| 9 | +Take the next honest step after Sprint 20: move Loader's completion and verification story from "better heuristics with canonical policy events" toward stronger evidence provenance, reduce compatibility read-model duplication where the canonical workflow timeline already carries the truth, and begin narrowing internal callers toward a runtime-first API instead of treating `Agent` as the only natural seam. |
| 10 | + |
| 11 | +Sprint 20 changed the remaining debt in a useful way: |
| 12 | + |
| 13 | +- the workflow timeline is now the canonical policy/accountability artifact, including live completion-trace projection |
| 14 | +- follow-through checks now use stronger DoD/runtime evidence instead of only textual heuristics |
| 15 | +- the remaining `Agent` shell is explicitly documented and guarded as a public facade |
| 16 | +- but completion/verification evidence is still mostly flattened into human-readable strings rather than typed provenance |
| 17 | +- compatibility/read-model surfaces still rely on a few projections that are honest but not yet minimal |
| 18 | +- internal callers still treat `Agent` as the default runtime entry seam even though the shell is now explicitly a compatibility/public facade |
| 19 | + |
| 20 | +Sprint 21 should keep using the references as architectural guardrails, not as a feature-copy list. |
| 21 | + |
| 22 | +The standard remains: |
| 23 | + |
| 24 | +- use claw-code to sharpen canonical event ownership, green-contract discipline, and runtime-first seams |
| 25 | +- use OMX to sharpen verifier/accountability provenance and evidence-backed follow-through |
| 26 | +- do not add work just because the refs have it |
| 27 | +- do add work when the refs show that Loader is still too stringly-typed, too duplicative, or too dependent on a compatibility shell |
| 28 | + |
| 29 | +`audit.txt` remains a guardrail against wrapper-heavy drift and soft rescue behavior. It is not the factual roadmap. |
| 30 | + |
| 31 | +The references for this sprint are: |
| 32 | + |
| 33 | +- `refs/claw-code/rust/crates/runtime/src/policy_engine.rs` |
| 34 | +- `refs/claw-code/rust/crates/runtime/src/green_contract.rs` |
| 35 | +- `refs/claw-code/rust/crates/runtime/src/lane_events.rs` |
| 36 | +- `refs/claw-code/rust/crates/runtime/src/session_control.rs` |
| 37 | +- `refs/claw-code/rust/crates/runtime/src/bootstrap.rs` |
| 38 | +- `refs/claw-code/rust/crates/runtime/src/conversation.rs` |
| 39 | +- `refs/claw-code/PARITY.md` |
| 40 | +- `refs/oh-my-codex/src/verification/verifier.ts` |
| 41 | +- `refs/oh-my-codex/src/autoresearch/contracts.ts` |
| 42 | +- `refs/oh-my-codex/src/autoresearch/runtime.ts` |
| 43 | +- `refs/oh-my-codex/src/hooks/session.ts` |
| 44 | +- `refs/oh-my-codex/src/hooks/prompt-guidance-contract.ts` |
| 45 | +- `.docs/PARITY.md` |
| 46 | +- `.docs/audit.txt` |
| 47 | +- `.docs/audit_sprints/trunk_sitrep.md` |
| 48 | +- `.docs/sprints/sprint20.md` |
| 49 | + |
| 50 | +## Deliverables |
| 51 | + |
| 52 | +### 1. Introduce typed evidence provenance for completion and verification |
| 53 | + |
| 54 | +Sprint 20 strengthened follow-through, but most of the contract still collapses into free-text evidence summaries too early. |
| 55 | + |
| 56 | +Implementation targets: |
| 57 | + |
| 58 | +- inventory where completion/verification evidence is currently flattened into strings across: |
| 59 | + - `src/loader/runtime/task_completion.py` |
| 60 | + - `src/loader/runtime/completion_trace.py` |
| 61 | + - `src/loader/runtime/policy_timeline.py` |
| 62 | + - `src/loader/runtime/finalization.py` |
| 63 | + - `src/loader/runtime/dod.py` |
| 64 | + - `src/loader/runtime/workflow_policy.py` |
| 65 | +- define a small typed provenance model that can represent things like: |
| 66 | + - verification command ran and passed/failed |
| 67 | + - verification command was still missing |
| 68 | + - tracked work item remained incomplete |
| 69 | + - artifact/touchpoint evidence existed or was contradicted |
| 70 | + - claimed runtime outcome was backed by observed output |
| 71 | +- prefer structured provenance that can still be rendered into human-readable summaries, instead of making strings the primary contract |
| 72 | +- thread that provenance through completion policy and canonical policy events where it materially improves honesty or inspectability |
| 73 | + |
| 74 | +The goal is not to build a fake theorem prover. The goal is to stop throwing away runtime evidence structure too early. |
| 75 | + |
| 76 | +### 2. Reduce read-model duplication around the canonical workflow timeline |
| 77 | + |
| 78 | +Sprint 20 made the workflow timeline canonical, but a few read models still feel more coupled than they need to be. |
| 79 | + |
| 80 | +Implementation targets: |
| 81 | + |
| 82 | +- inventory where compatibility/read-model projections still depend on direct mutation or duplicated logic across: |
| 83 | + - `src/loader/runtime/completion_trace.py` |
| 84 | + - `src/loader/runtime/session.py` |
| 85 | + - `src/loader/runtime/inspection.py` |
| 86 | + - `src/loader/runtime/events.py` |
| 87 | + - any nearby status/session helper that reconstructs policy state manually |
| 88 | +- make sure projections like completion traces and latest-policy summaries are clearly derivations from canonical policy events instead of semi-independent contracts |
| 89 | +- remove any remaining direct writes or state bookkeeping that are only there to keep parallel policy read models in sync |
| 90 | +- keep compact operator-facing read models where they help, but make their derived nature explicit in code and tests |
| 91 | + |
| 92 | +The goal is one canonical truth plus honest projections, not a forest of near-duplicates. |
| 93 | + |
| 94 | +### 3. Start the runtime-first internal API transition below the public `Agent` facade |
| 95 | + |
| 96 | +Sprint 20 settled `Agent` as the public compatibility shell. Sprint 21 should stop using that shell as the default internal seam where it no longer needs to be. |
| 97 | + |
| 98 | +Implementation targets: |
| 99 | + |
| 100 | +- inventory current internal call sites that still instantiate or consume `Agent` when a runtime-first seam would be cleaner, especially in: |
| 101 | + - launcher/bootstrap helpers |
| 102 | + - CLI/TUI integration code |
| 103 | + - tests that are really exercising runtime behavior rather than public compatibility |
| 104 | +- define a small runtime-first entry contract for internal consumers where it clearly reduces shell coupling |
| 105 | +- keep `Agent` as the public compatibility surface, but begin migrating internal runtime-oriented callers away from assuming that `Agent` is the only valid execution owner |
| 106 | +- document what remains intentionally public-shell-only versus what is now runtime-first |
| 107 | + |
| 108 | +The goal is not to delete `Agent`. The goal is to make `Agent` clearly public/compatibility-facing while runtime internals use runtime-first seams by default. |
| 109 | + |
| 110 | +### 4. Sharpen operator visibility for evidence-backed stop/continue decisions |
| 111 | + |
| 112 | +Sprint 20 improved policy summaries, but the evidence itself is still only partially visible. |
| 113 | + |
| 114 | +Implementation targets: |
| 115 | + |
| 116 | +- improve the existing operator views so users can answer: |
| 117 | + - what exact evidence was missing when Loader stopped? |
| 118 | + - what exact evidence satisfied the completion contract? |
| 119 | + - which policy event carried that evidence? |
| 120 | +- prefer improving: |
| 121 | + - `loader workflow show` |
| 122 | + - `loader session show` |
| 123 | + - `loader status` |
| 124 | + over inventing a new command unless a new surface is clearly cleaner |
| 125 | +- add concise rollups first, and expose deeper provenance only where it materially helps post-mortem inspection |
| 126 | + |
| 127 | +The goal is to make Loader easier to audit after the fact, not simply more verbose. |
| 128 | + |
| 129 | +## Testing strategy |
| 130 | + |
| 131 | +- unit coverage for: |
| 132 | + - typed evidence-provenance normalization/rendering |
| 133 | + - derived read-model projections from the canonical workflow timeline |
| 134 | + - any new runtime-first internal entry contract below `Agent` |
| 135 | +- runtime coverage for: |
| 136 | + - honest finalization with explicit evidence provenance when completion still fails |
| 137 | + - successful completion paths that now surface structured proof instead of only summary strings |
| 138 | + - status/session/workflow inspection of the evidence-backed policy story |
| 139 | +- regression coverage for: |
| 140 | + - no drift back toward peer policy artifacts beside the canonical workflow timeline |
| 141 | + - no drift back toward `Agent` as the default internal seam when a runtime-first contract exists |
| 142 | + - no loss of the current compact operator read models while provenance becomes richer |
| 143 | + |
| 144 | +## Definition of done |
| 145 | + |
| 146 | +- Loader preserves one canonical policy/accountability artifact while making evidence provenance more structured |
| 147 | +- completion/verification evidence is less stringly-typed and more inspectable without weakening honesty |
| 148 | +- internal runtime-oriented code has at least one cleaner runtime-first seam below the public `Agent` facade |
| 149 | +- existing status/session/workflow surfaces answer stop/continue questions with clearer evidence context |
| 150 | +- Sprint 20's canonical-policy and facade-settlement gains remain green |
| 151 | + |
| 152 | +## Explicitly out of scope |
| 153 | + |
| 154 | +- full claw-code policy-engine parity |
| 155 | +- model-authored verifier narratives as a mandatory dependency |
| 156 | +- multi-agent or team orchestration |
| 157 | +- AST-aware semantic diffs |
| 158 | +- a broad visual workflow UI |
| 159 | +- rich permission-rule editing UX |