`ea386ab`

Expand runtime parity coverage for Sprint 01

Authored by

espadonne 1 month ago

SHA: ea386abc531480cb55c15c5c19f11baeaeec3d5a
Parents: 02dd95a
Tree: a6b6954

2 changed files

Status	File	+	-
M	`.docs/PARITY.md`	15	12
M	`tests/test_runtime_harness.py`	109	0

.docs/PARITY.mdmodified

 -# Loader Runtime Baseline
 +# Loader Runtime Parity Checkpoint
  Date: 2026-04-06
 -This file is the Sprint 00 baseline for Loader's current runtime behavior. It is intentionally narrow and operational: what the loop can do today, what is flaky, what is out of scope, and what scenarios we now measure with deterministic tests.
 +This file tracks the current deterministic runtime baseline for Loader. It stays intentionally narrow and operational: what the runtime can do today, what remains weak, and what scenarios we measure with repeatable tests.
  ## Supported today
  - confirmation callbacks for destructive `write` and `bash` actions
  - raw JSON fallback when the model emits tool syntax in plain text
  - heuristic completion nudges when the model stops before finishing a simple actionable task
 +- typed `TurnSummary` output for completed turns, including trace events and tool-result messages
 +- unified tool execution for native and extracted tool calls through `runtime.executor.ToolExecutor`
 +- typed tool-result messages backed by `Message.tool_results`
  ## Known weak spots
 -- the main runtime still lives in one large loop at [`src/loader/agent/loop.py`](../src/loader/agent/loop.py)
 -- duplicate suppression and pre-validation still try to construct `Message(..., tool_call_id=...)`, which is a known broken contract until Sprint 01 lands
 -- extracted raw-text tool execution duplicates the main tool execution path
 +- the core turn loop moved into [`src/loader/runtime/conversation.py`](../src/loader/runtime/conversation.py), but it is still much larger and more heuristic-heavy than the reference runtime in `refs/claw-code`
 +- planning, decomposition, and several helper behaviors still live in [`src/loader/agent/loop.py`](../src/loader/agent/loop.py), so ownership is cleaner than Sprint 00 but not fully simplified yet
  - completion is still heuristic, not evidence-backed
  - permissions are confirmation-based, not policy-based
  ## Out of scope in the current baseline
 -- typed turn engine / unified executor
 -- permission modes
 +- permission modes / policy engine
  - persisted sessions / memory / `.loader/` runtime state
  - mode router, clarify, or planning artifacts
  - doctor / status / session product surfaces
  - `bash_confirmation_prompt_denied`: green
  - `raw_json_tool_call_fallback`: green
  - `completion_check_continuation`: green
 -- `tool_result_contract_regression`: intentionally red in Sprint 00
 +- `tool_result_contract_regression`: green
 +- `turn_summary_smoke_for_multi_tool_turn`: green
 +- `native_and_raw_tool_paths_share_executor_trace`: green
  ## Verification snapshot
  As of 2026-04-06:
 -- `uv run pytest`: 70 passed, 1 failed
 -- the single failing test is `tests/test_runtime_harness.py::test_tool_result_contract_regression`
 -- that regression proves both broken branches currently raise `TypeError: Message.__init__() got an unexpected keyword argument 'tool_call_id'`
 +- `uv run pytest -q`: 78 passed
 +- `tests/test_runtime_harness.py` is fully green, including the original contract regression
 +- native and extracted tool calls now record the same executor trace events, with source-specific metadata
  ## Definition of honesty
  - If a scenario is green here, it should have deterministic automated coverage.
  - If a scenario is flaky or broken, it should be called out here before we claim parity work is done.
 -- Sprint 01 should turn the intentional red regression green by fixing the tool-result message contract, not by weakening the test.
 +- Sprint 01 turned the original `tool_call_id` regression green by fixing the message contract, not by weakening the test.

tests/test_runtime_harness.pymodified

      return [event.content for event in run.events if event.type == "tool_result"]
 +def trace_event_names(run) -> list[str]:
 +    """Return recorded runtime trace event names."""
++
 +    summary = run.agent.last_turn_summary
 +    assert summary is not None
 +    return [event.name for event in summary.trace]
++
++
  @pytest.mark.asyncio
  async def test_runtime_parity_manifest_matches_implemented_cases() -> None:
      manifest_names = [entry["name"] for entry in load_manifest()]
      assert "two parity lines" in run.response
 +@pytest.mark.asyncio
 +async def test_turn_summary_smoke_for_multi_tool_turn(temp_dir: Path) -> None:
 +    fixture = temp_dir / "fixture.txt"
 +    fixture.write_text("alpha parity line\nbeta line\ngamma parity line\n")
++
 +    backend = ScriptedBackend(
 +        completions=[
 +            native_tool_response(
 +                ToolCall(id="read-1", name="read", arguments={"file_path": str(fixture)}),
 +                ToolCall(
 +                    id="grep-1",
 +                    name="grep",
 +                    arguments={"pattern": "parity", "path": str(fixture)},
 +                ),
 +                content="I'll inspect the file and count parity matches.",
 +            ),
 +            final_response("The file has two parity lines, including alpha parity line."),
 +        ]
 +    )
++
 +    run = await run_scenario(
 +        "Inspect the fixture and find parity lines.",
 +        backend,
 +        config=non_streaming_config(),
 +        project_root=temp_dir,
 +    )
++
 +    summary = run.agent.last_turn_summary
 +    assert summary is not None
 +    assert summary.final_response == run.response
 +    assert summary.iterations == 2
 +    assert len(summary.assistant_messages) == 2
 +    assert len(summary.tool_result_messages) == 2
 +    assert "assistant.tool_batch" in trace_event_names(run)
++
++
  @pytest.mark.asyncio
  async def test_write_file_allowed(temp_dir: Path) -> None:
      target = temp_dir / "allowed.txt"
      assert "Recovered the raw JSON tool call" in run.response
 +@pytest.mark.asyncio
 +async def test_native_and_raw_tool_paths_share_executor_trace(temp_dir: Path) -> None:
 +    native_fixture = temp_dir / "native.txt"
 +    native_fixture.write_text("native parity line\n")
 +    native_backend = ScriptedBackend(
 +        completions=[
 +            native_tool_response(
 +                ToolCall(id="read-1", name="read", arguments={"file_path": str(native_fixture)}),
 +                content="I'll inspect the native tool result.",
 +            ),
 +            final_response("Native read complete."),
 +        ]
 +    )
 +    native_run = await run_scenario(
 +        "Read native.txt.",
 +        native_backend,
 +        config=non_streaming_config(),
 +        project_root=temp_dir,
 +    )
++
 +    raw_fixture = temp_dir / "raw.txt"
 +    raw_fixture.write_text("raw parity line\n")
 +    raw_json = f'{{"name": "read", "arguments": {{"file_path": "{raw_fixture}"}}}}'
 +    raw_backend = ScriptedBackend(
 +        streams=[
 +            [
 +                StreamChunk(content=raw_json[:20], is_done=False),
 +                StreamChunk(content=raw_json[20:], full_content=raw_json, is_done=True),
 +            ],
 +            [
 +                StreamChunk(
 +                    content="Raw read complete.",
 +                    full_content="Raw read complete.",
 +                    is_done=True,
 +                )
 +            ],
 +        ]
 +    )
 +    raw_run = await run_scenario(
 +        "Read raw.txt.",
 +        raw_backend,
 +        config=AgentConfig(auto_context=False, max_iterations=8),
 +        project_root=temp_dir,
 +    )
++
 +    for run in (native_run, raw_run):
 +        names = trace_event_names(run)
 +        assert "assistant.tool_batch" in names
 +        assert "tool.received" in names
 +        assert "tool.executed" in names
++
 +    native_summary = native_run.agent.last_turn_summary
 +    raw_summary = raw_run.agent.last_turn_summary
 +    assert native_summary is not None
 +    assert raw_summary is not None
 +    assert any(
 +        event.name == "tool.received" and event.data["source"] == "native"
 +        for event in native_summary.trace
 +    )
 +    assert any(
 +        event.name == "tool.received" and event.data["source"] == "raw_text"
 +        for event in raw_summary.trace
 +    )
++
++
  @pytest.mark.asyncio
  async def test_completion_check_continuation(
      temp_dir: Path,