Wire performance closeout gate

Status	File	+	-
M	`.docs/sprints/sprint28.md`	31	12
M	`.github/workflows/parity-matrix.yml`	6	0
M	`tests/perf_baseline.rs`	76	6

.docs/sprints/sprint28.mdmodified

  Sprint 27 — correctness gate in place; can freely refactor for speed.
  ## Goals
--Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.
++Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes
++the profiling surface, parallelizes the obvious hot paths, and enforces
++hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the
++production target, but Sprint 29 owns the fortsh fixture and final comparison.
++Mold demonstrates linkers can be very fast; we don't need mold's speed, but we
++need to not be painful.
  ## Deliverables
  ### 1. Baseline profile
--Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:
++Profile representative hello-world and runtime-archive links in Sprint 28.
++Sprint 29 extends the same profile surface to the fortsh link once the fixture
++exists. Categorize wall time:
  - Input parsing (Mach-O headers, sections, symbols, relocations).
  - Symbol resolution (hash-map probes, archive lookups).
  ### 5. Bump allocator for ephemeral data
--Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker.
++Deferred. The current Sprint 28 profile work did not prove allocation churn is
++the next limiting bucket after the parallel parsing/relocation/signature and
++string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation
++pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker.
  ### 6. mmap for large inputs
--`std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare).
++Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28
++closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a
++real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a
++`fs::read` fallback for archive members whose external path cannot be mapped.
  ### 7. Symbol-table hash map
  ### 8. String interner
--Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.
++Deferred. Sprint 28 made the global string table thread-shareable and removed
++the cloned string-table offset map during output writing. Per-input local
++interners remain a candidate if Sprint 29 identifies symbol seeding as a
++fortsh-scale bottleneck.
  ### 9. No-alloc hot paths
  ### 10. Benchmarks
--`afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with:
++Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`:
--- `bench_hello_world`: small, measures startup overhead.
++
--- `bench_runtime_link`: mid, measures symbol-table & reloc-apply.
++- `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead.
--- `bench_fortsh_link`: large, measures end-to-end throughput.
++- `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply.
++- `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture.
  Budget targets:
  - hello-world: ≤ 20 ms.
  - runtime link: ≤ 150 ms.
--- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine.
++- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29.
  ### 11. Determinism preserved
  ## Testing Strategy
--- Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
++- Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR.
++- Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
  - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
  - Sprint 27 parity must remain green — no correctness regression.
  - Single-threaded fallback (`-j 1`) for debugging.
  ## Definition of Done
--- fortsh link wall time within 2× of `ld`'s.
++- hello/runtime performance budgets are enforced in CI.
++- fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
  - All Sprint 27 scenarios still byte-identical.
  - Determinism bulletproof across parallelism.
  - No external dependencies added.

.github/workflows/parity-matrix.ymlmodified

        - name: Run determinism gate
          run: cargo test --test determinism -- --nocapture
++      - name: Run performance budget gate
++        env:
++          AFS_LD_HELLO_BUDGET_MS: "20"
++          AFS_LD_RUNTIME_BUDGET_MS: "150"
++        run: cargo test --test perf_baseline -- --nocapture
++
        - name: Run parity matrix
          env:
            PARITY_MATRIX_ARTIFACT_DIR: ${{ github.workspace }}/parity-matrix-artifacts

tests/perf_baseline.rsmodified

++use std::fs;
  use std::path::{Path, PathBuf};
++use std::process::Command;
  use std::time::Duration;
  mod common;
  use afs_ld::{LinkOptions, LinkProfile, Linker};
--use common::harness::{assemble, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version};
++use common::harness::{
++    assemble, have_tool, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version,
++};
  fn find_runtime_archive() -> Option<PathBuf> {
      let workspace = Path::new(env!("CARGO_MANIFEST_DIR")).join("..");
      None
+ }
++fn runtime_archive_fixture() -> Result<PathBuf, String> {
++    if let Some(runtime) = find_runtime_archive() {
++        return Ok(runtime);
++    }
++    build_synthetic_runtime_archive()
++}
++
++fn build_synthetic_runtime_archive() -> Result<PathBuf, String> {
++    if !have_tool("libtool") {
++        return Err("libtool unavailable".into());
++    }
++
++    let members = [
++        ("init", "_afs_program_init"),
++        ("finalize", "_afs_program_finalize"),
++        ("write_i32", "_afs_write_i32"),
++        ("write_f64", "_afs_write_f64"),
++        ("write_newline", "_afs_write_newline"),
++        ("read_i32", "_afs_read_i32"),
++        ("alloc", "_afs_alloc"),
++        ("dealloc", "_afs_dealloc"),
++        ("bounds_check", "_afs_bounds_check"),
++        ("stop", "_afs_stop"),
++        ("date_and_time", "_afs_date_and_time"),
++        ("cpu_time", "_afs_cpu_time"),
++        ("random_seed", "_afs_random_seed"),
++        ("random_number", "_afs_random_number"),
++        ("open_unit", "_afs_open_unit"),
++        ("close_unit", "_afs_close_unit"),
++    ];
++    let mut objects = Vec::with_capacity(members.len());
++    for (stem, symbol) in members {
++        let obj = scratch(&format!("perf-runtime-{stem}.o"));
++        let src = format!(
++            "\
++            .text\n\
++            .globl {symbol}\n\
++            .p2align 2\n\
++            {symbol}:\n\
++                ret\n\
++            .subsections_via_symbols\n",
++        );
++        assemble(&src, &obj)?;
++        objects.push(obj);
++    }
++
++    let archive = scratch("libafs-perf-runtime.a");
++    let _ = fs::remove_file(&archive);
++    let output = Command::new("libtool")
++        .args(["-static", "-o"])
++        .arg(&archive)
++        .args(&objects)
++        .output()
++        .map_err(|e| format!("spawn libtool archive: {e}"))?;
++    if !output.status.success() {
++        return Err(format!(
++            "libtool archive failed: {}",
++            String::from_utf8_lossy(&output.stderr)
++        ));
++    }
++    Ok(archive)
++}
++
  fn executable_opts(inputs: Vec<PathBuf>, output: PathBuf) -> LinkOptions {
      LinkOptions {
          inputs,
+ }
  #[test]
--fn hello_world_profile_reports_baseline_timings() {
++fn bench_hello_world_profile_reports_baseline_timings() {
      if !have_xcrun() || !have_xcrun_tool("ld") {
          eprintln!("skipping: xcrun as/ld unavailable");
          return;
+ }
  #[test]
--fn runtime_link_profile_reports_baseline_timings() {
++fn bench_runtime_link_profile_reports_baseline_timings() {
      if !have_xcrun() || !have_xcrun_tool("ld") {
          eprintln!("skipping: xcrun as/ld unavailable");
          return;
+     }
--    let Some(runtime) = find_runtime_archive() else {
++    let runtime = match runtime_archive_fixture() {
--        eprintln!("skipping: libarmfortas_rt.a not built");
++        Ok(runtime) => runtime,
--        return;
++        Err(reason) => {
++            eprintln!("skipping: {reason}");
++            return;
++        }
      };
      let obj = scratch("perf-runtime.o");

fortrangoingonforty/afs-ld / `73c9b05`

3 changed files