`73c9b05`

Wire performance closeout gate

Authored by mfwolffe <wolffemf@dukes.jmu.edu> 1 week ago

SHA: 73c9b0500099e8b8e22e388a00f5afb246561cb5
Parents: 43a662a
Tree: ac3de7e

3 changed files

Status	File	+	-
M	`.docs/sprints/sprint28.md`	31	12
M	`.github/workflows/parity-matrix.yml`	6	0
M	`tests/perf_baseline.rs`	76	6

.docs/sprints/sprint28.mdmodified

  Sprint 27 — correctness gate in place; can freely refactor for speed.
  ## Goals
 -Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.
 +Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes
 +the profiling surface, parallelizes the obvious hot paths, and enforces
 +hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the
 +production target, but Sprint 29 owns the fortsh fixture and final comparison.
 +Mold demonstrates linkers can be very fast; we don't need mold's speed, but we
 +need to not be painful.
  ## Deliverables
  ### 1. Baseline profile
 -Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:
 +Profile representative hello-world and runtime-archive links in Sprint 28.
 +Sprint 29 extends the same profile surface to the fortsh link once the fixture
 +exists. Categorize wall time:
  - Input parsing (Mach-O headers, sections, symbols, relocations).
  - Symbol resolution (hash-map probes, archive lookups).
  ### 5. Bump allocator for ephemeral data
 -Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker.
 +Deferred. The current Sprint 28 profile work did not prove allocation churn is
 +the next limiting bucket after the parallel parsing/relocation/signature and
 +string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation
 +pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker.
  ### 6. mmap for large inputs
 -`std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare).
 +Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28
 +closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a
 +real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a
 +`fs::read` fallback for archive members whose external path cannot be mapped.
  ### 7. Symbol-table hash map
  ### 8. String interner
 -Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.
 +Deferred. Sprint 28 made the global string table thread-shareable and removed
 +the cloned string-table offset map during output writing. Per-input local
 +interners remain a candidate if Sprint 29 identifies symbol seeding as a
 +fortsh-scale bottleneck.
  ### 9. No-alloc hot paths
  ### 10. Benchmarks
 -`afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with:
 -- `bench_hello_world`: small, measures startup overhead.
 -- `bench_runtime_link`: mid, measures symbol-table & reloc-apply.
 -- `bench_fortsh_link`: large, measures end-to-end throughput.
 +Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`:
++
 +- `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead.
 +- `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply.
 +- `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture.
  Budget targets:
  - hello-world: ≤ 20 ms.
  - runtime link: ≤ 150 ms.
 -- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine.
 +- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29.
  ### 11. Determinism preserved
  ## Testing Strategy
 -- Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
 +- Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR.
 +- Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
  - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
  - Sprint 27 parity must remain green — no correctness regression.
  - Single-threaded fallback (`-j 1`) for debugging.
  ## Definition of Done
 -- fortsh link wall time within 2× of `ld`'s.
 +- hello/runtime performance budgets are enforced in CI.
 +- fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
  - All Sprint 27 scenarios still byte-identical.
  - Determinism bulletproof across parallelism.
  - No external dependencies added.

.github/workflows/parity-matrix.ymlmodified

        - name: Run determinism gate
          run: cargo test --test determinism -- --nocapture
 +      - name: Run performance budget gate
 +        env:
 +          AFS_LD_HELLO_BUDGET_MS: "20"
 +          AFS_LD_RUNTIME_BUDGET_MS: "150"
 +        run: cargo test --test perf_baseline -- --nocapture
++
        - name: Run parity matrix
          env:
            PARITY_MATRIX_ARTIFACT_DIR: ${{ github.workspace }}/parity-matrix-artifacts

tests/perf_baseline.rsmodified

 +use std::fs;
  use std::path::{Path, PathBuf};
 +use std::process::Command;
  use std::time::Duration;
  mod common;
  use afs_ld::{LinkOptions, LinkProfile, Linker};
 -use common::harness::{assemble, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version};
 +use common::harness::{
 +    assemble, have_tool, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version,
 +};
  fn find_runtime_archive() -> Option<PathBuf> {
      let workspace = Path::new(env!("CARGO_MANIFEST_DIR")).join("..");
      None
+ }
 +fn runtime_archive_fixture() -> Result<PathBuf, String> {
 +    if let Some(runtime) = find_runtime_archive() {
 +        return Ok(runtime);
 +    }
 +    build_synthetic_runtime_archive()
 +}
++
 +fn build_synthetic_runtime_archive() -> Result<PathBuf, String> {
 +    if !have_tool("libtool") {
 +        return Err("libtool unavailable".into());
 +    }
++
 +    let members = [
 +        ("init", "_afs_program_init"),
 +        ("finalize", "_afs_program_finalize"),
 +        ("write_i32", "_afs_write_i32"),
 +        ("write_f64", "_afs_write_f64"),
 +        ("write_newline", "_afs_write_newline"),
 +        ("read_i32", "_afs_read_i32"),
 +        ("alloc", "_afs_alloc"),
 +        ("dealloc", "_afs_dealloc"),
 +        ("bounds_check", "_afs_bounds_check"),
 +        ("stop", "_afs_stop"),
 +        ("date_and_time", "_afs_date_and_time"),
 +        ("cpu_time", "_afs_cpu_time"),
 +        ("random_seed", "_afs_random_seed"),
 +        ("random_number", "_afs_random_number"),
 +        ("open_unit", "_afs_open_unit"),
 +        ("close_unit", "_afs_close_unit"),
 +    ];
 +    let mut objects = Vec::with_capacity(members.len());
 +    for (stem, symbol) in members {
 +        let obj = scratch(&format!("perf-runtime-{stem}.o"));
 +        let src = format!(
 +            "\
 +            .text\n\
 +            .globl {symbol}\n\
 +            .p2align 2\n\
 +            {symbol}:\n\
 +                ret\n\
 +            .subsections_via_symbols\n",
 +        );
 +        assemble(&src, &obj)?;
 +        objects.push(obj);
 +    }
++
 +    let archive = scratch("libafs-perf-runtime.a");
 +    let _ = fs::remove_file(&archive);
 +    let output = Command::new("libtool")
 +        .args(["-static", "-o"])
 +        .arg(&archive)
 +        .args(&objects)
 +        .output()
 +        .map_err(|e| format!("spawn libtool archive: {e}"))?;
 +    if !output.status.success() {
 +        return Err(format!(
 +            "libtool archive failed: {}",
 +            String::from_utf8_lossy(&output.stderr)
 +        ));
 +    }
 +    Ok(archive)
 +}
++
  fn executable_opts(inputs: Vec<PathBuf>, output: PathBuf) -> LinkOptions {
      LinkOptions {
          inputs,
+ }
  #[test]
 -fn hello_world_profile_reports_baseline_timings() {
 +fn bench_hello_world_profile_reports_baseline_timings() {
      if !have_xcrun() || !have_xcrun_tool("ld") {
          eprintln!("skipping: xcrun as/ld unavailable");
          return;
+ }
  #[test]
 -fn runtime_link_profile_reports_baseline_timings() {
 +fn bench_runtime_link_profile_reports_baseline_timings() {
      if !have_xcrun() || !have_xcrun_tool("ld") {
          eprintln!("skipping: xcrun as/ld unavailable");
          return;
+     }
 -    let Some(runtime) = find_runtime_archive() else {
 -        eprintln!("skipping: libarmfortas_rt.a not built");
 -        return;
 +    let runtime = match runtime_archive_fixture() {
 +        Ok(runtime) => runtime,
 +        Err(reason) => {
 +            eprintln!("skipping: {reason}");
 +            return;
 +        }
      };
      let obj = scratch("perf-runtime.o");