Wire performance closeout gate
Authored by
mfwolffe <wolffemf@dukes.jmu.edu>
- SHA
73c9b0500099e8b8e22e388a00f5afb246561cb5- Parents
-
43a662a - Tree
ac3de7e
73c9b05
73c9b0500099e8b8e22e388a00f5afb246561cb543a662a
ac3de7e| Status | File | + | - |
|---|---|---|---|
| M |
.docs/sprints/sprint28.md
|
31 | 12 |
| M |
.github/workflows/parity-matrix.yml
|
6 | 0 |
| M |
tests/perf_baseline.rs
|
76 | 6 |
.docs/sprints/sprint28.mdmodified@@ -4,13 +4,20 @@ | ||
| 4 | 4 | Sprint 27 — correctness gate in place; can freely refactor for speed. |
| 5 | 5 | |
| 6 | 6 | ## Goals |
| 7 | -Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful. | |
| 7 | +Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes | |
| 8 | +the profiling surface, parallelizes the obvious hot paths, and enforces | |
| 9 | +hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the | |
| 10 | +production target, but Sprint 29 owns the fortsh fixture and final comparison. | |
| 11 | +Mold demonstrates linkers can be very fast; we don't need mold's speed, but we | |
| 12 | +need to not be painful. | |
| 8 | 13 | |
| 9 | 14 | ## Deliverables |
| 10 | 15 | |
| 11 | 16 | ### 1. Baseline profile |
| 12 | 17 | |
| 13 | -Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time: | |
| 18 | +Profile representative hello-world and runtime-archive links in Sprint 28. | |
| 19 | +Sprint 29 extends the same profile surface to the fortsh link once the fixture | |
| 20 | +exists. Categorize wall time: | |
| 14 | 21 | |
| 15 | 22 | - Input parsing (Mach-O headers, sections, symbols, relocations). |
| 16 | 23 | - Symbol resolution (hash-map probes, archive lookups). |
@@ -37,11 +44,17 @@ One thread per 4 KiB page. SHA-256 is inherently sequential within a page but tr | ||
| 37 | 44 | |
| 38 | 45 | ### 5. Bump allocator for ephemeral data |
| 39 | 46 | |
| 40 | -Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker. | |
| 47 | +Deferred. The current Sprint 28 profile work did not prove allocation churn is | |
| 48 | +the next limiting bucket after the parallel parsing/relocation/signature and | |
| 49 | +string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation | |
| 50 | +pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker. | |
| 41 | 51 | |
| 42 | 52 | ### 6. mmap for large inputs |
| 43 | 53 | |
| 44 | -`std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare). | |
| 54 | +Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28 | |
| 55 | +closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a | |
| 56 | +real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a | |
| 57 | +`fs::read` fallback for archive members whose external path cannot be mapped. | |
| 45 | 58 | |
| 46 | 59 | ### 7. Symbol-table hash map |
| 47 | 60 | |
@@ -49,7 +62,10 @@ Profile shows std `HashMap` is fine for our scale. If not: replace with an open- | ||
| 49 | 62 | |
| 50 | 63 | ### 8. String interner |
| 51 | 64 | |
| 52 | -Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass. | |
| 65 | +Deferred. Sprint 28 made the global string table thread-shareable and removed | |
| 66 | +the cloned string-table offset map during output writing. Per-input local | |
| 67 | +interners remain a candidate if Sprint 29 identifies symbol seeding as a | |
| 68 | +fortsh-scale bottleneck. | |
| 53 | 69 | |
| 54 | 70 | ### 9. No-alloc hot paths |
| 55 | 71 | |
@@ -57,15 +73,16 @@ Reloc application and chain construction should not allocate per-reloc. Prealloc | ||
| 57 | 73 | |
| 58 | 74 | ### 10. Benchmarks |
| 59 | 75 | |
| 60 | -`afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with: | |
| 61 | -- `bench_hello_world`: small, measures startup overhead. | |
| 62 | -- `bench_runtime_link`: mid, measures symbol-table & reloc-apply. | |
| 63 | -- `bench_fortsh_link`: large, measures end-to-end throughput. | |
| 76 | +Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`: | |
| 77 | + | |
| 78 | +- `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead. | |
| 79 | +- `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply. | |
| 80 | +- `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture. | |
| 64 | 81 | |
| 65 | 82 | Budget targets: |
| 66 | 83 | - hello-world: ≤ 20 ms. |
| 67 | 84 | - runtime link: ≤ 150 ms. |
| 68 | -- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine. | |
| 85 | +- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29. | |
| 69 | 86 | |
| 70 | 87 | ### 11. Determinism preserved |
| 71 | 88 | |
@@ -73,14 +90,16 @@ Parallelism must not reorder output. Each worker produces a deterministic result | ||
| 73 | 90 | |
| 74 | 91 | ## Testing Strategy |
| 75 | 92 | |
| 76 | -- Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails. | |
| 93 | +- Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR. | |
| 94 | +- Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29. | |
| 77 | 95 | - Determinism: 100 parallel runs of the same input, assert byte-identical output every time. |
| 78 | 96 | - Sprint 27 parity must remain green — no correctness regression. |
| 79 | 97 | - Single-threaded fallback (`-j 1`) for debugging. |
| 80 | 98 | |
| 81 | 99 | ## Definition of Done |
| 82 | 100 | |
| 83 | -- fortsh link wall time within 2× of `ld`'s. | |
| 101 | +- hello/runtime performance budgets are enforced in CI. | |
| 102 | +- fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture. | |
| 84 | 103 | - All Sprint 27 scenarios still byte-identical. |
| 85 | 104 | - Determinism bulletproof across parallelism. |
| 86 | 105 | - No external dependencies added. |
.github/workflows/parity-matrix.ymlmodified@@ -26,6 +26,12 @@ jobs: | ||
| 26 | 26 | - name: Run determinism gate |
| 27 | 27 | run: cargo test --test determinism -- --nocapture |
| 28 | 28 | |
| 29 | + - name: Run performance budget gate | |
| 30 | + env: | |
| 31 | + AFS_LD_HELLO_BUDGET_MS: "20" | |
| 32 | + AFS_LD_RUNTIME_BUDGET_MS: "150" | |
| 33 | + run: cargo test --test perf_baseline -- --nocapture | |
| 34 | + | |
| 29 | 35 | - name: Run parity matrix |
| 30 | 36 | env: |
| 31 | 37 | PARITY_MATRIX_ARTIFACT_DIR: ${{ github.workspace }}/parity-matrix-artifacts |
tests/perf_baseline.rsmodified@@ -1,10 +1,14 @@ | ||
| 1 | +use std::fs; | |
| 1 | 2 | use std::path::{Path, PathBuf}; |
| 3 | +use std::process::Command; | |
| 2 | 4 | use std::time::Duration; |
| 3 | 5 | |
| 4 | 6 | mod common; |
| 5 | 7 | |
| 6 | 8 | use afs_ld::{LinkOptions, LinkProfile, Linker}; |
| 7 | -use common::harness::{assemble, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version}; | |
| 9 | +use common::harness::{ | |
| 10 | + assemble, have_tool, have_xcrun, have_xcrun_tool, scratch, sdk_path, sdk_version, | |
| 11 | +}; | |
| 8 | 12 | |
| 9 | 13 | fn find_runtime_archive() -> Option<PathBuf> { |
| 10 | 14 | let workspace = Path::new(env!("CARGO_MANIFEST_DIR")).join(".."); |
@@ -20,6 +24,69 @@ fn find_runtime_archive() -> Option<PathBuf> { | ||
| 20 | 24 | None |
| 21 | 25 | } |
| 22 | 26 | |
| 27 | +fn runtime_archive_fixture() -> Result<PathBuf, String> { | |
| 28 | + if let Some(runtime) = find_runtime_archive() { | |
| 29 | + return Ok(runtime); | |
| 30 | + } | |
| 31 | + build_synthetic_runtime_archive() | |
| 32 | +} | |
| 33 | + | |
| 34 | +fn build_synthetic_runtime_archive() -> Result<PathBuf, String> { | |
| 35 | + if !have_tool("libtool") { | |
| 36 | + return Err("libtool unavailable".into()); | |
| 37 | + } | |
| 38 | + | |
| 39 | + let members = [ | |
| 40 | + ("init", "_afs_program_init"), | |
| 41 | + ("finalize", "_afs_program_finalize"), | |
| 42 | + ("write_i32", "_afs_write_i32"), | |
| 43 | + ("write_f64", "_afs_write_f64"), | |
| 44 | + ("write_newline", "_afs_write_newline"), | |
| 45 | + ("read_i32", "_afs_read_i32"), | |
| 46 | + ("alloc", "_afs_alloc"), | |
| 47 | + ("dealloc", "_afs_dealloc"), | |
| 48 | + ("bounds_check", "_afs_bounds_check"), | |
| 49 | + ("stop", "_afs_stop"), | |
| 50 | + ("date_and_time", "_afs_date_and_time"), | |
| 51 | + ("cpu_time", "_afs_cpu_time"), | |
| 52 | + ("random_seed", "_afs_random_seed"), | |
| 53 | + ("random_number", "_afs_random_number"), | |
| 54 | + ("open_unit", "_afs_open_unit"), | |
| 55 | + ("close_unit", "_afs_close_unit"), | |
| 56 | + ]; | |
| 57 | + let mut objects = Vec::with_capacity(members.len()); | |
| 58 | + for (stem, symbol) in members { | |
| 59 | + let obj = scratch(&format!("perf-runtime-{stem}.o")); | |
| 60 | + let src = format!( | |
| 61 | + "\ | |
| 62 | + .text\n\ | |
| 63 | + .globl {symbol}\n\ | |
| 64 | + .p2align 2\n\ | |
| 65 | + {symbol}:\n\ | |
| 66 | + ret\n\ | |
| 67 | + .subsections_via_symbols\n", | |
| 68 | + ); | |
| 69 | + assemble(&src, &obj)?; | |
| 70 | + objects.push(obj); | |
| 71 | + } | |
| 72 | + | |
| 73 | + let archive = scratch("libafs-perf-runtime.a"); | |
| 74 | + let _ = fs::remove_file(&archive); | |
| 75 | + let output = Command::new("libtool") | |
| 76 | + .args(["-static", "-o"]) | |
| 77 | + .arg(&archive) | |
| 78 | + .args(&objects) | |
| 79 | + .output() | |
| 80 | + .map_err(|e| format!("spawn libtool archive: {e}"))?; | |
| 81 | + if !output.status.success() { | |
| 82 | + return Err(format!( | |
| 83 | + "libtool archive failed: {}", | |
| 84 | + String::from_utf8_lossy(&output.stderr) | |
| 85 | + )); | |
| 86 | + } | |
| 87 | + Ok(archive) | |
| 88 | +} | |
| 89 | + | |
| 23 | 90 | fn executable_opts(inputs: Vec<PathBuf>, output: PathBuf) -> LinkOptions { |
| 24 | 91 | LinkOptions { |
| 25 | 92 | inputs, |
@@ -139,7 +206,7 @@ fn assert_profile_basics(name: &str, profile: &LinkProfile) { | ||
| 139 | 206 | } |
| 140 | 207 | |
| 141 | 208 | #[test] |
| 142 | -fn hello_world_profile_reports_baseline_timings() { | |
| 209 | +fn bench_hello_world_profile_reports_baseline_timings() { | |
| 143 | 210 | if !have_xcrun() || !have_xcrun_tool("ld") { |
| 144 | 211 | eprintln!("skipping: xcrun as/ld unavailable"); |
| 145 | 212 | return; |
@@ -174,14 +241,17 @@ fn hello_world_profile_reports_baseline_timings() { | ||
| 174 | 241 | } |
| 175 | 242 | |
| 176 | 243 | #[test] |
| 177 | -fn runtime_link_profile_reports_baseline_timings() { | |
| 244 | +fn bench_runtime_link_profile_reports_baseline_timings() { | |
| 178 | 245 | if !have_xcrun() || !have_xcrun_tool("ld") { |
| 179 | 246 | eprintln!("skipping: xcrun as/ld unavailable"); |
| 180 | 247 | return; |
| 181 | 248 | } |
| 182 | - let Some(runtime) = find_runtime_archive() else { | |
| 183 | - eprintln!("skipping: libarmfortas_rt.a not built"); | |
| 184 | - return; | |
| 249 | + let runtime = match runtime_archive_fixture() { | |
| 250 | + Ok(runtime) => runtime, | |
| 251 | + Err(reason) => { | |
| 252 | + eprintln!("skipping: {reason}"); | |
| 253 | + return; | |
| 254 | + } | |
| 185 | 255 | }; |
| 186 | 256 | |
| 187 | 257 | let obj = scratch("perf-runtime.o"); |