@@ -4,13 +4,20 @@ |
| 4 | Sprint 27 — correctness gate in place; can freely refactor for speed. | 4 | Sprint 27 — correctness gate in place; can freely refactor for speed. |
| 5 | | 5 | |
| 6 | ## Goals | 6 | ## Goals |
| 7 | -Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful. | 7 | +Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes |
| | 8 | +the profiling surface, parallelizes the obvious hot paths, and enforces |
| | 9 | +hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the |
| | 10 | +production target, but Sprint 29 owns the fortsh fixture and final comparison. |
| | 11 | +Mold demonstrates linkers can be very fast; we don't need mold's speed, but we |
| | 12 | +need to not be painful. |
| 8 | | 13 | |
| 9 | ## Deliverables | 14 | ## Deliverables |
| 10 | | 15 | |
| 11 | ### 1. Baseline profile | 16 | ### 1. Baseline profile |
| 12 | | 17 | |
| 13 | -Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time: | 18 | +Profile representative hello-world and runtime-archive links in Sprint 28. |
| | 19 | +Sprint 29 extends the same profile surface to the fortsh link once the fixture |
| | 20 | +exists. Categorize wall time: |
| 14 | | 21 | |
| 15 | - Input parsing (Mach-O headers, sections, symbols, relocations). | 22 | - Input parsing (Mach-O headers, sections, symbols, relocations). |
| 16 | - Symbol resolution (hash-map probes, archive lookups). | 23 | - Symbol resolution (hash-map probes, archive lookups). |
@@ -37,11 +44,17 @@ One thread per 4 KiB page. SHA-256 is inherently sequential within a page but tr |
| 37 | | 44 | |
| 38 | ### 5. Bump allocator for ephemeral data | 45 | ### 5. Bump allocator for ephemeral data |
| 39 | | 46 | |
| 40 | -Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker. | 47 | +Deferred. The current Sprint 28 profile work did not prove allocation churn is |
| | 48 | +the next limiting bucket after the parallel parsing/relocation/signature and |
| | 49 | +string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation |
| | 50 | +pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker. |
| 41 | | 51 | |
| 42 | ### 6. mmap for large inputs | 52 | ### 6. mmap for large inputs |
| 43 | | 53 | |
| 44 | -`std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare). | 54 | +Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28 |
| | 55 | +closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a |
| | 56 | +real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a |
| | 57 | +`fs::read` fallback for archive members whose external path cannot be mapped. |
| 45 | | 58 | |
| 46 | ### 7. Symbol-table hash map | 59 | ### 7. Symbol-table hash map |
| 47 | | 60 | |
@@ -49,7 +62,10 @@ Profile shows std `HashMap` is fine for our scale. If not: replace with an open- |
| 49 | | 62 | |
| 50 | ### 8. String interner | 63 | ### 8. String interner |
| 51 | | 64 | |
| 52 | -Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass. | 65 | +Deferred. Sprint 28 made the global string table thread-shareable and removed |
| | 66 | +the cloned string-table offset map during output writing. Per-input local |
| | 67 | +interners remain a candidate if Sprint 29 identifies symbol seeding as a |
| | 68 | +fortsh-scale bottleneck. |
| 53 | | 69 | |
| 54 | ### 9. No-alloc hot paths | 70 | ### 9. No-alloc hot paths |
| 55 | | 71 | |
@@ -57,15 +73,16 @@ Reloc application and chain construction should not allocate per-reloc. Prealloc |
| 57 | | 73 | |
| 58 | ### 10. Benchmarks | 74 | ### 10. Benchmarks |
| 59 | | 75 | |
| 60 | -`afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with: | 76 | +Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`: |
| 61 | -- `bench_hello_world`: small, measures startup overhead. | 77 | + |
| 62 | -- `bench_runtime_link`: mid, measures symbol-table & reloc-apply. | 78 | +- `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead. |
| 63 | -- `bench_fortsh_link`: large, measures end-to-end throughput. | 79 | +- `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply. |
| | 80 | +- `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture. |
| 64 | | 81 | |
| 65 | Budget targets: | 82 | Budget targets: |
| 66 | - hello-world: ≤ 20 ms. | 83 | - hello-world: ≤ 20 ms. |
| 67 | - runtime link: ≤ 150 ms. | 84 | - runtime link: ≤ 150 ms. |
| 68 | -- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine. | 85 | +- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29. |
| 69 | | 86 | |
| 70 | ### 11. Determinism preserved | 87 | ### 11. Determinism preserved |
| 71 | | 88 | |
@@ -73,14 +90,16 @@ Parallelism must not reorder output. Each worker produces a deterministic result |
| 73 | | 90 | |
| 74 | ## Testing Strategy | 91 | ## Testing Strategy |
| 75 | | 92 | |
| 76 | -- Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails. | 93 | +- Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR. |
| | 94 | +- Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29. |
| 77 | - Determinism: 100 parallel runs of the same input, assert byte-identical output every time. | 95 | - Determinism: 100 parallel runs of the same input, assert byte-identical output every time. |
| 78 | - Sprint 27 parity must remain green — no correctness regression. | 96 | - Sprint 27 parity must remain green — no correctness regression. |
| 79 | - Single-threaded fallback (`-j 1`) for debugging. | 97 | - Single-threaded fallback (`-j 1`) for debugging. |
| 80 | | 98 | |
| 81 | ## Definition of Done | 99 | ## Definition of Done |
| 82 | | 100 | |
| 83 | -- fortsh link wall time within 2× of `ld`'s. | 101 | +- hello/runtime performance budgets are enforced in CI. |
| | 102 | +- fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture. |
| 84 | - All Sprint 27 scenarios still byte-identical. | 103 | - All Sprint 27 scenarios still byte-identical. |
| 85 | - Determinism bulletproof across parallelism. | 104 | - Determinism bulletproof across parallelism. |
| 86 | - No external dependencies added. | 105 | - No external dependencies added. |