markdown · 4617 bytes Raw Blame History

Sprint 28: Performance & Parallelism

Prerequisites

Sprint 27 — correctness gate in place; can freely refactor for speed.

Goals

Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes the profiling surface, parallelizes the obvious hot paths, and enforces hello/runtime-link budgets in CI. The fortsh 2× Apple ld gate remains the production target, but Sprint 29 owns the fortsh fixture and final comparison. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.

Deliverables

1. Baseline profile

Profile representative hello-world and runtime-archive links in Sprint 28. Sprint 29 extends the same profile surface to the fortsh link once the fixture exists. Categorize wall time:

  • Input parsing (Mach-O headers, sections, symbols, relocations).
  • Symbol resolution (hash-map probes, archive lookups).
  • Atomization.
  • Layout.
  • Reloc application.
  • Synth sections (__unwind_info is often a hotspot).
  • Writing output.
  • Code signature hashing.

Identify the biggest bucket; optimize there first.

2. Parallel input parsing

Parse each .o in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's thread::scope — no external crates. Parallelism bounded by std::thread::available_parallelism().

3. Parallel reloc application

Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.

4. Parallel SHA-256 for code signing

One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.

5. Bump allocator for ephemeral data

Deferred. The current Sprint 28 profile work did not prove allocation churn is the next limiting bucket after the parallel parsing/relocation/signature and string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation pressure, implement src/arena.rs as a std-only Vec<Box<[u8]>> chunker.

6. mmap for large inputs

Deferred. Object/archive loading still uses fs::read; this keeps the Sprint 28 closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a real bucket in Sprint 29, add an unsafe src/mmap.rs wrapper and keep a fs::read fallback for archive members whose external path cannot be mapped.

7. Symbol-table hash map

Profile shows std HashMap is fine for our scale. If not: replace with an open-addressing table keyed by Istr (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.

8. String interner

Deferred. Sprint 28 made the global string table thread-shareable and removed the cloned string-table offset map during output writing. Per-input local interners remain a candidate if Sprint 29 identifies symbol seeding as a fortsh-scale bottleneck.

9. No-alloc hot paths

Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.

10. Benchmarks

Sprint 28 uses CI-enforced integration benchmarks in tests/perf_baseline.rs:

  • bench_hello_world_profile_reports_baseline_timings: small, measures startup overhead.
  • bench_runtime_link_profile_reports_baseline_timings: mid, measures symbol-table, archive parsing, and reloc-apply.
  • bench_fortsh_link: deferred to Sprint 29 with the real fortsh fixture.

Budget targets:

  • hello-world: ≤ 20 ms.
  • runtime link: ≤ 150 ms.
  • fortsh link: ≤ 2× Apple ld's wall time on the same machine, enforced in Sprint 29.

11. Determinism preserved

Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.

Testing Strategy

  • Benchmark gate: CI runs tests/perf_baseline.rs with hello/runtime budgets on every push and PR.
  • Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
  • Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
  • Sprint 27 parity must remain green — no correctness regression.
  • Single-threaded fallback (-j 1) for debugging.

Definition of Done

  • hello/runtime performance budgets are enforced in CI.
  • fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
  • All Sprint 27 scenarios still byte-identical.
  • Determinism bulletproof across parallelism.
  • No external dependencies added.
View source
1 # Sprint 28: Performance & Parallelism
2
3 ## Prerequisites
4 Sprint 27 — correctness gate in place; can freely refactor for speed.
5
6 ## Goals
7 Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes
8 the profiling surface, parallelizes the obvious hot paths, and enforces
9 hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the
10 production target, but Sprint 29 owns the fortsh fixture and final comparison.
11 Mold demonstrates linkers can be very fast; we don't need mold's speed, but we
12 need to not be painful.
13
14 ## Deliverables
15
16 ### 1. Baseline profile
17
18 Profile representative hello-world and runtime-archive links in Sprint 28.
19 Sprint 29 extends the same profile surface to the fortsh link once the fixture
20 exists. Categorize wall time:
21
22 - Input parsing (Mach-O headers, sections, symbols, relocations).
23 - Symbol resolution (hash-map probes, archive lookups).
24 - Atomization.
25 - Layout.
26 - Reloc application.
27 - Synth sections (`__unwind_info` is often a hotspot).
28 - Writing output.
29 - Code signature hashing.
30
31 Identify the biggest bucket; optimize there first.
32
33 ### 2. Parallel input parsing
34
35 Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
36
37 ### 3. Parallel reloc application
38
39 Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
40
41 ### 4. Parallel SHA-256 for code signing
42
43 One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
44
45 ### 5. Bump allocator for ephemeral data
46
47 Deferred. The current Sprint 28 profile work did not prove allocation churn is
48 the next limiting bucket after the parallel parsing/relocation/signature and
49 string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation
50 pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker.
51
52 ### 6. mmap for large inputs
53
54 Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28
55 closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a
56 real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a
57 `fs::read` fallback for archive members whose external path cannot be mapped.
58
59 ### 7. Symbol-table hash map
60
61 Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
62
63 ### 8. String interner
64
65 Deferred. Sprint 28 made the global string table thread-shareable and removed
66 the cloned string-table offset map during output writing. Per-input local
67 interners remain a candidate if Sprint 29 identifies symbol seeding as a
68 fortsh-scale bottleneck.
69
70 ### 9. No-alloc hot paths
71
72 Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
73
74 ### 10. Benchmarks
75
76 Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`:
77
78 - `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead.
79 - `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply.
80 - `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture.
81
82 Budget targets:
83 - hello-world: ≤ 20 ms.
84 - runtime link: ≤ 150 ms.
85 - fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29.
86
87 ### 11. Determinism preserved
88
89 Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
90
91 ## Testing Strategy
92
93 - Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR.
94 - Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
95 - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
96 - Sprint 27 parity must remain green — no correctness regression.
97 - Single-threaded fallback (`-j 1`) for debugging.
98
99 ## Definition of Done
100
101 - hello/runtime performance budgets are enforced in CI.
102 - fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
103 - All Sprint 27 scenarios still byte-identical.
104 - Determinism bulletproof across parallelism.
105 - No external dependencies added.