markdown · 3884 bytes Raw Blame History

Sprint 28: Performance & Parallelism

Prerequisites

Sprint 27 — correctness gate in place; can freely refactor for speed.

Goals

Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple ld's wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.

Deliverables

1. Baseline profile

Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:

  • Input parsing (Mach-O headers, sections, symbols, relocations).
  • Symbol resolution (hash-map probes, archive lookups).
  • Atomization.
  • Layout.
  • Reloc application.
  • Synth sections (__unwind_info is often a hotspot).
  • Writing output.
  • Code signature hashing.

Identify the biggest bucket; optimize there first.

2. Parallel input parsing

Parse each .o in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's thread::scope — no external crates. Parallelism bounded by std::thread::available_parallelism().

3. Parallel reloc application

Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.

4. Parallel SHA-256 for code signing

One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.

5. Bump allocator for ephemeral data

Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as src/arena.rs — a std-only Vec<Box<[u8]>> chunker.

6. mmap for large inputs

std::fs::File + memmap2? No — memmap2 is an external crate. Use libc::mmap via an unsafe src/mmap.rs wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to fs::read for GNU-thin archive members whose external path doesn't mmap cleanly (rare).

7. Symbol-table hash map

Profile shows std HashMap is fine for our scale. If not: replace with an open-addressing table keyed by Istr (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.

8. String interner

Single global StringInterner shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.

9. No-alloc hot paths

Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.

10. Benchmarks

afs-ld/bench/ (or a #[bench] behind cargo +nightly bench) with:

  • bench_hello_world: small, measures startup overhead.
  • bench_runtime_link: mid, measures symbol-table & reloc-apply.
  • bench_fortsh_link: large, measures end-to-end throughput.

Budget targets:

  • hello-world: ≤ 20 ms.
  • runtime link: ≤ 150 ms.
  • fortsh link: ≤ 2× Apple ld's wall time on the same machine.

11. Determinism preserved

Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.

Testing Strategy

  • Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
  • Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
  • Sprint 27 parity must remain green — no correctness regression.
  • Single-threaded fallback (-j 1) for debugging.

Definition of Done

  • fortsh link wall time within 2× of ld's.
  • All Sprint 27 scenarios still byte-identical.
  • Determinism bulletproof across parallelism.
  • No external dependencies added.
View source
1 # Sprint 28: Performance & Parallelism
2
3 ## Prerequisites
4 Sprint 27 — correctness gate in place; can freely refactor for speed.
5
6 ## Goals
7 Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.
8
9 ## Deliverables
10
11 ### 1. Baseline profile
12
13 Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:
14
15 - Input parsing (Mach-O headers, sections, symbols, relocations).
16 - Symbol resolution (hash-map probes, archive lookups).
17 - Atomization.
18 - Layout.
19 - Reloc application.
20 - Synth sections (`__unwind_info` is often a hotspot).
21 - Writing output.
22 - Code signature hashing.
23
24 Identify the biggest bucket; optimize there first.
25
26 ### 2. Parallel input parsing
27
28 Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
29
30 ### 3. Parallel reloc application
31
32 Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
33
34 ### 4. Parallel SHA-256 for code signing
35
36 One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
37
38 ### 5. Bump allocator for ephemeral data
39
40 Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker.
41
42 ### 6. mmap for large inputs
43
44 `std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare).
45
46 ### 7. Symbol-table hash map
47
48 Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
49
50 ### 8. String interner
51
52 Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.
53
54 ### 9. No-alloc hot paths
55
56 Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
57
58 ### 10. Benchmarks
59
60 `afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with:
61 - `bench_hello_world`: small, measures startup overhead.
62 - `bench_runtime_link`: mid, measures symbol-table & reloc-apply.
63 - `bench_fortsh_link`: large, measures end-to-end throughput.
64
65 Budget targets:
66 - hello-world: ≤ 20 ms.
67 - runtime link: ≤ 150 ms.
68 - fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine.
69
70 ### 11. Determinism preserved
71
72 Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
73
74 ## Testing Strategy
75
76 - Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
77 - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
78 - Sprint 27 parity must remain green — no correctness regression.
79 - Single-threaded fallback (`-j 1`) for debugging.
80
81 ## Definition of Done
82
83 - fortsh link wall time within 2× of `ld`'s.
84 - All Sprint 27 scenarios still byte-identical.
85 - Determinism bulletproof across parallelism.
86 - No external dependencies added.