afs-ld Public

Watch 0 Fork 0 Star 0

markdown · 3884 bytes Raw Blame History

Sprint 28: Performance & Parallelism

Prerequisites

Sprint 27 — correctness gate in place; can freely refactor for speed.

Goals

Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple ld's wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.

Deliverables

1. Baseline profile

Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:

Input parsing (Mach-O headers, sections, symbols, relocations).
Symbol resolution (hash-map probes, archive lookups).
Atomization.
Layout.
Reloc application.
Synth sections (__unwind_info is often a hotspot).
Writing output.
Code signature hashing.

Identify the biggest bucket; optimize there first.

2. Parallel input parsing

Parse each .o in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's thread::scope — no external crates. Parallelism bounded by std::thread::available_parallelism().

3. Parallel reloc application

Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.

4. Parallel SHA-256 for code signing

One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.

5. Bump allocator for ephemeral data

Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as src/arena.rs — a std-only Vec<Box<[u8]>> chunker.

6. mmap for large inputs

std::fs::File + memmap2? No — memmap2 is an external crate. Use libc::mmap via an unsafe src/mmap.rs wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to fs::read for GNU-thin archive members whose external path doesn't mmap cleanly (rare).

7. Symbol-table hash map

Profile shows std HashMap is fine for our scale. If not: replace with an open-addressing table keyed by Istr (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.

8. String interner

Single global StringInterner shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.

9. No-alloc hot paths

Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.

10. Benchmarks

afs-ld/bench/ (or a #[bench] behind cargo +nightly bench) with:

bench_hello_world: small, measures startup overhead.
bench_runtime_link: mid, measures symbol-table & reloc-apply.
bench_fortsh_link: large, measures end-to-end throughput.

Budget targets:

hello-world: ≤ 20 ms.
runtime link: ≤ 150 ms.
fortsh link: ≤ 2× Apple ld's wall time on the same machine.

11. Determinism preserved

Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.

Testing Strategy

Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
Sprint 27 parity must remain green — no correctness regression.
Single-threaded fallback (-j 1) for debugging.

Definition of Done

fortsh link wall time within 2× of ld's.
All Sprint 27 scenarios still byte-identical.
Determinism bulletproof across parallelism.
No external dependencies added.

View source

  
        1
        # Sprint 28: Performance & Parallelism
      
        2
        
        3
        ## Prerequisites
      
        4
        Sprint 27 — correctness gate in place; can freely refactor for speed.
      
        5
        
        6
        ## Goals
      
        7
        Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.
      
        8
        
        9
        ## Deliverables
      
        10
        
        11
        ### 1. Baseline profile
      
        12
        
        13
        Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:
      
        14
        
        15
        - Input parsing (Mach-O headers, sections, symbols, relocations).
      
        16
        - Symbol resolution (hash-map probes, archive lookups).
      
        17
        - Atomization.
      
        18
        - Layout.
      
        19
        - Reloc application.
      
        20
        - Synth sections (`__unwind_info` is often a hotspot).
      
        21
        - Writing output.
      
        22
        - Code signature hashing.
      
        23
        
        24
        Identify the biggest bucket; optimize there first.
      
        25
        
        26
        ### 2. Parallel input parsing
      
        27
        
        28
        Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
      
        29
        
        30
        ### 3. Parallel reloc application
      
        31
        
        32
        Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
      
        33
        
        34
        ### 4. Parallel SHA-256 for code signing
      
        35
        
        36
        One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
      
        37
        
        38
        ### 5. Bump allocator for ephemeral data
      
        39
        
        40
        Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker.
      
        41
        
        42
        ### 6. mmap for large inputs
      
        43
        
        44
        `std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare).
      
        45
        
        46
        ### 7. Symbol-table hash map
      
        47
        
        48
        Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
      
        49
        
        50
        ### 8. String interner
      
        51
        
        52
        Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.
      
        53
        
        54
        ### 9. No-alloc hot paths
      
        55
        
        56
        Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
      
        57
        
        58
        ### 10. Benchmarks
      
        59
        
        60
        `afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with:
      
        61
        - `bench_hello_world`: small, measures startup overhead.
      
        62
        - `bench_runtime_link`: mid, measures symbol-table & reloc-apply.
      
        63
        - `bench_fortsh_link`: large, measures end-to-end throughput.
      
        64
        
        65
        Budget targets:
      
        66
        - hello-world: ≤ 20 ms.
      
        67
        - runtime link: ≤ 150 ms.
      
        68
        - fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine.
      
        69
        
        70
        ### 11. Determinism preserved
      
        71
        
        72
        Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
      
        73
        
        74
        ## Testing Strategy
      
        75
        
        76
        - Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
      
        77
        - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
      
        78
        - Sprint 27 parity must remain green — no correctness regression.
      
        79
        - Single-threaded fallback (`-j 1`) for debugging.
      
        80
        
        81
        ## Definition of Done
      
        82
        
        83
        - fortsh link wall time within 2× of `ld`'s.
      
        84
        - All Sprint 27 scenarios still byte-identical.
      
        85
        - Determinism bulletproof across parallelism.
      
        86
        - No external dependencies added.

1	# Sprint 28: Performance & Parallelism
2
3	## Prerequisites
4	Sprint 27 — correctness gate in place; can freely refactor for speed.
5
6	## Goals
7	Make afs-ld fast enough to feel like a production tool. Target: within 2× of Apple `ld`'s wall time on the fortsh link. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.
8
9	## Deliverables
10
11	### 1. Baseline profile
12
13	Profile the fortsh link (Sprint 29 produces the fixture). Categorize wall time:
14
15	- Input parsing (Mach-O headers, sections, symbols, relocations).
16	- Symbol resolution (hash-map probes, archive lookups).
17	- Atomization.
18	- Layout.
19	- Reloc application.
20	- Synth sections (`__unwind_info` is often a hotspot).
21	- Writing output.
22	- Code signature hashing.
23
24	Identify the biggest bucket; optimize there first.
25
26	### 2. Parallel input parsing
27
28	Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
29
30	### 3. Parallel reloc application
31
32	Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
33
34	### 4. Parallel SHA-256 for code signing
35
36	One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
37
38	### 5. Bump allocator for ephemeral data
39
40	Parser produces many small allocations (strings, reloc lists, atom descriptors). A per-input arena avoids fragmentation and makes bulk drop free. Implement as `src/arena.rs` — a std-only `Vec<Box<[u8]>>` chunker.
41
42	### 6. mmap for large inputs
43
44	`std::fs::File` + `memmap2`? No — memmap2 is an external crate. Use `libc::mmap` via an unsafe `src/mmap.rs` wrapper. Input files are always read-only; mmap saves a read syscall and lets us share parse state across threads cheaply. Fall back to `fs::read` for GNU-thin archive members whose external path doesn't mmap cleanly (rare).
45
46	### 7. Symbol-table hash map
47
48	Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
49
50	### 8. String interner
51
52	Single global `StringInterner` shared across inputs. Interning cost: one hash lookup per name. Optimize by batching per-input: each input parses its strings into a local table, then merges into the global interner in one pass.
53
54	### 9. No-alloc hot paths
55
56	Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
57
58	### 10. Benchmarks
59
60	`afs-ld/bench/` (or a `#[bench]` behind `cargo +nightly bench`) with:
61	- `bench_hello_world`: small, measures startup overhead.
62	- `bench_runtime_link`: mid, measures symbol-table & reloc-apply.
63	- `bench_fortsh_link`: large, measures end-to-end throughput.
64
65	Budget targets:
66	- hello-world: ≤ 20 ms.
67	- runtime link: ≤ 150 ms.
68	- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine.
69
70	### 11. Determinism preserved
71
72	Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
73
74	## Testing Strategy
75
76	- Benchmarks land as regression gates: nightly CI records throughput; > 10% regression fails.
77	- Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
78	- Sprint 27 parity must remain green — no correctness regression.
79	- Single-threaded fallback (`-j 1`) for debugging.
80
81	## Definition of Done
82
83	- fortsh link wall time within 2× of `ld`'s.
84	- All Sprint 27 scenarios still byte-identical.
85	- Determinism bulletproof across parallelism.
86	- No external dependencies added.