afs-ld Public

Watch 0 Fork 0 Star 0

markdown · 4617 bytes Raw Blame History

Sprint 28: Performance & Parallelism

Prerequisites

Sprint 27 — correctness gate in place; can freely refactor for speed.

Goals

Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes the profiling surface, parallelizes the obvious hot paths, and enforces hello/runtime-link budgets in CI. The fortsh 2× Apple ld gate remains the production target, but Sprint 29 owns the fortsh fixture and final comparison. Mold demonstrates linkers can be very fast; we don't need mold's speed, but we need to not be painful.

Deliverables

1. Baseline profile

Profile representative hello-world and runtime-archive links in Sprint 28. Sprint 29 extends the same profile surface to the fortsh link once the fixture exists. Categorize wall time:

Input parsing (Mach-O headers, sections, symbols, relocations).
Symbol resolution (hash-map probes, archive lookups).
Atomization.
Layout.
Reloc application.
Synth sections (__unwind_info is often a hotspot).
Writing output.
Code signature hashing.

Identify the biggest bucket; optimize there first.

2. Parallel input parsing

Parse each .o in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's thread::scope — no external crates. Parallelism bounded by std::thread::available_parallelism().

3. Parallel reloc application

Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.

4. Parallel SHA-256 for code signing

One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.

5. Bump allocator for ephemeral data

Deferred. The current Sprint 28 profile work did not prove allocation churn is the next limiting bucket after the parallel parsing/relocation/signature and string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation pressure, implement src/arena.rs as a std-only Vec<Box<[u8]>> chunker.

6. mmap for large inputs

Deferred. Object/archive loading still uses fs::read; this keeps the Sprint 28 closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a real bucket in Sprint 29, add an unsafe src/mmap.rs wrapper and keep a fs::read fallback for archive members whose external path cannot be mapped.

7. Symbol-table hash map

Profile shows std HashMap is fine for our scale. If not: replace with an open-addressing table keyed by Istr (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.

8. String interner

Deferred. Sprint 28 made the global string table thread-shareable and removed the cloned string-table offset map during output writing. Per-input local interners remain a candidate if Sprint 29 identifies symbol seeding as a fortsh-scale bottleneck.

9. No-alloc hot paths

Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.

10. Benchmarks

Sprint 28 uses CI-enforced integration benchmarks in tests/perf_baseline.rs:

bench_hello_world_profile_reports_baseline_timings: small, measures startup overhead.
bench_runtime_link_profile_reports_baseline_timings: mid, measures symbol-table, archive parsing, and reloc-apply.
bench_fortsh_link: deferred to Sprint 29 with the real fortsh fixture.

Budget targets:

hello-world: ≤ 20 ms.
runtime link: ≤ 150 ms.
fortsh link: ≤ 2× Apple ld's wall time on the same machine, enforced in Sprint 29.

11. Determinism preserved

Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.

Testing Strategy

Benchmark gate: CI runs tests/perf_baseline.rs with hello/runtime budgets on every push and PR.
Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
Sprint 27 parity must remain green — no correctness regression.
Single-threaded fallback (-j 1) for debugging.

Definition of Done

hello/runtime performance budgets are enforced in CI.
fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
All Sprint 27 scenarios still byte-identical.
Determinism bulletproof across parallelism.
No external dependencies added.

View source

  
        1
        # Sprint 28: Performance & Parallelism
      
        2
        
        3
        ## Prerequisites
      
        4
        Sprint 27 — correctness gate in place; can freely refactor for speed.
      
        5
        
        6
        ## Goals
      
        7
        Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes
      
        8
        the profiling surface, parallelizes the obvious hot paths, and enforces
      
        9
        hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the
      
        10
        production target, but Sprint 29 owns the fortsh fixture and final comparison.
      
        11
        Mold demonstrates linkers can be very fast; we don't need mold's speed, but we
      
        12
        need to not be painful.
      
        13
        
        14
        ## Deliverables
      
        15
        
        16
        ### 1. Baseline profile
      
        17
        
        18
        Profile representative hello-world and runtime-archive links in Sprint 28.
      
        19
        Sprint 29 extends the same profile surface to the fortsh link once the fixture
      
        20
        exists. Categorize wall time:
      
        21
        
        22
        - Input parsing (Mach-O headers, sections, symbols, relocations).
      
        23
        - Symbol resolution (hash-map probes, archive lookups).
      
        24
        - Atomization.
      
        25
        - Layout.
      
        26
        - Reloc application.
      
        27
        - Synth sections (`__unwind_info` is often a hotspot).
      
        28
        - Writing output.
      
        29
        - Code signature hashing.
      
        30
        
        31
        Identify the biggest bucket; optimize there first.
      
        32
        
        33
        ### 2. Parallel input parsing
      
        34
        
        35
        Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
      
        36
        
        37
        ### 3. Parallel reloc application
      
        38
        
        39
        Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
      
        40
        
        41
        ### 4. Parallel SHA-256 for code signing
      
        42
        
        43
        One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
      
        44
        
        45
        ### 5. Bump allocator for ephemeral data
      
        46
        
        47
        Deferred. The current Sprint 28 profile work did not prove allocation churn is
      
        48
        the next limiting bucket after the parallel parsing/relocation/signature and
      
        49
        string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation
      
        50
        pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker.
      
        51
        
        52
        ### 6. mmap for large inputs
      
        53
        
        54
        Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28
      
        55
        closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a
      
        56
        real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a
      
        57
        `fs::read` fallback for archive members whose external path cannot be mapped.
      
        58
        
        59
        ### 7. Symbol-table hash map
      
        60
        
        61
        Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
      
        62
        
        63
        ### 8. String interner
      
        64
        
        65
        Deferred. Sprint 28 made the global string table thread-shareable and removed
      
        66
        the cloned string-table offset map during output writing. Per-input local
      
        67
        interners remain a candidate if Sprint 29 identifies symbol seeding as a
      
        68
        fortsh-scale bottleneck.
      
        69
        
        70
        ### 9. No-alloc hot paths
      
        71
        
        72
        Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
      
        73
        
        74
        ### 10. Benchmarks
      
        75
        
        76
        Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`:
      
        77
        
        78
        - `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead.
      
        79
        - `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply.
      
        80
        - `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture.
      
        81
        
        82
        Budget targets:
      
        83
        - hello-world: ≤ 20 ms.
      
        84
        - runtime link: ≤ 150 ms.
      
        85
        - fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29.
      
        86
        
        87
        ### 11. Determinism preserved
      
        88
        
        89
        Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
      
        90
        
        91
        ## Testing Strategy
      
        92
        
        93
        - Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR.
      
        94
        - Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
      
        95
        - Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
      
        96
        - Sprint 27 parity must remain green — no correctness regression.
      
        97
        - Single-threaded fallback (`-j 1`) for debugging.
      
        98
        
        99
        ## Definition of Done
      
        100
        
        101
        - hello/runtime performance budgets are enforced in CI.
      
        102
        - fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
      
        103
        - All Sprint 27 scenarios still byte-identical.
      
        104
        - Determinism bulletproof across parallelism.
      
        105
        - No external dependencies added.

1	# Sprint 28: Performance & Parallelism
2
3	## Prerequisites
4	Sprint 27 — correctness gate in place; can freely refactor for speed.
5
6	## Goals
7	Make afs-ld fast enough to feel like a production tool. Sprint 28 establishes
8	the profiling surface, parallelizes the obvious hot paths, and enforces
9	hello/runtime-link budgets in CI. The fortsh 2× Apple `ld` gate remains the
10	production target, but Sprint 29 owns the fortsh fixture and final comparison.
11	Mold demonstrates linkers can be very fast; we don't need mold's speed, but we
12	need to not be painful.
13
14	## Deliverables
15
16	### 1. Baseline profile
17
18	Profile representative hello-world and runtime-archive links in Sprint 28.
19	Sprint 29 extends the same profile surface to the fortsh link once the fixture
20	exists. Categorize wall time:
21
22	- Input parsing (Mach-O headers, sections, symbols, relocations).
23	- Symbol resolution (hash-map probes, archive lookups).
24	- Atomization.
25	- Layout.
26	- Reloc application.
27	- Synth sections (`__unwind_info` is often a hotspot).
28	- Writing output.
29	- Code signature hashing.
30
31	Identify the biggest bucket; optimize there first.
32
33	### 2. Parallel input parsing
34
35	Parse each `.o` in a separate worker thread; results collected into the symbol table after all parsing completes. Archive member parsing also parallel. Uses std's `thread::scope` — no external crates. Parallelism bounded by `std::thread::available_parallelism()`.
36
37	### 3. Parallel reloc application
38
39	Each atom's relocs are independent. Process per-atom in parallel; the output buffer is preallocated and each atom writes to a disjoint slice.
40
41	### 4. Parallel SHA-256 for code signing
42
43	One thread per 4 KiB page. SHA-256 is inherently sequential within a page but trivially parallel across pages. Drop-in speedup for large binaries.
44
45	### 5. Bump allocator for ephemeral data
46
47	Deferred. The current Sprint 28 profile work did not prove allocation churn is
48	the next limiting bucket after the parallel parsing/relocation/signature and
49	string-table clone fixes. If Sprint 29's fortsh profile shows parser allocation
50	pressure, implement `src/arena.rs` as a std-only `Vec<Box<[u8]>>` chunker.
51
52	### 6. mmap for large inputs
53
54	Deferred. Object/archive loading still uses `fs::read`; this keeps the Sprint 28
55	closeout safe and std-only. If fortsh-sized inputs show file-read overhead as a
56	real bucket in Sprint 29, add an unsafe `src/mmap.rs` wrapper and keep a
57	`fs::read` fallback for archive members whose external path cannot be mapped.
58
59	### 7. Symbol-table hash map
60
61	Profile shows std `HashMap` is fine for our scale. If not: replace with an open-addressing table keyed by `Istr` (handle-equality), linear probing, power-of-2 capacity. ~100 LoC.
62
63	### 8. String interner
64
65	Deferred. Sprint 28 made the global string table thread-shareable and removed
66	the cloned string-table offset map during output writing. Per-input local
67	interners remain a candidate if Sprint 29 identifies symbol seeding as a
68	fortsh-scale bottleneck.
69
70	### 9. No-alloc hot paths
71
72	Reloc application and chain construction should not allocate per-reloc. Preallocated scratch buffers, reused across the relocation pass.
73
74	### 10. Benchmarks
75
76	Sprint 28 uses CI-enforced integration benchmarks in `tests/perf_baseline.rs`:
77
78	- `bench_hello_world_profile_reports_baseline_timings`: small, measures startup overhead.
79	- `bench_runtime_link_profile_reports_baseline_timings`: mid, measures symbol-table, archive parsing, and reloc-apply.
80	- `bench_fortsh_link`: deferred to Sprint 29 with the real fortsh fixture.
81
82	Budget targets:
83	- hello-world: ≤ 20 ms.
84	- runtime link: ≤ 150 ms.
85	- fortsh link: ≤ 2× Apple `ld`'s wall time on the same machine, enforced in Sprint 29.
86
87	### 11. Determinism preserved
88
89	Parallelism must not reorder output. Each worker produces a deterministic result; join order is fixed; sorts are stable. A parallel and sequential run must produce byte-identical outputs.
90
91	## Testing Strategy
92
93	- Benchmark gate: CI runs `tests/perf_baseline.rs` with hello/runtime budgets on every push and PR.
94	- Nightly throughput recording and a relative >10% regression gate are deferred until the fortsh fixture lands in Sprint 29.
95	- Determinism: 100 parallel runs of the same input, assert byte-identical output every time.
96	- Sprint 27 parity must remain green — no correctness regression.
97	- Single-threaded fallback (`-j 1`) for debugging.
98
99	## Definition of Done
100
101	- hello/runtime performance budgets are enforced in CI.
102	- fortsh 2× comparison is explicitly handed to Sprint 29 with its fixture.
103	- All Sprint 27 scenarios still byte-identical.
104	- Determinism bulletproof across parallelism.
105	- No external dependencies added.