markdown · 3527 bytes Raw Blame History

Sprint 26: Thunks for Out-of-Range Branches

Prerequisites

Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.

Goals

When a BRANCH26 target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.

Deliverables

1. Detection pass

After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute distance = (target - P) >> 2. If |distance| > 0x0200_0000 (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.

2. Thunk synthesis

One thunk atom per (output segment, distant target). A thunk is 12 bytes:

ADRP x16, <target>@PAGE
ADD  x16, x16, <target>@PAGEOFF
BR   x16

Or, if the target is a Defined with a known value at link time:

ADRP x16, #<computed_page>
ADD  x16, x16, #<pageoff>
BR   x16

The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).

3. Placement

Thunks land in __TEXT,__thunks, a new synthetic section placed between __text and __stubs. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.

Algorithm:

  1. Run layout once.
  2. Detect overflow sites.
  3. Insert thunks near the caller cluster.
  4. Re-run layout (sizes changed).
  5. Re-check overflow — repeat until stable.

Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.

4. Thunk sharing

Multiple callers to the same out-of-range target share one thunk. Keyed by (output_section, target_atom_id).

5. Reloc rewrite

Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.

6. Interaction with -dead_strip and ICF

  • Thunks are dead-stripped if no live caller remains.
  • Thunks are never ICF candidates (they have unique target addresses).
  • A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.

7. -thunks <none|safe|normal>

  • -thunks=none: overflow is a hard error (default for small programs to catch bugs).
  • -thunks=safe (default on large programs): thunks inserted when needed.
  • -thunks=all: thunks inserted for every BRANCH26 (for testing).

Sprint 19 CLI wires these; this sprint implements the behavior.

8. Regression: small programs don't grow

Default is -thunks=safe. Programs that don't need thunks emit no __thunks section and are byte-identical to the pre-sprint output.

Testing Strategy

  • Synthetic: compile a source that produces >128 MiB of code (requires artificially padding .o files, or using a large constant array in __text). Verify thunks inserted.
  • Every thunk target reachable from its caller cluster.
  • Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
  • -thunks=none + overflow: produces a clear error citing the caller and target.
  • Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).

Definition of Done

  • Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
  • Layout fixed-point converges rapidly.
  • Small programs unchanged.
  • -thunks CLI matrix all wired.
View source
1 # Sprint 26: Thunks for Out-of-Range Branches
2
3 ## Prerequisites
4 Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.
5
6 ## Goals
7 When a `BRANCH26` target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.
8
9 ## Deliverables
10
11 ### 1. Detection pass
12
13 After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute `distance = (target - P) >> 2`. If `|distance| > 0x0200_0000` (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.
14
15 ### 2. Thunk synthesis
16
17 One thunk atom per (output segment, distant target). A thunk is 12 bytes:
18
19 ```
20 ADRP x16, <target>@PAGE
21 ADD x16, x16, <target>@PAGEOFF
22 BR x16
23 ```
24
25 Or, if the target is a Defined with a known value at link time:
26
27 ```
28 ADRP x16, #<computed_page>
29 ADD x16, x16, #<pageoff>
30 BR x16
31 ```
32
33 The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).
34
35 ### 3. Placement
36
37 Thunks land in `__TEXT,__thunks`, a new synthetic section placed between `__text` and `__stubs`. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.
38
39 Algorithm:
40 1. Run layout once.
41 2. Detect overflow sites.
42 3. Insert thunks near the caller cluster.
43 4. Re-run layout (sizes changed).
44 5. Re-check overflow — repeat until stable.
45
46 Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.
47
48 ### 4. Thunk sharing
49
50 Multiple callers to the same out-of-range target share one thunk. Keyed by `(output_section, target_atom_id)`.
51
52 ### 5. Reloc rewrite
53
54 Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.
55
56 ### 6. Interaction with `-dead_strip` and ICF
57
58 - Thunks are dead-stripped if no live caller remains.
59 - Thunks are never ICF candidates (they have unique target addresses).
60 - A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.
61
62 ### 7. `-thunks <none|safe|normal>`
63
64 - `-thunks=none`: overflow is a hard error (default for small programs to catch bugs).
65 - `-thunks=safe` (default on large programs): thunks inserted when needed.
66 - `-thunks=all`: thunks inserted for every BRANCH26 (for testing).
67
68 Sprint 19 CLI wires these; this sprint implements the behavior.
69
70 ### 8. Regression: small programs don't grow
71
72 Default is `-thunks=safe`. Programs that don't need thunks emit no `__thunks` section and are byte-identical to the pre-sprint output.
73
74 ## Testing Strategy
75
76 - Synthetic: compile a source that produces >128 MiB of code (requires artificially padding `.o` files, or using a large constant array in `__text`). Verify thunks inserted.
77 - Every thunk target reachable from its caller cluster.
78 - Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
79 - `-thunks=none` + overflow: produces a clear error citing the caller and target.
80 - Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).
81
82 ## Definition of Done
83
84 - Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
85 - Layout fixed-point converges rapidly.
86 - Small programs unchanged.
87 - `-thunks` CLI matrix all wired.