Sprint 26: Thunks for Out-of-Range Branches
Prerequisites
Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.
Goals
When a BRANCH26 target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.
Deliverables
1. Detection pass
After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute distance = (target - P) >> 2. If |distance| > 0x0200_0000 (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.
2. Thunk synthesis
One thunk atom per (output segment, distant target). A thunk is 12 bytes:
ADRP x16, <target>@PAGE
ADD x16, x16, <target>@PAGEOFF
BR x16
Or, if the target is a Defined with a known value at link time:
ADRP x16, #<computed_page>
ADD x16, x16, #<pageoff>
BR x16
The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).
3. Placement
Thunks land in __TEXT,__thunks, a new synthetic section placed between __text and __stubs. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.
Algorithm:
- Run layout once.
- Detect overflow sites.
- Insert thunks near the caller cluster.
- Re-run layout (sizes changed).
- Re-check overflow — repeat until stable.
Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.
4. Thunk sharing
Multiple callers to the same out-of-range target share one thunk. Keyed by (output_section, target_atom_id).
5. Reloc rewrite
Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.
6. Interaction with -dead_strip and ICF
- Thunks are dead-stripped if no live caller remains.
- Thunks are never ICF candidates (they have unique target addresses).
- A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.
7. -thunks <none|safe|normal>
-thunks=none: overflow is a hard error (default for small programs to catch bugs).-thunks=safe(default on large programs): thunks inserted when needed.-thunks=all: thunks inserted for every BRANCH26 (for testing).
Sprint 19 CLI wires these; this sprint implements the behavior.
8. Regression: small programs don't grow
Default is -thunks=safe. Programs that don't need thunks emit no __thunks section and are byte-identical to the pre-sprint output.
Testing Strategy
- Synthetic: compile a source that produces >128 MiB of code (requires artificially padding
.ofiles, or using a large constant array in__text). Verify thunks inserted. - Every thunk target reachable from its caller cluster.
- Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
-thunks=none+ overflow: produces a clear error citing the caller and target.- Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).
Definition of Done
- Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
- Layout fixed-point converges rapidly.
- Small programs unchanged.
-thunksCLI matrix all wired.
View source
| 1 | # Sprint 26: Thunks for Out-of-Range Branches |
| 2 | |
| 3 | ## Prerequisites |
| 4 | Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass. |
| 5 | |
| 6 | ## Goals |
| 7 | When a `BRANCH26` target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be. |
| 8 | |
| 9 | ## Deliverables |
| 10 | |
| 11 | ### 1. Detection pass |
| 12 | |
| 13 | After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute `distance = (target - P) >> 2`. If `|distance| > 0x0200_0000` (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk. |
| 14 | |
| 15 | ### 2. Thunk synthesis |
| 16 | |
| 17 | One thunk atom per (output segment, distant target). A thunk is 12 bytes: |
| 18 | |
| 19 | ``` |
| 20 | ADRP x16, <target>@PAGE |
| 21 | ADD x16, x16, <target>@PAGEOFF |
| 22 | BR x16 |
| 23 | ``` |
| 24 | |
| 25 | Or, if the target is a Defined with a known value at link time: |
| 26 | |
| 27 | ``` |
| 28 | ADRP x16, #<computed_page> |
| 29 | ADD x16, x16, #<pageoff> |
| 30 | BR x16 |
| 31 | ``` |
| 32 | |
| 33 | The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty). |
| 34 | |
| 35 | ### 3. Placement |
| 36 | |
| 37 | Thunks land in `__TEXT,__thunks`, a new synthetic section placed between `__text` and `__stubs`. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed. |
| 38 | |
| 39 | Algorithm: |
| 40 | 1. Run layout once. |
| 41 | 2. Detect overflow sites. |
| 42 | 3. Insert thunks near the caller cluster. |
| 43 | 4. Re-run layout (sizes changed). |
| 44 | 5. Re-check overflow — repeat until stable. |
| 45 | |
| 46 | Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly. |
| 47 | |
| 48 | ### 4. Thunk sharing |
| 49 | |
| 50 | Multiple callers to the same out-of-range target share one thunk. Keyed by `(output_section, target_atom_id)`. |
| 51 | |
| 52 | ### 5. Reloc rewrite |
| 53 | |
| 54 | Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD. |
| 55 | |
| 56 | ### 6. Interaction with `-dead_strip` and ICF |
| 57 | |
| 58 | - Thunks are dead-stripped if no live caller remains. |
| 59 | - Thunks are never ICF candidates (they have unique target addresses). |
| 60 | - A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip. |
| 61 | |
| 62 | ### 7. `-thunks <none|safe|normal>` |
| 63 | |
| 64 | - `-thunks=none`: overflow is a hard error (default for small programs to catch bugs). |
| 65 | - `-thunks=safe` (default on large programs): thunks inserted when needed. |
| 66 | - `-thunks=all`: thunks inserted for every BRANCH26 (for testing). |
| 67 | |
| 68 | Sprint 19 CLI wires these; this sprint implements the behavior. |
| 69 | |
| 70 | ### 8. Regression: small programs don't grow |
| 71 | |
| 72 | Default is `-thunks=safe`. Programs that don't need thunks emit no `__thunks` section and are byte-identical to the pre-sprint output. |
| 73 | |
| 74 | ## Testing Strategy |
| 75 | |
| 76 | - Synthetic: compile a source that produces >128 MiB of code (requires artificially padding `.o` files, or using a large constant array in `__text`). Verify thunks inserted. |
| 77 | - Every thunk target reachable from its caller cluster. |
| 78 | - Runtime: the large binary's entry point actually runs and calls through thunks without crashing. |
| 79 | - `-thunks=none` + overflow: produces a clear error citing the caller and target. |
| 80 | - Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted). |
| 81 | |
| 82 | ## Definition of Done |
| 83 | |
| 84 | - Thunks correctly inserted for out-of-range BRANCH26 on large fixtures. |
| 85 | - Layout fixed-point converges rapidly. |
| 86 | - Small programs unchanged. |
| 87 | - `-thunks` CLI matrix all wired. |