afs-ld Public

Watch 0 Fork 0 Star 0

markdown · 3527 bytes Raw Blame History

Sprint 26: Thunks for Out-of-Range Branches

Prerequisites

Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.

Goals

When a BRANCH26 target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.

Deliverables

1. Detection pass

After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute distance = (target - P) >> 2. If |distance| > 0x0200_0000 (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.

2. Thunk synthesis

One thunk atom per (output segment, distant target). A thunk is 12 bytes:

ADRP x16, <target>@PAGE
ADD  x16, x16, <target>@PAGEOFF
BR   x16

Or, if the target is a Defined with a known value at link time:

ADRP x16, #<computed_page>
ADD  x16, x16, #<pageoff>
BR   x16

The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).

3. Placement

Thunks land in __TEXT,__thunks, a new synthetic section placed between __text and __stubs. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.

Algorithm:

Run layout once.
Detect overflow sites.
Insert thunks near the caller cluster.
Re-run layout (sizes changed).
Re-check overflow — repeat until stable.

Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.

Multiple callers to the same out-of-range target share one thunk. Keyed by (output_section, target_atom_id).

5. Reloc rewrite

Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.

6. Interaction with `-dead_strip` and ICF

Thunks are dead-stripped if no live caller remains.
Thunks are never ICF candidates (they have unique target addresses).
A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.

7. `-thunks <none|safe|normal>`

-thunks=none: overflow is a hard error (default for small programs to catch bugs).
-thunks=safe (default on large programs): thunks inserted when needed.
-thunks=all: thunks inserted for every BRANCH26 (for testing).

Sprint 19 CLI wires these; this sprint implements the behavior.

8. Regression: small programs don't grow

Default is -thunks=safe. Programs that don't need thunks emit no __thunks section and are byte-identical to the pre-sprint output.

Testing Strategy

Synthetic: compile a source that produces >128 MiB of code (requires artificially padding .o files, or using a large constant array in __text). Verify thunks inserted.
Every thunk target reachable from its caller cluster.
Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
-thunks=none + overflow: produces a clear error citing the caller and target.
Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).

Definition of Done

Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
Layout fixed-point converges rapidly.
Small programs unchanged.
-thunks CLI matrix all wired.

View source

  
        1
        # Sprint 26: Thunks for Out-of-Range Branches
      
        2
        
        3
        ## Prerequisites
      
        4
        Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.
      
        5
        
        6
        ## Goals
      
        7
        When a `BRANCH26` target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.
      
        8
        
        9
        ## Deliverables
      
        10
        
        11
        ### 1. Detection pass
      
        12
        
        13
        After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute `distance = (target - P) >> 2`. If `|distance| > 0x0200_0000` (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.
      
        14
        
        15
        ### 2. Thunk synthesis
      
        16
        
        17
        One thunk atom per (output segment, distant target). A thunk is 12 bytes:
      
        18
        
        19
        ```
      
        20
        ADRP x16, <target>@PAGE
      
        21
        ADD  x16, x16, <target>@PAGEOFF
      
        22
        BR   x16
      
        23
        ```
      
        24
        
        25
        Or, if the target is a Defined with a known value at link time:
      
        26
        
        27
        ```
      
        28
        ADRP x16, #<computed_page>
      
        29
        ADD  x16, x16, #<pageoff>
      
        30
        BR   x16
      
        31
        ```
      
        32
        
        33
        The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).
      
        34
        
        35
        ### 3. Placement
      
        36
        
        37
        Thunks land in `__TEXT,__thunks`, a new synthetic section placed between `__text` and `__stubs`. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.
      
        38
        
        39
        Algorithm:
      
        40
        1. Run layout once.
      
        41
        2. Detect overflow sites.
      
        42
        3. Insert thunks near the caller cluster.
      
        43
        4. Re-run layout (sizes changed).
      
        44
        5. Re-check overflow — repeat until stable.
      
        45
        
        46
        Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.
      
        47
        
        48
        ### 4. Thunk sharing
      
        49
        
        50
        Multiple callers to the same out-of-range target share one thunk. Keyed by `(output_section, target_atom_id)`.
      
        51
        
        52
        ### 5. Reloc rewrite
      
        53
        
        54
        Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.
      
        55
        
        56
        ### 6. Interaction with `-dead_strip` and ICF
      
        57
        
        58
        - Thunks are dead-stripped if no live caller remains.
      
        59
        - Thunks are never ICF candidates (they have unique target addresses).
      
        60
        - A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.
      
        61
        
        62
        ### 7. `-thunks <none|safe|normal>`
      
        63
        
        64
        - `-thunks=none`: overflow is a hard error (default for small programs to catch bugs).
      
        65
        - `-thunks=safe` (default on large programs): thunks inserted when needed.
      
        66
        - `-thunks=all`: thunks inserted for every BRANCH26 (for testing).
      
        67
        
        68
        Sprint 19 CLI wires these; this sprint implements the behavior.
      
        69
        
        70
        ### 8. Regression: small programs don't grow
      
        71
        
        72
        Default is `-thunks=safe`. Programs that don't need thunks emit no `__thunks` section and are byte-identical to the pre-sprint output.
      
        73
        
        74
        ## Testing Strategy
      
        75
        
        76
        - Synthetic: compile a source that produces >128 MiB of code (requires artificially padding `.o` files, or using a large constant array in `__text`). Verify thunks inserted.
      
        77
        - Every thunk target reachable from its caller cluster.
      
        78
        - Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
      
        79
        - `-thunks=none` + overflow: produces a clear error citing the caller and target.
      
        80
        - Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).
      
        81
        
        82
        ## Definition of Done
      
        83
        
        84
        - Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
      
        85
        - Layout fixed-point converges rapidly.
      
        86
        - Small programs unchanged.
      
        87
        - `-thunks` CLI matrix all wired.

1	# Sprint 26: Thunks for Out-of-Range Branches
2
3	## Prerequisites
4	Sprint 11 — BRANCH26 reloc application; Sprint 10 — layout pass.
5
6	## Goals
7	When a `BRANCH26` target is more than ±128 MiB from the caller, insert a branch island that can reach any 32-bit-aligned address via an ADRP + BR sequence. Required for very large executables; fortsh is not that large today, but a statically-linked Fortran program with full-intrinsic binding could be.
8
9	## Deliverables
10
11	### 1. Detection pass
12
13	After layout (Sprint 10) assigns addresses, walk every BRANCH26 reloc. Compute `distance = (target - P) >> 2`. If `\|distance\| > 0x0200_0000` (that's 2^25 = 33,554,432 × 4 bytes = 128 MiB), flag the reloc as needing a thunk.
14
15	### 2. Thunk synthesis
16
17	One thunk atom per (output segment, distant target). A thunk is 12 bytes:
18
19	```
20	ADRP x16, <target>@PAGE
21	ADD x16, x16, <target>@PAGEOFF
22	BR x16
23	```
24
25	Or, if the target is a Defined with a known value at link time:
26
27	```
28	ADRP x16, #<computed_page>
29	ADD x16, x16, #<pageoff>
30	BR x16
31	```
32
33	The ADRP+ADD form reaches anywhere in the process's 4 GiB virtual range (actually ±4 GiB, plenty).
34
35	### 3. Placement
36
37	Thunks land in `__TEXT,__thunks`, a new synthetic section placed between `__text` and `__stubs`. Placement must be within ±128 MiB of every call site that uses it — for very large binaries, multiple thunk islands may be needed.
38
39	Algorithm:
40	1. Run layout once.
41	2. Detect overflow sites.
42	3. Insert thunks near the caller cluster.
43	4. Re-run layout (sizes changed).
44	5. Re-check overflow — repeat until stable.
45
46	Termination: adding a thunk can only make addresses shift by up to 12 bytes per thunk; overflow is a global property that converges rapidly.
47
48	### 4. Thunk sharing
49
50	Multiple callers to the same out-of-range target share one thunk. Keyed by `(output_section, target_atom_id)`.
51
52	### 5. Reloc rewrite
53
54	Each thunked BRANCH26 reloc gets rewritten to point at the thunk atom instead of the original target. Thunk atom's BR then reaches the real target via ADRP+ADD.
55
56	### 6. Interaction with `-dead_strip` and ICF
57
58	- Thunks are dead-stripped if no live caller remains.
59	- Thunks are never ICF candidates (they have unique target addresses).
60	- A dead-stripped target invalidates its thunk(s); easy since we generate thunks after dead-strip.
61
62	### 7. `-thunks <none\|safe\|normal>`
63
64	- `-thunks=none`: overflow is a hard error (default for small programs to catch bugs).
65	- `-thunks=safe` (default on large programs): thunks inserted when needed.
66	- `-thunks=all`: thunks inserted for every BRANCH26 (for testing).
67
68	Sprint 19 CLI wires these; this sprint implements the behavior.
69
70	### 8. Regression: small programs don't grow
71
72	Default is `-thunks=safe`. Programs that don't need thunks emit no `__thunks` section and are byte-identical to the pre-sprint output.
73
74	## Testing Strategy
75
76	- Synthetic: compile a source that produces >128 MiB of code (requires artificially padding `.o` files, or using a large constant array in `__text`). Verify thunks inserted.
77	- Every thunk target reachable from its caller cluster.
78	- Runtime: the large binary's entry point actually runs and calls through thunks without crashing.
79	- `-thunks=none` + overflow: produces a clear error citing the caller and target.
80	- Small-program regression: fortsh output size unchanged vs pre-Sprint-26 (no thunks inserted).
81
82	## Definition of Done
83
84	- Thunks correctly inserted for out-of-range BRANCH26 on large fixtures.
85	- Layout fixed-point converges rapidly.
86	- Small programs unchanged.
87	- `-thunks` CLI matrix all wired.