markdown · 9968 bytes Raw Blame History

Sprint 29.11: Full Sprint 29 Audit

Why This Sprint Exists

Sprint 29 no longer needs another cleanup/build-out sub-sprint. It needs an exhaustive audit.

29.10 closed the hidden implementation gaps that were still preventing an honest "mostly done" story. What remains now is proving that the optimizer surface we claim to have is actually correct:

  • across IR, assembly, objects, linked binaries, and runtime behavior
  • across -O0/-O1/-O2/-O3/-Os/-Ofast
  • across real-world-style Fortran programs, not just isolated micro-reproducers

This sprint is the formal closeout audit for all of Sprint 29, including 29.5-29.10.

Audit Scope

Sources of truth:

Each promised item should end the audit in one of three states:

  1. proven working with living tests
  2. explicitly deferred with a written dependency
  3. captured by a living XFAIL or audit regression

Audit Rules

  • Prefer real Fortran programs in test_programs/ over only unit-level IR toys.
  • Every new audit program should try to prove more than stdout:
    • runtime result correctness
    • IR/asm/object presence where relevant
    • object or linked-binary determinism where relevant
    • cross-opt equality unless the test is intentionally opt-sensitive
  • Use .refs as the sanity-check source for hotspot shapes and realistic coding patterns, especially stdlib/fpm-style loops, source scanners, and numerical kernels.
  • When the audit finds a real bug, land the smallest honest fix plus a regression.
  • When the audit finds a real bug that is not fixed immediately, record it in noted_items.md and capture it as a living XFAIL where possible.

Initial Tranche

Kickoff items:

  • source-level audit for helper-before-program entry lowering / dead-function root handling
  • source-level audit for module procedure host-association over module globals
  • real-world stdlib-style kernel:
    • tridiagonal sparse matvec (realworld_tridiag_spmv.f90)
  • real-world BLAS-style kernel:
    • axpy + reduction (realworld_axpy_reduce.f90)
  • real-world fpm-style application logic:
    • source suffix classification (realworld_suffix_scan.f90)

The passing real-world programs should prove:

  • runtime correctness
  • phase triangulation (ir|asm|obj|repro)
  • cross-opt equality
  • deterministic object snapshots
  • deterministic linked binaries without Mach-O UUID noise

Current audit findings:

  • fixed: helper-before-program lowering / dead-function rooting could drop the true __prog_* entry or make _main call the wrong helper first
  • fixed: named-parameter local fixed-array extents were folded as (1, 1) in lowering, which made real-world kernels like realworld_axpy_reduce.f90 and realworld_tridiag_spmv.f90 trip bogus bounds checks
  • fixed: ordinary load-bearing loops were entering an unsafe full-unroll path, which miscompiled realworld_axpy_reduce.f90 at -O2/-O3/-Ofast; the unroller is now hardened to keep that shape out while preserving the proven DO CONCURRENT full-unroll path
  • fixed: BCE only recognized the canonical bare loop IV, so real-world counted loops with safe iv +/- const array accesses kept redundant bounds checks at -O2+; the audit kernels realworld_sasum_cleanup.f90 and realworld_three_point_apply.f90 now prove the offset-IV case and keep SROA honest at the same time
  • fixed: cross-block LSF treated every call as a universal memory clobber, so branch-join reuse through a noalias helper side path stayed as a reload at -O2+; realworld_noalias_reuse.f90 now proves the noalias-call case and also keeps the local same-block reuse path honest
  • fixed: contained procedures only partially inherited host-associated parameter constants during lowering, so dummy array extents and loop bounds like x(n), y(n), and do i = 1, n could degrade inside real-world helper kernels; realworld_seed_overwrite.f90 now proves that host-param-backed dummy extents and loop bounds stay intact
  • fixed: backend ICmp lowering could emit mixed-width GP compares like cmp w26, x23 when the IR compared a 32-bit induction value against a 64-bit bound; realworld_ipo_chain.f90 now keeps the compare-width harmonization honest through a real helper-chain compile at -O2+
  • fixed: module procedures were still missing host association over their own module globals, so small cases like call bump() could silently leave a shared module variable unchanged. module_global_host_assoc.f90 is now a passing audit program with cross-opt equality plus asm/object/run reproducibility, and tests/module_host_audit.rs proves the raw IR resolves the shared module global inside the procedure body
  • fixed: extended OPEN lowering built the runtime control block by storing typed fields through a byte-pointer GEP, which first tripped IR verification for position='append' and then, after the verifier fix, still wrote fields at scaled-by-element-size offsets. io_append_log.f90 is now a passing file oracle with append rerun coverage plus asm/object reproducibility and cross-opt equality
  • fixed: descriptor-backed array query intrinsics (SIZE, LBOUND, UBOUND) were lowered as raw i64 runtime results even though Fortran default integer queries should materialize as default-kind scalars, and scalar/component assignment lowering skipped mixed-width coercion at ordinary store sites; realworld_shape_guard.f90 now proves the default-kind runtime-shape path through real allocatable metadata, loop bounds, and deterministic objects
  • fixed: backend MovReg emission did not handle x -> w truncation views, so real-world default-kind array-query assignments could produce invalid asm like mov w21, x20; the new runtime-shape audit keeps that truncation surface honest
  • fixed: fpm-style suffix classification in realworld_suffix_scan.f90 first needed fixed-length character arrays to lower as real element-addressable arrays instead of scalar strings, then needed scalar character dummy intrinsics like INDEX(name, suffix) to carry a real runtime length story through afs_c_strlen, and finally exposed two optimizer-side bugs: mixed-width GEP offsets were being compared without pointee-size scaling in alias analysis, and SROA was scalarizing aggregates even when a GEP address escaped through the synthesized descriptor. The program is now a passing audit probe with cross-opt equality plus IR/asm/object/run reproducibility
  • fixed: dummy-array SIZE(...) queries in realworld_assumed_shape_size.f90 were not actually using assumed-shape descriptors because bare (:) dummy declarations lowered through the Deferred array-spec path, and then the synthesized descriptor lost rank and flags at -O2+ because mixed-width GEP offsets in alias analysis made descriptor field stores look like they overlapped. Assumed-shape dummies now classify as descriptor-backed, query results are kept in default-kind scalars, and the real-world canary now passes across optimization levels with deterministic artifacts
  • proven: LICM hoists invariant scalar dummy loads out of a real-world affine update loop in realworld_affine_shift.f90 once BCE clears the loop body
  • proven: GVN reduces duplicated branch-join PURE helper calls in realworld_join_bias_sum.f90 at -O2+ instead of recomputing the same affine helper result through the join
  • proven: DSE removes the dead seed store in realworld_seed_overwrite.f90 across the intervening noalias helper call while preserving the real fill
  • proven: SROA scalarizes the fixed tap buffer in realworld_binomial_blend.f90 and BCE clears the corresponding safe stencil bounds checks at -O2+, giving us another living real-world audit kernel for the small-aggregate path
  • proven: loop-legality audit kernels realworld_inplace_prefix.f90 and realworld_inplace_symmix.f90 stay runtime-correct, cross-opt-equal, and deterministic across IR/object/binary surfaces
  • proven: the 29.9 single-file story now has real-world audit coverage for ELEMENTAL lowering plus DO CONCURRENT bulk redirection (realworld_elemental_stage.f90), intramodule IPO helper trimming (realworld_ipo_chain.f90), small-loop DO CONCURRENT exploitation (realworld_doconc_square.f90), and explicit-DO vectorization onto the bulk runtime kernels (realworld_vector_stage.f90)
  • separately deferred parser gap: typed character array constructors using an explicit type-spec inside []

Current audit corpus snapshot:

  • 183 top-level test_programs/*.f90 runtime corpus programs
  • 172 programs with CHECK
  • 30 programs with IR_CHECK
  • 6 programs with IR_NOT
  • 0 living XFAILs

Brutal Audit Priorities

1. 29.8 Optimizer Proof

Grow adversarial real-world coverage for:

  • GVN
  • SROA
  • BCE
  • local and cross-block LSF
  • LICM
  • loop legality transforms

2. 29.9 Claims Audit

Prove the current single-file story honestly:

  • PURE/ELEMENTAL exploitation
  • DO CONCURRENT exploitation
  • intramodule IPO
  • vectorization/runtime-kernel redirection

Also keep proving what is still absent:

  • cross-module IPO
  • whole-program analysis
  • general native vectorizer

3. Binary Correctness & Determinism

For representative real-world programs:

  • object snapshots deterministic at optimized levels
  • linked binaries byte-identical when rebuilt at the same output path
  • no LC_UUID
  • runtime behavior equal across optimization levels unless explicitly exempted

Success Condition

Sprint 29 closes when:

  • the promised optimizer/runtime surface has living proof
  • the remaining holes are few, explicit, and written down
  • the test suite gives us confidence in IR, binary, determinism, and integration rather than only stdout
View source
1 # Sprint 29.11: Full Sprint 29 Audit
2
3 ## Why This Sprint Exists
4
5 Sprint 29 no longer needs another cleanup/build-out sub-sprint. It needs an
6 exhaustive audit.
7
8 29.10 closed the hidden implementation gaps that were still preventing an honest
9 "mostly done" story. What remains now is proving that the optimizer surface we
10 claim to have is actually correct:
11 - across IR, assembly, objects, linked binaries, and runtime behavior
12 - across `-O0/-O1/-O2/-O3/-Os/-Ofast`
13 - across real-world-style Fortran programs, not just isolated micro-reproducers
14
15 This sprint is the formal closeout audit for all of Sprint 29, including
16 29.5-29.10.
17
18 ## Audit Scope
19
20 Sources of truth:
21 - [Sprint 29](sprint29.md)
22 - [Sprint 29.5](sprint29_5.md)
23 - [Sprint 29.6](sprint29_6.md)
24 - [Sprint 29.7](sprint29_7.md)
25 - [Sprint 29.8](sprint29_8.md)
26 - [Sprint 29.9](sprint29_9.md)
27 - [Sprint 29.10](sprint29_10.md)
28
29 Each promised item should end the audit in one of three states:
30 1. proven working with living tests
31 2. explicitly deferred with a written dependency
32 3. captured by a living XFAIL or audit regression
33
34 ## Audit Rules
35
36 - Prefer real Fortran programs in `test_programs/` over only unit-level IR toys.
37 - Every new audit program should try to prove more than stdout:
38 - runtime result correctness
39 - IR/asm/object presence where relevant
40 - object or linked-binary determinism where relevant
41 - cross-opt equality unless the test is intentionally opt-sensitive
42 - Use `.refs` as the sanity-check source for hotspot shapes and realistic coding
43 patterns, especially stdlib/fpm-style loops, source scanners, and numerical kernels.
44 - When the audit finds a real bug, land the smallest honest fix plus a regression.
45 - When the audit finds a real bug that is not fixed immediately, record it in
46 `noted_items.md` and capture it as a living `XFAIL` where possible.
47
48 ## Initial Tranche
49
50 Kickoff items:
51 - source-level audit for helper-before-program entry lowering / dead-function root handling
52 - source-level audit for module procedure host-association over module globals
53 - real-world stdlib-style kernel:
54 - tridiagonal sparse matvec (`realworld_tridiag_spmv.f90`)
55 - real-world BLAS-style kernel:
56 - axpy + reduction (`realworld_axpy_reduce.f90`)
57 - real-world fpm-style application logic:
58 - source suffix classification (`realworld_suffix_scan.f90`)
59
60 The passing real-world programs should prove:
61 - runtime correctness
62 - phase triangulation (`ir|asm|obj|repro`)
63 - cross-opt equality
64 - deterministic object snapshots
65 - deterministic linked binaries without Mach-O UUID noise
66
67 Current audit findings:
68 - fixed: helper-before-program lowering / dead-function rooting could drop the
69 true `__prog_*` entry or make `_main` call the wrong helper first
70 - fixed: named-parameter local fixed-array extents were folded as `(1, 1)` in
71 lowering, which made real-world kernels like `realworld_axpy_reduce.f90` and
72 `realworld_tridiag_spmv.f90` trip bogus bounds checks
73 - fixed: ordinary load-bearing loops were entering an unsafe full-unroll path,
74 which miscompiled `realworld_axpy_reduce.f90` at `-O2/-O3/-Ofast`; the
75 unroller is now hardened to keep that shape out while preserving the proven
76 `DO CONCURRENT` full-unroll path
77 - fixed: BCE only recognized the canonical bare loop IV, so real-world counted
78 loops with safe `iv +/- const` array accesses kept redundant bounds checks at
79 `-O2+`; the audit kernels `realworld_sasum_cleanup.f90` and
80 `realworld_three_point_apply.f90` now prove the offset-IV case and keep SROA
81 honest at the same time
82 - fixed: cross-block LSF treated every call as a universal memory clobber, so
83 branch-join reuse through a noalias helper side path stayed as a reload at
84 `-O2+`; `realworld_noalias_reuse.f90` now proves the noalias-call case and
85 also keeps the local same-block reuse path honest
86 - fixed: contained procedures only partially inherited host-associated
87 `parameter` constants during lowering, so dummy array extents and loop bounds
88 like `x(n)`, `y(n)`, and `do i = 1, n` could degrade inside real-world helper
89 kernels; `realworld_seed_overwrite.f90` now proves that host-param-backed
90 dummy extents and loop bounds stay intact
91 - fixed: backend `ICmp` lowering could emit mixed-width GP compares like
92 `cmp w26, x23` when the IR compared a 32-bit induction value against a 64-bit
93 bound; `realworld_ipo_chain.f90` now keeps the compare-width harmonization
94 honest through a real helper-chain compile at `-O2+`
95 - fixed: module procedures were still missing host association over their own
96 module globals, so small cases like `call bump()` could silently leave a
97 shared module variable unchanged. `module_global_host_assoc.f90` is now a
98 passing audit program with cross-opt equality plus asm/object/run reproducibility,
99 and `tests/module_host_audit.rs` proves the raw IR resolves the shared module
100 global inside the procedure body
101 - fixed: extended `OPEN` lowering built the runtime control block by storing
102 typed fields through a byte-pointer GEP, which first tripped IR verification
103 for `position='append'` and then, after the verifier fix, still wrote fields
104 at scaled-by-element-size offsets. `io_append_log.f90` is now a passing file
105 oracle with append rerun coverage plus asm/object reproducibility and
106 cross-opt equality
107 - fixed: descriptor-backed array query intrinsics (`SIZE`, `LBOUND`, `UBOUND`)
108 were lowered as raw `i64` runtime results even though Fortran default integer
109 queries should materialize as default-kind scalars, and scalar/component
110 assignment lowering skipped mixed-width coercion at ordinary store sites;
111 `realworld_shape_guard.f90` now proves the default-kind runtime-shape path
112 through real allocatable metadata, loop bounds, and deterministic objects
113 - fixed: backend `MovReg` emission did not handle `x -> w` truncation views, so
114 real-world default-kind array-query assignments could produce invalid asm like
115 `mov w21, x20`; the new runtime-shape audit keeps that truncation surface
116 honest
117 - fixed: fpm-style suffix classification in `realworld_suffix_scan.f90` first
118 needed fixed-length character arrays to lower as real element-addressable
119 arrays instead of scalar strings, then needed scalar character dummy intrinsics
120 like `INDEX(name, suffix)` to carry a real runtime length story through
121 `afs_c_strlen`, and finally exposed two optimizer-side bugs: mixed-width GEP
122 offsets were being compared without pointee-size scaling in alias analysis,
123 and SROA was scalarizing aggregates even when a GEP address escaped through
124 the synthesized descriptor. The program is now a passing audit probe with
125 cross-opt equality plus IR/asm/object/run reproducibility
126 - fixed: dummy-array `SIZE(...)` queries in `realworld_assumed_shape_size.f90`
127 were not actually using assumed-shape descriptors because bare `(:)` dummy
128 declarations lowered through the `Deferred` array-spec path, and then the
129 synthesized descriptor lost `rank` and `flags` at `-O2+` because mixed-width
130 GEP offsets in alias analysis made descriptor field stores look like they
131 overlapped. Assumed-shape dummies now classify as descriptor-backed, query
132 results are kept in default-kind scalars, and the real-world canary now
133 passes across optimization levels with deterministic artifacts
134 - proven: LICM hoists invariant scalar dummy loads out of a real-world affine
135 update loop in `realworld_affine_shift.f90` once BCE clears the loop body
136 - proven: GVN reduces duplicated branch-join PURE helper calls in
137 `realworld_join_bias_sum.f90` at `-O2+` instead of recomputing the same affine
138 helper result through the join
139 - proven: DSE removes the dead seed store in `realworld_seed_overwrite.f90`
140 across the intervening noalias helper call while preserving the real fill
141 - proven: SROA scalarizes the fixed tap buffer in `realworld_binomial_blend.f90`
142 and BCE clears the corresponding safe stencil bounds checks at `-O2+`, giving
143 us another living real-world audit kernel for the small-aggregate path
144 - proven: loop-legality audit kernels `realworld_inplace_prefix.f90` and
145 `realworld_inplace_symmix.f90` stay runtime-correct, cross-opt-equal, and
146 deterministic across IR/object/binary surfaces
147 - proven: the 29.9 single-file story now has real-world audit coverage for
148 ELEMENTAL lowering plus DO CONCURRENT bulk redirection
149 (`realworld_elemental_stage.f90`), intramodule IPO helper trimming
150 (`realworld_ipo_chain.f90`), small-loop DO CONCURRENT exploitation
151 (`realworld_doconc_square.f90`), and explicit-DO vectorization onto the bulk
152 runtime kernels (`realworld_vector_stage.f90`)
153 - separately deferred parser gap: typed character array constructors using an
154 explicit type-spec inside `[]`
155
156 Current audit corpus snapshot:
157 - `183` top-level `test_programs/*.f90` runtime corpus programs
158 - `172` programs with `CHECK`
159 - `30` programs with `IR_CHECK`
160 - `6` programs with `IR_NOT`
161 - `0` living `XFAIL`s
162
163 ## Brutal Audit Priorities
164
165 ### 1. 29.8 Optimizer Proof
166
167 Grow adversarial real-world coverage for:
168 - GVN
169 - SROA
170 - BCE
171 - local and cross-block LSF
172 - LICM
173 - loop legality transforms
174
175 ### 2. 29.9 Claims Audit
176
177 Prove the current single-file story honestly:
178 - PURE/ELEMENTAL exploitation
179 - DO CONCURRENT exploitation
180 - intramodule IPO
181 - vectorization/runtime-kernel redirection
182
183 Also keep proving what is still absent:
184 - cross-module IPO
185 - whole-program analysis
186 - general native vectorizer
187
188 ### 3. Binary Correctness & Determinism
189
190 For representative real-world programs:
191 - object snapshots deterministic at optimized levels
192 - linked binaries byte-identical when rebuilt at the same output path
193 - no `LC_UUID`
194 - runtime behavior equal across optimization levels unless explicitly exempted
195
196 ## Success Condition
197
198 Sprint 29 closes when:
199 - the promised optimizer/runtime surface has living proof
200 - the remaining holes are few, explicit, and written down
201 - the test suite gives us confidence in IR, binary, determinism, and integration
202 rather than only stdout