lower_arg_by_ref_full's tail path evaluates the actual via
lower_expr_full and returns the value as-is when it's a pointer. But
for shapes that lower_expr_full returns as a 384-byte descriptor
(array sections, array binops, array-result intrinsics), the callee —
which under the assumed-size / explicit-shape ABI expects a bare element
pointer — would read the descriptor's first 8 bytes (= base_addr field)
as if they were the array's first element. Empirically this surfaced
post-db04b9d as bounds-check failures of the form 'index <huge> outside
[1, n]' when stdlib's solve/lapack_getrf chain was rebuilt with the
WIP-aware ABI.
Detect Ptr<[i8; 384]> at the tail and load through to extract the
descriptor's base_addr before returning it as the bare-pointer arg.
The lower.rs → lower/core.rs split (71a0cc3) silently dropped commit
db04b9d's fix when the file was extracted: ArraySpec::AssumedSize was
re-added to the descriptor-using set in arg_uses_descriptor_from_decls.
Per F2018 §15.5.2.4 an assumed-size dummy 'a(lda,*)' is passed as a
bare element pointer; flagging it as descriptor-bearing made every
'a(i,j)' reference go through array_descriptor_addr → descriptor base,
yielding descriptor_base + 16 (= upper-half of base_addr field) instead
of the actual element. Restores db04b9d and adds an explanatory comment
so a future refactor doesn't drop it again.
lower_array_intrinsic dispatched on `name` with a match late in the
function but materialised the first arg's descriptor (alloca + memset
+ afs_create_section) unconditionally before that match — so calling
it on a user-procedure name would still emit a 384-byte throwaway
descriptor before returning None.
expr.rs's FunctionCall handler reaches lower_array_intrinsic from
two places (the `!has_named_interface` arm and the post-generic-
resolve fallback), so for any non-generic non-intrinsic call (e.g.
`pick(key(0:))`) each section actual lowered THREE times: twice as
unused descriptors emitted before this dispatcher returned None,
and once for the legitimate ref_arg_vals descriptor passed to the
real call.
Bail at the top of lower_array_intrinsic when `name` is not one of
the array intrinsics it actually handles
(size/lbound/ubound/shape/allocated/sum/product/maxval/minval/
maxloc/minloc/matmul/dot_product/transpose/huge/tiny/epsilon/
precision/range/digits/norm2).
Result on stdlib_hash_32bit_water.f90 (cumulative with eeee0e5 +
0592c14):
trunk baseline: 123 GB peak (OOM, uncompilable)
after eeee0e5: 26.6 GB, 18.4 s
after 0592c14: 922 MB, 1.49 s
after this fix: 35 MB, 0.22 s (~3500x lighter than trunk)
afs_create_section calls in -S: 5647 → 982 → 126.
cli_driver 579/579 PASS. Regression test threshold tightened from
60 to 24 emissions for an 8-section fixture (observed-good: 14).
expr.rs's FunctionCall lowering eagerly built two probe-arg-vec slices
before any branch decision:
* resolution_arg_vals — only consumed by resolve_generic_call_actuals
(which short-circuits to None if the callee isn't a NamedInterface)
and the structure-ctor fallback (gated on has_named_interface);
* intrinsic_arg_vals — only consumed by lower_intrinsic, which matches
on a fixed set of intrinsic names and returns None otherwise.
For a non-generic non-procptr non-intrinsic callee both slices are
discarded, but lowering each arg cost a section-descriptor
materialization (alloca + memset + afs_create_section) per array
section actual. Inside nested intrinsic chains that compounded
multiplicatively with the third "real" lowering in ref_arg_vals.
stdlib_hash_32bit_water.f90's water_hash inner loop —
`ieor(waterr32(key(i:)), waterp1)` repeated 4 times across 16 SELECT
CASE arms — produced 5647 `afs_create_section` calls and a 26 GB
compile peak. Gate the resolution probe behind has_named_interface ||
procptr_target.is_some(), and route the intrinsic probe through
sema::validate::is_intrinsic_name(&key).
Result on stdlib_hash_32bit_water.f90:
* compile peak: 26.6 GB → 922 MB (~28x)
* wall time: 18.4s → 1.49s (~12x)
* afs_create_section calls in -S: 5647 → 982 (~5.7x)
Combined with the earlier eeee0e5 the same file is now 133x lighter
than its trunk-baseline 123 GB peak. cli_driver 578/578 PASS.
Adds an asm-level regression test that compiles a fixed-shape
ieor/user-call chain and asserts the per-source-section descriptor
emission ratio stays below the multiplicative-probe regime.
The caller-side hidden-output buffer for ComplexBuffer ABI returns was
allocated as `[i8 x 8]` / `[i8 x 16]`, making the call's return
value `Ptr<[i8 x N]>`. is_complex_ty only recognises `[fN x 2]`
or `Ptr<[fN x 2]>`, so for `complex_local - complex_call(...)` the
binop's complex-arithmetic branch did not fire — execution fell to
the int/float promotion path and emitted `fsub %ptr<[i8 x 8]>` against
the buffer pointer. IR-verify rejected with `float op has non-float
operand : ptr<[i8 x 8]>`, blocking stdlib_lapack_solve_chol_comp's
CPOTF2/ZPOTF2:
ajj = real( real(a(j,j),sp) - cdotc(...), sp )
Type the alloca as `[fN x 2]` (sized by the existing kind-aware
hidden_result_temp_bytes_for_callee), with N=4 for sp (8 bytes) and
N=8 for dp (16 bytes). Both the Name-callee and the type-bound
Component-callee paths get the same treatment. is_complex_ty,
materialize_complex_operand, and the binop branch all then recognise
the buffer as a complex pair and emit lane-wise fadd/fsub/fmul
correctly.
emit_scalar_allocate_source_init_on_success used to pipe the source
expression through coerce_to_type(raw, dest_ty) → b.store(coerced,
dest_base). When the source is a scalar complex(sp/dp) function call,
lower_expr returns a pointer to the ComplexBuffer the callee wrote
into — typed Ptr<[f32/f64 x 2]>, Ptr<[i8 x 8/16]>, or bare Ptr<i8>.
coerce_to_type has no Ptr→Array path so it returned the pointer
unchanged, and b.store then tried to write a pointer-sized value
into a [f32/f64 x 2] slot. IR-verify rejected the store with
`value type ptr<[i8 x 8]> doesn't match pointee type [f32 x 2]`,
leaving stdlib_stats_moment_mask uncompilable
(`allocate(mean_, source = mean(x, 1, mask))` where mean returns
scalar complex(sp) — the same shape across roughly a dozen routines).
Fix recognises the buffer-pointer return at the source-init site and
memcpys the lane pair from the buffer to the freshly allocated
destination slot, parallel to the assignment-from-complex-call path
elsewhere. Regression test gates compilation only — the runtime
path through allocatable complex scalars hits a separate
pre-existing bug in afs_assign_allocatable / real(m_) reads on
complex allocatables that surfaces once IR-verify is no longer
masking it.
generic_dispatch_probe_value already calls array_function_result_elem_type
at the top. When that returns None for a Name(callee) FunctionCall and
the callee is neither a transformational array intrinsic (pack/reshape/
sum/merge/matmul/transpose/conjg/aimag/abs/cmplx/shape/transfer/dimag)
nor a local array, the subsequent lower_array_expr_descriptor call just
walks named-intrinsic match arms (all miss), then redundantly calls
array_function_result_elem_type a second time inside
lower_array_function_result_descriptor, only to return None. That
second invocation recursively probes args, and the arg-probes themselves
re-run lower_array_expr_descriptor — O(2^depth) for nested calls.
stdlib_hash_32bit_water.f90's water_hash inner loop nests four user
function calls deep across 16 SELECT CASE arms; the compile peak ran
to ~123 GB and the kernel SIGKILL'd the process under memory pressure.
Skipping the redundant path here drops the same compile to ~26 GB peak
(still high, but no longer triggers OOM). The
internal_subprogram_call_under_intrinsic_under_user_call_keeps_mangled_name
regression remains green: lower_expr_full at the bottom of the probe
does the real evaluation with internal_funcs threaded through, so
internal CONTAINS-block subprograms keep their mangled link names.
Setting every dim's stride to 1 worked for the dim[0]-only flat
iteration paths but collided byte offsets in every per-dim consumer.
For `m = (y > 3.)` over `real :: y(2,3)` the flat compare loop wrote
m[0..5] correctly (it iterates over dim[0].stride * elem_size), but
the masked sum-along-dim helper indexed mask via
`Σ idx_d * dim[d].stride * elem_size` and the all-1 strides made
distinct (i,j) tuples land on the same byte — only 4 of 6 mask bytes
were actually consulted, so `sum(y, 1, y > 3.)` quietly dropped the
column-2 mask hit and returned [0, 0, 6] instead of [0, 4, 11].
Match the column-major running stride convention that
`materialize_array_descriptor_for_info` already uses.
Adds `afs_array_sum_real8_dim_mask` and `afs_array_sum_int_dim_mask`,
plus a shared `for_each_reduce_along_dim_with_mask` traversal that
honors both source and mask per-dim strides, and a small
`mask_byte_is_true(mask, byte_off)` helper that dispatches on the
mask's `elem_size` (1, 2, 4, or 8 — `logical(int8)` through
`logical(int64)` and the default 1-byte bool storage all reach the
same predicate).
`afs_array_pack` previously read the mask via a fixed `as *const i32`
load, which crashed with a misaligned dereference once
`elem_size=1` started flowing through for default logical arrays —
switched to `mask_byte_is_true` so it works for every kind.
Two coupled changes:
1. `ir_scalar_byte_size` and `descriptor_element_size_bytes` now
return 1 for `IrType::Bool`, matching `IrType::Bool::size_bytes()`
and the bytes-per-element layout that `alloca [Bool x N]` actually
produces. The previous 4-byte report only made sense when the
scalar type was widened to a 4-byte slot, which never happened —
storage stayed at 1 byte. The mismatch silently broke every
consumer of the descriptor's `elem_size` for logical arrays:
`mask_at`, `afs_array_sum_real8_mask`, the new `_dim_mask`
helpers, and the whole-array broadcast loop all stepped 4×
past real data. `sum(y, mask=m)` returned the unmasked sum,
`sum(y, dim, mask)` returned the unmasked column sums, and
`arr(i) = .true.` for `logical :: arr(N)` wrote 3 bytes past
the slot. The whole-array broadcast loop now uses
`ir_scalar_byte_size` directly so they stay paired.
2. `lower_array_sum_dim_descriptor` no longer bails when a mask is
present; it lowers the mask actual into a descriptor and
dispatches to `afs_array_sum_{real8,int}_dim_mask`. Surfaced in
`example_var`'s `var(y, 1, y > 3.)` line, which previously fell
through to a scalar broadcast and crashed in
`afs_assign_allocatable` with a misaligned source pointer.
Linearscan's live intervals are linear-position [start, end] ranges,
which can't represent the true CFG-aware live set for vregs that
receive values via parallel-copy at every predecessor (block params,
post-phi placeholders). When a call block is lexically wedged
between a phi-vreg's def edge and its use block, the position-based
range straddles the call point but no actual control-flow path
through the vreg traverses it. The splitter then thought the vreg
crossed a call, picked a pre/post split, and assigned the post-half
a different physreg — every predecessor's parallel-copy landed in
pre_phys but every use inside the loop read post_phys.
Track each vreg's set of defining blocks while building
vreg_actual_range; if it has more than one, force real_crosses
false (no splitting). The unit test exercising splitting still
fires because its synthetic vregs have single defs.
Removes:
- ARMFORTAS_SPLIT_INTERVALS env gate (no longer needed)
- detect_partial_unroll_loop's acc_param + store guard (the
underlying regalloc bug it worked around is now fixed)
Verified at all opt levels:
- realworld_seed_overwrite.f90: 4 19 23 (was infinite loop at -O2+)
- realworld_affine_shift.f90: 14 16 (was 14 7 at -O2+)