armfortas Public

Commits

April 2026

‹ ›

Su	Mo	Tu	We	Th	Fr	Sa
29	30	31	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	1	2
3	4	5	6	7	8	9

Commits on May 7, 2026

Extract base from descriptor actual passed to bare-pointer dummy

lower_arg_by_ref_full's tail path evaluates the actual via
lower_expr_full and returns the value as-is when it's a pointer.  But
for shapes that lower_expr_full returns as a 384-byte descriptor
(array sections, array binops, array-result intrinsics), the callee —
which under the assumed-size / explicit-shape ABI expects a bare element
pointer — would read the descriptor's first 8 bytes (= base_addr field)
as if they were the array's first element.  Empirically this surfaced
post-db04b9d as bounds-check failures of the form 'index <huge> outside
[1, n]' when stdlib's solve/lapack_getrf chain was rebuilt with the
WIP-aware ABI.

Detect Ptr<[i8; 384]> at the tail and load through to extract the
descriptor's base_addr before returning it as the bare-pointer arg.

mfwolffe committed 6 days ago

789ba61

Re-apply assumed-size bare-pointer ABI lost in lower.rs split

The lower.rs → lower/core.rs split (71a0cc3) silently dropped commit
db04b9d's fix when the file was extracted: ArraySpec::AssumedSize was
re-added to the descriptor-using set in arg_uses_descriptor_from_decls.
Per F2018 §15.5.2.4 an assumed-size dummy 'a(lda,*)' is passed as a
bare element pointer; flagging it as descriptor-bearing made every
'a(i,j)' reference go through array_descriptor_addr → descriptor base,
yielding descriptor_base + 16 (= upper-half of base_addr field) instead
of the actual element. Restores db04b9d and adds an explanatory comment
so a future refactor doesn't drop it again.

mfwolffe committed 6 days ago

ccc4e2d

Merge pull request #25 from FortranGoingOnForty/compiler-edges
```
compiler-edges: stdlib hash unblocker + descriptor/storage drill (10 commits)
```
Matthew Forrester Wolffe committed 6 days ago
79fa042

Reject non-array-intrinsic callees at top of lower_array_intrinsic

lower_array_intrinsic dispatched on `name` with a match late in the
function but materialised the first arg's descriptor (alloca + memset
+ afs_create_section) unconditionally before that match — so calling
it on a user-procedure name would still emit a 384-byte throwaway
descriptor before returning None.

expr.rs's FunctionCall handler reaches lower_array_intrinsic from
two places (the `!has_named_interface` arm and the post-generic-
resolve fallback), so for any non-generic non-intrinsic call (e.g.
`pick(key(0:))`) each section actual lowered THREE times: twice as
unused descriptors emitted before this dispatcher returned None,
and once for the legitimate ref_arg_vals descriptor passed to the
real call.

Bail at the top of lower_array_intrinsic when `name` is not one of
the array intrinsics it actually handles
(size/lbound/ubound/shape/allocated/sum/product/maxval/minval/
maxloc/minloc/matmul/dot_product/transpose/huge/tiny/epsilon/
precision/range/digits/norm2).

Result on stdlib_hash_32bit_water.f90 (cumulative with eeee0e5 +
0592c14):
  trunk baseline:  123 GB peak (OOM, uncompilable)
  after eeee0e5:    26.6 GB,  18.4 s
  after 0592c14:    922 MB,    1.49 s
  after this fix:    35 MB,    0.22 s    (~3500x lighter than trunk)
afs_create_section calls in -S: 5647 → 982 → 126.

cli_driver 579/579 PASS.  Regression test threshold tightened from
60 to 24 emissions for an 8-section fixture (observed-good: 14).

mfwolffe committed 6 days ago

5623a6b

Skip resolution and intrinsic arg-probes when no caller can use them

expr.rs's FunctionCall lowering eagerly built two probe-arg-vec slices
before any branch decision:
* resolution_arg_vals — only consumed by resolve_generic_call_actuals
  (which short-circuits to None if the callee isn't a NamedInterface)
  and the structure-ctor fallback (gated on has_named_interface);
* intrinsic_arg_vals — only consumed by lower_intrinsic, which matches
  on a fixed set of intrinsic names and returns None otherwise.

For a non-generic non-procptr non-intrinsic callee both slices are
discarded, but lowering each arg cost a section-descriptor
materialization (alloca + memset + afs_create_section) per array
section actual.  Inside nested intrinsic chains that compounded
multiplicatively with the third "real" lowering in ref_arg_vals.

stdlib_hash_32bit_water.f90's water_hash inner loop —
`ieor(waterr32(key(i:)), waterp1)` repeated 4 times across 16 SELECT
CASE arms — produced 5647 `afs_create_section` calls and a 26 GB
compile peak.  Gate the resolution probe behind has_named_interface ||
procptr_target.is_some(), and route the intrinsic probe through
sema::validate::is_intrinsic_name(&key).

Result on stdlib_hash_32bit_water.f90:
* compile peak: 26.6 GB → 922 MB (~28x)
* wall time:    18.4s → 1.49s (~12x)
* afs_create_section calls in -S: 5647 → 982 (~5.7x)

Combined with the earlier eeee0e5 the same file is now 133x lighter
than its trunk-baseline 123 GB peak.  cli_driver 578/578 PASS.

Adds an asm-level regression test that compiles a fixed-shape
ieor/user-call chain and asserts the per-source-section descriptor
emission ratio stays below the multiplicative-probe regime.

mfwolffe committed 6 days ago

0592c14

Type ComplexBuffer ABI return temp as [fN x 2] so binops see complex

The caller-side hidden-output buffer for ComplexBuffer ABI returns was
allocated as `[i8 x 8]` / `[i8 x 16]`, making the call's return
value `Ptr<[i8 x N]>`.  is_complex_ty only recognises `[fN x 2]`
or `Ptr<[fN x 2]>`, so for `complex_local - complex_call(...)` the
binop's complex-arithmetic branch did not fire — execution fell to
the int/float promotion path and emitted `fsub %ptr<[i8 x 8]>` against
the buffer pointer.  IR-verify rejected with `float op has non-float
operand : ptr<[i8 x 8]>`, blocking stdlib_lapack_solve_chol_comp's
CPOTF2/ZPOTF2:
    ajj = real( real(a(j,j),sp) - cdotc(...), sp )

Type the alloca as `[fN x 2]` (sized by the existing kind-aware
hidden_result_temp_bytes_for_callee), with N=4 for sp (8 bytes) and
N=8 for dp (16 bytes).  Both the Name-callee and the type-bound
Component-callee paths get the same treatment.  is_complex_ty,
materialize_complex_operand, and the binop branch all then recognise
the buffer as a complex pair and emit lane-wise fadd/fsub/fmul
correctly.

mfwolffe committed 6 days ago

013c368

Memcpy ComplexBuffer-ABI return into ALLOCATE source= scalar slot

emit_scalar_allocate_source_init_on_success used to pipe the source
expression through coerce_to_type(raw, dest_ty) → b.store(coerced,
dest_base).  When the source is a scalar complex(sp/dp) function call,
lower_expr returns a pointer to the ComplexBuffer the callee wrote
into — typed Ptr<[f32/f64 x 2]>, Ptr<[i8 x 8/16]>, or bare Ptr<i8>.
coerce_to_type has no Ptr→Array path so it returned the pointer
unchanged, and b.store then tried to write a pointer-sized value
into a [f32/f64 x 2] slot.  IR-verify rejected the store with
`value type ptr<[i8 x 8]> doesn't match pointee type [f32 x 2]`,
leaving stdlib_stats_moment_mask uncompilable
(`allocate(mean_, source = mean(x, 1, mask))` where mean returns
scalar complex(sp) — the same shape across roughly a dozen routines).

Fix recognises the buffer-pointer return at the source-init site and
memcpys the lane pair from the buffer to the freshly allocated
destination slot, parallel to the assignment-from-complex-call path
elsewhere.  Regression test gates compilation only — the runtime
path through allocatable complex scalars hits a separate
pre-existing bug in afs_assign_allocatable / real(m_) reads on
complex allocatables that surfaces once IR-verify is no longer
masking it.

mfwolffe committed 6 days ago

10b4102

Skip redundant lower_array_expr_descriptor for scalar user-function probes

generic_dispatch_probe_value already calls array_function_result_elem_type
at the top.  When that returns None for a Name(callee) FunctionCall and
the callee is neither a transformational array intrinsic (pack/reshape/
sum/merge/matmul/transpose/conjg/aimag/abs/cmplx/shape/transfer/dimag)
nor a local array, the subsequent lower_array_expr_descriptor call just
walks named-intrinsic match arms (all miss), then redundantly calls
array_function_result_elem_type a second time inside
lower_array_function_result_descriptor, only to return None.  That
second invocation recursively probes args, and the arg-probes themselves
re-run lower_array_expr_descriptor — O(2^depth) for nested calls.

stdlib_hash_32bit_water.f90's water_hash inner loop nests four user
function calls deep across 16 SELECT CASE arms; the compile peak ran
to ~123 GB and the kernel SIGKILL'd the process under memory pressure.

Skipping the redundant path here drops the same compile to ~26 GB peak
(still high, but no longer triggers OOM).  The
internal_subprogram_call_under_intrinsic_under_user_call_keeps_mangled_name
regression remains green: lower_expr_full at the bottom of the probe
does the real evaluation with internal_funcs threaded through, so
internal CONTAINS-block subprograms keep their mangled link names.

mfwolffe committed 6 days ago

eeee0e5

Test sum(arr, dim, mask) filters per-column using descriptor strides

mfwolffe committed 6 days ago

0341fdb

Use column-major running strides in afs_allocate_like_with_elem_size

Setting every dim's stride to 1 worked for the dim[0]-only flat
iteration paths but collided byte offsets in every per-dim consumer.
For `m = (y > 3.)` over `real :: y(2,3)` the flat compare loop wrote
m[0..5] correctly (it iterates over dim[0].stride * elem_size), but
the masked sum-along-dim helper indexed mask via
`Σ idx_d * dim[d].stride * elem_size` and the all-1 strides made
distinct (i,j) tuples land on the same byte — only 4 of 6 mask bytes
were actually consulted, so `sum(y, 1, y > 3.)` quietly dropped the
column-2 mask hit and returned [0, 0, 6] instead of [0, 4, 11].
Match the column-major running stride convention that
`materialize_array_descriptor_for_info` already uses.

mfwolffe committed 6 days ago

4bd94a4

Add masked sum-along-dim helpers and route pack mask through mask_byte_is_true

Adds `afs_array_sum_real8_dim_mask` and `afs_array_sum_int_dim_mask`,
plus a shared `for_each_reduce_along_dim_with_mask` traversal that
honors both source and mask per-dim strides, and a small
`mask_byte_is_true(mask, byte_off)` helper that dispatches on the
mask's `elem_size` (1, 2, 4, or 8 — `logical(int8)` through
`logical(int64)` and the default 1-byte bool storage all reach the
same predicate).

`afs_array_pack` previously read the mask via a fixed `as *const i32`
load, which crashed with a misaligned dereference once
`elem_size=1` started flowing through for default logical arrays —
switched to `mask_byte_is_true` so it works for every kind.

mfwolffe committed 6 days ago

bcf0478

Align logical descriptor elem_size with bool storage and route sum(dim,mask)

Two coupled changes:

1. `ir_scalar_byte_size` and `descriptor_element_size_bytes` now
   return 1 for `IrType::Bool`, matching `IrType::Bool::size_bytes()`
   and the bytes-per-element layout that `alloca [Bool x N]` actually
   produces. The previous 4-byte report only made sense when the
   scalar type was widened to a 4-byte slot, which never happened —
   storage stayed at 1 byte. The mismatch silently broke every
   consumer of the descriptor's `elem_size` for logical arrays:
   `mask_at`, `afs_array_sum_real8_mask`, the new `_dim_mask`
   helpers, and the whole-array broadcast loop all stepped 4×
   past real data.  `sum(y, mask=m)` returned the unmasked sum,
   `sum(y, dim, mask)` returned the unmasked column sums, and
   `arr(i) = .true.` for `logical :: arr(N)` wrote 3 bytes past
   the slot. The whole-array broadcast loop now uses
   `ir_scalar_byte_size` directly so they stay paired.

2. `lower_array_sum_dim_descriptor` no longer bails when a mask is
   present; it lowers the mask actual into a descriptor and
   dispatches to `afs_array_sum_{real8,int}_dim_mask`. Surfaced in
   `example_var`'s `var(y, 1, y > 3.)` line, which previously fell
   through to a scalar broadcast and crashed in
   `afs_assign_allocatable` with a misaligned source pointer.

mfwolffe committed 6 days ago

93d1652

Drop Bool=4 override in GEP isel so byte stride matches alloca

mfwolffe committed 6 days ago

3384f25
Merge pull request #24 from FortranGoingOnForty/regalloc-phi-resolution
```
Refuse to split phi-like vregs in linearscan; remove regalloc gates
```
Matthew Forrester Wolffe committed 6 days ago
68977b8

Refuse to split phi-like vregs (defined in multiple blocks); remove env gates

Linearscan's live intervals are linear-position [start, end] ranges,
which can't represent the true CFG-aware live set for vregs that
receive values via parallel-copy at every predecessor (block params,
post-phi placeholders). When a call block is lexically wedged
between a phi-vreg's def edge and its use block, the position-based
range straddles the call point but no actual control-flow path
through the vreg traverses it. The splitter then thought the vreg
crossed a call, picked a pre/post split, and assigned the post-half
a different physreg — every predecessor's parallel-copy landed in
pre_phys but every use inside the loop read post_phys.

Track each vreg's set of defining blocks while building
vreg_actual_range; if it has more than one, force real_crosses
false (no splitting). The unit test exercising splitting still
fires because its synthetic vregs have single defs.

Removes:
- ARMFORTAS_SPLIT_INTERVALS env gate (no longer needed)
- detect_partial_unroll_loop's acc_param + store guard (the
  underlying regalloc bug it worked around is now fixed)

Verified at all opt levels:
- realworld_seed_overwrite.f90: 4 19 23 (was infinite loop at -O2+)
- realworld_affine_shift.f90:   14 16   (was 14 7 at -O2+)

mfwolffe committed 6 days ago

77677e8

Merge pull request #23 from FortranGoingOnForty/refactor-afs
```
Sprint refactor + NEON/unroll work + trunk merge
```
Matthew Forrester Wolffe committed 6 days ago
f2e4e02
Pass -J<test-dir> when compiling .amod-emitting modules so parallel tests don't race on .amod files in the repo root

mfwolffe committed 6 days ago

e8a4be8
Restore multi-value PRINTs and consolidate CHECKs to one unique pattern per line

mfwolffe committed 1 week ago

dbfe4e6
Print one value per line so CHECK matcher's line-by-line semantics matches

mfwolffe committed 1 week ago

ffa70fa
Gate live-range splitting behind ARMFORTAS_SPLIT_INTERVALS env var (regalloc miscompile)

mfwolffe committed 1 week ago

04784f4
Skip partial unroll on 2-block-param loops with stores (regalloc phi-resolution miscompile)

mfwolffe committed 1 week ago

4fb9d9e

Commits on May 6, 2026

Pre-build integration test binaries before timed test step + raise end-to-end timeout 60→120 min

mfwolffe committed 1 week ago

776203b
Bump CI timeouts: integration 45→90 min, end-to-end default→60 min

mfwolffe committed 1 week ago

bb046cd
Quiet clippy: use ?, contains, sort_by_key, collapsed match arms

mfwolffe committed 1 week ago

a15ee65
Pass INQUIRE read=/write=/readwrite= specifier buffers to runtime

mfwolffe committed 1 week ago

f1e4d0d
Allocate same-shape descriptor in elemental call path; dispatch IEEE intrinsics (port from trunk 4e04973)

mfwolffe committed 1 week ago

ecb2c7f
Lower sum(arr, dim) to descriptor-result runtime call (port from trunk f4ca895 + 89371de)

mfwolffe committed 1 week ago

b240b7b
Route complex scalar function results through hidden output buffer ABI (port from trunk 39dafcc + ac7b1e9 + b631737)

mfwolffe committed 1 week ago

ba7daa8
Disambiguate type-bound generic dispatch by formal rank (port from trunk 630bc42)

mfwolffe committed 1 week ago

c69b2f5
Populate fixed-shape stack array from reshape declared initializer + rank-N logical mask in array compare (port from trunk d2c6722 + 67ef56b)

mfwolffe committed 1 week ago

99013a9