armfortas Public

Commits

May 2026

‹ ›

Su	Mo	Tu	We	Th	Fr	Sa
26	27	28	29	30	1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31	1	2	3	4	5	6

Commits on May 8, 2026

Test real(count(mask,dim)) does not emit scalar _count probe

mfwolffe committed 5 days ago

2fb9932
Recognize count(mask,dim)/sum/pack/spread in array elem-type probe

mfwolffe committed 5 days ago

0115013
Test reshape descriptor stride walks columns under dim-reduction

mfwolffe committed 5 days ago

a4bbd9e
Use column-major running stride in reshape descriptor builder

mfwolffe committed 5 days ago

4a7ef0a
Test matmul(transpose(A), A) for non-square real matrix

mfwolffe committed 5 days ago

a5b03e0
Walk transformational intrinsics in first_arg_is_real/complex

mfwolffe committed 5 days ago

ef195e3
Fix afs_transpose row-major→column-major indexing for non-square matrices

mfwolffe committed 5 days ago

ffac446
Handle COUNT/ANY/ALL of scalar logical and rank-1 mask with dim

mfwolffe committed 5 days ago

1a94f6e
Test COUNT(mask, dim) on fixed and allocatable destinations

mfwolffe committed 5 days ago

8a77ae6
Lower COUNT(mask, dim) to per-slice descriptor result

mfwolffe committed 5 days ago

b647206
Add afs_array_count_logical_dim runtime helper for COUNT(mask, dim)

mfwolffe committed 5 days ago

5434355
Test list-directed print of complex arrays (whole, slice, section)

mfwolffe committed 5 days ago

d296e73
Print complex arrays element-wise in list-directed I/O

mfwolffe committed 5 days ago

6300a2a

Populate rank-remap pointer descriptor for section RHS

F2018 §10.2.2.3: rank-remap pointer assignment

  real(sp), pointer :: tau(:)
  real(sp), target  :: q(5, 5)
  tau(1:k) => q(1:k, 1)

lower_rank_remap_pointer_assignment used to require RHS = bare Name
and bail on any FunctionCall (section/element) — the pointer
descriptor never got its base_addr, rank, or extents populated.
Subsequent 'geqrf(..., tau, ...)' (assumed-size dummy 'tau(*)') then
received tau.base_addr = NULL, and slarfg's '*tau = ...' SEGV'd at
depth.  Surfaced as SEGVs across stdlib's qr/eig/schur cluster:
example_qr, example_qr_space, example_pivoting_qr*, example_eig*,
example_schur*.

Extend the source-shape match to handle FunctionCall (section
designator on a Name): convert each Range(start:..) into Element(start),
compute the address of the FIRST included element via
lower_array_element_addr, and use that as the descriptor's base_addr
with the target's bounds.

mfwolffe committed 5 days ago

6973f3c

Honor type_spec in reshape ArrayConstructor lowering

F2018 §7.8: a typed array constructor '[T :: ...]' has element type T
regardless of the element expressions' types. The reshape lowering at
lower_reshape_array_expr_descriptor inferred elem_ty solely via
first_array_constructor_type_info, which examines the first value
expression. For 'reshape([real(dp) :: 1, 2, 3, 4], [2, 2])' the values
are integer literals (4 bytes), so the materialised descriptor was
elem_size=4 instead of 8.

The malformed elem_size propagated through the reshape result; when
passed to an assumed-shape dummy 'a(:,:)' and used as SOURCE= in an
ALLOCATE, afs_prepare_array_copy saw 'dest.elem_size != source.elem_size'
(8 != 4), freed the freshly-allocated dest buffer, zeroed base_addr,
and the next read of 'amat(1,1)' SEGV'd. Surfaced across stdlib's det /
determinant / eig / qr clusters whose examples invoke
'det(reshape([real(dp)::1,2,3,4], [2,2]))'.

Consult type_spec first; fall back to first-element inference only when
no type_spec is present.

mfwolffe committed 6 days ago

20fb85b

Accept non-allocatable source in afs_prepare_array_copy

F2018 §9.7.1.2: SOURCE-expr in ALLOCATE need only be a value of the
right type/kind/shape — it doesn't have to itself be an ALLOCATABLE.
The common stdlib pattern is

    pure module function det(a) result(d)
      real(dp), intent(in) :: a(:,:)        ! assumed-shape dummy
      real(dp), allocatable :: amat(:,:)
      allocate(amat(size(a,1), size(a,2)), source=a)

afs_prepare_array_copy required both dest.is_allocated() AND
source.is_allocated().  Assumed-shape dummies carry flags=CONTIGUOUS
only — they're bound to the caller's data, not owned — so
source.is_allocated() returned false, the routine freed the
fresh dest buffer, zeroed dest.base_addr, and the next read of
amat(1,1) faulted.  Surfaced as SEGV across stdlib's det / determinant
/ eig / qr / lstsq / solve_chol / solve_custom clusters.

Replace source.is_allocated() with !source.base_addr.is_null():
the source is valid as long as it points to data, regardless of
whether it owns it.

mfwolffe committed 6 days ago

8e5daf7

Commits on May 7, 2026

Extract base from descriptor actual passed to bare-pointer dummy

lower_arg_by_ref_full's tail path evaluates the actual via
lower_expr_full and returns the value as-is when it's a pointer.  But
for shapes that lower_expr_full returns as a 384-byte descriptor
(array sections, array binops, array-result intrinsics), the callee —
which under the assumed-size / explicit-shape ABI expects a bare element
pointer — would read the descriptor's first 8 bytes (= base_addr field)
as if they were the array's first element.  Empirically this surfaced
post-db04b9d as bounds-check failures of the form 'index <huge> outside
[1, n]' when stdlib's solve/lapack_getrf chain was rebuilt with the
WIP-aware ABI.

Detect Ptr<[i8; 384]> at the tail and load through to extract the
descriptor's base_addr before returning it as the bare-pointer arg.

mfwolffe committed 6 days ago

789ba61

Re-apply assumed-size bare-pointer ABI lost in lower.rs split

The lower.rs → lower/core.rs split (71a0cc3) silently dropped commit
db04b9d's fix when the file was extracted: ArraySpec::AssumedSize was
re-added to the descriptor-using set in arg_uses_descriptor_from_decls.
Per F2018 §15.5.2.4 an assumed-size dummy 'a(lda,*)' is passed as a
bare element pointer; flagging it as descriptor-bearing made every
'a(i,j)' reference go through array_descriptor_addr → descriptor base,
yielding descriptor_base + 16 (= upper-half of base_addr field) instead
of the actual element. Restores db04b9d and adds an explanatory comment
so a future refactor doesn't drop it again.

mfwolffe committed 6 days ago

ccc4e2d

Merge pull request #25 from FortranGoingOnForty/compiler-edges
```
compiler-edges: stdlib hash unblocker + descriptor/storage drill (10 commits)
```
Matthew Forrester Wolffe committed 6 days ago
79fa042

Reject non-array-intrinsic callees at top of lower_array_intrinsic

lower_array_intrinsic dispatched on `name` with a match late in the
function but materialised the first arg's descriptor (alloca + memset
+ afs_create_section) unconditionally before that match — so calling
it on a user-procedure name would still emit a 384-byte throwaway
descriptor before returning None.

expr.rs's FunctionCall handler reaches lower_array_intrinsic from
two places (the `!has_named_interface` arm and the post-generic-
resolve fallback), so for any non-generic non-intrinsic call (e.g.
`pick(key(0:))`) each section actual lowered THREE times: twice as
unused descriptors emitted before this dispatcher returned None,
and once for the legitimate ref_arg_vals descriptor passed to the
real call.

Bail at the top of lower_array_intrinsic when `name` is not one of
the array intrinsics it actually handles
(size/lbound/ubound/shape/allocated/sum/product/maxval/minval/
maxloc/minloc/matmul/dot_product/transpose/huge/tiny/epsilon/
precision/range/digits/norm2).

Result on stdlib_hash_32bit_water.f90 (cumulative with eeee0e5 +
0592c14):
  trunk baseline:  123 GB peak (OOM, uncompilable)
  after eeee0e5:    26.6 GB,  18.4 s
  after 0592c14:    922 MB,    1.49 s
  after this fix:    35 MB,    0.22 s    (~3500x lighter than trunk)
afs_create_section calls in -S: 5647 → 982 → 126.

cli_driver 579/579 PASS.  Regression test threshold tightened from
60 to 24 emissions for an 8-section fixture (observed-good: 14).

mfwolffe committed 6 days ago

5623a6b

Skip resolution and intrinsic arg-probes when no caller can use them

expr.rs's FunctionCall lowering eagerly built two probe-arg-vec slices
before any branch decision:
* resolution_arg_vals — only consumed by resolve_generic_call_actuals
  (which short-circuits to None if the callee isn't a NamedInterface)
  and the structure-ctor fallback (gated on has_named_interface);
* intrinsic_arg_vals — only consumed by lower_intrinsic, which matches
  on a fixed set of intrinsic names and returns None otherwise.

For a non-generic non-procptr non-intrinsic callee both slices are
discarded, but lowering each arg cost a section-descriptor
materialization (alloca + memset + afs_create_section) per array
section actual.  Inside nested intrinsic chains that compounded
multiplicatively with the third "real" lowering in ref_arg_vals.

stdlib_hash_32bit_water.f90's water_hash inner loop —
`ieor(waterr32(key(i:)), waterp1)` repeated 4 times across 16 SELECT
CASE arms — produced 5647 `afs_create_section` calls and a 26 GB
compile peak.  Gate the resolution probe behind has_named_interface ||
procptr_target.is_some(), and route the intrinsic probe through
sema::validate::is_intrinsic_name(&key).

Result on stdlib_hash_32bit_water.f90:
* compile peak: 26.6 GB → 922 MB (~28x)
* wall time:    18.4s → 1.49s (~12x)
* afs_create_section calls in -S: 5647 → 982 (~5.7x)

Combined with the earlier eeee0e5 the same file is now 133x lighter
than its trunk-baseline 123 GB peak.  cli_driver 578/578 PASS.

Adds an asm-level regression test that compiles a fixed-shape
ieor/user-call chain and asserts the per-source-section descriptor
emission ratio stays below the multiplicative-probe regime.

mfwolffe committed 6 days ago

0592c14

Type ComplexBuffer ABI return temp as [fN x 2] so binops see complex

The caller-side hidden-output buffer for ComplexBuffer ABI returns was
allocated as `[i8 x 8]` / `[i8 x 16]`, making the call's return
value `Ptr<[i8 x N]>`.  is_complex_ty only recognises `[fN x 2]`
or `Ptr<[fN x 2]>`, so for `complex_local - complex_call(...)` the
binop's complex-arithmetic branch did not fire — execution fell to
the int/float promotion path and emitted `fsub %ptr<[i8 x 8]>` against
the buffer pointer.  IR-verify rejected with `float op has non-float
operand : ptr<[i8 x 8]>`, blocking stdlib_lapack_solve_chol_comp's
CPOTF2/ZPOTF2:
    ajj = real( real(a(j,j),sp) - cdotc(...), sp )

Type the alloca as `[fN x 2]` (sized by the existing kind-aware
hidden_result_temp_bytes_for_callee), with N=4 for sp (8 bytes) and
N=8 for dp (16 bytes).  Both the Name-callee and the type-bound
Component-callee paths get the same treatment.  is_complex_ty,
materialize_complex_operand, and the binop branch all then recognise
the buffer as a complex pair and emit lane-wise fadd/fsub/fmul
correctly.

mfwolffe committed 6 days ago

013c368

Memcpy ComplexBuffer-ABI return into ALLOCATE source= scalar slot

emit_scalar_allocate_source_init_on_success used to pipe the source
expression through coerce_to_type(raw, dest_ty) → b.store(coerced,
dest_base).  When the source is a scalar complex(sp/dp) function call,
lower_expr returns a pointer to the ComplexBuffer the callee wrote
into — typed Ptr<[f32/f64 x 2]>, Ptr<[i8 x 8/16]>, or bare Ptr<i8>.
coerce_to_type has no Ptr→Array path so it returned the pointer
unchanged, and b.store then tried to write a pointer-sized value
into a [f32/f64 x 2] slot.  IR-verify rejected the store with
`value type ptr<[i8 x 8]> doesn't match pointee type [f32 x 2]`,
leaving stdlib_stats_moment_mask uncompilable
(`allocate(mean_, source = mean(x, 1, mask))` where mean returns
scalar complex(sp) — the same shape across roughly a dozen routines).

Fix recognises the buffer-pointer return at the source-init site and
memcpys the lane pair from the buffer to the freshly allocated
destination slot, parallel to the assignment-from-complex-call path
elsewhere.  Regression test gates compilation only — the runtime
path through allocatable complex scalars hits a separate
pre-existing bug in afs_assign_allocatable / real(m_) reads on
complex allocatables that surfaces once IR-verify is no longer
masking it.

mfwolffe committed 6 days ago

10b4102

Skip redundant lower_array_expr_descriptor for scalar user-function probes

generic_dispatch_probe_value already calls array_function_result_elem_type
at the top.  When that returns None for a Name(callee) FunctionCall and
the callee is neither a transformational array intrinsic (pack/reshape/
sum/merge/matmul/transpose/conjg/aimag/abs/cmplx/shape/transfer/dimag)
nor a local array, the subsequent lower_array_expr_descriptor call just
walks named-intrinsic match arms (all miss), then redundantly calls
array_function_result_elem_type a second time inside
lower_array_function_result_descriptor, only to return None.  That
second invocation recursively probes args, and the arg-probes themselves
re-run lower_array_expr_descriptor — O(2^depth) for nested calls.

stdlib_hash_32bit_water.f90's water_hash inner loop nests four user
function calls deep across 16 SELECT CASE arms; the compile peak ran
to ~123 GB and the kernel SIGKILL'd the process under memory pressure.

Skipping the redundant path here drops the same compile to ~26 GB peak
(still high, but no longer triggers OOM).  The
internal_subprogram_call_under_intrinsic_under_user_call_keeps_mangled_name
regression remains green: lower_expr_full at the bottom of the probe
does the real evaluation with internal_funcs threaded through, so
internal CONTAINS-block subprograms keep their mangled link names.

mfwolffe committed 6 days ago

eeee0e5

Test sum(arr, dim, mask) filters per-column using descriptor strides

mfwolffe committed 6 days ago

0341fdb

Use column-major running strides in afs_allocate_like_with_elem_size

Setting every dim's stride to 1 worked for the dim[0]-only flat
iteration paths but collided byte offsets in every per-dim consumer.
For `m = (y > 3.)` over `real :: y(2,3)` the flat compare loop wrote
m[0..5] correctly (it iterates over dim[0].stride * elem_size), but
the masked sum-along-dim helper indexed mask via
`Σ idx_d * dim[d].stride * elem_size` and the all-1 strides made
distinct (i,j) tuples land on the same byte — only 4 of 6 mask bytes
were actually consulted, so `sum(y, 1, y > 3.)` quietly dropped the
column-2 mask hit and returned [0, 0, 6] instead of [0, 4, 11].
Match the column-major running stride convention that
`materialize_array_descriptor_for_info` already uses.

mfwolffe committed 6 days ago

4bd94a4

Add masked sum-along-dim helpers and route pack mask through mask_byte_is_true

Adds `afs_array_sum_real8_dim_mask` and `afs_array_sum_int_dim_mask`,
plus a shared `for_each_reduce_along_dim_with_mask` traversal that
honors both source and mask per-dim strides, and a small
`mask_byte_is_true(mask, byte_off)` helper that dispatches on the
mask's `elem_size` (1, 2, 4, or 8 — `logical(int8)` through
`logical(int64)` and the default 1-byte bool storage all reach the
same predicate).

`afs_array_pack` previously read the mask via a fixed `as *const i32`
load, which crashed with a misaligned dereference once
`elem_size=1` started flowing through for default logical arrays —
switched to `mask_byte_is_true` so it works for every kind.

mfwolffe committed 6 days ago

bcf0478

Align logical descriptor elem_size with bool storage and route sum(dim,mask)

Two coupled changes:

1. `ir_scalar_byte_size` and `descriptor_element_size_bytes` now
   return 1 for `IrType::Bool`, matching `IrType::Bool::size_bytes()`
   and the bytes-per-element layout that `alloca [Bool x N]` actually
   produces. The previous 4-byte report only made sense when the
   scalar type was widened to a 4-byte slot, which never happened —
   storage stayed at 1 byte. The mismatch silently broke every
   consumer of the descriptor's `elem_size` for logical arrays:
   `mask_at`, `afs_array_sum_real8_mask`, the new `_dim_mask`
   helpers, and the whole-array broadcast loop all stepped 4×
   past real data.  `sum(y, mask=m)` returned the unmasked sum,
   `sum(y, dim, mask)` returned the unmasked column sums, and
   `arr(i) = .true.` for `logical :: arr(N)` wrote 3 bytes past
   the slot. The whole-array broadcast loop now uses
   `ir_scalar_byte_size` directly so they stay paired.

2. `lower_array_sum_dim_descriptor` no longer bails when a mask is
   present; it lowers the mask actual into a descriptor and
   dispatches to `afs_array_sum_{real8,int}_dim_mask`. Surfaced in
   `example_var`'s `var(y, 1, y > 3.)` line, which previously fell
   through to a scalar broadcast and crashed in
   `afs_assign_allocatable` with a misaligned source pointer.

mfwolffe committed 6 days ago

93d1652

Drop Bool=4 override in GEP isel so byte stride matches alloca

mfwolffe committed 6 days ago

3384f25
Merge pull request #24 from FortranGoingOnForty/regalloc-phi-resolution
```
Refuse to split phi-like vregs in linearscan; remove regalloc gates
```
Matthew Forrester Wolffe committed 6 days ago
68977b8