Commits

trunk
Switch branches/tags
All users
Until May 7, 2026
April 2026
Su Mo Tu We Th Fr Sa
29 30 31 1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 1 2
3 4 5 6 7 8 9

Commits on May 7, 2026

  1. Extract base from descriptor actual passed to bare-pointer dummy
    lower_arg_by_ref_full's tail path evaluates the actual via
    lower_expr_full and returns the value as-is when it's a pointer.  But
    for shapes that lower_expr_full returns as a 384-byte descriptor
    (array sections, array binops, array-result intrinsics), the callee —
    which under the assumed-size / explicit-shape ABI expects a bare element
    pointer — would read the descriptor's first 8 bytes (= base_addr field)
    as if they were the array's first element.  Empirically this surfaced
    post-db04b9d as bounds-check failures of the form 'index <huge> outside
    [1, n]' when stdlib's solve/lapack_getrf chain was rebuilt with the
    WIP-aware ABI.
    
    Detect Ptr<[i8; 384]> at the tail and load through to extract the
    descriptor's base_addr before returning it as the bare-pointer arg.
    mfwolffe committed
  2. Re-apply assumed-size bare-pointer ABI lost in lower.rs split
    The lower.rs → lower/core.rs split (71a0cc3) silently dropped commit
    db04b9d's fix when the file was extracted: ArraySpec::AssumedSize was
    re-added to the descriptor-using set in arg_uses_descriptor_from_decls.
    Per F2018 §15.5.2.4 an assumed-size dummy 'a(lda,*)' is passed as a
    bare element pointer; flagging it as descriptor-bearing made every
    'a(i,j)' reference go through array_descriptor_addr → descriptor base,
    yielding descriptor_base + 16 (= upper-half of base_addr field) instead
    of the actual element. Restores db04b9d and adds an explanatory comment
    so a future refactor doesn't drop it again.
    mfwolffe committed
  3. Merge pull request #25 from FortranGoingOnForty/compiler-edges
    compiler-edges: stdlib hash unblocker + descriptor/storage drill (10 commits)
    Matthew Forrester Wolffe committed
  4. Reject non-array-intrinsic callees at top of lower_array_intrinsic
    lower_array_intrinsic dispatched on `name` with a match late in the
    function but materialised the first arg's descriptor (alloca + memset
    + afs_create_section) unconditionally before that match — so calling
    it on a user-procedure name would still emit a 384-byte throwaway
    descriptor before returning None.
    
    expr.rs's FunctionCall handler reaches lower_array_intrinsic from
    two places (the `!has_named_interface` arm and the post-generic-
    resolve fallback), so for any non-generic non-intrinsic call (e.g.
    `pick(key(0:))`) each section actual lowered THREE times: twice as
    unused descriptors emitted before this dispatcher returned None,
    and once for the legitimate ref_arg_vals descriptor passed to the
    real call.
    
    Bail at the top of lower_array_intrinsic when `name` is not one of
    the array intrinsics it actually handles
    (size/lbound/ubound/shape/allocated/sum/product/maxval/minval/
    maxloc/minloc/matmul/dot_product/transpose/huge/tiny/epsilon/
    precision/range/digits/norm2).
    
    Result on stdlib_hash_32bit_water.f90 (cumulative with eeee0e5 +
    0592c14):
      trunk baseline:  123 GB peak (OOM, uncompilable)
      after eeee0e5:    26.6 GB,  18.4 s
      after 0592c14:    922 MB,    1.49 s
      after this fix:    35 MB,    0.22 s    (~3500x lighter than trunk)
    afs_create_section calls in -S: 5647 → 982 → 126.
    
    cli_driver 579/579 PASS.  Regression test threshold tightened from
    60 to 24 emissions for an 8-section fixture (observed-good: 14).
    mfwolffe committed
  5. Skip resolution and intrinsic arg-probes when no caller can use them
    expr.rs's FunctionCall lowering eagerly built two probe-arg-vec slices
    before any branch decision:
    * resolution_arg_vals — only consumed by resolve_generic_call_actuals
      (which short-circuits to None if the callee isn't a NamedInterface)
      and the structure-ctor fallback (gated on has_named_interface);
    * intrinsic_arg_vals — only consumed by lower_intrinsic, which matches
      on a fixed set of intrinsic names and returns None otherwise.
    
    For a non-generic non-procptr non-intrinsic callee both slices are
    discarded, but lowering each arg cost a section-descriptor
    materialization (alloca + memset + afs_create_section) per array
    section actual.  Inside nested intrinsic chains that compounded
    multiplicatively with the third "real" lowering in ref_arg_vals.
    
    stdlib_hash_32bit_water.f90's water_hash inner loop —
    `ieor(waterr32(key(i:)), waterp1)` repeated 4 times across 16 SELECT
    CASE arms — produced 5647 `afs_create_section` calls and a 26 GB
    compile peak.  Gate the resolution probe behind has_named_interface ||
    procptr_target.is_some(), and route the intrinsic probe through
    sema::validate::is_intrinsic_name(&key).
    
    Result on stdlib_hash_32bit_water.f90:
    * compile peak: 26.6 GB → 922 MB (~28x)
    * wall time:    18.4s → 1.49s (~12x)
    * afs_create_section calls in -S: 5647 → 982 (~5.7x)
    
    Combined with the earlier eeee0e5 the same file is now 133x lighter
    than its trunk-baseline 123 GB peak.  cli_driver 578/578 PASS.
    
    Adds an asm-level regression test that compiles a fixed-shape
    ieor/user-call chain and asserts the per-source-section descriptor
    emission ratio stays below the multiplicative-probe regime.
    mfwolffe committed
  6. Type ComplexBuffer ABI return temp as [fN x 2] so binops see complex
    The caller-side hidden-output buffer for ComplexBuffer ABI returns was
    allocated as `[i8 x 8]` / `[i8 x 16]`, making the call's return
    value `Ptr<[i8 x N]>`.  is_complex_ty only recognises `[fN x 2]`
    or `Ptr<[fN x 2]>`, so for `complex_local - complex_call(...)` the
    binop's complex-arithmetic branch did not fire — execution fell to
    the int/float promotion path and emitted `fsub %ptr<[i8 x 8]>` against
    the buffer pointer.  IR-verify rejected with `float op has non-float
    operand : ptr<[i8 x 8]>`, blocking stdlib_lapack_solve_chol_comp's
    CPOTF2/ZPOTF2:
        ajj = real( real(a(j,j),sp) - cdotc(...), sp )
    
    Type the alloca as `[fN x 2]` (sized by the existing kind-aware
    hidden_result_temp_bytes_for_callee), with N=4 for sp (8 bytes) and
    N=8 for dp (16 bytes).  Both the Name-callee and the type-bound
    Component-callee paths get the same treatment.  is_complex_ty,
    materialize_complex_operand, and the binop branch all then recognise
    the buffer as a complex pair and emit lane-wise fadd/fsub/fmul
    correctly.
    mfwolffe committed
  7. Memcpy ComplexBuffer-ABI return into ALLOCATE source= scalar slot
    emit_scalar_allocate_source_init_on_success used to pipe the source
    expression through coerce_to_type(raw, dest_ty) → b.store(coerced,
    dest_base).  When the source is a scalar complex(sp/dp) function call,
    lower_expr returns a pointer to the ComplexBuffer the callee wrote
    into — typed Ptr<[f32/f64 x 2]>, Ptr<[i8 x 8/16]>, or bare Ptr<i8>.
    coerce_to_type has no Ptr→Array path so it returned the pointer
    unchanged, and b.store then tried to write a pointer-sized value
    into a [f32/f64 x 2] slot.  IR-verify rejected the store with
    `value type ptr<[i8 x 8]> doesn't match pointee type [f32 x 2]`,
    leaving stdlib_stats_moment_mask uncompilable
    (`allocate(mean_, source = mean(x, 1, mask))` where mean returns
    scalar complex(sp) — the same shape across roughly a dozen routines).
    
    Fix recognises the buffer-pointer return at the source-init site and
    memcpys the lane pair from the buffer to the freshly allocated
    destination slot, parallel to the assignment-from-complex-call path
    elsewhere.  Regression test gates compilation only — the runtime
    path through allocatable complex scalars hits a separate
    pre-existing bug in afs_assign_allocatable / real(m_) reads on
    complex allocatables that surfaces once IR-verify is no longer
    masking it.
    mfwolffe committed
  8. Skip redundant lower_array_expr_descriptor for scalar user-function probes
    generic_dispatch_probe_value already calls array_function_result_elem_type
    at the top.  When that returns None for a Name(callee) FunctionCall and
    the callee is neither a transformational array intrinsic (pack/reshape/
    sum/merge/matmul/transpose/conjg/aimag/abs/cmplx/shape/transfer/dimag)
    nor a local array, the subsequent lower_array_expr_descriptor call just
    walks named-intrinsic match arms (all miss), then redundantly calls
    array_function_result_elem_type a second time inside
    lower_array_function_result_descriptor, only to return None.  That
    second invocation recursively probes args, and the arg-probes themselves
    re-run lower_array_expr_descriptor — O(2^depth) for nested calls.
    
    stdlib_hash_32bit_water.f90's water_hash inner loop nests four user
    function calls deep across 16 SELECT CASE arms; the compile peak ran
    to ~123 GB and the kernel SIGKILL'd the process under memory pressure.
    
    Skipping the redundant path here drops the same compile to ~26 GB peak
    (still high, but no longer triggers OOM).  The
    internal_subprogram_call_under_intrinsic_under_user_call_keeps_mangled_name
    regression remains green: lower_expr_full at the bottom of the probe
    does the real evaluation with internal_funcs threaded through, so
    internal CONTAINS-block subprograms keep their mangled link names.
    mfwolffe committed
  9. Use column-major running strides in afs_allocate_like_with_elem_size
    Setting every dim's stride to 1 worked for the dim[0]-only flat
    iteration paths but collided byte offsets in every per-dim consumer.
    For `m = (y > 3.)` over `real :: y(2,3)` the flat compare loop wrote
    m[0..5] correctly (it iterates over dim[0].stride * elem_size), but
    the masked sum-along-dim helper indexed mask via
    `Σ idx_d * dim[d].stride * elem_size` and the all-1 strides made
    distinct (i,j) tuples land on the same byte — only 4 of 6 mask bytes
    were actually consulted, so `sum(y, 1, y > 3.)` quietly dropped the
    column-2 mask hit and returned [0, 0, 6] instead of [0, 4, 11].
    Match the column-major running stride convention that
    `materialize_array_descriptor_for_info` already uses.
    mfwolffe committed
  10. Add masked sum-along-dim helpers and route pack mask through mask_byte_is_true
    Adds `afs_array_sum_real8_dim_mask` and `afs_array_sum_int_dim_mask`,
    plus a shared `for_each_reduce_along_dim_with_mask` traversal that
    honors both source and mask per-dim strides, and a small
    `mask_byte_is_true(mask, byte_off)` helper that dispatches on the
    mask's `elem_size` (1, 2, 4, or 8 — `logical(int8)` through
    `logical(int64)` and the default 1-byte bool storage all reach the
    same predicate).
    
    `afs_array_pack` previously read the mask via a fixed `as *const i32`
    load, which crashed with a misaligned dereference once
    `elem_size=1` started flowing through for default logical arrays —
    switched to `mask_byte_is_true` so it works for every kind.
    mfwolffe committed
  11. Align logical descriptor elem_size with bool storage and route sum(dim,mask)
    Two coupled changes:
    
    1. `ir_scalar_byte_size` and `descriptor_element_size_bytes` now
       return 1 for `IrType::Bool`, matching `IrType::Bool::size_bytes()`
       and the bytes-per-element layout that `alloca [Bool x N]` actually
       produces. The previous 4-byte report only made sense when the
       scalar type was widened to a 4-byte slot, which never happened —
       storage stayed at 1 byte. The mismatch silently broke every
       consumer of the descriptor's `elem_size` for logical arrays:
       `mask_at`, `afs_array_sum_real8_mask`, the new `_dim_mask`
       helpers, and the whole-array broadcast loop all stepped 4×
       past real data.  `sum(y, mask=m)` returned the unmasked sum,
       `sum(y, dim, mask)` returned the unmasked column sums, and
       `arr(i) = .true.` for `logical :: arr(N)` wrote 3 bytes past
       the slot. The whole-array broadcast loop now uses
       `ir_scalar_byte_size` directly so they stay paired.
    
    2. `lower_array_sum_dim_descriptor` no longer bails when a mask is
       present; it lowers the mask actual into a descriptor and
       dispatches to `afs_array_sum_{real8,int}_dim_mask`. Surfaced in
       `example_var`'s `var(y, 1, y > 3.)` line, which previously fell
       through to a scalar broadcast and crashed in
       `afs_assign_allocatable` with a misaligned source pointer.
    mfwolffe committed
  12. Merge pull request #24 from FortranGoingOnForty/regalloc-phi-resolution
    Refuse to split phi-like vregs in linearscan; remove regalloc gates
    Matthew Forrester Wolffe committed
  13. Refuse to split phi-like vregs (defined in multiple blocks); remove env gates
    Linearscan's live intervals are linear-position [start, end] ranges,
    which can't represent the true CFG-aware live set for vregs that
    receive values via parallel-copy at every predecessor (block params,
    post-phi placeholders). When a call block is lexically wedged
    between a phi-vreg's def edge and its use block, the position-based
    range straddles the call point but no actual control-flow path
    through the vreg traverses it. The splitter then thought the vreg
    crossed a call, picked a pre/post split, and assigned the post-half
    a different physreg — every predecessor's parallel-copy landed in
    pre_phys but every use inside the loop read post_phys.
    
    Track each vreg's set of defining blocks while building
    vreg_actual_range; if it has more than one, force real_crosses
    false (no splitting). The unit test exercising splitting still
    fires because its synthetic vregs have single defs.
    
    Removes:
    - ARMFORTAS_SPLIT_INTERVALS env gate (no longer needed)
    - detect_partial_unroll_loop's acc_param + store guard (the
      underlying regalloc bug it worked around is now fixed)
    
    Verified at all opt levels:
    - realworld_seed_overwrite.f90: 4 19 23 (was infinite loop at -O2+)
    - realworld_affine_shift.f90:   14 16   (was 14 7 at -O2+)
    mfwolffe committed
  14. Merge pull request #23 from FortranGoingOnForty/refactor-afs
    Sprint refactor + NEON/unroll work + trunk merge
    Matthew Forrester Wolffe committed

Commits on May 6, 2026