Skip to content

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255

Open
EgorBo wants to merge 3 commits intomainfrom
hadd-fix
Open

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255
EgorBo wants to merge 3 commits intomainfrom
hadd-fix

Conversation

@EgorBo
Copy link
Copy Markdown
Member

@EgorBo EgorBo commented Mar 28, 2026

Fixes #126250

Summary

Restore vhaddps/vhaddpd-based lowering for Vector256<float>.Sum() and Vector256<double>.Sum() (and transitively Vector512 floating-point Sum), replacing the more expensive vpermilps + vaddps decomposition that was introduced in #95568 and later expanded in e012fd4.

Root Cause

Two changes caused the regression:

  1. PR Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568 (Dec 2023) replaced vhaddps with a vpermilps + vaddps (shuffle + vertical add) sequence, claiming hadd was "not the most efficient."
  2. Commit e012fd4 (Jun 2024) added lane-independent splitting for floating-point to "ensure deterministic results matching the software fallback." This doubled the instruction count for Vector256<float>.Sum() from 6 to 11 instructions.

The determinism concern was valid in principle — the software fallback processes each 128-bit lane independently. However, vhaddps already operates within 128-bit lanes (it is defined as two independent 128-bit horizontal adds), so it naturally preserves the same lane-by-lane addition order. The lane-splitting was unnecessary.

Fix

For 256-bit floating-point Sum when AVX is available, use NI_AVX_HorizontalAdd (vhaddps/vhaddpd) followed by a lane-combine step. This produces 4 instructions for float and 3 for double, vs the previous 11 and 5.

Vector512 floating-point Sum also benefits because it recursively calls the 256-bit Sum path.

Codegen — float Test(Vector256<float> vec) => Vector256.Sum(vec);

Before (vpermilps + vaddps, 62 bytes, 14 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vpermilps xmm2, xmm1, -79
       vaddps   xmm1, xmm2, xmm1
       vpermilps xmm2, xmm1, 78
       vaddps   xmm1, xmm2, xmm1
       vextractf128 xmm0, ymm0
       vpermilps xmm2, xmm0, -79
       vaddps   xmm0, xmm2, xmm0
       vpermilps xmm2, xmm0, 78
       vaddps   xmm0, xmm2, xmm0
       vaddss   xmm0, xmm1, xmm0
       vzeroupper
       ret

After (vhaddps, 26 bytes, 7 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vhaddps  ymm0, ymm0, ymm0
       vhaddps  ymm0, ymm0, ymm0
       vextractf128 xmm1, ymm0
       vaddps   xmm0, xmm0, xmm1
       vzeroupper
       ret

58% less code, 50% fewer instructions. On AVX-512 hardware, the improvement is even larger because:

  • vhaddps has HW_Flag_NoEvexSemantics, constraining registers to ymm0–15 and avoiding 4-byte EVEX prefixes
  • The reduced loop body size avoids micro-op cache (DSB) pressure

Validation

  • Numerical output is identical before and after (deterministic FP addition order preserved)
  • SPMI replay clean: 50,228 ASP.NET contexts, zero failures

Note

This comment was generated by Copilot.

Benchmark results with PR #126255

Benchmarked using the reporter's code on two AVX-512 server CPUs (cloud, via EgorBot):

Intel Emerald Rapids (Xeon 8573C)

Method Toolchain NumElevations Mean Ratio
Vector<float> portable SIMD PR 1024 907.4 ns 1.00
Vector256<float> explicit SIMD PR 1024 564.4 ns 0.62
Vector<float> portable SIMD main 1024 884.1 ns 0.97
Vector256<float> explicit SIMD main 1024 647.2 ns 0.71
Vector<float> portable SIMD PR 4096 3,549.4 ns 1.00
Vector256<float> explicit SIMD PR 4096 2,280.3 ns 0.64
Vector<float> portable SIMD main 4096 3,583.4 ns 1.01
Vector256<float> explicit SIMD main 4096 2,543.1 ns 0.72

AMD Turin (EPYC 9V45, Zen 5)

Method Toolchain NumElevations Mean Ratio
Vector<float> portable SIMD PR 1024 487.2 ns 1.00
Vector256<float> explicit SIMD PR 1024 338.0 ns 0.69
Vector<float> portable SIMD main 1024 486.5 ns 1.00
Vector256<float> explicit SIMD main 1024 380.8 ns 0.78
Vector<float> portable SIMD PR 4096 1,959.6 ns 1.00
Vector256<float> explicit SIMD PR 4096 1,365.9 ns 0.70
Vector<float> portable SIMD main 4096 1,959.1 ns 1.00
Vector256<float> explicit SIMD main 4096 1,541.5 ns 0.79

Summary

CPU Size Vector256 improvement
Emerald Rapids 1024 −12.8%
Emerald Rapids 4096 −10.3%
Turin (Zen 5) 1024 −11.3%
Turin (Zen 5) 4096 −11.4%

Consistent ~11% improvement on both Intel and AMD AVX-512 server CPUs. Tiger Lake (the reporter's CPU) is not available in the cloud; the improvement there would likely be larger due to the smaller µop cache being more sensitive to the 28% loop body size reduction.

Copilot AI review requested due to automatic review settings March 28, 2026 13:55
@github-actions github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restores the AVX vhaddps / vhaddpd-based lowering for Vector256<float/double>.Sum() in the JIT to address a performance regression introduced by prior permute+add decomposition and lane-splitting, while maintaining deterministic lane-by-lane FP addition order.

Changes:

  • Add an AVX-only fast path for 256-bit floating-point Sum() using NI_AVX_HorizontalAdd and a final lane-combine step.
  • Keep the existing non-AVX floating-point path that sums 128-bit halves independently for determinism (software-fallback matching), but clarify the comment.

@github-actions

This comment has been minimized.

@EgorBo

This comment was marked as resolved.

@EgorBo
Copy link
Copy Markdown
Member Author

EgorBo commented Mar 28, 2026

@EgorBot -amd

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}


if (simdSize == 32)
{
if (varTypeIsFloating(simdBaseType) && compOpportunisticallyDependsOn(InstructionSet_AVX))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdSize == 32 implies AVX support, so this check and the fallback code below can be removed.

Copilot AI review requested due to automatic review settings March 29, 2026 00:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

@github-actions

This comment has been minimized.

Copilot AI review requested due to automatic review settings March 29, 2026 16:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

@github-actions
Copy link
Copy Markdown
Contributor

Note

This review was generated by GitHub Copilot.

Code Review: gtNewSimdSumNode changes in gentree.cpp

Correctness

  • Addition order preserved for float (simdSize==32). Old: split→recursive 128-bit sum per lane→scalar add = ((a+b)+(c+d)) + ((e+f)+(g+h)). New: two vhaddps(ymm,ymm)vextractf128vaddps = ((a+b)+(c+d)) + ((e+f)+(g+h)). Bit-identical FP results including NaN/infinity propagation, since the same pairwise additions occur in the same tree-reduction order. (Lines 27210-27224)

  • Addition order preserved for double (simdSize==32). haddCount=1 → one vhaddpd(ymm,ymm) produces [a+b, a+b | c+d, c+d], then extract+add gives (a+b)+(c+d), matching the old recursive path. (Lines 27210-27224)

  • 64-bit (simdSize==64) floating-point path unchanged. It still splits into two 32-byte halves, recursively calls gtNewSimdSumNode(…, 32) on each, and does a final scalar GT_ADD. The new hadd code is correctly invoked by these recursive calls and returns a scalar via gtNewSimdToScalarNode. (Lines 27184-27193 calling into 27200-27224)

  • Integer path preserved. The varTypeIsFloating guard at line 27203 means integers fall through to the unchanged split-add-recurse path at lines 27227-27232 → 27297+. No behavioral change.

  • fgMakeMultiUse usage is correct. Each loop iteration (line 27215) creates tmp as a copy of op1 before reassigning op1 to the hadd node, matching the established pattern used elsewhere in this function and in lowerxarch.cpp:6017-6057.

Assertion validity

  • assert(compOpportunisticallyDependsOn(InstructionSet_AVX)) at line 27202 is sound. simdSize==32 for XARCH is gated on AVX being available; this assert documents that invariant correctly. NI_AVX_HorizontalAdd at line 27216 requires AVX, which is guaranteed by this assert.

Performance / Instruction count

  • Instruction reduction is real. New float: vhaddps × 2 + vextractf128 + vaddps = 4 ops. New double: vhaddpd + vextractf128 + vaddpd = 3 ops. Significant improvement.

  • ⚠️ Comment's "vs 11/5" old-path counts are slightly off (line 27208). By my trace the old float path was ~10 instructions (getLower free + vextract + 2×(vpermilps+vaddps+vpermilps+vaddps) + scalar vaddss - though ToScalar can be free), and old double was ~6 (not 5). This is cosmetic only and doesn't affect correctness—consider adjusting to "vs ~10/~6" or simply removing the specific numbers.

EVEX / AVX-512 compatibility

  • NI_AVX_HorizontalAdd is marked HW_Flag_NoEvexSemantics in hwintrinsiclistxarch.h:674. This means it won't be promoted to an EVEX-encoded form. vhaddps/vhaddpd have no AVX-512 equivalent, so VEX encoding is forced—correct behavior.

Test coverage

  • ⚠️ Existing tests are trivially weak for this change. Vector256SingleSumTest (Vector256Tests.cs:6083) sums Vector256.Create(1.0f) = all-ones → 8.0f. Vector256DoubleSumTest (line 6017) sums Vector256.Create(1.0)4.0. These pass with any correct sum implementation and don't exercise the order-dependent nature of FP arithmetic at all.

  • No tests with non-uniform floating-point values, NaN, ±infinity, or signed zeros for Vector256<float>.Sum() or Vector256<double>.Sum(). The whole point of the determinism comment is that addition order matters for FP. A test like Vector256.Create(1e30f, 1f, -1e30f, 1f, 0f, 0f, 0f, 0f).Sum() would catch order-dependent bugs (expected 2f with correct order, 0f with wrong order). Similarly, tests with float.NaN, float.PositiveInfinity, and -0.0f would validate NaN propagation and signed-zero behavior. This gap pre-dates this PR, but since the PR changes the codegen, adding at least one non-trivial FP test is strongly recommended.

  • 💡 Consider adding a regression test with a known order-sensitive float vector (e.g., catastrophic cancellation pattern) that would fail if hadd order were incorrect, to lock in the determinism guarantee going forward.

Minor / style

  • 💡 getSIMDVectorLength(16, simdBaseType) at line 27210 is intentional—it counts elements per 128-bit lane to determine the number of hadd passes needed within each lane. This is correct but slightly non-obvious. A brief inline comment like // elements per 128-bit lane would aid readability.

  • The intrinsic variable declared at line 27172 is unused in the new floating-point path but was already unused in the old path for this case, so no regression.

Generated by Code Review for issue #126255 ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware

3 participants