JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum() by EgorBo · Pull Request #126255 · dotnet/runtime

EgorBo · 2026-03-28T13:55:45Z

Summary

Restore vhaddps/vhaddpd-based lowering for Vector256<float>.Sum() and Vector256<double>.Sum() (and transitively Vector512 floating-point Sum), replacing the more expensive vpermilps + vaddps decomposition that was introduced in #95568 and later expanded in e012fd4.

Root Cause

Two changes caused the regression:

PR Updating Sum() implementation for Vector128 and Vector256 + adding lowering for Vector512 #95568 (Dec 2023) replaced vhaddps with a vpermilps + vaddps (shuffle + vertical add) sequence, claiming hadd was "not the most efficient."
Commit e012fd4 (Jun 2024) added lane-independent splitting for floating-point to "ensure deterministic results matching the software fallback." This doubled the instruction count for Vector256<float>.Sum() from 6 to 11 instructions.

The determinism concern was valid in principle — the software fallback processes each 128-bit lane independently. However, vhaddps already operates within 128-bit lanes (it is defined as two independent 128-bit horizontal adds), so it naturally preserves the same lane-by-lane addition order. The lane-splitting was unnecessary.

Fix

For 256-bit floating-point Sum when AVX is available, use NI_AVX_HorizontalAdd (vhaddps/vhaddpd) followed by a lane-combine step. This produces 4 instructions for float and 3 for double, vs the previous 11 and 5.

Vector512 floating-point Sum also benefits because it recursively calls the 256-bit Sum path.

Codegen — `float Test(Vector256<float> vec) => Vector256.Sum(vec);`

Before (`vpermilps + vaddps`, 62 bytes, 14 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vmovaps  ymm1, ymm0
       vpermilps xmm2, xmm1, -79
       vaddps   xmm1, xmm2, xmm1
       vpermilps xmm2, xmm1, 78
       vaddps   xmm1, xmm2, xmm1
       vextractf128 xmm0, ymm0
       vpermilps xmm2, xmm0, -79
       vaddps   xmm0, xmm2, xmm0
       vpermilps xmm2, xmm0, 78
       vaddps   xmm0, xmm2, xmm0
       vaddss   xmm0, xmm1, xmm0
       vzeroupper
       ret

After (`vhaddps`, 26 bytes, 7 instructions)

       vmovups  ymm0, ymmword ptr [rcx]
       vhaddps  ymm0, ymm0, ymm0
       vhaddps  ymm0, ymm0, ymm0
       vextractf128 xmm1, ymm0
       vaddps   xmm0, xmm0, xmm1
       vzeroupper
       ret

58% less code, 50% fewer instructions. On AVX-512 hardware, the improvement is even larger because:

vhaddps has HW_Flag_NoEvexSemantics, constraining registers to ymm0–15 and avoiding 4-byte EVEX prefixes
The reduced loop body size avoids micro-op cache (DSB) pressure

Validation

Numerical output is identical before and after (deterministic FP addition order preserved)
SPMI replay clean: 50,228 ASP.NET contexts, zero failures

Note

This comment was generated by Copilot.

Benchmark results with PR #126255

Benchmarked using the reporter's code on two AVX-512 server CPUs (cloud, via EgorBot):

Intel Emerald Rapids (Xeon 8573C)

Method	Toolchain	NumElevations	Mean	Ratio
Vector<float> portable SIMD	PR	1024	907.4 ns	1.00
Vector256<float> explicit SIMD	PR	1024	564.4 ns	0.62
Vector<float> portable SIMD	main	1024	884.1 ns	0.97
Vector256<float> explicit SIMD	main	1024	647.2 ns	0.71

Vector<float> portable SIMD	PR	4096	3,549.4 ns	1.00
Vector256<float> explicit SIMD	PR	4096	2,280.3 ns	0.64
Vector<float> portable SIMD	main	4096	3,583.4 ns	1.01
Vector256<float> explicit SIMD	main	4096	2,543.1 ns	0.72

AMD Turin (EPYC 9V45, Zen 5)

Method	Toolchain	NumElevations	Mean	Ratio
Vector<float> portable SIMD	PR	1024	487.2 ns	1.00
Vector256<float> explicit SIMD	PR	1024	338.0 ns	0.69
Vector<float> portable SIMD	main	1024	486.5 ns	1.00
Vector256<float> explicit SIMD	main	1024	380.8 ns	0.78

Vector<float> portable SIMD	PR	4096	1,959.6 ns	1.00
Vector256<float> explicit SIMD	PR	4096	1,365.9 ns	0.70
Vector<float> portable SIMD	main	4096	1,959.1 ns	1.00
Vector256<float> explicit SIMD	main	4096	1,541.5 ns	0.79

Summary

CPU	Size	Vector256 improvement
Emerald Rapids	1024	−12.8%
Emerald Rapids	4096	−10.3%
Turin (Zen 5)	1024	−11.3%
Turin (Zen 5)	4096	−11.4%

Consistent ~11% improvement on both Intel and AMD AVX-512 server CPUs. Tiger Lake (the reporter's CPU) is not available in the cloud; the improvement there would likely be larger due to the smaller µop cache being more sensitive to the 28% loop body size reduction.

Copilot

Pull request overview

Restores the AVX vhaddps / vhaddpd-based lowering for Vector256<float/double>.Sum() in the JIT to address a performance regression introduced by prior permute+add decomposition and lane-splitting, while maintaining deterministic lane-by-lane FP addition order.

Changes:

Add an AVX-only fast path for 256-bit floating-point Sum() using NI_AVX_HorizontalAdd and a final lane-combine step.
Keep the existing non-AVX floating-point path that sums 128-bit halves independently for determinism (software-fallback matching), but clarify the comment.

EgorBo · 2026-03-28T17:08:11Z

@EgorBot -amd

using System;
using System.Numerics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

[MemoryDiagnoser]
public class Vector256RegressionBenchmark
{
    private const float NullHeight = -3.4E+38f;
    private const float FillTolerance = 0.01f;
    private const float NegCutTolerance = -0.01f;

    private float[] _baseElevations;
    private float[] _topElevations;

    [Params(1024, 4096)]
    public int NumElevations { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        var rng = new Random(42);
        _baseElevations = new float[NumElevations];
        _topElevations = new float[NumElevations];
        for (var i = 0; i < NumElevations; i++)
        {
            _baseElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100);
            _topElevations[i] = rng.NextDouble() < 0.2
                ? NullHeight : (float)(rng.NextDouble() * 100 + 10);
        }
    }

    [Benchmark(Description = "Vector<float> portable SIMD", Baseline = true)]
    public unsafe (double cut, double fill) PortableVector()
    {
        double cutVol = 0, fillVol = 0;
        var nullVec = new Vector<float>(NullHeight);
        var fillTolVec = new Vector<float>(FillTolerance);
        var negCutTolVec = new Vector<float>(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector<float>*)bp;
            var tv = (Vector<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector.Equals(*bv, nullVec)
                           | Vector.Equals(*tv, nullVec));
                if (Vector.Sum(mask) == 0) continue;

                var delta = Vector.ConditionalSelect(mask, *tv, Vector<float>.Zero)
                          - Vector.ConditionalSelect(mask, *bv, Vector<float>.Zero);

                var fillMask = Vector.GreaterThan(delta, fillTolVec);
                var usedFill = -Vector.Sum(fillMask);
                if (usedFill > 0)
                    fillVol -= Vector.Dot(delta, Vector.ConvertToSingle(fillMask));

                if (usedFill < Vector<float>.Count)
                {
                    var cutMask = Vector.LessThan(delta, negCutTolVec);
                    var usedCut = -Vector.Sum(cutMask);
                    if (usedCut > 0)
                        cutVol -= Vector.Dot(delta, Vector.ConvertToSingle(cutMask));
                }
            }
        }
        return (cutVol, fillVol);
    }

    [Benchmark(Description = "Vector256<float> explicit SIMD")]
    public unsafe (double cut, double fill) ExplicitVector256()
    {
        if (!Vector256.IsHardwareAccelerated) return (0, 0);

        double cutVol = 0, fillVol = 0;
        var nullVec = Vector256.Create(NullHeight);
        var fillTolVec = Vector256.Create(FillTolerance);
        var negCutTolVec = Vector256.Create(NegCutTolerance);

        fixed (float* bp = _baseElevations, tp = _topElevations)
        {
            var bv = (Vector256<float>*)bp;
            var tv = (Vector256<float>*)tp;
            for (int i = 0, limit = NumElevations / Vector256<float>.Count;
                 i < limit; i++, bv++, tv++)
            {
                var mask = ~(Vector256.Equals(*bv, nullVec)
                           | Vector256.Equals(*tv, nullVec));
                if (Vector256.EqualsAll(mask, Vector256<float>.Zero)) continue;

                var delta = Vector256.ConditionalSelect(mask, *tv, Vector256<float>.Zero)
                          - Vector256.ConditionalSelect(mask, *bv, Vector256<float>.Zero);

                var fillMask = Vector256.GreaterThan(delta, fillTolVec);
                if (Vector256.ExtractMostSignificantBits(fillMask) != 0)
                    fillVol += Vector256.Sum(
                        Vector256.ConditionalSelect(fillMask, delta, Vector256<float>.Zero));

                var cutMask = Vector256.LessThan(delta, negCutTolVec);
                if (Vector256.ExtractMostSignificantBits(cutMask) != 0)
                    cutVol += Vector256.Sum(
                        Vector256.ConditionalSelect(cutMask, delta, Vector256<float>.Zero));
            }
        }
        return (cutVol, fillVol);
    }
}

saucecontrol · 2026-03-28T22:21:40Z

src/coreclr/jit/gentree.cpp


    if (simdSize == 32)
    {
+        if (varTypeIsFloating(simdBaseType) && compOpportunisticallyDependsOn(InstructionSet_AVX))


simdSize == 32 implies AVX support, so this check and the fallback code below can be removed.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

src/coreclr/jit/gentree.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

src/coreclr/jit/gentree.cpp

github-actions · 2026-03-29T16:28:44Z

Note

This review was generated by GitHub Copilot.

Code Review: `gtNewSimdSumNode` changes in `gentree.cpp`

Correctness

✅ Addition order preserved for float (simdSize==32). Old: split→recursive 128-bit sum per lane→scalar add = ((a+b)+(c+d)) + ((e+f)+(g+h)). New: two vhaddps(ymm,ymm)→vextractf128→vaddps = ((a+b)+(c+d)) + ((e+f)+(g+h)). Bit-identical FP results including NaN/infinity propagation, since the same pairwise additions occur in the same tree-reduction order. (Lines 27210-27224)
✅ Addition order preserved for double (simdSize==32). haddCount=1 → one vhaddpd(ymm,ymm) produces [a+b, a+b | c+d, c+d], then extract+add gives (a+b)+(c+d), matching the old recursive path. (Lines 27210-27224)
✅ 64-bit (simdSize==64) floating-point path unchanged. It still splits into two 32-byte halves, recursively calls gtNewSimdSumNode(…, 32) on each, and does a final scalar GT_ADD. The new hadd code is correctly invoked by these recursive calls and returns a scalar via gtNewSimdToScalarNode. (Lines 27184-27193 calling into 27200-27224)
✅ Integer path preserved. The varTypeIsFloating guard at line 27203 means integers fall through to the unchanged split-add-recurse path at lines 27227-27232 → 27297+. No behavioral change.
✅ fgMakeMultiUse usage is correct. Each loop iteration (line 27215) creates tmp as a copy of op1 before reassigning op1 to the hadd node, matching the established pattern used elsewhere in this function and in lowerxarch.cpp:6017-6057.

Assertion validity

✅ assert(compOpportunisticallyDependsOn(InstructionSet_AVX)) at line 27202 is sound. simdSize==32 for XARCH is gated on AVX being available; this assert documents that invariant correctly. NI_AVX_HorizontalAdd at line 27216 requires AVX, which is guaranteed by this assert.

Performance / Instruction count

✅ Instruction reduction is real. New float: vhaddps × 2 + vextractf128 + vaddps = 4 ops. New double: vhaddpd + vextractf128 + vaddpd = 3 ops. Significant improvement.
⚠️ Comment's "vs 11/5" old-path counts are slightly off (line 27208). By my trace the old float path was ~10 instructions (getLower free + vextract + 2×(vpermilps+vaddps+vpermilps+vaddps) + scalar vaddss - though ToScalar can be free), and old double was ~6 (not 5). This is cosmetic only and doesn't affect correctness—consider adjusting to "vs ~10/~6" or simply removing the specific numbers.

EVEX / AVX-512 compatibility

✅ NI_AVX_HorizontalAdd is marked HW_Flag_NoEvexSemantics in hwintrinsiclistxarch.h:674. This means it won't be promoted to an EVEX-encoded form. vhaddps/vhaddpd have no AVX-512 equivalent, so VEX encoding is forced—correct behavior.

Test coverage

⚠️ Existing tests are trivially weak for this change. Vector256SingleSumTest (Vector256Tests.cs:6083) sums Vector256.Create(1.0f) = all-ones → 8.0f. Vector256DoubleSumTest (line 6017) sums Vector256.Create(1.0) → 4.0. These pass with any correct sum implementation and don't exercise the order-dependent nature of FP arithmetic at all.
❌ No tests with non-uniform floating-point values, NaN, ±infinity, or signed zeros for Vector256<float>.Sum() or Vector256<double>.Sum(). The whole point of the determinism comment is that addition order matters for FP. A test like Vector256.Create(1e30f, 1f, -1e30f, 1f, 0f, 0f, 0f, 0f).Sum() would catch order-dependent bugs (expected 2f with correct order, 0f with wrong order). Similarly, tests with float.NaN, float.PositiveInfinity, and -0.0f would validate NaN propagation and signed-zero behavior. This gap pre-dates this PR, but since the PR changes the codegen, adding at least one non-trivial FP test is strongly recommended.
💡 Consider adding a regression test with a known order-sensitive float vector (e.g., catastrophic cancellation pattern) that would fail if hadd order were incorrect, to lock in the determinism guarantee going forward.

Minor / style

💡 getSIMDVectorLength(16, simdBaseType) at line 27210 is intentional—it counts elements per 128-bit lane to determine the number of hadd passes needed within each lane. This is correct but slightly non-obvious. A brief inline comment like // elements per 128-bit lane would aid readability.
✅ The intrinsic variable declared at line 27172 is unused in the new floating-point path but was already unused in the old path for this case, so no regression.

Generated by Code Review for issue #126255 · ◷

improve Sum

2453d81

Copilot AI review requested due to automatic review settings March 28, 2026 13:55

github-actions bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 28, 2026

dotnet-policy-service bot assigned EgorBo Mar 28, 2026

Copilot started reviewing on behalf of EgorBo March 28, 2026 13:56 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

This comment was marked as resolved.

Sign in to view

EgorBot mentioned this pull request Mar 28, 2026

Benchmarks for dotnet/runtime#126255 (for @EgorBo) EgorBot/Benchmarks#79

Open

EgorBot mentioned this pull request Mar 28, 2026

Benchmarks for dotnet/runtime#126255 (for @EgorBo) EgorBot/Benchmarks#80

Open

EgorBo mentioned this pull request Mar 28, 2026

Vector256 explicit intrinsics 71% slower on .NET 10 vs .NET 8 on AVX-512 hardware #126250

Open

This was referenced Mar 28, 2026

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

[android-arm64] The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#6408

Open

saucecontrol reviewed Mar 28, 2026

View reviewed changes

Update gentree.cpp

9ab2d65

Copilot AI review requested due to automatic review settings March 29, 2026 00:03

Copilot started reviewing on behalf of EgorBo March 29, 2026 00:04 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

src/coreclr/jit/gentree.cpp Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

Update gentree.cpp

6c2351b

Copilot AI review requested due to automatic review settings March 29, 2026 16:06

Copilot started reviewing on behalf of EgorBo March 29, 2026 16:08 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

src/coreclr/jit/gentree.cpp Show resolved Hide resolved

build-analysis bot mentioned this pull request Mar 29, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255

JIT: Restore vhaddps-based lowering for Vector256/512 floating-point Sum()#126255
EgorBo wants to merge 3 commits intomainfrom
hadd-fix

EgorBo commented Mar 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

EgorBo commented Mar 28, 2026

Uh oh!

saucecontrol Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EgorBo commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Codegen — float Test(Vector256<float> vec) => Vector256.Sum(vec);

Before (vpermilps + vaddps, 62 bytes, 14 instructions)

After (vhaddps, 26 bytes, 7 instructions)

Validation

Benchmark results with PR #126255

Intel Emerald Rapids (Xeon 8573C)

AMD Turin (EPYC 9V45, Zen 5)

Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

EgorBo commented Mar 28, 2026

Uh oh!

saucecontrol Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

github-actions bot commented Mar 29, 2026

Code Review: gtNewSimdSumNode changes in gentree.cpp

Correctness

Assertion validity

Performance / Instruction count

EVEX / AVX-512 compatibility

Test coverage

Minor / style

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EgorBo commented Mar 28, 2026 •

edited

Loading

Codegen — `float Test(Vector256<float> vec) => Vector256.Sum(vec);`

Before (`vpermilps + vaddps`, 62 bytes, 14 instructions)

After (`vhaddps`, 26 bytes, 7 instructions)

Code Review: `gtNewSimdSumNode` changes in `gentree.cpp`