Stackalloc localloc #112168

AndyAyersMS · 2025-02-05T01:08:28Z

Experiment with turning non-escaping new (nongc)[n] into stackallocs.
Also enable new (nongc)[100] if the allocation site is within a loop, also via stackalloc.

Currently no restriction on how big (that will have to change).

dotnet-policy-service · 2025-02-05T01:09:19Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

AndyAyersMS · 2025-02-05T01:09:57Z

@hez2010 fyi

Expect this may blow up in some tests with stack overflows; we'll see. Also I forgot to exclude sites in handers, so things may blow up there too.

AndyAyersMS · 2025-02-05T02:18:52Z

Some preliminary notes. @dotnet/jit-contrib @davidwrighton @jkotas interested in your feedback.

This builds up on #104906

JIT-introduced `stackalloc`

Currently the JIT will never introduce a stackalloc into a method, but allowing this may be interesting.

Escape Analysis

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method. A successful proof requires knowing everything that can possibly happen to that allocation. Proofs of non-escape often founder at call boundaries, since the JIT generally has no knowledge of callee behavior. For instance x will be considered escaping in M if the JIT cannot inline Q.

M()
{ 
     int[] x = new int[100];
     Q(x);
}

Even when non-escape can be proven, stack allocation may not be possible. For example in the following code snippet, the array assigned to x cannot be stack allocated since the array size is not known to the JIT, and the known-sized arrays assigned to y cannot be stack allocated since the allocation site is in a loop. In these cases the amount of stack growth required is not known at JIT time.

N(int n)
{
    int[] x = new int[n];
    for (int i = 0; i < n; i++)
    {
        int[] y = new int[100];

        ;; uses of x and y
    }
}

Limitations like these can be overcome by allowing the JIT to introduce stackalloc into methods. But this comes with other complications:

stack space is limited, so these allocations cannot always go on the stack. Either large values of n or even modest values of n if the stack is already close ot its limit can introduce stack overflows into methods that would not have overflowed without stack allocation. So it seems any such allocation must conditionally be on the stack, or somewhere else (say an arena).
we generally cannot introduce stackalloc into catch or handler code (finally blocks)
objects with GC fields cannot (currently) be handled this way.
- there is currently no way for a method to describe a runtime-varying number of GC roots to the runtime
- arrays are heap objects, and writes to arrays require (unchecked) write barriers
- there is no way to do a store covariance check without a write barrier (easy to fix)
- many other runtime helpers assume object references are always heap references
- stack allocated GC roots may end up being treated as so-called "untracked" lifetimes, extending the GC lifetime of the objects they reference
- diagnostics may become more challenging

None of these seem fundamentally hard to solve, though the cost of checked write barriers might be enough to dissuade us.

If non-escape can be proven but stack allocation is conditional or not possible, the resulting object is still "thread private" and can be optimized more aggressively than if it was a general heap object. There are also widely used idioms (eg in Enumerators) where objects clone themselves to provide thread private access. The JIT does not understand these patterns, but we could work on enhancing the memory analyses the JIT does to try take advantage of this information.

This draft PR introduces stackalloc for non-GC type arrays when the JIT can prove non-escape. This currently has no policy attached. The initial thought for a policy is to leverage the same logic as in TryEnsureSufficientExecutionStack: at each allocation site, check the available stack capacity, and if the stack is not too full, allocate on the stack (perhaps with some additional per-allocation limit), else allocate on the heap.

These changes may well make the allocating methods slightly slower, as the cost of the array zeroing now must be directly paid by the method, rather than incurred by GC or slow allocation helpers. But they may well make the overall system faster. In some case the JIT may be able to prove that the zeroing is not necessary, if all elements are written before being read, but that's a ways off (if ever).

Span Captures (not part of this PR)

Escape analysis can possibly also leverage the fact that an allocation may be opaquely captured by a byref like struct. For instance in

O()
{
    Span<int> x = new int[100];
    Q(x);
}

the array lifetime cannot exceed Os lifetime and so the array can be safely stack allocated (here "opaquely" meaning there is no way to extract the captured object from the struct). There is no need to analyze or inline Q.

As above if the array size is unknown or the allocation site is in a loop then allocation would require stackalloc and associated policies.

In general doing this sort of thing requires "field-wise" escape analysis which is something I intend to work on, but it seems likely that just handing Span might be an easier and valuable special case; a span local has at most one GC reference inside it, so we can likely conflate the object and the reference and just leverage our current analysis.

Enabling this would potentially allow replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

jkotas · 2025-02-05T03:09:29Z

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method

Do you have good examples in BCL or other real-world code where this kicks in?

replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

The typical use of stackallocs in BCL are constant-sized stackallocs or stackalloc+ArrayPool combos. Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

BCL stackallocs and stackalloc+ArrayPool combos have other safety problems:

The memory is uninitialized. I doubt that we would be willing to pay for initialization of these buffers throughout the BCL.
The memory has to be returned to the array pool exactly once.

For the BCL use cases in particular, it may be more interesting to work on #52065 and base this optimization on top of it:

It would allow us to replace the stackalloc+ArrayPool combos throughout the BCL
JIT would inject code that returns the memory to the pool at the end of the method. Alternatively, we can work with Roslyn to introduce constructs for enforced deterministic destruction so that the cleanup code is injected in IL.
This array stackalloc optimization can use the same primitive.

jkotas · 2025-02-05T03:16:44Z

Span x = new int[100];
Q(x);

If we had malloca-like API, I think this specific example can be converted to it as an optimization in Roslyn as well.

hez2010 · 2025-02-05T07:02:13Z

Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

Sometimes we may need to return a buffer to its caller so that we cannot use stackalloc. If the method is managed to get inlined to its caller, with this analysis we may get rid of the heap array allocation.

Do you have good examples in BCL or other real-world code where this kicks in?

Some typical scenarios that this may kick in after we have the support for gcref arrays are like string.Split and Regex.Matches etc.

AndyAyersMS · 2025-02-06T03:57:16Z

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method

Do you have good examples in BCL or other real-world code where this kicks in?

Mostly this was an exploration of how hard it would be to enable the transformation in the JIT, and to contemplate what else might need to be addressed.

I have started scouting around for potential impact but it will take a while to get a useful set of data. I also need to build up a better automated analysis for categorizing the things that block and unblock allocation (at least for the first blocker) and make sure we're not missing anything simple in our analysis.

With this PR as is, on one large internal application that has likely been extensively hand tuned, there are roughly 22K Tier1 optimized methods, 2.3K methods with array creation sites, and 4.1K total array allocation sites.

2 of the arrays are stack allocated (not sure if via localloc). I don't have a breakdown yet of what blocks the other 4K.

For some context, on this same application, conditional escape analysis for enumerators kicks in for around 200 methods.

AndyAyersMS · 2025-02-06T15:36:10Z

Remaining failures all look like stack overflows -- there needs to be a per-instance size limit as well as a dynamic size limit. So seems like this sort of transformation is feasible.

Adding a per-instance limit will introduce conditional heap/stack allocation, so that seems like an easy next step.

AndyAyersMS added 4 commits February 4, 2025 17:04

proof of concept

c5cefd3

fix flags

4d9f1bc

enable by default (for now)

e3c4019

enable for array allocations in loops too

23b4261

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2025

dotnet-policy-service bot assigned AndyAyersMS Feb 5, 2025

add missing bit of code

d096d63

build-analysis bot mentioned this pull request Feb 5, 2025

LibraryImportGenerator.Unit.Tests crashing on linux-x64 mono interpreter #100800

Open

AndyAyersMS added 2 commits February 5, 2025 13:10

add handler check; fix elem size type

032f724

fix zero init logic

1e088c4

AndyAyersMS mentioned this pull request Feb 9, 2025

Stack Allocation Enhancements #104936

Open

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stackalloc localloc #112168

Stackalloc localloc #112168

AndyAyersMS commented Feb 5, 2025

dotnet-policy-service bot commented Feb 5, 2025

AndyAyersMS commented Feb 5, 2025 •

edited

Loading

AndyAyersMS commented Feb 5, 2025 •

edited

Loading

jkotas commented Feb 5, 2025 •

edited

Loading

jkotas commented Feb 5, 2025

hez2010 commented Feb 5, 2025 •

edited

Loading

AndyAyersMS commented Feb 6, 2025

AndyAyersMS commented Feb 6, 2025

Stackalloc localloc #112168

Are you sure you want to change the base?

Stackalloc localloc #112168

Conversation

AndyAyersMS commented Feb 5, 2025

dotnet-policy-service bot commented Feb 5, 2025

AndyAyersMS commented Feb 5, 2025 • edited Loading

AndyAyersMS commented Feb 5, 2025 • edited Loading

JIT-introduced stackalloc

Escape Analysis

Span Captures (not part of this PR)

jkotas commented Feb 5, 2025 • edited Loading

jkotas commented Feb 5, 2025

hez2010 commented Feb 5, 2025 • edited Loading

AndyAyersMS commented Feb 6, 2025

AndyAyersMS commented Feb 6, 2025

AndyAyersMS commented Feb 5, 2025 •

edited

Loading

AndyAyersMS commented Feb 5, 2025 •

edited

Loading

JIT-introduced `stackalloc`

jkotas commented Feb 5, 2025 •

edited

Loading

hez2010 commented Feb 5, 2025 •

edited

Loading