1 files changed, 311 insertions, 0 deletions
diff --git a/_content/blog/allocation-optimizations.md b/_content/blog/allocation-optimizations.md
new file mode 100644
index 00000000..507541da
--- /dev/null
+++ b/_content/blog/allocation-optimizations.md
@@ -0,0 +1,311 @@
+---
+title: Allocating on the Stack
+date: 2026-02-27
+by:
+- Keith Randall
+summary: A description of some of the recent changes to do allocations on the stack instead of the heap.
+template: true
+---
+
+We're always looking for ways to make Go programs faster. In the last
+2 releases, we have concentrated on mitigating a particular source of
+slowness, heap allocations. Each time a Go program allocates memory
+from the heap, there's a fairly large chunk of code that needs to run
+to satisfy that allocation. In addition, heap allocations present
+additional load on the garbage collector.  Even with recent
+enhancements like [Green Tea](/blog/greenteagc), the garbage collector
+still incurs substantial overhead.
+
+So we've been working on ways to do more allocations on the stack
+instead of the heap.  Stack allocations are considerably cheaper to
+perform (sometimes completely free).  Moreover, they present no load
+to the garbage collector, as stack allocations can be collected
+automatically together with the stack frame itself. Stack allocations
+also enable prompt reuse, which is very cache friendly.
+
+## Stack allocation of constant-sized slices
+
+Consider the task of building a slice of tasks to process:
+{{raw `
+	func process(c chan task) {
+		var tasks []task
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		processAll(tasks)
+	}
+`}}
+
+Let's walk through what happens at runtime when pulling tasks from the
+channel `c` and adding them to the slice `tasks`.
+
+On the first loop iteration, there is no backing store for `tasks`, so
+`append` has to allocate one. Because it doesn't know how big the
+slice will eventually be, it can't be too aggressive. Currently, it
+allocates a backing store of size 1.
+
+On the second loop iteration, the backing store now exists, but it is
+full. `append` again has to allocate a new backing store, this time of
+size 2. The old backing store of size 1 is now garbage.
+
+On the third loop iteration, the backing store of size 2 is
+full. `append` *again* has to allocate a new backing store, this time
+of size 4. The old backing store of size 2 is now garbage.
+
+On the fourth loop iteration, the backing store of size 4 has only 3
+items in it. `append` can just place the item in the existing backing
+store and bump up the slice length. Yay! No call to the allocator for
+this iteration.
+
+On the fifth loop iteration, the backing store of size 4 is full, and
+`append` again has to allocate a new backing store, this time of size
+8.
+
+And so on. We generally double the size of the allocation each time it
+fills up, so we can eventually append most new tasks to the slice
+without allocation. But there is a fair amount of overhead in the
+"startup" phase when the slice is small. During this startup phase we
+spend a lot of time in the allocator, and produce a bunch of garbage,
+which seems pretty wasteful. And it may be that in your program, the
+slice never really gets large. This startup phase may be all you ever
+encounter.
+
+If this code was a really hot part of your program, you might be
+tempted to start the slice out at a larger size, to avoid all of these
+allocations.
+
+{{raw `
+	func process2(c chan task) {
+		tasks := make([]task, 0, 10) // probably at most 10 tasks
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		processAll(tasks)
+	}
+`}}
+
+This is a reasonable optimization to do. It is never incorrect; your
+program still runs correctly. If the guess is too small, you get
+allocations from `append` as before. If the guess is too large, you
+waste some memory.
+
+If your guess for the number of tasks was a good one, then there's
+only one allocation site in this program. The `make` call allocates a
+slice backing store of the correct size, and `append` never has to do
+any reallocation.
+
+The surprising thing is that if you benchmark this code with 10
+elements in the channel, you'll see that you didn't reduce the number
+of allocations to 1, you reduced the number of allocations to 0!
+
+The reason is that the compiler decided to allocate the backing store
+on the stack. Because it knows what size it needs to be (10 times the
+size of a task) it can allocate storage for it in the stack frame of
+`process2` instead of on the heap[<sup>1</sup>](#footnotes).  Note
+that this depends on the fact that the backing store does not [escape
+to the heap](/doc/gc-guide#Escape_analysis) inside of `processAll`.
+
+## Stack allocation of variable-sized slices
+
+But of course, hard coding a size guess is a bit rigid.
+Maybe we can pass in an estimated length?
+
+{{raw `
+	func process3(c chan task, lengthGuess int) {
+		tasks := make([]task, 0, lengthGuess)
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		processAll(tasks)
+	}
+`}}
+
+This lets the caller pick a good size for the `tasks` slice, which may
+vary depending on where this code is being called from.
+
+Unfortunately, in Go 1.24 the non-constant size of the backing store
+means the compiler can no longer allocate the backing store on the
+stack.  It will end up on the heap, converting our 0-allocation code
+to 1-allocation code. Still better than having `append` do all the
+intermediate allocations, but unfortunate.
+
+But never fear, Go 1.25 is here!
+
+Imagine you decide to do the following, to get the stack allocation
+only in cases where the guess is small:
+
+{{raw `
+	func process4(c chan task, lengthGuess int) {
+		var tasks []task
+		if lengthGuess <= 10 {
+			tasks = make([]task, 0, 10)
+		} else {
+			tasks = make([]task, 0, lengthGuess)
+		}
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		processAll(tasks)
+	}
+`}}
+
+Kind of ugly, but it would work. When the guess is small, you use a
+constant size `make` and thus a stack-allocated backing store, and
+when the guess is larger you use a variable size `make` and allocate
+the backing store from the heap.
+
+But in Go 1.25, you don't need to head down this ugly road. The Go
+1.25 compiler does this transformation for you!  For certain slice
+allocation locations, the compiler automatically allocates a small
+(currently 32-byte) slice backing store, and uses that backing store
+for the result of the `make` if the size requested is small
+enough. Otherwise, it uses a heap allocation as normal.
+
+In Go 1.25, `process3` performs zero heap allocations, if
+`lengthGuess` is small enough that a slice of that length fits into 32
+bytes. (And of course that `lengthGuess` is a correct guess for how
+many items are in `c`.)
+
+We're always improving the performance of Go, so upgrade to the latest
+Go release and [be
+surprised](https://youtu.be/FUm0pfgWehI?si=QRTt_JYwr-cRHDNJ&t=960) by
+how much faster and memory efficient your program becomes!
+
+## Stack allocation of append-allocated slices
+
+Ok, but you still don't want to have to change your API to add this
+weird length guess. Anything else you could do?
+
+Upgrade to Go 1.26!
+
+{{raw `
+	func process(c chan task) {
+		var tasks []task
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		processAll(tasks)
+	}
+`}}
+
+In Go 1.26, we allocate the same kind of small, speculative backing
+store on the stack, but now we can use it directly at the `append`
+site.
+
+On the first loop iteration, there is no backing store for `tasks`, so
+`append` uses a small, stack-allocated backing store as the first
+allocation. If, for instance, we can fit 4 `task`s in that backing store,
+the first `append` allocates a backing store of length 4 from the stack.
+
+The next 3 loop iterations append directly to the stack backing store,
+requiring no allocation.
+
+On the 4th iteration, the stack backing store is finally full and we
+have to go to the heap for more backing store. But we have avoided
+almost all of the startup overhead described earlier in this article.
+No heap allocations of size, 1, 2, and 4, and none of the garbage that
+they eventually become. If your slices are small, maybe you will never
+have a heap allocation.
+
+## Stack allocation of append-allocated escaping slices
+
+Ok, this is all good when the `tasks` slice doesn't escape. But what if
+I'm returning the slice? Then it can't be allocated on the stack, right?
+
+Right! The backing store for the slice returned by `extract` below
+can't be allocated on the stack, because the stack frame for `extract`
+disappears when `extract` returns.
+
+{{raw `
+	func extract(c chan task) []task {
+		var tasks []task
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		return tasks
+	}
+`}}
+
+But you might think, the *returned* slice can't be allocated on the
+stack. But what about all those intermediate slices that just become
+garbage? Maybe we can allocate those on the stack?
+
+{{raw `
+	func extract2(c chan task) []task {
+		var tasks []task
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		tasks2 := make([]task, len(tasks))
+		copy(tasks2, tasks)
+		return tasks2
+	}
+`}}
+
+Then the `tasks` slice never escapes `extract2`. It can benefit from
+all of the optimizations described above. Then at the very end of
+`extract2`, when we know the final size of the slice, we do one heap
+allocation of the required size, copy our `task`s into it, and return
+the copy.
+
+But do you really want to write all that additional code? It seems
+error prone. Maybe the compiler can do this transformation for us?
+
+In Go 1.26, it can!
+
+For escaping slices, the compiler will transform the original `extract`
+code to something like this:
+
+{{raw `
+	func extract3(c chan task) []task {
+		var tasks []task
+		for t := range c {
+			tasks = append(tasks, t)
+		}
+		tasks = runtime.move2heap(tasks)
+		return tasks
+	}
+`}}
+
+`runtime.move2heap` is a special compiler+runtime function that is the
+identity function for slices that are already allocated in the heap.
+For slices that are on the stack, it allocates a new slice on the
+heap, copies the stack-allocated slice to the heap copy, and returns
+the heap copy.
+
+This ensures that for our original `extract` code, if the number of
+items fits in our small stack-allocated buffer, we perform exactly 1
+allocation of exactly the right size. If the number of items exceeds
+the capacity our small stack-allocated buffer, we do our normal
+doubling-allocation once the stack-allocated buffer overflows.
+
+The optimization that Go 1.26 does is actually better than the
+hand-optimized code, because it does not require the extra
+allocation+copy that the hand-optimized code always does at the end.
+It requires the allocation+copy only in the case that we've exclusively
+operated on a stack-backed slice up to the return point.
+
+We do pay the cost for a copy, but that cost is almost completely
+offset by the copies in the startup phase that we no longer have to
+do. (In fact, the the new scheme at worst has to copy one more element
+than the old scheme.)
+
+## Wrapping up
+
+Hand optimization can still be beneficial, especially if you have a
+good estimate of the slice size ahead of time. But hopefully the
+compiler will now catch a lot of the simple cases for you and allow
+you to focus on the remaining ones that really matter.
+
+There are a lot of details that the compiler needs to ensure to get
+all these optimizations right. If you think that one of these
+optimizations is causing correctness or (negative) performance issues
+for you, you can turn them off with
+`-gcflags=all=-d=variablemakehash=n`. If turning these optimizations
+off helps, please [file an issue](/issue/new) so we can investigate.
+
+## Footnotes
+
+<sup>1</sup> Go stacks do not have any `alloca`-style mechanism for
+dynamically-sized stack frames. All Go stack frames are constant
+sized.