1 files changed, 244 insertions, 0 deletions
diff --git a/design/59960-heap-hugepage-util.md b/design/59960-heap-hugepage-util.md
new file mode 100644
index 0000000..1e14729
--- /dev/null
+++ b/design/59960-heap-hugepage-util.md
@@ -0,0 +1,244 @@
+# A more hugepage-aware Go heap
+
+Authors: Michael Knyszek, Michael Pratt
+
+## Background
+
+[Transparent huge pages (THP) admin
+guide](https://www.kernel.org/doc/html/latest/admin-guide/mm/transhuge.html).
+
+[Go scavenging
+policy](30333-smarter-scavenging.md#which-memory-should-we-scavenge).
+(Implementation details are out-of-date, but linked policy is relevant.)
+
+[THP flag behavior](#appendix_thp-flag-behavior).
+
+## Motivation
+
+Currently, Go's hugepage-related policies [do not play well
+together](https://github.com/golang/go/issues/55328) and have bit-rotted.[^1]
+The result is that the memory regions the Go runtime chooses to mark as
+`MADV_NOHUGEPAGE` and `MADV_HUGEPAGE` are somewhat haphazard, resulting in
+memory overuse for small heaps.
+The memory overuse is upwards of 40% memory overhead in some cases.
+Turning off huge pages entirely fixes the problem, but leaves CPU performance on
+the table.
+This policy also means large heaps might have dense sections that are
+erroneously mapped as `MADV_NOHUGEPAGE`, costing up to 1% throughput.
+
+The goal of this work is to eliminate this overhead for small heaps while
+improving huge page utilization for large heaps.
+
+[^1]: [Large allocations](https://cs.opensource.google/go/go/+/master:src/runtime/mheap.go;l=1344;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb)
+    will force [a call to `MADV_HUGEPAGE` for any aligned huge pages
+    within](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=148;drc=9839668b5619f45e293dd40339bf0ac614ea6bee),
+    while small allocations tend to leave memory in an undetermined state for
+    huge pages.
+    The scavenger will try to release entire aligned hugepages at a time.
+    Also, when any memory is released, [we `MADV_NOHUGEPAGE` any aligned pages
+    in the range we
+    release](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=40;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
+    However, the scavenger will [only release 64 KiB at a time unless it finds
+    an aligned huge page to
+    release](https://cs.opensource.google/go/go/+/master:src/runtime/mgcscavenge.go;l=564;drc=c70fd4b30aba5db2df7b5f6b0833c62b909f50eb),
+    and even then it'll [only `MADV_NOHUGEPAGE` the corresponding huge pages if
+    the region it's scavenging crosses a huge page
+    boundary](https://cs.opensource.google/go/go/+/master:src/runtime/mem_linux.go;l=70;drc=9839668b5619f45e293dd40339bf0ac614ea6bee).
+
+## Proposal
+
+One key insight in the design of the scavenger is that the runtime always has a
+good idea of how much memory will be used soon: the total heap footprint for a
+GC cycle is determined by the heap goal. [^2]
+
+[^2]: The runtime also has a first-fit page allocator so that the scavenger can
+    take pages from the high addresses in the heap, again to reduce the chance
+    of conflict.
+    The scavenger tries to return memory to the OS such that it leaves enough
+    paged-in memory around to reach the heap goal (adjusted for fragmentation
+    within spans and a 10% buffer for fragmentation outside of spans, or capped
+    by the memory limit).
+    The purpose behind this is to reduce the chance that the scavenger will
+    return memory to the OS that will be used soon.
+
+Indeed, by [tracing page allocations and watching page state over
+time](#appendix_page-traces) we can see that Go heaps tend to get very dense
+toward the end of a GC cycle; this makes all of that memory a decent candidate
+for huge pages from the perspective of fragmentation.
+However, it's also clear this density fluctuates significantly within a GC
+cycle.
+
+Therefore, I propose the following policy:
+1. All new memory is initially marked as `MADV_HUGEPAGE` with the expectation
+   that it will be used.
+1. Before the scavenger releases pages in an aligned 4 MiB region of memory [^3]
+   it [first](#appendix_thp-flag-behavior) marks it as `MADV_NOHUGEPAGE` if it
+   isn't already marked as such.
+    - If `max_ptes_none` is 0, then skip this step.
+1. Aligned 4 MiB regions of memory are only available to scavenge if they
+   weren't at least 96% [^4] full at the end of the last GC cycle. [^5]
+    - Scavenging for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` ignores this
+      rule.
+1. Any aligned 4 MiB region of memory that exceeds 96% occupancy is immediately
+   marked as `MADV_HUGEPAGE`.
+    - If `max_ptes_none` is 0, then use `MADV_COLLAPSE` instead, if available.
+    - Memory scavenged for `GOMEMLIMIT` or `runtime/debug.FreeOSMemory` is not
+      marked `MADV_HUGEPAGE` until the next allocation that causes this
+      condition after the end of the current GC cycle. [^6]
+
+[^3]: 4 MiB doesn't align with linux/amd64 huge page sizes, but is a very
+    convenient number of the runtime because the page allocator manages memory
+    in 4 MiB chunks.
+
+[^4]: The bar for explicit (non-default) backing by huge pages must be very
+    high.
+    The main issue is the default value of
+    `/sys/kernel/mm/transparent_hugepage/defrag` on Linux: it forces regions
+    marked as `MADV_HUGEPAGE` to be immediately backed, stalling in the kernel
+    until it can compact and rearrange things to provide a huge page.
+    Meanwhile the combination of `MADV_NOHUGEPAGE` and `MADV_DONTNEED` does the
+    opposite.
+    Switching between these two states often creates really expensive churn.
+
+[^5]: Note that `runtime/debug.FreeOSMemory` and the mechanism to maintain
+    `GOMEMLIMIT` must still be able to release all memory to be effective.
+    For that reason, this rule does not apply to those two situations.
+    Basically, these cases get to skip waiting until the end of the GC cycle,
+    optimistically assuming that memory won't be used.
+
+[^6]: It might happen that the wrong memory was scavenged (memory that soon
+    after exceeds 96% occupancy).
+    This delay helps reduce churn.
+
+The goal of these changes is to ensure that when sparse regions of the heap have
+their memory returned to the OS, it stays that way regardless of
+`max_ptes_none`.
+Meanwhile, the policy avoids expensive churn by delaying the release of pages
+that were part of dense memory regions by at least a full GC cycle.
+
+Note that there's potentially quite a lot of hysteresis here, which could impact
+memory reclaim for, for example, a brief memory spike followed by a long-lived
+idle low-memory state.
+In the worst case, the time between GC cycles is 2 minutes, and the scavenger's
+slowest return rate is ~256 MiB/sec. [^7] I suspect this isn't slow enough to be
+a problem in practice.
+Furthermore, `GOMEMLIMIT` can still be employed to maintain a memory maximum.
+
+[^7]: The scavenger is much more aggressive than it once was, targeting 1% of
+    total CPU usage.
+    Spending 1% of one CPU core in 2018 on `MADV_DONTNEED` meant roughly 8 KiB
+    released per millisecond in the worst case.
+    For a `GOMAXPROCS=32` process, this worst case is now approximately 256 KiB
+    per millisecond.
+    In the best case, wherein the scavenger can identify whole unreleased huge
+    pages, it would release 2 MiB per millisecond in 2018, so 64 MiB per
+    millisecond today.
+
+## Alternative attempts
+
+Initially, I attempted a design where all heap memory up to the heap goal
+(address-ordered) is marked as `MADV_HUGEPAGE` and ineligible for scavenging.
+The rest is always eligible for scavenging, and the scavenger marks that memory
+as `MADV_NOHUGEPAGE`.
+
+This approach had a few problems:
+1. The heap goal tends to fluctuate, creating churn at the boundary.
+1. When the heap is actively growing, the aftermath of this churn actually ends
+   up in the middle of the fully-grown heap, as the scavenger works on memory
+   beyond the boundary in between GC cycles.
+1. Any fragmentation that does exist in the middle of the heap, for example if
+   most allocations are large, is never looked at by the scavenger.
+
+I also tried a simple heuristic to turn off the scavenger when it looks like the
+heap is growing, but not all heaps grow monotonically, so a small amount of
+churn still occurred.
+It's difficult to come up with a good heuristic without assuming monotonicity.
+
+My next attempt was more direct: mark high density chunks as `MADV_HUGEPAGE`,
+and allow low density chunks to be scavenged and set as `MADV_NOHUGEPAGE`.
+A chunk would become high density if it was observed to have at least 80%
+occupancy, and would later switch back to low density if it had less than 20%
+occupancy.
+This gap existed for hysteresis to reduce churn.
+Unfortunately, this also didn't work: GC-heavy programs often have memory
+regions that go from extremely low (near 0%) occupancy to 100% within a single
+GC cycle, creating a lot of churn.
+
+The design above is ultimately a combination of these two designs: assume that
+the heap gets generally dense within a GC cycle, but handle it on a
+chunk-by-chunk basis.
+
+Where all this differs from other huge page efforts, such as [what TCMalloc
+did](https://google.github.io/tcmalloc/temeraire.html), is the lack of
+bin-packing of allocated memory in huge pages (which is really the majority and
+key part of the design).
+Bin-packing provides the benefit of increasing the likelihood that an entire
+huge page will be free by putting new memory in existing huge pages over some
+global policy that may put it anywhere like "best-fit."
+This not only improves the efficiency of releasing memory, but makes the overall
+footprint smaller due to less fragmentation.
+
+This is unlikely to be that useful for Go since Go's heap already, at least
+transiently, gets very dense.
+Another thing that gets in the way of doing the same kind of bin-packing for Go
+is that the allocator's slow path gets hit much harder than TCMalloc's slow
+path.
+The reason for this boils down to the GC memory reuse pattern (essentially, FIFO
+vs. LIFO reuse).
+Slowdowns in this path will likely create scalability problems.
+
+## Appendix: THP flag behavior
+
+Whether or not pages are eligible for THP is controlled by a combination of
+settings:
+
+`/sys/kernel/mm/transparent_hugepage/enabled`: system-wide control, possible
+values:
+- `never`: THP disabled
+- `madvise`: Only pages with `MADV_HUGEPAGE` are eligible
+- `always`: All pages are eligible, unless marked `MADV_NOHUGEPAGE`
+
+`prctl(PR_SET_THP_DISABLE)`: process-wide control to disable THP
+
+`madvise`: per-mapping control, possible values:
+- `MADV_NOHUGEPAGE`: mapping not eligible for THP
+  - Note that existing huge pages will not be split if this flag is set.
+- `MADV_HUGEPAGE`: mapping eligible for THP unless there is a process- or
+  system-wide disable.
+- Unset: mapping eligible for THP if system-wide control is set to “always”.
+
+`/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none`: system-wide
+control that specifies how many extra small pages can be allocated when
+collapsing a group of pages into a huge page.
+In other words, how many small pages in a candidate huge page can be
+not-faulted-in or faulted-in zero pages.
+
+`MADV_DONTNEED` on a smaller range within a huge page will split the huge page
+to zero the range.
+However, the full huge page range will still be immediately eligible for
+coalescing by `khugepaged` if `max_ptes_none > 0`, which is true for the default
+open source Linux configuration.
+Thus to both disable future THP and split an existing huge page race-free, you
+must first set `MADV_NOHUGEPAGE` and then call `MADV_DONTNEED`.
+
+Another consideration is the newly-upstreamed `MADV_COLLAPSE`, which collapses
+memory regions into huge pages unconditionally.
+`MADV_DONTNEED` can then used to break them up.
+This scheme represents effectively complete control over huge pages, provided
+`khugepaged` doesn't coalesce pages in a way that undoes the `MADV_DONTNEED`.
+(For example by setting `max_ptes_none` to zero.)
+
+## Appendix: Page traces
+
+To investigate this issue I built a
+[low-overhead](https://perf.golang.org/search?q=upload:20221024.9) [page event
+tracer](https://go.dev/cl/444157) and [visualization
+utility](https://go.dev/cl/444158) to check assumptions of application and GC
+behavior.
+Below are a bunch of traces and conclusions from them.
+- [Tile38 K-Nearest benchmark](./59960/tile38.png): GC-heavy benchmark.
+  Note the fluctuation between very low occupancy and very high occupancy.
+  During a single GC cycle, the page heap gets at least transiently very dense.
+  This benchmark caused me the most trouble when trying out ideas.
+- [Go compiler building a massive package](./59960/compiler.png): Note again the
+  high density.