aboutsummaryrefslogtreecommitdiff
path: root/src/runtime/mgcmark_greenteagc.go
AgeCommit message (Collapse)Author
2025-07-15runtime: use memclrNoHeapPointers to clear inline mark bitsMichael Anthony Knyszek
Clearing the inline mark bits with memclrNoHeapPointers is slightly better than having the compiler insert, e.g. duffzero, since it can take advantage of wider SIMD instructions. duffzero is likely going away, but we know things the compiler doesn't, such as the fact that this memory is nicely aligned. In this particular case, memclrNoHeapPointers does a better job. For #73581. Change-Id: I3918096929acfe6efe6f469fb089ebe04b4acff5 Reviewed-on: https://go-review.googlesource.com/c/go/+/687938 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
2025-07-15runtime: only clear inline mark bits on span alloc if necessaryMichael Anthony Knyszek
This change modifies initInlineMarkBits to only clear mark bits if the span wasn't just freshly allocated from the OS, where we know the bits are already zeroed. This probably doesn't make a huge difference most of the time, but it's an easy optimization and helps rule it out as a source of slowdown. For #73581. Change-Id: I78cd4d8968bb0bf6536c0a38ef9397475c39f0ad Reviewed-on: https://go-review.googlesource.com/c/go/+/687937 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-07-15runtime: have mergeInlineMarkBits also clear the inline mark bitsMichael Anthony Knyszek
This is conceptually simpler, as the sweeper doesn't have to worry about clearing them separately. It also doesn't have a use for them. This will also be useful to avoiding unnecessary zeroing in initInlineMarkBits at allocation time. Currently, because it's used in both span allocation and at sweep time, we cannot blindly trust needzero. This change also renames mergeInlineMarkBits to moveInlineMarkBits to make this change in semantics clearer from the name. For #73581. Change-Id: Ib154738a945633b7ff5b2ae27235baa310400139 Reviewed-on: https://go-review.googlesource.com/c/go/+/687936 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-07-15runtime: merge inline mark bits with gcmarkBits 8 bytes at a timeMichael Anthony Knyszek
Currently, with Green Tea GC, we need to copy (really bitwise-or) mark bits back into mspan.gcmarkBits, so that it can propagate to mspan.allocBits at sweep time. This function does actually seem to make sweeping small spans a good bit more expensive, though sweeping is still relatively cheap. There's some low-hanging fruit here though, in that the merge is performed one byte at a time, but this is pretty inefficient. We can almost as easily perform this merge one word at a time instead, which seems to make this operation about 33% faster. For #73581. Change-Id: I170d36e7a2193199c423dcd556cba048ebd698af Reviewed-on: https://go-review.googlesource.com/c/go/+/687935 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
2025-06-25runtime: make explicit nil check in (*spanInlineMarkBits).initMichael Anthony Knyszek
The hugo binary gets slower, potentially dramatically so, with GOEXPERIMENT=greenteagc. The root cause is page mapping churn. The Green Tea code introduced a new implicit nil check on value in a freshly-allocated span to clear some new heap metadata. This nil check would read the fresh memory, causing Linux to back that virtual address space with an RO page. This would then be almost immediately written to, causing Linux to possibly flush the TLB and find memory to replace that read-only page (likely deduplicated as just the zero page). This CL fixes the issue by replacing the implicit nil check, which is a memory read expected to fault if it's truly nil, with an explicit one. The explicit nil check is a branch, and thus makes no reads to memory. The result is that the hugo binary no longer gets slower. No regression test because it doesn't seem possible without access to OS internals, like Linux tracepoints. We briefly experimented with RSS metrics, but they're inconsistent. Some system RSS metrics count the deduplicated zero page, while others (like those produced by /proc/self/smaps) do not. Instead, we'll add a new benchmark to our benchmark suite, separately. For #73581. Fixes #74375. Change-Id: I708321c14749a94ccff55072663012eba18b3b91 Reviewed-on: https://go-review.googlesource.com/c/go/+/684015 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2025-06-04runtime: reduce per-P memory footprint when greenteagc is disabledMichael Anthony Knyszek
There are two additional sources of memory overhead per P that come from greenteagc. One is for ptrBuf, but on platforms other than Windows it doesn't actually cost anything due to demand-paging (Windows also demand-pages, but the memory is 'committed' so it still counts against OS RSS metrics). The other is for per-sizeclass scan stats. However when greenteagc is disabled, most of these scan stats are completely unused. The worst-case memory overhead from these two sources is relatively small (about 10 KiB per P), but for programs with a small memory footprint running on a machine with a lot of cores, this can be significant (single-digit percent). This change does two things. First, it puts ptrBuf initialization behind the greenteagc experiment, so now that memory is never allocated by default. Second, it abstracts the implementation details of scan stat collection and emission, such that we can have two different implementations depending on the build tag. This lets us remove all the unused stats when the greenteagc experiment is disabled, reducing the memory overhead of the stats from ~2.6 KiB per P to 536 bytes per P. This is enough to make the difference no longer noticable in our benchmark suite. Fixes #73931. Change-Id: I4351f1cbb3f6743d8f5922d757d73442c6d6ad3f Reviewed-on: https://go-review.googlesource.com/c/go/+/678535 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-05-02runtime: mark and scan small objects in whole spans [green tea]Michael Anthony Knyszek
Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>