| Age | Commit message (Collapse) | Author |
|
Clearing the inline mark bits with memclrNoHeapPointers is slightly
better than having the compiler insert, e.g. duffzero, since it can take
advantage of wider SIMD instructions. duffzero is likely going away, but
we know things the compiler doesn't, such as the fact that this memory
is nicely aligned. In this particular case, memclrNoHeapPointers does a
better job.
For #73581.
Change-Id: I3918096929acfe6efe6f469fb089ebe04b4acff5
Reviewed-on: https://go-review.googlesource.com/c/go/+/687938
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
This change modifies initInlineMarkBits to only clear mark bits if the
span wasn't just freshly allocated from the OS, where we know the bits
are already zeroed. This probably doesn't make a huge difference most of
the time, but it's an easy optimization and helps rule it out as a
source of slowdown.
For #73581.
Change-Id: I78cd4d8968bb0bf6536c0a38ef9397475c39f0ad
Reviewed-on: https://go-review.googlesource.com/c/go/+/687937
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
This is conceptually simpler, as the sweeper doesn't have to worry about
clearing them separately. It also doesn't have a use for them.
This will also be useful to avoiding unnecessary zeroing in
initInlineMarkBits at allocation time. Currently, because it's used in
both span allocation and at sweep time, we cannot blindly trust
needzero.
This change also renames mergeInlineMarkBits to moveInlineMarkBits to
make this change in semantics clearer from the name.
For #73581.
Change-Id: Ib154738a945633b7ff5b2ae27235baa310400139
Reviewed-on: https://go-review.googlesource.com/c/go/+/687936
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Currently, with Green Tea GC, we need to copy (really bitwise-or) mark
bits back into mspan.gcmarkBits, so that it can propagate to
mspan.allocBits at sweep time. This function does actually seem to make
sweeping small spans a good bit more expensive, though sweeping is still
relatively cheap. There's some low-hanging fruit here though, in that
the merge is performed one byte at a time, but this is pretty
inefficient. We can almost as easily perform this merge one word at a
time instead, which seems to make this operation about 33% faster.
For #73581.
Change-Id: I170d36e7a2193199c423dcd556cba048ebd698af
Reviewed-on: https://go-review.googlesource.com/c/go/+/687935
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
The hugo binary gets slower, potentially dramatically so, with
GOEXPERIMENT=greenteagc. The root cause is page mapping churn. The Green
Tea code introduced a new implicit nil check on value in a
freshly-allocated span to clear some new heap metadata. This nil check
would read the fresh memory, causing Linux to back that virtual address
space with an RO page. This would then be almost immediately written to,
causing Linux to possibly flush the TLB and find memory to replace that
read-only page (likely deduplicated as just the zero page).
This CL fixes the issue by replacing the implicit nil check, which is a
memory read expected to fault if it's truly nil, with an explicit one.
The explicit nil check is a branch, and thus makes no reads to memory.
The result is that the hugo binary no longer gets slower.
No regression test because it doesn't seem possible without access to OS
internals, like Linux tracepoints. We briefly experimented with RSS
metrics, but they're inconsistent. Some system RSS metrics count the
deduplicated zero page, while others (like those produced by
/proc/self/smaps) do not.
Instead, we'll add a new benchmark to our benchmark suite, separately.
For #73581.
Fixes #74375.
Change-Id: I708321c14749a94ccff55072663012eba18b3b91
Reviewed-on: https://go-review.googlesource.com/c/go/+/684015
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
There are two additional sources of memory overhead per P that come from
greenteagc. One is for ptrBuf, but on platforms other than Windows it
doesn't actually cost anything due to demand-paging (Windows also
demand-pages, but the memory is 'committed' so it still counts against
OS RSS metrics). The other is for per-sizeclass scan stats. However when
greenteagc is disabled, most of these scan stats are completely unused.
The worst-case memory overhead from these two sources is relatively
small (about 10 KiB per P), but for programs with a small memory
footprint running on a machine with a lot of cores, this can be
significant (single-digit percent).
This change does two things. First, it puts ptrBuf initialization behind
the greenteagc experiment, so now that memory is never allocated by
default. Second, it abstracts the implementation details of scan stat
collection and emission, such that we can have two different
implementations depending on the build tag. This lets us remove all the
unused stats when the greenteagc experiment is disabled, reducing the
memory overhead of the stats from ~2.6 KiB per P to 536 bytes per P.
This is enough to make the difference no longer noticable in our
benchmark suite.
Fixes #73931.
Change-Id: I4351f1cbb3f6743d8f5922d757d73442c6d6ad3f
Reviewed-on: https://go-review.googlesource.com/c/go/+/678535
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Our current parallel mark algorithm suffers from frequent stalls on
memory since its access pattern is essentially random. Small objects
are the worst offenders, since each one forces pulling in at least one
full cache line to access even when the amount to be scanned is far
smaller than that. Each object also requires an independent access to
per-object metadata.
The purpose of this change is to improve garbage collector performance
by scanning small objects in batches to obtain better cache locality
than our current approach. The core idea behind this change is to defer
marking and scanning small objects, and then scan them in batches
localized to a span.
This change adds scanned bits to each small object (<=512 bytes) span in
addition to mark bits. The scanned bits indicate that the object has
been scanned. (One way to think of them is "grey" bits and "black" bits
in the tri-color mark-sweep abstraction.) Each of these spans is always
8 KiB and if they contain pointers, the pointer/scalar data is already
packed together at the end of the span, allowing us to further optimize
the mark algorithm for this specific case.
When the GC encounters a pointer, it first checks if it points into a
small object span. If so, it is first marked in the mark bits, and then
the object is queued on a work-stealing P-local queue. This object
represents the whole span, and we ensure that a span can only appear at
most once in any queue by maintaining an atomic ownership bit for each
span. Later, when the pointer is dequeued, we scan every object with a
set mark that doesn't have a corresponding scanned bit. If it turns out
that was the only object in the mark bits since the last time we scanned
the span, we scan just that object directly, essentially falling back to
the existing algorithm. noscan objects have no scan work, so they are
never queued.
Each span's mark and scanned bits are co-located together at the end of
the span. Since the span is always 8 KiB in size, it can be found with
simple pointer arithmetic. Next to the marks and scans we also store the
size class, eliminating the need to access the span's mspan altogether.
The work-stealing P-local queue is a new source of GC work. If this
queue gets full, half of it is dumped to a global linked list of spans
to scan. The regular scan queues are always prioritized over this queue
to allow time for darts to accumulate. Stealing work from other Ps is a
last resort.
This change also adds a new debug mode under GODEBUG=gctrace=2 that
dumps whole-span scanning statistics by size class on every GC cycle.
A future extension to this CL is to use SIMD-accelerated scanning
kernels for scanning spans with high mark bit density.
For #19112. (Deadlock averted in GOEXPERIMENT.)
For #73581.
Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/658036
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|