| Age | Commit message (Collapse) | Author |
|
This change removes the locked global span queue and replaces the
fixed-size local span queue with a variable-sized local span queue. The
variable-sized local span queue grows as needed to accomodate local
work. With no global span queue either, GC workers balance work amongst
themselves by stealing from each other.
The new variable-sized local span queues are inspired by the P-local
deque underlying sync.Pool. Unlike the sync.Pool deque, however, both
the owning P and stealing Ps take spans from the tail, making this
incarnation a strict queue, not a deque. This is intentional, since we
want a queue-like order to encourage objects to accumulate on each span.
These variable-sized local span queues are crucial to mark termination,
just like the global span queue was. To avoid hitting the ragged barrier
too often, we must check whether any Ps have any spans on their
variable-sized local span queues. We maintain a per-P atomic bitmask
(another pMask) that contains this state. We can also use this to speed
up stealing by skipping Ps that don't have any local spans.
The variable-sized local span queues are slower than the old fixed-size
local span queues because of the additional indirection, so this change
adds a non-atomic local fixed-size queue. This risks getting work stuck
on it, so, similarly to how workbufs work, each worker will occasionally
dump some spans onto its local variable-sized queue. This scales much
more nicely than dumping to a global queue, but is still visible to all
other Ps.
For #73581.
Change-Id: I814f54d9c3cc7fa7896167746e9823f50943ac22
Reviewed-on: https://go-review.googlesource.com/c/go/+/700496
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Our current parallel mark algorithm suffers from frequent stalls on
memory since its access pattern is essentially random. Small objects
are the worst offenders, since each one forces pulling in at least one
full cache line to access even when the amount to be scanned is far
smaller than that. Each object also requires an independent access to
per-object metadata.
The purpose of this change is to improve garbage collector performance
by scanning small objects in batches to obtain better cache locality
than our current approach. The core idea behind this change is to defer
marking and scanning small objects, and then scan them in batches
localized to a span.
This change adds scanned bits to each small object (<=512 bytes) span in
addition to mark bits. The scanned bits indicate that the object has
been scanned. (One way to think of them is "grey" bits and "black" bits
in the tri-color mark-sweep abstraction.) Each of these spans is always
8 KiB and if they contain pointers, the pointer/scalar data is already
packed together at the end of the span, allowing us to further optimize
the mark algorithm for this specific case.
When the GC encounters a pointer, it first checks if it points into a
small object span. If so, it is first marked in the mark bits, and then
the object is queued on a work-stealing P-local queue. This object
represents the whole span, and we ensure that a span can only appear at
most once in any queue by maintaining an atomic ownership bit for each
span. Later, when the pointer is dequeued, we scan every object with a
set mark that doesn't have a corresponding scanned bit. If it turns out
that was the only object in the mark bits since the last time we scanned
the span, we scan just that object directly, essentially falling back to
the existing algorithm. noscan objects have no scan work, so they are
never queued.
Each span's mark and scanned bits are co-located together at the end of
the span. Since the span is always 8 KiB in size, it can be found with
simple pointer arithmetic. Next to the marks and scans we also store the
size class, eliminating the need to access the span's mspan altogether.
The work-stealing P-local queue is a new source of GC work. If this
queue gets full, half of it is dumped to a global linked list of spans
to scan. The regular scan queues are always prioritized over this queue
to allow time for darts to accumulate. Stealing work from other Ps is a
last resort.
This change also adds a new debug mode under GODEBUG=gctrace=2 that
dumps whole-span scanning statistics by size class on every GC cycle.
A future extension to this CL is to use SIMD-accelerated scanning
kernels for scanning spans with high mark bit density.
For #19112. (Deadlock averted in GOEXPERIMENT.)
For #73581.
Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/658036
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Cleanup and friction reduction
For #65355.
Change-Id: Ia14c9dc584a529a35b97801dd3e95b9acc99a511
Reviewed-on: https://go-review.googlesource.com/c/go/+/600436
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
For #65355
Change-Id: I65dd090fb99de9b231af2112c5ccb0eb635db2be
Reviewed-on: https://go-review.googlesource.com/c/go/+/560155
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ibrahim Bazoka <ibrahimbazoka729@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
|
|
Change-Id: I69065f8adf101fdb28682c55997f503013a50e29
Reviewed-on: https://go-review.googlesource.com/c/go/+/449757
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Joedian Reid <joedian@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Joedian Reid <joedian@golang.org>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
|
|
Updates #46731
Change-Id: Ic2208c8bb639aa1e390be0d62e2bd799ecf20654
Reviewed-on: https://go-review.googlesource.com/c/go/+/421878
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
|
|
[This CL is part of a sequence implementing the proposal #51082.
The design doc is at https://go.dev/s/godocfmt-design.]
Run the updated gofmt, which reformats doc comments,
on the main repository. Vendored files are excluded.
For #51082.
Change-Id: I7332f099b60f716295fb34719c98c04eb1a85407
Reviewed-on: https://go-review.googlesource.com/c/go/+/384268
Reviewed-by: Jonathan Amsterdam <jba@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
|
|
A future change to gofmt will rewrite
// Doc comment.
//go:foo
to
// Doc comment.
//
//go:foo
Apply that change preemptively to all comments (not necessarily just doc comments).
For #51082.
Change-Id: Iffe0285418d1e79d34526af3520b415a12203ca9
Reviewed-on: https://go-review.googlesource.com/c/go/+/384260
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Less text and improves codegen a bit.
compilecmp on ARM64:
runtime
(*gcWork).putFast 160 -> 144 (-10.00%)
(*gcWork).tryGetFast 144 -> 128 (-11.11%)
scanobject 784 -> 752 (-4.08%)
greyobject 800 -> 784 (-2.00%)
AMD64:
runtime
greyobject 765 -> 748 (-2.22%)
(*gcWork).tryGetFast 102 -> 85 (-16.67%)
scanobject 837 -> 820 (-2.03%)
(*gcWork).putFast 102 -> 89 (-12.75%)
Change-Id: I6bb508afe1ba416823775c0bfc08ea9dc21de8a3
Reviewed-on: https://go-review.googlesource.com/c/go/+/393754
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
Trust: Michael Knyszek <mknyszek@google.com>
|
|
This change implements the GC pacer redesign outlined in #44167 and the
accompanying design document, behind a GOEXPERIMENT flag that is on by
default.
In addition to adding the new pacer, this CL also includes code to track
and account for stack and globals scan work in the pacer and in the
assist credit system.
The new pacer also deviates slightly from the document in that it
increases the bound on the minimum trigger ratio from 0.6 (scaled by
GOGC) to 0.7. The logic behind this change is that the new pacer much
more consistently hits the goal (good!) leading to slightly less
frequent GC cycles, but _longer_ ones (in this case, bad!). It turns out
that the cost of having the GC on hurts throughput significantly (per
byte of memory used), though tail latencies can improve by up to 10%! To
be conservative, this change moves the value to 0.7 where there is a
small improvement to both throughput and latency, given the memory use.
Because the new pacer accounts for the two most significant sources of
scan work after heap objects, it is now also safer to reduce the minimum
heap size without leading to very poor amortization. This change thus
decreases the minimum heap size to 512 KiB, which corresponds to the
fact that the runtime has around 200 KiB of scannable globals always
there, up-front, providing a baseline.
Benchmark results: https://perf.golang.org/search?q=upload:20211001.6
tile38's KNearest benchmark shows a memory increase, but throughput (and
latency) per byte of memory used is better.
gopher-lua showed an increase in both CPU time and memory usage, but
subsequent attempts to reproduce this behavior are inconsistent.
Sometimes the overall performance is better, sometimes it's worse. This
suggests that the benchmark is fairly noisy in a way not captured by the
benchmarking framework itself.
biogo-igor is the only benchmark to show a significant performance loss.
This benchmark exhibits a very high GC rate, with relatively little work
to do in each cycle. The idle mark workers are quite active. In the new
pacer, mark phases are longer, mark assists are fewer, and some of that
time in mark assists has shifted to idle workers. Linux perf indicates
that the difference in CPU time can be mostly attributed to write-barrier
slow path related calls, which in turn indicates that the write barrier
being on for longer is the primary culprit. This also explains the memory
increase, as a longer mark phase leads to more memory allocated black,
surviving an extra cycle and contributing to the heap goal.
For #44167.
Change-Id: I8ac7cfef7d593e4a642c9b2be43fb3591a8ec9c4
Reviewed-on: https://go-review.googlesource.com/c/go/+/309869
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
[git-generate]
cd src/runtime
goimports -w *.go
Change-Id: I1387af0f2fd1a213dc2f4c122e83a8db0fcb15f0
Reviewed-on: https://go-review.googlesource.com/c/go/+/329189
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
|
|
internal/goarch.PtrSize [generated]
[git-generate]
cd src/runtime/internal/math
gofmt -w -r "sys.PtrSize -> goarch.PtrSize" .
goimports -w *.go
cd ../..
gofmt -w -r "sys.PtrSize -> goarch.PtrSize" .
goimports -w *.go
Change-Id: I43491cdd54d2e06d4d04152b3d213851b7d6d423
Reviewed-on: https://go-review.googlesource.com/c/go/+/328337
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
|
|
Change-Id: I31f2081eb7c30a9583f479f9194e636fe721b9b3
GitHub-Last-Rev: d09f5fbdc5785dc3963b22ad75309740e0de258e
GitHub-Pull-Request: golang/go#45278
Reviewed-on: https://go-review.googlesource.com/c/go/+/305231
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Go Bot <gobot@golang.org>
|
|
This change modifies mheap's span allocation API to have each caller
declare a purpose, defined as a new enum called spanAllocType.
The purpose behind this change is two-fold:
1. Tight control over who gets to allocate heap memory is, generally
speaking, a good thing. Every codepath that allocates heap memory
places additional implicit restrictions on the allocator. A notable
example of a restriction is work bufs coming from heap memory: write
barriers are not allowed in allocation paths because then we could
have a situation where the allocator calls into the allocator.
2. Memory statistic updating is explicit. Instead of passing an opaque
pointer for statistic updating, which places restrictions on how that
statistic may be updated, we use the spanAllocType to determine which
statistic to update and how.
We also take this opportunity to group all the statistic updating code
together, which should make the accounting code a little easier to
follow.
Change-Id: Ic0b0898959ba2a776f67122f0e36c9d7d60e3085
Reviewed-on: https://go-review.googlesource.com/c/go/+/246970
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
debugCachedWork and all of its dependent fields and code were added to
aid in debugging issue #27993. Now that the source of the problem is
known and mitigated (via the extra work check after STW in gcMarkDone),
these extra checks are no longer required and simply make the code more
difficult to follow.
Remove it all.
Updates #27993
Change-Id: I594beedd5ca61733ba9cc9eaad8f80ea92df1a0d
Reviewed-on: https://go-review.googlesource.com/c/go/+/262350
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
|
|
I took some of the infrastructure from Austin's lock logging CR
https://go-review.googlesource.com/c/go/+/192704 (with deadlock
detection from the logs), and developed a setup to give static lock
ranking for runtime locks.
Static lock ranking establishes a documented total ordering among locks,
and then reports an error if the total order is violated. This can
happen if a deadlock happens (by acquiring a sequence of locks in
different orders), or if just one side of a possible deadlock happens.
Lock ordering deadlocks cannot happen as long as the lock ordering is
followed.
Along the way, I found a deadlock involving the new timer code, which Ian fixed
via https://go-review.googlesource.com/c/go/+/207348, as well as two other
potential deadlocks.
See the constants at the top of runtime/lockrank.go to show the static
lock ranking that I ended up with, along with some comments. This is
great documentation of the current intended lock ordering when acquiring
multiple locks in the runtime.
I also added an array lockPartialOrder[] which shows and enforces the
current partial ordering among locks (which is embedded within the total
ordering). This is more specific about the dependencies among locks.
I don't try to check the ranking within a lock class with multiple locks
that can be acquired at the same time (i.e. check the ranking when
multiple hchan locks are acquired).
Currently, I am doing a lockInit() call to set the lock rank of most
locks. Any lock that is not otherwise initialized is assumed to be a
leaf lock (a very high rank lock), so that eliminates the need to do
anything for a bunch of locks (including all architecture-dependent
locks). For two locks, root.lock and notifyList.lock (only in the
runtime/sema.go file), it is not as easy to do lock initialization, so
instead, I am passing the lock rank with the lock calls.
For Windows compilation, I needed to increase the StackGuard size from
896 to 928 because of the new lock-rank checking functions.
Checking of the static lock ranking is enabled by setting
GOEXPERIMENT=staticlockranking before doing a run.
To make sure that the static lock ranking code has no overhead in memory
or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so
that it defines a build tag (with the same name) whenever any experiment
has been baked into the toolchain (by checking Expstring()). This allows
me to avoid increasing the size of the 'mutex' type when static lock
ranking is not enabled.
Fixes #38029
Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a
Reviewed-on: https://go-review.googlesource.com/c/go/+/207619
Reviewed-by: Dan Scales <danscales@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Dan Scales <danscales@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
We check whether an M is preemptible in a surprising number of places.
Put it in one function.
For #10958, #24543.
Change-Id: I305090fdb1ea7f7a55ffe25851c1e35012d0d06c
Reviewed-on: https://go-review.googlesource.com/c/go/+/201439
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
Currently it's possible for the runtime to deadlock if checkPut is
called in a non-preemptible context. In this case, checkPut may spin,
so it won't leave the non-preemptible context, but the thread running
gcMarkDone needs to preempt all of the goroutines before it can
release the checkPut spin loops.
Fix this by returning from checkPut if it's called under any of the
conditions that would prevent gcMarkDone from preempting it. In this
case, it leaves a note behind that this happened; if the runtime does
later detect left-over work it can at least indicate that it was
unable to catch it in the act.
For #27993.
Updates #29385 (may fix it).
Change-Id: Ic71c10701229febb4ddf8c104fb10e06d84b122e
Reviewed-on: https://go-review.googlesource.com/c/156017
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
This captures the stack trace where mark completion observed that each
P had no work, and then dumps this if that P later discovers more
work. Hopefully this will help bound where the work was created.
For #27993.
Change-Id: I4f29202880d22c433482dc1463fb50ab693b6de6
Reviewed-on: https://go-review.googlesource.com/c/154599
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
In order to further diagnose #27993, I need to see exactly what
pointers are being added to the gcWork buffer too late.
Change-Id: I8d92113426ffbc6e55d819c39e7ab5eafa68668d
Reviewed-on: https://go-review.googlesource.com/c/152957
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
This adds several new checks to help debug #27993. It adds a mechanism
for freezing write barriers and gcWork puts during the mark completion
algorithm. This way, if we do detect mark completion, we can catch any
puts that happened during the completion algorithm. Based on build
dashboard failures, this seems to be the window of time when these are
happening.
This also double-checks that all work buffers are empty immediately
upon entering mark termination (much earlier than the current check).
This is unlikely to trigger based on the current failures, but is a
good safety net.
Change-Id: I03f56c48c4322069e28c50fbc3c15b2fee2130c2
Reviewed-on: https://go-review.googlesource.com/c/151797
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
This adds a debug check to throw immediately if any pointers are added
to the gcWork buffer after the mark completion barrier. The intent is
to catch the source of the cached GC work that occasionally produces
"P has cached GC work at end of mark termination" failures.
The result should be that we get "throwOnGCWork" throws instead of "P
has cached GC work at end of mark termination" throws, but with useful
stack traces.
This should be reverted before the release. I've been unable to
reproduce this issue locally, but this issue appears fairly regularly
on the builders, so the intent is to catch it on the builders.
This probably slows down the GC slightly.
For #27993.
Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c
Reviewed-on: https://go-review.googlesource.com/c/149969
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
Go documentation style for boolean funcs is to say:
// Foo reports whether ...
func Foo() bool
(rather than "returns true if")
This CL also replaces 4 uses of "iff" with the same "reports whether"
wording, which doesn't lose any meaning, and will prevent people from
sending typo fixes when they don't realize it's "if and only if". In
the past I think we've had the typo CLs updated to just say "reports
whether". So do them all at once.
(Inspired by the addition of another "returns true if" in CL 146938
in fd_plan9.go)
Created with:
$ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true iff" | grep -v vendor)
$ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true if" | grep -v vendor)
Change-Id: Ided502237f5ab0d25cb625dbab12529c361a8b9f
Reviewed-on: https://go-review.googlesource.com/c/147037
Reviewed-by: Ian Lance Taylor <iant@golang.org>
|
|
Now work.helperDrainBlock is always false, so we can remove it and
code paths that only ran when it was true. That means we no longer use
the gcDrainBlock mode of gcDrain, so we can eliminate that. That means
we no longer use gcWork.get, so we can eliminate that. That means we
no longer use getfull, so we can eliminate that.
Updates #26903. This is a follow-up to unifying STW GC and concurrent GC.
Change-Id: I8dbcf8ce24861df0a6149e0b7c5cd0eadb5c13f6
Reviewed-on: https://go-review.googlesource.com/c/134782
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Now that there is no mark 2 phase, gcBlackenPromptly is no longer
used.
Updates #26903. This is a follow-up to eliminating mark 2.
Change-Id: Ib9c534f21b36b8416fcf3cab667f186167b827f8
Reviewed-on: https://go-review.googlesource.com/c/134319
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Nothing currently consumes the flag, but we'll use it in the
distributed termination detection algorithm.
Updates #26903. This is preparation for eliminating mark 2.
Change-Id: I5e149a05b1c878fe1009150da21f8bd8ae2b9b6a
Reviewed-on: https://go-review.googlesource.com/c/134317
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Change-Id: Ic8c506289caaf6218494e5150d10002e0232feaa
Reviewed-on: https://go-review.googlesource.com/85876
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
This implements runtime support for buffered write barriers on amd64.
The buffered write barrier has a fast path that simply enqueues
pointers in a per-P buffer. Unlike the current write barrier, this
fast path is *not* a normal Go call and does not require the compiler
to spill general-purpose registers or put arguments on the stack. When
the buffer fills up, the write barrier takes the slow path, which
spills all general purpose registers and flushes the buffer. We don't
allow safe-points or stack splits while this frame is active, so it
doesn't matter that we have no type information for the spilled
registers in this frame.
One minor complication is cgocheck=2 mode, which uses the write
barrier to detect Go pointers being written to non-Go memory. We
obviously can't buffer this, so instead we set the buffer to its
minimum size, forcing the write barrier into the slow path on every
call. For this specific case, we pass additional information as
arguments to the flush function. This also requires enabling the cgo
write barrier slightly later during runtime initialization, after Ps
(and the per-P write barrier buffers) have been initialized.
The code in this CL is not yet active. The next CL will modify the
compiler to generate calls to the new write barrier.
This reduces the average cost of the write barrier by roughly a factor
of 4, which will pay for the cost of having it enabled more of the
time after we make the GC pacer less aggressive. (Benchmarks will be
in the next CL.)
Updates #14951.
Updates #22460.
Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1
Reviewed-on: https://go-review.googlesource.com/73711
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Currently most of these are marked go:nowritebarrier as a hint, but
it's actually important that these not invoke write barriers
recursively. The danger is that some gcWork method would invoke the
write barrier while the gcWork is in an inconsistent state and that
the write barrier would in turn invoke some other gcWork method, which
would crash or permanently corrupt the gcWork. Simply marking the
write barrier itself as go:nowritebarrierrec isn't sufficient to
prevent this if the write barrier doesn't use the outer method.
Thankfully, this doesn't cause any build failures, so we were getting
this right. :)
For #22460.
Change-Id: I35a7292a584200eb35a49507cd3fe359ba2206f6
Reviewed-on: https://go-review.googlesource.com/72554
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
This extends the sweeper to free workbufs back to the heap between GC
cycles, allowing this memory to be reused for GC'd allocations or
eventually returned to the OS.
This helps for applications that have high peak heap usage relative to
their regular heap usage (for example, a high-memory initialization
phase). Workbuf memory is roughly proportional to heap size and since
we currently never free workbufs, it's proportional to *peak* heap
size. By freeing workbufs, we can release and reuse this memory for
other purposes when the heap shrinks.
This is somewhat complicated because this costs ~1–2 µs per workbuf
span, so for large heaps it's too expensive to just do synchronously
after mark termination between starting the world and dropping the
worldsema. Hence, we do it asynchronously in the sweeper. This adds a
list of "free" workbuf spans that can be returned to the heap. GC
moves all workbuf spans to this list after mark termination and the
background sweeper drains this list back to the heap. If the sweeper
doesn't finish, that's fine, since getempty can directly reuse any
remaining spans to allocate more workbufs.
Performance impact is negligible. On the x/benchmarks, this reduces
GC-bytes-from-system by 6–11%.
Fixes #19325.
Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8
Reviewed-on: https://go-review.googlesource.com/38582
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Currently the runtime allocates workbufs from persistent memory, which
means they can never be freed.
Switch to allocating them from manually-managed heap spans. This
doesn't free them yet, but it puts us in a position to do so.
For #19325.
Change-Id: I94b2512a2f2bbbb456cd9347761b9412e80d2da9
Reviewed-on: https://go-review.googlesource.com/38581
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
The lfstack API is still a C-style API: lfstacks all have unhelpful
type uint64 and the APIs are package-level functions. Make the code
more readable and Go-style by creating an lfstack type with methods
for push, pop, and empty.
Change-Id: I64685fa3be0e82ae2d1a782a452a50974440a827
Reviewed-on: https://go-review.googlesource.com/38290
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
The gcstats structure is no longer consumed by anything and no longer
tracks statistics that are particularly relevant to the concurrent
garbage collector. Remove it. (Having statistics is probably a good
idea, but these aren't the stats we need these days and we don't have
a way to get them out of the runtime.)
In preparation for #13613.
Change-Id: Ib63e2f9067850668f9dcbfd4ed89aab4a6622c3f
Reviewed-on: https://go-review.googlesource.com/34936
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Since workbuf is now marked go:notinheap, the write barrier-preventing
wrapper type wbufptr is no longer necessary. Remove it.
Change-Id: I3e5b5803a1547d65de1c1a9c22458a38e08549b7
Reviewed-on: https://go-review.googlesource.com/35971
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
If the scheduler has no user work and there's no GC work visible, it
puts the P to sleep (or blocks on the network). However, if we later
enqueue more GC work, there's currently nothing that specifically
wakes up the scheduler to let it start an idle GC worker. As a result,
we can underutilize the CPU during GC if Ps have been put to sleep.
Fix this by making GC wake idle Ps when work buffers are put on the
full list. We already have a hook to do this, since we use this to
preempt a random P if we need more dedicated workers. We expand this
hook to instead wake an idle P if there is one. The logic we use for
this is identical to the logic used to wake an idle P when we ready a
goroutine.
To make this really sound, we also fix the scheduler to re-check the
idle GC worker condition after releasing its P. This closes a race
where 1) the scheduler checks for idle work and finds none, 2) new
work is enqueued but there are no idle Ps so none are woken, and 3)
the scheduler releases its P.
There is one subtlety here. Currently we call enlistWorker directly
from putfull, but the gcWork is in an inconsistent state in the places
that call putfull. This isn't a problem right now because nothing that
enlistWorker does touches the gcWork, but with the added call to
wakep, it's possible to get a recursive call into the gcWork
(specifically, while write barriers are disallowed, this can do an
allocation, which can dispose a gcWork, which can put a workbuf). To
handle this, we lift the enlistWorker calls up a layer and delay them
until the gcWork is in a consistent state.
Fixes #14179.
Change-Id: Ia2467a52e54c9688c3c1752e1fc00f5b37bbfeeb
Reviewed-on: https://go-review.googlesource.com/32434
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
|
|
This covers basically all sysAlloc'd, persistentalloc'd, and
fixalloc'd types.
Change-Id: I0487c887c2a0ade5e33d4c4c12d837e97468e66b
Reviewed-on: https://go-review.googlesource.com/30941
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Currently the time spent in scanobject is proportional to the size of
the object being scanned. Since scanobject is non-preemptible, large
objects can cause significant goroutine (and even whole application)
delays through several means:
1. If a GC assist picks up a large object, the allocating goroutine is
blocked for the whole scan, even if that scan well exceeds that
goroutine's debt.
2. Since the scheduler does not run on the P performing a large object
scan, goroutines in that P's run queue do not run unless they are
stolen by another P (which can take some time). If there are a few
large objects, all of the Ps may get tied up so the scheduler
doesn't run anywhere.
3. Even if a large object is scanned by a background worker and other
Ps are still running the scheduler, the large object scan doesn't
flush background credit until the whole scan is done. This can
easily cause all allocations to block in assists, waiting for
credit, causing an effective STW.
Fix this by splitting large objects into 128 KB "oblets" and scanning
at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates
to bounding scanobject at roughly 100 µs. This improves assist
behavior both because assists can no longer get "unlucky" and be stuck
scanning a large object, and because it causes the background worker
to flush credit and unblock assists more frequently when scanning
large objects. This also improves GC parallelism if the heap consists
primarily of a small number of very large objects by letting multiple
workers scan a large objects in parallel.
Fixes #10345. Fixes #16293.
This substantially improves goroutine latency in the benchmark from
issue #16293, which exercises several forms of very large objects:
name old max-latency new max-latency delta
SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12)
SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20)
SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20)
MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18)
ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20)
ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20)
name old P99.9-latency new P99.9-latency delta
SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18)
SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20)
SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20)
MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20)
ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19)
ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20)
This has negligible effect on all metrics from the garbage, JSON, and
HTTP x/benchmarks.
It shows slight improvement on some of the go1 benchmarks,
particularly Revcomp, which uses some multi-megabyte buffers:
name old time/op new time/op delta
BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20)
Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20)
FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19)
FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16)
FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20)
FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20)
FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20)
FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20)
FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19)
GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20)
GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19)
Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19)
Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19)
HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19)
JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19)
JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18)
Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19)
GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19)
RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17)
RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20)
RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19)
RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20)
RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20)
RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18)
RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19)
RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19)
Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19)
Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17)
TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13)
TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20)
[Geo mean] 53.5µs 53.3µs -0.23%
Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132
Reviewed-on: https://go-review.googlesource.com/23540
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
The complexity of the GC work buffers put and tryGet
prevented them from being inlined. This CL simplifies
the fast path thus enabling inlining. If the fast
path does not succeed the previous put and tryGet
functions are called.
Change-Id: I6da6495d0dadf42bd0377c110b502274cc01acf5
Reviewed-on: https://go-review.googlesource.com/20704
Reviewed-by: Austin Clements <austin@google.com>
|
|
The tree's pretty inconsistent about single space vs double space
after a period in documentation. Make it consistently a single space,
per earlier decisions. This means contributors won't be confused by
misleading precedence.
This CL doesn't use go/doc to parse. It only addresses // comments.
It was generated with:
$ perl -i -npe 's,^(\s*// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.) +([A-Z])')
$ go test go/doc -update
Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7
Reviewed-on: https://go-review.googlesource.com/20022
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Dave Day <djd@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Early in Go 1.5 we had bugs with ownership of workbufs, so we added a
system for tracing their ownership to help debug these issues.
However, this system has both CPU and space overhead even when
disabled, it clutters up the workbuf API, the higher level gcWork
abstraction makes it very difficult to mess up the ownership of
workbufs in practice, and the tracing hasn't been enabled or needed
since 5b66e5d nine months ago. Hence, remove it.
Benchmarks show the usual noise from changes at this level, but little
overall movement.
name old time/op new time/op delta
XBenchGarbage-12 2.48ms ± 1% 2.47ms ± 0% -0.68% (p=0.000 n=21+21)
name old time/op new time/op delta
BinaryTree17-12 2.98s ± 7% 2.98s ± 6% ~ (p=0.799 n=20+20)
Fannkuch11-12 2.61s ± 3% 2.55s ± 5% -2.55% (p=0.003 n=20+20)
FmtFprintfEmpty-12 52.8ns ± 6% 53.6ns ± 6% ~ (p=0.228 n=20+20)
FmtFprintfString-12 177ns ± 4% 177ns ± 4% ~ (p=0.280 n=20+20)
FmtFprintfInt-12 162ns ± 5% 162ns ± 3% ~ (p=0.347 n=20+20)
FmtFprintfIntInt-12 277ns ± 7% 273ns ± 4% -1.62% (p=0.005 n=20+20)
FmtFprintfPrefixedInt-12 237ns ± 4% 242ns ± 4% +2.13% (p=0.005 n=20+20)
FmtFprintfFloat-12 315ns ± 4% 312ns ± 4% -0.97% (p=0.001 n=20+20)
FmtManyArgs-12 1.11µs ± 3% 1.15µs ± 4% +3.41% (p=0.004 n=20+20)
GobDecode-12 8.50ms ± 7% 8.53ms ± 7% ~ (p=0.429 n=20+20)
GobEncode-12 6.86ms ± 9% 6.93ms ± 7% +0.93% (p=0.030 n=20+20)
Gzip-12 326ms ± 4% 329ms ± 4% +0.98% (p=0.020 n=20+20)
Gunzip-12 43.3ms ± 3% 43.8ms ± 9% +1.25% (p=0.003 n=20+20)
HTTPClientServer-12 72.0µs ± 3% 71.5µs ± 3% ~ (p=0.053 n=20+20)
JSONEncode-12 17.0ms ± 6% 17.3ms ± 7% +1.32% (p=0.006 n=20+20)
JSONDecode-12 64.2ms ± 4% 63.5ms ± 3% -1.05% (p=0.005 n=20+20)
Mandelbrot200-12 4.00ms ± 3% 3.99ms ± 3% ~ (p=0.121 n=20+20)
GoParse-12 3.74ms ± 5% 3.75ms ± 9% ~ (p=0.383 n=20+20)
RegexpMatchEasy0_32-12 104ns ± 4% 104ns ± 6% ~ (p=0.392 n=20+20)
RegexpMatchEasy0_1K-12 358ns ± 3% 361ns ± 4% +0.95% (p=0.003 n=20+20)
RegexpMatchEasy1_32-12 86.3ns ± 5% 86.1ns ± 6% ~ (p=0.614 n=20+20)
RegexpMatchEasy1_1K-12 523ns ± 4% 518ns ± 3% -1.14% (p=0.008 n=20+20)
RegexpMatchMedium_32-12 137ns ± 3% 134ns ± 4% -1.90% (p=0.005 n=20+20)
RegexpMatchMedium_1K-12 41.0µs ± 3% 40.6µs ± 4% -1.11% (p=0.004 n=20+20)
RegexpMatchHard_32-12 2.13µs ± 4% 2.11µs ± 5% -1.31% (p=0.014 n=20+20)
RegexpMatchHard_1K-12 64.1µs ± 3% 63.2µs ± 5% -1.38% (p=0.005 n=20+20)
Revcomp-12 555ms ±10% 548ms ± 7% -1.17% (p=0.011 n=20+20)
Template-12 84.2ms ± 5% 88.2ms ± 4% +4.73% (p=0.000 n=20+20)
TimeParse-12 365ns ± 4% 371ns ± 5% +1.77% (p=0.002 n=20+20)
TimeFormat-12 361ns ± 4% 365ns ± 3% +1.08% (p=0.002 n=20+20)
[Geo mean] 64.7µs 64.8µs +0.19%
Change-Id: Ib043a7a0d18b588b298873d60913d44cd19f3b44
Reviewed-on: https://go-review.googlesource.com/19887
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Currently most uses of gcWork use the per-P gcWork, but there are two
places that still use a stack-based gcWork. Simplify things by making
these instead use the per-P gcWork.
Change-Id: I712d012cce9dd5757c8541824e9641ac1c2a329c
Reviewed-on: https://go-review.googlesource.com/19636
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
When gcWork was first introduced, the compiler's escape analysis
wasn't good enough to detect that that method receiver didn't escape,
so we had to hack around this.
Now that the compiler can figure out this for itself, remove these
hacks.
Change-Id: I9f73fab721e272410b8b6905b564e7abc03c0dfe
Reviewed-on: https://go-review.googlesource.com/19634
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Change-Id: Icd06d99c42b8299fd931c7da821e1f418684d913
Reviewed-on: https://go-review.googlesource.com/19829
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
runtime/internal/sys will hold system-, architecture- and config-
specific constants.
Updates #11647
Change-Id: I6db29c312556087a42e8d2bdd9af40d157c56b54
Reviewed-on: https://go-review.googlesource.com/16817
Reviewed-by: Russ Cox <rsc@golang.org>
|
|
This change breaks out most of the atomics functions in the runtime
into package runtime/internal/atomic. It adds some basic support
in the toolchain for runtime packages, and also modifies linux/arm
atomics to remove the dependency on the runtime's mutex. The mutexes
have been replaced with spinlocks.
all trybots are happy!
In addition to the trybots, I've tested on the darwin/arm64 builder,
on the darwin/arm builder, and on a ppc64le machine.
Change-Id: I6698c8e3cf3834f55ce5824059f44d00dc8e3c2f
Reviewed-on: https://go-review.googlesource.com/14204
Run-TryBot: Michael Matloob <matloob@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
|
|
Currently we depend on the good graces and timing of the scheduler to
get opportunities to start dedicated mark workers. In the worst case,
it may take 10ms to get dedicated mark workers going at the beginning
of mark 1 and mark 2 or after the amount of available work has dropped
and gone back up.
Instead of waiting for the regular preemption logic to get around to
us, make putfull enlist a random P if we're not already running enough
dedicated workers. This should improve performance stability of the
garbage collector and is likely to improve the overall performance
somewhat.
No overall effect on the go1 benchmarks. It speeds up the garbage
benchmark by 12%, which more than counters the performance loss from
the previous commit.
name old time/op new time/op delta
XBenchGarbage-12 6.32ms ± 4% 5.58ms ± 2% -11.68% (p=0.000 n=20+16)
name old time/op new time/op delta
BinaryTree17-12 3.18s ± 5% 3.12s ± 4% -1.83% (p=0.021 n=20+20)
Fannkuch11-12 2.50s ± 2% 2.46s ± 2% -1.57% (p=0.000 n=18+19)
FmtFprintfEmpty-12 50.8ns ± 3% 50.4ns ± 3% ~ (p=0.184 n=20+20)
FmtFprintfString-12 167ns ± 2% 171ns ± 1% +2.46% (p=0.000 n=20+19)
FmtFprintfInt-12 161ns ± 2% 163ns ± 2% +1.81% (p=0.000 n=20+20)
FmtFprintfIntInt-12 269ns ± 1% 266ns ± 1% -0.81% (p=0.002 n=19+20)
FmtFprintfPrefixedInt-12 237ns ± 2% 231ns ± 2% -2.86% (p=0.000 n=20+20)
FmtFprintfFloat-12 313ns ± 2% 313ns ± 1% ~ (p=0.681 n=20+20)
FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 1% -2.26% (p=0.000 n=20+20)
GobDecode-12 8.66ms ± 1% 8.67ms ± 1% ~ (p=0.380 n=19+20)
GobEncode-12 6.56ms ± 1% 6.56ms ± 2% ~ (p=0.607 n=19+20)
Gzip-12 317ms ± 1% 314ms ± 2% -1.10% (p=0.000 n=20+19)
Gunzip-12 42.1ms ± 1% 42.2ms ± 1% +0.27% (p=0.044 n=20+19)
HTTPClientServer-12 62.7µs ± 1% 62.0µs ± 1% -1.04% (p=0.000 n=19+18)
JSONEncode-12 16.7ms ± 1% 16.8ms ± 2% +0.59% (p=0.021 n=20+20)
JSONDecode-12 58.2ms ± 1% 61.4ms ± 2% +5.43% (p=0.000 n=18+19)
Mandelbrot200-12 3.84ms ± 1% 3.87ms ± 2% +0.79% (p=0.008 n=18+20)
GoParse-12 3.86ms ± 2% 3.76ms ± 2% -2.60% (p=0.000 n=20+20)
RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% -0.68% (p=0.005 n=18+15)
RegexpMatchEasy0_1K-12 332ns ± 1% 342ns ± 1% +3.16% (p=0.000 n=19+19)
RegexpMatchEasy1_32-12 82.9ns ± 3% 83.0ns ± 2% ~ (p=0.906 n=19+20)
RegexpMatchEasy1_1K-12 487ns ± 1% 494ns ± 1% +1.50% (p=0.000 n=17+20)
RegexpMatchMedium_32-12 131ns ± 2% 130ns ± 1% ~ (p=0.686 n=19+20)
RegexpMatchMedium_1K-12 39.6µs ± 1% 39.2µs ± 1% -1.09% (p=0.000 n=18+19)
RegexpMatchHard_32-12 2.04µs ± 1% 2.04µs ± 2% ~ (p=0.804 n=20+20)
RegexpMatchHard_1K-12 61.7µs ± 2% 61.3µs ± 2% ~ (p=0.052 n=18+20)
Revcomp-12 529ms ± 2% 533ms ± 1% +0.83% (p=0.003 n=20+19)
Template-12 70.7ms ± 2% 71.0ms ± 2% ~ (p=0.065 n=20+19)
TimeParse-12 351ns ± 2% 355ns ± 1% +1.25% (p=0.000 n=19+20)
TimeFormat-12 362ns ± 2% 373ns ± 1% +2.83% (p=0.000 n=18+20)
[Geo mean] 62.2µs 62.3µs +0.13%
name old speed new speed delta
GobDecode-12 88.6MB/s ± 1% 88.5MB/s ± 1% ~ (p=0.392 n=19+20)
GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.622 n=19+20)
Gzip-12 61.1MB/s ± 1% 61.8MB/s ± 2% +1.11% (p=0.000 n=20+19)
Gunzip-12 461MB/s ± 1% 460MB/s ± 1% -0.27% (p=0.044 n=20+19)
JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% -0.58% (p=0.022 n=20+20)
JSONDecode-12 33.3MB/s ± 1% 31.6MB/s ± 2% -5.15% (p=0.000 n=18+19)
GoParse-12 15.0MB/s ± 2% 15.4MB/s ± 2% +2.66% (p=0.000 n=20+20)
RegexpMatchEasy0_32-12 317MB/s ± 2% 319MB/s ± 2% ~ (p=0.052 n=20+20)
RegexpMatchEasy0_1K-12 3.08GB/s ± 1% 2.99GB/s ± 1% -3.07% (p=0.000 n=19+19)
RegexpMatchEasy1_32-12 386MB/s ± 3% 386MB/s ± 2% ~ (p=0.939 n=19+20)
RegexpMatchEasy1_1K-12 2.10GB/s ± 1% 2.07GB/s ± 1% -1.46% (p=0.000 n=17+20)
RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.64MB/s ± 1% ~ (p=0.702 n=19+20)
RegexpMatchMedium_1K-12 25.9MB/s ± 1% 26.1MB/s ± 2% +0.99% (p=0.000 n=18+20)
RegexpMatchHard_32-12 15.7MB/s ± 1% 15.7MB/s ± 2% ~ (p=0.723 n=20+20)
RegexpMatchHard_1K-12 16.6MB/s ± 2% 16.7MB/s ± 2% ~ (p=0.052 n=18+20)
Revcomp-12 481MB/s ± 2% 477MB/s ± 1% -0.83% (p=0.003 n=20+19)
Template-12 27.5MB/s ± 2% 27.3MB/s ± 2% ~ (p=0.062 n=20+19)
[Geo mean] 99.4MB/s 99.1MB/s -0.35%
Change-Id: I914d8cadded5a230509d118164a4c201601afc06
Reviewed-on: https://go-review.googlesource.com/16298
Reviewed-by: Rick Hudson <rlh@golang.org>
|
|
Currently the gcWork abstraction caches a single work buffer. As a
result, if a worker is putting and getting pointers right at the
boundary of a work buffer, it can flap between work buffers and
(potentially significantly) increase contention on the global work
buffer lists.
This change modifies gcWork to instead cache two work buffers and
switch off between them. This introduces one buffers' worth of
hysteresis and eliminates the above performance worst case by
amortizing the cost of getting or putting a work buffer over at least
one buffers' worth of work.
In practice, it's difficult to trigger this worst case with reasonably
large work buffers. On the garbage benchmark, this reduces the max
writes/sec to the global work list from 32K to 25K and the median from
6K to 5K. However, if a workload were to trigger this worst case
behavior, it could significantly drive up this contention.
This has negligible effects on the go1 benchmarks and slightly speeds
up the garbage benchmark.
name old time/op new time/op delta
XBenchGarbage-12 5.90ms ± 3% 5.83ms ± 4% -1.18% (p=0.011 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.22s ± 4% 3.17s ± 3% -1.57% (p=0.009 n=19+20)
Fannkuch11-12 2.44s ± 1% 2.53s ± 4% +3.78% (p=0.000 n=18+19)
FmtFprintfEmpty-12 50.2ns ± 2% 50.5ns ± 5% ~ (p=0.631 n=19+20)
FmtFprintfString-12 167ns ± 1% 166ns ± 1% ~ (p=0.141 n=20+20)
FmtFprintfInt-12 162ns ± 1% 159ns ± 1% -1.80% (p=0.000 n=20+20)
FmtFprintfIntInt-12 277ns ± 2% 263ns ± 1% -4.78% (p=0.000 n=20+18)
FmtFprintfPrefixedInt-12 240ns ± 1% 232ns ± 2% -3.25% (p=0.000 n=20+20)
FmtFprintfFloat-12 311ns ± 1% 315ns ± 2% +1.17% (p=0.000 n=20+20)
FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 2% -1.72% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.71ms ± 2% +0.68% (p=0.001 n=19+20)
GobEncode-12 6.51ms ± 1% 6.54ms ± 1% +0.42% (p=0.047 n=20+19)
Gzip-12 318ms ± 2% 315ms ± 2% -1.20% (p=0.000 n=19+19)
Gunzip-12 42.2ms ± 2% 42.1ms ± 1% ~ (p=0.667 n=20+19)
HTTPClientServer-12 62.5µs ± 1% 62.4µs ± 1% ~ (p=0.110 n=20+18)
JSONEncode-12 16.8ms ± 1% 16.8ms ± 2% ~ (p=0.569 n=19+20)
JSONDecode-12 60.8ms ± 2% 59.8ms ± 1% -1.69% (p=0.000 n=19+19)
Mandelbrot200-12 3.87ms ± 1% 3.85ms ± 0% -0.61% (p=0.001 n=20+17)
GoParse-12 3.76ms ± 2% 3.76ms ± 1% ~ (p=0.698 n=20+20)
RegexpMatchEasy0_32-12 100ns ± 2% 101ns ± 2% ~ (p=0.065 n=19+20)
RegexpMatchEasy0_1K-12 342ns ± 2% 333ns ± 1% -2.82% (p=0.000 n=20+19)
RegexpMatchEasy1_32-12 83.3ns ± 2% 83.2ns ± 2% ~ (p=0.692 n=20+19)
RegexpMatchEasy1_1K-12 498ns ± 2% 490ns ± 1% -1.52% (p=0.000 n=18+20)
RegexpMatchMedium_32-12 131ns ± 2% 131ns ± 2% ~ (p=0.464 n=20+18)
RegexpMatchMedium_1K-12 39.3µs ± 2% 39.6µs ± 1% +0.77% (p=0.000 n=18+19)
RegexpMatchHard_32-12 2.04µs ± 2% 2.06µs ± 1% +0.69% (p=0.009 n=19+20)
RegexpMatchHard_1K-12 61.4µs ± 2% 62.1µs ± 1% +1.21% (p=0.000 n=19+20)
Revcomp-12 534ms ± 1% 529ms ± 1% -0.97% (p=0.000 n=19+16)
Template-12 70.4ms ± 2% 70.0ms ± 1% ~ (p=0.070 n=19+19)
TimeParse-12 359ns ± 3% 344ns ± 1% -4.15% (p=0.000 n=19+19)
TimeFormat-12 357ns ± 1% 361ns ± 2% +1.05% (p=0.002 n=20+20)
[Geo mean] 62.4µs 62.0µs -0.56%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.1MB/s ± 2% -0.68% (p=0.001 n=19+20)
GobEncode-12 118MB/s ± 1% 117MB/s ± 1% -0.42% (p=0.046 n=20+19)
Gzip-12 60.9MB/s ± 2% 61.7MB/s ± 2% +1.21% (p=0.000 n=19+19)
Gunzip-12 460MB/s ± 2% 461MB/s ± 1% ~ (p=0.661 n=20+19)
JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% ~ (p=0.555 n=19+20)
JSONDecode-12 31.9MB/s ± 2% 32.5MB/s ± 1% +1.72% (p=0.000 n=19+19)
GoParse-12 15.4MB/s ± 2% 15.4MB/s ± 1% ~ (p=0.653 n=20+20)
RegexpMatchEasy0_32-12 317MB/s ± 2% 315MB/s ± 2% ~ (p=0.141 n=19+20)
RegexpMatchEasy0_1K-12 2.99GB/s ± 2% 3.07GB/s ± 1% +2.86% (p=0.000 n=20+19)
RegexpMatchEasy1_32-12 384MB/s ± 2% 385MB/s ± 2% ~ (p=0.672 n=20+19)
RegexpMatchEasy1_1K-12 2.06GB/s ± 2% 2.09GB/s ± 1% +1.54% (p=0.000 n=18+20)
RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.63MB/s ± 2% ~ (p=0.800 n=20+18)
RegexpMatchMedium_1K-12 26.0MB/s ± 1% 25.8MB/s ± 1% -0.77% (p=0.000 n=18+19)
RegexpMatchHard_32-12 15.7MB/s ± 2% 15.6MB/s ± 1% -0.69% (p=0.010 n=19+20)
RegexpMatchHard_1K-12 16.7MB/s ± 2% 16.5MB/s ± 1% -1.19% (p=0.000 n=19+20)
Revcomp-12 476MB/s ± 1% 481MB/s ± 1% +0.97% (p=0.000 n=19+16)
Template-12 27.6MB/s ± 2% 27.7MB/s ± 1% ~ (p=0.071 n=19+19)
[Geo mean] 99.1MB/s 99.3MB/s +0.27%
Change-Id: I68bcbf74ccb716cd5e844a554f67b679135105e6
Reviewed-on: https://go-review.googlesource.com/16042
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Currently the GC work buffers are only 256 bytes and hence can record
only 24 64-bit pointer. They were reduced from 4K in commits db7fd1c
and a15818f as a way to minimize the amount of work the per-P workbuf
caches could "hide" from the mark phase and carry in to the mark
termination phase. However, this approach wasn't very robust and we
later added a "mark 2" phase to address this problem head-on.
Because of mark 2, there's now no benefit to having very small work
buffers. But there are plenty of downsides: small work buffers
increase contention on the work lists, increase the frequency and
hence net overhead of acquiring and releasing work buffers, and
somewhat increase memory overhead of the GC.
This commit expands work buffers back to 4K (504 64-bit pointers).
This reduces the rate of writes to work.full in the garbage benchmark
from a peak of ~780,000 writes/sec to a peak of ~32,000 writes/sec.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark.
name old time/op new time/op delta
XBenchGarbage-12 5.37ms ± 5% 5.60ms ± 2% +4.37% (p=0.000 n=20+20)
Change-Id: Ic9cc28e7a125d23d9faf4f5e690fb8aa9bcdfb28
Reviewed-on: https://go-review.googlesource.com/15893
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Currently the concurrent root scan is performed in its entirety by the
GC coordinator before entering concurrent mark (which enables GC
workers). This scan is done sequentially, which can prolong the scan
phase, delay the mark phase, and means that the scan phase does not
obey the 25% CPU goal. Furthermore, there's no need to complete the
root scan before starting marking (in fact, we already allow GC
assists to happen during the scan phase), so this acts as an
unnecessary barrier between root scanning and marking.
This change shifts the root scan work out of the GC coordinator and in
to the GC workers. The coordinator simply sets up the scan state and
enqueues the right number of root scan jobs. The GC workers then drain
the root scan jobs prior to draining heap scan jobs.
This parallelizes the root scan process, makes it obey the 25% CPU
goal, and effectively eliminates root scanning as an isolated phase,
allowing the system to smoothly transition from root scanning to heap
marking. This also eliminates a major non-STW responsibility of the GC
coordinator, which will make it easier to switch to a decentralized
state machine. Finally, it puts us in a good position to perform root
scanning in assists as well, which will help satisfy assists at the
beginning of the GC cycle.
This is mostly straightforward. One tricky aspect is that we have to
deal with preemption deadlock: where two non-preemptible gorountines
are trying to preempt each other to perform a stack scan. Given the
context where this happens, the only instance of this is two
background workers trying to scan each other. We avoid this by simply
not scanning the stacks of background workers during the concurrent
phase; this is safe because we'll scan them during mark termination
(and their stacks are *very* small and should not contain any new
pointers).
This change also switches the root marking during mark termination to
use the same gcDrain-based code path as concurrent mark. This
shouldn't affect performance because STW root marking was already
parallel and tasks switched to heap marking immediately when no more
root marking tasks were available. However, it simplifies the code and
unifies these code paths.
This has negligible effect on the go1 benchmarks. It slightly slows
down the garbage benchmark, possibly by making GC run slightly more
frequently.
name old time/op new time/op delta
XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18)
name old time/op new time/op delta
BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20)
Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18)
FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20)
FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19)
FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18)
FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19)
FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19)
FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19)
FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20)
GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20)
Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17)
Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20)
HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20)
JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18)
JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20)
Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19)
GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20)
RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20)
RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19)
RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19)
RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18)
RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20)
RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18)
RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19)
RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18)
Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17)
Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20)
TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20)
TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19)
[Geo mean] 62.1µs 62.3µs +0.31%
name old speed new speed delta
GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20)
GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20)
Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17)
Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20)
JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18)
JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20)
GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20)
RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20)
RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19)
RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19)
RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18)
RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20)
RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18)
RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19)
RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19)
Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17)
Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20)
[Geo mean] 100MB/s 99.0MB/s -0.82%
Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0
Reviewed-on: https://go-review.googlesource.com/16059
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
This work queue is no longer used (there are many reads of
work.partial, but the only write is in putpartial, which is never
called).
Fixes #11922.
Change-Id: I08b76c0c02a0867a9cdcb94783e1f7629d44249a
Reviewed-on: https://go-review.googlesource.com/15892
Reviewed-by: Rick Hudson <rlh@golang.org>
|