go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2025-09-23	runtime: eliminate global span queue [green tea]	Michael Anthony Knyszek
	This change removes the locked global span queue and replaces the fixed-size local span queue with a variable-sized local span queue. The variable-sized local span queue grows as needed to accomodate local work. With no global span queue either, GC workers balance work amongst themselves by stealing from each other. The new variable-sized local span queues are inspired by the P-local deque underlying sync.Pool. Unlike the sync.Pool deque, however, both the owning P and stealing Ps take spans from the tail, making this incarnation a strict queue, not a deque. This is intentional, since we want a queue-like order to encourage objects to accumulate on each span. These variable-sized local span queues are crucial to mark termination, just like the global span queue was. To avoid hitting the ragged barrier too often, we must check whether any Ps have any spans on their variable-sized local span queues. We maintain a per-P atomic bitmask (another pMask) that contains this state. We can also use this to speed up stealing by skipping Ps that don't have any local spans. The variable-sized local span queues are slower than the old fixed-size local span queues because of the additional indirection, so this change adds a non-atomic local fixed-size queue. This risks getting work stuck on it, so, similarly to how workbufs work, each worker will occasionally dump some spans onto its local variable-sized queue. This scales much more nicely than dumping to a global queue, but is still visible to all other Ps. For #73581. Change-Id: I814f54d9c3cc7fa7896167746e9823f50943ac22 Reviewed-on: https://go-review.googlesource.com/c/go/+/700496 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-02	runtime: mark and scan small objects in whole spans [green tea]	Michael Anthony Knyszek
	Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-07-23	runtime,internal: move runtime/internal/sys to internal/runtime/sys	David Chase
	Cleanup and friction reduction For #65355. Change-Id: Ia14c9dc584a529a35b97801dd3e95b9acc99a511 Reviewed-on: https://go-review.googlesource.com/c/go/+/600436 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>
2024-03-25	runtime: migrate internal/atomic to internal/runtime	Andy Pan
	For #65355 Change-Id: I65dd090fb99de9b231af2112c5ccb0eb635db2be Reviewed-on: https://go-review.googlesource.com/c/go/+/560155 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ibrahim Bazoka <ibrahimbazoka729@gmail.com> Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
2022-11-18	all: add missing periods in comments	cui fliter
	Change-Id: I69065f8adf101fdb28682c55997f503013a50e29 Reviewed-on: https://go-review.googlesource.com/c/go/+/449757 Auto-Submit: Ian Lance Taylor <iant@google.com> Reviewed-by: Joedian Reid <joedian@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Joedian Reid <joedian@golang.org> Run-TryBot: Ian Lance Taylor <iant@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-08-19	runtime: add and use runtime/internal/sys.NotInHeap	Cuong Manh Le
	Updates #46731 Change-Id: Ic2208c8bb639aa1e390be0d62e2bd799ecf20654 Reviewed-on: https://go-review.googlesource.com/c/go/+/421878 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Cuong Manh Le <cuong.manhle.vn@gmail.com>
2022-04-11	all: gofmt main repo	Russ Cox
	[This CL is part of a sequence implementing the proposal #51082. The design doc is at https://go.dev/s/godocfmt-design.] Run the updated gofmt, which reformats doc comments, on the main repository. Vendored files are excluded. For #51082. Change-Id: I7332f099b60f716295fb34719c98c04eb1a85407 Reviewed-on: https://go-review.googlesource.com/c/go/+/384268 Reviewed-by: Jonathan Amsterdam <jba@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
2022-04-05	all: separate doc comment from //go: directives	Russ Cox
	A future change to gofmt will rewrite // Doc comment. //go:foo to // Doc comment. // //go:foo Apply that change preemptively to all comments (not necessarily just doc comments). For #51082. Change-Id: Iffe0285418d1e79d34526af3520b415a12203ca9 Reviewed-on: https://go-review.googlesource.com/c/go/+/384260 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-03-21	runtime: combine wbuf checks in tryGetFast and putFast	Jakub Ciolek
	Less text and improves codegen a bit. compilecmp on ARM64: runtime (gcWork).putFast 160 -> 144 (-10.00%) (gcWork).tryGetFast 144 -> 128 (-11.11%) scanobject 784 -> 752 (-4.08%) greyobject 800 -> 784 (-2.00%) AMD64: runtime greyobject 765 -> 748 (-2.22%) (gcWork).tryGetFast 102 -> 85 (-16.67%) scanobject 837 -> 820 (-2.03%) (gcWork).putFast 102 -> 89 (-12.75%) Change-Id: I6bb508afe1ba416823775c0bfc08ea9dc21de8a3 Reviewed-on: https://go-review.googlesource.com/c/go/+/393754 Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Robert Griesemer <gri@golang.org> Trust: Michael Knyszek <mknyszek@google.com>
2021-11-04	runtime: implement GC pacer redesign	Michael Anthony Knyszek
	This change implements the GC pacer redesign outlined in #44167 and the accompanying design document, behind a GOEXPERIMENT flag that is on by default. In addition to adding the new pacer, this CL also includes code to track and account for stack and globals scan work in the pacer and in the assist credit system. The new pacer also deviates slightly from the document in that it increases the bound on the minimum trigger ratio from 0.6 (scaled by GOGC) to 0.7. The logic behind this change is that the new pacer much more consistently hits the goal (good!) leading to slightly less frequent GC cycles, but _longer_ ones (in this case, bad!). It turns out that the cost of having the GC on hurts throughput significantly (per byte of memory used), though tail latencies can improve by up to 10%! To be conservative, this change moves the value to 0.7 where there is a small improvement to both throughput and latency, given the memory use. Because the new pacer accounts for the two most significant sources of scan work after heap objects, it is now also safer to reduce the minimum heap size without leading to very poor amortization. This change thus decreases the minimum heap size to 512 KiB, which corresponds to the fact that the runtime has around 200 KiB of scannable globals always there, up-front, providing a baseline. Benchmark results: https://perf.golang.org/search?q=upload:20211001.6 tile38's KNearest benchmark shows a memory increase, but throughput (and latency) per byte of memory used is better. gopher-lua showed an increase in both CPU time and memory usage, but subsequent attempts to reproduce this behavior are inconsistent. Sometimes the overall performance is better, sometimes it's worse. This suggests that the benchmark is fairly noisy in a way not captured by the benchmarking framework itself. biogo-igor is the only benchmark to show a significant performance loss. This benchmark exhibits a very high GC rate, with relatively little work to do in each cycle. The idle mark workers are quite active. In the new pacer, mark phases are longer, mark assists are fewer, and some of that time in mark assists has shifted to idle workers. Linux perf indicates that the difference in CPU time can be mostly attributed to write-barrier slow path related calls, which in turn indicates that the write barrier being on for longer is the primary culprit. This also explains the memory increase, as a longer mark phase leads to more memory allocated black, surviving an extra cycle and contributing to the heap goal. For #44167. Change-Id: I8ac7cfef7d593e4a642c9b2be43fb3591a8ec9c4 Reviewed-on: https://go-review.googlesource.com/c/go/+/309869 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2021-06-17	[dev.typeparams] runtime: fix import sort order [generated]	Michael Anthony Knyszek
	[git-generate] cd src/runtime goimports -w *.go Change-Id: I1387af0f2fd1a213dc2f4c122e83a8db0fcb15f0 Reviewed-on: https://go-review.googlesource.com/c/go/+/329189 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> Reviewed-by: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-06-17	[dev.typeparams] runtime: replace uses of runtime/internal/sys.PtrSize with ↵	Michael Anthony Knyszek
	internal/goarch.PtrSize [generated] [git-generate] cd src/runtime/internal/math gofmt -w -r "sys.PtrSize -> goarch.PtrSize" . goimports -w .go cd ../.. gofmt -w -r "sys.PtrSize -> goarch.PtrSize" . goimports -w .go Change-Id: I43491cdd54d2e06d4d04152b3d213851b7d6d423 Reviewed-on: https://go-review.googlesource.com/c/go/+/328337 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com>
2021-03-29	runtime: fix some typos	徐志伟
	Change-Id: I31f2081eb7c30a9583f479f9194e636fe721b9b3 GitHub-Last-Rev: d09f5fbdc5785dc3963b22ad75309740e0de258e GitHub-Pull-Request: golang/go#45278 Reviewed-on: https://go-review.googlesource.com/c/go/+/305231 Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-10-26	runtime: make the span allocation purpose more explicit	Michael Anthony Knyszek
	This change modifies mheap's span allocation API to have each caller declare a purpose, defined as a new enum called spanAllocType. The purpose behind this change is two-fold: 1. Tight control over who gets to allocate heap memory is, generally speaking, a good thing. Every codepath that allocates heap memory places additional implicit restrictions on the allocator. A notable example of a restriction is work bufs coming from heap memory: write barriers are not allowed in allocation paths because then we could have a situation where the allocator calls into the allocator. 2. Memory statistic updating is explicit. Instead of passing an opaque pointer for statistic updating, which places restrictions on how that statistic may be updated, we use the spanAllocType to determine which statistic to update and how. We also take this opportunity to group all the statistic updating code together, which should make the accounting code a little easier to follow. Change-Id: Ic0b0898959ba2a776f67122f0e36c9d7d60e3085 Reviewed-on: https://go-review.googlesource.com/c/go/+/246970 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>
2020-10-15	runtime: remove debugCachedWork	Michael Pratt
	debugCachedWork and all of its dependent fields and code were added to aid in debugging issue #27993. Now that the source of the problem is known and mitigated (via the extra work check after STW in gcMarkDone), these extra checks are no longer required and simply make the code more difficult to follow. Remove it all. Updates #27993 Change-Id: I594beedd5ca61733ba9cc9eaad8f80ea92df1a0d Reviewed-on: https://go-review.googlesource.com/c/go/+/262350 Trust: Michael Pratt <mpratt@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>
2020-04-07	runtime: static lock ranking for the runtime (enabled by GOEXPERIMENT)	Dan Scales
	I took some of the infrastructure from Austin's lock logging CR https://go-review.googlesource.com/c/go/+/192704 (with deadlock detection from the logs), and developed a setup to give static lock ranking for runtime locks. Static lock ranking establishes a documented total ordering among locks, and then reports an error if the total order is violated. This can happen if a deadlock happens (by acquiring a sequence of locks in different orders), or if just one side of a possible deadlock happens. Lock ordering deadlocks cannot happen as long as the lock ordering is followed. Along the way, I found a deadlock involving the new timer code, which Ian fixed via https://go-review.googlesource.com/c/go/+/207348, as well as two other potential deadlocks. See the constants at the top of runtime/lockrank.go to show the static lock ranking that I ended up with, along with some comments. This is great documentation of the current intended lock ordering when acquiring multiple locks in the runtime. I also added an array lockPartialOrder[] which shows and enforces the current partial ordering among locks (which is embedded within the total ordering). This is more specific about the dependencies among locks. I don't try to check the ranking within a lock class with multiple locks that can be acquired at the same time (i.e. check the ranking when multiple hchan locks are acquired). Currently, I am doing a lockInit() call to set the lock rank of most locks. Any lock that is not otherwise initialized is assumed to be a leaf lock (a very high rank lock), so that eliminates the need to do anything for a bunch of locks (including all architecture-dependent locks). For two locks, root.lock and notifyList.lock (only in the runtime/sema.go file), it is not as easy to do lock initialization, so instead, I am passing the lock rank with the lock calls. For Windows compilation, I needed to increase the StackGuard size from 896 to 928 because of the new lock-rank checking functions. Checking of the static lock ranking is enabled by setting GOEXPERIMENT=staticlockranking before doing a run. To make sure that the static lock ranking code has no overhead in memory or CPU when not enabled by GOEXPERIMENT, I changed 'go build/install' so that it defines a build tag (with the same name) whenever any experiment has been baked into the toolchain (by checking Expstring()). This allows me to avoid increasing the size of the 'mutex' type when static lock ranking is not enabled. Fixes #38029 Change-Id: I154217ff307c47051f8dae9c2a03b53081acd83a Reviewed-on: https://go-review.googlesource.com/c/go/+/207619 Reviewed-by: Dan Scales <danscales@google.com> Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Dan Scales <danscales@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-10-25	runtime: abstract M preemption check into a function	Austin Clements
	We check whether an M is preemptible in a surprising number of places. Put it in one function. For #10958, #24543. Change-Id: I305090fdb1ea7f7a55ffe25851c1e35012d0d06c Reviewed-on: https://go-review.googlesource.com/c/go/+/201439 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-01-02	runtime: don't spin in checkPut if non-preemptible	Austin Clements
	Currently it's possible for the runtime to deadlock if checkPut is called in a non-preemptible context. In this case, checkPut may spin, so it won't leave the non-preemptible context, but the thread running gcMarkDone needs to preempt all of the goroutines before it can release the checkPut spin loops. Fix this by returning from checkPut if it's called under any of the conditions that would prevent gcMarkDone from preempting it. In this case, it leaves a note behind that this happened; if the runtime does later detect left-over work it can at least indicate that it was unable to catch it in the act. For #27993. Updates #29385 (may fix it). Change-Id: Ic71c10701229febb4ddf8c104fb10e06d84b122e Reviewed-on: https://go-review.googlesource.com/c/156017 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-12-17	runtime: capture pause stack for late gcWork put debugging	Austin Clements
	This captures the stack trace where mark completion observed that each P had no work, and then dumps this if that P later discovers more work. Hopefully this will help bound where the work was created. For #27993. Change-Id: I4f29202880d22c433482dc1463fb50ab693b6de6 Reviewed-on: https://go-review.googlesource.com/c/154599 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2018-12-06	runtime: print pointers being put in checkPut	Austin Clements
	In order to further diagnose #27993, I need to see exactly what pointers are being added to the gcWork buffer too late. Change-Id: I8d92113426ffbc6e55d819c39e7ab5eafa68668d Reviewed-on: https://go-review.googlesource.com/c/152957 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2018-11-29	runtime: check more work flushing races	Austin Clements
	This adds several new checks to help debug #27993. It adds a mechanism for freezing write barriers and gcWork puts during the mark completion algorithm. This way, if we do detect mark completion, we can catch any puts that happened during the completion algorithm. Based on build dashboard failures, this seems to be the window of time when these are happening. This also double-checks that all work buffers are empty immediately upon entering mark termination (much earlier than the current check). This is unlikely to trigger based on the current failures, but is a good safety net. Change-Id: I03f56c48c4322069e28c50fbc3c15b2fee2130c2 Reviewed-on: https://go-review.googlesource.com/c/151797 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2018-11-21	runtime: debug code to catch bad gcWork.puts	Austin Clements
	This adds a debug check to throw immediately if any pointers are added to the gcWork buffer after the mark completion barrier. The intent is to catch the source of the cached GC work that occasionally produces "P has cached GC work at end of mark termination" failures. The result should be that we get "throwOnGCWork" throws instead of "P has cached GC work at end of mark termination" throws, but with useful stack traces. This should be reverted before the release. I've been unable to reproduce this issue locally, but this issue appears fairly regularly on the builders, so the intent is to catch it on the builders. This probably slows down the GC slightly. For #27993. Change-Id: I5035e14058ad313bfbd3d68c41ec05179147a85c Reviewed-on: https://go-review.googlesource.com/c/149969 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-11-02	all: use "reports whether" consistently in the few places that didn't	Brad Fitzpatrick
	Go documentation style for boolean funcs is to say: // Foo reports whether ... func Foo() bool (rather than "returns true if") This CL also replaces 4 uses of "iff" with the same "reports whether" wording, which doesn't lose any meaning, and will prevent people from sending typo fixes when they don't realize it's "if and only if". In the past I think we've had the typo CLs updated to just say "reports whether". So do them all at once. (Inspired by the addition of another "returns true if" in CL 146938 in fd_plan9.go) Created with: $ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true iff" \| grep -v vendor) $ perl -i -npe 's/returns true if/reports whether/' $(git grep -l "returns true if" \| grep -v vendor) Change-Id: Ided502237f5ab0d25cb625dbab12529c361a8b9f Reviewed-on: https://go-review.googlesource.com/c/147037 Reviewed-by: Ian Lance Taylor <iant@golang.org>
2018-10-02	runtime: eliminate blocking GC work drains	Austin Clements
	Now work.helperDrainBlock is always false, so we can remove it and code paths that only ran when it was true. That means we no longer use the gcDrainBlock mode of gcDrain, so we can eliminate that. That means we no longer use gcWork.get, so we can eliminate that. That means we no longer use getfull, so we can eliminate that. Updates #26903. This is a follow-up to unifying STW GC and concurrent GC. Change-Id: I8dbcf8ce24861df0a6149e0b7c5cd0eadb5c13f6 Reviewed-on: https://go-review.googlesource.com/c/134782 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-10-02	runtime: eliminate gcBlackenPromptly mode	Austin Clements
	Now that there is no mark 2 phase, gcBlackenPromptly is no longer used. Updates #26903. This is a follow-up to eliminating mark 2. Change-Id: Ib9c534f21b36b8416fcf3cab667f186167b827f8 Reviewed-on: https://go-review.googlesource.com/c/134319 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-10-02	runtime: track whether any buffer has been flushed from gcWork	Austin Clements
	Nothing currently consumes the flag, but we'll use it in the distributed termination detection algorithm. Updates #26903. This is preparation for eliminating mark 2. Change-Id: I5e149a05b1c878fe1009150da21f8bd8ae2b9b6a Reviewed-on: https://go-review.googlesource.com/c/134317 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2018-02-15	runtime: validate lfnode addresses	Austin Clements
	Change-Id: Ic8c506289caaf6218494e5150d10002e0232feaa Reviewed-on: https://go-review.googlesource.com/85876 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-10-30	runtime: buffered write barrier implementation	Austin Clements
	This implements runtime support for buffered write barriers on amd64. The buffered write barrier has a fast path that simply enqueues pointers in a per-P buffer. Unlike the current write barrier, this fast path is not a normal Go call and does not require the compiler to spill general-purpose registers or put arguments on the stack. When the buffer fills up, the write barrier takes the slow path, which spills all general purpose registers and flushes the buffer. We don't allow safe-points or stack splits while this frame is active, so it doesn't matter that we have no type information for the spilled registers in this frame. One minor complication is cgocheck=2 mode, which uses the write barrier to detect Go pointers being written to non-Go memory. We obviously can't buffer this, so instead we set the buffer to its minimum size, forcing the write barrier into the slow path on every call. For this specific case, we pass additional information as arguments to the flush function. This also requires enabling the cgo write barrier slightly later during runtime initialization, after Ps (and the per-P write barrier buffers) have been initialized. The code in this CL is not yet active. The next CL will modify the compiler to generate calls to the new write barrier. This reduces the average cost of the write barrier by roughly a factor of 4, which will pay for the cost of having it enabled more of the time after we make the GC pacer less aggressive. (Benchmarks will be in the next CL.) Updates #14951. Updates #22460. Change-Id: I396b5b0e2c5e5c4acfd761a3235fd15abadc6cb1 Reviewed-on: https://go-review.googlesource.com/73711 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-10-29	runtime: mark gcWork methods nowritebarrierrec	Austin Clements
	Currently most of these are marked go:nowritebarrier as a hint, but it's actually important that these not invoke write barriers recursively. The danger is that some gcWork method would invoke the write barrier while the gcWork is in an inconsistent state and that the write barrier would in turn invoke some other gcWork method, which would crash or permanently corrupt the gcWork. Simply marking the write barrier itself as go:nowritebarrierrec isn't sufficient to prevent this if the write barrier doesn't use the outer method. Thankfully, this doesn't cause any build failures, so we were getting this right. :) For #22460. Change-Id: I35a7292a584200eb35a49507cd3fe359ba2206f6 Reviewed-on: https://go-review.googlesource.com/72554 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-13	runtime: free workbufs during sweeping	Austin Clements
	This extends the sweeper to free workbufs back to the heap between GC cycles, allowing this memory to be reused for GC'd allocations or eventually returned to the OS. This helps for applications that have high peak heap usage relative to their regular heap usage (for example, a high-memory initialization phase). Workbuf memory is roughly proportional to heap size and since we currently never free workbufs, it's proportional to peak heap size. By freeing workbufs, we can release and reuse this memory for other purposes when the heap shrinks. This is somewhat complicated because this costs ~1–2 µs per workbuf span, so for large heaps it's too expensive to just do synchronously after mark termination between starting the world and dropping the worldsema. Hence, we do it asynchronously in the sweeper. This adds a list of "free" workbuf spans that can be returned to the heap. GC moves all workbuf spans to this list after mark termination and the background sweeper drains this list back to the heap. If the sweeper doesn't finish, that's fine, since getempty can directly reuse any remaining spans to allocate more workbufs. Performance impact is negligible. On the x/benchmarks, this reduces GC-bytes-from-system by 6–11%. Fixes #19325. Change-Id: Icb92da2196f0c39ee984faf92d52f29fd9ded7a8 Reviewed-on: https://go-review.googlesource.com/38582 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-04-13	runtime: allocate GC workbufs from manually-managed spans	Austin Clements
	Currently the runtime allocates workbufs from persistent memory, which means they can never be freed. Switch to allocating them from manually-managed heap spans. This doesn't free them yet, but it puts us in a position to do so. For #19325. Change-Id: I94b2512a2f2bbbb456cd9347761b9412e80d2da9 Reviewed-on: https://go-review.googlesource.com/38581 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-19	runtime: introduce a type for lfstacks	Austin Clements
	The lfstack API is still a C-style API: lfstacks all have unhelpful type uint64 and the APIs are package-level functions. Make the code more readable and Go-style by creating an lfstack type with methods for push, pop, and empty. Change-Id: I64685fa3be0e82ae2d1a782a452a50974440a827 Reviewed-on: https://go-review.googlesource.com/38290 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-04	runtime: remove unused gcstats	Austin Clements
	The gcstats structure is no longer consumed by anything and no longer tracks statistics that are particularly relevant to the concurrent garbage collector. Remove it. (Having statistics is probably a good idea, but these aren't the stats we need these days and we don't have a way to get them out of the runtime.) In preparation for #13613. Change-Id: Ib63e2f9067850668f9dcbfd4ed89aab4a6622c3f Reviewed-on: https://go-review.googlesource.com/34936 Run-TryBot: Austin Clements <austin@google.com> Reviewed-by: Rick Hudson <rlh@golang.org>
2017-03-03	runtime: remove wbufptr	Austin Clements
	Since workbuf is now marked go:notinheap, the write barrier-preventing wrapper type wbufptr is no longer necessary. Remove it. Change-Id: I3e5b5803a1547d65de1c1a9c22458a38e08549b7 Reviewed-on: https://go-review.googlesource.com/35971 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-11-20	runtime: wake idle Ps when enqueuing GC work	Austin Clements
	If the scheduler has no user work and there's no GC work visible, it puts the P to sleep (or blocks on the network). However, if we later enqueue more GC work, there's currently nothing that specifically wakes up the scheduler to let it start an idle GC worker. As a result, we can underutilize the CPU during GC if Ps have been put to sleep. Fix this by making GC wake idle Ps when work buffers are put on the full list. We already have a hook to do this, since we use this to preempt a random P if we need more dedicated workers. We expand this hook to instead wake an idle P if there is one. The logic we use for this is identical to the logic used to wake an idle P when we ready a goroutine. To make this really sound, we also fix the scheduler to re-check the idle GC worker condition after releasing its P. This closes a race where 1) the scheduler checks for idle work and finds none, 2) new work is enqueued but there are no idle Ps so none are woken, and 3) the scheduler releases its P. There is one subtlety here. Currently we call enlistWorker directly from putfull, but the gcWork is in an inconsistent state in the places that call putfull. This isn't a problem right now because nothing that enlistWorker does touches the gcWork, but with the added call to wakep, it's possible to get a recursive call into the gcWork (specifically, while write barriers are disallowed, this can do an allocation, which can dispose a gcWork, which can put a workbuf). To handle this, we lift the enlistWorker calls up a layer and delay them until the gcWork is in a consistent state. Fixes #14179. Change-Id: Ia2467a52e54c9688c3c1752e1fc00f5b37bbfeeb Reviewed-on: https://go-review.googlesource.com/32434 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
2016-10-15	runtime: mark several types go:notinheap	Austin Clements
	This covers basically all sysAlloc'd, persistentalloc'd, and fixalloc'd types. Change-Id: I0487c887c2a0ade5e33d4c4c12d837e97468e66b Reviewed-on: https://go-review.googlesource.com/30941 Reviewed-by: Rick Hudson <rlh@golang.org>
2016-09-06	runtime: bound scanobject to ~100 µs	Austin Clements
	Currently the time spent in scanobject is proportional to the size of the object being scanned. Since scanobject is non-preemptible, large objects can cause significant goroutine (and even whole application) delays through several means: 1. If a GC assist picks up a large object, the allocating goroutine is blocked for the whole scan, even if that scan well exceeds that goroutine's debt. 2. Since the scheduler does not run on the P performing a large object scan, goroutines in that P's run queue do not run unless they are stolen by another P (which can take some time). If there are a few large objects, all of the Ps may get tied up so the scheduler doesn't run anywhere. 3. Even if a large object is scanned by a background worker and other Ps are still running the scheduler, the large object scan doesn't flush background credit until the whole scan is done. This can easily cause all allocations to block in assists, waiting for credit, causing an effective STW. Fix this by splitting large objects into 128 KB "oblets" and scanning at most one oblet at a time. Since we can scan 1–2 MB/ms, this equates to bounding scanobject at roughly 100 µs. This improves assist behavior both because assists can no longer get "unlucky" and be stuck scanning a large object, and because it causes the background worker to flush credit and unblock assists more frequently when scanning large objects. This also improves GC parallelism if the heap consists primarily of a small number of very large objects by letting multiple workers scan a large objects in parallel. Fixes #10345. Fixes #16293. This substantially improves goroutine latency in the benchmark from issue #16293, which exercises several forms of very large objects: name old max-latency new max-latency delta SliceNoPointer-12 154µs ± 1% 155µs ± 2% ~ (p=0.087 n=13+12) SlicePointer-12 314ms ± 1% 5.94ms ±138% -98.11% (p=0.000 n=19+20) SliceLivePointer-12 1148ms ± 0% 4.72ms ±167% -99.59% (p=0.000 n=19+20) MapNoPointer-12 72509µs ± 1% 408µs ±325% -99.44% (p=0.000 n=19+18) ChanPointer-12 313ms ± 0% 4.74ms ±140% -98.49% (p=0.000 n=18+20) ChanLivePointer-12 1147ms ± 0% 3.30ms ±149% -99.71% (p=0.000 n=19+20) name old P99.9-latency new P99.9-latency delta SliceNoPointer-12 113µs ±25% 107µs ±12% ~ (p=0.153 n=20+18) SlicePointer-12 309450µs ± 0% 133µs ±23% -99.96% (p=0.000 n=20+20) SliceLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) MapNoPointer-12 448µs ±288% 119µs ±18% -73.34% (p=0.000 n=18+20) ChanPointer-12 309450µs ± 0% 134µs ±23% -99.96% (p=0.000 n=20+19) ChanLivePointer-12 961ms ± 0% 1.35ms ±27% -99.86% (p=0.000 n=20+20) This has negligible effect on all metrics from the garbage, JSON, and HTTP x/benchmarks. It shows slight improvement on some of the go1 benchmarks, particularly Revcomp, which uses some multi-megabyte buffers: name old time/op new time/op delta BinaryTree17-12 2.46s ± 1% 2.47s ± 1% +0.32% (p=0.012 n=20+20) Fannkuch11-12 2.82s ± 0% 2.81s ± 0% -0.61% (p=0.000 n=17+20) FmtFprintfEmpty-12 50.8ns ± 5% 50.5ns ± 2% ~ (p=0.197 n=17+19) FmtFprintfString-12 131ns ± 1% 132ns ± 0% +0.57% (p=0.000 n=20+16) FmtFprintfInt-12 117ns ± 0% 116ns ± 0% -0.47% (p=0.000 n=15+20) FmtFprintfIntInt-12 180ns ± 0% 179ns ± 1% -0.78% (p=0.000 n=16+20) FmtFprintfPrefixedInt-12 186ns ± 1% 185ns ± 1% -0.55% (p=0.000 n=19+20) FmtFprintfFloat-12 263ns ± 1% 271ns ± 0% +2.84% (p=0.000 n=18+20) FmtManyArgs-12 741ns ± 1% 742ns ± 1% ~ (p=0.190 n=19+19) GobDecode-12 7.44ms ± 0% 7.35ms ± 1% -1.21% (p=0.000 n=20+20) GobEncode-12 6.22ms ± 1% 6.21ms ± 1% ~ (p=0.336 n=20+19) Gzip-12 220ms ± 1% 219ms ± 1% ~ (p=0.130 n=19+19) Gunzip-12 37.9ms ± 0% 37.9ms ± 1% ~ (p=1.000 n=20+19) HTTPClientServer-12 82.5µs ± 3% 82.6µs ± 3% ~ (p=0.776 n=20+19) JSONEncode-12 16.4ms ± 1% 16.5ms ± 2% +0.49% (p=0.003 n=18+19) JSONDecode-12 53.7ms ± 1% 54.1ms ± 1% +0.71% (p=0.000 n=19+18) Mandelbrot200-12 4.19ms ± 1% 4.20ms ± 1% ~ (p=0.452 n=19+19) GoParse-12 3.38ms ± 1% 3.37ms ± 1% ~ (p=0.123 n=19+19) RegexpMatchEasy0_32-12 72.1ns ± 1% 71.8ns ± 1% ~ (p=0.397 n=19+17) RegexpMatchEasy0_1K-12 242ns ± 0% 242ns ± 0% ~ (p=0.168 n=17+20) RegexpMatchEasy1_32-12 72.1ns ± 1% 72.1ns ± 1% ~ (p=0.538 n=18+19) RegexpMatchEasy1_1K-12 385ns ± 1% 384ns ± 1% ~ (p=0.388 n=20+20) RegexpMatchMedium_32-12 112ns ± 1% 112ns ± 3% ~ (p=0.539 n=20+20) RegexpMatchMedium_1K-12 34.4µs ± 2% 34.4µs ± 2% ~ (p=0.628 n=18+18) RegexpMatchHard_32-12 1.80µs ± 1% 1.80µs ± 1% ~ (p=0.522 n=18+19) RegexpMatchHard_1K-12 54.0µs ± 1% 54.1µs ± 1% ~ (p=0.647 n=20+19) Revcomp-12 387ms ± 1% 369ms ± 5% -4.89% (p=0.000 n=17+19) Template-12 62.3ms ± 1% 62.0ms ± 0% -0.48% (p=0.002 n=20+17) TimeParse-12 314ns ± 1% 314ns ± 0% ~ (p=1.011 n=20+13) TimeFormat-12 358ns ± 0% 354ns ± 0% -1.12% (p=0.000 n=17+20) [Geo mean] 53.5µs 53.3µs -0.23% Change-Id: I2a0a179d1d6bf7875dd054b7693dd12d2a340132 Reviewed-on: https://go-review.googlesource.com/23540 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-04-27	[dev.garbage] runtime: add gc work buffer tryGet and put fast paths	Rick Hudson
	The complexity of the GC work buffers put and tryGet prevented them from being inlined. This CL simplifies the fast path thus enabling inlining. If the fast path does not succeed the previous put and tryGet functions are called. Change-Id: I6da6495d0dadf42bd0377c110b502274cc01acf5 Reviewed-on: https://go-review.googlesource.com/20704 Reviewed-by: Austin Clements <austin@google.com>
2016-03-02	all: single space after period.	Brad Fitzpatrick
	The tree's pretty inconsistent about single space vs double space after a period in documentation. Make it consistently a single space, per earlier decisions. This means contributors won't be confused by misleading precedence. This CL doesn't use go/doc to parse. It only addresses // comments. It was generated with: $ perl -i -npe 's,^(\s// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s//(.+\.) +([A-Z])') $ go test go/doc -update Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7 Reviewed-on: https://go-review.googlesource.com/20022 Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Dave Day <djd@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-26	runtime: remove workbuf logging	Austin Clements
	Early in Go 1.5 we had bugs with ownership of workbufs, so we added a system for tracing their ownership to help debug these issues. However, this system has both CPU and space overhead even when disabled, it clutters up the workbuf API, the higher level gcWork abstraction makes it very difficult to mess up the ownership of workbufs in practice, and the tracing hasn't been enabled or needed since 5b66e5d nine months ago. Hence, remove it. Benchmarks show the usual noise from changes at this level, but little overall movement. name old time/op new time/op delta XBenchGarbage-12 2.48ms ± 1% 2.47ms ± 0% -0.68% (p=0.000 n=21+21) name old time/op new time/op delta BinaryTree17-12 2.98s ± 7% 2.98s ± 6% ~ (p=0.799 n=20+20) Fannkuch11-12 2.61s ± 3% 2.55s ± 5% -2.55% (p=0.003 n=20+20) FmtFprintfEmpty-12 52.8ns ± 6% 53.6ns ± 6% ~ (p=0.228 n=20+20) FmtFprintfString-12 177ns ± 4% 177ns ± 4% ~ (p=0.280 n=20+20) FmtFprintfInt-12 162ns ± 5% 162ns ± 3% ~ (p=0.347 n=20+20) FmtFprintfIntInt-12 277ns ± 7% 273ns ± 4% -1.62% (p=0.005 n=20+20) FmtFprintfPrefixedInt-12 237ns ± 4% 242ns ± 4% +2.13% (p=0.005 n=20+20) FmtFprintfFloat-12 315ns ± 4% 312ns ± 4% -0.97% (p=0.001 n=20+20) FmtManyArgs-12 1.11µs ± 3% 1.15µs ± 4% +3.41% (p=0.004 n=20+20) GobDecode-12 8.50ms ± 7% 8.53ms ± 7% ~ (p=0.429 n=20+20) GobEncode-12 6.86ms ± 9% 6.93ms ± 7% +0.93% (p=0.030 n=20+20) Gzip-12 326ms ± 4% 329ms ± 4% +0.98% (p=0.020 n=20+20) Gunzip-12 43.3ms ± 3% 43.8ms ± 9% +1.25% (p=0.003 n=20+20) HTTPClientServer-12 72.0µs ± 3% 71.5µs ± 3% ~ (p=0.053 n=20+20) JSONEncode-12 17.0ms ± 6% 17.3ms ± 7% +1.32% (p=0.006 n=20+20) JSONDecode-12 64.2ms ± 4% 63.5ms ± 3% -1.05% (p=0.005 n=20+20) Mandelbrot200-12 4.00ms ± 3% 3.99ms ± 3% ~ (p=0.121 n=20+20) GoParse-12 3.74ms ± 5% 3.75ms ± 9% ~ (p=0.383 n=20+20) RegexpMatchEasy0_32-12 104ns ± 4% 104ns ± 6% ~ (p=0.392 n=20+20) RegexpMatchEasy0_1K-12 358ns ± 3% 361ns ± 4% +0.95% (p=0.003 n=20+20) RegexpMatchEasy1_32-12 86.3ns ± 5% 86.1ns ± 6% ~ (p=0.614 n=20+20) RegexpMatchEasy1_1K-12 523ns ± 4% 518ns ± 3% -1.14% (p=0.008 n=20+20) RegexpMatchMedium_32-12 137ns ± 3% 134ns ± 4% -1.90% (p=0.005 n=20+20) RegexpMatchMedium_1K-12 41.0µs ± 3% 40.6µs ± 4% -1.11% (p=0.004 n=20+20) RegexpMatchHard_32-12 2.13µs ± 4% 2.11µs ± 5% -1.31% (p=0.014 n=20+20) RegexpMatchHard_1K-12 64.1µs ± 3% 63.2µs ± 5% -1.38% (p=0.005 n=20+20) Revcomp-12 555ms ±10% 548ms ± 7% -1.17% (p=0.011 n=20+20) Template-12 84.2ms ± 5% 88.2ms ± 4% +4.73% (p=0.000 n=20+20) TimeParse-12 365ns ± 4% 371ns ± 5% +1.77% (p=0.002 n=20+20) TimeFormat-12 361ns ± 4% 365ns ± 3% +1.08% (p=0.002 n=20+20) [Geo mean] 64.7µs 64.8µs +0.19% Change-Id: Ib043a7a0d18b588b298873d60913d44cd19f3b44 Reviewed-on: https://go-review.googlesource.com/19887 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-by: Rick Hudson <rlh@golang.org>
2016-02-25	runtime: use only per-P gcWork	Austin Clements
	Currently most uses of gcWork use the per-P gcWork, but there are two places that still use a stack-based gcWork. Simplify things by making these instead use the per-P gcWork. Change-Id: I712d012cce9dd5757c8541824e9641ac1c2a329c Reviewed-on: https://go-review.googlesource.com/19636 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-25	runtime: remove noescape hacks from gcWork	Austin Clements
	When gcWork was first introduced, the compiler's escape analysis wasn't good enough to detect that that method receiver didn't escape, so we had to hack around this. Now that the compiler can figure out this for itself, remove these hacks. Change-Id: I9f73fab721e272410b8b6905b564e7abc03c0dfe Reviewed-on: https://go-review.googlesource.com/19634 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-02-24	all: fix typos and spelling	Martin Möhrmann
	Change-Id: Icd06d99c42b8299fd931c7da821e1f418684d913 Reviewed-on: https://go-review.googlesource.com/19829 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-11-12	runtime: break out system-specific constants into package sys	Michael Matloob
	runtime/internal/sys will hold system-, architecture- and config- specific constants. Updates #11647 Change-Id: I6db29c312556087a42e8d2bdd9af40d157c56b54 Reviewed-on: https://go-review.googlesource.com/16817 Reviewed-by: Russ Cox <rsc@golang.org>
2015-11-10	runtime: break atomics out into package runtime/internal/atomic	Michael Matloob
	This change breaks out most of the atomics functions in the runtime into package runtime/internal/atomic. It adds some basic support in the toolchain for runtime packages, and also modifies linux/arm atomics to remove the dependency on the runtime's mutex. The mutexes have been replaced with spinlocks. all trybots are happy! In addition to the trybots, I've tested on the darwin/arm64 builder, on the darwin/arm builder, and on a ppc64le machine. Change-Id: I6698c8e3cf3834f55ce5824059f44d00dc8e3c2f Reviewed-on: https://go-review.googlesource.com/14204 Run-TryBot: Michael Matloob <matloob@golang.org> Reviewed-by: Russ Cox <rsc@golang.org>
2015-11-04	runtime: make putfull start mark workers	Austin Clements
	Currently we depend on the good graces and timing of the scheduler to get opportunities to start dedicated mark workers. In the worst case, it may take 10ms to get dedicated mark workers going at the beginning of mark 1 and mark 2 or after the amount of available work has dropped and gone back up. Instead of waiting for the regular preemption logic to get around to us, make putfull enlist a random P if we're not already running enough dedicated workers. This should improve performance stability of the garbage collector and is likely to improve the overall performance somewhat. No overall effect on the go1 benchmarks. It speeds up the garbage benchmark by 12%, which more than counters the performance loss from the previous commit. name old time/op new time/op delta XBenchGarbage-12 6.32ms ± 4% 5.58ms ± 2% -11.68% (p=0.000 n=20+16) name old time/op new time/op delta BinaryTree17-12 3.18s ± 5% 3.12s ± 4% -1.83% (p=0.021 n=20+20) Fannkuch11-12 2.50s ± 2% 2.46s ± 2% -1.57% (p=0.000 n=18+19) FmtFprintfEmpty-12 50.8ns ± 3% 50.4ns ± 3% ~ (p=0.184 n=20+20) FmtFprintfString-12 167ns ± 2% 171ns ± 1% +2.46% (p=0.000 n=20+19) FmtFprintfInt-12 161ns ± 2% 163ns ± 2% +1.81% (p=0.000 n=20+20) FmtFprintfIntInt-12 269ns ± 1% 266ns ± 1% -0.81% (p=0.002 n=19+20) FmtFprintfPrefixedInt-12 237ns ± 2% 231ns ± 2% -2.86% (p=0.000 n=20+20) FmtFprintfFloat-12 313ns ± 2% 313ns ± 1% ~ (p=0.681 n=20+20) FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 1% -2.26% (p=0.000 n=20+20) GobDecode-12 8.66ms ± 1% 8.67ms ± 1% ~ (p=0.380 n=19+20) GobEncode-12 6.56ms ± 1% 6.56ms ± 2% ~ (p=0.607 n=19+20) Gzip-12 317ms ± 1% 314ms ± 2% -1.10% (p=0.000 n=20+19) Gunzip-12 42.1ms ± 1% 42.2ms ± 1% +0.27% (p=0.044 n=20+19) HTTPClientServer-12 62.7µs ± 1% 62.0µs ± 1% -1.04% (p=0.000 n=19+18) JSONEncode-12 16.7ms ± 1% 16.8ms ± 2% +0.59% (p=0.021 n=20+20) JSONDecode-12 58.2ms ± 1% 61.4ms ± 2% +5.43% (p=0.000 n=18+19) Mandelbrot200-12 3.84ms ± 1% 3.87ms ± 2% +0.79% (p=0.008 n=18+20) GoParse-12 3.86ms ± 2% 3.76ms ± 2% -2.60% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 100ns ± 2% 100ns ± 1% -0.68% (p=0.005 n=18+15) RegexpMatchEasy0_1K-12 332ns ± 1% 342ns ± 1% +3.16% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 82.9ns ± 3% 83.0ns ± 2% ~ (p=0.906 n=19+20) RegexpMatchEasy1_1K-12 487ns ± 1% 494ns ± 1% +1.50% (p=0.000 n=17+20) RegexpMatchMedium_32-12 131ns ± 2% 130ns ± 1% ~ (p=0.686 n=19+20) RegexpMatchMedium_1K-12 39.6µs ± 1% 39.2µs ± 1% -1.09% (p=0.000 n=18+19) RegexpMatchHard_32-12 2.04µs ± 1% 2.04µs ± 2% ~ (p=0.804 n=20+20) RegexpMatchHard_1K-12 61.7µs ± 2% 61.3µs ± 2% ~ (p=0.052 n=18+20) Revcomp-12 529ms ± 2% 533ms ± 1% +0.83% (p=0.003 n=20+19) Template-12 70.7ms ± 2% 71.0ms ± 2% ~ (p=0.065 n=20+19) TimeParse-12 351ns ± 2% 355ns ± 1% +1.25% (p=0.000 n=19+20) TimeFormat-12 362ns ± 2% 373ns ± 1% +2.83% (p=0.000 n=18+20) [Geo mean] 62.2µs 62.3µs +0.13% name old speed new speed delta GobDecode-12 88.6MB/s ± 1% 88.5MB/s ± 1% ~ (p=0.392 n=19+20) GobEncode-12 117MB/s ± 1% 117MB/s ± 1% ~ (p=0.622 n=19+20) Gzip-12 61.1MB/s ± 1% 61.8MB/s ± 2% +1.11% (p=0.000 n=20+19) Gunzip-12 461MB/s ± 1% 460MB/s ± 1% -0.27% (p=0.044 n=20+19) JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% -0.58% (p=0.022 n=20+20) JSONDecode-12 33.3MB/s ± 1% 31.6MB/s ± 2% -5.15% (p=0.000 n=18+19) GoParse-12 15.0MB/s ± 2% 15.4MB/s ± 2% +2.66% (p=0.000 n=20+20) RegexpMatchEasy0_32-12 317MB/s ± 2% 319MB/s ± 2% ~ (p=0.052 n=20+20) RegexpMatchEasy0_1K-12 3.08GB/s ± 1% 2.99GB/s ± 1% -3.07% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 386MB/s ± 3% 386MB/s ± 2% ~ (p=0.939 n=19+20) RegexpMatchEasy1_1K-12 2.10GB/s ± 1% 2.07GB/s ± 1% -1.46% (p=0.000 n=17+20) RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.64MB/s ± 1% ~ (p=0.702 n=19+20) RegexpMatchMedium_1K-12 25.9MB/s ± 1% 26.1MB/s ± 2% +0.99% (p=0.000 n=18+20) RegexpMatchHard_32-12 15.7MB/s ± 1% 15.7MB/s ± 2% ~ (p=0.723 n=20+20) RegexpMatchHard_1K-12 16.6MB/s ± 2% 16.7MB/s ± 2% ~ (p=0.052 n=18+20) Revcomp-12 481MB/s ± 2% 477MB/s ± 1% -0.83% (p=0.003 n=20+19) Template-12 27.5MB/s ± 2% 27.3MB/s ± 2% ~ (p=0.062 n=20+19) [Geo mean] 99.4MB/s 99.1MB/s -0.35% Change-Id: I914d8cadded5a230509d118164a4c201601afc06 Reviewed-on: https://go-review.googlesource.com/16298 Reviewed-by: Rick Hudson <rlh@golang.org>
2015-11-03	runtime: cache two workbufs to reduce contention	Austin Clements
	Currently the gcWork abstraction caches a single work buffer. As a result, if a worker is putting and getting pointers right at the boundary of a work buffer, it can flap between work buffers and (potentially significantly) increase contention on the global work buffer lists. This change modifies gcWork to instead cache two work buffers and switch off between them. This introduces one buffers' worth of hysteresis and eliminates the above performance worst case by amortizing the cost of getting or putting a work buffer over at least one buffers' worth of work. In practice, it's difficult to trigger this worst case with reasonably large work buffers. On the garbage benchmark, this reduces the max writes/sec to the global work list from 32K to 25K and the median from 6K to 5K. However, if a workload were to trigger this worst case behavior, it could significantly drive up this contention. This has negligible effects on the go1 benchmarks and slightly speeds up the garbage benchmark. name old time/op new time/op delta XBenchGarbage-12 5.90ms ± 3% 5.83ms ± 4% -1.18% (p=0.011 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.22s ± 4% 3.17s ± 3% -1.57% (p=0.009 n=19+20) Fannkuch11-12 2.44s ± 1% 2.53s ± 4% +3.78% (p=0.000 n=18+19) FmtFprintfEmpty-12 50.2ns ± 2% 50.5ns ± 5% ~ (p=0.631 n=19+20) FmtFprintfString-12 167ns ± 1% 166ns ± 1% ~ (p=0.141 n=20+20) FmtFprintfInt-12 162ns ± 1% 159ns ± 1% -1.80% (p=0.000 n=20+20) FmtFprintfIntInt-12 277ns ± 2% 263ns ± 1% -4.78% (p=0.000 n=20+18) FmtFprintfPrefixedInt-12 240ns ± 1% 232ns ± 2% -3.25% (p=0.000 n=20+20) FmtFprintfFloat-12 311ns ± 1% 315ns ± 2% +1.17% (p=0.000 n=20+20) FmtManyArgs-12 1.05µs ± 2% 1.03µs ± 2% -1.72% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.71ms ± 2% +0.68% (p=0.001 n=19+20) GobEncode-12 6.51ms ± 1% 6.54ms ± 1% +0.42% (p=0.047 n=20+19) Gzip-12 318ms ± 2% 315ms ± 2% -1.20% (p=0.000 n=19+19) Gunzip-12 42.2ms ± 2% 42.1ms ± 1% ~ (p=0.667 n=20+19) HTTPClientServer-12 62.5µs ± 1% 62.4µs ± 1% ~ (p=0.110 n=20+18) JSONEncode-12 16.8ms ± 1% 16.8ms ± 2% ~ (p=0.569 n=19+20) JSONDecode-12 60.8ms ± 2% 59.8ms ± 1% -1.69% (p=0.000 n=19+19) Mandelbrot200-12 3.87ms ± 1% 3.85ms ± 0% -0.61% (p=0.001 n=20+17) GoParse-12 3.76ms ± 2% 3.76ms ± 1% ~ (p=0.698 n=20+20) RegexpMatchEasy0_32-12 100ns ± 2% 101ns ± 2% ~ (p=0.065 n=19+20) RegexpMatchEasy0_1K-12 342ns ± 2% 333ns ± 1% -2.82% (p=0.000 n=20+19) RegexpMatchEasy1_32-12 83.3ns ± 2% 83.2ns ± 2% ~ (p=0.692 n=20+19) RegexpMatchEasy1_1K-12 498ns ± 2% 490ns ± 1% -1.52% (p=0.000 n=18+20) RegexpMatchMedium_32-12 131ns ± 2% 131ns ± 2% ~ (p=0.464 n=20+18) RegexpMatchMedium_1K-12 39.3µs ± 2% 39.6µs ± 1% +0.77% (p=0.000 n=18+19) RegexpMatchHard_32-12 2.04µs ± 2% 2.06µs ± 1% +0.69% (p=0.009 n=19+20) RegexpMatchHard_1K-12 61.4µs ± 2% 62.1µs ± 1% +1.21% (p=0.000 n=19+20) Revcomp-12 534ms ± 1% 529ms ± 1% -0.97% (p=0.000 n=19+16) Template-12 70.4ms ± 2% 70.0ms ± 1% ~ (p=0.070 n=19+19) TimeParse-12 359ns ± 3% 344ns ± 1% -4.15% (p=0.000 n=19+19) TimeFormat-12 357ns ± 1% 361ns ± 2% +1.05% (p=0.002 n=20+20) [Geo mean] 62.4µs 62.0µs -0.56% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.1MB/s ± 2% -0.68% (p=0.001 n=19+20) GobEncode-12 118MB/s ± 1% 117MB/s ± 1% -0.42% (p=0.046 n=20+19) Gzip-12 60.9MB/s ± 2% 61.7MB/s ± 2% +1.21% (p=0.000 n=19+19) Gunzip-12 460MB/s ± 2% 461MB/s ± 1% ~ (p=0.661 n=20+19) JSONEncode-12 116MB/s ± 1% 115MB/s ± 2% ~ (p=0.555 n=19+20) JSONDecode-12 31.9MB/s ± 2% 32.5MB/s ± 1% +1.72% (p=0.000 n=19+19) GoParse-12 15.4MB/s ± 2% 15.4MB/s ± 1% ~ (p=0.653 n=20+20) RegexpMatchEasy0_32-12 317MB/s ± 2% 315MB/s ± 2% ~ (p=0.141 n=19+20) RegexpMatchEasy0_1K-12 2.99GB/s ± 2% 3.07GB/s ± 1% +2.86% (p=0.000 n=20+19) RegexpMatchEasy1_32-12 384MB/s ± 2% 385MB/s ± 2% ~ (p=0.672 n=20+19) RegexpMatchEasy1_1K-12 2.06GB/s ± 2% 2.09GB/s ± 1% +1.54% (p=0.000 n=18+20) RegexpMatchMedium_32-12 7.62MB/s ± 2% 7.63MB/s ± 2% ~ (p=0.800 n=20+18) RegexpMatchMedium_1K-12 26.0MB/s ± 1% 25.8MB/s ± 1% -0.77% (p=0.000 n=18+19) RegexpMatchHard_32-12 15.7MB/s ± 2% 15.6MB/s ± 1% -0.69% (p=0.010 n=19+20) RegexpMatchHard_1K-12 16.7MB/s ± 2% 16.5MB/s ± 1% -1.19% (p=0.000 n=19+20) Revcomp-12 476MB/s ± 1% 481MB/s ± 1% +0.97% (p=0.000 n=19+16) Template-12 27.6MB/s ± 2% 27.7MB/s ± 1% ~ (p=0.071 n=19+19) [Geo mean] 99.1MB/s 99.3MB/s +0.27% Change-Id: I68bcbf74ccb716cd5e844a554f67b679135105e6 Reviewed-on: https://go-review.googlesource.com/16042 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-11-03	runtime: enlarge GC work buffer size	Austin Clements
	Currently the GC work buffers are only 256 bytes and hence can record only 24 64-bit pointer. They were reduced from 4K in commits db7fd1c and a15818f as a way to minimize the amount of work the per-P workbuf caches could "hide" from the mark phase and carry in to the mark termination phase. However, this approach wasn't very robust and we later added a "mark 2" phase to address this problem head-on. Because of mark 2, there's now no benefit to having very small work buffers. But there are plenty of downsides: small work buffers increase contention on the work lists, increase the frequency and hence net overhead of acquiring and releasing work buffers, and somewhat increase memory overhead of the GC. This commit expands work buffers back to 4K (504 64-bit pointers). This reduces the rate of writes to work.full in the garbage benchmark from a peak of ~780,000 writes/sec to a peak of ~32,000 writes/sec. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark. name old time/op new time/op delta XBenchGarbage-12 5.37ms ± 5% 5.60ms ± 2% +4.37% (p=0.000 n=20+20) Change-Id: Ic9cc28e7a125d23d9faf4f5e690fb8aa9bcdfb28 Reviewed-on: https://go-review.googlesource.com/15893 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-30	runtime: perform concurrent scan in GC workers	Austin Clements
	Currently the concurrent root scan is performed in its entirety by the GC coordinator before entering concurrent mark (which enables GC workers). This scan is done sequentially, which can prolong the scan phase, delay the mark phase, and means that the scan phase does not obey the 25% CPU goal. Furthermore, there's no need to complete the root scan before starting marking (in fact, we already allow GC assists to happen during the scan phase), so this acts as an unnecessary barrier between root scanning and marking. This change shifts the root scan work out of the GC coordinator and in to the GC workers. The coordinator simply sets up the scan state and enqueues the right number of root scan jobs. The GC workers then drain the root scan jobs prior to draining heap scan jobs. This parallelizes the root scan process, makes it obey the 25% CPU goal, and effectively eliminates root scanning as an isolated phase, allowing the system to smoothly transition from root scanning to heap marking. This also eliminates a major non-STW responsibility of the GC coordinator, which will make it easier to switch to a decentralized state machine. Finally, it puts us in a good position to perform root scanning in assists as well, which will help satisfy assists at the beginning of the GC cycle. This is mostly straightforward. One tricky aspect is that we have to deal with preemption deadlock: where two non-preemptible gorountines are trying to preempt each other to perform a stack scan. Given the context where this happens, the only instance of this is two background workers trying to scan each other. We avoid this by simply not scanning the stacks of background workers during the concurrent phase; this is safe because we'll scan them during mark termination (and their stacks are very small and should not contain any new pointers). This change also switches the root marking during mark termination to use the same gcDrain-based code path as concurrent mark. This shouldn't affect performance because STW root marking was already parallel and tasks switched to heap marking immediately when no more root marking tasks were available. However, it simplifies the code and unifies these code paths. This has negligible effect on the go1 benchmarks. It slightly slows down the garbage benchmark, possibly by making GC run slightly more frequently. name old time/op new time/op delta XBenchGarbage-12 5.10ms ± 1% 5.24ms ± 1% +2.87% (p=0.000 n=18+18) name old time/op new time/op delta BinaryTree17-12 3.25s ± 3% 3.20s ± 5% -1.57% (p=0.013 n=20+20) Fannkuch11-12 2.45s ± 1% 2.46s ± 1% +0.38% (p=0.019 n=20+18) FmtFprintfEmpty-12 49.7ns ± 3% 49.9ns ± 4% ~ (p=0.851 n=19+20) FmtFprintfString-12 170ns ± 2% 170ns ± 1% ~ (p=0.775 n=20+19) FmtFprintfInt-12 161ns ± 1% 160ns ± 1% -0.78% (p=0.000 n=19+18) FmtFprintfIntInt-12 267ns ± 1% 270ns ± 1% +1.04% (p=0.000 n=19+19) FmtFprintfPrefixedInt-12 238ns ± 2% 238ns ± 1% ~ (p=0.133 n=18+19) FmtFprintfFloat-12 311ns ± 1% 310ns ± 2% -0.35% (p=0.023 n=20+19) FmtManyArgs-12 1.08µs ± 1% 1.06µs ± 1% -2.31% (p=0.000 n=20+20) GobDecode-12 8.65ms ± 1% 8.63ms ± 1% ~ (p=0.377 n=18+20) GobEncode-12 6.49ms ± 1% 6.52ms ± 1% +0.37% (p=0.015 n=20+20) Gzip-12 319ms ± 3% 318ms ± 1% ~ (p=0.975 n=19+17) Gunzip-12 41.9ms ± 1% 42.1ms ± 2% +0.65% (p=0.004 n=19+20) HTTPClientServer-12 61.7µs ± 1% 62.6µs ± 1% +1.40% (p=0.000 n=18+20) JSONEncode-12 16.8ms ± 1% 16.9ms ± 1% ~ (p=0.239 n=20+18) JSONDecode-12 58.4ms ± 1% 60.7ms ± 1% +3.85% (p=0.000 n=19+20) Mandelbrot200-12 3.86ms ± 0% 3.86ms ± 1% ~ (p=0.092 n=18+19) GoParse-12 3.75ms ± 2% 3.75ms ± 2% ~ (p=0.708 n=19+20) RegexpMatchEasy0_32-12 100ns ± 1% 100ns ± 2% +0.60% (p=0.010 n=17+20) RegexpMatchEasy0_1K-12 341ns ± 1% 342ns ± 2% ~ (p=0.203 n=20+19) RegexpMatchEasy1_32-12 82.5ns ± 2% 83.2ns ± 2% +0.83% (p=0.007 n=19+19) RegexpMatchEasy1_1K-12 495ns ± 1% 495ns ± 2% ~ (p=0.970 n=19+18) RegexpMatchMedium_32-12 130ns ± 2% 130ns ± 2% +0.59% (p=0.039 n=19+20) RegexpMatchMedium_1K-12 39.2µs ± 1% 39.3µs ± 1% ~ (p=0.214 n=18+18) RegexpMatchHard_32-12 2.03µs ± 2% 2.02µs ± 1% ~ (p=0.166 n=18+19) RegexpMatchHard_1K-12 61.0µs ± 1% 60.9µs ± 1% ~ (p=0.169 n=20+18) Revcomp-12 533ms ± 1% 535ms ± 1% ~ (p=0.071 n=19+17) Template-12 68.1ms ± 2% 73.0ms ± 1% +7.26% (p=0.000 n=19+20) TimeParse-12 355ns ± 2% 356ns ± 2% ~ (p=0.530 n=19+20) TimeFormat-12 357ns ± 2% 347ns ± 1% -2.59% (p=0.000 n=20+19) [Geo mean] 62.1µs 62.3µs +0.31% name old speed new speed delta GobDecode-12 88.7MB/s ± 1% 88.9MB/s ± 1% ~ (p=0.377 n=18+20) GobEncode-12 118MB/s ± 1% 118MB/s ± 1% -0.37% (p=0.015 n=20+20) Gzip-12 60.9MB/s ± 3% 60.9MB/s ± 1% ~ (p=0.944 n=19+17) Gunzip-12 464MB/s ± 1% 461MB/s ± 2% -0.64% (p=0.004 n=19+20) JSONEncode-12 115MB/s ± 1% 115MB/s ± 1% ~ (p=0.236 n=20+18) JSONDecode-12 33.2MB/s ± 1% 32.0MB/s ± 1% -3.71% (p=0.000 n=19+20) GoParse-12 15.5MB/s ± 2% 15.5MB/s ± 2% ~ (p=0.702 n=19+20) RegexpMatchEasy0_32-12 320MB/s ± 1% 318MB/s ± 2% ~ (p=0.094 n=18+20) RegexpMatchEasy0_1K-12 3.00GB/s ± 1% 2.99GB/s ± 1% ~ (p=0.194 n=20+19) RegexpMatchEasy1_32-12 388MB/s ± 2% 385MB/s ± 2% -0.83% (p=0.008 n=19+19) RegexpMatchEasy1_1K-12 2.07GB/s ± 1% 2.07GB/s ± 1% ~ (p=0.964 n=19+18) RegexpMatchMedium_32-12 7.68MB/s ± 1% 7.64MB/s ± 2% -0.57% (p=0.020 n=19+20) RegexpMatchMedium_1K-12 26.1MB/s ± 1% 26.1MB/s ± 1% ~ (p=0.211 n=18+18) RegexpMatchHard_32-12 15.8MB/s ± 1% 15.8MB/s ± 1% ~ (p=0.180 n=18+19) RegexpMatchHard_1K-12 16.8MB/s ± 1% 16.8MB/s ± 2% ~ (p=0.236 n=20+19) Revcomp-12 477MB/s ± 1% 475MB/s ± 1% ~ (p=0.071 n=19+17) Template-12 28.5MB/s ± 2% 26.6MB/s ± 1% -6.77% (p=0.000 n=19+20) [Geo mean] 100MB/s 99.0MB/s -0.82% Change-Id: I875bf6ceb306d1ee2f470cabf88aa6ede27c47a0 Reviewed-on: https://go-review.googlesource.com/16059 Reviewed-by: Rick Hudson <rlh@golang.org> Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-10-19	runtime: remove work.partial queue	Austin Clements
	This work queue is no longer used (there are many reads of work.partial, but the only write is in putpartial, which is never called). Fixes #11922. Change-Id: I08b76c0c02a0867a9cdcb94783e1f7629d44249a Reviewed-on: https://go-review.googlesource.com/15892 Reviewed-by: Rick Hudson <rlh@golang.org>