go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2026-01-23	runtime: speed up cheaprand and cheaprand64	Gavin Lam
	The current cheaprand performs 128-bit multiplication on 64-bit numbers and truncate the result to 32 bits, which is inefficient. A 32-bit specific implementation is more performant because it performs 64-bit multiplication on 32-bit numbers instead. The current cheaprand64 involves two cheaprand calls. Implementing it as 64-bit wyrand is significantly faster. Since cheaprand64 discards one bit, I have preserved this behavior. The underlying uint64 function is made available as cheaprandu64. │ old │ new │ │ sec/op │ sec/op vs base │ Cheaprand-8 1.358n ± 0% 1.218n ± 0% -10.31% (n=100) Cheaprand64-8 2.424n ± 0% 1.391n ± 0% -42.62% (n=100) Blocksampled-8 8.347n ± 0% 2.022n ± 0% -75.78% (n=100) Fixes #77149 Change-Id: Ib0b5da4a642cd34d0401b03c1d343041f8230d11 GitHub-Last-Rev: 549d8d407e2bbcaecdee0b52cbf3a513dda637fb GitHub-Pull-Request: golang/go#77150 Reviewed-on: https://go-review.googlesource.com/c/go/+/735480 Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-19	runtime: rename ncpu to numCPUStartup	Michael Pratt
	ncpu is the total logical CPU count at startup. It is never updated. For #73193, we will start using updated CPU counts for updated GOMAXPROCS, making the ncpu name a bit ambiguous. Change to a less ambiguous name. While we're at it, give the OS specific lookup functions a common name, so it can be used outside of osinit later. For #73193. Change-Id: I6a6a636cf21cc60de36b211f3c374080849fc667 Reviewed-on: https://go-review.googlesource.com/c/go/+/672277 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>
2025-05-08	runtime: avoid overflow in mutex delay calculation	Rhys Hiltner
	If cputicks is in the top quarter of the int64's range, adding two values together will overflow and confuse the subsequent calculations, leading to zero-duration contention events in the profile. This fixes the TestRuntimeLockMetricsAndProfile failures on the linux-s390x builder. Change-Id: Icb814c39a8702379dfd71c06a53b2618e3589e07 Reviewed-on: https://go-review.googlesource.com/c/go/+/671115 Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-05-07	runtime: blame unlocker for mutex delay	Rhys Hiltner
	Correct how the mutex contention profile reports on runtime-internal mutex values, to match sync.Mutex's semantics. Decide at the start of unlock2 whether we'd like to collect a contention sample. If so: Opt in to a slightly slower unlock path which avoids accidentally accepting blame for delay caused by other Ms. Release the lock before doing an O(N) traversal of the stack of waiting Ms, to calculate the total delay to those Ms that our critical section caused. Report that, with the current callstack, in the mutex profile. Fixes #66999 Change-Id: I561ed8dc120669bd045d514cb0d1c6c99c2add04 Reviewed-on: https://go-review.googlesource.com/c/go/+/667615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-04-23	runtime: move some malloc constants to internal/runtime/gc	Michael Anthony Knyszek
	These constants are needed by some future generator programs. Change-Id: I5dccd009cbb3b2f321523bc0d8eaeb4c82e5df81 Reviewed-on: https://go-review.googlesource.com/c/go/+/655276 Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-22	runtime: commit to spinbitmutex GOEXPERIMENT	Rhys Hiltner
	Use the "spinbit" mutex implementation always (including on platforms that need to emulate atomic.Xchg8), and delete the prior "tristate" implementations. The exception is GOARCH=wasm, where the Go runtime does not use multiple threads. For #68578 Change-Id: Ifc29bbfa05071d776c23a19ae185891a03a82417 Reviewed-on: https://go-review.googlesource.com/c/go/+/658456 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-22	runtime: fix test of when a mutex is contended	Rhys Hiltner
	This is used only in tests that verify reports of runtime-internal mutex contention. For #66999 For #70602 Change-Id: I72cb1302d8ea0524f1182ec892f5c9a1923cddba Reviewed-on: https://go-review.googlesource.com/c/go/+/667095 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Junyang Shao <shaojunyang@google.com>
2025-04-11	runtime: handle m0 padding better	Russ Cox
	The SpinbitMutex experiment requires m structs other than m0 to be allocated in 2048-byte size class, by adding padding. Do the calculation more explicitly, to avoid future CLs like CL 653335. Change-Id: I83ae1e86ef3711ab65441f4e487f94b9e1429029 Reviewed-on: https://go-review.googlesource.com/c/go/+/654595 Reviewed-by: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
2024-11-20	runtime: clean up new lock2 structure	Rhys Hiltner
	Simplify some flow control, as suggested on https://go.dev/cl/620435. The MutexCapture microbenchmark shows a bit of throughput improvement at moderate levels of contention, and little change to capture and starvation. (Note that the capture and starvation figures below are in terms of power-of-two buckets multiplied by throughput, so they either follow similar patterns or move by a factor of two.) For #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexCapture 18.21n ± 0% 18.35n ± 0% +0.77% (p=0.000 n=10) MutexCapture-2 21.46n ± 8% 21.05n ± 12% ~ (p=0.796 n=10) MutexCapture-3 22.56n ± 9% 22.59n ± 18% ~ (p=0.631 n=10) MutexCapture-4 22.85n ± 5% 22.74n ± 2% ~ (p=0.565 n=10) MutexCapture-5 22.84n ± 5% 22.50n ± 14% ~ (p=0.912 n=10) MutexCapture-6 23.33n ± 14% 22.22n ± 3% -4.78% (p=0.004 n=10) MutexCapture-7 27.04n ± 14% 23.78n ± 15% ~ (p=0.089 n=10) MutexCapture-8 25.44n ± 10% 23.03n ± 6% -9.48% (p=0.004 n=10) MutexCapture-9 25.56n ± 7% 24.39n ± 11% ~ (p=0.218 n=10) MutexCapture-10 26.77n ± 10% 24.00n ± 7% -10.33% (p=0.023 n=10) MutexCapture-11 27.02n ± 7% 24.55n ± 15% -9.18% (p=0.035 n=10) MutexCapture-12 26.71n ± 8% 24.96n ± 8% ~ (p=0.148 n=10) MutexCapture-13 25.58n ± 4% 25.82n ± 5% ~ (p=0.271 n=10) MutexCapture-14 26.86n ± 6% 25.91n ± 7% ~ (p=0.529 n=10) MutexCapture-15 25.12n ± 13% 26.16n ± 4% ~ (p=0.353 n=10) MutexCapture-16 26.18n ± 4% 26.21n ± 9% ~ (p=0.838 n=10) MutexCapture-17 26.04n ± 4% 25.85n ± 5% ~ (p=0.363 n=10) MutexCapture-18 26.02n ± 7% 25.93n ± 5% ~ (p=0.853 n=10) MutexCapture-19 25.67n ± 5% 26.21n ± 4% ~ (p=0.631 n=10) MutexCapture-20 25.50n ± 6% 25.99n ± 8% ~ (p=0.404 n=10) geomean 24.73n 24.02n -2.88% │ old │ new │ │ sec/streak-p90 │ sec/streak-p90 vs base │ MutexCapture 76.36m ± 0% 76.96m ± 0% +0.79% (p=0.000 n=10) MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 5.936µ ± 93% 5.782µ ± 18% ~ (p=0.684 n=10) MutexCapture-4 5.849µ ± 5% 5.820µ ± 2% ~ (p=0.579 n=10) MutexCapture-5 5.849µ ± 5% 5.759µ ± 14% ~ (p=0.912 n=10) MutexCapture-6 5.975µ ± 14% 5.687µ ± 3% -4.81% (p=0.004 n=10) MutexCapture-7 6.921µ ± 14% 6.086µ ± 18% ~ (p=0.165 n=10) MutexCapture-8 6.512µ ± 10% 5.894µ ± 6% -9.50% (p=0.004 n=10) MutexCapture-9 6.544µ ± 7% 6.245µ ± 11% ~ (p=0.218 n=10) MutexCapture-10 6.962µ ± 11% 6.144µ ± 7% -11.76% (p=0.023 n=10) MutexCapture-11 6.938µ ± 7% 6.284µ ± 130% ~ (p=0.190 n=10) MutexCapture-12 6.838µ ± 8% 6.408µ ± 13% ~ (p=0.404 n=10) MutexCapture-13 6.549µ ± 4% 6.608µ ± 5% ~ (p=0.271 n=10) MutexCapture-14 6.877µ ± 8% 6.634µ ± 7% ~ (p=0.436 n=10) MutexCapture-15 6.433µ ± 13% 6.697µ ± 4% ~ (p=0.247 n=10) MutexCapture-16 6.702µ ± 10% 6.711µ ± 116% ~ (p=0.796 n=10) MutexCapture-17 6.730µ ± 3% 6.619µ ± 5% ~ (p=0.225 n=10) MutexCapture-18 6.663µ ± 7% 6.716µ ± 13% ~ (p=0.853 n=10) MutexCapture-19 6.570µ ± 5% 6.710µ ± 4% ~ (p=0.529 n=10) MutexCapture-20 6.528µ ± 6% 6.775µ ± 11% ~ (p=0.247 n=10) geomean 10.66µ 10.00µ -6.13% │ old │ new │ │ sec/starve-p90 │ sec/starve-p90 vs base │ MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 184.8µ ± 91% 183.9µ ± 48% ~ (p=0.436 n=10) MutexCapture-4 388.8µ ± 270% 375.6µ ± 280% ~ (p=0.436 n=10) MutexCapture-5 807.2µ ± 83% 2880.9µ ± 85% ~ (p=0.105 n=10) MutexCapture-6 2.272m ± 61% 2.173m ± 34% ~ (p=0.280 n=10) MutexCapture-7 1.351m ± 125% 2.990m ± 70% ~ (p=0.393 n=10) MutexCapture-8 3.328m ± 97% 3.064m ± 96% ~ (p=0.739 n=10) MutexCapture-9 3.526m ± 91% 3.081m ± 47% -12.62% (p=0.015 n=10) MutexCapture-10 3.641m ± 86% 3.228m ± 90% -11.34% (p=0.005 n=10) MutexCapture-11 3.324m ± 109% 3.190m ± 71% ~ (p=0.481 n=10) MutexCapture-12 3.519m ± 77% 3.200m ± 106% ~ (p=0.393 n=10) MutexCapture-13 3.353m ± 91% 3.368m ± 99% ~ (p=0.853 n=10) MutexCapture-14 3.314m ± 101% 3.396m ± 286% ~ (p=0.353 n=10) MutexCapture-15 3.534m ± 83% 3.397m ± 91% ~ (p=0.739 n=10) MutexCapture-16 3.485m ± 90% 3.436m ± 116% ~ (p=0.853 n=10) MutexCapture-17 6.516m ± 48% 3.452m ± 88% ~ (p=0.190 n=10) MutexCapture-18 6.645m ± 105% 3.439m ± 108% ~ (p=0.218 n=10) MutexCapture-19 6.521m ± 46% 4.907m ± 42% ~ (p=0.529 n=10) MutexCapture-20 6.532m ± 47% 3.516m ± 89% ~ (p=0.089 n=10) geomean 1.919m 1.783m -7.06% Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e Reviewed-on: https://go-review.googlesource.com/c/go/+/629415 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
2024-11-15	runtime: unify lock2, allow deeper sleep	Rhys Hiltner
	The tri-state mutex implementation (unlocked, locked, sleeping) avoids sleep/wake syscalls when contention is low or absent, but its performance degrades when many threads are contending for a mutex to execute a fast critical section. A fast critical section means frequent unlock2 calls. Each of those finds the mutex in the "sleeping" state and so wakes a sleeping thread, even if many other threads are already awake and in the spin loop of lock2 attempting to acquire the mutex for themselves. Many spinning threads means wasting energy and CPU time that could be used by other processes on the machine. Many threads all spinning on the same cache line leads to performance collapse. Merge the futex- and semaphore-based mutex implementations by using a semaphore abstraction for futex platforms. Then, add a bit to the mutex state word that communicates whether one of the waiting threads is awake and spinning. When threads in lock2 see the new "spinning" bit, they can sleep immediately. In unlock2, the "spinning" bit means we can save a syscall and not wake a sleeping thread. This brings up the real possibility of starvation: waiting threads are able to enter a deeper sleep than before, since one of their peers can volunteer to be the sole "spinning" thread and thus cause unlock2 to skip the semawakeup call. Additionally, the waiting threads form a LIFO stack so any wakeups that do occur will target threads that have gone to sleep most recently. Counteract those effects by periodically waking the thread at the bottom of the stack and allowing it to spin. Exempt sched.lock from most of the new behaviors; it's often used by several threads in sequence to do thread-specific work, so low-latency handoff is a priority over improved throughput. Gate use of this implementation behind GOEXPERIMENT=spinbitmutex, so it's easy to disable. Enable it by default on supported platforms (the most efficient implementation requires atomic.Xchg8). Fixes #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexContention 17.82n ± 0% 17.74n ± 0% -0.42% (p=0.000 n=10) MutexContention-2 22.17n ± 9% 19.85n ± 12% ~ (p=0.089 n=10) MutexContention-3 26.14n ± 14% 20.81n ± 13% -20.41% (p=0.000 n=10) MutexContention-4 29.28n ± 8% 21.19n ± 10% -27.62% (p=0.000 n=10) MutexContention-5 31.79n ± 2% 21.98n ± 10% -30.83% (p=0.000 n=10) MutexContention-6 34.63n ± 1% 22.58n ± 5% -34.79% (p=0.000 n=10) MutexContention-7 44.16n ± 2% 23.14n ± 7% -47.59% (p=0.000 n=10) MutexContention-8 53.81n ± 3% 23.66n ± 6% -56.04% (p=0.000 n=10) MutexContention-9 65.58n ± 4% 23.91n ± 9% -63.54% (p=0.000 n=10) MutexContention-10 77.35n ± 3% 26.06n ± 9% -66.31% (p=0.000 n=10) MutexContention-11 89.62n ± 1% 25.56n ± 9% -71.47% (p=0.000 n=10) MutexContention-12 102.45n ± 2% 25.57n ± 7% -75.04% (p=0.000 n=10) MutexContention-13 111.95n ± 1% 24.59n ± 8% -78.04% (p=0.000 n=10) MutexContention-14 123.95n ± 3% 24.42n ± 6% -80.30% (p=0.000 n=10) MutexContention-15 120.80n ± 10% 25.54n ± 6% -78.86% (p=0.000 n=10) MutexContention-16 128.10n ± 25% 26.95n ± 4% -78.96% (p=0.000 n=10) MutexContention-17 139.80n ± 18% 24.96n ± 5% -82.14% (p=0.000 n=10) MutexContention-18 141.35n ± 7% 25.05n ± 8% -82.27% (p=0.000 n=10) MutexContention-19 151.35n ± 18% 25.72n ± 6% -83.00% (p=0.000 n=10) MutexContention-20 153.30n ± 20% 24.75n ± 6% -83.85% (p=0.000 n=10) MutexHandoff/Solo-20 13.54n ± 1% 13.61n ± 4% ~ (p=0.206 n=10) MutexHandoff/FastPingPong-20 141.3n ± 209% 164.8n ± 49% ~ (p=0.436 n=10) MutexHandoff/SlowPingPong-20 1.572µ ± 16% 1.804µ ± 19% +14.76% (p=0.015 n=10) geomean 74.34n 30.26n -59.30% goos: darwin goarch: arm64 pkg: runtime cpu: Apple M1 │ old │ new │ │ sec/op │ sec/op vs base │ MutexContention 13.86n ± 3% 12.09n ± 3% -12.73% (p=0.000 n=10) MutexContention-2 15.88n ± 1% 16.50n ± 2% +3.94% (p=0.001 n=10) MutexContention-3 18.45n ± 2% 16.88n ± 2% -8.54% (p=0.000 n=10) MutexContention-4 20.01n ± 2% 18.94n ± 18% ~ (p=0.469 n=10) MutexContention-5 22.60n ± 1% 17.51n ± 9% -22.50% (p=0.000 n=10) MutexContention-6 23.93n ± 2% 17.35n ± 2% -27.48% (p=0.000 n=10) MutexContention-7 24.69n ± 1% 17.15n ± 3% -30.54% (p=0.000 n=10) MutexContention-8 25.01n ± 1% 17.33n ± 2% -30.69% (p=0.000 n=10) MutexHandoff/Solo-8 13.96n ± 4% 12.04n ± 4% -13.78% (p=0.000 n=10) MutexHandoff/FastPingPong-8 68.89n ± 4% 64.62n ± 2% -6.20% (p=0.000 n=10) MutexHandoff/SlowPingPong-8 9.698µ ± 22% 9.646µ ± 35% ~ (p=0.912 n=10) geomean 38.20n 32.53n -14.84% Change-Id: I0058c75eadf282d08eea7fce0d426f0518039f7c Reviewed-on: https://go-review.googlesource.com/c/go/+/620435 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>

The current cheaprand performs 128-bit multiplication on 64-bit numbers and truncate the result to 32 bits, which is inefficient. A 32-bit specific implementation is more performant because it performs 64-bit multiplication on 32-bit numbers instead. The current cheaprand64 involves two cheaprand calls. Implementing it as 64-bit wyrand is significantly faster. Since cheaprand64 discards one bit, I have preserved this behavior. The underlying uint64 function is made available as cheaprandu64. │ old │ new │ │ sec/op │ sec/op vs base │ Cheaprand-8 1.358n ± 0% 1.218n ± 0% -10.31% (n=100) Cheaprand64-8 2.424n ± 0% 1.391n ± 0% -42.62% (n=100) Blocksampled-8 8.347n ± 0% 2.022n ± 0% -75.78% (n=100) Fixes #77149 Change-Id: Ib0b5da4a642cd34d0401b03c1d343041f8230d11 GitHub-Last-Rev: 549d8d407e2bbcaecdee0b52cbf3a513dda637fb GitHub-Pull-Request: golang/go#77150 Reviewed-on: https://go-review.googlesource.com/c/go/+/735480 Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

ncpu is the total logical CPU count at startup. It is never updated. For #73193, we will start using updated CPU counts for updated GOMAXPROCS, making the ncpu name a bit ambiguous. Change to a less ambiguous name. While we're at it, give the OS specific lookup functions a common name, so it can be used outside of osinit later. For #73193. Change-Id: I6a6a636cf21cc60de36b211f3c374080849fc667 Reviewed-on: https://go-review.googlesource.com/c/go/+/672277 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>

If cputicks is in the top quarter of the int64's range, adding two values together will overflow and confuse the subsequent calculations, leading to zero-duration contention events in the profile. This fixes the TestRuntimeLockMetricsAndProfile failures on the linux-s390x builder. Change-Id: Icb814c39a8702379dfd71c06a53b2618e3589e07 Reviewed-on: https://go-review.googlesource.com/c/go/+/671115 Reviewed-by: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

Correct how the mutex contention profile reports on runtime-internal mutex values, to match sync.Mutex's semantics. Decide at the start of unlock2 whether we'd like to collect a contention sample. If so: Opt in to a slightly slower unlock path which avoids accidentally accepting blame for delay caused by other Ms. Release the lock before doing an O(N) traversal of the stack of waiting Ms, to calculate the total delay to those Ms that our critical section caused. Report that, with the current callstack, in the mutex profile. Fixes #66999 Change-Id: I561ed8dc120669bd045d514cb0d1c6c99c2add04 Reviewed-on: https://go-review.googlesource.com/c/go/+/667615 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>

These constants are needed by some future generator programs. Change-Id: I5dccd009cbb3b2f321523bc0d8eaeb4c82e5df81 Reviewed-on: https://go-review.googlesource.com/c/go/+/655276 Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

Use the "spinbit" mutex implementation always (including on platforms that need to emulate atomic.Xchg8), and delete the prior "tristate" implementations. The exception is GOARCH=wasm, where the Go runtime does not use multiple threads. For #68578 Change-Id: Ifc29bbfa05071d776c23a19ae185891a03a82417 Reviewed-on: https://go-review.googlesource.com/c/go/+/658456 Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

This is used only in tests that verify reports of runtime-internal mutex contention. For #66999 For #70602 Change-Id: I72cb1302d8ea0524f1182ec892f5c9a1923cddba Reviewed-on: https://go-review.googlesource.com/c/go/+/667095 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Junyang Shao <shaojunyang@google.com>

The SpinbitMutex experiment requires m structs other than m0 to be allocated in 2048-byte size class, by adding padding. Do the calculation more explicitly, to avoid future CLs like CL 653335. Change-Id: I83ae1e86ef3711ab65441f4e487f94b9e1429029 Reviewed-on: https://go-review.googlesource.com/c/go/+/654595 Reviewed-by: Rhys Hiltner <rhys.hiltner@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>

Simplify some flow control, as suggested on https://go.dev/cl/620435. The MutexCapture microbenchmark shows a bit of throughput improvement at moderate levels of contention, and little change to capture and starvation. (Note that the capture and starvation figures below are in terms of power-of-two buckets multiplied by throughput, so they either follow similar patterns or move by a factor of two.) For #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexCapture 18.21n ± 0% 18.35n ± 0% +0.77% (p=0.000 n=10) MutexCapture-2 21.46n ± 8% 21.05n ± 12% ~ (p=0.796 n=10) MutexCapture-3 22.56n ± 9% 22.59n ± 18% ~ (p=0.631 n=10) MutexCapture-4 22.85n ± 5% 22.74n ± 2% ~ (p=0.565 n=10) MutexCapture-5 22.84n ± 5% 22.50n ± 14% ~ (p=0.912 n=10) MutexCapture-6 23.33n ± 14% 22.22n ± 3% -4.78% (p=0.004 n=10) MutexCapture-7 27.04n ± 14% 23.78n ± 15% ~ (p=0.089 n=10) MutexCapture-8 25.44n ± 10% 23.03n ± 6% -9.48% (p=0.004 n=10) MutexCapture-9 25.56n ± 7% 24.39n ± 11% ~ (p=0.218 n=10) MutexCapture-10 26.77n ± 10% 24.00n ± 7% -10.33% (p=0.023 n=10) MutexCapture-11 27.02n ± 7% 24.55n ± 15% -9.18% (p=0.035 n=10) MutexCapture-12 26.71n ± 8% 24.96n ± 8% ~ (p=0.148 n=10) MutexCapture-13 25.58n ± 4% 25.82n ± 5% ~ (p=0.271 n=10) MutexCapture-14 26.86n ± 6% 25.91n ± 7% ~ (p=0.529 n=10) MutexCapture-15 25.12n ± 13% 26.16n ± 4% ~ (p=0.353 n=10) MutexCapture-16 26.18n ± 4% 26.21n ± 9% ~ (p=0.838 n=10) MutexCapture-17 26.04n ± 4% 25.85n ± 5% ~ (p=0.363 n=10) MutexCapture-18 26.02n ± 7% 25.93n ± 5% ~ (p=0.853 n=10) MutexCapture-19 25.67n ± 5% 26.21n ± 4% ~ (p=0.631 n=10) MutexCapture-20 25.50n ± 6% 25.99n ± 8% ~ (p=0.404 n=10) geomean 24.73n 24.02n -2.88% │ old │ new │ │ sec/streak-p90 │ sec/streak-p90 vs base │ MutexCapture 76.36m ± 0% 76.96m ± 0% +0.79% (p=0.000 n=10) MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 5.936µ ± 93% 5.782µ ± 18% ~ (p=0.684 n=10) MutexCapture-4 5.849µ ± 5% 5.820µ ± 2% ~ (p=0.579 n=10) MutexCapture-5 5.849µ ± 5% 5.759µ ± 14% ~ (p=0.912 n=10) MutexCapture-6 5.975µ ± 14% 5.687µ ± 3% -4.81% (p=0.004 n=10) MutexCapture-7 6.921µ ± 14% 6.086µ ± 18% ~ (p=0.165 n=10) MutexCapture-8 6.512µ ± 10% 5.894µ ± 6% -9.50% (p=0.004 n=10) MutexCapture-9 6.544µ ± 7% 6.245µ ± 11% ~ (p=0.218 n=10) MutexCapture-10 6.962µ ± 11% 6.144µ ± 7% -11.76% (p=0.023 n=10) MutexCapture-11 6.938µ ± 7% 6.284µ ± 130% ~ (p=0.190 n=10) MutexCapture-12 6.838µ ± 8% 6.408µ ± 13% ~ (p=0.404 n=10) MutexCapture-13 6.549µ ± 4% 6.608µ ± 5% ~ (p=0.271 n=10) MutexCapture-14 6.877µ ± 8% 6.634µ ± 7% ~ (p=0.436 n=10) MutexCapture-15 6.433µ ± 13% 6.697µ ± 4% ~ (p=0.247 n=10) MutexCapture-16 6.702µ ± 10% 6.711µ ± 116% ~ (p=0.796 n=10) MutexCapture-17 6.730µ ± 3% 6.619µ ± 5% ~ (p=0.225 n=10) MutexCapture-18 6.663µ ± 7% 6.716µ ± 13% ~ (p=0.853 n=10) MutexCapture-19 6.570µ ± 5% 6.710µ ± 4% ~ (p=0.529 n=10) MutexCapture-20 6.528µ ± 6% 6.775µ ± 11% ~ (p=0.247 n=10) geomean 10.66µ 10.00µ -6.13% │ old │ new │ │ sec/starve-p90 │ sec/starve-p90 vs base │ MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10) MutexCapture-3 184.8µ ± 91% 183.9µ ± 48% ~ (p=0.436 n=10) MutexCapture-4 388.8µ ± 270% 375.6µ ± 280% ~ (p=0.436 n=10) MutexCapture-5 807.2µ ± 83% 2880.9µ ± 85% ~ (p=0.105 n=10) MutexCapture-6 2.272m ± 61% 2.173m ± 34% ~ (p=0.280 n=10) MutexCapture-7 1.351m ± 125% 2.990m ± 70% ~ (p=0.393 n=10) MutexCapture-8 3.328m ± 97% 3.064m ± 96% ~ (p=0.739 n=10) MutexCapture-9 3.526m ± 91% 3.081m ± 47% -12.62% (p=0.015 n=10) MutexCapture-10 3.641m ± 86% 3.228m ± 90% -11.34% (p=0.005 n=10) MutexCapture-11 3.324m ± 109% 3.190m ± 71% ~ (p=0.481 n=10) MutexCapture-12 3.519m ± 77% 3.200m ± 106% ~ (p=0.393 n=10) MutexCapture-13 3.353m ± 91% 3.368m ± 99% ~ (p=0.853 n=10) MutexCapture-14 3.314m ± 101% 3.396m ± 286% ~ (p=0.353 n=10) MutexCapture-15 3.534m ± 83% 3.397m ± 91% ~ (p=0.739 n=10) MutexCapture-16 3.485m ± 90% 3.436m ± 116% ~ (p=0.853 n=10) MutexCapture-17 6.516m ± 48% 3.452m ± 88% ~ (p=0.190 n=10) MutexCapture-18 6.645m ± 105% 3.439m ± 108% ~ (p=0.218 n=10) MutexCapture-19 6.521m ± 46% 4.907m ± 42% ~ (p=0.529 n=10) MutexCapture-20 6.532m ± 47% 3.516m ± 89% ~ (p=0.089 n=10) geomean 1.919m 1.783m -7.06% Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e Reviewed-on: https://go-review.googlesource.com/c/go/+/629415 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>

The tri-state mutex implementation (unlocked, locked, sleeping) avoids sleep/wake syscalls when contention is low or absent, but its performance degrades when many threads are contending for a mutex to execute a fast critical section. A fast critical section means frequent unlock2 calls. Each of those finds the mutex in the "sleeping" state and so wakes a sleeping thread, even if many other threads are already awake and in the spin loop of lock2 attempting to acquire the mutex for themselves. Many spinning threads means wasting energy and CPU time that could be used by other processes on the machine. Many threads all spinning on the same cache line leads to performance collapse. Merge the futex- and semaphore-based mutex implementations by using a semaphore abstraction for futex platforms. Then, add a bit to the mutex state word that communicates whether one of the waiting threads is awake and spinning. When threads in lock2 see the new "spinning" bit, they can sleep immediately. In unlock2, the "spinning" bit means we can save a syscall and not wake a sleeping thread. This brings up the real possibility of starvation: waiting threads are able to enter a deeper sleep than before, since one of their peers can volunteer to be the sole "spinning" thread and thus cause unlock2 to skip the semawakeup call. Additionally, the waiting threads form a LIFO stack so any wakeups that do occur will target threads that have gone to sleep most recently. Counteract those effects by periodically waking the thread at the bottom of the stack and allowing it to spin. Exempt sched.lock from most of the new behaviors; it's often used by several threads in sequence to do thread-specific work, so low-latency handoff is a priority over improved throughput. Gate use of this implementation behind GOEXPERIMENT=spinbitmutex, so it's easy to disable. Enable it by default on supported platforms (the most efficient implementation requires atomic.Xchg8). Fixes #68578 goos: linux goarch: amd64 pkg: runtime cpu: 13th Gen Intel(R) Core(TM) i7-13700H │ old │ new │ │ sec/op │ sec/op vs base │ MutexContention 17.82n ± 0% 17.74n ± 0% -0.42% (p=0.000 n=10) MutexContention-2 22.17n ± 9% 19.85n ± 12% ~ (p=0.089 n=10) MutexContention-3 26.14n ± 14% 20.81n ± 13% -20.41% (p=0.000 n=10) MutexContention-4 29.28n ± 8% 21.19n ± 10% -27.62% (p=0.000 n=10) MutexContention-5 31.79n ± 2% 21.98n ± 10% -30.83% (p=0.000 n=10) MutexContention-6 34.63n ± 1% 22.58n ± 5% -34.79% (p=0.000 n=10) MutexContention-7 44.16n ± 2% 23.14n ± 7% -47.59% (p=0.000 n=10) MutexContention-8 53.81n ± 3% 23.66n ± 6% -56.04% (p=0.000 n=10) MutexContention-9 65.58n ± 4% 23.91n ± 9% -63.54% (p=0.000 n=10) MutexContention-10 77.35n ± 3% 26.06n ± 9% -66.31% (p=0.000 n=10) MutexContention-11 89.62n ± 1% 25.56n ± 9% -71.47% (p=0.000 n=10) MutexContention-12 102.45n ± 2% 25.57n ± 7% -75.04% (p=0.000 n=10) MutexContention-13 111.95n ± 1% 24.59n ± 8% -78.04% (p=0.000 n=10) MutexContention-14 123.95n ± 3% 24.42n ± 6% -80.30% (p=0.000 n=10) MutexContention-15 120.80n ± 10% 25.54n ± 6% -78.86% (p=0.000 n=10) MutexContention-16 128.10n ± 25% 26.95n ± 4% -78.96% (p=0.000 n=10) MutexContention-17 139.80n ± 18% 24.96n ± 5% -82.14% (p=0.000 n=10) MutexContention-18 141.35n ± 7% 25.05n ± 8% -82.27% (p=0.000 n=10) MutexContention-19 151.35n ± 18% 25.72n ± 6% -83.00% (p=0.000 n=10) MutexContention-20 153.30n ± 20% 24.75n ± 6% -83.85% (p=0.000 n=10) MutexHandoff/Solo-20 13.54n ± 1% 13.61n ± 4% ~ (p=0.206 n=10) MutexHandoff/FastPingPong-20 141.3n ± 209% 164.8n ± 49% ~ (p=0.436 n=10) MutexHandoff/SlowPingPong-20 1.572µ ± 16% 1.804µ ± 19% +14.76% (p=0.015 n=10) geomean 74.34n 30.26n -59.30% goos: darwin goarch: arm64 pkg: runtime cpu: Apple M1 │ old │ new │ │ sec/op │ sec/op vs base │ MutexContention 13.86n ± 3% 12.09n ± 3% -12.73% (p=0.000 n=10) MutexContention-2 15.88n ± 1% 16.50n ± 2% +3.94% (p=0.001 n=10) MutexContention-3 18.45n ± 2% 16.88n ± 2% -8.54% (p=0.000 n=10) MutexContention-4 20.01n ± 2% 18.94n ± 18% ~ (p=0.469 n=10) MutexContention-5 22.60n ± 1% 17.51n ± 9% -22.50% (p=0.000 n=10) MutexContention-6 23.93n ± 2% 17.35n ± 2% -27.48% (p=0.000 n=10) MutexContention-7 24.69n ± 1% 17.15n ± 3% -30.54% (p=0.000 n=10) MutexContention-8 25.01n ± 1% 17.33n ± 2% -30.69% (p=0.000 n=10) MutexHandoff/Solo-8 13.96n ± 4% 12.04n ± 4% -13.78% (p=0.000 n=10) MutexHandoff/FastPingPong-8 68.89n ± 4% 64.62n ± 2% -6.20% (p=0.000 n=10) MutexHandoff/SlowPingPong-8 9.698µ ± 22% 9.646µ ± 35% ~ (p=0.912 n=10) geomean 38.20n 32.53n -14.84% Change-Id: I0058c75eadf282d08eea7fce0d426f0518039f7c Reviewed-on: https://go-review.googlesource.com/c/go/+/620435 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>