| Age | Commit message (Collapse) | Author |
|
The current cheaprand performs 128-bit multiplication on 64-bit numbers
and truncate the result to 32 bits, which is inefficient.
A 32-bit specific implementation is more performant because it performs
64-bit multiplication on 32-bit numbers instead.
The current cheaprand64 involves two cheaprand calls.
Implementing it as 64-bit wyrand is significantly faster.
Since cheaprand64 discards one bit, I have preserved this behavior.
The underlying uint64 function is made available as cheaprandu64.
│ old │ new │
│ sec/op │ sec/op vs base │
Cheaprand-8 1.358n ± 0% 1.218n ± 0% -10.31% (n=100)
Cheaprand64-8 2.424n ± 0% 1.391n ± 0% -42.62% (n=100)
Blocksampled-8 8.347n ± 0% 2.022n ± 0% -75.78% (n=100)
Fixes #77149
Change-Id: Ib0b5da4a642cd34d0401b03c1d343041f8230d11
GitHub-Last-Rev: 549d8d407e2bbcaecdee0b52cbf3a513dda637fb
GitHub-Pull-Request: golang/go#77150
Reviewed-on: https://go-review.googlesource.com/c/go/+/735480
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
ncpu is the total logical CPU count at startup. It is never updated. For
#73193, we will start using updated CPU counts for updated GOMAXPROCS,
making the ncpu name a bit ambiguous. Change to a less ambiguous name.
While we're at it, give the OS specific lookup functions a common name,
so it can be used outside of osinit later.
For #73193.
Change-Id: I6a6a636cf21cc60de36b211f3c374080849fc667
Reviewed-on: https://go-review.googlesource.com/c/go/+/672277
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
|
|
If cputicks is in the top quarter of the int64's range, adding two
values together will overflow and confuse the subsequent calculations,
leading to zero-duration contention events in the profile.
This fixes the TestRuntimeLockMetricsAndProfile failures on the
linux-s390x builder.
Change-Id: Icb814c39a8702379dfd71c06a53b2618e3589e07
Reviewed-on: https://go-review.googlesource.com/c/go/+/671115
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Correct how the mutex contention profile reports on runtime-internal
mutex values, to match sync.Mutex's semantics.
Decide at the start of unlock2 whether we'd like to collect a contention
sample. If so: Opt in to a slightly slower unlock path which avoids
accidentally accepting blame for delay caused by other Ms. Release the
lock before doing an O(N) traversal of the stack of waiting Ms, to
calculate the total delay to those Ms that our critical section caused.
Report that, with the current callstack, in the mutex profile.
Fixes #66999
Change-Id: I561ed8dc120669bd045d514cb0d1c6c99c2add04
Reviewed-on: https://go-review.googlesource.com/c/go/+/667615
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
These constants are needed by some future generator programs.
Change-Id: I5dccd009cbb3b2f321523bc0d8eaeb4c82e5df81
Reviewed-on: https://go-review.googlesource.com/c/go/+/655276
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Use the "spinbit" mutex implementation always (including on platforms
that need to emulate atomic.Xchg8), and delete the prior "tristate"
implementations.
The exception is GOARCH=wasm, where the Go runtime does not use multiple
threads.
For #68578
Change-Id: Ifc29bbfa05071d776c23a19ae185891a03a82417
Reviewed-on: https://go-review.googlesource.com/c/go/+/658456
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This is used only in tests that verify reports of runtime-internal mutex
contention.
For #66999
For #70602
Change-Id: I72cb1302d8ea0524f1182ec892f5c9a1923cddba
Reviewed-on: https://go-review.googlesource.com/c/go/+/667095
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
|
|
The SpinbitMutex experiment requires m structs other than m0
to be allocated in 2048-byte size class, by adding padding.
Do the calculation more explicitly, to avoid future CLs like CL 653335.
Change-Id: I83ae1e86ef3711ab65441f4e487f94b9e1429029
Reviewed-on: https://go-review.googlesource.com/c/go/+/654595
Reviewed-by: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
Simplify some flow control, as suggested on https://go.dev/cl/620435.
The MutexCapture microbenchmark shows a bit of throughput improvement at
moderate levels of contention, and little change to capture and
starvation. (Note that the capture and starvation figures below are in
terms of power-of-two buckets multiplied by throughput, so they either
follow similar patterns or move by a factor of two.)
For #68578
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
│ old │ new │
│ sec/op │ sec/op vs base │
MutexCapture 18.21n ± 0% 18.35n ± 0% +0.77% (p=0.000 n=10)
MutexCapture-2 21.46n ± 8% 21.05n ± 12% ~ (p=0.796 n=10)
MutexCapture-3 22.56n ± 9% 22.59n ± 18% ~ (p=0.631 n=10)
MutexCapture-4 22.85n ± 5% 22.74n ± 2% ~ (p=0.565 n=10)
MutexCapture-5 22.84n ± 5% 22.50n ± 14% ~ (p=0.912 n=10)
MutexCapture-6 23.33n ± 14% 22.22n ± 3% -4.78% (p=0.004 n=10)
MutexCapture-7 27.04n ± 14% 23.78n ± 15% ~ (p=0.089 n=10)
MutexCapture-8 25.44n ± 10% 23.03n ± 6% -9.48% (p=0.004 n=10)
MutexCapture-9 25.56n ± 7% 24.39n ± 11% ~ (p=0.218 n=10)
MutexCapture-10 26.77n ± 10% 24.00n ± 7% -10.33% (p=0.023 n=10)
MutexCapture-11 27.02n ± 7% 24.55n ± 15% -9.18% (p=0.035 n=10)
MutexCapture-12 26.71n ± 8% 24.96n ± 8% ~ (p=0.148 n=10)
MutexCapture-13 25.58n ± 4% 25.82n ± 5% ~ (p=0.271 n=10)
MutexCapture-14 26.86n ± 6% 25.91n ± 7% ~ (p=0.529 n=10)
MutexCapture-15 25.12n ± 13% 26.16n ± 4% ~ (p=0.353 n=10)
MutexCapture-16 26.18n ± 4% 26.21n ± 9% ~ (p=0.838 n=10)
MutexCapture-17 26.04n ± 4% 25.85n ± 5% ~ (p=0.363 n=10)
MutexCapture-18 26.02n ± 7% 25.93n ± 5% ~ (p=0.853 n=10)
MutexCapture-19 25.67n ± 5% 26.21n ± 4% ~ (p=0.631 n=10)
MutexCapture-20 25.50n ± 6% 25.99n ± 8% ~ (p=0.404 n=10)
geomean 24.73n 24.02n -2.88%
│ old │ new │
│ sec/streak-p90 │ sec/streak-p90 vs base │
MutexCapture 76.36m ± 0% 76.96m ± 0% +0.79% (p=0.000 n=10)
MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10)
MutexCapture-3 5.936µ ± 93% 5.782µ ± 18% ~ (p=0.684 n=10)
MutexCapture-4 5.849µ ± 5% 5.820µ ± 2% ~ (p=0.579 n=10)
MutexCapture-5 5.849µ ± 5% 5.759µ ± 14% ~ (p=0.912 n=10)
MutexCapture-6 5.975µ ± 14% 5.687µ ± 3% -4.81% (p=0.004 n=10)
MutexCapture-7 6.921µ ± 14% 6.086µ ± 18% ~ (p=0.165 n=10)
MutexCapture-8 6.512µ ± 10% 5.894µ ± 6% -9.50% (p=0.004 n=10)
MutexCapture-9 6.544µ ± 7% 6.245µ ± 11% ~ (p=0.218 n=10)
MutexCapture-10 6.962µ ± 11% 6.144µ ± 7% -11.76% (p=0.023 n=10)
MutexCapture-11 6.938µ ± 7% 6.284µ ± 130% ~ (p=0.190 n=10)
MutexCapture-12 6.838µ ± 8% 6.408µ ± 13% ~ (p=0.404 n=10)
MutexCapture-13 6.549µ ± 4% 6.608µ ± 5% ~ (p=0.271 n=10)
MutexCapture-14 6.877µ ± 8% 6.634µ ± 7% ~ (p=0.436 n=10)
MutexCapture-15 6.433µ ± 13% 6.697µ ± 4% ~ (p=0.247 n=10)
MutexCapture-16 6.702µ ± 10% 6.711µ ± 116% ~ (p=0.796 n=10)
MutexCapture-17 6.730µ ± 3% 6.619µ ± 5% ~ (p=0.225 n=10)
MutexCapture-18 6.663µ ± 7% 6.716µ ± 13% ~ (p=0.853 n=10)
MutexCapture-19 6.570µ ± 5% 6.710µ ± 4% ~ (p=0.529 n=10)
MutexCapture-20 6.528µ ± 6% 6.775µ ± 11% ~ (p=0.247 n=10)
geomean 10.66µ 10.00µ -6.13%
│ old │ new │
│ sec/starve-p90 │ sec/starve-p90 vs base │
MutexCapture-2 10.609µ ± 50% 5.390µ ± 119% ~ (p=0.579 n=10)
MutexCapture-3 184.8µ ± 91% 183.9µ ± 48% ~ (p=0.436 n=10)
MutexCapture-4 388.8µ ± 270% 375.6µ ± 280% ~ (p=0.436 n=10)
MutexCapture-5 807.2µ ± 83% 2880.9µ ± 85% ~ (p=0.105 n=10)
MutexCapture-6 2.272m ± 61% 2.173m ± 34% ~ (p=0.280 n=10)
MutexCapture-7 1.351m ± 125% 2.990m ± 70% ~ (p=0.393 n=10)
MutexCapture-8 3.328m ± 97% 3.064m ± 96% ~ (p=0.739 n=10)
MutexCapture-9 3.526m ± 91% 3.081m ± 47% -12.62% (p=0.015 n=10)
MutexCapture-10 3.641m ± 86% 3.228m ± 90% -11.34% (p=0.005 n=10)
MutexCapture-11 3.324m ± 109% 3.190m ± 71% ~ (p=0.481 n=10)
MutexCapture-12 3.519m ± 77% 3.200m ± 106% ~ (p=0.393 n=10)
MutexCapture-13 3.353m ± 91% 3.368m ± 99% ~ (p=0.853 n=10)
MutexCapture-14 3.314m ± 101% 3.396m ± 286% ~ (p=0.353 n=10)
MutexCapture-15 3.534m ± 83% 3.397m ± 91% ~ (p=0.739 n=10)
MutexCapture-16 3.485m ± 90% 3.436m ± 116% ~ (p=0.853 n=10)
MutexCapture-17 6.516m ± 48% 3.452m ± 88% ~ (p=0.190 n=10)
MutexCapture-18 6.645m ± 105% 3.439m ± 108% ~ (p=0.218 n=10)
MutexCapture-19 6.521m ± 46% 4.907m ± 42% ~ (p=0.529 n=10)
MutexCapture-20 6.532m ± 47% 3.516m ± 89% ~ (p=0.089 n=10)
geomean 1.919m 1.783m -7.06%
Change-Id: I36106e1baf8afd132f1568748d1b83b797fa260e
Reviewed-on: https://go-review.googlesource.com/c/go/+/629415
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
|
|
The tri-state mutex implementation (unlocked, locked, sleeping) avoids
sleep/wake syscalls when contention is low or absent, but its
performance degrades when many threads are contending for a mutex to
execute a fast critical section.
A fast critical section means frequent unlock2 calls. Each of those
finds the mutex in the "sleeping" state and so wakes a sleeping thread,
even if many other threads are already awake and in the spin loop of
lock2 attempting to acquire the mutex for themselves. Many spinning
threads means wasting energy and CPU time that could be used by other
processes on the machine. Many threads all spinning on the same cache
line leads to performance collapse.
Merge the futex- and semaphore-based mutex implementations by using a
semaphore abstraction for futex platforms. Then, add a bit to the mutex
state word that communicates whether one of the waiting threads is awake
and spinning. When threads in lock2 see the new "spinning" bit, they can
sleep immediately. In unlock2, the "spinning" bit means we can save a
syscall and not wake a sleeping thread.
This brings up the real possibility of starvation: waiting threads are
able to enter a deeper sleep than before, since one of their peers can
volunteer to be the sole "spinning" thread and thus cause unlock2 to
skip the semawakeup call. Additionally, the waiting threads form a LIFO
stack so any wakeups that do occur will target threads that have gone to
sleep most recently. Counteract those effects by periodically waking the
thread at the bottom of the stack and allowing it to spin.
Exempt sched.lock from most of the new behaviors; it's often used by
several threads in sequence to do thread-specific work, so low-latency
handoff is a priority over improved throughput.
Gate use of this implementation behind GOEXPERIMENT=spinbitmutex, so
it's easy to disable. Enable it by default on supported platforms (the
most efficient implementation requires atomic.Xchg8).
Fixes #68578
goos: linux
goarch: amd64
pkg: runtime
cpu: 13th Gen Intel(R) Core(TM) i7-13700H
│ old │ new │
│ sec/op │ sec/op vs base │
MutexContention 17.82n ± 0% 17.74n ± 0% -0.42% (p=0.000 n=10)
MutexContention-2 22.17n ± 9% 19.85n ± 12% ~ (p=0.089 n=10)
MutexContention-3 26.14n ± 14% 20.81n ± 13% -20.41% (p=0.000 n=10)
MutexContention-4 29.28n ± 8% 21.19n ± 10% -27.62% (p=0.000 n=10)
MutexContention-5 31.79n ± 2% 21.98n ± 10% -30.83% (p=0.000 n=10)
MutexContention-6 34.63n ± 1% 22.58n ± 5% -34.79% (p=0.000 n=10)
MutexContention-7 44.16n ± 2% 23.14n ± 7% -47.59% (p=0.000 n=10)
MutexContention-8 53.81n ± 3% 23.66n ± 6% -56.04% (p=0.000 n=10)
MutexContention-9 65.58n ± 4% 23.91n ± 9% -63.54% (p=0.000 n=10)
MutexContention-10 77.35n ± 3% 26.06n ± 9% -66.31% (p=0.000 n=10)
MutexContention-11 89.62n ± 1% 25.56n ± 9% -71.47% (p=0.000 n=10)
MutexContention-12 102.45n ± 2% 25.57n ± 7% -75.04% (p=0.000 n=10)
MutexContention-13 111.95n ± 1% 24.59n ± 8% -78.04% (p=0.000 n=10)
MutexContention-14 123.95n ± 3% 24.42n ± 6% -80.30% (p=0.000 n=10)
MutexContention-15 120.80n ± 10% 25.54n ± 6% -78.86% (p=0.000 n=10)
MutexContention-16 128.10n ± 25% 26.95n ± 4% -78.96% (p=0.000 n=10)
MutexContention-17 139.80n ± 18% 24.96n ± 5% -82.14% (p=0.000 n=10)
MutexContention-18 141.35n ± 7% 25.05n ± 8% -82.27% (p=0.000 n=10)
MutexContention-19 151.35n ± 18% 25.72n ± 6% -83.00% (p=0.000 n=10)
MutexContention-20 153.30n ± 20% 24.75n ± 6% -83.85% (p=0.000 n=10)
MutexHandoff/Solo-20 13.54n ± 1% 13.61n ± 4% ~ (p=0.206 n=10)
MutexHandoff/FastPingPong-20 141.3n ± 209% 164.8n ± 49% ~ (p=0.436 n=10)
MutexHandoff/SlowPingPong-20 1.572µ ± 16% 1.804µ ± 19% +14.76% (p=0.015 n=10)
geomean 74.34n 30.26n -59.30%
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M1
│ old │ new │
│ sec/op │ sec/op vs base │
MutexContention 13.86n ± 3% 12.09n ± 3% -12.73% (p=0.000 n=10)
MutexContention-2 15.88n ± 1% 16.50n ± 2% +3.94% (p=0.001 n=10)
MutexContention-3 18.45n ± 2% 16.88n ± 2% -8.54% (p=0.000 n=10)
MutexContention-4 20.01n ± 2% 18.94n ± 18% ~ (p=0.469 n=10)
MutexContention-5 22.60n ± 1% 17.51n ± 9% -22.50% (p=0.000 n=10)
MutexContention-6 23.93n ± 2% 17.35n ± 2% -27.48% (p=0.000 n=10)
MutexContention-7 24.69n ± 1% 17.15n ± 3% -30.54% (p=0.000 n=10)
MutexContention-8 25.01n ± 1% 17.33n ± 2% -30.69% (p=0.000 n=10)
MutexHandoff/Solo-8 13.96n ± 4% 12.04n ± 4% -13.78% (p=0.000 n=10)
MutexHandoff/FastPingPong-8 68.89n ± 4% 64.62n ± 2% -6.20% (p=0.000 n=10)
MutexHandoff/SlowPingPong-8 9.698µ ± 22% 9.646µ ± 35% ~ (p=0.912 n=10)
geomean 38.20n 32.53n -14.84%
Change-Id: I0058c75eadf282d08eea7fce0d426f0518039f7c
Reviewed-on: https://go-review.googlesource.com/c/go/+/620435
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
|