diff options
Diffstat (limited to 'design/68578-mutex-spinbit.md')
| -rw-r--r-- | design/68578-mutex-spinbit.md | 258 |
1 files changed, 258 insertions, 0 deletions
diff --git a/design/68578-mutex-spinbit.md b/design/68578-mutex-spinbit.md new file mode 100644 index 0000000..85f9e93 --- /dev/null +++ b/design/68578-mutex-spinbit.md @@ -0,0 +1,258 @@ +# Proposal: Improve scalability of runtime.lock2 + +Author(s): Rhys Hiltner + +Last updated: 2024-10-16 + +Discussion at https://go.dev/issue/68578. + +## Abstract + +Improve multi-core scalability of the runtime's internal mutex implementation +by minimizing wakeups of waiting threads. + +Avoiding wakeups of threads that are waiting for the lock allows those threads +to sleep for longer. +That reduces the number of concurrent threads that are attempting to read the +mutex's state word. +Fewer reads of that cache line mean less cache coherency traffic within the +processor when a thread needs to make an update. +Fast updates (to acquire and release the lock) even when many threads need the +lock means better scalability. + +This is not an API change, so is not part of the formal proposal process. + +## Background + +One of the simplest mutex designs is a single bit that is "0" when unlocked or +"1" when locked. +To acquire the lock, a thread attempts to swap in a "1", +looping until the result it gets is "0". +To unlock, the thread swaps in a "0". + +The performance of such a spinlock is poor in at least two ways. +First, threads that are trying to acquire an already-held lock waste their own +on-CPU time. +Second, those software threads execute on hardware resources that need a local +copy of the mutex state word in cache. + +Having the state word in cache for read access requires it not be writeable by +any other processors. +Writing to that memory location requires the hardware to invalidate all cached +copies of that memory, one in each processor that had loaded it for reading. +The hardware-internal communication necessary to implement those guarantees +has a cost, which appears as a slowdown when writing to that memory location. + +Go's current mutex design is several steps more advanced than the simple +spinlock, but under certain conditions its performance can degrade in a similar way. +First, when `runtime.lock2` is unable to immediately obtain the mutex it will +pause for a moment before retrying, primarily using hardware-level delay +instructions (such as `PAUSE` on 386 and amd64). +Then, if it's unable to acquire the mutex after several retries it will ask +the OS to put it to sleep until another thread requests a wakeup. +On Linux, we use the `futex` syscall to sleep directly on the mutex address, +implemented in src/runtime/lock_futex.go. +On many other platforms (including Windows and macOS),the waiting threads +form a LIFO stack with the mutex state word as a pointer to the top of the +stack, implemented in src/runtime/lock_sema.go. + +When the `futex` syscall is available, +the OS maintains a list of waiting threads and will choose which it wakes. +Otherwise, the Go runtime maintains that list and names a specific thread +when it asks the OS to do a wakeup. +To avoid a `futex` syscall when there's no contention, +we split the "locked" state into two variants: +1 meaning "locked with no contention" and +2 meaning "locked, and a thread may be asleep". +(With the semaphore-based implementation, +the Go runtime can--and must--know for itself whether a thread is asleep.) +Go's mutex implementation has those three logical states +(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms. +For the purposes of the Go runtime, I'm calling this design "tristate". + +After releasing the mutex, +`runtime.unlock2` will wake a thread whenever one is sleeping. +It does not consider whether one of the waiting threads is already awake. +If a waiting thread is already awake, it's not necessary to wake another. + +Waking additional threads results in higher concurrent demand for the mutex +state word's cache line. +Every thread that is awake and spinning in a loop to reload the state word +leads to more cache coherency traffic within the processor, +and to slower writes to that cache line. + +Consider the case where many threads all need to use the same mutex many times +in a row. +Furthermore, consider that the critical section is short relative to the time +it takes a thread to give up on spinning and go (back) to sleep. +At the end of each critical section, the thread that is releasing the mutex +will see that a waiting thread is asleep, and will wake it. +It takes a relatively long time for a thread to decide to go to sleep, +and there's a relatively short time until the next `runtime.unlock2` call will +wake it. +Many threads will be awake, all reloading the state word in a loop, +all slowing down updates to its value. + +Without a limit on the number of threads that can spin on the state word, +higher demand for a mutex value degrades its performance. + +See also https://go.dev/issue/68578. + +## Proposal + +Expand the mutex state word to include a new flag, "spinning". +Use the "spinning" bit to communicate whether one of the waiting threads is +awake and looping while trying to acquire the lock. +Threads mutually exclude each other from the "spinning" state, +but they won't block while attempting to acquire the bit. +Only the thread that owns the "spinning" bit is allowed to reload the state +word in a loop. +It releases the "spinning" bit before going to sleep. +The other waiting threads go directly to sleep. +The thread that unlocks a mutex can avoid waking a thread if it sees that one +is already awake and spinning. +For the purposes of the Go runtime, I'm calling this design "spinbit". + +### futex-based option, https://go.dev/cl/601597 + +I've prepared https://go.dev/cl/601597, +which implements the "spinbit" design for GOOS=linux and GOARCH=amd64. +I've prepared a matching [TLA+ model](./68578/spinbit.tla) +to check for lost wakeups. +(When relying on the `futex` syscall to maintain the list of sleeping Ms, +it's easy to write lost-wakeup bugs.) + +It uses an atomic `Xchg8` operation on two different bytes of the mutex state +word. +The low byte records whether the mutex is locked, +and whether one or more waiting Ms may be asleep. +The "spinning" flag is in a separate byte and so can be independently +manipulated with atomic `Xchg8` operations. +The two bytes are within a single uintptr field (`runtime.mutex.key`). +When the spinning M attempts to acquire the lock, +it can do a CAS on the entire state word, +setting the "locked" flag and clearing the "spinning" flag +in a single operation. + +### Cross-OS option, https://go.dev/cl/620435 + +I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and +lock_futex.go implementations and so supports all GOOS values for which Go +supports multiple threads. +(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}` +functions for lock_futex.go platforms.) +Go's development branch now includes `Xchg8` support for +GOARCH=amd64,arm64,ppc64,ppc64le, +and so that CL supports all of those architectures. + +The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations +to interact with the "locked" flag. +The lowest byte of the state word is dedicated to use with those `Xchg8` +operations. +Most of the upper bytes hold a partial pointer to an M. +(The `runtime.m` datastructure is large enough to allow reconstructing the low +bits from the partial pointer, +with special handling for the non-heap-allocated `runtime.m0` value.) +Beyond the 8 bits needed for use with `Xchg8`, +a few more low bits are available for use as flags. +One of those bits holds the "spinning" flag, +which is manipulated with pointer-length `Load` and `CAS` operations. + +When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm` +pointers, as lock_sema.go does today. +The list of waiting Ms is a multi-producer, single-consumer stack. +Each M can add itself, +but inspecting or removing Ms requires exclusive access. +Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that +ownership. +That puts any use of the sleeping M list in the critical path of the mutex. + +My proposal uses another bit of the state word as a try-lock to control +inspecting and removing Ms from the list. +This allows additional list-management code without slowing the critical path +of a busy mutex, and use of efficient `Xchg8` operations in the fast paths. +We'll need access to the list in order to attribute contention delay to the +right critical section in the [mutex profile](https://go.dev/issue/66999). +Access to the list will also let us periodically wake an M even when it's not +strictly necessary, to combat tail latency that may be introduced by the +reduction in wakeups. + +Here's the full layout of the `runtime.mutex.key` state word: +Bit 0 holds the "locked" flag, the primary purpose of the mutex. +Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M. +Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation. +Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state +word in a loop. +Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove +sleeping Ms from the LIFO stack. +Bits 10 and higher of the state word hold bits 10 and higher of a pointer to +the M at the top of the LIFO stack of sleeping waiters. + +## Rationale + +The status quo is a `runtime.lock2` implementation that experiences congestion +collapse under high contention on machines with many hardware threads. +Addressing that will require fewer threads loading the same cache line in a +loop. + +The literature presents several options for scalable / non-collapsing mutexes. +Some require an additional memory footprint for each mutex in proportion to +the number of threads that may seek to acquire the lock. +Some require threads to store a reference to a value that they will use to +release each lock they hold. +Go includes a `runtime.mutex` as part of every `chan`, and in some +applications those values are the ones with the most contention. +Coupled with `select`, there's no limit to the number of mutexes that an M can +hold. +That means neither of those forms of increased memory footprint is acceptable. + +The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs +is also important to the project. +That limits the use of many of the literature's proposed locking algorithms, +if they include FIFO queue handoff behavior. +On my test hardware +(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1), +a `runtime.mutex` value with zero or moderate contention can support +50,000,000 uses per second on any threads, +or can move between active threads 10,000,000 times per second, +or can move between inactive threads (with sleep mediated by the OS) +about 40,000 to 500,000 times per second (depending on the OS). +Some amount of capture or barging, rather than queueing, is required to +maintain the level of throughput that Go users have come to expect. + +Keeping the size of `runtime.mutex` values as they are today but allowing +threads to sleep with fewer interruptions seems like fulfilling the goal of +the original design. +The main disadvantage I know of is the risk of increased tail latency: +A small set of threads may be able to capture a contended mutex, +passing it back and forth among themselves while the other threads sleep +indefinitely. +That's already a risk of the current lock_sema.go implementation, +but the high volume of wakeups means threads are unlikely to sleep for long, +and the list of sleeping threads may regularly dip to zero. + +The "cross-OS" option has an edge here: +with it, the Go runtime maintains an explicit list of sleeping Ms and so can do +targeted wakes or even direct handoffs to reduce starvation. + +## Compatibility + +There is no change in exported APIs. + +## Implementation + +I've prepared two options for the Go 1.24 release cycle. +One relies on the `futex` syscall and the `Xchg8` operation, and so initially +supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597. +The other relies on only the `Xchg8` operation and works with any GOOS value +that supports threads: https://go.dev/cl/620435. +Both are controlled by `GOEXPERIMENT=spinbitmutex`, +enabled by default on supported platforms. + +## Open issues (if applicable) + +I appreciate feedback on the balance between simplicity, +performance at zero or low contention, +performance under extreme contention, +both the performance and maintenance burden for non-first-class ports, +and the accuracy of contention profiles. |
