1 files changed, 258 insertions, 0 deletions
diff --git a/design/68578-mutex-spinbit.md b/design/68578-mutex-spinbit.md
new file mode 100644
index 0000000..85f9e93
--- /dev/null
+++ b/design/68578-mutex-spinbit.md
@@ -0,0 +1,258 @@
+# Proposal: Improve scalability of runtime.lock2
+
+Author(s): Rhys Hiltner
+
+Last updated: 2024-10-16
+
+Discussion at https://go.dev/issue/68578.
+
+## Abstract
+
+Improve multi-core scalability of the runtime's internal mutex implementation
+by minimizing wakeups of waiting threads.
+
+Avoiding wakeups of threads that are waiting for the lock allows those threads
+to sleep for longer.
+That reduces the number of concurrent threads that are attempting to read the
+mutex's state word.
+Fewer reads of that cache line mean less cache coherency traffic within the
+processor when a thread needs to make an update.
+Fast updates (to acquire and release the lock) even when many threads need the
+lock means better scalability.
+
+This is not an API change, so is not part of the formal proposal process.
+
+## Background
+
+One of the simplest mutex designs is a single bit that is "0" when unlocked or
+"1" when locked.
+To acquire the lock, a thread attempts to swap in a "1",
+looping until the result it gets is "0".
+To unlock, the thread swaps in a "0".
+
+The performance of such a spinlock is poor in at least two ways.
+First, threads that are trying to acquire an already-held lock waste their own
+on-CPU time.
+Second, those software threads execute on hardware resources that need a local
+copy of the mutex state word in cache.
+
+Having the state word in cache for read access requires it not be writeable by
+any other processors.
+Writing to that memory location requires the hardware to invalidate all cached
+copies of that memory, one in each processor that had loaded it for reading.
+The hardware-internal communication necessary to implement those guarantees
+has a cost, which appears as a slowdown when writing to that memory location.
+
+Go's current mutex design is several steps more advanced than the simple
+spinlock, but under certain conditions its performance can degrade in a similar way.
+First, when `runtime.lock2` is unable to immediately obtain the mutex it will
+pause for a moment before retrying, primarily using hardware-level delay
+instructions (such as `PAUSE` on 386 and amd64).
+Then, if it's unable to acquire the mutex after several retries it will ask
+the OS to put it to sleep until another thread requests a wakeup.
+On Linux, we use the `futex` syscall to sleep directly on the mutex address,
+implemented in src/runtime/lock_futex.go.
+On many other platforms (including Windows and macOS),the waiting threads
+form a LIFO stack with the mutex state word as a pointer to the top of the
+stack, implemented in src/runtime/lock_sema.go.
+
+When the `futex` syscall is available,
+the OS maintains a list of waiting threads and will choose which it wakes.
+Otherwise, the Go runtime maintains that list and names a specific thread
+when it asks the OS to do a wakeup.
+To avoid a `futex` syscall when there's no contention,
+we split the "locked" state into two variants:
+1 meaning "locked with no contention" and
+2 meaning "locked, and a thread may be asleep".
+(With the semaphore-based implementation,
+the Go runtime can--and must--know for itself whether a thread is asleep.)
+Go's mutex implementation has those three logical states
+(unlocked, locked, locked-with-sleepers) on all multi-threaded platforms.
+For the purposes of the Go runtime, I'm calling this design "tristate".
+
+After releasing the mutex,
+`runtime.unlock2` will wake a thread whenever one is sleeping.
+It does not consider whether one of the waiting threads is already awake.
+If a waiting thread is already awake, it's not necessary to wake another.
+
+Waking additional threads results in higher concurrent demand for the mutex
+state word's cache line.
+Every thread that is awake and spinning in a loop to reload the state word
+leads to more cache coherency traffic within the processor,
+and to slower writes to that cache line.
+
+Consider the case where many threads all need to use the same mutex many times
+in a row.
+Furthermore, consider that the critical section is short relative to the time
+it takes a thread to give up on spinning and go (back) to sleep.
+At the end of each critical section, the thread that is releasing the mutex
+will see that a waiting thread is asleep, and will wake it.
+It takes a relatively long time for a thread to decide to go to sleep,
+and there's a relatively short time until the next `runtime.unlock2` call will
+wake it.
+Many threads will be awake, all reloading the state word in a loop,
+all slowing down updates to its value.
+
+Without a limit on the number of threads that can spin on the state word,
+higher demand for a mutex value degrades its performance.
+
+See also https://go.dev/issue/68578.
+
+## Proposal
+
+Expand the mutex state word to include a new flag, "spinning".
+Use the "spinning" bit to communicate whether one of the waiting threads is
+awake and looping while trying to acquire the lock.
+Threads mutually exclude each other from the "spinning" state,
+but they won't block while attempting to acquire the bit.
+Only the thread that owns the "spinning" bit is allowed to reload the state
+word in a loop.
+It releases the "spinning" bit before going to sleep.
+The other waiting threads go directly to sleep.
+The thread that unlocks a mutex can avoid waking a thread if it sees that one
+is already awake and spinning.
+For the purposes of the Go runtime, I'm calling this design "spinbit".
+
+### futex-based option, https://go.dev/cl/601597
+
+I've prepared https://go.dev/cl/601597,
+which implements the "spinbit" design for GOOS=linux and GOARCH=amd64.
+I've prepared a matching [TLA+ model](./68578/spinbit.tla)
+to check for lost wakeups.
+(When relying on the `futex` syscall to maintain the list of sleeping Ms,
+it's easy to write lost-wakeup bugs.)
+
+It uses an atomic `Xchg8` operation on two different bytes of the mutex state
+word.
+The low byte records whether the mutex is locked,
+and whether one or more waiting Ms may be asleep.
+The "spinning" flag is in a separate byte and so can be independently
+manipulated with atomic `Xchg8` operations.
+The two bytes are within a single uintptr field (`runtime.mutex.key`).
+When the spinning M attempts to acquire the lock,
+it can do a CAS on the entire state word,
+setting the "locked" flag and clearing the "spinning" flag
+in a single operation.
+
+### Cross-OS option, https://go.dev/cl/620435
+
+I've also prepared https://go.dev/cl/620435 which unifies the lock_sema.go and
+lock_futex.go implementations and so supports all GOOS values for which Go
+supports multiple threads.
+(It uses `futex` to implement the `runtime.sema{create,sleep,wakeup}`
+functions for lock_futex.go platforms.)
+Go's development branch now includes `Xchg8` support for
+GOARCH=amd64,arm64,ppc64,ppc64le,
+and so that CL supports all of those architectures.
+
+The fast path for `runtime.lock2` and `runtime.unlock2` use `Xchg8` operations
+to interact with the "locked" flag.
+The lowest byte of the state word is dedicated to use with those `Xchg8`
+operations.
+Most of the upper bytes hold a partial pointer to an M.
+(The `runtime.m` datastructure is large enough to allow reconstructing the low
+bits from the partial pointer,
+with special handling for the non-heap-allocated `runtime.m0` value.)
+Beyond the 8 bits needed for use with `Xchg8`,
+a few more low bits are available for use as flags.
+One of those bits holds the "spinning" flag,
+which is manipulated with pointer-length `Load` and `CAS` operations.
+
+When Ms go to sleep they form a LIFO stack linked via `runtime.m.nextwaitm`
+pointers, as lock_sema.go does today.
+The list of waiting Ms is a multi-producer, single-consumer stack.
+Each M can add itself,
+but inspecting or removing Ms requires exclusive access.
+Today, lock_sema.go's `runtime.unlock2` uses the mutex itself to control that
+ownership.
+That puts any use of the sleeping M list in the critical path of the mutex.
+
+My proposal uses another bit of the state word as a try-lock to control
+inspecting and removing Ms from the list.
+This allows additional list-management code without slowing the critical path
+of a busy mutex, and use of efficient `Xchg8` operations in the fast paths.
+We'll need access to the list in order to attribute contention delay to the
+right critical section in the [mutex profile](https://go.dev/issue/66999).
+Access to the list will also let us periodically wake an M even when it's not
+strictly necessary, to combat tail latency that may be introduced by the
+reduction in wakeups.
+
+Here's the full layout of the `runtime.mutex.key` state word:
+Bit 0 holds the "locked" flag, the primary purpose of the mutex.
+Bit 1 is the "sleeping" flag, and is set when the upper bits point to an M.
+Bits 2 through 7 are unused, since they're lost with every `Xchg8` operation.
+Bit 8 holds the "spinning" try-lock, allowing the holder to reload the state
+word in a loop.
+Bit 9 holds the "stack" try-lock, allowing the holder to inspect and remove
+sleeping Ms from the LIFO stack.
+Bits 10 and higher of the state word hold bits 10 and higher of a pointer to
+the M at the top of the LIFO stack of sleeping waiters.
+
+## Rationale
+
+The status quo is a `runtime.lock2` implementation that experiences congestion
+collapse under high contention on machines with many hardware threads.
+Addressing that will require fewer threads loading the same cache line in a
+loop.
+
+The literature presents several options for scalable / non-collapsing mutexes.
+Some require an additional memory footprint for each mutex in proportion to
+the number of threads that may seek to acquire the lock.
+Some require threads to store a reference to a value that they will use to
+release each lock they hold.
+Go includes a `runtime.mutex` as part of every `chan`, and in some
+applications those values are the ones with the most contention.
+Coupled with `select`, there's no limit to the number of mutexes that an M can
+hold.
+That means neither of those forms of increased memory footprint is acceptable.
+
+The performance of fully uncontended `runtime.lock2`/`runtime.unlock2` pairs
+is also important to the project.
+That limits the use of many of the literature's proposed locking algorithms,
+if they include FIFO queue handoff behavior.
+On my test hardware
+(a linux/amd64 machine with i7-13700H, and a darwin/arm64 M1),
+a `runtime.mutex` value with zero or moderate contention can support
+50,000,000 uses per second on any threads,
+or can move between active threads 10,000,000 times per second,
+or can move between inactive threads (with sleep mediated by the OS)
+about 40,000 to 500,000 times per second (depending on the OS).
+Some amount of capture or barging, rather than queueing, is required to
+maintain the level of throughput that Go users have come to expect.
+
+Keeping the size of `runtime.mutex` values as they are today but allowing
+threads to sleep with fewer interruptions seems like fulfilling the goal of
+the original design.
+The main disadvantage I know of is the risk of increased tail latency:
+A small set of threads may be able to capture a contended mutex,
+passing it back and forth among themselves while the other threads sleep
+indefinitely.
+That's already a risk of the current lock_sema.go implementation,
+but the high volume of wakeups means threads are unlikely to sleep for long,
+and the list of sleeping threads may regularly dip to zero.
+
+The "cross-OS" option has an edge here:
+with it, the Go runtime maintains an explicit list of sleeping Ms and so can do
+targeted wakes or even direct handoffs to reduce starvation.
+
+## Compatibility
+
+There is no change in exported APIs.
+
+## Implementation
+
+I've prepared two options for the Go 1.24 release cycle.
+One relies on the `futex` syscall and the `Xchg8` operation, and so initially
+supports GOOS=linux and GOARCH=amd64: https://go.dev/cl/601597.
+The other relies on only the `Xchg8` operation and works with any GOOS value
+that supports threads: https://go.dev/cl/620435.
+Both are controlled by `GOEXPERIMENT=spinbitmutex`,
+enabled by default on supported platforms.
+
+## Open issues (if applicable)
+
+I appreciate feedback on the balance between simplicity,
+performance at zero or low contention,
+performance under extreme contention,
+both the performance and maintenance burden for non-first-class ports,
+and the accuracy of contention profiles.