| Age | Commit message (Collapse) | Author |
|
Change-Id: Ide4daa5eee3fd4f3007d6ef23aa84b8916562c39
Reviewed-on: https://go-review.googlesource.com/c/go/+/684457
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
We associate WaitGroups with synctest bubbles by attaching a
special to the WaitGroup. It is not possible to attach a special
to a linker-allocated value, such as:
var wg sync.WaitGroup
Avoid panicking when accessing a linker-allocated WaitGroup
from a bubble. We have no way to associate these WaitGroups
with a bubble, so just treat them as always unbubbled.
This is probably fine, since the WaitGroup was always
created outside the bubble in this case.
Fixes #74005
Change-Id: Ic71514b0b8d0cecd62e45cc929ffcbeb16f54a55
Reviewed-on: https://go-review.googlesource.com/c/go/+/679695
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Damien Neil <dneil@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Moving to a smaller package allows its use in other internal/runtime
packages.
This isn't internal/strconvlite since it can't be used directly by
strconv.
For #73193.
Change-Id: I6a6a636c9c8b3f06b5fd6c07fe9dd5a7a37d1429
Reviewed-on: https://go-review.googlesource.com/c/go/+/672697
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
|
|
This change splits the finalizer and cleanup queues and implements a new
lock-free blocking queue for cleanups. The basic design is as follows:
The cleanup queue is organized in fixed-sized blocks. Individual cleanup
functions are queued, but only whole blocks are dequeued.
Enqueuing cleanups places them in P-local cleanup blocks. These are
flushed to the full list as they get full. Cleanups can only be enqueued
by an active sweeper.
Dequeuing cleanups always dequeues entire blocks from the full list.
Cleanup blocks can be dequeued and executed at any time.
The very last active sweeper in the sweep phase is responsible for
flushing all local cleanup blocks to the full list. It can do this
without any synchronization because the next GC can't start yet, so we
can be very certain that nobody else will be accessing the local blocks.
Cleanup blocks are stored off-heap because the need to be allocated by
the sweeper, which is called from heap allocation paths. As a result,
the GC treats cleanup blocks as roots, just like finalizer blocks.
Flushes to the full list signal to the scheduler that cleanup goroutines
should be awoken. Every time the scheduler goes to wake up a cleanup
goroutine and there were more signals than goroutines to wake, it then
forwards this signal to runtime.AddCleanup, so that it creates another
goroutine the next time it is called, up to gomaxprocs goroutines.
The signals here are a little convoluted, but exist because the sweeper
and the scheduler cannot safely create new goroutines.
For #71772.
For #71825.
Change-Id: Ie839fde2b67e1b79ac1426be0ea29a8d923a62cc
Reviewed-on: https://go-review.googlesource.com/c/go/+/650697
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
Have the test use the same clock (cputicks) as the profiler, and use the
test's own measurements as hard bounds on the magnitude to expect in the
profile.
Compare the depiction of two users of the same lock: one where the
critical section is fast, one where it is slow. Confirm that the profile
shows the slow critical section as a large source of delay (with #66999
fixed), rather than showing the fast critical section as a large
recipient of delay.
Previously reviewed as https://go.dev/cl/586237.
For #66999
Change-Id: Ic2d78cc29153d5322577d84abdc448e95ed8f594
Reviewed-on: https://go-review.googlesource.com/c/go/+/667616
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
|
|
Our current parallel mark algorithm suffers from frequent stalls on
memory since its access pattern is essentially random. Small objects
are the worst offenders, since each one forces pulling in at least one
full cache line to access even when the amount to be scanned is far
smaller than that. Each object also requires an independent access to
per-object metadata.
The purpose of this change is to improve garbage collector performance
by scanning small objects in batches to obtain better cache locality
than our current approach. The core idea behind this change is to defer
marking and scanning small objects, and then scan them in batches
localized to a span.
This change adds scanned bits to each small object (<=512 bytes) span in
addition to mark bits. The scanned bits indicate that the object has
been scanned. (One way to think of them is "grey" bits and "black" bits
in the tri-color mark-sweep abstraction.) Each of these spans is always
8 KiB and if they contain pointers, the pointer/scalar data is already
packed together at the end of the span, allowing us to further optimize
the mark algorithm for this specific case.
When the GC encounters a pointer, it first checks if it points into a
small object span. If so, it is first marked in the mark bits, and then
the object is queued on a work-stealing P-local queue. This object
represents the whole span, and we ensure that a span can only appear at
most once in any queue by maintaining an atomic ownership bit for each
span. Later, when the pointer is dequeued, we scan every object with a
set mark that doesn't have a corresponding scanned bit. If it turns out
that was the only object in the mark bits since the last time we scanned
the span, we scan just that object directly, essentially falling back to
the existing algorithm. noscan objects have no scan work, so they are
never queued.
Each span's mark and scanned bits are co-located together at the end of
the span. Since the span is always 8 KiB in size, it can be found with
simple pointer arithmetic. Next to the marks and scans we also store the
size class, eliminating the need to access the span's mspan altogether.
The work-stealing P-local queue is a new source of GC work. If this
queue gets full, half of it is dumped to a global linked list of spans
to scan. The regular scan queues are always prioritized over this queue
to allow time for darts to accumulate. Stealing work from other Ps is a
last resort.
This change also adds a new debug mode under GODEBUG=gctrace=2 that
dumps whole-span scanning statistics by size class on every GC cycle.
A future extension to this CL is to use SIMD-accelerated scanning
kernels for scanning spans with high mark bit density.
For #19112. (Deadlock averted in GOEXPERIMENT.)
For #73581.
Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/658036
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Currently we assume alignment to 8 bytes, so we can steal the low 3 bits.
This CL assumes alignment to 512 bytes, so we can steal the low 9 bits.
That's 6 extra bits!
Aligning to 512 bytes wastes a bit of space but it is not egregious.
Most of the objects that we make tagged pointers to are pretty big.
Update #49405
Change-Id: I66fc7784ac1be5f12f285de1d7851d5a6871fb75
Reviewed-on: https://go-review.googlesource.com/c/go/+/665815
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
We will want to reference these definitions from new generator programs,
and this is a good opportunity to cleanup all these old C-style names.
Change-Id: Ifb06f0afc381e2697e7877f038eca786610c96de
Reviewed-on: https://go-review.googlesource.com/c/go/+/655275
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name
the anonymous memory areas.
This API has been introduced in Linux 5.17 to decorate the anonymous
memory areas shown in /proc/<pid>/maps.
This is already used by glibc. See:
* https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434
* https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63
This can be useful when investigating the memory consumption of a
multi-language program.
On a 100% Go program, pprof profiler can be used to profile the memory
consumption of the program. But pprof is only aware of what happens
within the Go world.
On a multi-language program, there could be a doubt about whether the
suspicious extra-memory consumption comes from the Go part or the native
part.
With this change, the following Go program:
package main
import (
"fmt"
"log"
"os"
)
/*
#include <stdlib.h>
void f(void)
{
(void)malloc(1024*1024*1024);
}
*/
import "C"
func main() {
C.f()
data, err := os.ReadFile("/proc/self/maps")
if err != nil {
log.Fatal(err)
}
fmt.Println(string(data))
}
produces this output:
$ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run .
00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00580000-005a4000 rw-p 00000000 00:00 0
2e075000-2e096000 rw-p 00000000 00:00 0 [heap]
c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap]
c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation]
777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f40021000-777f44000000 ---p 00000000 00:00 0
777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f44021000-777f48000000 ---p 00000000 00:00 0
777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f48021000-777f4c000000 ---p 00000000 00:00 0
777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f4c021000-777f50000000 ---p 00000000 00:00 0
777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f50021000-777f54000000 ---p 00000000 00:00 0
777f55afb000-777f55afc000 ---p 00000000 00:00 0
777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378]
777f562fc000-777f562fd000 ---p 00000000 00:00 0
777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377]
777f56afd000-777f56afe000 ---p 00000000 00:00 0
777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376]
777f572fe000-777f572ff000 ---p 00000000 00:00 0
777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375]
777f57aff000-777f57b00000 ---p 00000000 00:00 0
777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374]
777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index]
777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index]
777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index]
777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index]
777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index]
777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits]
777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array]
777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc]
777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6
777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6
777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6
777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6
777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6
777f9eff4000-777f9effc000 rw-p 00000000 00:00 0
777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc]
777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar]
777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock]
777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso]
777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f088000-777f9f089000 rw-p 00000000 00:00 0
7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
The anonymous memory areas are now labelled so that we can see which
ones have been allocated by the Go runtime versus which ones have been
allocated by the glibc.
Fixes #71546
Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4
Reviewed-on: https://go-review.googlesource.com/c/go/+/646095
Auto-Submit: Alan Donovan <adonovan@google.com>
Commit-Queue: Alan Donovan <adonovan@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
|
|
Change-Id: I07e7c8eaa5bd4bac0d576b2f2f4cd3f81b0b77a4
Reviewed-on: https://go-review.googlesource.com/c/go/+/630055
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Commit-Queue: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
Instead, have the runtime build the gc bitmaps on demand
at runtime.
Change-Id: If7a245bc62e4bce3ce80972410b0ed307d921abe
Reviewed-on: https://go-review.googlesource.com/c/go/+/616255
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Currently it's possible for weak->strong conversions to create more GC
work during mark termination. When a weak->strong conversion happens
during the mark phase, we need to mark the newly-strong pointer, since
it may now be the only pointer to that object. In other words, the
object could be white.
But queueing new white objects creates GC work, and if this happens
during mark termination, we could end up violating mark termination
invariants. In the parlance of the mark termination algorithm, the
weak->strong conversion is a non-monotonic source of GC work, unlike the
write barriers (which will eventually only see black objects).
This change fixes the problem by forcing weak->strong conversions to
block during mark termination. We can do this efficiently by setting a
global flag before the ragged barrier that is checked at each
weak->strong conversion. If the flag is set, then the conversions block.
The ragged barrier ensures that all Ps have observed the flag and that
any weak->strong conversions which completed before the ragged barrier
have their newly-minted strong pointers visible in GC work queues if
necessary. We later unset the flag and wake all the blocked goroutines
during the mark termination STW.
There are a few subtleties that we need to account for. For one, it's
possible that a goroutine which blocked in a weak->strong conversion
wakes up only to find it's mark termination time again, so we need to
recheck the global flag on wake. We should also stay non-preemptible
while performing the check, so that if the check *does* appear as true,
it cannot switch back to false while we're actively trying to block. If
it switches to false while we try to block, then we'll be stuck in the
queue until the following GC.
All-in-all, this CL is more complicated than I would have liked, but
it's the only idea so far that is clearly correct to me at a high level.
This change adds a test which is somewhat invasive as it manipulates
mark termination, but hopefully that infrastructure will be useful for
debugging, fixing, and regression testing mark termination whenever we
do fix it.
Fixes #69803.
Change-Id: Ie314e6fd357c9e2a07a9be21f217f75f7aba8c4a
Reviewed-on: https://go-review.googlesource.com/c/go/+/623615
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Use the new SwissTable-based map in internal/runtime/maps as the basis
for the runtime map when GOEXPERIMENT=swissmap.
Integration is complete enough to pass all.bash. Notable missing
features:
* Race integration / concurrent write detection
* Stack-allocated maps
* Specialized "fast" map variants
* Indirect key / elem
For #54766.
Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10,gotip-linux-amd64-longtest-swissmap
Change-Id: Ie97b656b6d8e05c0403311ae08fef9f51756a639
Reviewed-on: https://go-review.googlesource.com/c/go/+/594596
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.
For #54766.
Change-Id: I45a530422207dd94b5ad4eee51216c9410a84040
Reviewed-on: https://go-review.googlesource.com/c/go/+/613261
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Moving these intrinsics to a base package enables other internal/runtime
packages to use them.
For #54766.
Change-Id: I0b3eded3bb45af53e3eb5bab93e3792e6a8beb46
Reviewed-on: https://go-review.googlesource.com/c/go/+/613260
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
The two map implementations are still identical, but now the compiler
targets the appropriate ABI depending on GOEXPERIMENT.
For #54766.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-linux-amd64-longtest-swissmap
Change-Id: I8438f64f044ba9de30ddbf2b8ceb9b4edd2d5614
Reviewed-on: https://go-review.googlesource.com/c/go/+/580779
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
|
|
These are delay primitives for lock2. If a mutex isn't immediately
available, we can use procyield to tell the processor to wait for a
moment, or osyield to allow the OS to run a different process or thread
if one is waiting. We expect a processor-level yield to be faster than
an os-level yield, and for both of them to be fast relative to entering
a full sleep (via futexsleep or semasleep).
Each architecture has its own way of hinting to the processor that it's
in a spin-wait loop, so procyield presents an architecture-independent
interface for use in lock_futex.go and lock_sema.go.
Measure the (single-threaded) speed of these to confirm.
For #68578
Change-Id: I90cd46ea553f2990395aceb048206285558c877e
Reviewed-on: https://go-review.googlesource.com/c/go/+/601396
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
Cleanup and friction reduction
For #65355.
Change-Id: Ia14c9dc584a529a35b97801dd3e95b9acc99a511
Reviewed-on: https://go-review.googlesource.com/c/go/+/600436
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
This reverts commit f9ba2cff2286d378eca28c841bea8488c69fc30e (CL 586237)
Reason for revert: This is part of a patch series that changed the
handling of contended lock2/unlock2 calls, reducing the maximum
throughput of contended runtime.mutex values, and causing a performance
regression on applications where that is (or became) the bottleneck.
This test verifies that the semantics of the mutex profile for
runtime.mutex values matches that of sync.Mutex values. Without the rest
of the patch series, this test would correctly identify that Go 1.22's
semantics are incorrect (issue #66999).
Updates #66999
Updates #67585
Change-Id: Id06ae01d7bc91c94054c80d273e6530cb2d59d10
Reviewed-on: https://go-review.googlesource.com/c/go/+/589096
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Have the test use the same clock (cputicks) as the profiler, and use the
test's own measurements as hard bounds on the magnitude to expect in the
profile.
Compare the depiction of two users of the same lock: one where the
critical section is fast, one where it is slow. Confirm that the profile
shows the slow critical section as a large source of delay (with #66999
fixed), rather than showing the fast critical section as a large
recipient of delay.
For #64253
For #66999
Change-Id: I784c8beedc39de564dc8cee42060a5d5ce55de39
Reviewed-on: https://go-review.googlesource.com/c/go/+/586237
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
The existing implementation of traceMap is a hash map with a fixed
bucket table size which scales poorly with the number of elements added
to the map. After a few thousands elements are in the map, it tends to
fall over.
Furthermore, cleaning up the trace map is currently non-preemptible,
without very good reason.
This change replaces the traceMap implementation with a simple
append-only concurrent hash-trie. The data structure is incredibly
simple and does not suffer at all from the same scaling issues.
Because the traceMap no longer has a lock, and the traceRegionAlloc it
embeds is not thread-safe, we have to push that lock down. While we're
here, this change also makes the fast path for the traceRegionAlloc
lock-free. This may not be inherently faster due to contention on the
atomic add, but it creates an easy path to sharding the main allocation
buffer to reduce contention in the future. (We might want to also
consider a fully thread-local allocator that covers both string and
stack tables. The only reason a thread-local allocator isn't feasible
right now is because each of these has their own region, but we could
certainly group all them together.)
Change-Id: I8c06d42825c326061a1b8569e322afc4bc2a513a
Reviewed-on: https://go-review.googlesource.com/c/go/+/570035
Reviewed-by: Carlos Amedee <carlos@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Bypass: Michael Knyszek <mknyszek@google.com>
Reviewed-by: David Chase <drchase@google.com>
|
|
This change removes the allocheaders, deleting all the old code and
merging mbitmap_allocheaders.go back into mbitmap.go.
This change also deletes the SetType benchmarks which were already
broken in the new GOEXPERIMENT (it's harder to set up than before). We
weren't really watching these benchmarks at all, and they don't provide
additional test coverage.
Change-Id: I135497201c3259087c5cd3722ed3fbe24791d25d
Reviewed-on: https://go-review.googlesource.com/c/go/+/567200
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
This change fixes a possible race with updating metrics and reading
them. The update is intended to be protected by the world being stopped,
but here, it clearly isn't.
Fixing this lets us lower the thresholds in the metrics tests by an
order of magnitude, because the only thing we have to worry about now is
floating point error (the tests were previously written assuming the
floating point error was much higher than it actually was; that turns
out not to be the case, and this bug was the problem instead). However,
this still isn't that tight of a bound; we still want to catch any and
all problems of exactness. For this purpose, this CL adds a test to
check the source-of-truth (in uint64 nanoseconds) that ensures the
totals exactly match.
This means we unfortunately have to take another time measurement, but
for now let's prioritize correctness. A few additional nanoseconds of
STW time won't be terribly noticable.
Fixes #66212.
Change-Id: Id02c66e8a43c13b1f70e9b268b8a84cc72293bfd
Reviewed-on: https://go-review.googlesource.com/c/go/+/570257
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Nicolas Hillegeer <aktau@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
For #59670
Change-Id: Id66e102f13e529dd041b68ce869026a56f0a1b9b
GitHub-Last-Rev: 43aa9376f72bc02a9d86518cdc99494a6b2f8573
GitHub-Pull-Request: golang/go#65564
Reviewed-on: https://go-review.googlesource.com/c/go/+/562298
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Austin Clements <austin@google.com>
|
|
For #65355
Change-Id: I65dd090fb99de9b231af2112c5ccb0eb635db2be
Reviewed-on: https://go-review.googlesource.com/c/go/+/560155
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ibrahim Bazoka <ibrahimbazoka729@gmail.com>
Auto-Submit: Emmanuel Odeke <emmanuel@orijtech.com>
|
|
If user code has two timers t1 and t2 and does *t1 = *t2
(or *t1 = Timer{}), it creeps me out that we would be
corrupting the runtime data structures inlined in the
Timer struct. Replace that field with a pointer to the
runtime data structure instead, so that the corruption
cannot happen, even in a badly behaved program.
In fact, remove the struct definition entirely and linkname
a constructor instead. Now the runtime can evolve the struct
however it likes without needing to keep package time in sync.
Also move the workaround logic for #21874 out of
runtime and into package time.
Change-Id: Ia30f7802ee7b3a11f5d8a78dd30fd9c8633dc787
Reviewed-on: https://go-review.googlesource.com/c/go/+/568339
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Change-Id: Ib78c1513616089f4942297cd17212b1b11871fd5
GitHub-Last-Rev: f97fe5b5bffffe25dc31de7964588640cb70ec41
GitHub-Pull-Request: golang/go#65819
Reviewed-on: https://go-review.googlesource.com/c/go/+/565515
Reviewed-by: Jorropo <jorropo.pgm@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
For #59670
Change-Id: I9265e033bf3a84c3dc7b4a5d52c0df9672435f0d
GitHub-Last-Rev: 8e4099095cec9e774abe9b72212a1eab6f6f9fdb
GitHub-Pull-Request: golang/go#64774
Reviewed-on: https://go-review.googlesource.com/c/go/+/550117
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
CL 549536 intended to decouple the internal implementation of rwmutex
from the semantic meaning of an rwmutex read/write lock in the static
lock ranking.
Unfortunately, it was not thought through well enough. The internals
were represented with the rwmutexR and rwmutexW lock ranks. The idea was
that the internal lock ranks need not model the higher-level ordering,
since those have separate rankings. That is incorrect; rwmutexW is held
for the duration of a write lock, so it must be ranked before any lock
taken while any write lock is held, which is precisely what we were
trying to avoid.
This is visible in violations like:
0 : execW 11 0x0
1 : rwmutexW 51 0x111d9c8
2 : fin 30 0x111d3a0
fatal error: lock ordering problem
execW < fin is modeled, but rwmutexW < fin is missing.
Fix this by eliminating the rwmutexR/W lock ranks shared across
different types of rwmutex. Instead require users to define an
additional "internal" lock rank to represent the implementation details
of rwmutex.rLock. We can avoid an additional "internal" lock rank for
rwmutex.wLock because the existing writeRank has the same semantics for
semantic and internal locking. i.e., writeRank is held for the duration
of a write lock, which is exactly how rwmutex.wLock is used, so we can
use writeRank directly on wLock.
For #64722.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-staticlockranking
Change-Id: Ia572de188a46ba8fe054ae28537648beaa16b12c
Reviewed-on: https://go-review.googlesource.com/c/go/+/555055
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
Currently, lock ranking doesn't really try to model rwmutex. It records
the internal locks rLock and wLock, but in a subpar fashion:
1. wLock is held from lock to unlock, so it works OK, but it conflates
write locks of all rwmutexes as rwmutexW, rather than allowing
different rwmutexes to have different rankings.
2. rLock is an internal implementation detail that is only taken when
there is contention in rlock. As as result, the reader lock path is
almost never checked.
Add proper modeling. rwmutexR and rwmutexW remain as the ranks of the
internal locks, which have their own ordering. The new init method is
passed the ranks of the higher level lock that this represents, just
like lockInit for mutex.
execW ordered before MALLOC captures the case from #64722. i.e., there
can be allocation between BeforeFork and AfterFork.
For #64722.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-staticlockranking
Change-Id: I23335b28faa42fb04f1bc9da02fdf54d1616cd28
Reviewed-on: https://go-review.googlesource.com/c/go/+/549536
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Move ChaCha8 code into internal/chacha8rand and use it to implement
runtime.rand, which is used for the unseeded global source for
both math/rand and math/rand/v2. This also affects the calculation of
the start point for iteration over very very large maps (when the
32-bit fastrand is not big enough).
The benefit is that misuse of the global random number generators
in math/rand and math/rand/v2 in contexts where non-predictable
randomness is important for security reasons is no longer a
security problem, removing a common mistake among programmers
who are unaware of the different kinds of randomness.
The cost is an extra 304 bytes per thread stored in the m struct
plus 2-3ns more per random uint64 due to the more sophisticated
algorithm. Using PCG looks like it would cost about the same,
although I haven't benchmarked that.
Before this, the math/rand and math/rand/v2 global generator
was wyrand (https://github.com/wangyi-fudan/wyhash).
For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson
ALFG was justifiable, since the latter was not any better.
But for math/rand/v2, the global generator really should be
at least as good as one of the well-studied, specific algorithms
provided directly by the package, and it's not.
(Wyrand is still reasonable for scheduling and cache decisions.)
Good randomness does have a cost: about twice wyrand.
Also rationalize the various runtime rand references.
goos: linux
goarch: amd64
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.amd64 │ 5cf807d1ea.amd64 │
│ sec/op │ sec/op vs base │
ChaCha8-32 1.862n ± 2% 1.861n ± 2% ~ (p=0.825 n=20)
PCG_DXSM-32 1.471n ± 1% 1.460n ± 2% ~ (p=0.153 n=20)
SourceUint64-32 1.636n ± 2% 1.582n ± 1% -3.30% (p=0.000 n=20)
GlobalInt64-32 2.087n ± 1% 3.663n ± 1% +75.54% (p=0.000 n=20)
GlobalInt64Parallel-32 0.1042n ± 1% 0.2026n ± 1% +94.48% (p=0.000 n=20)
GlobalUint64-32 2.263n ± 2% 3.724n ± 1% +64.57% (p=0.000 n=20)
GlobalUint64Parallel-32 0.1019n ± 1% 0.1973n ± 1% +93.67% (p=0.000 n=20)
Int64-32 1.771n ± 1% 1.774n ± 1% ~ (p=0.449 n=20)
Uint64-32 1.863n ± 2% 1.866n ± 1% ~ (p=0.364 n=20)
GlobalIntN1000-32 3.134n ± 3% 4.730n ± 2% +50.95% (p=0.000 n=20)
IntN1000-32 2.489n ± 1% 2.489n ± 1% ~ (p=0.683 n=20)
Int64N1000-32 2.521n ± 1% 2.516n ± 1% ~ (p=0.394 n=20)
Int64N1e8-32 2.479n ± 1% 2.478n ± 2% ~ (p=0.743 n=20)
Int64N1e9-32 2.530n ± 2% 2.514n ± 2% ~ (p=0.193 n=20)
Int64N2e9-32 2.501n ± 1% 2.494n ± 1% ~ (p=0.616 n=20)
Int64N1e18-32 3.227n ± 1% 3.205n ± 1% ~ (p=0.101 n=20)
Int64N2e18-32 3.647n ± 1% 3.599n ± 1% ~ (p=0.019 n=20)
Int64N4e18-32 5.135n ± 1% 5.069n ± 2% ~ (p=0.034 n=20)
Int32N1000-32 2.657n ± 1% 2.637n ± 1% ~ (p=0.180 n=20)
Int32N1e8-32 2.636n ± 1% 2.636n ± 1% ~ (p=0.763 n=20)
Int32N1e9-32 2.660n ± 2% 2.638n ± 1% ~ (p=0.358 n=20)
Int32N2e9-32 2.662n ± 2% 2.618n ± 2% ~ (p=0.064 n=20)
Float32-32 2.272n ± 2% 2.239n ± 2% ~ (p=0.194 n=20)
Float64-32 2.272n ± 1% 2.286n ± 2% ~ (p=0.763 n=20)
ExpFloat64-32 3.762n ± 1% 3.744n ± 1% ~ (p=0.171 n=20)
NormFloat64-32 3.706n ± 1% 3.655n ± 2% ~ (p=0.066 n=20)
Perm3-32 32.93n ± 3% 34.62n ± 1% +5.13% (p=0.000 n=20)
Perm30-32 202.9n ± 1% 204.0n ± 1% ~ (p=0.482 n=20)
Perm30ViaShuffle-32 115.0n ± 1% 114.9n ± 1% ~ (p=0.358 n=20)
ShuffleOverhead-32 112.8n ± 1% 112.7n ± 1% ~ (p=0.692 n=20)
Concurrent-32 2.107n ± 0% 3.725n ± 1% +76.75% (p=0.000 n=20)
goos: darwin
goarch: arm64
pkg: math/rand/v2
│ bbb48afeb7.arm64 │ 5cf807d1ea.arm64 │
│ sec/op │ sec/op vs base │
ChaCha8-8 2.480n ± 0% 2.429n ± 0% -2.04% (p=0.000 n=20)
PCG_DXSM-8 2.531n ± 0% 2.530n ± 0% ~ (p=0.877 n=20)
SourceUint64-8 2.534n ± 0% 2.533n ± 0% ~ (p=0.732 n=20)
GlobalInt64-8 2.172n ± 1% 4.794n ± 0% +120.67% (p=0.000 n=20)
GlobalInt64Parallel-8 0.4320n ± 0% 0.9605n ± 0% +122.32% (p=0.000 n=20)
GlobalUint64-8 2.182n ± 0% 4.770n ± 0% +118.58% (p=0.000 n=20)
GlobalUint64Parallel-8 0.4307n ± 0% 0.9583n ± 0% +122.51% (p=0.000 n=20)
Int64-8 4.107n ± 0% 4.104n ± 0% ~ (p=0.416 n=20)
Uint64-8 4.080n ± 0% 4.080n ± 0% ~ (p=0.052 n=20)
GlobalIntN1000-8 2.814n ± 2% 5.643n ± 0% +100.50% (p=0.000 n=20)
IntN1000-8 4.141n ± 0% 4.139n ± 0% ~ (p=0.140 n=20)
Int64N1000-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.313 n=20)
Int64N1e8-8 4.140n ± 0% 4.139n ± 0% ~ (p=0.103 n=20)
Int64N1e9-8 4.139n ± 0% 4.140n ± 0% ~ (p=0.761 n=20)
Int64N2e9-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.636 n=20)
Int64N1e18-8 5.266n ± 0% 5.326n ± 1% +1.14% (p=0.001 n=20)
Int64N2e18-8 6.052n ± 0% 6.167n ± 0% +1.90% (p=0.000 n=20)
Int64N4e18-8 8.826n ± 0% 9.051n ± 0% +2.55% (p=0.000 n=20)
Int32N1000-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N1e8-8 4.126n ± 0% 4.131n ± 0% +0.12% (p=0.000 n=20)
Int32N1e9-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N2e9-8 4.132n ± 0% 4.131n ± 0% ~ (p=0.017 n=20)
Float32-8 4.109n ± 0% 4.105n ± 0% ~ (p=0.379 n=20)
Float64-8 4.107n ± 0% 4.106n ± 0% ~ (p=0.867 n=20)
ExpFloat64-8 5.339n ± 0% 5.383n ± 0% +0.82% (p=0.000 n=20)
NormFloat64-8 5.735n ± 0% 5.737n ± 1% ~ (p=0.856 n=20)
Perm3-8 26.65n ± 0% 26.80n ± 1% +0.58% (p=0.000 n=20)
Perm30-8 194.8n ± 1% 197.0n ± 0% +1.18% (p=0.000 n=20)
Perm30ViaShuffle-8 156.6n ± 0% 157.6n ± 1% +0.61% (p=0.000 n=20)
ShuffleOverhead-8 124.9n ± 0% 125.5n ± 0% +0.52% (p=0.000 n=20)
Concurrent-8 2.434n ± 3% 5.066n ± 0% +108.09% (p=0.000 n=20)
goos: linux
goarch: 386
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.386 │ 5cf807d1ea.386 │
│ sec/op │ sec/op vs base │
ChaCha8-32 11.295n ± 1% 4.748n ± 2% -57.96% (p=0.000 n=20)
PCG_DXSM-32 7.693n ± 1% 7.738n ± 2% ~ (p=0.542 n=20)
SourceUint64-32 7.658n ± 2% 7.622n ± 2% ~ (p=0.344 n=20)
GlobalInt64-32 3.473n ± 2% 7.526n ± 2% +116.73% (p=0.000 n=20)
GlobalInt64Parallel-32 0.3198n ± 0% 0.5444n ± 0% +70.22% (p=0.000 n=20)
GlobalUint64-32 3.612n ± 0% 7.575n ± 1% +109.69% (p=0.000 n=20)
GlobalUint64Parallel-32 0.3168n ± 0% 0.5403n ± 0% +70.51% (p=0.000 n=20)
Int64-32 7.673n ± 2% 7.789n ± 1% ~ (p=0.122 n=20)
Uint64-32 7.773n ± 1% 7.827n ± 2% ~ (p=0.920 n=20)
GlobalIntN1000-32 6.268n ± 1% 9.581n ± 1% +52.87% (p=0.000 n=20)
IntN1000-32 10.33n ± 2% 10.45n ± 1% ~ (p=0.233 n=20)
Int64N1000-32 10.98n ± 2% 11.01n ± 1% ~ (p=0.401 n=20)
Int64N1e8-32 11.19n ± 2% 10.97n ± 1% ~ (p=0.033 n=20)
Int64N1e9-32 11.06n ± 1% 11.08n ± 1% ~ (p=0.498 n=20)
Int64N2e9-32 11.10n ± 1% 11.01n ± 2% ~ (p=0.995 n=20)
Int64N1e18-32 15.23n ± 2% 15.04n ± 1% ~ (p=0.973 n=20)
Int64N2e18-32 15.89n ± 1% 15.85n ± 1% ~ (p=0.409 n=20)
Int64N4e18-32 18.96n ± 2% 19.34n ± 2% ~ (p=0.048 n=20)
Int32N1000-32 10.46n ± 2% 10.44n ± 2% ~ (p=0.480 n=20)
Int32N1e8-32 10.46n ± 2% 10.49n ± 2% ~ (p=0.951 n=20)
Int32N1e9-32 10.28n ± 2% 10.26n ± 1% ~ (p=0.431 n=20)
Int32N2e9-32 10.50n ± 2% 10.44n ± 2% ~ (p=0.249 n=20)
Float32-32 13.80n ± 2% 13.80n ± 2% ~ (p=0.751 n=20)
Float64-32 23.55n ± 2% 23.87n ± 0% ~ (p=0.408 n=20)
ExpFloat64-32 15.36n ± 1% 15.29n ± 2% ~ (p=0.316 n=20)
NormFloat64-32 13.57n ± 1% 13.79n ± 1% +1.66% (p=0.005 n=20)
Perm3-32 45.70n ± 2% 46.99n ± 2% +2.81% (p=0.001 n=20)
Perm30-32 399.0n ± 1% 403.8n ± 1% +1.19% (p=0.006 n=20)
Perm30ViaShuffle-32 349.0n ± 1% 350.4n ± 1% ~ (p=0.909 n=20)
ShuffleOverhead-32 322.3n ± 1% 323.8n ± 1% ~ (p=0.410 n=20)
Concurrent-32 3.331n ± 1% 7.312n ± 1% +119.50% (p=0.000 n=20)
For #61716.
Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2
Reviewed-on: https://go-review.googlesource.com/c/go/+/516860
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
ReadMemStats has a few assertions it makes about the consistency of the
stats it's about to produce. Specifically, how those stats line up with
runtime-internal stats. These checks are generally useful, but crashing
just because some stats are wrong is a heavy price to pay.
For a long time this wasn't a problem, but very recently it became a
real problem. It turns out that there's real benign skew that can happen
wherein sysmon (which doesn't synchronize with a STW) generates a trace
event when tracing is enabled, and may mutate some stats while
ReadMemStats is running its checks.
Fix this by synchronizing with both sysmon and the tracer. This is a bit
heavy-handed, but better that than false positives.
Also, put the checks behind a debug mode. We want to reduce the risk of
backporting this change, and again, it's not great to crash just because
user-facing stats are off. Still, enable this debug mode during the
runtime tests so we don't lose quite as much coverage from disabling
these checks by default.
Fixes #64401.
Change-Id: I9adb3e5c7161d207648d07373a11da8a5f0fda9a
Reviewed-on: https://go-review.googlesource.com/c/go/+/545277
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
|
|
Add runtime-internal locks to the mutex contention profile.
Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.
Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.
Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.
Fixes #57071
This is a roll-forward of https://go.dev/cl/528657, which was reverted
in https://go.dev/cl/543660
Reason for revert: de-flakes tests (reduces dependence on fine-grained
timers, correctly identifies contention on big-endian futex locks,
attempts to measure contention in the semaphore implementation but only
uses that secondary measurement to finish the test early, skips tests on
single-processor systems)
Change-Id: I31389f24283d85e46ad9ba8d4f514cb9add8dfb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/544195
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
|
|
Once(Func|Value|Values)
The function passed to OnceFunc/OnceValue/OnceValues may transitively
keep more allocations alive. As the passed function is guaranteed to be
called at most once, it is safe to drop it after the first call is
complete. This avoids keeping the passed function (and anything it
transitively references) alive until the returned function is GCed.
Change-Id: I2faf397b481d2f693ab3aea8e2981b02adbc7a21
Reviewed-on: https://go-review.googlesource.com/c/go/+/481515
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: qiulaidongfeng <2645477756@qq.com>
|
|
This reverts commit go.dev/cl/528657.
Reason for revert: broke a lot of builders.
Change-Id: I70c33062020e997c4df67b3eaa2e886cf0da961e
Reviewed-on: https://go-review.googlesource.com/c/go/+/543660
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Add runtime-internal locks to the mutex contention profile.
Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.
Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.
Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.
Fixes #57071
Change-Id: I3eee0ccbfc20f333b56f20d8725dfd7f3a526b41
Reviewed-on: https://go-review.googlesource.com/c/go/+/528657
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Than McIntosh <thanm@google.com>
|
|
This CL adds four new time histogram metrics:
/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds
The "stopping" metrics measure the time taken to start a stop-the-world
pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps.
This can be used to detect STW struggling to preempt Ps.
The "total" metrics measure the total duration of a stop-the-world
pause, from starting to stop-the-world until the world is started again.
This includes the time spent in the "start" phase.
The "gc" metrics are used for GC-related STW pauses. The "other" metrics
are used for all other STW pauses.
All of these metrics start timing in stopTheWorldWithSema only after
successfully acquiring sched.lock, thus excluding lock contention on
sched.lock. The reasoning behind this is that while waiting on
sched.lock the world is not stopped at all (all other Ps can run), so
the impact of this contention is primarily limited to the goroutine
attempting to stop-the-world. Additionally, we already have some
visibility into sched.lock contention via contention profiles (#57071).
/sched/pauses/total/gc:seconds is conceptually equivalent to
/gc/pauses:seconds, so the latter is marked as deprecated and returns
the same histogram as the former.
In the implementation, there are a few minor differences:
* For both mark and sweep termination stops, /gc/pauses:seconds started
timing prior to calling startTheWorldWithSema, thus including lock
contention.
These details are minor enough, that I do not believe the slight change
in reporting will matter. For mark termination stops, moving timing stop
into startTheWorldWithSema does have the side effect of requiring moving
other GC metric calculations outside of the STW, as they depend on the
same end time.
Fixes #63340
Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05
Reviewed-on: https://go-review.googlesource.com/c/go/+/534161
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This change replaces the 1-bit-per-word heap bitmap for most size
classes with allocation headers for objects that contain pointers. The
header consists of a single pointer to a type. All allocations with
headers are treated as implicitly containing one or more instances of
the type in the header.
As the name implies, headers are usually stored as the first word of an
object. There are two additional exceptions to where headers are stored
and how they're used.
Objects smaller than 512 bytes do not have headers. Instead, a heap
bitmap is reserved at the end of spans for objects of this size. A full
word of overhead is too much for these small objects. The bitmap is of
the same format of the old bitmap, minus the noMorePtrs bits which are
unnecessary. All the objects <512 bytes have a bitmap less than a
pointer-word in size, and that was the granularity at which noMorePtrs
could stop scanning early anyway.
Objects that are larger than 32 KiB (which have their own span) have
their headers stored directly in the span, to allow power-of-two-sized
allocations to not spill over into an extra page.
The full implementation is behind GOEXPERIMENT=allocheaders.
The purpose of this change is performance. First and foremost, with
headers we no longer have to unroll pointer/scalar data at allocation
time for most size classes. Small size classes still need some
unrolling, but their bitmaps are small so we can optimize that case
fairly well. Larger objects effectively have their pointer/scalar data
unrolled on-demand from type data, which is much more compactly
represented and results in less TLB pressure. Furthermore, since the
headers are usually right next to the object and where we're about to
start scanning, we get an additional temporal locality benefit in the
data cache when looking up type metadata. The pointer/scalar data is
now effectively unrolled on-demand, but it's also simpler to unroll than
before; that unrolled data is never written anywhere, and for arrays we
get the benefit of retreading the same data per element, as opposed to
looking it up from scratch for each pointer-word of bitmap. Lastly,
because we no longer have a heap bitmap that spans the entire heap,
there's a flat 1.5% memory use reduction. This is balanced slightly by
some objects possibly being bumped up a size class, but most objects are
not tightly optimized to size class sizes so there's some memory to
spare, making the header basically free in those cases.
See the follow-up CL which turns on this experiment by default for
benchmark results. (CL 538217.)
Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5
Reviewed-on: https://go-review.googlesource.com/c/go/+/437955
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
ReadMetricsSlow was updated to call the core of readMetrics on the
systemstack to prevent issues with stat skew if the stack gets moved
between readmemstats_m and readMetrics. However, readMetrics calls into
the map implementation, which has race instrumentation. The system stack
typically has no racectx set, resulting in crashes.
Donate racectx to g0 like the tracer does, so that these accesses don't
crash.
For #60607.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-race
Change-Id: Ic0251af2d9b60361f071fe97084508223109480c
Reviewed-on: https://go-review.googlesource.com/c/go/+/539695
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
Currently it's possible (and even probable, with mayMoreStackMove mode)
for a stack allocation to occur between readmemstats_m and readMetrics
in ReadMetricsSlow. This can cause tests to fail by producing metrics
that are inconsistent between the two sources.
Fix this by breaking out the critical section of readMetrics and calling
that from ReadMetricsSlow on the systemstack. Our main constraint in
calling readMetrics on the system stack is the fact that we can't
acquire the metrics semaphore from the system stack. But if we break out
the critical section, then we can acquire that semaphore before we go on
the system stack.
While we're here, add another readMetrics call before readmemstats_m.
Since we're being paranoid about ways that metrics could get skewed
between the two calls, let's eliminate all uncertainty. It's possible
for readMetrics to allocate new memory, for example for histograms, and
fail while it's reading metrics. I believe we're just getting lucky
today with the order in which the metrics are produced. Another call to
readMetrics will preallocate this data in the samples slice. One nice
thing about this second read is that now we effectively have a way to
check if readMetrics really will allocate if called a second time on the
same samples slice.
Fixes #60607.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest
Change-Id: If6ce666530903239ef9f02dbbc3f1cb6be71e425
Reviewed-on: https://go-review.googlesource.com/c/go/+/539117
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Change-Id: Ib89a71e20f9c6b86c97814c75cb427e9bd7075e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/538735
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
|
|
Error like "morestack on g0" is one of the errors that is very
hard to debug, because often it doesn't print a useful stack trace.
The runtime doesn't directly print a stack trace because it is
a bad stack state to call print. Sometimes the SIGABRT may trigger
a traceback, but sometimes not especially in a cgo binary. Even if
it triggers a traceback it often does not include the stack trace
of the bad stack.
This CL makes it explicitly print a stack trace and throw. The
idea is to have some space as an "emergency" crash stack. When the
stack is in a really bad state, we switch to the crash stack and
do a traceback.
Currently only implemented on AMD64 and ARM64.
TODO: also handle errors like "morestack on gsignal" and bad
systemstack. Also handle other architectures.
Change-Id: Ibfc397202f2bb0737c5cbe99f2763de83301c1c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/419435
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Change-Id: I3f0b7209621b39cee69566a5cc95e4343b4f1f20
GitHub-Last-Rev: af9dbbe69ad74e8c210254dafa260a886b690853
GitHub-Pull-Request: golang/go#63321
Reviewed-on: https://go-review.googlesource.com/c/go/+/531916
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
We were using the size stored in the map, which is the smaller
of the real type size and 128.
As of CL 61538 we don't use these functions, but we expect to
use them again in the future after #61626 is resolved.
Change-Id: I7bfb4af5f0e3a56361d4019a8ed7c1ec59ff31fd
Reviewed-on: https://go-review.googlesource.com/c/go/+/535215
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
The correct load factor is 6.5, not 6.
This got broken by accident in CL 462115.
Fixes #63438
Change-Id: Ib07bb6ab6103aec87cb775bc06bd04362a64e489
Reviewed-on: https://go-review.googlesource.com/c/go/+/533279
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Cuong Manh Le <cuong.manhle.vn@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
mspan.freeindex and nelems can fit into uint16 for all possible
values. Use uint16 instead of uintptr.
Change-Id: Ifce20751e81d5022be1f6b5cbb5fbe4fd1728b1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/451359
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
After the previous CL, this is now all dead code. This change is
separated out to make the previous one easy to backport.
For #63334.
Related to #61718 and #59960.
Change-Id: I109673ed97c62c472bbe2717dfeeb5aa4fc883ea
Reviewed-on: https://go-review.googlesource.com/c/go/+/532117
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This is a follow up of CL 528696.
Change-Id: I5b71eabedb12567c4b1b36f7182a3d2b0ed662a5
GitHub-Last-Rev: acaf3ac11c38042ad27b99e1c70a3c9f1a554a15
GitHub-Pull-Request: golang/go#62713
Reviewed-on: https://go-review.googlesource.com/c/go/+/529197
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
The stack bounds from pthread are not always accurate, and could
cause seg fault if we run out of the actual stack space before
reaching the bounds. Here we use an artificially small stack bounds
to check overflow without actually running out of the system stack.
Change-Id: I8067c5e1297307103b315d9d0c60120293b57aab
Reviewed-on: https://go-review.googlesource.com/c/go/+/523695
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Currently the runtime marks all new memory as MADV_HUGEPAGE on Linux and
manages its hugepage eligibility status. Unfortunately, the default
THP behavior on most Linux distros is that MADV_HUGEPAGE blocks while
the kernel eagerly reclaims and compacts memory to allocate a hugepage.
This direct reclaim and compaction is unbounded, and may result in
significant application thread stalls. In really bad cases, this can
exceed 100s of ms or even seconds.
Really all we want is to undo MADV_NOHUGEPAGE marks and let the default
Linux paging behavior take over, but the only way to unmark a region as
MADV_NOHUGEPAGE is to also mark it MADV_HUGEPAGE.
The overall strategy of trying to keep hugepages for the heap unbroken
however is sound. So instead let's use the new shiny MADV_COLLAPSE if it
exists.
MADV_COLLAPSE makes a best-effort synchronous attempt at collapsing the
physical memory backing a memory region into a hugepage. We'll use
MADV_COLLAPSE where we would've used MADV_HUGEPAGE, and stop using
MADV_NOHUGEPAGE altogether.
Because MADV_COLLAPSE is synchronous, it's also important to not
re-collapse huge pages if the huge pages are likely part of some large
allocation. Although in many cases it's advantageous to back these
allocations with hugepages because they're contiguous, eagerly
collapsing every hugepage means having to page in at least part of the
large allocation.
However, because we won't use MADV_NOHUGEPAGE anymore, we'll no longer
handle the fact that khugepaged might come in and back some memory we
returned to the OS with a hugepage. I've come to the conclusion that
this is basically unavoidable without a new madvise flag and that it's
just not a good default. If this change lands, advice about Linux huge
page settings will be added to the GC guide.
Verified that this change doesn't regress Sweet, at least not on my
machine with:
/sys/kernel/mm/transparent_hugepage/enabled [always or madvise]
/sys/kernel/mm/transparent_hugepage/defrag [madvise]
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none [0 or 511]
Unfortunately, this workaround means that we only get forced hugepages
on Linux 6.1+.
Fixes #61718.
Change-Id: I7f4a7ba397847de29f800a99f9cb66cb2720a533
Reviewed-on: https://go-review.googlesource.com/c/go/+/516795
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|