go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2025-07-20	internal/testenv: exclude GOMAXPROCS when building test programfix-runtime-test-GOMAXPROCS	Shulhan
	In the environment where GOMAXPROCS set explicitly, for example to 3 in shell profile, the runtime tests will fail with the following error, ---- ok regexp/syntax 0.428s --- FAIL: TestCgroupGOMAXPROCS (0.81s) crash_test.go:186: running /home/ms/src/go/bin/go build -o /tmp/go-build1753772192/testprog.exe crash_test.go:208: built testprog in 796.664277ms --- FAIL: TestCgroupGOMAXPROCS/containermaxprocs=0 (0.00s) cgroup_linux_test.go:60: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (907.06µs): ok cgroup_linux_test.go:63: output got "3\n" want "4\n" --- FAIL: TestCgroupGOMAXPROCSNoLimit (0.00s) cgroup_linux_test.go:82: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (879.194µs): ok cgroup_linux_test.go:85: output got "3\n" want "4\n" --- FAIL: TestCgroupGOMAXPROCSHigherThanNumCPU (0.00s) cgroup_linux_test.go:102: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (852.396µs): ok cgroup_linux_test.go:105: output got "3\n" want "4\n" --- FAIL: TestCgroupGOMAXPROCSRound (0.01s) --- FAIL: TestCgroupGOMAXPROCSRound/50000 (0.00s) cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (852.099µs): ok cgroup_linux_test.go:159: output got "3\n" want "2\n" --- FAIL: TestCgroupGOMAXPROCSRound/100000 (0.00s) cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (894.001µs): ok cgroup_linux_test.go:159: output got "3\n" want "2\n" --- FAIL: TestCgroupGOMAXPROCSRound/150000 (0.00s) cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (850.897µs): ok cgroup_linux_test.go:159: output got "3\n" want "2\n" --- FAIL: TestCgroupGOMAXPROCSSchedAffinity (0.00s) cgroup_linux_test.go:229: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (867.987µs): ok cgroup_linux_test.go:232: output got "3\n" want "2\n" FAIL FAIL runtime 23.088s ---- This changes exclude the GOMAXPROCS when building program for testing so it does not affect the tests. Change-Id: I590d9eca57026539413cf4c93b37f624f179d534
2025-07-15	runtime/maps: fix typo in group.go comment (instrinsified -> intrinsified)	dyma solovei
	Several comments refer to bitset as 'instrinsified', which is likely a typo, because it refers to the output of the intrinsics implemented with SIMD. Change-Id: I00f26b8d8128592ee0e9dc8a1b1480c93a9542d6 GitHub-Last-Rev: 8a4236710979f2f969210e0b261bdb9ae44f3321 GitHub-Pull-Request: golang/go#74624 Reviewed-on: https://go-review.googlesource.com/c/go/+/688016 Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2025-05-21	internal/runtime/cgroup: CPU cgroup limit discovery	Michael Pratt
	For #73193. Change-Id: I6a6a636ca9fa9cba429cf053468c56c2939cb1ac Reviewed-on: https://go-review.googlesource.com/c/go/+/668638 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
2025-05-21	internal/runtime/cgroup: add line-by-line reader using a single scratch buffer	Michael Pratt
	Change-Id: I6a6a636ca21edcc6f16705fbb72a5241d4f7f22d Reviewed-on: https://go-review.googlesource.com/c/go/+/668637 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-19	runtime: move atoi to internal/runtime/strconv	Michael Pratt
	Moving to a smaller package allows its use in other internal/runtime packages. This isn't internal/strconvlite since it can't be used directly by strconv. For #73193. Change-Id: I6a6a636c9c8b3f06b5fd6c07fe9dd5a7a37d1429 Reviewed-on: https://go-review.googlesource.com/c/go/+/672697 Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com>
2025-05-19	internal/runtime/syscall: add basic file system calls	Michael Pratt
	Change-Id: I6a6a636c5e119165dc1018d1fc0354f5b6929656 Reviewed-on: https://go-review.googlesource.com/c/go/+/670496 Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-05-16	runtime: prevent cleanup goroutines from missing work	Michael Anthony Knyszek
	Currently, there's a window of time where each cleanup goroutine has committed to going to sleep (immediately after full.pop() == nil) but hasn't yet marked itself as asleep (state.sleep()). If new work arrives in this window, it might get missed. This is what we see in #73642, and I can reproduce it with stress2. Side-note: even if the work gets missed by the existing sleeping goroutines, needg is incremented. So in theory a new goroutine will handle the work. Right now that doesn't happen in tests like the one running in #73642, where there might never be another call to AddCleanup to create the additional goroutine. Also, if we've hit the maximum on cleanup goroutines and all of them are in this window simultaneously, we can still end up missing work, it's just more rare. So this is still a problem even if we choose to just be more aggressive about creating new cleanup goroutines. This change fixes the problem and also aims to make the cleanup wake/sleep code clearer. The way this change fixes this problem is to have cleanup goroutines re-check the work list before going to sleep, but after having already marked themselves as sleeping. This way, if new work comes in before the cleanup goroutine marks itself as going to sleep, we can rely on the re-check to pick up that work. If new work comes after the goroutine marks itself as going to sleep and after the re-check, we can rely on the scheduler noticing that the goroutine is asleep and waking it up. If work comes in between a goroutine marking itself as sleeping and the re-check, then the re-check will catch that piece of work. However, the scheduler might now get a false signal that the goroutine is asleep and try to wake it up. This is OK. The sleeping signal is now mutated and double-checked under the queue lock, so the scheduler will grab the lock, may notice there are no sleeping goroutines, and go on its way. This may cause spurious lock acquisitions but it should be very rare. The window between a cleanup goroutine marking itself as going to sleep and re-checking the work list is a handful of instructions at most. This seems subtle but overall it's a simplification of the code. We rely more on the lock, which is easier to reason about, and we track two separate atomic variables instead of the merged cleanupSleepState: the length of the full list, and the number of cleanup goroutines that are asleep. The former is now the primary way to acquire work. Cleanup goroutines must decrement the length successfully to obtain an item off the full list. The number of cleanup goroutines asleep, meanwhile, is now only updated with the queue lock held. It can be checked without the lock held, and the invariant to make that safe is simple: it must always be an overestimate of the number of sleeping cleanup goroutines. The changes here do change some other behaviors. First, since we're tracking the length of the full list instead of the abstract concept of a wake-up, the waker can't consume wake-ups anymore. This means that cleanup goroutines may be created more aggressively. If two threads in the scheduler see that there are goroutines that are asleep, only one will win the race, but the other will observe zero asleep goroutines but potentially many work units available. This will cause it to signal many goroutines to be created. This is OK since we have a cap on the number of cleanup goroutines, and the race should be relatively rare. Second, because cleanup goroutines can now fail to go to sleep if any units of work come in, they might spend more time contended on the lock. For example, if we have N cleanup goroutines and work comes in at just the wrong rate, in the worst case we'll have each of G goroutines loop N times for N blocks, resulting in O(G*N) thread time to handle each block in the worst case. To paint a picture, imagine each goroutine trying to go to sleep, fail because a new block of work came in, and only one goroutine will get that block. Then once that goroutine is done, we all try again, fail because a new block of work came in, and so on and so forth. This case is unlikely, though, and probably not worth worrying about until it actually becomes a problem. (A similar problem exists with parking (and exists before this change, too) but at least in that case each goroutine parks, so it doesn't block the thread.) Fixes #73642. Change-Id: I6bbe1b789e7eb7e8168e56da425a6450fbad9625 Reviewed-on: https://go-review.googlesource.com/c/go/+/671676 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-05-07	internal/runtime/maps: make clear also erase tombstones	khr@golang.org
	This will make future uses of the map faster because the probe sequences will likely be shorter. Change-Id: If10f3af49a5feaff7d1b82337bbbfb93bcd9dcb5 Reviewed-on: https://go-review.googlesource.com/c/go/+/633076 Auto-Submit: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com>
2025-05-02	runtime: mark and scan small objects in whole spans [green tea]	Michael Anthony Knyszek
	Our current parallel mark algorithm suffers from frequent stalls on memory since its access pattern is essentially random. Small objects are the worst offenders, since each one forces pulling in at least one full cache line to access even when the amount to be scanned is far smaller than that. Each object also requires an independent access to per-object metadata. The purpose of this change is to improve garbage collector performance by scanning small objects in batches to obtain better cache locality than our current approach. The core idea behind this change is to defer marking and scanning small objects, and then scan them in batches localized to a span. This change adds scanned bits to each small object (<=512 bytes) span in addition to mark bits. The scanned bits indicate that the object has been scanned. (One way to think of them is "grey" bits and "black" bits in the tri-color mark-sweep abstraction.) Each of these spans is always 8 KiB and if they contain pointers, the pointer/scalar data is already packed together at the end of the span, allowing us to further optimize the mark algorithm for this specific case. When the GC encounters a pointer, it first checks if it points into a small object span. If so, it is first marked in the mark bits, and then the object is queued on a work-stealing P-local queue. This object represents the whole span, and we ensure that a span can only appear at most once in any queue by maintaining an atomic ownership bit for each span. Later, when the pointer is dequeued, we scan every object with a set mark that doesn't have a corresponding scanned bit. If it turns out that was the only object in the mark bits since the last time we scanned the span, we scan just that object directly, essentially falling back to the existing algorithm. noscan objects have no scan work, so they are never queued. Each span's mark and scanned bits are co-located together at the end of the span. Since the span is always 8 KiB in size, it can be found with simple pointer arithmetic. Next to the marks and scans we also store the size class, eliminating the need to access the span's mspan altogether. The work-stealing P-local queue is a new source of GC work. If this queue gets full, half of it is dumped to a global linked list of spans to scan. The regular scan queues are always prioritized over this queue to allow time for darts to accumulate. Stealing work from other Ps is a last resort. This change also adds a new debug mode under GODEBUG=gctrace=2 that dumps whole-span scanning statistics by size class on every GC cycle. A future extension to this CL is to use SIMD-accelerated scanning kernels for scanning spans with high mark bit density. For #19112. (Deadlock averted in GOEXPERIMENT.) For #73581. Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc Reviewed-on: https://go-review.googlesource.com/c/go/+/658036 Auto-Submit: Michael Knyszek <mknyszek@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-23	runtime: move some malloc constants to internal/runtime/gc	Michael Anthony Knyszek
	These constants are needed by some future generator programs. Change-Id: I5dccd009cbb3b2f321523bc0d8eaeb4c82e5df81 Reviewed-on: https://go-review.googlesource.com/c/go/+/655276 Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-23	runtime: move sizeclass defs to new package internal/runtime/gc	Michael Anthony Knyszek
	We will want to reference these definitions from new generator programs, and this is a good opportunity to cleanup all these old C-style names. Change-Id: Ifb06f0afc381e2697e7877f038eca786610c96de Reviewed-on: https://go-review.googlesource.com/c/go/+/655275 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-04-22	runtime, internal/runtime/maps: speed-up empty/zero map lookups	Mateusz Poliwczak
	This lets the inliner do a better job optimizing the mapKeyError call. goos: linux goarch: amd64 pkg: runtime cpu: AMD Ryzen 5 4600G with Radeon Graphics │ /tmp/before2 │ /tmp/after3 │ │ sec/op │ sec/op vs base │ MapAccessZero/Key=int64-12 1.875n ± 0% 1.875n ± 0% ~ (p=0.506 n=25) MapAccessZero/Key=int32-12 1.875n ± 0% 1.875n ± 0% ~ (p=0.082 n=25) MapAccessZero/Key=string-12 1.902n ± 1% 1.902n ± 1% ~ (p=0.256 n=25) MapAccessZero/Key=mediumType-12 2.816n ± 0% 1.958n ± 0% -30.47% (p=0.000 n=25) MapAccessZero/Key=bigType-12 2.815n ± 0% 1.935n ± 0% -31.26% (p=0.000 n=25) MapAccessEmpty/Key=int64-12 1.942n ± 0% 2.109n ± 0% +8.60% (p=0.000 n=25) MapAccessEmpty/Key=int32-12 2.110n ± 0% 1.940n ± 0% -8.06% (p=0.000 n=25) MapAccessEmpty/Key=string-12 2.024n ± 0% 2.109n ± 0% +4.20% (p=0.000 n=25) MapAccessEmpty/Key=mediumType-12 3.157n ± 0% 2.344n ± 0% -25.75% (p=0.000 n=25) MapAccessEmpty/Key=bigType-12 3.054n ± 0% 2.115n ± 0% -30.75% (p=0.000 n=25) geomean 2.305n 2.011n -12.75% Change-Id: Iee83930884dc4c8a791a711aa189a1c93b68d536 Reviewed-on: https://go-review.googlesource.com/c/go/+/663495 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-04-18	internal/runtime/maps: move tombstone test to swiss file	Michael Pratt
	This test fails on GOEXPERIMENT=noswissmap as it is testing behavior specific to swissmaps. Move it to map_swiss_test.go to skip it on noswissmap. We could also switch the test to use NewTestMap, which provides a swissmap even in GOEXPERIMENT=noswissmap, but that is tedious to use and noswissmap is going away soon anyway. For #70886. Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest-noswissmap Change-Id: I6a6a636c5ec72217d936cd01e9da36ae127ea2c5 Reviewed-on: https://go-review.googlesource.com/c/go/+/666437 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-17	internal/runtime/maps: prune tombstones in maps before growing	Keith Randall
	Before growing, if there are lots of tombstones try to remove them. If we can remove enough, we can continue at the given size for a while longer. Fixes #70886 Change-Id: I71e0d873ae118bb35798314ec25e78eaa5340d73 Reviewed-on: https://go-review.googlesource.com/c/go/+/640955 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-08	internal/runtime/maps: pass proper func PC to race.WritePC/race.ReadPC	Mateusz Poliwczak
	Fixes #73191 Change-Id: I0f8a5a19faa745943a98476c7caf4c97ccdce184 Reviewed-on: https://go-review.googlesource.com/c/go/+/663175 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-03-30	internal/runtime/maps: speed up small map lookups ~1.7x for unpredictable keys	thepudds
	On master, lookups on small Swiss Table maps (<= 8 elements) for non-specialized key types are seemingly a performance regression compared to the Go 1.23 map implementation (reported in #70849). Currently, a linear scan is used for gets in these cases. This CL changes (*Map).getWithKeySmall to instead use the SIMD or SWAR match on the control bytes to then jump to candidate matching slots, with sample results below for a 16-byte key. This especially helps the hit case when the key is unpredictable, which previously had to scan an unpredictable number of control bytes to find a candidate slot when the key is unpredictable. Separately, other CLs in this stack modify the main Swiss Table benchmarks to randomize lookup key order (vs. previously most of the benchmarks had a repeating lookup key ordering, which likely is predictable until the map is too big). We have sample results for the randomized key order benchmarks followed by results from the older benchmarks. The first table below is with randomized key order. For hits, the older results get slower as there are more elements. With this CL, we see hits for unpredictable key ordering (sizes 2-8) get a ~1.7x speedup from ~25ns to ~14ns, with a now consistent lookup time for the different sizes. (The 1 element size map has a predictable key ordering because there is only one key, and that reports a modest ~0.5ns or ~3% performance penalty). Misses for unpredictable key order get a ~1.3x speedup, from ~13ns to ~10ns, with similar results for the 1 element size. │ no-fix-new-bmarks │ fix-with-new-bmarks │ │ sec/op │ sec/op vs base │ MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 13.26n ± 0% 13.64n ± 0% +2.90% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 19.47n ± 0% 13.62n ± 0% -30.05% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 22.23n ± 0% 13.64n ± 0% -38.68% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 23.98n ± 0% 13.64n ± 0% -43.11% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 25.02n ± 0% 13.67n ± 0% -45.35% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 25.77n ± 1% 13.68n ± 2% -46.89% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 26.38n ± 0% 13.64n ± 0% -48.28% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 26.31n ± 0% 13.71n ± 21% -47.90% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 13.055n ± 0% 9.815n ± 0% -24.82% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 13.070n ± 0% 9.813n ± 0% -24.92% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 13.060n ± 0% 9.819n ± 0% -24.82% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 13.075n ± 0% 9.816n ± 0% -24.92% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 13.060n ± 0% 9.826n ± 0% -24.76% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 13.095n ± 19% 9.834n ± 31% -24.90% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 13.075n ± 19% 9.822n ± 27% -24.88% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 13.11n ± 16% 12.14n ± 19% -7.43% (p=0.000 n=20) The next table uses the original benchmarks from just before this CL stack (i.e., without shuffling lookup keys). With this CL, we see improvement that is directionally similar to the above results but not as large, presumably because the branches in the linear scan are fairly predictable with predictable keys. (The numbers here also include the time from a mod in the benchmark code, which seemed to take around ~1/3 of CPU time based on spot checking a couple of examples, vs. the modified benchmarks shown above have removed that mod). │ master-8c3e391573 │ just-fix-with-old-bmarks │ │ sec/op │ sec/op vs base │ MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 20.85n ± 0% 21.69n ± 0% +4.03% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 21.22n ± 0% 21.70n ± 0% +2.24% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 21.73n ± 0% 21.71n ± 0% ~ (p=0.158 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 22.06n ± 0% 21.71n ± 0% -1.56% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 22.41n ± 0% 21.73n ± 0% -3.01% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 22.71n ± 0% 21.72n ± 0% -4.38% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 22.98n ± 0% 21.71n ± 0% -5.53% (p=0.000 n=20) MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 23.20n ± 0% 21.72n ± 0% -6.36% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 19.95n ± 0% 17.30n ± 0% -13.28% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 19.96n ± 0% 17.31n ± 0% -13.28% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 19.95n ± 0% 17.29n ± 0% -13.33% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 19.95n ± 0% 17.30n ± 0% -13.29% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 19.96n ± 25% 17.32n ± 0% -13.22% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 19.99n ± 24% 17.29n ± 0% -13.51% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 19.97n ± 20% 17.34n ± 16% -13.14% (p=0.000 n=20) MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 20.02n ± 11% 17.33n ± 14% -13.44% (p=0.000 n=20) geomean 21.02n 19.39n -7.78% See #70849 for additional benchmark results, including results for arm64 (which also means without SIMD support). Updates #54766 Updates #70700 Fixes #70849 Change-Id: Ic2361bb6fc15b4436d1d1d5be7e4712e547f611b Reviewed-on: https://go-review.googlesource.com/c/go/+/634396 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-27	maps: implement faster clone	Keith Randall
	│ base │ experiment │ │ sec/op │ sec/op vs base │ MapClone-24 66.802m ± 7% 3.348m ± 2% -94.99% (p=0.000 n=10) Fixes #70836 Change-Id: I9e192b1ee82e18f5580ff18918307042a337fdcc Reviewed-on: https://go-review.googlesource.com/c/go/+/660175 Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2025-03-14	internal/runtime/atomic: add Xchg8 for s390x and wasm	Rhys Hiltner
	This makes the single-byte atomic.Xchg8 operation available on all GOARCHes, including those without direct / single-instruction support. Fixes #69735 Change-Id: Icb6aff8f907257db81ea440dc4d29f96b3cff6c4 Reviewed-on: https://go-review.googlesource.com/c/go/+/657936 Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2025-03-11	runtime/internal: clean up completely	Jes Cok
	We've been slowly moving packages from runtime/internal to internal/runtime. For now, runtime/internal only has test packages. It's a good chance to clean up the references to runtime/internal in the toolchain. For #65355. Change-Id: Ie6f9091a44511d0db9946ea6de7a78d3afe9f063 GitHub-Last-Rev: fad32e2e81d11508e734c3c3d3b0c1da583f89f5 GitHub-Pull-Request: golang/go#72137 Reviewed-on: https://go-review.googlesource.com/c/go/+/655515 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>
2025-03-10	internal/runtime/atomic: updated go assembler comments	Prabhav Dogra
	Updated comments in go assembler package Change-Id: I174e344ca45fae6ef70af2e0b29cd783b003b4c2 GitHub-Last-Rev: 8ab37208891e795561a943269ca82b1ce6e7eef5 GitHub-Pull-Request: golang/go#72048 Reviewed-on: https://go-review.googlesource.com/c/go/+/654478 Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: PRABHAV DOGRA <prabhavdogra1@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-03-04	runtime: decorate anonymous memory mappings	Lénaïc Huard
	Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name the anonymous memory areas. This API has been introduced in Linux 5.17 to decorate the anonymous memory areas shown in /proc/<pid>/maps. This is already used by glibc. See: * https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434 * https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63 This can be useful when investigating the memory consumption of a multi-language program. On a 100% Go program, pprof profiler can be used to profile the memory consumption of the program. But pprof is only aware of what happens within the Go world. On a multi-language program, there could be a doubt about whether the suspicious extra-memory consumption comes from the Go part or the native part. With this change, the following Go program: package main import ( "fmt" "log" "os" ) /* #include <stdlib.h> void f(void) { (void)malloc(102410241024); } */ import "C" func main() { C.f() data, err := os.ReadFile("/proc/self/maps") if err != nil { log.Fatal(err) } fmt.Println(string(data)) } produces this output: $ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run . 00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name 00580000-005a4000 rw-p 00000000 00:00 0 2e075000-2e096000 rw-p 00000000 00:00 0 [heap] c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap] c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation] 777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f40021000-777f44000000 ---p 00000000 00:00 0 777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f44021000-777f48000000 ---p 00000000 00:00 0 777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f48021000-777f4c000000 ---p 00000000 00:00 0 777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f4c021000-777f50000000 ---p 00000000 00:00 0 777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena] 777f50021000-777f54000000 ---p 00000000 00:00 0 777f55afb000-777f55afc000 ---p 00000000 00:00 0 777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378] 777f562fc000-777f562fd000 ---p 00000000 00:00 0 777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377] 777f56afd000-777f56afe000 ---p 00000000 00:00 0 777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376] 777f572fe000-777f572ff000 ---p 00000000 00:00 0 777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375] 777f57aff000-777f57b00000 ---p 00000000 00:00 0 777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374] 777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index] 777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index] 777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index] 777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index] 777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits] 777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array] 777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary] 777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc] 777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6 777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6 777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6 777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6 777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6 777f9eff4000-777f9effc000 rw-p 00000000 00:00 0 777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc] 777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata] 777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar] 777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock] 777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso] 777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2 777f9f088000-777f9f089000 rw-p 00000000 00:00 0 7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall] The anonymous memory areas are now labelled so that we can see which ones have been allocated by the Go runtime versus which ones have been allocated by the glibc. Fixes #71546 Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4 Reviewed-on: https://go-review.googlesource.com/c/go/+/646095 Auto-Submit: Alan Donovan <adonovan@google.com> Commit-Queue: Alan Donovan <adonovan@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2025-02-21	internal/runtime/atomic: add Xchg8 for mipsx	Julian Zhu
	For #69735 Change-Id: I2a0336214786e14b9a37834d81a0a0d14231451c Reviewed-on: https://go-review.googlesource.com/c/go/+/651315 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-02-20	internal/runtime/atomic: add Xchg8 for mips64x	Julian Zhu
	For #69735 Change-Id: Ide6b3077768a96b76078e5d4f6460596b8ff1560 Reviewed-on: https://go-review.googlesource.com/c/go/+/631756 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org>
2025-02-19	internal/runtime/atomic: add Xchg8 for riscv64	Julian Zhu
	For #69735 Change-Id: I34ca2b027494525ab64f94beee89ca373a5031ae Reviewed-on: https://go-review.googlesource.com/c/go/+/631615 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Mark Ryan <markdryan@rivosinc.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-02-16	runtime/maps: fix typo in group.go comment (H1 -> H2)	Artyom Litovets
	Fixes a typo to correctly describe the hash bits of the control word. Change-Id: Id3c2ae0bd529e579a95258845f9d8028e23d10d2 GitHub-Last-Rev: 1baa81be5d292d5625d5d7788b8ea090453f962c GitHub-Pull-Request: golang/go#71730 Reviewed-on: https://go-review.googlesource.com/c/go/+/649416 Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
2025-01-14	internal/runtime/maps: re-enable some tests	Keith Randall
	Re-enable tests for stack-allocated maps and fast map accessors. Those are implemented now. Update #54766 Change-Id: I8c019702bd9fb077b2fe3f7c78e8e9e10d2263a6 Reviewed-on: https://go-review.googlesource.com/c/go/+/642376 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> Auto-Submit: Keith Randall <khr@golang.org>
2024-12-21	cmd/compile: load map length with the right type	Cherry Mui
	len(map) is lowered to loading the first field of the map structure, which is the length. Currently it is a load of an int. With the old map, the first field is indeed an int. With Swiss map, however, it is a uint64. On big-endian 32-bit machine, loading an (32-bit) int from a uint64 would load just the high bits, which are (probably) all 0. Change to a load with the proper type. Fixes #70248. Change-Id: I39cf2d1e6658dac5a8de25c858e1580e2a14b894 Reviewed-on: https://go-review.googlesource.com/c/go/+/638375 Run-TryBot: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org>
2024-12-06	cmd/internal/objabi, internal/runtime: increase nosplit limit on OpenBSD	Damien Neil
	OpenBSD is bumping up against the nosplit limit, and openbsd/ppc64 is over it. Increase StackGuardMultiplier on OpenBSD, matching AIX. Change-Id: I61e17c99ce77e1fd3f368159dc4615aeae99e913 Reviewed-on: https://go-review.googlesource.com/c/go/+/632996 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Damien Neil <dneil@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-21	all: fix some function names and typos in comment	cuishuang
	Change-Id: I07e7c8eaa5bd4bac0d576b2f2f4cd3f81b0b77a4 Reviewed-on: https://go-review.googlesource.com/c/go/+/630055 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Commit-Queue: Ian Lance Taylor <iant@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Reviewed-by: Russ Cox <rsc@golang.org> Auto-Submit: Ian Lance Taylor <iant@google.com>
2024-11-21	internal/copyright: add test that copyright notices exist	Russ Cox
	We shouldn't spend human code review time checking this. Let the computer check. Change-Id: I6de9d733c128d833b958b0e43a52b564e8f82dd3 Reviewed-on: https://go-review.googlesource.com/c/go/+/630417 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Sam Thanawalla <samthanawalla@google.com>
2024-11-20	cmd/compile: intrinsify swissmap match calls with SIMD on amd64	Michael Pratt
	Use similar SIMD operations to the ones used in Abseil. We still using 8-slot groups (even though the XMM registers could handle 16-slot groups) to keep the implementation simpler (no changes to the memory layout of maps). Still, the implementations of matchH2 and matchEmpty are shorter than the portable version using standard arithmetic operations. They also return a packed bitset, which avoids the need to shift in bitset.first. That said, the packed bitset is a downside in cognitive complexity, as we have to think about two different possible representations. This doesn't leak out of the API, but we do need to intrinsify bitset to switch to a compatible implementation. The compiler's intrinsics don't support intrinsifying methods, so the implementations move to free functions. This makes operations between 0-3% faster on my machine. e.g., MapGetHit/impl=runtimeMap/t=Int64/len=6-12 12.34n ± 1% 11.42n ± 1% -7.46% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=12-12 15.14n ± 2% 14.88n ± 1% -1.72% (p=0.009 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=18-12 15.04n ± 6% 14.66n ± 2% -2.53% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=24-12 15.80n ± 1% 15.48n ± 3% ~ (p=0.444 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=30-12 15.55n ± 4% 14.77n ± 3% -5.02% (p=0.004 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=64-12 15.26n ± 1% 15.05n ± 1% ~ (p=0.055 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=128-12 15.34n ± 1% 15.02n ± 2% -2.09% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=256-12 15.42n ± 1% 15.15n ± 1% -1.75% (p=0.001 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=512-12 15.48n ± 1% 15.18n ± 1% -1.94% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=1024-12 17.38n ± 1% 17.05n ± 1% -1.90% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=2048-12 17.96n ± 0% 17.59n ± 1% -2.06% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=4096-12 18.36n ± 1% 18.18n ± 1% -0.98% (p=0.013 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=8192-12 18.75n ± 0% 18.31n ± 1% -2.35% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=65536-12 26.25n ± 0% 25.95n ± 1% -1.14% (p=0.000 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=262144-12 44.24n ± 1% 44.06n ± 1% ~ (p=0.181 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=1048576-12 85.02n ± 0% 85.35n ± 0% +0.39% (p=0.032 n=25) MapGetHit/impl=runtimeMap/t=Int64/len=4194304-12 98.87n ± 1% 98.85n ± 1% ~ (p=0.799 n=25) For #54766. Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10,gotip-linux-amd64-goamd64v3 Change-Id: Ic1b852f02744404122cb3672900fd95f4625905e Reviewed-on: https://go-review.googlesource.com/c/go/+/626277 Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-20	cmd/compile, internal/runtime/atomic: add Xchg8 for loong64	Guoqi Chen
	In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic instruction AMSWAP[DB]{B,H} [1] is supported. Therefore, the implementation of the atomic operation exchange can be selected according to the CPUCFG flag LAM_BH: AMSWAPDBB(full barrier) instruction is used on new microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can significantly improve the performance of Go programs on new microstructures. Because Xchg8 implemented using traditional LL-SC uses too many temporary registers, it is not suitable for intrinsics. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz BenchmarkXchg8 100000000 10.41 ns/op BenchmarkXchg8-2 100000000 10.41 ns/op BenchmarkXchg8-4 100000000 10.41 ns/op BenchmarkXchg8Parallel 96647592 12.41 ns/op BenchmarkXchg8Parallel-2 58376136 20.60 ns/op BenchmarkXchg8Parallel-4 78458899 17.97 ns/op goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000-HV @ 2500.00MHz BenchmarkXchg8 38323825 31.23 ns/op BenchmarkXchg8-2 38368219 31.23 ns/op BenchmarkXchg8-4 37154156 31.26 ns/op BenchmarkXchg8Parallel 37908301 31.63 ns/op BenchmarkXchg8Parallel-2 30413440 39.42 ns/op BenchmarkXchg8Parallel-4 30737626 39.03 ns/op For #69735 [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Change-Id: I02ba68f66a2210b6902344fdc9975eb62de728ab Reviewed-on: https://go-review.googlesource.com/c/go/+/623058 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-11-19	internal/runtime/maps: hash copy of key instead of key itself	Keith Randall
	Hashing the key means we have to take the address of it. That inhibits subsequent optimizations on the key variable. By hashing a copy, we incur an extra store at the hash callsite, but we no longer need a load of the key in the inner loop. It can live in a register throughout. (Technically, it gets spilled around the call to the hasher, but it gets restored outside the loop.) Maybe one day we can have special hash functions that take int64/int32/string instead of int64/int32/*string. Change-Id: Iba3133f6e82328f53c0abcb5eec13ee47c4969d1 Reviewed-on: https://go-review.googlesource.com/c/go/+/629419 Reviewed-by: Russ Cox <rsc@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2024-11-19	internal/runtime/maps: assume constant elem offset with int64 and string keys	Keith Randall
	Note this doesn't work with int32 keys because alignment padding can change the offset of the element. Change-Id: I27804d3cfc7cc1b7f995f7e29630f0824f0ee899 Reviewed-on: https://go-review.googlesource.com/c/go/+/629418 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Russ Cox <rsc@golang.org>
2024-11-19	internal/runtime/maps: use simpler calculation for slot element	Keith Randall
	This reduces the adds required at the return point from 3 to 1. (The multiply inside g.elem() does get CSE'd with the one inside g.key(), but the rest of the adds don't.) Instead, compute the element as just a fixed offset from the key. Change-Id: Ia4d7664efafcdca5e9daeb77d270651bb186232c Reviewed-on: https://go-review.googlesource.com/c/go/+/629535 Reviewed-by: Russ Cox <rsc@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2024-11-19	crypto/subtle: add DIT closure	Roland Shoemaker
	Add a new function, WithDataIndependentTiming, which takes a function as an argument, and encloses it with calls to set/unset the DIT PSTATE bit on Arm64. Since DIT is OS thread-local, for the duration of the execution of WithDataIndependentTiming, we lock the goroutine to the OS thread, using LockOSThread. For long running operations, this is likely to not be performant, but we expect this to be tightly scoped around cryptographic operations that have bounded execution times. If locking to the OS thread turns out to be too slow, another option is to add a bit to the g state indicating if a goroutine has DIT enabled, and then have the scheduler enable/disable DIT when scheduling a g. Additionally, we add a new GODEBUG, dataindependenttiming, which allows setting DIT for an entire program. Running a program with dataindependenttiming=1 enables DIT for the program during initialization. In an ideal world PSTATE.DIT would be inherited from the parent thread, so we'd only need to set it in the main thread and then all subsequent threads would inherit the value. While this does happen in the Linux kernel [0], it is not the case for darwin [1]. Rather than add complex logic to only set it on darwin for each new thread, we just unconditionally set it in mstart1 and cgocallbackg1 regardless of the OS. DIT will already impose some overhead, and the cost of setting the bit is only ~two instructions (CALL, MSR), so it should be cheap enough. Fixes #66450 Updates #49702 [0] https://github.com/torvalds/linux/blob/e8bdb3c8be08c9a3edc0a373c0aa8729355a0705/arch/arm64/kernel/process.c#L373 [1] https://github.com/apple-oss-distributions/xnu/blob/8d741a5de7ff4191bf97d57b9f54c2f6d4a15585/osfmk/arm64/status.c#L1666 Change-Id: I78eda691ff9254b0415f2b54770e5850a0179749 Reviewed-on: https://go-review.googlesource.com/c/go/+/598336 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Filippo Valsorda <filippo@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-19	cmd/compiler,internal/runtime/atomic: optimize Cas{64,32} on loong64	Guoqi Chen
	In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic compare-and-exchange instruction AMCAS[DB]{B,W,H,V} [1] is supported. Therefore, the implementation of the atomic operation compare-and-swap can be selected according to the CPUCFG flag LAMCAS: AMCASDB(full barrier) instruction is used on new microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can significantly improve the performance of Go programs on new microstructures. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| Cas 46.84n ± 0% 22.82n ± 0% -51.28% (p=0.000 n=20) Cas-2 47.58n ± 0% 29.57n ± 0% -37.85% (p=0.000 n=20) Cas-4 43.27n ± 20% 25.31n ± 13% -41.50% (p=0.000 n=20) Cas64 46.85n ± 0% 22.82n ± 0% -51.29% (p=0.000 n=20) Cas64-2 47.43n ± 0% 29.53n ± 0% -37.74% (p=0.002 n=20) Cas64-4 43.18n ± 0% 25.28n ± 2% -41.46% (p=0.000 n=20) geomean 45.82n 25.74n -43.82% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| Cas 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20) Cas-2 52.80n ± 0% 53.11n ± 0% +0.59% (p=0.000 n=20) Cas-4 55.97n ± 0% 57.31n ± 0% +2.39% (p=0.000 n=20) Cas64 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20) Cas64-2 52.68n ± 0% 53.11n ± 0% +0.82% (p=0.000 n=20) Cas64-4 55.96n ± 0% 57.26n ± 0% +2.33% (p=0.000 n=20) geomean 52.86n 53.83n +1.82% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Change-Id: I9b777c63c124fb492f61c903f77061fa2b4e5322 Reviewed-on: https://go-review.googlesource.com/c/go/+/613396 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-18	internal/runtime/maps: don't copy indirect key/elem when growing maps	Keith Randall
	We can reuse the same indirect storage when growing, so we don't need an additional allocation. Change-Id: I57adb406becfbec648188ec66f4bb2e94d4b9cab Reviewed-on: https://go-review.googlesource.com/c/go/+/625902 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-18	internal/runtime/maps: fix noswiss builder	khr@golang.org
	Missed initializing a field in the stub that lets the noswiss builder test the swiss implementation. Change-Id: Ie093478ad3e4301e4fe88ba65c132a9dbccd89a9 Reviewed-on: https://go-review.googlesource.com/c/go/+/628895 Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-17	runtime/internal/maps: remove entryMask	Keith Randall
	It is easily recomputed as capacity-1. This reduces a table from 40 to 32 bytes (on 64-bit archs). That gets us down one sizeclass. Change-Id: Icb74fb2de50baa18ca62052c7b2fe8e6af4c8837 Reviewed-on: https://go-review.googlesource.com/c/go/+/625198 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-17	internal/runtime/maps: simplify small group lookup	Keith Randall
	We don't really need the index of the slot we're looking at. Just keep looking until there are no more filled slots. This particularly helps when there are only a few filled entries (packed at the bottom), and we're looking for something that isn't there. We exit earlier than we would otherwise. goos: darwin goarch: arm64 pkg: runtime cpu: Apple M2 Ultra │ baseline │ experiment │ │ sec/op │ sec/op vs base │ MapSmallAccessHit/Key=int64/Elem=int64/len=1-24 2.759n ± 0% 2.779n ± 2% ~ (p=0.055 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=2-24 2.862n ± 1% 2.922n ± 1% +2.08% (p=0.000 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=3-24 3.003n ± 0% 3.061n ± 1% +1.91% (p=0.000 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=4-24 3.170n ± 1% 3.188n ± 1% +0.57% (p=0.030 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=5-24 3.387n ± 1% 3.391n ± 1% ~ (p=0.362 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=6-24 3.601n ± 1% 3.584n ± 0% -0.49% (p=0.009 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=7-24 3.785n ± 1% 3.778n ± 3% ~ (p=0.987 n=10) MapSmallAccessHit/Key=int64/Elem=int64/len=8-24 3.960n ± 1% 3.946n ± 1% ~ (p=0.256 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=0-24 2.004n ± 1% MapSmallAccessMiss/Key=int64/Elem=int64/len=1-24 5.145n ± 1% 2.411n ± 1% -53.14% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=2-24 5.128n ± 0% 3.313n ± 1% -35.40% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=3-24 5.159n ± 1% 3.690n ± 1% -28.48% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=4-24 5.117n ± 1% 4.466n ± 6% -12.73% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=5-24 5.115n ± 1% 4.308n ± 1% -15.79% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=6-24 5.111n ± 1% 4.538n ± 2% -11.19% (p=0.000 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=7-24 4.896n ± 4% 4.831n ± 1% -1.33% (p=0.001 n=10) MapSmallAccessMiss/Key=int64/Elem=int64/len=8-24 4.905n ± 1% 5.121n ± 1% +4.40% (p=0.000 n=10) geomean 3.917n 3.631n -11.11% Change-Id: Ife26ac457a513af24fa0921b839ee6cd5fed6fba Reviewed-on: https://go-review.googlesource.com/c/go/+/627717 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2024-11-17	internal/runtime/maps: eliminate a load from the hot path	Keith Randall
	typ.Group.Size involves two loads. Instead cache GroupSize as a separate fields of the map type so we can get to it in just one load. Change-Id: I10ffdce1c7f75dcf448da14040fda78f0d75fd1d Reviewed-on: https://go-review.googlesource.com/c/go/+/627716 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-17	runtime/internal/maps: optimize long string keys for small maps	Keith Randall
	For large strings, do a quick equality check on all the slots. Only if more than one passes the quick equality check do we resort to hashing. │ baseline │ experiment │ │ sec/op │ sec/op vs base │ MegMap-24 16609.50n ± 1% 13.91n ± 3% -99.92% (p=0.000 n=10) MegOneMap-24 16655.00n ± 0% 12.27n ± 1% -99.93% (p=0.000 n=10) MegEqMap-24 41.31µ ± 1% 25.03µ ± 1% -39.40% (p=0.000 n=10) MegEmptyMap-24 2.034n ± 0% 2.027n ± 2% ~ (p=0.541 n=10) MegEmptyMapWithInterfaceKey-24 5.931n ± 2% 5.599n ± 1% -5.60% (p=0.000 n=10) MapStringKeysEight_16-24 8.473n ± 7% 8.224n ± 5% ~ (p=0.315 n=10) MapStringKeysEight_32-24 8.441n ± 2% 8.147n ± 1% -3.48% (p=0.002 n=10) MapStringKeysEight_64-24 8.769n ± 1% 8.517n ± 1% -2.87% (p=0.000 n=10) MapStringKeysEight_128-24 10.73n ± 4% 13.57n ± 8% +26.57% (p=0.000 n=10) MapStringKeysEight_256-24 12.97n ± 2% 14.35n ± 4% +10.64% (p=0.001 n=10) MapStringKeysEight_1M-24 17359.50n ± 3% 13.92n ± 4% -99.92% (p=0.000 n=10) Change-Id: I4cc2ea4edab12a4b03236de626c7bcf0f96b6cc0 Reviewed-on: https://go-review.googlesource.com/c/go/+/625905 Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2024-11-13	internal/runtime/maps: use match to skip non-full slots in iteration	Michael Pratt
	Iteration over swissmaps with low load (think map with large hint but only one entry) is signicantly regressed vs old maps. See noswiss vs swiss-tip below (+60%). Currently we visit every single slot and individually check if the slot is full or not. We can do much better by using the control word to find all full slots in a group in a single operation. This lets us skip completely empty groups for instance. Always using the control match approach is great for maps with low load, but is a regression for mostly full maps. Mostly full maps have the majority of slots full, so most calls to mapiternext will return the next slot. In that case, doing the full group match on every call is more expensive than checking the individual slot. Thus we take a hybrid approach: on each call, we first check an individual slot. If that slot is full, we're done. If that slot is non-full, then we fall back to doing full group matches. This trade-off works well. Both mostly empty and mostly full maps perform nearly as well as doing all matching and all individual, respectively. The fast path is placed above the slow path loop rather than combined (with some sort of `useMatch` variable) into a single loop to help the compiler's code generation. The compiler really struggles with code generation on a combined loop for some reason, yielding ~15% additional instructions/op. Comparison with old maps prior to this CL: │ noswiss │ swiss-tip │ │ sec/op │ sec/op vs base │ MapIter/Key=int64/Elem=int64/len=6-12 11.53n ± 2% 10.64n ± 2% -7.72% (p=0.002 n=6) MapIter/Key=int64/Elem=int64/len=64-12 10.180n ± 2% 9.670n ± 5% -5.01% (p=0.004 n=6) MapIter/Key=int64/Elem=int64/len=65536-12 10.78n ± 1% 10.15n ± 2% -5.84% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=6-12 6.116n ± 2% 6.840n ± 2% +11.84% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=64-12 2.403n ± 2% 3.892n ± 0% +61.95% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=65536-12 1.940n ± 3% 3.237n ± 1% +66.81% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=6-12 66.20n ± 2% 60.14n ± 3% -9.15% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=64-12 97.24n ± 1% 171.35n ± 1% +76.21% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=65536-12 826.1n ± 12% 842.5n ± 10% ~ (p=0.937 n=6) geomean 17.93n 20.96n +16.88% After this CL: │ noswiss │ swiss-cl │ │ sec/op │ sec/op vs base │ MapIter/Key=int64/Elem=int64/len=6-12 11.53n ± 2% 10.90n ± 3% -5.42% (p=0.002 n=6) MapIter/Key=int64/Elem=int64/len=64-12 10.180n ± 2% 9.719n ± 9% -4.53% (p=0.043 n=6) MapIter/Key=int64/Elem=int64/len=65536-12 10.78n ± 1% 10.07n ± 2% -6.63% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=6-12 6.116n ± 2% 7.022n ± 1% +14.82% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=64-12 2.403n ± 2% 1.475n ± 1% -38.63% (p=0.002 n=6) MapIterLowLoad/Key=int64/Elem=int64/len=65536-12 1.940n ± 3% 1.210n ± 6% -37.67% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=6-12 66.20n ± 2% 61.54n ± 2% -7.02% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=64-12 97.24n ± 1% 110.10n ± 1% +13.23% (p=0.002 n=6) MapPop/Key=int64/Elem=int64/len=65536-12 826.1n ± 12% 504.7n ± 6% -38.91% (p=0.002 n=6) geomean 17.93n 15.29n -14.74% For #54766. Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10 Change-Id: Ic07f9df763239e85be57873103df5007144fdaef Reviewed-on: https://go-review.googlesource.com/c/go/+/627156 Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-11	internal/runtime/maps: don't hash twice when deleting	Keith Randall
	│ baseline │ experiment │ │ sec/op │ sec/op vs base │ MapDeleteLargeKey-24 312.0n ± 6% 162.3n ± 5% -47.97% (p=0.000 n=10) Change-Id: I31f1f8e3c344cf8abf2e9eb4b51b78fcd67b93c4 Reviewed-on: https://go-review.googlesource.com/c/go/+/625906 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-11	internal/runtime/maps: get rid of a few obsolete TODOs	Keith Randall
	Change-Id: I7b3d95c0861ae2b6e0721b65aa75cda036435e9c Reviewed-on: https://go-review.googlesource.com/c/go/+/625903 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-11	cmd/compiler,internal/runtime/atomic: optimize And{64,32,8} and Or{64,32,8} ↵	Guoqi Chen
	on loong64 Use loong64's atomic operation instruction AMANDDB{V,W,W} (full barrier) to implement And{64,32,8}, AMORDB{V,W,W} (full barrier) to implement Or{64,32,8}. Intrinsify And{64,32,8} and Or{64,32,8}, And this CL alias all of the And/Or operations into sync/atomic package. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000-HV @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| And32 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20) And32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) And64 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20) And64Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) Or32 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20) Or32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) Or64 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20) Or64Parallel 28.97n ± 0% 12.41n ± 0% -57.16% (p=0.000 n=20) And8 29.15n ± 0% 13.21n ± 0% -54.68% (p=0.000 n=20) And 27.71n ± 0% 12.82n ± 0% -53.74% (p=0.000 n=20) And8Parallel 28.99n ± 0% 14.46n ± 0% -50.12% (p=0.000 n=20) AndParallel 29.12n ± 0% 14.42n ± 0% -50.48% (p=0.000 n=20) Or8 28.31n ± 0% 12.81n ± 0% -54.75% (p=0.000 n=20) Or 27.72n ± 0% 12.81n ± 0% -53.79% (p=0.000 n=20) Or8Parallel 29.03n ± 0% 14.62n ± 0% -49.64% (p=0.000 n=20) OrParallel 29.12n ± 0% 14.42n ± 0% -50.49% (p=0.000 n=20) geomean 28.47n 12.58n -55.80% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| And32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) And32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) And64 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) And64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) And8 30.42n ± 0% 14.41n ± 0% -52.63% (p=0.000 n=20) And 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20) And8Parallel 31.23n ± 0% 15.21n ± 0% -51.30% (p=0.000 n=20) AndParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20) Or32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) Or32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) Or64 30.02n ± 0% 14.82n ± 0% -50.63% (p=0.000 n=20) Or64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) Or8 30.02n ± 0% 14.01n ± 0% -53.33% (p=0.000 n=20) Or 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20) Or8Parallel 30.83n ± 0% 14.81n ± 0% -51.96% (p=0.000 n=20) OrParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20) geomean 30.47n 14.75n -51.61% Change-Id: If008ff6a08b51905076f8ddb6e92f8e214d3f7b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/482756 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-11	cmd/compiler,internal/runtime/atomic: optimize xchg{32,64} on loong64	Guoqi Chen
	Use Loong64's atomic operation instruction AMSWAPDB{W,V} (full barrier) to implement atomic.Xchg{32,64} goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz \| old.bench \| new.bench \| \| sec/op \| sec/op vs base \| Xchg 26.44n ± 0% 12.01n ± 0% -54.58% (p=0.000 n=20) Xchg-2 30.10n ± 0% 25.58n ± 0% -15.02% (p=0.000 n=20) Xchg-4 30.06n ± 0% 24.82n ± 0% -17.43% (p=0.000 n=20) Xchg64 26.44n ± 0% 12.02n ± 0% -54.54% (p=0.000 n=20) Xchg64-2 30.10n ± 0% 25.57n ± 0% -15.05% (p=0.000 n=20) Xchg64-4 30.05n ± 0% 24.80n ± 0% -17.47% (p=0.000 n=20) geomean 28.81n 19.68n -31.69% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz \| old.bench \| new.bench \| \| sec/op \| sec/op vs base \| Xchg 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg-4 34.63n ± 0% 19.59n ± 0% -43.42% (p=0.000 n=20) Xchg64 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg64-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg64-4 34.67n ± 0% 19.59n ± 0% -43.50% (p=0.000 n=20) geomean 31.44n 17.11n -45.59% Updates #59120. Change-Id: Ied74fc20338b63799c6d6eeb122c31b42cff0f7e Reviewed-on: https://go-review.googlesource.com/c/go/+/481578 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: WANG Xuerui <git@xen0n.name> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
2024-11-08	cmd/compiler,internal/runtime/atomic: optimize xadd{32,64} on loong64	Guoqi Chen
	Use Loong64's atomic operation instruction AMADDDB{W,V} (full barrier) to implement atomic.Xadd{32,64} goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| Xadd 27.24n ± 0% 12.01n ± 0% -55.91% (p=0.000 n=20) Xadd-2 31.93n ± 0% 25.55n ± 0% -19.98% (p=0.000 n=20) Xadd-4 31.90n ± 0% 24.80n ± 0% -22.26% (p=0.000 n=20) Xadd64 27.23n ± 0% 12.01n ± 0% -55.89% (p=0.000 n=20) Xadd64-2 31.93n ± 0% 25.57n ± 0% -19.90% (p=0.000 n=20) Xadd64-4 31.89n ± 0% 24.80n ± 0% -22.23% (p=0.000 n=20) geomean 30.27n 19.67n -35.01% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| Xadd 26.02n ± 0% 12.41n ± 0% -52.31% (p=0.000 n=20) Xadd-2 37.36n ± 0% 20.60n ± 0% -44.86% (p=0.000 n=20) Xadd-4 37.22n ± 0% 19.59n ± 0% -47.37% (p=0.000 n=20) Xadd64 26.42n ± 0% 12.41n ± 0% -53.03% (p=0.000 n=20) Xadd64-2 37.77n ± 0% 20.60n ± 0% -45.46% (p=0.000 n=20) Xadd64-4 37.78n ± 0% 19.59n ± 0% -48.15% (p=0.000 n=20) geomean 33.30n 17.11n -48.62% Change-Id: I982539c2aa04680e9dd11b099ba8d5f215bf9b32 Reviewed-on: https://go-review.googlesource.com/c/go/+/481937 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
2024-11-07	cmd/compiler,internal/runtime/atomic: optimize Store{64,32,8} on loong64	Guoqi Chen
	On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1] is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures. Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according to the CPU feature. The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to ensure consistency in the order of Store/Load [2]. LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses the R0 register, and there is no performance difference between the implementations of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000-HV @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| AtomicStore64 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore64-2 19.61n ± 0% 13.61n ± 0% -30.57% (p=0.000 n=20) AtomicStore64-4 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore-2 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore-4 19.62n ± 0% 13.62n ± 0% -30.58% (p=0.000 n=20) AtomicStore8 19.61n ± 0% 20.01n ± 0% +2.04% (p=0.000 n=20) AtomicStore8-2 19.62n ± 0% 20.02n ± 0% +2.01% (p=0.000 n=20) AtomicStore8-4 19.61n ± 0% 20.02n ± 0% +2.09% (p=0.000 n=20) geomean 19.61n 15.48n -21.08% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz \| bench.old \| bench.new \| \| sec/op \| sec/op vs base \| AtomicStore64 18.03n ± 0% 12.81n ± 0% -28.93% (p=0.000 n=20) AtomicStore64-2 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore64-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) geomean 18.01n 12.81n -28.89% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html [2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097 Reviewed-on: https://go-review.googlesource.com/c/go/+/581356 Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>