| Age | Commit message (Collapse) | Author |
|
In the environment where GOMAXPROCS set explicitly, for example to 3 in
shell profile, the runtime tests will fail with the following error,
----
ok regexp/syntax 0.428s
--- FAIL: TestCgroupGOMAXPROCS (0.81s)
crash_test.go:186: running /home/ms/src/go/bin/go build -o /tmp/go-build1753772192/testprog.exe
crash_test.go:208: built testprog in 796.664277ms
--- FAIL: TestCgroupGOMAXPROCS/containermaxprocs=0 (0.00s)
cgroup_linux_test.go:60: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (907.06µs): ok
cgroup_linux_test.go:63: output got "3\n" want "4\n"
--- FAIL: TestCgroupGOMAXPROCSNoLimit (0.00s)
cgroup_linux_test.go:82: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (879.194µs): ok
cgroup_linux_test.go:85: output got "3\n" want "4\n"
--- FAIL: TestCgroupGOMAXPROCSHigherThanNumCPU (0.00s)
cgroup_linux_test.go:102: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (852.396µs): ok
cgroup_linux_test.go:105: output got "3\n" want "4\n"
--- FAIL: TestCgroupGOMAXPROCSRound (0.01s)
--- FAIL: TestCgroupGOMAXPROCSRound/50000 (0.00s)
cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (852.099µs): ok
cgroup_linux_test.go:159: output got "3\n" want "2\n"
--- FAIL: TestCgroupGOMAXPROCSRound/100000 (0.00s)
cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (894.001µs): ok
cgroup_linux_test.go:159: output got "3\n" want "2\n"
--- FAIL: TestCgroupGOMAXPROCSRound/150000 (0.00s)
cgroup_linux_test.go:156: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (850.897µs): ok
cgroup_linux_test.go:159: output got "3\n" want "2\n"
--- FAIL: TestCgroupGOMAXPROCSSchedAffinity (0.00s)
cgroup_linux_test.go:229: /tmp/go-build1753772192/testprog.exe PrintGOMAXPROCS (867.987µs): ok
cgroup_linux_test.go:232: output got "3\n" want "2\n"
FAIL
FAIL runtime 23.088s
----
This changes exclude the GOMAXPROCS when building program for testing so it
does not affect the tests.
Change-Id: I590d9eca57026539413cf4c93b37f624f179d534
|
|
Several comments refer to bitset as 'instrinsified', which is likely
a typo, because it refers to the output of the intrinsics implemented
with SIMD.
Change-Id: I00f26b8d8128592ee0e9dc8a1b1480c93a9542d6
GitHub-Last-Rev: 8a4236710979f2f969210e0b261bdb9ae44f3321
GitHub-Pull-Request: golang/go#74624
Reviewed-on: https://go-review.googlesource.com/c/go/+/688016
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
For #73193.
Change-Id: I6a6a636ca9fa9cba429cf053468c56c2939cb1ac
Reviewed-on: https://go-review.googlesource.com/c/go/+/668638
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
Change-Id: I6a6a636ca21edcc6f16705fbb72a5241d4f7f22d
Reviewed-on: https://go-review.googlesource.com/c/go/+/668637
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Moving to a smaller package allows its use in other internal/runtime
packages.
This isn't internal/strconvlite since it can't be used directly by
strconv.
For #73193.
Change-Id: I6a6a636c9c8b3f06b5fd6c07fe9dd5a7a37d1429
Reviewed-on: https://go-review.googlesource.com/c/go/+/672697
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
|
|
Change-Id: I6a6a636c5e119165dc1018d1fc0354f5b6929656
Reviewed-on: https://go-review.googlesource.com/c/go/+/670496
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Currently, there's a window of time where each cleanup goroutine has
committed to going to sleep (immediately after full.pop() == nil) but
hasn't yet marked itself as asleep (state.sleep()). If new work arrives
in this window, it might get missed. This is what we see in #73642, and
I can reproduce it with stress2.
Side-note: even if the work gets missed by the existing sleeping
goroutines, needg is incremented. So in theory a new goroutine will
handle the work. Right now that doesn't happen in tests like the one
running in #73642, where there might never be another call to AddCleanup
to create the additional goroutine. Also, if we've hit the maximum on
cleanup goroutines and all of them are in this window simultaneously, we
can still end up missing work, it's just more rare. So this is still a
problem even if we choose to just be more aggressive about creating new
cleanup goroutines.
This change fixes the problem and also aims to make the cleanup
wake/sleep code clearer. The way this change fixes this problem is to
have cleanup goroutines re-check the work list before going to sleep,
but after having already marked themselves as sleeping. This way, if new
work comes in before the cleanup goroutine marks itself as going to
sleep, we can rely on the re-check to pick up that work. If new work
comes after the goroutine marks itself as going to sleep and after the
re-check, we can rely on the scheduler noticing that the goroutine is
asleep and waking it up. If work comes in between a goroutine marking
itself as sleeping and the re-check, then the re-check will catch that
piece of work. However, the scheduler might now get a false signal that
the goroutine is asleep and try to wake it up. This is OK. The sleeping
signal is now mutated and double-checked under the queue lock, so the
scheduler will grab the lock, may notice there are no sleeping
goroutines, and go on its way. This may cause spurious lock acquisitions
but it should be very rare. The window between a cleanup goroutine
marking itself as going to sleep and re-checking the work list is a
handful of instructions at most.
This seems subtle but overall it's a simplification of the code. We
rely more on the lock, which is easier to reason about, and we track two
separate atomic variables instead of the merged cleanupSleepState: the
length of the full list, and the number of cleanup goroutines that are
asleep. The former is now the primary way to acquire work. Cleanup
goroutines must decrement the length successfully to obtain an item off
the full list. The number of cleanup goroutines asleep, meanwhile, is
now only updated with the queue lock held. It can be checked without the
lock held, and the invariant to make that safe is simple: it must always
be an overestimate of the number of sleeping cleanup goroutines.
The changes here do change some other behaviors.
First, since we're tracking the length of the full list instead of the
abstract concept of a wake-up, the waker can't consume wake-ups anymore.
This means that cleanup goroutines may be created more aggressively. If
two threads in the scheduler see that there are goroutines that are
asleep, only one will win the race, but the other will observe zero
asleep goroutines but potentially many work units available. This will
cause it to signal many goroutines to be created. This is OK since we
have a cap on the number of cleanup goroutines, and the race should be
relatively rare.
Second, because cleanup goroutines can now fail to go to sleep if any
units of work come in, they might spend more time contended on the lock.
For example, if we have N cleanup goroutines and work comes in at *just*
the wrong rate, in the worst case we'll have each of G goroutines loop
N times for N blocks, resulting in O(G*N) thread time to handle each
block in the worst case. To paint a picture, imagine each goroutine
trying to go to sleep, fail because a new block of work came in, and
only one goroutine will get that block. Then once that goroutine is
done, we all try again, fail because a new block of work came in, and so
on and so forth. This case is unlikely, though, and probably not worth
worrying about until it actually becomes a problem. (A similar problem
exists with parking (and exists before this change, too) but at least in
that case each goroutine parks, so it doesn't block the thread.)
Fixes #73642.
Change-Id: I6bbe1b789e7eb7e8168e56da425a6450fbad9625
Reviewed-on: https://go-review.googlesource.com/c/go/+/671676
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
This will make future uses of the map faster because the probe
sequences will likely be shorter.
Change-Id: If10f3af49a5feaff7d1b82337bbbfb93bcd9dcb5
Reviewed-on: https://go-review.googlesource.com/c/go/+/633076
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Our current parallel mark algorithm suffers from frequent stalls on
memory since its access pattern is essentially random. Small objects
are the worst offenders, since each one forces pulling in at least one
full cache line to access even when the amount to be scanned is far
smaller than that. Each object also requires an independent access to
per-object metadata.
The purpose of this change is to improve garbage collector performance
by scanning small objects in batches to obtain better cache locality
than our current approach. The core idea behind this change is to defer
marking and scanning small objects, and then scan them in batches
localized to a span.
This change adds scanned bits to each small object (<=512 bytes) span in
addition to mark bits. The scanned bits indicate that the object has
been scanned. (One way to think of them is "grey" bits and "black" bits
in the tri-color mark-sweep abstraction.) Each of these spans is always
8 KiB and if they contain pointers, the pointer/scalar data is already
packed together at the end of the span, allowing us to further optimize
the mark algorithm for this specific case.
When the GC encounters a pointer, it first checks if it points into a
small object span. If so, it is first marked in the mark bits, and then
the object is queued on a work-stealing P-local queue. This object
represents the whole span, and we ensure that a span can only appear at
most once in any queue by maintaining an atomic ownership bit for each
span. Later, when the pointer is dequeued, we scan every object with a
set mark that doesn't have a corresponding scanned bit. If it turns out
that was the only object in the mark bits since the last time we scanned
the span, we scan just that object directly, essentially falling back to
the existing algorithm. noscan objects have no scan work, so they are
never queued.
Each span's mark and scanned bits are co-located together at the end of
the span. Since the span is always 8 KiB in size, it can be found with
simple pointer arithmetic. Next to the marks and scans we also store the
size class, eliminating the need to access the span's mspan altogether.
The work-stealing P-local queue is a new source of GC work. If this
queue gets full, half of it is dumped to a global linked list of spans
to scan. The regular scan queues are always prioritized over this queue
to allow time for darts to accumulate. Stealing work from other Ps is a
last resort.
This change also adds a new debug mode under GODEBUG=gctrace=2 that
dumps whole-span scanning statistics by size class on every GC cycle.
A future extension to this CL is to use SIMD-accelerated scanning
kernels for scanning spans with high mark bit density.
For #19112. (Deadlock averted in GOEXPERIMENT.)
For #73581.
Change-Id: I4bbb4e36f376950a53e61aaaae157ce842c341bc
Reviewed-on: https://go-review.googlesource.com/c/go/+/658036
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
These constants are needed by some future generator programs.
Change-Id: I5dccd009cbb3b2f321523bc0d8eaeb4c82e5df81
Reviewed-on: https://go-review.googlesource.com/c/go/+/655276
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
We will want to reference these definitions from new generator programs,
and this is a good opportunity to cleanup all these old C-style names.
Change-Id: Ifb06f0afc381e2697e7877f038eca786610c96de
Reviewed-on: https://go-review.googlesource.com/c/go/+/655275
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
This lets the inliner do a better job optimizing the mapKeyError call.
goos: linux
goarch: amd64
pkg: runtime
cpu: AMD Ryzen 5 4600G with Radeon Graphics
│ /tmp/before2 │ /tmp/after3 │
│ sec/op │ sec/op vs base │
MapAccessZero/Key=int64-12 1.875n ± 0% 1.875n ± 0% ~ (p=0.506 n=25)
MapAccessZero/Key=int32-12 1.875n ± 0% 1.875n ± 0% ~ (p=0.082 n=25)
MapAccessZero/Key=string-12 1.902n ± 1% 1.902n ± 1% ~ (p=0.256 n=25)
MapAccessZero/Key=mediumType-12 2.816n ± 0% 1.958n ± 0% -30.47% (p=0.000 n=25)
MapAccessZero/Key=bigType-12 2.815n ± 0% 1.935n ± 0% -31.26% (p=0.000 n=25)
MapAccessEmpty/Key=int64-12 1.942n ± 0% 2.109n ± 0% +8.60% (p=0.000 n=25)
MapAccessEmpty/Key=int32-12 2.110n ± 0% 1.940n ± 0% -8.06% (p=0.000 n=25)
MapAccessEmpty/Key=string-12 2.024n ± 0% 2.109n ± 0% +4.20% (p=0.000 n=25)
MapAccessEmpty/Key=mediumType-12 3.157n ± 0% 2.344n ± 0% -25.75% (p=0.000 n=25)
MapAccessEmpty/Key=bigType-12 3.054n ± 0% 2.115n ± 0% -30.75% (p=0.000 n=25)
geomean 2.305n 2.011n -12.75%
Change-Id: Iee83930884dc4c8a791a711aa189a1c93b68d536
Reviewed-on: https://go-review.googlesource.com/c/go/+/663495
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
This test fails on GOEXPERIMENT=noswissmap as it is testing behavior
specific to swissmaps. Move it to map_swiss_test.go to skip it on
noswissmap.
We could also switch the test to use NewTestMap, which provides a
swissmap even in GOEXPERIMENT=noswissmap, but that is tedious to use and
noswissmap is going away soon anyway.
For #70886.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest-noswissmap
Change-Id: I6a6a636c5ec72217d936cd01e9da36ae127ea2c5
Reviewed-on: https://go-review.googlesource.com/c/go/+/666437
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Before growing, if there are lots of tombstones try to remove them.
If we can remove enough, we can continue at the given size for a
while longer.
Fixes #70886
Change-Id: I71e0d873ae118bb35798314ec25e78eaa5340d73
Reviewed-on: https://go-review.googlesource.com/c/go/+/640955
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Fixes #73191
Change-Id: I0f8a5a19faa745943a98476c7caf4c97ccdce184
Reviewed-on: https://go-review.googlesource.com/c/go/+/663175
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
|
|
On master, lookups on small Swiss Table maps (<= 8 elements) for
non-specialized key types are seemingly a performance regression
compared to the Go 1.23 map implementation (reported in #70849).
Currently, a linear scan is used for gets in these cases.
This CL changes (*Map).getWithKeySmall to instead use the SIMD or SWAR
match on the control bytes to then jump to candidate matching slots,
with sample results below for a 16-byte key. This especially helps the
hit case when the key is unpredictable, which previously had to scan an
unpredictable number of control bytes to find a candidate slot when the
key is unpredictable.
Separately, other CLs in this stack modify the main Swiss Table
benchmarks to randomize lookup key order (vs. previously most of the
benchmarks had a repeating lookup key ordering, which likely is
predictable until the map is too big). We have sample results for the
randomized key order benchmarks followed by results from the older
benchmarks.
The first table below is with randomized key order. For hits, the older
results get slower as there are more elements. With this CL, we see hits
for unpredictable key ordering (sizes 2-8) get a ~1.7x speedup from
~25ns to ~14ns, with a now consistent lookup time for the different
sizes. (The 1 element size map has a predictable key ordering because
there is only one key, and that reports a modest ~0.5ns or ~3%
performance penalty). Misses for unpredictable key order get a ~1.3x
speedup, from ~13ns to ~10ns, with similar results for the 1 element
size.
│ no-fix-new-bmarks │ fix-with-new-bmarks │
│ sec/op │ sec/op vs base │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 13.26n ± 0% 13.64n ± 0% +2.90% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 19.47n ± 0% 13.62n ± 0% -30.05% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 22.23n ± 0% 13.64n ± 0% -38.68% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 23.98n ± 0% 13.64n ± 0% -43.11% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 25.02n ± 0% 13.67n ± 0% -45.35% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 25.77n ± 1% 13.68n ± 2% -46.89% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 26.38n ± 0% 13.64n ± 0% -48.28% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 26.31n ± 0% 13.71n ± 21% -47.90% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 13.055n ± 0% 9.815n ± 0% -24.82% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 13.070n ± 0% 9.813n ± 0% -24.92% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 13.060n ± 0% 9.819n ± 0% -24.82% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 13.075n ± 0% 9.816n ± 0% -24.92% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 13.060n ± 0% 9.826n ± 0% -24.76% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 13.095n ± 19% 9.834n ± 31% -24.90% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 13.075n ± 19% 9.822n ± 27% -24.88% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 13.11n ± 16% 12.14n ± 19% -7.43% (p=0.000 n=20)
The next table uses the original benchmarks from just before this CL
stack (i.e., without shuffling lookup keys).
With this CL, we see improvement that is directionally similar to the
above results but not as large, presumably because the branches in the
linear scan are fairly predictable with predictable keys. (The numbers
here also include the time from a mod in the benchmark code, which
seemed to take around ~1/3 of CPU time based on spot checking a couple
of examples, vs. the modified benchmarks shown above have removed that
mod).
│ master-8c3e391573 │ just-fix-with-old-bmarks │
│ sec/op │ sec/op vs base │
MapSmallAccessHit/Key=smallType/Elem=int32/len=1-4 20.85n ± 0% 21.69n ± 0% +4.03% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=2-4 21.22n ± 0% 21.70n ± 0% +2.24% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=3-4 21.73n ± 0% 21.71n ± 0% ~ (p=0.158 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=4-4 22.06n ± 0% 21.71n ± 0% -1.56% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=5-4 22.41n ± 0% 21.73n ± 0% -3.01% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=6-4 22.71n ± 0% 21.72n ± 0% -4.38% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=7-4 22.98n ± 0% 21.71n ± 0% -5.53% (p=0.000 n=20)
MapSmallAccessHit/Key=smallType/Elem=int32/len=8-4 23.20n ± 0% 21.72n ± 0% -6.36% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=1-4 19.95n ± 0% 17.30n ± 0% -13.28% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=2-4 19.96n ± 0% 17.31n ± 0% -13.28% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=3-4 19.95n ± 0% 17.29n ± 0% -13.33% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=4-4 19.95n ± 0% 17.30n ± 0% -13.29% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=5-4 19.96n ± 25% 17.32n ± 0% -13.22% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=6-4 19.99n ± 24% 17.29n ± 0% -13.51% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=7-4 19.97n ± 20% 17.34n ± 16% -13.14% (p=0.000 n=20)
MapSmallAccessMiss/Key=smallType/Elem=int32/len=8-4 20.02n ± 11% 17.33n ± 14% -13.44% (p=0.000 n=20)
geomean 21.02n 19.39n -7.78%
See #70849 for additional benchmark results, including results for arm64
(which also means without SIMD support).
Updates #54766
Updates #70700
Fixes #70849
Change-Id: Ic2361bb6fc15b4436d1d1d5be7e4712e547f611b
Reviewed-on: https://go-review.googlesource.com/c/go/+/634396
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
│ base │ experiment │
│ sec/op │ sec/op vs base │
MapClone-24 66.802m ± 7% 3.348m ± 2% -94.99% (p=0.000 n=10)
Fixes #70836
Change-Id: I9e192b1ee82e18f5580ff18918307042a337fdcc
Reviewed-on: https://go-review.googlesource.com/c/go/+/660175
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
This makes the single-byte atomic.Xchg8 operation available on all
GOARCHes, including those without direct / single-instruction support.
Fixes #69735
Change-Id: Icb6aff8f907257db81ea440dc4d29f96b3cff6c4
Reviewed-on: https://go-review.googlesource.com/c/go/+/657936
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: David Chase <drchase@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
We've been slowly moving packages from runtime/internal to
internal/runtime. For now, runtime/internal only has test packages.
It's a good chance to clean up the references to runtime/internal
in the toolchain.
For #65355.
Change-Id: Ie6f9091a44511d0db9946ea6de7a78d3afe9f063
GitHub-Last-Rev: fad32e2e81d11508e734c3c3d3b0c1da583f89f5
GitHub-Pull-Request: golang/go#72137
Reviewed-on: https://go-review.googlesource.com/c/go/+/655515
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
|
|
Updated comments in go assembler package
Change-Id: I174e344ca45fae6ef70af2e0b29cd783b003b4c2
GitHub-Last-Rev: 8ab37208891e795561a943269ca82b1ce6e7eef5
GitHub-Pull-Request: golang/go#72048
Reviewed-on: https://go-review.googlesource.com/c/go/+/654478
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: PRABHAV DOGRA <prabhavdogra1@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Leverage the prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ...) API to name
the anonymous memory areas.
This API has been introduced in Linux 5.17 to decorate the anonymous
memory areas shown in /proc/<pid>/maps.
This is already used by glibc. See:
* https://sourceware.org/git/?p=glibc.git;a=blob;f=malloc/malloc.c;h=27dfd1eb907f4615b70c70237c42c552bb4f26a8;hb=HEAD#l2434
* https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/setvmaname.c;h=ea93a5ffbebc9e5a7e32a297138f465724b4725f;hb=HEAD#l63
This can be useful when investigating the memory consumption of a
multi-language program.
On a 100% Go program, pprof profiler can be used to profile the memory
consumption of the program. But pprof is only aware of what happens
within the Go world.
On a multi-language program, there could be a doubt about whether the
suspicious extra-memory consumption comes from the Go part or the native
part.
With this change, the following Go program:
package main
import (
"fmt"
"log"
"os"
)
/*
#include <stdlib.h>
void f(void)
{
(void)malloc(1024*1024*1024);
}
*/
import "C"
func main() {
C.f()
data, err := os.ReadFile("/proc/self/maps")
if err != nil {
log.Fatal(err)
}
fmt.Println(string(data))
}
produces this output:
$ GLIBC_TUNABLES=glibc.mem.decorate_maps=1 ~/doc/devel/open-source/go/bin/go run .
00400000-00402000 r--p 00000000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00402000-004a4000 r-xp 00002000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
004a4000-00574000 r--p 000a4000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00574000-00575000 r--p 00173000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00575000-00580000 rw-p 00174000 00:21 28451768 /home/lenaic/.cache/go-build/9f/9f25a17baed5a80d03eb080a2ce2a5ff49c17f9a56e28330f0474a2bb74a30a0-d/test_vma_name
00580000-005a4000 rw-p 00000000 00:00 0
2e075000-2e096000 rw-p 00000000 00:00 0 [heap]
c000000000-c000400000 rw-p 00000000 00:00 0 [anon: Go: heap]
c000400000-c004000000 ---p 00000000 00:00 0 [anon: Go: heap reservation]
777f40000000-777f40021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f40021000-777f44000000 ---p 00000000 00:00 0
777f44000000-777f44021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f44021000-777f48000000 ---p 00000000 00:00 0
777f48000000-777f48021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f48021000-777f4c000000 ---p 00000000 00:00 0
777f4c000000-777f4c021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f4c021000-777f50000000 ---p 00000000 00:00 0
777f50000000-777f50021000 rw-p 00000000 00:00 0 [anon: glibc: malloc arena]
777f50021000-777f54000000 ---p 00000000 00:00 0
777f55afb000-777f55afc000 ---p 00000000 00:00 0
777f55afc000-777f562fc000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216378]
777f562fc000-777f562fd000 ---p 00000000 00:00 0
777f562fd000-777f56afd000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216377]
777f56afd000-777f56afe000 ---p 00000000 00:00 0
777f56afe000-777f572fe000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216376]
777f572fe000-777f572ff000 ---p 00000000 00:00 0
777f572ff000-777f57aff000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216375]
777f57aff000-777f57b00000 ---p 00000000 00:00 0
777f57b00000-777f58300000 rw-p 00000000 00:00 0 [anon: glibc: pthread stack: 216374]
777f58300000-777f58400000 rw-p 00000000 00:00 0 [anon: Go: page alloc index]
777f58400000-777f5a400000 rw-p 00000000 00:00 0 [anon: Go: heap index]
777f5a400000-777f6a580000 ---p 00000000 00:00 0 [anon: Go: scavenge index]
777f6a580000-777f6a581000 rw-p 00000000 00:00 0 [anon: Go: scavenge index]
777f6a581000-777f7a400000 ---p 00000000 00:00 0 [anon: Go: scavenge index]
777f7a400000-777f8a580000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f8a580000-777f8a581000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f8a581000-777f9c430000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9c430000-777f9c431000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9c431000-777f9e806000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9e806000-777f9e807000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9e807000-777f9ec00000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ec36000-777f9ecb6000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9ecb6000-777f9ecc6000 rw-p 00000000 00:00 0 [anon: Go: gc bits]
777f9ecc6000-777f9ecd6000 rw-p 00000000 00:00 0 [anon: Go: allspans array]
777f9ecd6000-777f9ece7000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9ece7000-777f9ed67000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ed67000-777f9ed68000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9ed68000-777f9ede7000 ---p 00000000 00:00 0 [anon: Go: page summary]
777f9ede7000-777f9ee07000 rw-p 00000000 00:00 0 [anon: Go: page alloc]
777f9ee07000-777f9ee0a000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc]
777f9ee0a000-777f9ee2e000 r--p 00000000 00:21 48158213 /usr/lib/libc.so.6
777f9ee2e000-777f9ef9f000 r-xp 00024000 00:21 48158213 /usr/lib/libc.so.6
777f9ef9f000-777f9efee000 r--p 00195000 00:21 48158213 /usr/lib/libc.so.6
777f9efee000-777f9eff2000 r--p 001e3000 00:21 48158213 /usr/lib/libc.so.6
777f9eff2000-777f9eff4000 rw-p 001e7000 00:21 48158213 /usr/lib/libc.so.6
777f9eff4000-777f9effc000 rw-p 00000000 00:00 0
777f9effc000-777f9effe000 rw-p 00000000 00:00 0 [anon: glibc: loader malloc]
777f9f00a000-777f9f04a000 rw-p 00000000 00:00 0 [anon: Go: immortal metadata]
777f9f04a000-777f9f04c000 r--p 00000000 00:00 0 [vvar]
777f9f04c000-777f9f04e000 r--p 00000000 00:00 0 [vvar_vclock]
777f9f04e000-777f9f050000 r-xp 00000000 00:00 0 [vdso]
777f9f050000-777f9f051000 r--p 00000000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f051000-777f9f07a000 r-xp 00001000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f07a000-777f9f085000 r--p 0002a000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f085000-777f9f087000 r--p 00034000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f087000-777f9f088000 rw-p 00036000 00:21 48158204 /usr/lib/ld-linux-x86-64.so.2
777f9f088000-777f9f089000 rw-p 00000000 00:00 0
7ffc7bfa7000-7ffc7bfc8000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
The anonymous memory areas are now labelled so that we can see which
ones have been allocated by the Go runtime versus which ones have been
allocated by the glibc.
Fixes #71546
Change-Id: I304e8b4dd7f2477a6da794fd44e9a7a5354e4bf4
Reviewed-on: https://go-review.googlesource.com/c/go/+/646095
Auto-Submit: Alan Donovan <adonovan@google.com>
Commit-Queue: Alan Donovan <adonovan@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
|
|
For #69735
Change-Id: I2a0336214786e14b9a37834d81a0a0d14231451c
Reviewed-on: https://go-review.googlesource.com/c/go/+/651315
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
For #69735
Change-Id: Ide6b3077768a96b76078e5d4f6460596b8ff1560
Reviewed-on: https://go-review.googlesource.com/c/go/+/631756
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
|
|
For #69735
Change-Id: I34ca2b027494525ab64f94beee89ca373a5031ae
Reviewed-on: https://go-review.googlesource.com/c/go/+/631615
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Fixes a typo to correctly describe the hash bits of the control word.
Change-Id: Id3c2ae0bd529e579a95258845f9d8028e23d10d2
GitHub-Last-Rev: 1baa81be5d292d5625d5d7788b8ea090453f962c
GitHub-Pull-Request: golang/go#71730
Reviewed-on: https://go-review.googlesource.com/c/go/+/649416
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
|
|
Re-enable tests for stack-allocated maps and fast map accessors.
Those are implemented now.
Update #54766
Change-Id: I8c019702bd9fb077b2fe3f7c78e8e9e10d2263a6
Reviewed-on: https://go-review.googlesource.com/c/go/+/642376
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
|
|
len(map) is lowered to loading the first field of the map
structure, which is the length. Currently it is a load of an int.
With the old map, the first field is indeed an int. With Swiss
map, however, it is a uint64. On big-endian 32-bit machine,
loading an (32-bit) int from a uint64 would load just the high
bits, which are (probably) all 0. Change to a load with the proper
type.
Fixes #70248.
Change-Id: I39cf2d1e6658dac5a8de25c858e1580e2a14b894
Reviewed-on: https://go-review.googlesource.com/c/go/+/638375
Run-TryBot: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
OpenBSD is bumping up against the nosplit limit, and openbsd/ppc64
is over it. Increase StackGuardMultiplier on OpenBSD, matching AIX.
Change-Id: I61e17c99ce77e1fd3f368159dc4615aeae99e913
Reviewed-on: https://go-review.googlesource.com/c/go/+/632996
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Damien Neil <dneil@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Change-Id: I07e7c8eaa5bd4bac0d576b2f2f4cd3f81b0b77a4
Reviewed-on: https://go-review.googlesource.com/c/go/+/630055
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Commit-Queue: Ian Lance Taylor <iant@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
We shouldn't spend human code review time checking this.
Let the computer check.
Change-Id: I6de9d733c128d833b958b0e43a52b564e8f82dd3
Reviewed-on: https://go-review.googlesource.com/c/go/+/630417
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Sam Thanawalla <samthanawalla@google.com>
|
|
Use similar SIMD operations to the ones used in Abseil. We still
using 8-slot groups (even though the XMM registers could handle 16-slot
groups) to keep the implementation simpler (no changes to the memory
layout of maps).
Still, the implementations of matchH2 and matchEmpty are shorter than
the portable version using standard arithmetic operations. They also
return a packed bitset, which avoids the need to shift in bitset.first.
That said, the packed bitset is a downside in cognitive complexity, as
we have to think about two different possible representations. This
doesn't leak out of the API, but we do need to intrinsify bitset to
switch to a compatible implementation.
The compiler's intrinsics don't support intrinsifying methods, so the
implementations move to free functions.
This makes operations between 0-3% faster on my machine. e.g.,
MapGetHit/impl=runtimeMap/t=Int64/len=6-12 12.34n ± 1% 11.42n ± 1% -7.46% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=12-12 15.14n ± 2% 14.88n ± 1% -1.72% (p=0.009 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=18-12 15.04n ± 6% 14.66n ± 2% -2.53% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=24-12 15.80n ± 1% 15.48n ± 3% ~ (p=0.444 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=30-12 15.55n ± 4% 14.77n ± 3% -5.02% (p=0.004 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=64-12 15.26n ± 1% 15.05n ± 1% ~ (p=0.055 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=128-12 15.34n ± 1% 15.02n ± 2% -2.09% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=256-12 15.42n ± 1% 15.15n ± 1% -1.75% (p=0.001 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=512-12 15.48n ± 1% 15.18n ± 1% -1.94% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=1024-12 17.38n ± 1% 17.05n ± 1% -1.90% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=2048-12 17.96n ± 0% 17.59n ± 1% -2.06% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=4096-12 18.36n ± 1% 18.18n ± 1% -0.98% (p=0.013 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=8192-12 18.75n ± 0% 18.31n ± 1% -2.35% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=65536-12 26.25n ± 0% 25.95n ± 1% -1.14% (p=0.000 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=262144-12 44.24n ± 1% 44.06n ± 1% ~ (p=0.181 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=1048576-12 85.02n ± 0% 85.35n ± 0% +0.39% (p=0.032 n=25)
MapGetHit/impl=runtimeMap/t=Int64/len=4194304-12 98.87n ± 1% 98.85n ± 1% ~ (p=0.799 n=25)
For #54766.
Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10,gotip-linux-amd64-goamd64v3
Change-Id: Ic1b852f02744404122cb3672900fd95f4625905e
Reviewed-on: https://go-review.googlesource.com/c/go/+/626277
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic
instruction AMSWAP[DB]{B,H} [1] is supported. Therefore, the implementation of
the atomic operation exchange can be selected according to the CPUCFG flag LAM_BH:
AMSWAPDBB(full barrier) instruction is used on new microstructures, and traditional
LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can
significantly improve the performance of Go programs on new microstructures.
Because Xchg8 implemented using traditional LL-SC uses too many temporary
registers, it is not suitable for intrinsics.
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
BenchmarkXchg8 100000000 10.41 ns/op
BenchmarkXchg8-2 100000000 10.41 ns/op
BenchmarkXchg8-4 100000000 10.41 ns/op
BenchmarkXchg8Parallel 96647592 12.41 ns/op
BenchmarkXchg8Parallel-2 58376136 20.60 ns/op
BenchmarkXchg8Parallel-4 78458899 17.97 ns/op
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000-HV @ 2500.00MHz
BenchmarkXchg8 38323825 31.23 ns/op
BenchmarkXchg8-2 38368219 31.23 ns/op
BenchmarkXchg8-4 37154156 31.26 ns/op
BenchmarkXchg8Parallel 37908301 31.63 ns/op
BenchmarkXchg8Parallel-2 30413440 39.42 ns/op
BenchmarkXchg8Parallel-4 30737626 39.03 ns/op
For #69735
[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html
Change-Id: I02ba68f66a2210b6902344fdc9975eb62de728ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/623058
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
|
|
Hashing the key means we have to take the address of it. That inhibits
subsequent optimizations on the key variable.
By hashing a copy, we incur an extra store at the hash callsite, but
we no longer need a load of the key in the inner loop. It can live
in a register throughout. (Technically, it gets spilled around
the call to the hasher, but it gets restored outside the loop.)
Maybe one day we can have special hash functions that take
int64/int32/string instead of *int64/*int32/*string.
Change-Id: Iba3133f6e82328f53c0abcb5eec13ee47c4969d1
Reviewed-on: https://go-review.googlesource.com/c/go/+/629419
Reviewed-by: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Note this doesn't work with int32 keys because alignment padding can change
the offset of the element.
Change-Id: I27804d3cfc7cc1b7f995f7e29630f0824f0ee899
Reviewed-on: https://go-review.googlesource.com/c/go/+/629418
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Russ Cox <rsc@golang.org>
|
|
This reduces the adds required at the return point from 3 to 1.
(The multiply inside g.elem() does get CSE'd with the one inside
g.key(), but the rest of the adds don't.)
Instead, compute the element as just a fixed offset from the key.
Change-Id: Ia4d7664efafcdca5e9daeb77d270651bb186232c
Reviewed-on: https://go-review.googlesource.com/c/go/+/629535
Reviewed-by: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Add a new function, WithDataIndependentTiming, which takes a function as
an argument, and encloses it with calls to set/unset the DIT PSTATE bit
on Arm64.
Since DIT is OS thread-local, for the duration of the execution of
WithDataIndependentTiming, we lock the goroutine to the OS thread, using
LockOSThread. For long running operations, this is likely to not be
performant, but we expect this to be tightly scoped around cryptographic
operations that have bounded execution times.
If locking to the OS thread turns out to be too slow, another option is
to add a bit to the g state indicating if a goroutine has DIT enabled,
and then have the scheduler enable/disable DIT when scheduling a g.
Additionally, we add a new GODEBUG, dataindependenttiming, which allows
setting DIT for an entire program. Running a program with
dataindependenttiming=1 enables DIT for the program during
initialization. In an ideal world PSTATE.DIT would be inherited from
the parent thread, so we'd only need to set it in the main thread and
then all subsequent threads would inherit the value. While this does
happen in the Linux kernel [0], it is not the case for darwin [1].
Rather than add complex logic to only set it on darwin for each new
thread, we just unconditionally set it in mstart1 and cgocallbackg1
regardless of the OS. DIT will already impose some overhead, and the
cost of setting the bit is only ~two instructions (CALL, MSR), so it
should be cheap enough.
Fixes #66450
Updates #49702
[0] https://github.com/torvalds/linux/blob/e8bdb3c8be08c9a3edc0a373c0aa8729355a0705/arch/arm64/kernel/process.c#L373
[1] https://github.com/apple-oss-distributions/xnu/blob/8d741a5de7ff4191bf97d57b9f54c2f6d4a15585/osfmk/arm64/status.c#L1666
Change-Id: I78eda691ff9254b0415f2b54770e5850a0179749
Reviewed-on: https://go-review.googlesource.com/c/go/+/598336
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic
compare-and-exchange instruction AMCAS[DB]{B,W,H,V} [1] is supported. Therefore,
the implementation of the atomic operation compare-and-swap can be selected according
to the CPUCFG flag LAMCAS: AMCASDB(full barrier) instruction is used on new
microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older
microstructures. This can significantly improve the performance of Go programs on
new microstructures.
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
Cas 46.84n ± 0% 22.82n ± 0% -51.28% (p=0.000 n=20)
Cas-2 47.58n ± 0% 29.57n ± 0% -37.85% (p=0.000 n=20)
Cas-4 43.27n ± 20% 25.31n ± 13% -41.50% (p=0.000 n=20)
Cas64 46.85n ± 0% 22.82n ± 0% -51.29% (p=0.000 n=20)
Cas64-2 47.43n ± 0% 29.53n ± 0% -37.74% (p=0.002 n=20)
Cas64-4 43.18n ± 0% 25.28n ± 2% -41.46% (p=0.000 n=20)
geomean 45.82n 25.74n -43.82%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
Cas 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20)
Cas-2 52.80n ± 0% 53.11n ± 0% +0.59% (p=0.000 n=20)
Cas-4 55.97n ± 0% 57.31n ± 0% +2.39% (p=0.000 n=20)
Cas64 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20)
Cas64-2 52.68n ± 0% 53.11n ± 0% +0.82% (p=0.000 n=20)
Cas64-4 55.96n ± 0% 57.26n ± 0% +2.33% (p=0.000 n=20)
geomean 52.86n 53.83n +1.82%
[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html
Change-Id: I9b777c63c124fb492f61c903f77061fa2b4e5322
Reviewed-on: https://go-review.googlesource.com/c/go/+/613396
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
We can reuse the same indirect storage when growing, so we don't
need an additional allocation.
Change-Id: I57adb406becfbec648188ec66f4bb2e94d4b9cab
Reviewed-on: https://go-review.googlesource.com/c/go/+/625902
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Missed initializing a field in the stub that lets the noswiss
builder test the swiss implementation.
Change-Id: Ie093478ad3e4301e4fe88ba65c132a9dbccd89a9
Reviewed-on: https://go-review.googlesource.com/c/go/+/628895
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
It is easily recomputed as capacity-1.
This reduces a table from 40 to 32 bytes (on 64-bit archs).
That gets us down one sizeclass.
Change-Id: Icb74fb2de50baa18ca62052c7b2fe8e6af4c8837
Reviewed-on: https://go-review.googlesource.com/c/go/+/625198
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
We don't really need the index of the slot we're looking at.
Just keep looking until there are no more filled slots.
This particularly helps when there are only a few filled entries
(packed at the bottom), and we're looking for something that isn't
there. We exit earlier than we would otherwise.
goos: darwin
goarch: arm64
pkg: runtime
cpu: Apple M2 Ultra
│ baseline │ experiment │
│ sec/op │ sec/op vs base │
MapSmallAccessHit/Key=int64/Elem=int64/len=1-24 2.759n ± 0% 2.779n ± 2% ~ (p=0.055 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=2-24 2.862n ± 1% 2.922n ± 1% +2.08% (p=0.000 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=3-24 3.003n ± 0% 3.061n ± 1% +1.91% (p=0.000 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=4-24 3.170n ± 1% 3.188n ± 1% +0.57% (p=0.030 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=5-24 3.387n ± 1% 3.391n ± 1% ~ (p=0.362 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=6-24 3.601n ± 1% 3.584n ± 0% -0.49% (p=0.009 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=7-24 3.785n ± 1% 3.778n ± 3% ~ (p=0.987 n=10)
MapSmallAccessHit/Key=int64/Elem=int64/len=8-24 3.960n ± 1% 3.946n ± 1% ~ (p=0.256 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=0-24 2.004n ± 1%
MapSmallAccessMiss/Key=int64/Elem=int64/len=1-24 5.145n ± 1% 2.411n ± 1% -53.14% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=2-24 5.128n ± 0% 3.313n ± 1% -35.40% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=3-24 5.159n ± 1% 3.690n ± 1% -28.48% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=4-24 5.117n ± 1% 4.466n ± 6% -12.73% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=5-24 5.115n ± 1% 4.308n ± 1% -15.79% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=6-24 5.111n ± 1% 4.538n ± 2% -11.19% (p=0.000 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=7-24 4.896n ± 4% 4.831n ± 1% -1.33% (p=0.001 n=10)
MapSmallAccessMiss/Key=int64/Elem=int64/len=8-24 4.905n ± 1% 5.121n ± 1% +4.40% (p=0.000 n=10)
geomean 3.917n 3.631n -11.11%
Change-Id: Ife26ac457a513af24fa0921b839ee6cd5fed6fba
Reviewed-on: https://go-review.googlesource.com/c/go/+/627717
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
typ.Group.Size involves two loads.
Instead cache GroupSize as a separate fields of the map type
so we can get to it in just one load.
Change-Id: I10ffdce1c7f75dcf448da14040fda78f0d75fd1d
Reviewed-on: https://go-review.googlesource.com/c/go/+/627716
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
For large strings, do a quick equality check on all the slots.
Only if more than one passes the quick equality check do we
resort to hashing.
│ baseline │ experiment │
│ sec/op │ sec/op vs base │
MegMap-24 16609.50n ± 1% 13.91n ± 3% -99.92% (p=0.000 n=10)
MegOneMap-24 16655.00n ± 0% 12.27n ± 1% -99.93% (p=0.000 n=10)
MegEqMap-24 41.31µ ± 1% 25.03µ ± 1% -39.40% (p=0.000 n=10)
MegEmptyMap-24 2.034n ± 0% 2.027n ± 2% ~ (p=0.541 n=10)
MegEmptyMapWithInterfaceKey-24 5.931n ± 2% 5.599n ± 1% -5.60% (p=0.000 n=10)
MapStringKeysEight_16-24 8.473n ± 7% 8.224n ± 5% ~ (p=0.315 n=10)
MapStringKeysEight_32-24 8.441n ± 2% 8.147n ± 1% -3.48% (p=0.002 n=10)
MapStringKeysEight_64-24 8.769n ± 1% 8.517n ± 1% -2.87% (p=0.000 n=10)
MapStringKeysEight_128-24 10.73n ± 4% 13.57n ± 8% +26.57% (p=0.000 n=10)
MapStringKeysEight_256-24 12.97n ± 2% 14.35n ± 4% +10.64% (p=0.001 n=10)
MapStringKeysEight_1M-24 17359.50n ± 3% 13.92n ± 4% -99.92% (p=0.000 n=10)
Change-Id: I4cc2ea4edab12a4b03236de626c7bcf0f96b6cc0
Reviewed-on: https://go-review.googlesource.com/c/go/+/625905
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Iteration over swissmaps with low load (think map with large hint but
only one entry) is signicantly regressed vs old maps. See noswiss vs
swiss-tip below (+60%).
Currently we visit every single slot and individually check if the slot
is full or not.
We can do much better by using the control word to find all full slots
in a group in a single operation. This lets us skip completely empty
groups for instance.
Always using the control match approach is great for maps with low load,
but is a regression for mostly full maps. Mostly full maps have the
majority of slots full, so most calls to mapiternext will return the
next slot. In that case, doing the full group match on every call is
more expensive than checking the individual slot.
Thus we take a hybrid approach: on each call, we first check an
individual slot. If that slot is full, we're done. If that slot is
non-full, then we fall back to doing full group matches.
This trade-off works well. Both mostly empty and mostly full maps
perform nearly as well as doing all matching and all individual,
respectively.
The fast path is placed above the slow path loop rather than combined
(with some sort of `useMatch` variable) into a single loop to help the
compiler's code generation. The compiler really struggles with code
generation on a combined loop for some reason, yielding ~15% additional
instructions/op.
Comparison with old maps prior to this CL:
│ noswiss │ swiss-tip │
│ sec/op │ sec/op vs base │
MapIter/Key=int64/Elem=int64/len=6-12 11.53n ± 2% 10.64n ± 2% -7.72% (p=0.002 n=6)
MapIter/Key=int64/Elem=int64/len=64-12 10.180n ± 2% 9.670n ± 5% -5.01% (p=0.004 n=6)
MapIter/Key=int64/Elem=int64/len=65536-12 10.78n ± 1% 10.15n ± 2% -5.84% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=6-12 6.116n ± 2% 6.840n ± 2% +11.84% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=64-12 2.403n ± 2% 3.892n ± 0% +61.95% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=65536-12 1.940n ± 3% 3.237n ± 1% +66.81% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=6-12 66.20n ± 2% 60.14n ± 3% -9.15% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=64-12 97.24n ± 1% 171.35n ± 1% +76.21% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=65536-12 826.1n ± 12% 842.5n ± 10% ~ (p=0.937 n=6)
geomean 17.93n 20.96n +16.88%
After this CL:
│ noswiss │ swiss-cl │
│ sec/op │ sec/op vs base │
MapIter/Key=int64/Elem=int64/len=6-12 11.53n ± 2% 10.90n ± 3% -5.42% (p=0.002 n=6)
MapIter/Key=int64/Elem=int64/len=64-12 10.180n ± 2% 9.719n ± 9% -4.53% (p=0.043 n=6)
MapIter/Key=int64/Elem=int64/len=65536-12 10.78n ± 1% 10.07n ± 2% -6.63% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=6-12 6.116n ± 2% 7.022n ± 1% +14.82% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=64-12 2.403n ± 2% 1.475n ± 1% -38.63% (p=0.002 n=6)
MapIterLowLoad/Key=int64/Elem=int64/len=65536-12 1.940n ± 3% 1.210n ± 6% -37.67% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=6-12 66.20n ± 2% 61.54n ± 2% -7.02% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=64-12 97.24n ± 1% 110.10n ± 1% +13.23% (p=0.002 n=6)
MapPop/Key=int64/Elem=int64/len=65536-12 826.1n ± 12% 504.7n ± 6% -38.91% (p=0.002 n=6)
geomean 17.93n 15.29n -14.74%
For #54766.
Cq-Include-Trybots: luci.golang.try:gotip-linux-ppc64_power10
Change-Id: Ic07f9df763239e85be57873103df5007144fdaef
Reviewed-on: https://go-review.googlesource.com/c/go/+/627156
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
│ baseline │ experiment │
│ sec/op │ sec/op vs base │
MapDeleteLargeKey-24 312.0n ± 6% 162.3n ± 5% -47.97% (p=0.000 n=10)
Change-Id: I31f1f8e3c344cf8abf2e9eb4b51b78fcd67b93c4
Reviewed-on: https://go-review.googlesource.com/c/go/+/625906
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Change-Id: I7b3d95c0861ae2b6e0721b65aa75cda036435e9c
Reviewed-on: https://go-review.googlesource.com/c/go/+/625903
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
on loong64
Use loong64's atomic operation instruction AMANDDB{V,W,W} (full barrier) to implement
And{64,32,8}, AMORDB{V,W,W} (full barrier) to implement Or{64,32,8}.
Intrinsify And{64,32,8} and Or{64,32,8}, And this CL alias all of the And/Or operations
into sync/atomic package.
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000-HV @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
And32 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20)
And32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20)
And64 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20)
And64Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20)
Or32 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20)
Or32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20)
Or64 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20)
Or64Parallel 28.97n ± 0% 12.41n ± 0% -57.16% (p=0.000 n=20)
And8 29.15n ± 0% 13.21n ± 0% -54.68% (p=0.000 n=20)
And 27.71n ± 0% 12.82n ± 0% -53.74% (p=0.000 n=20)
And8Parallel 28.99n ± 0% 14.46n ± 0% -50.12% (p=0.000 n=20)
AndParallel 29.12n ± 0% 14.42n ± 0% -50.48% (p=0.000 n=20)
Or8 28.31n ± 0% 12.81n ± 0% -54.75% (p=0.000 n=20)
Or 27.72n ± 0% 12.81n ± 0% -53.79% (p=0.000 n=20)
Or8Parallel 29.03n ± 0% 14.62n ± 0% -49.64% (p=0.000 n=20)
OrParallel 29.12n ± 0% 14.42n ± 0% -50.49% (p=0.000 n=20)
geomean 28.47n 12.58n -55.80%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
And32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20)
And32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20)
And64 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20)
And64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20)
And8 30.42n ± 0% 14.41n ± 0% -52.63% (p=0.000 n=20)
And 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20)
And8Parallel 31.23n ± 0% 15.21n ± 0% -51.30% (p=0.000 n=20)
AndParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20)
Or32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20)
Or32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20)
Or64 30.02n ± 0% 14.82n ± 0% -50.63% (p=0.000 n=20)
Or64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20)
Or8 30.02n ± 0% 14.01n ± 0% -53.33% (p=0.000 n=20)
Or 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20)
Or8Parallel 30.83n ± 0% 14.81n ± 0% -51.96% (p=0.000 n=20)
OrParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20)
geomean 30.47n 14.75n -51.61%
Change-Id: If008ff6a08b51905076f8ddb6e92f8e214d3f7b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/482756
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Use Loong64's atomic operation instruction AMSWAPDB{W,V} (full barrier)
to implement atomic.Xchg{32,64}
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
| old.bench | new.bench |
| sec/op | sec/op vs base |
Xchg 26.44n ± 0% 12.01n ± 0% -54.58% (p=0.000 n=20)
Xchg-2 30.10n ± 0% 25.58n ± 0% -15.02% (p=0.000 n=20)
Xchg-4 30.06n ± 0% 24.82n ± 0% -17.43% (p=0.000 n=20)
Xchg64 26.44n ± 0% 12.02n ± 0% -54.54% (p=0.000 n=20)
Xchg64-2 30.10n ± 0% 25.57n ± 0% -15.05% (p=0.000 n=20)
Xchg64-4 30.05n ± 0% 24.80n ± 0% -17.47% (p=0.000 n=20)
geomean 28.81n 19.68n -31.69%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
| old.bench | new.bench |
| sec/op | sec/op vs base |
Xchg 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20)
Xchg-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20)
Xchg-4 34.63n ± 0% 19.59n ± 0% -43.42% (p=0.000 n=20)
Xchg64 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20)
Xchg64-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20)
Xchg64-4 34.67n ± 0% 19.59n ± 0% -43.50% (p=0.000 n=20)
geomean 31.44n 17.11n -45.59%
Updates #59120.
Change-Id: Ied74fc20338b63799c6d6eeb122c31b42cff0f7e
Reviewed-on: https://go-review.googlesource.com/c/go/+/481578
Reviewed-by: Meidan Li <limeidan@loongson.cn>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: WANG Xuerui <git@xen0n.name>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
|
|
Use Loong64's atomic operation instruction AMADDDB{W,V} (full barrier)
to implement atomic.Xadd{32,64}
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
Xadd 27.24n ± 0% 12.01n ± 0% -55.91% (p=0.000 n=20)
Xadd-2 31.93n ± 0% 25.55n ± 0% -19.98% (p=0.000 n=20)
Xadd-4 31.90n ± 0% 24.80n ± 0% -22.26% (p=0.000 n=20)
Xadd64 27.23n ± 0% 12.01n ± 0% -55.89% (p=0.000 n=20)
Xadd64-2 31.93n ± 0% 25.57n ± 0% -19.90% (p=0.000 n=20)
Xadd64-4 31.89n ± 0% 24.80n ± 0% -22.23% (p=0.000 n=20)
geomean 30.27n 19.67n -35.01%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
Xadd 26.02n ± 0% 12.41n ± 0% -52.31% (p=0.000 n=20)
Xadd-2 37.36n ± 0% 20.60n ± 0% -44.86% (p=0.000 n=20)
Xadd-4 37.22n ± 0% 19.59n ± 0% -47.37% (p=0.000 n=20)
Xadd64 26.42n ± 0% 12.41n ± 0% -53.03% (p=0.000 n=20)
Xadd64-2 37.77n ± 0% 20.60n ± 0% -45.46% (p=0.000 n=20)
Xadd64-4 37.78n ± 0% 19.59n ± 0% -48.15% (p=0.000 n=20)
geomean 33.30n 17.11n -48.62%
Change-Id: I982539c2aa04680e9dd11b099ba8d5f215bf9b32
Reviewed-on: https://go-review.googlesource.com/c/go/+/481937
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: WANG Xuerui <git@xen0n.name>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
|
|
On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1]
is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures.
Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and
the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according
to the CPU feature.
The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to
ensure consistency in the order of Store/Load [2].
LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses
the R0 register, and there is no performance difference between the implementations
of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}.
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000-HV @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
AtomicStore64 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20)
AtomicStore64-2 19.61n ± 0% 13.61n ± 0% -30.57% (p=0.000 n=20)
AtomicStore64-4 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20)
AtomicStore 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20)
AtomicStore-2 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20)
AtomicStore-4 19.62n ± 0% 13.62n ± 0% -30.58% (p=0.000 n=20)
AtomicStore8 19.61n ± 0% 20.01n ± 0% +2.04% (p=0.000 n=20)
AtomicStore8-2 19.62n ± 0% 20.02n ± 0% +2.01% (p=0.000 n=20)
AtomicStore8-4 19.61n ± 0% 20.02n ± 0% +2.09% (p=0.000 n=20)
geomean 19.61n 15.48n -21.08%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
AtomicStore64 18.03n ± 0% 12.81n ± 0% -28.93% (p=0.000 n=20)
AtomicStore64-2 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20)
AtomicStore64-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20)
AtomicStore-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
geomean 18.01n 12.81n -28.89%
[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html
[2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md
Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097
Reviewed-on: https://go-review.googlesource.com/c/go/+/581356
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
|