| Age | Commit message (Collapse) | Author |
|
The register allocation will use the upper register soon, this CL is to
enable that.
Change-Id: I4d7285e08b95f4e6ebee72594dfbe8d1199f09ed
Reviewed-on: https://go-review.googlesource.com/c/go/+/686498
TryBot-Bypass: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Commit-Queue: David Chase <drchase@google.com>
|
|
Based on CL 669415 by shaojunyang@google.com.
Change-Id: I574f15c3b18a7179a1573aaf567caf18d8602ef1
Reviewed-on: https://go-review.googlesource.com/c/go/+/680900
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Austin Clements <austin@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Asynchronous preemption must save all registers that could be in use
by Go code. Currently, it saves all of these to the goroutine stack.
As a result, the stack frame requirements of asynchronous preemption
can be rather high. On amd64, this requires 368 bytes of stack space,
most of which is the XMM registers. Several RISC architectures are
around 0.5 KiB.
As we add support for SIMD instructions, this is going to become a
problem. The AVX-512 register state is 2.5 KiB. This well exceeds the
nosplit limit, and even if it didn't, could constrain when we can
asynchronously preempt goroutines on small stacks.
This CL fixes this by moving pure scalar state stored in non-GP
registers off the stack and into an allocated "extended register
state" object. To reduce space overhead, we only allocate these
objects as needed. While in the theoretical limit, every G could need
this register state, in practice very few do at a time.
However, we can't allocate when we're in the middle of saving the
register state during an asynchronous preemption, so we reserve
scratch space on every P to temporarily store the register state,
which can then be copied out to an allocated state object later by Go
code.
This commit only implements this for amd64, since that's where we're
about to add much more vector state, but it lays the groundwork for
doing this on any architecture that could benefit.
Change-Id: I123a95e21c11d5c10942d70e27f84d2d99bbf735
Reviewed-on: https://go-review.googlesource.com/c/go/+/680898
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Austin Clements <austin@google.com>
|
|
Updates #37174
Updates #49233
Fixes #41152
Change-Id: I35b148c8bc132f02dd6a5a6bb48b711fb5c5df9e
Reviewed-on: https://go-review.googlesource.com/c/go/+/560955
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Use amd64 assembly header to remove unnecessary cpu flags checks
and dead code that is guaranteed to not be executed when compiling
for specific microarchitectures.
name old time/op new time/op delta
BytesCompare/1-12 3.88ns ± 1% 3.18ns ± 1% -18.15% (p=0.008 n=5+5)
BytesCompare/2-12 3.89ns ± 1% 3.21ns ± 2% -17.66% (p=0.008 n=5+5)
BytesCompare/4-12 3.89ns ± 0% 3.17ns ± 0% -18.62% (p=0.008 n=5+5)
BytesCompare/8-12 3.44ns ± 2% 3.39ns ± 1% -1.36% (p=0.008 n=5+5)
BytesCompare/16-12 3.40ns ± 1% 3.14ns ± 0% -7.77% (p=0.008 n=5+5)
BytesCompare/32-12 3.90ns ± 1% 3.65ns ± 0% -6.19% (p=0.008 n=5+5)
BytesCompare/64-12 4.96ns ± 1% 4.71ns ± 2% -4.98% (p=0.008 n=5+5)
BytesCompare/128-12 6.42ns ± 0% 5.99ns ± 4% -6.75% (p=0.008 n=5+5)
BytesCompare/256-12 9.36ns ± 0% 7.40ns ± 0% -20.97% (p=0.008 n=5+5)
BytesCompare/512-12 15.9ns ± 1% 11.4ns ± 1% -28.36% (p=0.008 n=5+5)
BytesCompare/1024-12 27.0ns ± 0% 19.3ns ± 0% -28.36% (p=0.008 n=5+5)
BytesCompare/2048-12 50.2ns ± 0% 43.3ns ± 0% -13.71% (p=0.008 n=5+5)
[Geo mean] 7.13ns 6.07ns -14.86%
name old speed new speed delta
Count/10-12 723MB/s ± 0% 704MB/s ± 1% -2.73% (p=0.008 n=5+5)
Count/32-12 2.21GB/s ± 0% 2.12GB/s ± 2% -4.21% (p=0.008 n=5+5)
Count/4K-12 1.03GB/s ± 0% 1.03GB/s ± 1% ~ (p=1.000 n=5+5)
Count/4M-12 1.04GB/s ± 0% 1.02GB/s ± 2% ~ (p=0.310 n=5+5)
Count/64M-12 1.02GB/s ± 0% 1.01GB/s ± 1% -1.00% (p=0.016 n=5+5)
CountEasy/10-12 779MB/s ± 0% 768MB/s ± 1% -1.48% (p=0.008 n=5+5)
CountEasy/32-12 2.15GB/s ± 0% 2.09GB/s ± 1% -2.71% (p=0.008 n=5+5)
CountEasy/4K-12 45.1GB/s ± 1% 45.2GB/s ± 1% ~ (p=0.421 n=5+5)
CountEasy/4M-12 36.4GB/s ± 1% 36.5GB/s ± 1% ~ (p=0.690 n=5+5)
CountEasy/64M-12 16.1GB/s ± 2% 16.4GB/s ± 1% ~ (p=0.056 n=5+5)
CountSingle/10-12 2.15GB/s ± 2% 2.22GB/s ± 1% +3.37% (p=0.008 n=5+5)
CountSingle/32-12 5.86GB/s ± 1% 5.76GB/s ± 1% -1.55% (p=0.008 n=5+5)
CountSingle/4K-12 54.6GB/s ± 1% 55.0GB/s ± 1% ~ (p=0.548 n=5+5)
CountSingle/4M-12 45.9GB/s ± 4% 46.4GB/s ± 2% ~ (p=0.548 n=5+5)
CountSingle/64M-12 17.3GB/s ± 1% 17.2GB/s ± 2% ~ (p=1.000 n=5+5)
[Geo mean] 5.11GB/s 5.08GB/s -0.53%
name old speed new speed delta
Equal/1-12 200MB/s ± 0% 188MB/s ± 1% -6.11% (p=0.008 n=5+5)
Equal/6-12 1.20GB/s ± 0% 1.13GB/s ± 1% -6.38% (p=0.008 n=5+5)
Equal/9-12 1.67GB/s ± 3% 1.74GB/s ± 1% +3.83% (p=0.008 n=5+5)
Equal/15-12 2.82GB/s ± 1% 2.89GB/s ± 1% +2.63% (p=0.008 n=5+5)
Equal/16-12 2.96GB/s ± 1% 3.08GB/s ± 1% +3.95% (p=0.008 n=5+5)
Equal/20-12 3.33GB/s ± 1% 3.54GB/s ± 1% +6.36% (p=0.008 n=5+5)
Equal/32-12 4.57GB/s ± 0% 5.26GB/s ± 1% +15.09% (p=0.008 n=5+5)
Equal/4K-12 62.0GB/s ± 1% 65.9GB/s ± 2% +6.29% (p=0.008 n=5+5)
Equal/4M-12 23.6GB/s ± 2% 24.8GB/s ± 4% +5.43% (p=0.008 n=5+5)
Equal/64M-12 11.1GB/s ± 2% 11.3GB/s ± 1% +1.69% (p=0.008 n=5+5)
[Geo mean] 3.91GB/s 4.03GB/s +3.11%
name old speed new speed delta
IndexByte/10-12 2.64GB/s ± 0% 2.69GB/s ± 0% +1.67% (p=0.008 n=5+5)
IndexByte/32-12 6.79GB/s ± 0% 6.27GB/s ± 0% -7.57% (p=0.008 n=5+5)
IndexByte/4K-12 56.2GB/s ± 0% 56.9GB/s ± 0% +1.27% (p=0.008 n=5+5)
IndexByte/4M-12 40.1GB/s ± 1% 41.7GB/s ± 1% +4.05% (p=0.008 n=5+5)
IndexByte/64M-12 17.5GB/s ± 0% 17.7GB/s ± 1% ~ (p=0.095 n=5+5)
IndexBytePortable/10-12 2.06GB/s ± 1% 2.16GB/s ± 1% +5.08% (p=0.008 n=5+5)
IndexBytePortable/32-12 1.40GB/s ± 1% 1.54GB/s ± 1% +10.05% (p=0.008 n=5+5)
IndexBytePortable/4K-12 3.99GB/s ± 0% 4.08GB/s ± 0% +2.16% (p=0.008 n=5+5)
IndexBytePortable/4M-12 4.05GB/s ± 1% 4.08GB/s ± 2% ~ (p=0.095 n=5+5)
IndexBytePortable/64M-12 3.80GB/s ± 1% 3.81GB/s ± 0% ~ (p=0.421 n=5+5)
IndexRune/10-12 746MB/s ± 1% 752MB/s ± 0% +0.85% (p=0.008 n=5+5)
IndexRune/32-12 2.33GB/s ± 0% 2.42GB/s ± 0% +3.66% (p=0.008 n=5+5)
IndexRune/4K-12 44.4GB/s ± 0% 44.2GB/s ± 0% ~ (p=0.095 n=5+5)
IndexRune/4M-12 36.2GB/s ± 1% 36.3GB/s ± 2% ~ (p=0.841 n=5+5)
IndexRune/64M-12 16.2GB/s ± 2% 16.3GB/s ± 2% ~ (p=0.548 n=5+5)
IndexRuneASCII/10-12 2.57GB/s ± 0% 2.58GB/s ± 0% +0.63% (p=0.008 n=5+5)
IndexRuneASCII/32-12 6.00GB/s ± 0% 6.30GB/s ± 1% +4.98% (p=0.008 n=5+5)
IndexRuneASCII/4K-12 56.7GB/s ± 0% 56.8GB/s ± 1% ~ (p=0.151 n=5+5)
IndexRuneASCII/4M-12 41.6GB/s ± 1% 41.7GB/s ± 2% ~ (p=0.151 n=5+5)
IndexRuneASCII/64M-12 17.7GB/s ± 1% 17.6GB/s ± 1% ~ (p=0.222 n=5+5)
Index/10-12 1.06GB/s ± 1% 1.06GB/s ± 0% ~ (p=0.310 n=5+5)
Index/32-12 3.57GB/s ± 0% 3.56GB/s ± 1% ~ (p=0.056 n=5+5)
Index/4K-12 1.02GB/s ± 2% 1.03GB/s ± 0% ~ (p=0.690 n=5+5)
Index/4M-12 1.04GB/s ± 0% 1.03GB/s ± 1% ~ (p=1.000 n=4+5)
Index/64M-12 1.02GB/s ± 0% 1.02GB/s ± 0% ~ (p=0.905 n=5+4)
IndexEasy/10-12 1.12GB/s ± 2% 1.15GB/s ± 1% +3.10% (p=0.008 n=5+5)
IndexEasy/32-12 3.14GB/s ± 2% 3.13GB/s ± 1% ~ (p=0.310 n=5+5)
IndexEasy/4K-12 47.6GB/s ± 1% 47.7GB/s ± 2% ~ (p=0.310 n=5+5)
IndexEasy/4M-12 36.4GB/s ± 1% 36.3GB/s ± 2% ~ (p=0.690 n=5+5)
IndexEasy/64M-12 16.1GB/s ± 1% 16.4GB/s ± 5% ~ (p=0.151 n=5+5)
[Geo mean] 6.39GB/s 6.46GB/s +1.11%
Change-Id: Ic1ca62f5cc719d87e2c4aeff25ad73507facff82
Reviewed-on: https://go-review.googlesource.com/c/go/+/397576
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Keith Randall <khr@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
assembly functions
There are a few assembly functions in the runtime that are marked
as ABIInternal, solely because funcPC can get the right address.
The functions themselves do not actually follow ABIInternal (or
irrelevant). Now we have internal/abi.FuncPCABI0, use that, and
un-mark the functions.
Also un-mark assembly functions that are only called in assembly.
For them, it only matters if the caller and callee are consistent.
Change-Id: I240e126ac13cb362f61ff8482057ee9f53c24097
Reviewed-on: https://go-review.googlesource.com/c/go/+/321950
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
When in dynlink mode, reading a global can clobber R15.
Just to be safe, save R15 before checking the AVX state to see
if we need to VZEROUPPER or not.
This could cause a problem in buildmodes that aren't supported yet.
Change-Id: I8fda62d3fbe808584774fa5e8d9810a4612a84e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/288452
Trust: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
[This is a roll-forward of CL 262319, with a fix for some Darwin test
failures].
Change the definitions of selected runtime assembly routines
from ABI0 (the default) to ABIInternal. The ABIInternal def is
intended to indicate that these functions don't follow the existing Go
runtime ABI. In addition, convert the assembly reference to
runtime.main (from runtime.mainPC) to ABIInternal. Finally, for
functions such as "runtime.duffzero" that are called directly from
generated code, make sure that the compiler looks up the correct
ABI version.
This is intended to support the register abi work, however these
changes should not have any issues even when GOEXPERIMENT=regabi is
not in effect.
Updates #27539, #40724.
Change-Id: Idf507f1c06176073563845239e1a54dad51a9ea9
Reviewed-on: https://go-review.googlesource.com/c/go/+/266638
Trust: Than McIntosh <thanm@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
|
|
This reverts commit 50af50d136551e2009b2b52e829570536271cdaa.
Reason for revert: Causes failures in the runtime package test on Darwin, apparently.
Change-Id: I006bc1b3443fa7207e92fb4a93e3fb438d4d3de3
Reviewed-on: https://go-review.googlesource.com/c/go/+/266257
Trust: Than McIntosh <thanm@google.com>
Run-TryBot: Than McIntosh <thanm@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
Change the definitions of selected runtime assembly routines
from ABI0 (the default) to ABIInternal. The ABIInternal def is
intended to indicate that these functions don't follow the existing Go
runtime ABI. In addition, convert the assembly reference to
runtime.main (from runtime.mainPC) to ABIInternal. Finally, for
functions such as "runtime.duffzero" that are called directly from
generated code, make sure that the compiler looks up the correct
ABI version.
This is intended to support the register abi work, however these
changes should not have any issues even when GOEXPERIMENT=regabi is
not in effect.
Updates #27539, #40724.
Change-Id: I9846f8dcaccc95718cf2e61a18b7e924a0677e4c
Reviewed-on: https://go-review.googlesource.com/c/go/+/262319
Run-TryBot: Than McIntosh <thanm@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Than McIntosh <thanm@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
In CL 219131 we inserted a VZEROUPPER instruction on darwin/amd64.
The instruction is not available on pre-AVX machines. Guard it
with CPU feature.
Fixes #37459.
Change-Id: I9a064df277d091be4ee594eda5c7fd8ee323102b
Reviewed-on: https://go-review.googlesource.com/c/go/+/221057
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
Apparently, the signal handling code path in darwin kernel leaves
the upper bits of Y registers in a dirty state, which causes many
SSE operations (128-bit and narrower) become much slower. Clear
the upper bits to get to a clean state.
We do it at the entry of asyncPreempt, which is immediately
following exiting from the kernel's signal handling code, if we
actually injected a call. It does not cover other exits where we
don't inject a call, e.g. failed preemption, profiling signal, or
other async signals. But it does cover an important use case of
async signals, preempting a tight numerical loop, which we
introduced in this cycle.
Running the benchmark in issue #37174:
name old time/op new time/op delta
Fast-8 90.0ns ± 1% 46.8ns ± 3% -47.97% (p=0.000 n=10+10)
Slow-8 188ns ± 5% 49ns ± 1% -73.82% (p=0.000 n=10+9)
There is no more slowdown due to preemption signals.
For #37174.
Change-Id: I8b83d083fade1cabbda09b4bc25ccbadafaf7605
Reviewed-on: https://go-review.googlesource.com/c/go/+/219131
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
This adds asynchronous preemption function for amd64 and 386. These
functions spill and restore all register state that can be used by
user Go code.
For the moment we stub out the other arches.
For #10958, #24543.
Change-Id: I6f93fabe9875f4834922a5712362e79045c00aca
Reviewed-on: https://go-review.googlesource.com/c/go/+/201759
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|