| Age | Commit message (Collapse) | Author |
|
The BSWAPL instruction is supported in i486 and newer.
https://github.com/golang/go/wiki/MinimumRequirements#386 says we
support "All Pentium MMX or later". The Pentium is also referred to as
i586, so that we are safe with these instructions.
Change-Id: I6dea1f9d864a45bb07c8f8f35a81cfe16cca216c
Reviewed-on: https://go-review.googlesource.com/c/go/+/465515
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
This change intrinsifies ReverseBytes{16|32|64} by generating the
corresponding new instructions in Power10: brh, brd and brw and
adds a verification test for the same.
On Power 9 and 8, the .go code performs optimally as it is.
Performance improvement seen on Power10:
ReverseBytes32 1.38ns ± 0% 1.18ns ± 0% -14.2
ReverseBytes64 1.52ns ± 0% 1.11ns ± 0% -26.87
ReverseBytes16 1.41ns ± 1% 1.18ns ± 0% -16.47
Change-Id: I88f127f3ab9ba24a772becc21ad90acfba324b37
Reviewed-on: https://go-review.googlesource.com/c/go/+/446675
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
Manually consolidate the remaining ppc64/ppc64le test which
are not so trivial to automatically merge.
The remaining ppc64le tests are limited to cases where load/stores are
merged (this only happens on ppc64le) and the race detector (only
supported on ppc64le).
Change-Id: I1f9c0f3d3ddbb7fbbd8c81fbbd6537394fba63ce
Reviewed-on: https://go-review.googlesource.com/c/go/+/463217
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
|
|
Use a small python script to consolidate duplicate
ppc64/ppc64le tests into a single ppc64x codegen test.
This makes small assumption that anytime two tests with
for different arch/variant combos exists, those tests
can be combined into a single ppc64x test.
E.x:
// ppc64le: foo
// ppc64le/power9: foo
into
// ppc64x: foo
or
// ppc64: foo
// ppc64le: foo
into
// ppc64x: foo
import glob
import re
files = glob.glob("codegen/*.go")
for file in files:
with open(file) as f:
text = [l for l in f]
i = 0
while i < len(text):
first = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[i])
if first:
j = i+1
while j < len(text):
second = re.match("\s*// ?ppc64(le)?(/power[89])?:(.*)", text[j])
if not second:
break
if (not first.group(2) or first.group(2) == second.group(2)) and first.group(3) == second.group(3):
text[i] = re.sub(" ?ppc64(le|x)?"," ppc64x",text[i])
text=text[:j] + (text[j+1:])
else:
j += 1
i+=1
with open(file, 'w') as f:
f.write("".join(text))
Change-Id: Ic6b009b54eacaadc5a23db9c5a3bf7331b595821
Reviewed-on: https://go-review.googlesource.com/c/go/+/463220
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
This needs to be as low as possible while not breaking priority
assumptions of other scores to correctly schedule carry chains.
Prior to the arm64 changes, it was set below ReadTuple. At the time,
this prevented the MulHiLo implementation on PPC64 from occluding
the scheduling of a full carry chain.
Memory scores can also prevent better scheduling, as can be observed
with crypto/internal/edwards25519/field.feMulGeneric.
Fixes #56497
Change-Id: Ia4b54e6dffcce584faf46b1b8d7cea18a3913887
Reviewed-on: https://go-review.googlesource.com/c/go/+/447435
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
This is a follow up of CL 420095 on loong64.
file before after Δ %
compile/internal/ssa.a 35649482 35653274 +3792 +0.011%
compile/internal/ssagen.a 4099858 4098728 -1130 -0.028%
ecdh.a 227896 226896 -1000 -0.439%
internal/nistec/fiat.a 1212254 1128184 -84070 -6.935%
tls.a 3256800 3256802 +2 +0.000%
big.a 1708518 1702496 -6022 -0.352%
bits.a 106762 105734 -1028 -0.963%
math.a 578762 577288 -1474 -0.255%
netip.a 555922 555610 -312 -0.056%
net.a 3286528 3286530 +2 +0.000%
golang.org/x/crypto/internal/poly1305.a 109546 107686 -1860 -1.698%
total 260392768 260299668 -93100 -0.036%
Change-Id: Ieffca705aae5666501f284502d986ca179dde494
Reviewed-on: https://go-review.googlesource.com/c/go/+/428557
Reviewed-by: Carlos Amedee <carlos@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
|
|
This is a follow up of CL 420094 on loong64.
Reduce go toolchain size slightly on linux/loong64.
compilecmp HEAD~1 -> HEAD
HEAD~1 (8a32354219): internal/trace: use strings.Builder
HEAD (1767784ac3): cmd/compile: intrinsify Add64 on loong64
platform: linux/loong64
file before after Δ %
addr2line 3882616 3882536 -80 -0.002%
api 5528866 5528450 -416 -0.008%
asm 5133780 5133796 +16 +0.000%
cgo 4668787 4668491 -296 -0.006%
compile 25163409 25164729 +1320 +0.005%
cover 4658055 4658007 -48 -0.001%
dist 3437783 3437727 -56 -0.002%
doc 3883069 3883205 +136 +0.004%
fix 3383254 3383070 -184 -0.005%
link 6747559 6747023 -536 -0.008%
nm 3793923 3793939 +16 +0.000%
objdump 4256628 4256812 +184 +0.004%
pack 2356328 2356144 -184 -0.008%
pprof 14233370 14131910 -101460 -0.713%
test2json 2638668 2638476 -192 -0.007%
trace 13392065 13360781 -31284 -0.234%
vet 7456388 7455588 -800 -0.011%
total 132498256 132364392 -133864 -0.101%
file before after Δ %
compile/internal/ssa.a 35644590 35649482 +4892 +0.014%
compile/internal/ssagen.a 4101250 4099858 -1392 -0.034%
internal/edwards25519/field.a 226064 201718 -24346 -10.770%
internal/nistec/fiat.a 1689922 1212254 -477668 -28.266%
tls.a 3256798 3256800 +2 +0.000%
big.a 1718552 1708518 -10034 -0.584%
bits.a 107786 106762 -1024 -0.950%
cmplx.a 169434 168214 -1220 -0.720%
math.a 581302 578762 -2540 -0.437%
netip.a 556096 555922 -174 -0.031%
net.a 3286526 3286528 +2 +0.000%
runtime.a 8644786 8644510 -276 -0.003%
strconv.a 519098 518374 -724 -0.139%
golang.org/x/crypto/internal/poly1305.a 115398 109546 -5852 -5.071%
total 260913122 260392768 -520354 -0.199%
Change-Id: I75b2bb7761fa5a0d0d032d4ebe3582d092ea77be
Reviewed-on: https://go-review.googlesource.com/c/go/+/428556
Reviewed-by: Carlos Amedee <carlos@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
This CL optimizes RotateLeft8/16 on arm64.
For 16 bits, we form a 32 bits register by duplicating two 16 bits
registers, then use RORW instruction to do the rotate shift.
For 8 bits, we just use LSR and LSL instead of RORW because the code is
simpler.
Benchmark Old ThisCL delta
RotateLeft8-46 2.16 ns/op 1.73 ns/op -19.70%
RotateLeft16-46 2.16 ns/op 1.54 ns/op -28.53%
Change-Id: I09cde4383d12e31876a57f8cdfd3bb4f324fadb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/420976
Reviewed-by: Keith Randall <khr@google.com>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
|
|
After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:
name old time/op new time/op delta
ScalarBaseMult/P256 1.64ms ± 1% 1.60ms ± 1% -2.36% (p=0.008 n=5+5)
ScalarBaseMult/P224 1.53ms ± 1% 1.47ms ± 2% -4.24% (p=0.008 n=5+5)
ScalarBaseMult/P384 5.12ms ± 2% 5.03ms ± 2% ~ (p=0.095 n=5+5)
ScalarBaseMult/P521 22.3ms ± 2% 13.8ms ± 1% -37.89% (p=0.008 n=5+5)
ScalarMult/P256 4.49ms ± 2% 4.26ms ± 2% -5.13% (p=0.008 n=5+5)
ScalarMult/P224 4.33ms ± 1% 4.09ms ± 1% -5.59% (p=0.008 n=5+5)
ScalarMult/P384 16.3ms ± 1% 15.5ms ± 2% -4.78% (p=0.008 n=5+5)
ScalarMult/P521 101ms ± 0% 47ms ± 2% -53.36% (p=0.008 n=5+5)
Change-Id: I31cf0506e27f9d85f576af1813630a19c20dda8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/420095
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
According to RISCV instruction set manual v2.2 Sec 2.4, we can
implement overflowing check for unsigned addition cheaply using
SLTU instructions.
After this CL, the performance difference in crypto/elliptic
benchmarks on linux/riscv64 are:
name old time/op new time/op delta
ScalarBaseMult/P256 1.93ms ± 1% 1.64ms ± 1% -14.96% (p=0.008 n=5+5)
ScalarBaseMult/P224 1.80ms ± 2% 1.53ms ± 1% -14.89% (p=0.008 n=5+5)
ScalarBaseMult/P384 6.15ms ± 2% 5.12ms ± 2% -16.73% (p=0.008 n=5+5)
ScalarBaseMult/P521 25.9ms ± 1% 22.3ms ± 2% -13.78% (p=0.008 n=5+5)
ScalarMult/P256 5.59ms ± 1% 4.49ms ± 2% -19.79% (p=0.008 n=5+5)
ScalarMult/P224 5.42ms ± 1% 4.33ms ± 1% -20.01% (p=0.008 n=5+5)
ScalarMult/P384 19.9ms ± 2% 16.3ms ± 1% -18.15% (p=0.008 n=5+5)
ScalarMult/P521 97.3ms ± 1% 100.7ms ± 0% +3.48% (p=0.008 n=5+5)
Change-Id: Ic4c82ced4b072a4a6575343fa9f29dd09b0cabc4
Reviewed-on: https://go-review.googlesource.com/c/go/+/420094
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
This is a follow up of CL 425101 on RISCV64.
According to RISCV Volume 1, Unprivileged Spec v. 20191213 Chapter 7.1:
If both the high and low bits of the same product are required, then the
recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2
(source register specifiers must be in same order and rdh cannot be the
same as rs1 or rs2). Microarchitectures can then fuse these into a single
multiply operation instead of performing two separate multiplies.
So we should not split Muluhilo to separate instructions.
Updates #54607
Change-Id: If47461f3aaaf00e27cd583a9990e144fb8bcdb17
Reviewed-on: https://go-review.googlesource.com/c/go/+/425203
Auto-Submit: Keith Randall <khr@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
On ARM64 we use two separate instructions to compute the hi and lo
results of a 64x64->128 multiplication. Lower to two separate ops
so if only one result is needed we can deadcode the other.
Fixes #54607.
Change-Id: Ib023e77eb2b2b0bcf467b45471cb8a294bce6f90
Reviewed-on: https://go-review.googlesource.com/c/go/+/425101
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go
and slices.go to verify on ppc64/ppc64le as well
Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94
Reviewed-on: https://go-review.googlesource.com/c/go/+/412115
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Archana Ravindar <aravind5@in.ibm.com>
Reviewed-by: Than McIntosh <thanm@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
|
|
math/bits.Add64 and math/bits.Sub64 now lower and optimize
directly in SSA form.
The optimization of carry chains focuses around eliding
XER<->GPR transfers of the CA bit when used exclusively as an
input to a single carry operations, or when the CA value is
known.
This also adds support for handling XER spills in the assembler
which could happen if carry chains contain inter-dependencies
on each other (which seems very unlikely with practical usage),
or a clobber happens (SRAW/SRAD/SUBFC operations clobber CA).
With PPC64 Add64/Sub64 lowering into SSA and this patch, the net
performance difference in crypto/elliptic benchmarks on P9/ppc64le
are:
name old time/op new time/op delta
ScalarBaseMult/P256 46.3µs ± 0% 46.9µs ± 0% +1.34%
ScalarBaseMult/P224 356µs ± 0% 209µs ± 0% -41.14%
ScalarBaseMult/P384 1.20ms ± 0% 0.57ms ± 0% -52.14%
ScalarBaseMult/P521 3.38ms ± 0% 1.44ms ± 0% -57.27%
ScalarMult/P256 199µs ± 0% 199µs ± 0% -0.17%
ScalarMult/P224 357µs ± 0% 212µs ± 0% -40.56%
ScalarMult/P384 1.20ms ± 0% 0.58ms ± 0% -51.86%
ScalarMult/P521 3.37ms ± 0% 1.44ms ± 0% -57.32%
MarshalUnmarshal/P256/Uncompressed 2.59µs ± 0% 2.52µs ± 0% -2.63%
MarshalUnmarshal/P256/Compressed 2.58µs ± 0% 2.52µs ± 0% -2.06%
MarshalUnmarshal/P224/Uncompressed 1.54µs ± 0% 1.40µs ± 0% -9.42%
MarshalUnmarshal/P224/Compressed 1.54µs ± 0% 1.39µs ± 0% -9.87%
MarshalUnmarshal/P384/Uncompressed 2.40µs ± 0% 1.80µs ± 0% -24.93%
MarshalUnmarshal/P384/Compressed 2.35µs ± 0% 1.81µs ± 0% -23.03%
MarshalUnmarshal/P521/Uncompressed 3.79µs ± 0% 2.58µs ± 0% -31.81%
MarshalUnmarshal/P521/Compressed 3.80µs ± 0% 2.60µs ± 0% -31.67%
Note, P256 uses an asm implementation, thus, little variation is expected.
Change-Id: I88a24f6bf0f4f285c649e40243b1ab69cc452b71
Reviewed-on: https://go-review.googlesource.com/c/go/+/346870
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Paul Murphy <murp@ibm.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
|
|
LZCNT is similar to BSR, but BSR(x) is undefined when x == 0, so using
LZCNT can avoid a special case for zero input. Except that case,
LZCNTQ(x) == 63-BSRQ(x) and LZCNTL(x) == 31-BSRL(x).
And according to https://www.agner.org/optimize/instruction_tables.pdf,
LZCNT instructions are much faster than BSR on AMD CPU.
name old time/op new time/op delta
LeadingZeros-8 0.91ns ± 1% 0.80ns ± 7% -11.68% (p=0.000 n=9+9)
LeadingZeros8-8 0.98ns ±15% 0.91ns ± 1% -7.34% (p=0.000 n=9+9)
LeadingZeros16-8 0.94ns ± 3% 0.92ns ± 2% -2.36% (p=0.001 n=10+10)
LeadingZeros32-8 0.89ns ± 1% 0.78ns ± 2% -12.49% (p=0.000 n=10+10)
LeadingZeros64-8 0.92ns ± 1% 0.78ns ± 1% -14.48% (p=0.000 n=10+10)
Change-Id: I125147fe3d6994a4cfe558432780408e9a27557a
Reviewed-on: https://go-review.googlesource.com/c/go/+/396794
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
It was noticed through some other investigation that BitLen32
was not generating the best code and found that it wasn't recognized
as an intrinsic. This corrects that and enables the test for PPC64.
Change-Id: Iab496a8830c8552f507b7292649b1b660f3848b5
Reviewed-on: https://go-review.googlesource.com/c/go/+/355872
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
In case of amd64 the compiler issues checks if extensions are
available on a platform. With GOAMD64 microarchitecture levels
provided, some of the checks could be eliminated.
Change-Id: If15c178bcae273b2ce7d3673415cb8849292e087
Reviewed-on: https://go-review.googlesource.com/c/go/+/352010
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
|
|
on my Intel CoffeeLake CPU:
name old time/op new time/op delta
TrailingZeros-8 0.68ns ± 1% 0.64ns ± 1% -6.26% (p=0.000 n=10+10)
TrailingZeros8-8 0.70ns ± 1% 0.70ns ± 1% ~ (p=0.697 n=10+10)
TrailingZeros16-8 0.70ns ± 1% 0.70ns ± 1% +0.57% (p=0.043 n=10+10)
TrailingZeros32-8 0.66ns ± 1% 0.64ns ± 1% -3.35% (p=0.000 n=10+10)
TrailingZeros64-8 0.68ns ± 1% 0.64ns ± 1% -5.84% (p=0.000 n=9+10)
Updates #45453
Change-Id: I228ff2d51df24b1306136f061432f8a12bb1d6fd
Reviewed-on: https://go-review.googlesource.com/c/go/+/353249
Trust: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
According to RISCV instruction set manual v2.2 Sec 6.1
MULHU followed by MUL will be fused into one multiply by microarchitecture
Benchstat on Hifive unmatched:
name old time/op new time/op delta
Hash8Bytes 245ns ± 3% 186ns ± 4% -23.99% (p=0.000 n=10+10)
Hash320Bytes 1.94µs ± 1% 1.31µs ± 1% -32.38% (p=0.000 n=9+10)
Hash1K 5.84µs ± 0% 3.84µs ± 0% -34.20% (p=0.000 n=10+9)
Hash8K 45.3µs ± 0% 29.4µs ± 0% -35.04% (p=0.000 n=10+10)
name old speed new speed delta
Hash8Bytes 32.7MB/s ± 3% 43.0MB/s ± 4% +31.61% (p=0.000 n=10+10)
Hash320Bytes 165MB/s ± 1% 244MB/s ± 1% +47.88% (p=0.000 n=9+10)
Hash1K 175MB/s ± 0% 266MB/s ± 0% +51.98% (p=0.000 n=10+9)
Hash8K 181MB/s ± 0% 279MB/s ± 0% +53.94% (p=0.000 n=10+10)
Change-Id: I3561495d02a4a0ad8578e9b9819bf0a4eaca5d12
Reviewed-on: https://go-review.googlesource.com/c/go/+/329970
Reviewed-by: Joel Sing <joel@sing.id.au>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Go Bot <gobot@golang.org>
Trust: Meng Zhuo <mzh@golangcn.org>
|
|
Fixes this failure:
go test cmd/compile/internal/ssa -run TestStmtLines -v
=== RUN TestStmtLines
stmtlines_test.go:115: Saw too many (amd64, > 1%) lines without
statement marks, total=88263, nostmt=1930
('-run TestStmtLines -v' lists failing lines)
The failure has two causes.
One is that the first-line adjuster in code generation was relocating
"first lines" to instructions that would either not have any code generated,
or would have the statment marker removed by a different believed-good heuristic.
The other was that statement boundaries were getting attached to register
values (that with the old ABI were loads from the stack, hence real instructions).
The register values disappear at code generation.
The fixes are to (1) note that certain instructions are not good choices for
"first value" and skip them, and (2) in an expandCalls post-pass, look for
register valued instructions and under appropriate conditions move their
statement marker to a compatible use.
Also updates TestStmtLines to always log the score, for easier comparison of
minor compiler changes.
Updates #40724.
Change-Id: I485573ce900e292d7c44574adb7629cdb4695c3f
Reviewed-on: https://go-review.googlesource.com/c/go/+/309649
Trust: David Chase <drchase@google.com>
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
Optimize combinations of left and right shifts by a constant value
into a 'rotate then insert selected bits [into zero]' instruction.
Use the same instruction for contiguous masks since it has some
benefits over 'and immediate' (not restricted to 32-bits, does not
overwrite source register).
To keep the complexity of this change under control I've only
implemented 64 bit operations for now.
There are a lot more optimizations that can be done with this
instruction family. However, since their function overlaps with other
instructions we need to be somewhat careful not to break existing
optimization rules by creating optimization dead ends. This is
particularly true of the load/store merging rules which contain lots
of zero extensions and shifts.
This CL does interfere with the store merging rules when an operand
is shifted left before it is stored:
binary.BigEndian.PutUint64(b, x << 1)
This is unfortunate but it's not critical and somewhat complex so
I plan to fix that in a follow up CL.
file before after Δ %
addr2line 4117446 4117282 -164 -0.004%
api 4945184 4942752 -2432 -0.049%
asm 4998079 4991891 -6188 -0.124%
buildid 2685158 2684074 -1084 -0.040%
cgo 4553732 4553394 -338 -0.007%
compile 19294446 19245070 -49376 -0.256%
cover 4897105 4891319 -5786 -0.118%
dist 3544389 3542785 -1604 -0.045%
doc 3926795 3927617 +822 +0.021%
fix 3302958 3293868 -9090 -0.275%
link 6546274 6543456 -2818 -0.043%
nm 4102021 4100825 -1196 -0.029%
objdump 4542431 4548483 +6052 +0.133%
pack 2482465 2416389 -66076 -2.662%
pprof 13366541 13363915 -2626 -0.020%
test2json 2829007 2761515 -67492 -2.386%
trace 10216164 10219684 +3520 +0.034%
vet 6773956 6773572 -384 -0.006%
total 107124151 106917891 -206260 -0.193%
Change-Id: I7591cce41e06867ba10a745daae9333513062746
Reviewed-on: https://go-review.googlesource.com/c/go/+/233317
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Trust: Michael Munday <mike.munday@ibm.com>
|
|
This merges an lis + subf into subfic, and for 32b constants
lwa + subf into oris + ori + subf.
The carry bit is no longer used in code generation, therefore
I think we can clobber it as needed. Note, lowered borrow/carry
arithmetic is self-contained and thus is not affected.
A few extra rules are added to ensure early transformations to
SUBFCconst don't trip up earlier rules, fold constant operations,
or otherwise simplify lowering. Likewise, tests are added to
ensure all rules are hit. Generic constant folding catches
trivial cases, however some lowering rules insert arithmetic
which can introduce new opportunities (e.g BitLen or Slicemask).
I couldn't find a specific benchmark to demonstrate noteworthy
improvements, but this is generating subfic in many of the default
bent test binaries, so we are at least saving a little code space.
Change-Id: Iad7c6e5767eaa9dc24dc1c989bd1c8cfe1982012
Reviewed-on: https://go-review.googlesource.com/c/go/+/249461
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
|
|
This CL optimizes code that uses a carry from a function such as
bits.Add64 as the condition in an if statement. For example:
x, c := bits.Add64(a, b, 0)
if c != 0 {
panic("overflow")
}
Rather than converting the carry into a 0 or a 1 value and using
that as an input to a comparison instruction the carry flag is now
used as the input to a conditional branch directly. This typically
removes an ADD LOGICAL WITH CARRY instruction when user code is
doing overflow detection and is closer to the code that a user
would expect to generate.
Change-Id: I950431270955ab72f1b5c6db873b6abe769be0da
Reviewed-on: https://go-review.googlesource.com/c/go/+/219757
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
Before using some CPU instructions, we must check for their presence.
We use global variables in the runtime package to record features.
Prior to this CL, we issued a regular memory load for these features.
The downside to this is that, because it is a regular memory load,
it cannot be hoisted out of loops or otherwise reordered with other loads.
This CL introduces a new intrinsic just for checking cpu features.
It still ends up resulting in a memory load, but that memory load can
now be floated to the entry block and rematerialized as needed.
One downside is that the regular load could be combined with the comparison
into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE.
(It is possible that MOVBQZX+TESTQ+NE would be better.)
This CL does only amd64. It is easy to extend to other architectures.
For the benchmark in #36196, on my machine, this offers a mild speedup.
name old time/op new time/op delta
FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96)
NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98)
Updates #15808
Updates #36196
Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb
Reviewed-on: https://go-review.googlesource.com/c/go/+/212360
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Giovanni Bajo <rasky@develer.com>
|
|
Benchmark:
name old time/op new time/op delta
Mul 36.0ns ± 1% 2.8ns ± 0% -92.31% (p=0.000 n=10+10)
Mul32 4.37ns ± 0% 4.37ns ± 0% ~ (p=0.429 n=6+10)
Mul64 36.4ns ± 0% 2.8ns ± 0% -92.37% (p=0.000 n=10+9)
Change-Id: Ic4f4e5958adbf24999abcee721d0180b5413fca7
Reviewed-on: https://go-review.googlesource.com/c/go/+/200582
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
This change adds an intrinsic for Mul64 on s390x. To achieve that,
a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly
instruction directly uses an existing instruction on Z and supports multiplication
of two 64 bit unsigned integer and stores the result in two separate registers.
In this case, we require the multiplcand to be stored in register R3 and
the output result (the high and low 64 bit of the product) to be stored in
R2 and R3 respectively.
A test case is also added.
Benchmark:
name old time/op new time/op delta
Mul-18 11.1ns ± 0% 1.4ns ± 0% -87.39% (p=0.002 n=8+10)
Mul32-18 2.07ns ± 0% 2.07ns ± 0% ~ (all equal)
Mul64-18 11.1ns ± 1% 1.4ns ± 0% -87.42% (p=0.000 n=10+10)
Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/194572
Reviewed-by: Michael Munday <mike.munday@ibm.com>
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
wasm has 32-bit versions of all integer operations. This change
lowers RotateLeft32 to i32.rotl on wasm and intrinsifies the math/bits
call. Benchmarking on amd64 under node.js this is ~25% faster.
node v10.15.3/amd64
name old time/op new time/op delta
RotateLeft 8.37ns ± 1% 8.28ns ± 0% -1.05% (p=0.029 n=4+4)
RotateLeft8 11.9ns ± 1% 11.8ns ± 0% ~ (p=0.167 n=5+5)
RotateLeft16 11.8ns ± 0% 11.8ns ± 0% ~ (all equal)
RotateLeft32 11.9ns ± 1% 8.7ns ± 0% -26.32% (p=0.008 n=5+5)
RotateLeft64 8.31ns ± 1% 8.43ns ± 2% ~ (p=0.063 n=5+5)
Updates #31265
Change-Id: I5b8e155978faeea536c4f6427ac9564d2f096a46
Reviewed-on: https://go-review.googlesource.com/c/go/+/182359
Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Richard Musiol <neelance@gmail.com>
|
|
This CL reverts CL 192097 and fixes the issue in CL 189277.
Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b
Reviewed-on: https://go-review.googlesource.com/c/go/+/192519
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
The syntax of a shifted operation does not have a "$" sign for
the shift amount. Remove it.
Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99
Reviewed-on: https://go-review.googlesource.com/c/go/+/192100
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
This CL optimizes math.bits.RotateLeft32 to inline
"MOVW Rx@>Ry, Rd" on ARM.
The benchmark results of math/bits show some improvements.
name old time/op new time/op delta
RotateLeft-4 9.42ns ± 0% 6.91ns ± 0% -26.66% (p=0.000 n=40+33)
RotateLeft8-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+31)
RotateLeft16-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+32)
RotateLeft32-4 8.16ns ± 0% 7.54ns ± 0% -7.68% (p=0.000 n=40+40)
RotateLeft64-4 15.7ns ± 0% 15.7ns ± 0% ~ (all equal)
updates #31265
Change-Id: I77bc1c2c702d5323fc7cad5264a8e2d5666bf712
Reviewed-on: https://go-review.googlesource.com/c/go/+/188697
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
This reverts CL 189277.
Reason for revert: broke 32-bit builders.
Updates #33902
Change-Id: Ie5f180d0371a90e5057ed578c334372e5fc3a286
Reviewed-on: https://go-review.googlesource.com/c/go/+/192097
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
|
|
This CL optimizes math.bits.TrailingZeros16 on 386 with
a pair of BSFL and ORL instrcutions.
The case TrailingZeros16-4 of the benchmark test in
math/bits shows big improvement.
name old time/op new time/op delta
TrailingZeros16-4 1.55ns ± 1% 0.87ns ± 1% -43.87% (p=0.000 n=50+49)
Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b
Reviewed-on: https://go-review.googlesource.com/c/go/+/189277
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
This CL adds intrinsics for the 64-bit addition and subtraction
functions in math/bits. These intrinsics use the condition code
to propagate the carry or borrow bit.
To make the carry chains more efficient I've removed the
'clobberFlags' property from most of the load and store
operations. Originally these ops did clobber flags when using
offsets that didn't fit in a signed 20-bit integer, however
that is no longer true.
As with other platforms the intrinsics are faster when executed
in a chain rather than a loop because currently we need to spill
and restore the carry bit between each loop iteration. We may
be able to reduce the need to do this on s390x (e.g. by using
compare-and-branch instructions that do not clobber flags) in the
future.
name old time/op new time/op delta
Add64 1.21ns ± 2% 2.03ns ± 2% +67.18% (p=0.000 n=7+10)
Add64multiple 2.98ns ± 3% 1.03ns ± 0% -65.39% (p=0.000 n=10+9)
Sub64 1.23ns ± 4% 2.03ns ± 1% +64.85% (p=0.000 n=10+10)
Sub64multiple 3.73ns ± 4% 1.04ns ± 1% -72.28% (p=0.000 n=10+8)
Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083
Reviewed-on: https://go-review.googlesource.com/c/go/+/174303
Run-TryBot: Michael Munday <mike.munday@ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
This change creates an intrinsic for Add64 for ppc64x and adds a
testcase for it.
name old time/op new time/op delta
Add64-160 1.90ns ±40% 2.29ns ± 0% ~ (p=0.119 n=5+5)
Add64multiple-160 6.69ns ± 2% 2.45ns ± 4% -63.47% (p=0.016 n=4+5)
Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97
Reviewed-on: https://go-review.googlesource.com/c/go/+/173758
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
|
|
This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS,
NGC and NEG, and optimzes the case of borrowing chains.
Benchmarks:
name old time/op new time/op delta
Sub-64 2.500000ns +- 0% 2.048000ns +- 1% -18.08% (p=0.000 n=10+10)
Sub32-64 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal)
Sub64-64 2.500000ns +- 0% 2.080000ns +- 0% -16.80% (p=0.000 n=10+7)
Sub64multiple-64 7.090000ns +- 0% 2.090000ns +- 0% -70.52% (p=0.000 n=10+10)
Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655
Reviewed-on: https://go-review.googlesource.com/c/go/+/159018
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
With this change, these two functions generate identical code:
func f(x uint64) (uint64, uint64) {
return bits.Div64(0, x, 5)
}
func g(x uint64) (uint64, uint64) {
return x / 5, x % 5
}
Updates #31582
Change-Id: Ia96c2e67f8af5dd985823afee5f155608c04a4b6
Reviewed-on: https://go-review.googlesource.com/c/go/+/173197
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
This CL deals with the additional comments of CL 159017.
Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b
Reviewed-on: https://go-review.googlesource.com/c/go/+/168857
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
zeros instructions on POWER9
This change adds new POWER9 instructions for counting trailing zeros (CNTTZW/CNTTZD)
to the assembler and generates them in SSA when GOPPC64=power9.
name old time/op new time/op delta
TrailingZeros-160 1.59ns ±20% 1.45ns ±10% -8.81% (p=0.000 n=14+13)
TrailingZeros8-160 1.55ns ±23% 1.62ns ±44% ~ (p=0.593 n=13+15)
TrailingZeros16-160 1.78ns ±23% 1.62ns ±38% -9.31% (p=0.003 n=14+14)
TrailingZeros32-160 1.64ns ±10% 1.49ns ± 9% -9.15% (p=0.000 n=13+14)
TrailingZeros64-160 1.53ns ± 6% 1.45ns ± 5% -5.38% (p=0.000 n=15+13)
Change-Id: I365e6ff79f3ce4d8ebe089a6a86b1771853eb596
Reviewed-on: https://go-review.googlesource.com/c/go/+/167517
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
|
|
This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS
and ADC, and optimzes the case of carry chains.The CL also changes the
test code so that the intrinsic implementation can be tested.
Benchmarks:
name old time/op new time/op delta
Add-224 2.500000ns +- 0% 2.090000ns +- 4% -16.40% (p=0.000 n=9+10)
Add32-224 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal)
Add64-224 2.500000ns +- 0% 1.577778ns +- 2% -36.89% (p=0.000 n=10+9)
Add64multiple-224 6.000000ns +- 0% 2.000000ns +- 0% -66.67% (p=0.000 n=10+10)
Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615
Reviewed-on: https://go-review.googlesource.com/c/go/+/159017
Run-TryBot: Ben Shi <powerman1st@163.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
for arm
This follows CL 156999 which did the same for arm64.
name old time/op new time/op delta
TrailingZeros-4 7.30ns ± 1% 7.30ns ± 0% ~ (p=0.413 n=9+9)
TrailingZeros8-4 8.32ns ± 0% 7.17ns ± 0% -13.77% (p=0.000 n=10+9)
TrailingZeros16-4 8.30ns ± 0% 7.18ns ± 0% -13.50% (p=0.000 n=9+10)
TrailingZeros32-4 6.46ns ± 1% 6.47ns ± 1% ~ (p=0.325 n=10+10)
TrailingZeros64-4 16.3ns ± 0% 16.2ns ± 0% -0.61% (p=0.000 n=7+10)
Change-Id: I7e9e1abf7e30d811aa474d272b2824ec7cbbaa98
Reviewed-on: https://go-review.googlesource.com/c/go/+/167797
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
This commit adds compiler intrinsics for the packages math and
math/bits on the wasm architecture for better performance.
benchmark old ns/op new ns/op delta
BenchmarkCeil 8.31 3.21 -61.37%
BenchmarkCopysign 5.24 3.88 -25.95%
BenchmarkAbs 5.42 3.34 -38.38%
BenchmarkFloor 8.29 3.18 -61.64%
BenchmarkRoundToEven 9.76 3.26 -66.60%
BenchmarkSqrtLatency 8.13 4.88 -39.98%
BenchmarkSqrtPrime 5246 3535 -32.62%
BenchmarkTrunc 8.29 3.15 -62.00%
BenchmarkLeadingZeros 13.0 4.23 -67.46%
BenchmarkLeadingZeros8 4.65 4.42 -4.95%
BenchmarkLeadingZeros16 7.60 4.38 -42.37%
BenchmarkLeadingZeros32 10.7 4.48 -58.13%
BenchmarkLeadingZeros64 12.9 4.31 -66.59%
BenchmarkTrailingZeros 6.52 4.04 -38.04%
BenchmarkTrailingZeros8 4.57 4.14 -9.41%
BenchmarkTrailingZeros16 6.69 4.16 -37.82%
BenchmarkTrailingZeros32 6.97 4.23 -39.31%
BenchmarkTrailingZeros64 6.59 4.00 -39.30%
BenchmarkOnesCount 7.93 3.30 -58.39%
BenchmarkOnesCount8 3.56 3.19 -10.39%
BenchmarkOnesCount16 4.85 3.19 -34.23%
BenchmarkOnesCount32 7.27 3.19 -56.12%
BenchmarkOnesCount64 8.08 3.28 -59.41%
BenchmarkRotateLeft 4.88 3.80 -22.13%
BenchmarkRotateLeft64 5.03 3.63 -27.83%
Change-Id: Ic1e0c2984878be8defb6eb7eb6ee63765c793222
Reviewed-on: https://go-review.googlesource.com/c/go/+/165177
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
for arm64
This CL eliminates unnecessary type conversion operations: OpZeroExt16to64 and OpZeroExt8to64.
If the input argrument is a nonzero value, then ORconst operation can also be eliminated.
Benchmarks:
name old time/op new time/op delta
TrailingZeros-8 2.75ns ± 0% 2.75ns ± 0% ~ (all equal)
TrailingZeros8-8 3.49ns ± 1% 2.93ns ± 0% -16.00% (p=0.000 n=10+10)
TrailingZeros16-8 3.49ns ± 1% 2.93ns ± 0% -16.05% (p=0.000 n=9+10)
TrailingZeros32-8 2.67ns ± 1% 2.68ns ± 1% ~ (p=0.468 n=10+10)
TrailingZeros64-8 2.67ns ± 1% 2.65ns ± 0% -0.62% (p=0.022 n=10+9)
code:
func f16(x uint) { z = bits.TrailingZeros16(uint16(x)) }
Before:
"".f16 STEXT size=48 args=0x8 locals=0x0 leaf
0x0000 00000 (test.go:7) TEXT "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
0x0000 00000 (test.go:7) FUNCDATA ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) PCDATA $2, ZR
0x0000 00000 (test.go:7) PCDATA ZR, ZR
0x0000 00000 (test.go:7) MOVD "".x(FP), R0
0x0004 00004 (test.go:7) MOVHU R0, R0
0x0008 00008 (test.go:7) ORR $65536, R0, R0
0x000c 00012 (test.go:7) RBIT R0, R0
0x0010 00016 (test.go:7) CLZ R0, R0
0x0014 00020 (test.go:7) MOVD R0, "".z(SB)
0x0020 00032 (test.go:7) RET (R30)
This line of code is unnecessary:
0x0004 00004 (test.go:7) MOVHU R0, R0
After:
"".f16 STEXT size=32 args=0x8 locals=0x0 leaf
0x0000 00000 (test.go:7) TEXT "".f16(SB), LEAF|NOFRAME|ABIInternal, $0-8
0x0000 00000 (test.go:7) FUNCDATA ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
0x0000 00000 (test.go:7) PCDATA $2, ZR
0x0000 00000 (test.go:7) PCDATA ZR, ZR
0x0000 00000 (test.go:7) MOVD "".x(FP), R0
0x0004 00004 (test.go:7) ORR $65536, R0, R0
0x0008 00008 (test.go:7) RBITW R0, R0
0x000c 00012 (test.go:7) CLZW R0, R0
0x0010 00016 (test.go:7) MOVD R0, "".z(SB)
0x001c 00028 (test.go:7) RET (R30)
The situation of TrailingZeros8 is similar to TrailingZeros16.
Change-Id: I473bdca06be8460a0be87abbae6fe640017e4c9d
Reviewed-on: https://go-review.googlesource.com/c/go/+/156999
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
This CL adds two rules to turn patterns like ((x<<8) | (x>>8)) (the type of
x is uint16, "|" can also be "+" or "^") to a REV16 instruction on arm v6+.
This optimization rule can be used for math/bits.ReverseBytes16.
Benchmarks on arm v6:
name old time/op new time/op delta
ReverseBytes-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal)
ReverseBytes16-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal)
ReverseBytes32-32 1.29ns ± 0% 1.29ns ± 0% ~ (all equal)
ReverseBytes64-32 1.43ns ± 0% 1.43ns ± 0% ~ (all equal)
Change-Id: I819e633c9a9d308f8e476fb0c82d73fb73dd019f
Reviewed-on: https://go-review.googlesource.com/c/go/+/159019
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Benchmark:
name old time/op new time/op delta
Div-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal)
Div32-8 6.51ns ± 0% 3.00ns ± 0% -53.90% (p=0.000 n=10+8)
Div64-8 22.5ns ± 0% 22.5ns ± 0% ~ (all equal)
Code:
func div32(hi, lo, y uint32) (q, r uint32) {return bits.Div32(hi, lo, y)}
Before:
0x0020 00032 (test.go:24) MOVWU "".y+8(FP), R0
0x0024 00036 ($GOROOT/src/math/bits/bits.go:472) CBZW R0, 132
0x0028 00040 ($GOROOT/src/math/bits/bits.go:472) MOVWU "".hi(FP), R1
0x002c 00044 ($GOROOT/src/math/bits/bits.go:472) CMPW R1, R0
0x0030 00048 ($GOROOT/src/math/bits/bits.go:472) BLS 96
0x0034 00052 ($GOROOT/src/math/bits/bits.go:475) MOVWU "".lo+4(FP), R2
0x0038 00056 ($GOROOT/src/math/bits/bits.go:475) ORR R1<<32, R2, R1
0x003c 00060 ($GOROOT/src/math/bits/bits.go:476) CBZ R0, 140
0x0040 00064 ($GOROOT/src/math/bits/bits.go:476) UDIV R0, R1, R2
0x0044 00068 (test.go:24) MOVW R2, "".q+16(FP)
0x0048 00072 ($GOROOT/src/math/bits/bits.go:476) UREM R0, R1, R0
0x0050 00080 (test.go:24) MOVW R0, "".r+20(FP)
0x0054 00084 (test.go:24) MOVD -8(RSP), R29
0x0058 00088 (test.go:24) MOVD.P 32(RSP), R30
0x005c 00092 (test.go:24) RET (R30)
After:
0x001c 00028 (test.go:24) MOVWU "".y+8(FP), R0
0x0020 00032 (test.go:24) CBZW R0, 92
0x0024 00036 (test.go:24) MOVWU "".hi(FP), R1
0x0028 00040 (test.go:24) CMPW R0, R1
0x002c 00044 (test.go:24) BHS 84
0x0030 00048 (test.go:24) MOVWU "".lo+4(FP), R2
0x0034 00052 (test.go:24) ORR R1<<32, R2, R4
0x0038 00056 (test.go:24) UDIV R0, R4, R3
0x003c 00060 (test.go:24) MSUB R3, R4, R0, R4
0x0040 00064 (test.go:24) MOVW R3, "".q+16(FP)
0x0044 00068 (test.go:24) MOVW R4, "".r+20(FP)
0x0048 00072 (test.go:24) MOVD -8(RSP), R29
0x004c 00076 (test.go:24) MOVD.P 16(RSP), R30
0x0050 00080 (test.go:24) RET (R30)
UREM instruction in the previous assembly code will be converted to UDIV and MSUB instructions
on arm64. However the UDIV instruction in UREM is unnecessary, because it's a duplicate of the
previous UDIV. This CL adds a rule to have this extra UDIV instruction removed by CSE.
Change-Id: Ie2508784320020b2de022806d09f75a7871bb3d7
Reviewed-on: https://go-review.googlesource.com/c/159577
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
On amd64 ReverseBytes16 is lowered to a rotate instruction. However arm64 doesn't
have 16-bit rotate instruction, but has a REV16W instruction which can be used
for ReverseBytes16. This CL adds a rule to turn the patterns like (x<<8) | (x>>8)
(the type of x is uint16, and "|" can also be "^" or "+") to a REV16W instruction.
Code:
func reverseBytes16(i uint16) uint16 { return bits.ReverseBytes16(i) }
Before:
0x0004 00004 (test.go:6) MOVHU "".i(FP), R0
0x0008 00008 ($GOROOT/src/math/bits/bits.go:262) UBFX $8, R0, $8, R1
0x000c 00012 ($GOROOT/src/math/bits/bits.go:262) ORR R0<<8, R1, R0
0x0010 00016 (test.go:6) MOVH R0, "".~r1+8(FP)
0x0014 00020 (test.go:6) RET (R30)
After:
0x0000 00000 (test.go:6) MOVHU "".i(FP), R0
0x0004 00004 (test.go:6) REV16W R0, R0
0x0008 00008 (test.go:6) MOVH R0, "".~r1+8(FP)
0x000c 00012 (test.go:6) RET (R30)
Benchmarks:
name old time/op new time/op delta
ReverseBytes-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal)
ReverseBytes16-224 1.500000ns +- 0% 1.000000ns +- 0% -33.33% (p=0.000 n=9+10)
ReverseBytes32-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal)
ReverseBytes64-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal)
Change-Id: I87cd41b2d8e549bf39c601f185d5775bd42d739c
Reviewed-on: https://go-review.googlesource.com/c/157757
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Arm64 has a 32-bit CLZ instruction CLZW, which can be used for intrinsic Len32.
Function LeadingZeros32 calls Len32, with this change, the assembly code of
LeadingZeros32 becomes more concise.
Go code:
func f32(x uint32) { z = bits.LeadingZeros32(x) }
Before:
"".f32 STEXT size=32 args=0x8 locals=0x0 leaf
0x0000 00000 (test.go:7) TEXT "".f32(SB), LEAF|NOFRAME|ABIInternal, $0-8
0x0004 00004 (test.go:7) MOVWU "".x(FP), R0
0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZ R0, R0
0x000c 00012 ($GOROOT/src/math/bits/bits.go:30) SUB $32, R0, R0
0x0010 00016 (test.go:7) MOVD R0, "".z(SB)
0x001c 00028 (test.go:7) RET (R30)
After:
"".f32 STEXT size=32 args=0x8 locals=0x0 leaf
0x0000 00000 (test.go:7) TEXT "".f32(SB), LEAF|NOFRAME|ABIInternal, $0-8
0x0004 00004 (test.go:7) MOVWU "".x(FP), R0
0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZW R0, R0
0x000c 00012 (test.go:7) MOVD R0, "".z(SB)
0x0018 00024 (test.go:7) RET (R30)
Benchmarks:
name old time/op new time/op delta
LeadingZeros-8 2.53ns ± 0% 2.55ns ± 0% +0.67% (p=0.000 n=10+10)
LeadingZeros8-8 3.56ns ± 0% 3.56ns ± 0% ~ (all equal)
LeadingZeros16-8 3.55ns ± 0% 3.56ns ± 0% ~ (p=0.465 n=10+10)
LeadingZeros32-8 3.55ns ± 0% 2.96ns ± 0% -16.71% (p=0.000 n=10+7)
LeadingZeros64-8 2.53ns ± 0% 2.54ns ± 0% ~ (p=0.059 n=8+10)
Change-Id: Ie5666bb82909e341060e02ffd4e86c0e5d67e90a
Reviewed-on: https://go-review.googlesource.com/c/157000
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
Note that the intrinsic implementation panics separately for overflow and
divide by zero, which matches the behavior of the pure go implementation.
There is a modest performance improvement after intrinsic implementation.
name old time/op new time/op delta
Div-4 53.0ns ± 1% 47.0ns ± 0% -11.28% (p=0.008 n=5+5)
Div32-4 18.4ns ± 0% 18.5ns ± 1% ~ (p=0.444 n=5+5)
Div64-4 53.3ns ± 0% 47.5ns ± 4% -10.77% (p=0.008 n=5+5)
Updates #28273
Change-Id: Ic1688ecc0964acace2e91bf44ef16f5fb6b6bc82
Reviewed-on: https://go-review.googlesource.com/c/144378
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
The current support_XXX variables are specific for the
amd64 and 386 platforms.
Prefix processor capability variables by architecture to have a
consistent naming scheme and avoid reuse of the existing
variables for new platforms.
This also aligns naming of runtime variables closer with internal/cpu
processor capability variable names.
Change-Id: I3eabb29a03874678851376185d3a62e73c1aff1d
Reviewed-on: https://go-review.googlesource.com/c/91435
Run-TryBot: Martin Möhrmann <martisch@uos.de>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
name old time/op new time/op delta
Sub-8 1.12ns ± 1% 1.17ns ± 1% +5.20% (p=0.008 n=5+5)
Sub32-8 1.11ns ± 0% 1.11ns ± 0% ~ (all samples are equal)
Sub64-8 1.12ns ± 0% 1.18ns ± 1% +5.00% (p=0.016 n=4+5)
Sub64multiple-8 4.10ns ± 1% 0.86ns ± 1% -78.93% (p=0.008 n=5+5)
Fixes #28273
Change-Id: Ibcb6f2fd32d987c3bcbae4f4cd9d335a3de98548
Reviewed-on: https://go-review.googlesource.com/c/144258
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
name old time/op new time/op delta
Add-8 1.11ns ± 0% 1.18ns ± 0% +6.31% (p=0.029 n=4+4)
Add32-8 1.02ns ± 0% 1.02ns ± 1% ~ (p=0.333 n=4+5)
Add64-8 1.11ns ± 1% 1.17ns ± 0% +5.79% (p=0.008 n=5+5)
Add64multiple-8 4.35ns ± 1% 0.86ns ± 0% -80.22% (p=0.000 n=5+4)
The individual ops are a bit slower (but still very fast).
Using the ops in carry chains is very fast.
Update #28273
Change-Id: Id975f76df2b930abf0e412911d327b6c5b1befe5
Reviewed-on: https://go-review.googlesource.com/c/144257
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|