go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2022-11-03	cmd/compile/internal/ssa: re-adjust CarryChainTail scheduling priority	Paul E. Murphy
	This needs to be as low as possible while not breaking priority assumptions of other scores to correctly schedule carry chains. Prior to the arm64 changes, it was set below ReadTuple. At the time, this prevented the MulHiLo implementation on PPC64 from occluding the scheduling of a full carry chain. Memory scores can also prevent better scheduling, as can be observed with crypto/internal/edwards25519/field.feMulGeneric. Fixes #56497 Change-Id: Ia4b54e6dffcce584faf46b1b8d7cea18a3913887 Reviewed-on: https://go-review.googlesource.com/c/go/+/447435 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Keith Randall <khr@golang.org>
2022-10-07	cmd/compile: intrinsify Sub64 on loong64	Wayne Zuo
	This is a follow up of CL 420095 on loong64. file before after Δ % compile/internal/ssa.a 35649482 35653274 +3792 +0.011% compile/internal/ssagen.a 4099858 4098728 -1130 -0.028% ecdh.a 227896 226896 -1000 -0.439% internal/nistec/fiat.a 1212254 1128184 -84070 -6.935% tls.a 3256800 3256802 +2 +0.000% big.a 1708518 1702496 -6022 -0.352% bits.a 106762 105734 -1028 -0.963% math.a 578762 577288 -1474 -0.255% netip.a 555922 555610 -312 -0.056% net.a 3286528 3286530 +2 +0.000% golang.org/x/crypto/internal/poly1305.a 109546 107686 -1860 -1.698% total 260392768 260299668 -93100 -0.036% Change-Id: Ieffca705aae5666501f284502d986ca179dde494 Reviewed-on: https://go-review.googlesource.com/c/go/+/428557 Reviewed-by: Carlos Amedee <carlos@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
2022-10-07	cmd/compile: intrinsify Add64 on loong64	Wayne Zuo
	This is a follow up of CL 420094 on loong64. Reduce go toolchain size slightly on linux/loong64. compilecmp HEAD~1 -> HEAD HEAD~1 (8a32354219): internal/trace: use strings.Builder HEAD (1767784ac3): cmd/compile: intrinsify Add64 on loong64 platform: linux/loong64 file before after Δ % addr2line 3882616 3882536 -80 -0.002% api 5528866 5528450 -416 -0.008% asm 5133780 5133796 +16 +0.000% cgo 4668787 4668491 -296 -0.006% compile 25163409 25164729 +1320 +0.005% cover 4658055 4658007 -48 -0.001% dist 3437783 3437727 -56 -0.002% doc 3883069 3883205 +136 +0.004% fix 3383254 3383070 -184 -0.005% link 6747559 6747023 -536 -0.008% nm 3793923 3793939 +16 +0.000% objdump 4256628 4256812 +184 +0.004% pack 2356328 2356144 -184 -0.008% pprof 14233370 14131910 -101460 -0.713% test2json 2638668 2638476 -192 -0.007% trace 13392065 13360781 -31284 -0.234% vet 7456388 7455588 -800 -0.011% total 132498256 132364392 -133864 -0.101% file before after Δ % compile/internal/ssa.a 35644590 35649482 +4892 +0.014% compile/internal/ssagen.a 4101250 4099858 -1392 -0.034% internal/edwards25519/field.a 226064 201718 -24346 -10.770% internal/nistec/fiat.a 1689922 1212254 -477668 -28.266% tls.a 3256798 3256800 +2 +0.000% big.a 1718552 1708518 -10034 -0.584% bits.a 107786 106762 -1024 -0.950% cmplx.a 169434 168214 -1220 -0.720% math.a 581302 578762 -2540 -0.437% netip.a 556096 555922 -174 -0.031% net.a 3286526 3286528 +2 +0.000% runtime.a 8644786 8644510 -276 -0.003% strconv.a 519098 518374 -724 -0.139% golang.org/x/crypto/internal/poly1305.a 115398 109546 -5852 -5.071% total 260913122 260392768 -520354 -0.199% Change-Id: I75b2bb7761fa5a0d0d032d4ebe3582d092ea77be Reviewed-on: https://go-review.googlesource.com/c/go/+/428556 Reviewed-by: Carlos Amedee <carlos@golang.org> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-09-02	cmd/compile: optimize RotateLeft8/16 on arm64	ruinan
	This CL optimizes RotateLeft8/16 on arm64. For 16 bits, we form a 32 bits register by duplicating two 16 bits registers, then use RORW instruction to do the rotate shift. For 8 bits, we just use LSR and LSL instead of RORW because the code is simpler. Benchmark Old ThisCL delta RotateLeft8-46 2.16 ns/op 1.73 ns/op -19.70% RotateLeft16-46 2.16 ns/op 1.54 ns/op -28.53% Change-Id: I09cde4383d12e31876a57f8cdfd3bb4f324fadb0 Reviewed-on: https://go-review.googlesource.com/c/go/+/420976 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Heschi Kreinick <heschi@google.com> Run-TryBot: Keith Randall <khr@golang.org>
2022-08-27	cmd/compile: intrinsify Sub64 on riscv64	Wayne Zuo
	After this CL, the performance difference in crypto/elliptic benchmarks on linux/riscv64 are: name old time/op new time/op delta ScalarBaseMult/P256 1.64ms ± 1% 1.60ms ± 1% -2.36% (p=0.008 n=5+5) ScalarBaseMult/P224 1.53ms ± 1% 1.47ms ± 2% -4.24% (p=0.008 n=5+5) ScalarBaseMult/P384 5.12ms ± 2% 5.03ms ± 2% ~ (p=0.095 n=5+5) ScalarBaseMult/P521 22.3ms ± 2% 13.8ms ± 1% -37.89% (p=0.008 n=5+5) ScalarMult/P256 4.49ms ± 2% 4.26ms ± 2% -5.13% (p=0.008 n=5+5) ScalarMult/P224 4.33ms ± 1% 4.09ms ± 1% -5.59% (p=0.008 n=5+5) ScalarMult/P384 16.3ms ± 1% 15.5ms ± 2% -4.78% (p=0.008 n=5+5) ScalarMult/P521 101ms ± 0% 47ms ± 2% -53.36% (p=0.008 n=5+5) Change-Id: I31cf0506e27f9d85f576af1813630a19c20dda8a Reviewed-on: https://go-review.googlesource.com/c/go/+/420095 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Joel Sing <joel@sing.id.au> Reviewed-by: David Chase <drchase@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-27	cmd/compile: intrinsify Add64 on riscv64	Wayne Zuo
	According to RISCV instruction set manual v2.2 Sec 2.4, we can implement overflowing check for unsigned addition cheaply using SLTU instructions. After this CL, the performance difference in crypto/elliptic benchmarks on linux/riscv64 are: name old time/op new time/op delta ScalarBaseMult/P256 1.93ms ± 1% 1.64ms ± 1% -14.96% (p=0.008 n=5+5) ScalarBaseMult/P224 1.80ms ± 2% 1.53ms ± 1% -14.89% (p=0.008 n=5+5) ScalarBaseMult/P384 6.15ms ± 2% 5.12ms ± 2% -16.73% (p=0.008 n=5+5) ScalarBaseMult/P521 25.9ms ± 1% 22.3ms ± 2% -13.78% (p=0.008 n=5+5) ScalarMult/P256 5.59ms ± 1% 4.49ms ± 2% -19.79% (p=0.008 n=5+5) ScalarMult/P224 5.42ms ± 1% 4.33ms ± 1% -20.01% (p=0.008 n=5+5) ScalarMult/P384 19.9ms ± 2% 16.3ms ± 1% -18.15% (p=0.008 n=5+5) ScalarMult/P521 97.3ms ± 1% 100.7ms ± 0% +3.48% (p=0.008 n=5+5) Change-Id: Ic4c82ced4b072a4a6575343fa9f29dd09b0cabc4 Reviewed-on: https://go-review.googlesource.com/c/go/+/420094 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Joel Sing <joel@sing.id.au> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-08-24	cmd/compile: deadcode for LoweredMuluhilo on riscv64	Wayne Zuo
	This is a follow up of CL 425101 on RISCV64. According to RISCV Volume 1, Unprivileged Spec v. 20191213 Chapter 7.1: If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order and rdh cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies. So we should not split Muluhilo to separate instructions. Updates #54607 Change-Id: If47461f3aaaf00e27cd583a9990e144fb8bcdb17 Reviewed-on: https://go-review.googlesource.com/c/go/+/425203 Auto-Submit: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>
2022-08-22	cmd/compile: split Muluhilo op on ARM64	Cherry Mui
	On ARM64 we use two separate instructions to compute the hi and lo results of a 64x64->128 multiplication. Lower to two separate ops so if only one result is needed we can deadcode the other. Fixes #54607. Change-Id: Ib023e77eb2b2b0bcf467b45471cb8a294bce6f90 Reviewed-on: https://go-review.googlesource.com/c/go/+/425101 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
2022-08-17	test/codegen: updated multiple tests to verify on ppc64,ppc64le	Archana R
	Updated multiple tests in test/codegen: math.go, mathbits.go, shift.go and slices.go to verify on ppc64/ppc64le as well Change-Id: Id88dd41569b7097819fb4d451b615f69cf7f7a94 Reviewed-on: https://go-review.googlesource.com/c/go/+/412115 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Archana Ravindar <aravind5@in.ibm.com> Reviewed-by: Than McIntosh <thanm@google.com> Reviewed-by: Paul Murphy <murp@ibm.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-05-10	cmd/compile: lower Add64/Sub64 into ssa on PPC64	Paul E. Murphy
	math/bits.Add64 and math/bits.Sub64 now lower and optimize directly in SSA form. The optimization of carry chains focuses around eliding XER<->GPR transfers of the CA bit when used exclusively as an input to a single carry operations, or when the CA value is known. This also adds support for handling XER spills in the assembler which could happen if carry chains contain inter-dependencies on each other (which seems very unlikely with practical usage), or a clobber happens (SRAW/SRAD/SUBFC operations clobber CA). With PPC64 Add64/Sub64 lowering into SSA and this patch, the net performance difference in crypto/elliptic benchmarks on P9/ppc64le are: name old time/op new time/op delta ScalarBaseMult/P256 46.3µs ± 0% 46.9µs ± 0% +1.34% ScalarBaseMult/P224 356µs ± 0% 209µs ± 0% -41.14% ScalarBaseMult/P384 1.20ms ± 0% 0.57ms ± 0% -52.14% ScalarBaseMult/P521 3.38ms ± 0% 1.44ms ± 0% -57.27% ScalarMult/P256 199µs ± 0% 199µs ± 0% -0.17% ScalarMult/P224 357µs ± 0% 212µs ± 0% -40.56% ScalarMult/P384 1.20ms ± 0% 0.58ms ± 0% -51.86% ScalarMult/P521 3.37ms ± 0% 1.44ms ± 0% -57.32% MarshalUnmarshal/P256/Uncompressed 2.59µs ± 0% 2.52µs ± 0% -2.63% MarshalUnmarshal/P256/Compressed 2.58µs ± 0% 2.52µs ± 0% -2.06% MarshalUnmarshal/P224/Uncompressed 1.54µs ± 0% 1.40µs ± 0% -9.42% MarshalUnmarshal/P224/Compressed 1.54µs ± 0% 1.39µs ± 0% -9.87% MarshalUnmarshal/P384/Uncompressed 2.40µs ± 0% 1.80µs ± 0% -24.93% MarshalUnmarshal/P384/Compressed 2.35µs ± 0% 1.81µs ± 0% -23.03% MarshalUnmarshal/P521/Uncompressed 3.79µs ± 0% 2.58µs ± 0% -31.81% MarshalUnmarshal/P521/Compressed 3.80µs ± 0% 2.60µs ± 0% -31.67% Note, P256 uses an asm implementation, thus, little variation is expected. Change-Id: I88a24f6bf0f4f285c649e40243b1ab69cc452b71 Reviewed-on: https://go-review.googlesource.com/c/go/+/346870 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Paul Murphy <murp@ibm.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com>
2022-04-04	cmd/compile: use LZCNT instruction for GOAMD64>=3	Wayne Zuo
	LZCNT is similar to BSR, but BSR(x) is undefined when x == 0, so using LZCNT can avoid a special case for zero input. Except that case, LZCNTQ(x) == 63-BSRQ(x) and LZCNTL(x) == 31-BSRL(x). And according to https://www.agner.org/optimize/instruction_tables.pdf, LZCNT instructions are much faster than BSR on AMD CPU. name old time/op new time/op delta LeadingZeros-8 0.91ns ± 1% 0.80ns ± 7% -11.68% (p=0.000 n=9+9) LeadingZeros8-8 0.98ns ±15% 0.91ns ± 1% -7.34% (p=0.000 n=9+9) LeadingZeros16-8 0.94ns ± 3% 0.92ns ± 2% -2.36% (p=0.001 n=10+10) LeadingZeros32-8 0.89ns ± 1% 0.78ns ± 2% -12.49% (p=0.000 n=10+10) LeadingZeros64-8 0.92ns ± 1% 0.78ns ± 1% -14.48% (p=0.000 n=10+10) Change-Id: I125147fe3d6994a4cfe558432780408e9a27557a Reviewed-on: https://go-review.googlesource.com/c/go/+/396794 Reviewed-by: Keith Randall <khr@golang.org> Trust: Emmanuel Odeke <emmanuel@orijtech.com> Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2021-10-18	cmd/compile/internal/ssagen: set BitLen32 as intrinsic on PPC64	Lynn Boger
	It was noticed through some other investigation that BitLen32 was not generating the best code and found that it wasn't recognized as an intrinsic. This corrects that and enables the test for PPC64. Change-Id: Iab496a8830c8552f507b7292649b1b660f3848b5 Reviewed-on: https://go-review.googlesource.com/c/go/+/355872 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Trust: Lynn Boger <laboger@linux.vnet.ibm.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2021-10-05	cmd/compile: don't emit unnecessary amd64 extension checks	nimelehin
	In case of amd64 the compiler issues checks if extensions are available on a platform. With GOAMD64 microarchitecture levels provided, some of the checks could be eliminated. Change-Id: If15c178bcae273b2ce7d3673415cb8849292e087 Reviewed-on: https://go-review.googlesource.com/c/go/+/352010 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Go Bot <gobot@golang.org>
2021-10-05	cmd/compile: use TZCNT instruction for GOAMD64>=v3	wdvxdr
	on my Intel CoffeeLake CPU: name old time/op new time/op delta TrailingZeros-8 0.68ns ± 1% 0.64ns ± 1% -6.26% (p=0.000 n=10+10) TrailingZeros8-8 0.70ns ± 1% 0.70ns ± 1% ~ (p=0.697 n=10+10) TrailingZeros16-8 0.70ns ± 1% 0.70ns ± 1% +0.57% (p=0.043 n=10+10) TrailingZeros32-8 0.66ns ± 1% 0.64ns ± 1% -3.35% (p=0.000 n=10+10) TrailingZeros64-8 0.68ns ± 1% 0.64ns ± 1% -5.84% (p=0.000 n=9+10) Updates #45453 Change-Id: I228ff2d51df24b1306136f061432f8a12bb1d6fd Reviewed-on: https://go-review.googlesource.com/c/go/+/353249 Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2021-08-16	cmd/compile: intrinsify Mul64 on riscv64	Meng Zhuo
	According to RISCV instruction set manual v2.2 Sec 6.1 MULHU followed by MUL will be fused into one multiply by microarchitecture Benchstat on Hifive unmatched: name old time/op new time/op delta Hash8Bytes 245ns ± 3% 186ns ± 4% -23.99% (p=0.000 n=10+10) Hash320Bytes 1.94µs ± 1% 1.31µs ± 1% -32.38% (p=0.000 n=9+10) Hash1K 5.84µs ± 0% 3.84µs ± 0% -34.20% (p=0.000 n=10+9) Hash8K 45.3µs ± 0% 29.4µs ± 0% -35.04% (p=0.000 n=10+10) name old speed new speed delta Hash8Bytes 32.7MB/s ± 3% 43.0MB/s ± 4% +31.61% (p=0.000 n=10+10) Hash320Bytes 165MB/s ± 1% 244MB/s ± 1% +47.88% (p=0.000 n=9+10) Hash1K 175MB/s ± 0% 266MB/s ± 0% +51.98% (p=0.000 n=10+9) Hash8K 181MB/s ± 0% 279MB/s ± 0% +53.94% (p=0.000 n=10+10) Change-Id: I3561495d02a4a0ad8578e9b9819bf0a4eaca5d12 Reviewed-on: https://go-review.googlesource.com/c/go/+/329970 Reviewed-by: Joel Sing <joel@sing.id.au> Run-TryBot: Joel Sing <joel@sing.id.au> TryBot-Result: Go Bot <gobot@golang.org> Trust: Meng Zhuo <mzh@golangcn.org>
2021-04-14	cmd/compile: rescue stmt boundaries from OpArgXXXReg and OpSelectN.	David Chase
	Fixes this failure: go test cmd/compile/internal/ssa -run TestStmtLines -v === RUN TestStmtLines stmtlines_test.go:115: Saw too many (amd64, > 1%) lines without statement marks, total=88263, nostmt=1930 ('-run TestStmtLines -v' lists failing lines) The failure has two causes. One is that the first-line adjuster in code generation was relocating "first lines" to instructions that would either not have any code generated, or would have the statment marker removed by a different believed-good heuristic. The other was that statement boundaries were getting attached to register values (that with the old ABI were loads from the stack, hence real instructions). The register values disappear at code generation. The fixes are to (1) note that certain instructions are not good choices for "first value" and skip them, and (2) in an expandCalls post-pass, look for register valued instructions and under appropriate conditions move their statement marker to a compatible use. Also updates TestStmtLines to always log the score, for easier comparison of minor compiler changes. Updates #40724. Change-Id: I485573ce900e292d7c44574adb7629cdb4695c3f Reviewed-on: https://go-review.googlesource.com/c/go/+/309649 Trust: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2020-11-06	cmd/compile: optimize shift pairs and masks on s390x	Michael Munday
	Optimize combinations of left and right shifts by a constant value into a 'rotate then insert selected bits [into zero]' instruction. Use the same instruction for contiguous masks since it has some benefits over 'and immediate' (not restricted to 32-bits, does not overwrite source register). To keep the complexity of this change under control I've only implemented 64 bit operations for now. There are a lot more optimizations that can be done with this instruction family. However, since their function overlaps with other instructions we need to be somewhat careful not to break existing optimization rules by creating optimization dead ends. This is particularly true of the load/store merging rules which contain lots of zero extensions and shifts. This CL does interfere with the store merging rules when an operand is shifted left before it is stored: binary.BigEndian.PutUint64(b, x << 1) This is unfortunate but it's not critical and somewhat complex so I plan to fix that in a follow up CL. file before after Δ % addr2line 4117446 4117282 -164 -0.004% api 4945184 4942752 -2432 -0.049% asm 4998079 4991891 -6188 -0.124% buildid 2685158 2684074 -1084 -0.040% cgo 4553732 4553394 -338 -0.007% compile 19294446 19245070 -49376 -0.256% cover 4897105 4891319 -5786 -0.118% dist 3544389 3542785 -1604 -0.045% doc 3926795 3927617 +822 +0.021% fix 3302958 3293868 -9090 -0.275% link 6546274 6543456 -2818 -0.043% nm 4102021 4100825 -1196 -0.029% objdump 4542431 4548483 +6052 +0.133% pack 2482465 2416389 -66076 -2.662% pprof 13366541 13363915 -2626 -0.020% test2json 2829007 2761515 -67492 -2.386% trace 10216164 10219684 +3520 +0.034% vet 6773956 6773572 -384 -0.006% total 107124151 106917891 -206260 -0.193% Change-Id: I7591cce41e06867ba10a745daae9333513062746 Reviewed-on: https://go-review.googlesource.com/c/go/+/233317 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Trust: Michael Munday <mike.munday@ibm.com>
2020-08-27	cmd/compile: generate subfic on ppc64	Paul E. Murphy
	This merges an lis + subf into subfic, and for 32b constants lwa + subf into oris + ori + subf. The carry bit is no longer used in code generation, therefore I think we can clobber it as needed. Note, lowered borrow/carry arithmetic is self-contained and thus is not affected. A few extra rules are added to ensure early transformations to SUBFCconst don't trip up earlier rules, fold constant operations, or otherwise simplify lowering. Likewise, tests are added to ensure all rules are hit. Generic constant folding catches trivial cases, however some lowering rules insert arithmetic which can introduce new opportunities (e.g BitLen or Slicemask). I couldn't find a specific benchmark to demonstrate noteworthy improvements, but this is generating subfic in many of the default bent test binaries, so we are at least saving a little code space. Change-Id: Iad7c6e5767eaa9dc24dc1c989bd1c8cfe1982012 Reviewed-on: https://go-review.googlesource.com/c/go/+/249461 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2020-04-22	cmd/compile: clean up codegen for branch-on-carry on s390x	Michael Munday
	This CL optimizes code that uses a carry from a function such as bits.Add64 as the condition in an if statement. For example: x, c := bits.Add64(a, b, 0) if c != 0 { panic("overflow") } Rather than converting the carry into a 0 or a 1 value and using that as an input to a comparison instruction the carry flag is now used as the input to a conditional branch directly. This typically removes an ADD LOGICAL WITH CARRY instruction when user code is doing overflow detection and is closer to the code that a user would expect to generate. Change-Id: I950431270955ab72f1b5c6db873b6abe769be0da Reviewed-on: https://go-review.googlesource.com/c/go/+/219757 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-04-04	cmd/compile: add intrinsic HasCPUFeature for checking cpu features	Josh Bleecher Snyder
	Before using some CPU instructions, we must check for their presence. We use global variables in the runtime package to record features. Prior to this CL, we issued a regular memory load for these features. The downside to this is that, because it is a regular memory load, it cannot be hoisted out of loops or otherwise reordered with other loads. This CL introduces a new intrinsic just for checking cpu features. It still ends up resulting in a memory load, but that memory load can now be floated to the entry block and rematerialized as needed. One downside is that the regular load could be combined with the comparison into a CMPBconstload+NE. This new intrinsic cannot; it generates MOVB+TESTB+NE. (It is possible that MOVBQZX+TESTQ+NE would be better.) This CL does only amd64. It is easy to extend to other architectures. For the benchmark in #36196, on my machine, this offers a mild speedup. name old time/op new time/op delta FMA-8 1.39ns ± 6% 1.29ns ± 9% -7.19% (p=0.000 n=97+96) NonFMA-8 2.03ns ±11% 2.04ns ±12% ~ (p=0.618 n=99+98) Updates #15808 Updates #36196 Change-Id: I75e2fcfcf5a6df1bdb80657a7143bed69fca6deb Reviewed-on: https://go-review.googlesource.com/c/go/+/212360 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com>
2019-10-14	cmd/compile: add math/bits.Mul64 intrinsic on mips64x	Meng Zhuo
	Benchmark: name old time/op new time/op delta Mul 36.0ns ± 1% 2.8ns ± 0% -92.31% (p=0.000 n=10+10) Mul32 4.37ns ± 0% 4.37ns ± 0% ~ (p=0.429 n=6+10) Mul64 36.4ns ± 0% 2.8ns ± 0% -92.37% (p=0.000 n=10+9) Change-Id: Ic4f4e5958adbf24999abcee721d0180b5413fca7 Reviewed-on: https://go-review.googlesource.com/c/go/+/200582 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-09-13	cmd/compile: add math/bits.Mul64 intrinsic on s390x	Ruixin Bao
	This change adds an intrinsic for Mul64 on s390x. To achieve that, a new assembly instruction, MLGR, is introduced in s390x/asmz.go. This assembly instruction directly uses an existing instruction on Z and supports multiplication of two 64 bit unsigned integer and stores the result in two separate registers. In this case, we require the multiplcand to be stored in register R3 and the output result (the high and low 64 bit of the product) to be stored in R2 and R3 respectively. A test case is also added. Benchmark: name old time/op new time/op delta Mul-18 11.1ns ± 0% 1.4ns ± 0% -87.39% (p=0.002 n=8+10) Mul32-18 2.07ns ± 0% 2.07ns ± 0% ~ (all equal) Mul64-18 11.1ns ± 1% 1.4ns ± 0% -87.42% (p=0.000 n=10+10) Change-Id: Ieca6ad1f61fff9a48a31d50bbd3f3c6d9e6675c1 Reviewed-on: https://go-review.googlesource.com/c/go/+/194572 Reviewed-by: Michael Munday <mike.munday@ibm.com> Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-08-31	cmd/compile: intrinsify RotateLeft32 on wasm	Brian Kessler
	wasm has 32-bit versions of all integer operations. This change lowers RotateLeft32 to i32.rotl on wasm and intrinsifies the math/bits call. Benchmarking on amd64 under node.js this is ~25% faster. node v10.15.3/amd64 name old time/op new time/op delta RotateLeft 8.37ns ± 1% 8.28ns ± 0% -1.05% (p=0.029 n=4+4) RotateLeft8 11.9ns ± 1% 11.8ns ± 0% ~ (p=0.167 n=5+5) RotateLeft16 11.8ns ± 0% 11.8ns ± 0% ~ (all equal) RotateLeft32 11.9ns ± 1% 8.7ns ± 0% -26.32% (p=0.008 n=5+5) RotateLeft64 8.31ns ± 1% 8.43ns ± 2% ~ (p=0.063 n=5+5) Updates #31265 Change-Id: I5b8e155978faeea536c4f6427ac9564d2f096a46 Reviewed-on: https://go-review.googlesource.com/c/go/+/182359 Run-TryBot: Brian Kessler <brian.m.kessler@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Richard Musiol <neelance@gmail.com>
2019-08-30	cmd/compile: optimize 386's math.bits.TrailingZeros16	Ben Shi
	This CL reverts CL 192097 and fixes the issue in CL 189277. Change-Id: Icd271262e1f5019a8e01c91f91c12c1261eeb02b Reviewed-on: https://go-review.googlesource.com/c/go/+/192519 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-08-28	test/codegen: fix ARM32 RotateLeft32 test	Cherry Zhang
	The syntax of a shifted operation does not have a "$" sign for the shift amount. Remove it. Change-Id: I50782fe942b640076f48c2fafea4d3175be8ff99 Reviewed-on: https://go-review.googlesource.com/c/go/+/192100 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-08-28	cmd/compile: optimize ARM's math.bits.RotateLeft32	Ben Shi
	This CL optimizes math.bits.RotateLeft32 to inline "MOVW Rx@>Ry, Rd" on ARM. The benchmark results of math/bits show some improvements. name old time/op new time/op delta RotateLeft-4 9.42ns ± 0% 6.91ns ± 0% -26.66% (p=0.000 n=40+33) RotateLeft8-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+31) RotateLeft16-4 8.79ns ± 0% 8.79ns ± 0% -0.04% (p=0.000 n=40+32) RotateLeft32-4 8.16ns ± 0% 7.54ns ± 0% -7.68% (p=0.000 n=40+40) RotateLeft64-4 15.7ns ± 0% 15.7ns ± 0% ~ (all equal) updates #31265 Change-Id: I77bc1c2c702d5323fc7cad5264a8e2d5666bf712 Reviewed-on: https://go-review.googlesource.com/c/go/+/188697 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-08-28	Revert "cmd/compile: optimize 386's math.bits.TrailingZeros16"	Bryan C. Mills
	This reverts CL 189277. Reason for revert: broke 32-bit builders. Updates #33902 Change-Id: Ie5f180d0371a90e5057ed578c334372e5fc3a286 Reviewed-on: https://go-review.googlesource.com/c/go/+/192097 Run-TryBot: Bryan C. Mills <bcmills@google.com> Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2019-08-28	cmd/compile: optimize 386's math.bits.TrailingZeros16	Ben Shi
	This CL optimizes math.bits.TrailingZeros16 on 386 with a pair of BSFL and ORL instrcutions. The case TrailingZeros16-4 of the benchmark test in math/bits shows big improvement. name old time/op new time/op delta TrailingZeros16-4 1.55ns ± 1% 0.87ns ± 1% -43.87% (p=0.000 n=50+49) Change-Id: Ia899975b0e46f45dcd20223b713ed632bc32740b Reviewed-on: https://go-review.googlesource.com/c/go/+/189277 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2019-05-03	cmd/compile: add math/bits.{Add,Sub}64 intrinsics on s390x	Michael Munday
	This CL adds intrinsics for the 64-bit addition and subtraction functions in math/bits. These intrinsics use the condition code to propagate the carry or borrow bit. To make the carry chains more efficient I've removed the 'clobberFlags' property from most of the load and store operations. Originally these ops did clobber flags when using offsets that didn't fit in a signed 20-bit integer, however that is no longer true. As with other platforms the intrinsics are faster when executed in a chain rather than a loop because currently we need to spill and restore the carry bit between each loop iteration. We may be able to reduce the need to do this on s390x (e.g. by using compare-and-branch instructions that do not clobber flags) in the future. name old time/op new time/op delta Add64 1.21ns ± 2% 2.03ns ± 2% +67.18% (p=0.000 n=7+10) Add64multiple 2.98ns ± 3% 1.03ns ± 0% -65.39% (p=0.000 n=10+9) Sub64 1.23ns ± 4% 2.03ns ± 1% +64.85% (p=0.000 n=10+10) Sub64multiple 3.73ns ± 4% 1.04ns ± 1% -72.28% (p=0.000 n=10+8) Change-Id: I913bbd5e19e6b95bef52f5bc4f14d6fe40119083 Reviewed-on: https://go-review.googlesource.com/c/go/+/174303 Run-TryBot: Michael Munday <mike.munday@ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-04-28	cmd/compile: intrinsify math/bits.Add64 for ppc64x	Carlos Eduardo Seo
	This change creates an intrinsic for Add64 for ppc64x and adds a testcase for it. name old time/op new time/op delta Add64-160 1.90ns ±40% 2.29ns ± 0% ~ (p=0.119 n=5+5) Add64multiple-160 6.69ns ± 2% 2.45ns ± 4% -63.47% (p=0.016 n=4+5) Change-Id: I9abe6fb023fdf62eea3c9b46a1820f60bb0a7f97 Reviewed-on: https://go-review.googlesource.com/c/go/+/173758 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com> Run-TryBot: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
2019-04-22	cmd/compile: intrinsify math/bits.Sub64 for arm64	erifan01
	This CL instrinsifies Sub64 with arm64 instruction sequence NEGS, SBCS, NGC and NEG, and optimzes the case of borrowing chains. Benchmarks: name old time/op new time/op delta Sub-64 2.500000ns +- 0% 2.048000ns +- 1% -18.08% (p=0.000 n=10+10) Sub32-64 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal) Sub64-64 2.500000ns +- 0% 2.080000ns +- 0% -16.80% (p=0.000 n=10+7) Sub64multiple-64 7.090000ns +- 0% 2.090000ns +- 0% -70.52% (p=0.000 n=10+10) Change-Id: I3d2664e009a9635e13b55d2c4567c7b34c2c0655 Reviewed-on: https://go-review.googlesource.com/c/go/+/159018 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-04-20	cmd/compile: reduce bits.Div64(0, lo, y) to 64 bit division	Josh Bleecher Snyder
	With this change, these two functions generate identical code: func f(x uint64) (uint64, uint64) { return bits.Div64(0, x, 5) } func g(x uint64) (uint64, uint64) { return x / 5, x % 5 } Updates #31582 Change-Id: Ia96c2e67f8af5dd985823afee5f155608c04a4b6 Reviewed-on: https://go-review.googlesource.com/c/go/+/173197 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-22	cmd/compile: follow up intrinsifying math/bits.Add64 for arm64	erifan01
	This CL deals with the additional comments of CL 159017. Change-Id: I4ad3c60c834646d58dc0c544c741b92bfe83fb8b Reviewed-on: https://go-review.googlesource.com/c/go/+/168857 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-20	cmd/compile/internal, cmd/internal/obj/ppc64: generate new count trailing ↵	Carlos Eduardo Seo
	zeros instructions on POWER9 This change adds new POWER9 instructions for counting trailing zeros (CNTTZW/CNTTZD) to the assembler and generates them in SSA when GOPPC64=power9. name old time/op new time/op delta TrailingZeros-160 1.59ns ±20% 1.45ns ±10% -8.81% (p=0.000 n=14+13) TrailingZeros8-160 1.55ns ±23% 1.62ns ±44% ~ (p=0.593 n=13+15) TrailingZeros16-160 1.78ns ±23% 1.62ns ±38% -9.31% (p=0.003 n=14+14) TrailingZeros32-160 1.64ns ±10% 1.49ns ± 9% -9.15% (p=0.000 n=13+14) TrailingZeros64-160 1.53ns ± 6% 1.45ns ± 5% -5.38% (p=0.000 n=15+13) Change-Id: I365e6ff79f3ce4d8ebe089a6a86b1771853eb596 Reviewed-on: https://go-review.googlesource.com/c/go/+/167517 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2019-03-20	cmd/compile: intrinsify math/bits.Add64 for arm64	erifan01
	This CL instrinsifies Add64 with arm64 instruction sequence ADDS, ADCS and ADC, and optimzes the case of carry chains.The CL also changes the test code so that the intrinsic implementation can be tested. Benchmarks: name old time/op new time/op delta Add-224 2.500000ns +- 0% 2.090000ns +- 4% -16.40% (p=0.000 n=9+10) Add32-224 2.500000ns +- 0% 2.500000ns +- 0% ~ (all equal) Add64-224 2.500000ns +- 0% 1.577778ns +- 2% -36.89% (p=0.000 n=10+9) Add64multiple-224 6.000000ns +- 0% 2.000000ns +- 0% -66.67% (p=0.000 n=10+10) Change-Id: I6ee91c9a85c16cc72ade5fd94868c579f16c7615 Reviewed-on: https://go-review.googlesource.com/c/go/+/159017 Run-TryBot: Ben Shi <powerman1st@163.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-15	cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16\|8) ↵	Tobias Klauser
	for arm This follows CL 156999 which did the same for arm64. name old time/op new time/op delta TrailingZeros-4 7.30ns ± 1% 7.30ns ± 0% ~ (p=0.413 n=9+9) TrailingZeros8-4 8.32ns ± 0% 7.17ns ± 0% -13.77% (p=0.000 n=10+9) TrailingZeros16-4 8.30ns ± 0% 7.18ns ± 0% -13.50% (p=0.000 n=9+10) TrailingZeros32-4 6.46ns ± 1% 6.47ns ± 1% ~ (p=0.325 n=10+10) TrailingZeros64-4 16.3ns ± 0% 16.2ns ± 0% -0.61% (p=0.000 n=7+10) Change-Id: I7e9e1abf7e30d811aa474d272b2824ec7cbbaa98 Reviewed-on: https://go-review.googlesource.com/c/go/+/167797 Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-14	math, math/bits: add intrinsics for wasm	Richard Musiol
	This commit adds compiler intrinsics for the packages math and math/bits on the wasm architecture for better performance. benchmark old ns/op new ns/op delta BenchmarkCeil 8.31 3.21 -61.37% BenchmarkCopysign 5.24 3.88 -25.95% BenchmarkAbs 5.42 3.34 -38.38% BenchmarkFloor 8.29 3.18 -61.64% BenchmarkRoundToEven 9.76 3.26 -66.60% BenchmarkSqrtLatency 8.13 4.88 -39.98% BenchmarkSqrtPrime 5246 3535 -32.62% BenchmarkTrunc 8.29 3.15 -62.00% BenchmarkLeadingZeros 13.0 4.23 -67.46% BenchmarkLeadingZeros8 4.65 4.42 -4.95% BenchmarkLeadingZeros16 7.60 4.38 -42.37% BenchmarkLeadingZeros32 10.7 4.48 -58.13% BenchmarkLeadingZeros64 12.9 4.31 -66.59% BenchmarkTrailingZeros 6.52 4.04 -38.04% BenchmarkTrailingZeros8 4.57 4.14 -9.41% BenchmarkTrailingZeros16 6.69 4.16 -37.82% BenchmarkTrailingZeros32 6.97 4.23 -39.31% BenchmarkTrailingZeros64 6.59 4.00 -39.30% BenchmarkOnesCount 7.93 3.30 -58.39% BenchmarkOnesCount8 3.56 3.19 -10.39% BenchmarkOnesCount16 4.85 3.19 -34.23% BenchmarkOnesCount32 7.27 3.19 -56.12% BenchmarkOnesCount64 8.08 3.28 -59.41% BenchmarkRotateLeft 4.88 3.80 -22.13% BenchmarkRotateLeft64 5.03 3.63 -27.83% Change-Id: Ic1e0c2984878be8defb6eb7eb6ee63765c793222 Reviewed-on: https://go-review.googlesource.com/c/go/+/165177 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-07	cmd/compile: eliminate unnecessary type conversions in TrailingZeros(16\|8) ↵	erifan01
	for arm64 This CL eliminates unnecessary type conversion operations: OpZeroExt16to64 and OpZeroExt8to64. If the input argrument is a nonzero value, then ORconst operation can also be eliminated. Benchmarks: name old time/op new time/op delta TrailingZeros-8 2.75ns ± 0% 2.75ns ± 0% ~ (all equal) TrailingZeros8-8 3.49ns ± 1% 2.93ns ± 0% -16.00% (p=0.000 n=10+10) TrailingZeros16-8 3.49ns ± 1% 2.93ns ± 0% -16.05% (p=0.000 n=9+10) TrailingZeros32-8 2.67ns ± 1% 2.68ns ± 1% ~ (p=0.468 n=10+10) TrailingZeros64-8 2.67ns ± 1% 2.65ns ± 0% -0.62% (p=0.022 n=10+9) code: func f16(x uint) { z = bits.TrailingZeros16(uint16(x)) } Before: "".f16 STEXT size=48 args=0x8 locals=0x0 leaf 0x0000 00000 (test.go:7) TEXT "".f16(SB), LEAF\|NOFRAME\|ABIInternal, $0-8 0x0000 00000 (test.go:7) FUNCDATA ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) PCDATA $2, ZR 0x0000 00000 (test.go:7) PCDATA ZR, ZR 0x0000 00000 (test.go:7) MOVD "".x(FP), R0 0x0004 00004 (test.go:7) MOVHU R0, R0 0x0008 00008 (test.go:7) ORR $65536, R0, R0 0x000c 00012 (test.go:7) RBIT R0, R0 0x0010 00016 (test.go:7) CLZ R0, R0 0x0014 00020 (test.go:7) MOVD R0, "".z(SB) 0x0020 00032 (test.go:7) RET (R30) This line of code is unnecessary: 0x0004 00004 (test.go:7) MOVHU R0, R0 After: "".f16 STEXT size=32 args=0x8 locals=0x0 leaf 0x0000 00000 (test.go:7) TEXT "".f16(SB), LEAF\|NOFRAME\|ABIInternal, $0-8 0x0000 00000 (test.go:7) FUNCDATA ZR, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) FUNCDATA $3, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) 0x0000 00000 (test.go:7) PCDATA $2, ZR 0x0000 00000 (test.go:7) PCDATA ZR, ZR 0x0000 00000 (test.go:7) MOVD "".x(FP), R0 0x0004 00004 (test.go:7) ORR $65536, R0, R0 0x0008 00008 (test.go:7) RBITW R0, R0 0x000c 00012 (test.go:7) CLZW R0, R0 0x0010 00016 (test.go:7) MOVD R0, "".z(SB) 0x001c 00028 (test.go:7) RET (R30) The situation of TrailingZeros8 is similar to TrailingZeros16. Change-Id: I473bdca06be8460a0be87abbae6fe640017e4c9d Reviewed-on: https://go-review.googlesource.com/c/go/+/156999 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-07	cmd/compile: add an optimization rule for math/bits.ReverseBytes16 on arm	erifan01
	This CL adds two rules to turn patterns like ((x<<8) \| (x>>8)) (the type of x is uint16, "\|" can also be "+" or "^") to a REV16 instruction on arm v6+. This optimization rule can be used for math/bits.ReverseBytes16. Benchmarks on arm v6: name old time/op new time/op delta ReverseBytes-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal) ReverseBytes16-32 2.86ns ± 0% 2.86ns ± 0% ~ (all equal) ReverseBytes32-32 1.29ns ± 0% 1.29ns ± 0% ~ (all equal) ReverseBytes64-32 1.43ns ± 0% 1.43ns ± 0% ~ (all equal) Change-Id: I819e633c9a9d308f8e476fb0c82d73fb73dd019f Reviewed-on: https://go-review.googlesource.com/c/go/+/159019 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-03	cmd/compile: optimize math/bits.Div32 for arm64	erifan01
	Benchmark: name old time/op new time/op delta Div-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) Div32-8 6.51ns ± 0% 3.00ns ± 0% -53.90% (p=0.000 n=10+8) Div64-8 22.5ns ± 0% 22.5ns ± 0% ~ (all equal) Code: func div32(hi, lo, y uint32) (q, r uint32) {return bits.Div32(hi, lo, y)} Before: 0x0020 00032 (test.go:24) MOVWU "".y+8(FP), R0 0x0024 00036 ($GOROOT/src/math/bits/bits.go:472) CBZW R0, 132 0x0028 00040 ($GOROOT/src/math/bits/bits.go:472) MOVWU "".hi(FP), R1 0x002c 00044 ($GOROOT/src/math/bits/bits.go:472) CMPW R1, R0 0x0030 00048 ($GOROOT/src/math/bits/bits.go:472) BLS 96 0x0034 00052 ($GOROOT/src/math/bits/bits.go:475) MOVWU "".lo+4(FP), R2 0x0038 00056 ($GOROOT/src/math/bits/bits.go:475) ORR R1<<32, R2, R1 0x003c 00060 ($GOROOT/src/math/bits/bits.go:476) CBZ R0, 140 0x0040 00064 ($GOROOT/src/math/bits/bits.go:476) UDIV R0, R1, R2 0x0044 00068 (test.go:24) MOVW R2, "".q+16(FP) 0x0048 00072 ($GOROOT/src/math/bits/bits.go:476) UREM R0, R1, R0 0x0050 00080 (test.go:24) MOVW R0, "".r+20(FP) 0x0054 00084 (test.go:24) MOVD -8(RSP), R29 0x0058 00088 (test.go:24) MOVD.P 32(RSP), R30 0x005c 00092 (test.go:24) RET (R30) After: 0x001c 00028 (test.go:24) MOVWU "".y+8(FP), R0 0x0020 00032 (test.go:24) CBZW R0, 92 0x0024 00036 (test.go:24) MOVWU "".hi(FP), R1 0x0028 00040 (test.go:24) CMPW R0, R1 0x002c 00044 (test.go:24) BHS 84 0x0030 00048 (test.go:24) MOVWU "".lo+4(FP), R2 0x0034 00052 (test.go:24) ORR R1<<32, R2, R4 0x0038 00056 (test.go:24) UDIV R0, R4, R3 0x003c 00060 (test.go:24) MSUB R3, R4, R0, R4 0x0040 00064 (test.go:24) MOVW R3, "".q+16(FP) 0x0044 00068 (test.go:24) MOVW R4, "".r+20(FP) 0x0048 00072 (test.go:24) MOVD -8(RSP), R29 0x004c 00076 (test.go:24) MOVD.P 16(RSP), R30 0x0050 00080 (test.go:24) RET (R30) UREM instruction in the previous assembly code will be converted to UDIV and MSUB instructions on arm64. However the UDIV instruction in UREM is unnecessary, because it's a duplicate of the previous UDIV. This CL adds a rule to have this extra UDIV instruction removed by CSE. Change-Id: Ie2508784320020b2de022806d09f75a7871bb3d7 Reviewed-on: https://go-review.googlesource.com/c/159577 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Bryan C. Mills <bcmills@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-03-01	cmd/compile: add an optimaztion rule for math/bits.ReverseBytes16 on arm64	erifan01
	On amd64 ReverseBytes16 is lowered to a rotate instruction. However arm64 doesn't have 16-bit rotate instruction, but has a REV16W instruction which can be used for ReverseBytes16. This CL adds a rule to turn the patterns like (x<<8) \| (x>>8) (the type of x is uint16, and "\|" can also be "^" or "+") to a REV16W instruction. Code: func reverseBytes16(i uint16) uint16 { return bits.ReverseBytes16(i) } Before: 0x0004 00004 (test.go:6) MOVHU "".i(FP), R0 0x0008 00008 ($GOROOT/src/math/bits/bits.go:262) UBFX $8, R0, $8, R1 0x000c 00012 ($GOROOT/src/math/bits/bits.go:262) ORR R0<<8, R1, R0 0x0010 00016 (test.go:6) MOVH R0, "".~r1+8(FP) 0x0014 00020 (test.go:6) RET (R30) After: 0x0000 00000 (test.go:6) MOVHU "".i(FP), R0 0x0004 00004 (test.go:6) REV16W R0, R0 0x0008 00008 (test.go:6) MOVH R0, "".~r1+8(FP) 0x000c 00012 (test.go:6) RET (R30) Benchmarks: name old time/op new time/op delta ReverseBytes-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal) ReverseBytes16-224 1.500000ns +- 0% 1.000000ns +- 0% -33.33% (p=0.000 n=9+10) ReverseBytes32-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal) ReverseBytes64-224 1.000000ns +- 0% 1.000000ns +- 0% ~ (all equal) Change-Id: I87cd41b2d8e549bf39c601f185d5775bd42d739c Reviewed-on: https://go-review.googlesource.com/c/157757 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-02-27	cmd/compile: optimize math/bits Len32 intrinsic on arm64	erifan01
	Arm64 has a 32-bit CLZ instruction CLZW, which can be used for intrinsic Len32. Function LeadingZeros32 calls Len32, with this change, the assembly code of LeadingZeros32 becomes more concise. Go code: func f32(x uint32) { z = bits.LeadingZeros32(x) } Before: "".f32 STEXT size=32 args=0x8 locals=0x0 leaf 0x0000 00000 (test.go:7) TEXT "".f32(SB), LEAF\|NOFRAME\|ABIInternal, $0-8 0x0004 00004 (test.go:7) MOVWU "".x(FP), R0 0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZ R0, R0 0x000c 00012 ($GOROOT/src/math/bits/bits.go:30) SUB $32, R0, R0 0x0010 00016 (test.go:7) MOVD R0, "".z(SB) 0x001c 00028 (test.go:7) RET (R30) After: "".f32 STEXT size=32 args=0x8 locals=0x0 leaf 0x0000 00000 (test.go:7) TEXT "".f32(SB), LEAF\|NOFRAME\|ABIInternal, $0-8 0x0004 00004 (test.go:7) MOVWU "".x(FP), R0 0x0008 00008 ($GOROOT/src/math/bits/bits.go:30) CLZW R0, R0 0x000c 00012 (test.go:7) MOVD R0, "".z(SB) 0x0018 00024 (test.go:7) RET (R30) Benchmarks: name old time/op new time/op delta LeadingZeros-8 2.53ns ± 0% 2.55ns ± 0% +0.67% (p=0.000 n=10+10) LeadingZeros8-8 3.56ns ± 0% 3.56ns ± 0% ~ (all equal) LeadingZeros16-8 3.55ns ± 0% 3.56ns ± 0% ~ (p=0.465 n=10+10) LeadingZeros32-8 3.55ns ± 0% 2.96ns ± 0% -16.71% (p=0.000 n=10+7) LeadingZeros64-8 2.53ns ± 0% 2.54ns ± 0% ~ (p=0.059 n=8+10) Change-Id: Ie5666bb82909e341060e02ffd4e86c0e5d67e90a Reviewed-on: https://go-review.googlesource.com/c/157000 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-11-27	cmd/compile: intrinsify math/bits.Div on amd64	Brian Kessler
	Note that the intrinsic implementation panics separately for overflow and divide by zero, which matches the behavior of the pure go implementation. There is a modest performance improvement after intrinsic implementation. name old time/op new time/op delta Div-4 53.0ns ± 1% 47.0ns ± 0% -11.28% (p=0.008 n=5+5) Div32-4 18.4ns ± 0% 18.5ns ± 1% ~ (p=0.444 n=5+5) Div64-4 53.3ns ± 0% 47.5ns ± 4% -10.77% (p=0.008 n=5+5) Updates #28273 Change-Id: Ic1688ecc0964acace2e91bf44ef16f5fb6b6bc82 Reviewed-on: https://go-review.googlesource.com/c/144378 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2018-11-14	runtime: make processor capability variable naming platform specific	Martin Möhrmann
	The current support_XXX variables are specific for the amd64 and 386 platforms. Prefix processor capability variables by architecture to have a consistent naming scheme and avoid reuse of the existing variables for new platforms. This also aligns naming of runtime variables closer with internal/cpu processor capability variable names. Change-Id: I3eabb29a03874678851376185d3a62e73c1aff1d Reviewed-on: https://go-review.googlesource.com/c/91435 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2018-10-25	cmd/compile: intrinsify math/bits.Sub on amd64	Keith Randall
	name old time/op new time/op delta Sub-8 1.12ns ± 1% 1.17ns ± 1% +5.20% (p=0.008 n=5+5) Sub32-8 1.11ns ± 0% 1.11ns ± 0% ~ (all samples are equal) Sub64-8 1.12ns ± 0% 1.18ns ± 1% +5.00% (p=0.016 n=4+5) Sub64multiple-8 4.10ns ± 1% 0.86ns ± 1% -78.93% (p=0.008 n=5+5) Fixes #28273 Change-Id: Ibcb6f2fd32d987c3bcbae4f4cd9d335a3de98548 Reviewed-on: https://go-review.googlesource.com/c/144258 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-10-25	cmd/compile: intrinsify math/bits.Add on amd64	Keith Randall
	name old time/op new time/op delta Add-8 1.11ns ± 0% 1.18ns ± 0% +6.31% (p=0.029 n=4+4) Add32-8 1.02ns ± 0% 1.02ns ± 1% ~ (p=0.333 n=4+5) Add64-8 1.11ns ± 1% 1.17ns ± 0% +5.79% (p=0.008 n=5+5) Add64multiple-8 4.35ns ± 1% 0.86ns ± 0% -80.22% (p=0.000 n=5+4) The individual ops are a bit slower (but still very fast). Using the ops in carry chains is very fast. Update #28273 Change-Id: Id975f76df2b930abf0e412911d327b6c5b1befe5 Reviewed-on: https://go-review.googlesource.com/c/144257 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2018-10-16	test/codegen: enable more tests for ppc64/ppc64le	Lynn Boger
	Adding cases for ppc64,ppc64le to the codegen tests where appropriate. Change-Id: Idf8cbe88a4ab4406a4ef1ea777bd15a58b68f3ed Reviewed-on: https://go-review.googlesource.com/c/142557 Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-10-15	test/codegen: test ppc64 TrailingZeros, OnesCount codegen	Alberto Donizetti
	This change adds codegen tests for the intrinsification on ppc64 of the OnesCount{64,32,16,8}, and TrailingZeros{64,32,16,8} math/bits functions. Change-Id: Id3364921fbd18316850e15c8c71330c906187fdb Reviewed-on: https://go-review.googlesource.com/c/141897 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2018-10-02	cmd/compile: instrinsify math/bits.Mul on ppc64x	Carlos Eduardo Seo
	Add SSA rules to intrinsify Mul/Mul64 on ppc64x. benchmark old ns/op new ns/op delta BenchmarkMul-40 8.80 0.93 -89.43% BenchmarkMul32-40 1.39 1.39 +0.00% BenchmarkMul64-40 5.39 0.93 -82.75% Updates #24813 Change-Id: I6e95bfbe976a2278bd17799df184a7fbc0e57829 Reviewed-on: https://go-review.googlesource.com/138917 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
2018-09-26	cmd/compile: intrinsify math/bits.Mul	Brian Kessler
	Add SSA rules to intrinsify Mul/Mul64 (AMD64 and ARM64). SSA rules for other functions and architectures are left as a future optimization. Benchmark results on AMD64/ARM64 before and after SSA implementation are below. amd64 name old time/op new time/op delta Add-4 1.78ns ± 0% 1.85ns ±12% ~ (p=0.397 n=4+5) Add32-4 1.71ns ± 1% 1.70ns ± 0% ~ (p=0.683 n=5+5) Add64-4 1.80ns ± 2% 1.77ns ± 0% -1.22% (p=0.048 n=5+5) Sub-4 1.78ns ± 0% 1.78ns ± 0% ~ (all equal) Sub32-4 1.78ns ± 1% 1.78ns ± 0% ~ (p=1.000 n=5+5) Sub64-4 1.78ns ± 1% 1.78ns ± 0% ~ (p=0.968 n=5+4) Mul-4 11.5ns ± 1% 1.8ns ± 2% -84.39% (p=0.008 n=5+5) Mul32-4 1.39ns ± 0% 1.38ns ± 3% ~ (p=0.175 n=5+5) Mul64-4 6.85ns ± 1% 1.78ns ± 1% -73.97% (p=0.008 n=5+5) Div-4 57.1ns ± 1% 56.7ns ± 0% ~ (p=0.087 n=5+5) Div32-4 18.0ns ± 0% 18.0ns ± 0% ~ (all equal) Div64-4 56.4ns ±10% 53.6ns ± 1% ~ (p=0.071 n=5+5) arm64 name old time/op new time/op delta Add-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add64-96 5.52ns ± 0% 5.51ns ± 0% ~ (p=0.444 n=5+5) Sub-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub64-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Mul-96 34.6ns ± 0% 5.0ns ± 0% -85.52% (p=0.008 n=5+5) Mul32-96 4.51ns ± 0% 4.51ns ± 0% ~ (all equal) Mul64-96 21.1ns ± 0% 5.0ns ± 0% -76.26% (p=0.008 n=5+5) Div-96 64.7ns ± 0% 64.7ns ± 0% ~ (all equal) Div32-96 17.0ns ± 0% 17.0ns ± 0% ~ (all equal) Div64-96 53.1ns ± 0% 53.1ns ± 0% ~ (all equal) Updates #24813 Change-Id: I9bda6d2102f65cae3d436a2087b47ed8bafeb068 Reviewed-on: https://go-review.googlesource.com/129415 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>