aboutsummaryrefslogtreecommitdiff
path: root/src/cmd/compile/internal/ssa/_gen/LOONG64Ops.go
AgeCommit message (Collapse)Author
2026-01-28cmd/compile: (loong64) optimize float32(abs|sqrt64(float64(x)))Xiaolin Zhao
Ref: #733621 Updates #75463 Change-Id: Idd8821d1713754097a2fe83a050c25d9ec5b17eb Reviewed-on: https://go-review.googlesource.com/c/go/+/735540 Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: David Chase <drchase@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-11-23cmd/compile: use 32x32->64 multiplies on loong64Xiaolin Zhao
Gets rid of some sign extensions, like arm64. Change-Id: I9fc37e15a82718bfcf53db8cab0c4e7baaa0a747 Reviewed-on: https://go-review.googlesource.com/c/go/+/721522 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-11-19cmd/compile: Implement LoweredZeroLoop with LSX Instruction on loong64Guoqi Chen
goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3A6000 @ 2500.00MHz | old.txt | new.txt | | sec/op | sec/op vs base | ClearFat256 6.406n ± 0% 3.329n ± 1% -48.03% (p=0.000 n=10) ClearFat512 12.810n ± 0% 7.607n ± 0% -40.62% (p=0.000 n=10) ClearFat1024 25.62n ± 0% 14.01n ± 0% -45.32% (p=0.000 n=10) ClearFat1032 26.02n ± 0% 14.28n ± 0% -45.14% (p=0.000 n=10) ClearFat1040 26.02n ± 0% 14.41n ± 0% -44.62% (p=0.000 n=10) MemclrKnownSize192 4.804n ± 0% 2.827n ± 0% -41.15% (p=0.000 n=10) MemclrKnownSize248 6.561n ± 0% 4.371n ± 0% -33.38% (p=0.000 n=10) MemclrKnownSize256 6.406n ± 0% 3.335n ± 0% -47.94% (p=0.000 n=10) geomean 11.41n 6.453n -43.45% goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3C5000 @ 2200.00MHz | old.txt | new.txt | | sec/op | sec/op vs base | ClearFat256 14.570n ± 0% 7.284n ± 0% -50.01% (p=0.000 n=10) ClearFat512 29.13n ± 0% 14.57n ± 0% -49.98% (p=0.000 n=10) ClearFat1024 58.26n ± 0% 29.15n ± 0% -49.97% (p=0.000 n=10) ClearFat1032 58.73n ± 0% 29.15n ± 0% -50.36% (p=0.000 n=10) ClearFat1040 59.18n ± 0% 29.26n ± 0% -50.56% (p=0.000 n=10) MemclrKnownSize192 10.930n ± 0% 5.466n ± 0% -49.99% (p=0.000 n=10) MemclrKnownSize248 14.110n ± 0% 6.772n ± 0% -52.01% (p=0.000 n=10) MemclrKnownSize256 14.570n ± 0% 7.285n ± 0% -50.00% (p=0.000 n=10) geomean 25.75n 12.78n -50.36% Change-Id: I88d7b6ae2f6fc3f095979f24fb83ff42a9d2d42e Reviewed-on: https://go-review.googlesource.com/c/go/+/720940 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Mark Freeman <markfreeman@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
2025-10-09cmd/compile: declare no output register for loong64 LoweredAtomic{And,Or}32 opsWANG Xuerui
The ICE seen on loong64 while compiling the `(*gcWork).tryStealSpan` function was due to an `LoweredAtomicAnd32` op (inlined from the `(pMask).clear` implementation) being incorrectly assigned an output register while it shouldn't have. Because the op is of mem type, it has needRegister() == false; hence in the shuffle phase of regalloc, its bogus output register has no associated `orig` value recorded. The bug was introduced in CL 482756, but only recently exposed by CL 696035. Since the old-style atomic ops need no return value (and is even documented so besides the loong64 ssa op definition), just fix the register info for both. While at it, add a note in the ssa op definition file about the architectural necessity of resultNotInArgs for loong64 atomic ops, because the practice is not seen in several other arches I have checked. Updates #75776 Change-Id: I087f51b8a2825d7b00fc3965b0afcc8b02cad277 Reviewed-on: https://go-review.googlesource.com/c/go/+/710475 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-09-27cmd/compile: implement jump table on loong64limeidan
Following CL 357330, use jump tables on Loong64. goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz │ old │ new │ │ sec/op │ sec/op vs base │ Switch8Predictable 2.352n ± 0% 2.101n ± 0% -10.65% (p=0.000 n=10) Switch8Unpredictable 11.99n ± 0% 10.25n ± 0% -14.51% (p=0.000 n=10) Switch32Predictable 3.153n ± 0% 1.887n ± 1% -40.14% (p=0.000 n=10) Switch32Unpredictable 12.47n ± 0% 10.22n ± 0% -18.00% (p=0.000 n=10) SwitchStringPredictable 3.162n ± 0% 3.352n ± 0% +6.01% (p=0.000 n=10) SwitchStringUnpredictable 14.70n ± 0% 13.31n ± 0% -9.46% (p=0.000 n=10) SwitchTypePredictable 3.702n ± 0% 2.201n ± 0% -40.55% (p=0.000 n=10) SwitchTypeUnpredictable 16.18n ± 0% 14.48n ± 0% -10.51% (p=0.000 n=10) SwitchInterfaceTypePredictable 7.654n ± 0% 9.680n ± 0% +26.47% (p=0.000 n=10) SwitchInterfaceTypeUnpredictable 22.04n ± 0% 22.44n ± 0% +1.81% (p=0.000 n=10) geomean 7.441n 6.469n -13.07% Change-Id: Id6f30fa73349c60fac17670084daee56973a955f Reviewed-on: https://go-review.googlesource.com/c/go/+/705396 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Junyang Shao <shaojunyang@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-09-17cmd/compile: fix the issue of shift amount exceeding the valid rangeXiaolin Zhao
Fixes #75479 Change-Id: I362d3e49090e94f91a840dd5a475978b59222a00 Reviewed-on: https://go-review.googlesource.com/c/go/+/704135 Reviewed-by: Mark Freeman <markfreeman@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-09-05cmd/compile: simplify specific addition operations using the ADDV16 instructionXiaolin Zhao
On loong64, the addi.d instruction can only directly handle 12-bit immediate numbers. If a larger immediate number needs to be processed, it must first be placed in a register, and then the add.d instruction is used to complete the processing of the larger immediate number. If a larger immediate number c satisfies is32Bit(c) && c&0xffff == 0, then the ADDV16 instruction can be used to complete the addition operation. Removes 164 instructions from the go binary on loong64. Change-Id: I404de93cc4eaaa12fe424f5a0d61b03231215d1a Reviewed-on: https://go-review.googlesource.com/c/go/+/700536 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-09-04runtime, cmd/compile, cmd/internal/obj: remove duff support for loong64limeidan
Change-Id: I44d6452933c8010f7dfbf821a32053f9d1cf151e Reviewed-on: https://go-review.googlesource.com/c/go/+/700096 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>
2025-09-03cmd/compile: use generated loops instead of DUFFCOPY on loong64limeidan
Change-Id: If9da2b5681e5d05d7c3d51f003f1fe662d3feaec Reviewed-on: https://go-review.googlesource.com/c/go/+/699855 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
2025-09-03cmd/compile: simplify the support for 32bit high multiply on loong64Xiaolin Zhao
Removes 152 instructions from the Go binary on loong64. Change-Id: Icf8ead4f4ca965f51add85ac5e45c3cca8916401 Reviewed-on: https://go-review.googlesource.com/c/go/+/700335 Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-08-31cmd/compile: use generated loops instead of DUFFZERO on loong64limeidan
Change-Id: Id43ee4353d4bac96627f8b0f54545cdd3d2a1d1b Reviewed-on: https://go-review.googlesource.com/c/go/+/699695 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-08-21cmd/compile: use zero register instead of specialized *zero instructions on ↵limeidan
loong64 Refer to CL 633075, loong64 has a zero(R0) register that can be used to do this. Change-Id: I846c6bdfcfd6dbfa18338afc13e34e350580ead4 Reviewed-on: https://go-review.googlesource.com/c/go/+/693876 Reviewed-by: Carlos Amedee <carlos@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Keith Randall <khr@golang.org>
2025-08-21cmd/compile: optimize some patterns into revb2h/revb4h instruction on loong64Xiaolin Zhao
Pattern1: (the type of c is uint16) c>>8 | c<<8 To: revb2h c Pattern2: (the type of c is uint32) (c & 0xff00ff00)>>8 | (c & 0x00ff00ff)<<8 To: revb2h c Pattern3: (the type of c is uint64) (c & 0xff00ff00ff00ff00)>>8 | (c & 0x00ff00ff00ff00ff)<<8 To: revb4h c Change-Id: Ic6231a3f476cbacbea4bd00e31193d107cb86cda Reviewed-on: https://go-review.googlesource.com/c/go/+/696335 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Keith Randall <khr@google.com> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-12cmd/compile/internal: optimize multiplication use new operation ↵limeidan
'ADDshiftLLV' on loong64 goos: linux goarch: loong64 pkg: cmd/compile/internal/test cpu: Loongson-3A6000-HV @ 2500.00MHz │ old │ new │ │ sec/op │ sec/op vs base │ MulconstI32/3 0.8004n ± 0% 0.4247n ± 2% -46.94% (p=0.000 n=10) MulconstI32/5 0.8005n ± 0% 0.4256n ± 1% -46.83% (p=0.000 n=10) MulconstI32/12 1.2010n ± 0% 0.8005n ± 0% -33.35% (p=0.000 n=10) MulconstI32/120 0.8090n ± 0% 0.8067n ± 0% -0.28% (p=0.007 n=10) MulconstI32/-120 0.8109n ± 0% 0.8072n ± 0% -0.47% (p=0.000 n=10) MulconstI32/65537 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) MulconstI32/65538 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.265 n=10) MulconstI64/3 0.8005n ± 0% 0.4241n ± 1% -47.02% (p=0.000 n=10) MulconstI64/5 0.8004n ± 0% 0.4249n ± 1% -46.91% (p=0.000 n=10) MulconstI64/12 1.2010n ± 0% 0.8004n ± 0% -33.36% (p=0.000 n=10) MulconstI64/120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.635 n=10) MulconstI64/-120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.837 n=10) MulconstI64/65537 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.837 n=10) MulconstI64/65538 0.8096n ± 0% 0.8004n ± 0% -1.14% (p=0.000 n=10) MulconstU32/3 0.8004n ± 0% 0.4263n ± 1% -46.75% (p=0.000 n=10) MulconstU32/5 0.8005n ± 0% 0.4262n ± 1% -46.76% (p=0.000 n=10) MulconstU32/12 1.2010n ± 0% 0.8005n ± 0% -33.35% (p=0.000 n=10) MulconstU32/120 0.8105n ± 0% 0.8096n ± 0% ~ (p=0.183 n=10) MulconstU32/65537 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) MulconstU32/65538 0.8005n ± 0% 0.8005n ± 0% ~ (p=1.000 n=10) MulconstU64/3 0.8004n ± 0% 0.4265n ± 4% -46.71% (p=0.000 n=10) MulconstU64/5 0.8004n ± 0% 0.4256n ± 0% -46.82% (p=0.000 n=10) MulconstU64/12 1.2010n ± 0% 0.8004n ± 0% -33.36% (p=0.000 n=10) MulconstU64/120 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.387 n=10) MulconstU64/65537 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.265 n=10) MulconstU64/65538 0.8080n ± 0% 0.8004n ± 0% -0.93% (p=0.000 n=10) geomean 0.8539n 0.6597n -22.74% Change-Id: Ie33e88985d7639f481bbba540bc917b9f185c357 Reviewed-on: https://go-review.googlesource.com/c/go/+/693855 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-08-10cmd/compile/internal/ssa: optimise more branches with SGTconst/SGTUconst on ↵Xiaolin Zhao
loong64 Add branches to convert EQZ/NEZ into more optimal branch conditions. This reduces 720 instructions from the go toolchain binary on loong64. file before after Δ % asm 555306 555082 -224 -0.0403% cgo 481814 481742 -72 -0.0149% compile 2475686 2475710 +24 +0.0010% cover 516854 516770 -84 -0.0163% link 702566 702530 -36 -0.0051% preprofile 238612 238548 -64 -0.0268% vet 793140 793060 -80 -0.0101% go 1573466 1573346 -120 -0.0076% gofmt 320560 320496 -64 -0.0200% total 7658004 7657284 -720 -0.0094% Additionally, rename EQ/NE to EQZ/NEZ to enhance readability. Change-Id: Ibc876bc8b8d4e81d5c3aaf0b74b60419f3c771b1 Reviewed-on: https://go-review.googlesource.com/c/go/+/693455 Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Keith Randall <khr@google.com>
2025-08-07cmd/compile/internal/ssa: fix typo in LOONG64Ops.go commentXiaolin Zhao
Change-Id: I680bae7fc1a26c1f249ab833fa8d41e9387b2d50 Reviewed-on: https://go-review.googlesource.com/c/go/+/693456 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mark Freeman <markfreeman@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2025-07-30cmd/compile: move loong64 over to new bounds check strategyKeith Randall
Change-Id: I5dec33d10d16a5d5c0dc7231cd1f764a6d1d7598 Reviewed-on: https://go-review.googlesource.com/c/go/+/682399 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
2025-05-21cmd/compile: add rules about ORN and ANDNXiaolin Zhao
Reduce the number of go toolchain instructions on loong64 as follows. file before after Δ % addr2line 279880 279776 -104 -0.0372% asm 556638 556410 -228 -0.0410% buildid 272272 272072 -200 -0.0735% cgo 481522 481318 -204 -0.0424% compile 2457788 2457580 -208 -0.0085% covdata 323384 323280 -104 -0.0322% cover 518450 518234 -216 -0.0417% dist 340790 340686 -104 -0.0305% distpack 282456 282252 -204 -0.0722% doc 789932 789688 -244 -0.0309% fix 324332 324228 -104 -0.0321% link 704622 704390 -232 -0.0329% nm 277132 277028 -104 -0.0375% objdump 507862 507758 -104 -0.0205% pack 221774 221674 -100 -0.0451% pprof 1469816 1469552 -264 -0.0180% test2json 254836 254732 -104 -0.0408% trace 1100002 1099738 -264 -0.0240% vet 781078 780874 -204 -0.0261% go 1529116 1528848 -268 -0.0175% gofmt 318556 318448 -108 -0.0339% total 13792238 13788566 -3672 -0.0266% Change-Id: I23fb3ebd41309252c7075e57ea7094e79f8c4fef Reviewed-on: https://go-review.googlesource.com/c/go/+/674335 Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: abner chenc <chenguoqi@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn>
2025-05-19cmd/compile: add prefetch intrinsic support on loong64Guoqi Chen
This CL enables intrinsic support to emit the following prefetch instructions for loong64 platform: 1.Prefetch - prefetches data from memory address to cache; 2.PrefetchStreamed - prefetches data from memory address, with a hint that this data is being streamed. Benchmarks picked from go/test/bench/garbage Parameters tested with: GOMAXPROCS=8 tree2 -heapsize=1000000000 -cpus=8 tree -n=18 parser peano Benchmarks Loongson-3A6000-HV @ 2500.00MHz: | bench.old | bench.new | | sec/op | sec/op vs base | Tree2-8 1238.2µ ± 24% 999.9µ ± 453% ~ (p=0.089 n=10) Tree-8 277.4m ± 1% 275.5m ± 1% ~ (p=0.063 n=10) Parser-8 3.564 ± 0% 3.509 ± 1% -1.56% (p=0.000 n=10) Peano-8 39.12m ± 2% 38.85m ± 2% ~ (p=0.353 n=10) geomean 83.19m 78.28m -5.90% Change-Id: I59e9aa4f609a106d4f70706e6d6d1fe6738ab72a Reviewed-on: https://go-review.googlesource.com/c/go/+/671876 Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com>
2025-03-10cmd/compile: optimize shifts of int32 and uint32 on loong64Xiaolin Zhao
goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000-HV @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 1.100n ± 1% 1.101n ± 0% ~ (p=0.566 n=10) LeadingZeros8 1.501n ± 0% 1.502n ± 0% +0.07% (p=0.000 n=10) LeadingZeros16 1.501n ± 0% 1.502n ± 0% +0.07% (p=0.000 n=10) LeadingZeros32 1.2010n ± 0% 0.9511n ± 0% -20.81% (p=0.000 n=10) LeadingZeros64 1.104n ± 1% 1.119n ± 0% +1.40% (p=0.000 n=10) TrailingZeros 0.8137n ± 0% 0.8086n ± 0% -0.63% (p=0.001 n=10) TrailingZeros8 1.031n ± 1% 1.031n ± 1% ~ (p=0.956 n=10) TrailingZeros16 0.8204n ± 1% 0.8114n ± 0% -1.11% (p=0.000 n=10) TrailingZeros32 0.8145n ± 0% 0.8090n ± 0% -0.68% (p=0.000 n=10) TrailingZeros64 0.8159n ± 0% 0.8089n ± 1% -0.86% (p=0.000 n=10) OnesCount 0.8672n ± 0% 0.8677n ± 0% +0.06% (p=0.000 n=10) OnesCount8 0.8005n ± 0% 0.8009n ± 0% +0.06% (p=0.000 n=10) OnesCount16 0.9339n ± 0% 0.9344n ± 0% +0.05% (p=0.000 n=10) OnesCount32 0.8672n ± 0% 0.8677n ± 0% +0.06% (p=0.000 n=10) OnesCount64 1.201n ± 0% 1.201n ± 0% ~ (p=0.474 n=10) RotateLeft 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) RotateLeft8 1.202n ± 0% 1.202n ± 0% ~ (p=0.210 n=10) RotateLeft16 0.8050n ± 0% 0.8036n ± 0% -0.17% (p=0.002 n=10) RotateLeft32 0.6674n ± 0% 0.6674n ± 0% ~ (p=1.000 n=10) RotateLeft64 0.6673n ± 0% 0.6674n ± 0% ~ (p=0.072 n=10) Reverse 0.4123n ± 0% 0.4067n ± 1% -1.37% (p=0.000 n=10) Reverse8 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) Reverse16 0.8004n ± 0% 0.8009n ± 0% +0.06% (p=0.000 n=10) Reverse32 0.8004n ± 0% 0.8009n ± 0% +0.06% (p=0.000 n=10) Reverse64 0.8004n ± 0% 0.8009n ± 0% +0.06% (p=0.001 n=10) ReverseBytes 0.4100n ± 1% 0.4057n ± 1% -1.06% (p=0.002 n=10) ReverseBytes16 0.8004n ± 0% 0.8009n ± 0% +0.07% (p=0.000 n=10) ReverseBytes32 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) ReverseBytes64 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) Add 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Add32 1.201n ± 0% 1.201n ± 0% ~ (p=0.474 n=10) Add64 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Add64multiple 1.831n ± 0% 1.832n ± 0% ~ (p=1.000 n=10) Sub 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Sub32 1.601n ± 0% 1.602n ± 0% +0.06% (p=0.000 n=10) Sub64 1.201n ± 0% 1.201n ± 0% ~ (p=0.474 n=10) Sub64multiple 2.400n ± 0% 2.402n ± 0% +0.10% (p=0.000 n=10) Mul 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) Mul32 0.8005n ± 0% 0.8009n ± 0% +0.05% (p=0.000 n=10) Mul64 0.8004n ± 0% 0.8008n ± 0% +0.05% (p=0.000 n=10) Div 9.107n ± 0% 9.083n ± 0% ~ (p=0.255 n=10) Div32 4.009n ± 0% 4.011n ± 0% +0.05% (p=0.000 n=10) Div64 9.705n ± 0% 9.711n ± 0% +0.06% (p=0.000 n=10) geomean 1.089n 1.083n -0.62% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 1.352n ± 0% 1.341n ± 4% -0.81% (p=0.024 n=10) LeadingZeros8 1.766n ± 0% 1.781n ± 0% +0.88% (p=0.000 n=10) LeadingZeros16 1.766n ± 0% 1.782n ± 0% +0.88% (p=0.000 n=10) LeadingZeros32 1.536n ± 0% 1.341n ± 1% -12.73% (p=0.000 n=10) LeadingZeros64 1.351n ± 1% 1.338n ± 0% -0.96% (p=0.000 n=10) TrailingZeros 0.9037n ± 0% 0.9025n ± 0% -0.12% (p=0.020 n=10) TrailingZeros8 1.087n ± 3% 1.056n ± 0% ~ (p=0.060 n=10) TrailingZeros16 1.101n ± 0% 1.101n ± 0% ~ (p=0.211 n=10) TrailingZeros32 0.9040n ± 0% 0.9024n ± 1% -0.18% (p=0.017 n=10) TrailingZeros64 0.9043n ± 0% 0.9028n ± 1% ~ (p=0.118 n=10) OnesCount 1.503n ± 2% 1.482n ± 1% -1.43% (p=0.001 n=10) OnesCount8 1.207n ± 0% 1.206n ± 0% -0.12% (p=0.000 n=10) OnesCount16 1.501n ± 0% 1.534n ± 0% +2.13% (p=0.000 n=10) OnesCount32 1.483n ± 1% 1.531n ± 1% +3.27% (p=0.000 n=10) OnesCount64 1.301n ± 0% 1.302n ± 0% +0.08% (p=0.000 n=10) RotateLeft 0.8136n ± 4% 0.8083n ± 0% -0.66% (p=0.002 n=10) RotateLeft8 1.311n ± 0% 1.310n ± 0% ~ (p=0.786 n=10) RotateLeft16 1.165n ± 0% 1.149n ± 0% -1.33% (p=0.001 n=10) RotateLeft32 0.8138n ± 1% 0.8093n ± 0% -0.57% (p=0.017 n=10) RotateLeft64 0.8149n ± 1% 0.8088n ± 0% -0.74% (p=0.000 n=10) Reverse 0.5195n ± 1% 0.5109n ± 0% -1.67% (p=0.000 n=10) Reverse8 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.000 n=10) Reverse16 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.000 n=10) Reverse32 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.012 n=10) Reverse64 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.010 n=10) ReverseBytes 0.5120n ± 1% 0.5122n ± 2% ~ (p=0.306 n=10) ReverseBytes16 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.000 n=10) ReverseBytes32 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.000 n=10) ReverseBytes64 0.8007n ± 0% 0.8010n ± 0% +0.04% (p=0.000 n=10) Add 1.201n ± 0% 1.201n ± 4% ~ (p=0.334 n=10) Add32 1.201n ± 0% 1.201n ± 0% ~ (p=0.563 n=10) Add64 1.201n ± 0% 1.201n ± 1% ~ (p=0.652 n=10) Add64multiple 1.909n ± 0% 1.902n ± 0% ~ (p=0.126 n=10) Sub 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Sub32 1.655n ± 0% 1.654n ± 0% ~ (p=0.589 n=10) Sub64 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Sub64multiple 2.150n ± 0% 2.180n ± 4% +1.37% (p=0.000 n=10) Mul 0.9341n ± 0% 0.9345n ± 0% +0.04% (p=0.011 n=10) Mul32 1.053n ± 0% 1.030n ± 0% -2.23% (p=0.000 n=10) Mul64 0.9341n ± 0% 0.9345n ± 0% +0.04% (p=0.018 n=10) Div 11.59n ± 0% 11.57n ± 1% ~ (p=0.091 n=10) Div32 4.337n ± 0% 4.337n ± 1% ~ (p=0.783 n=10) Div64 12.81n ± 0% 12.76n ± 0% -0.39% (p=0.001 n=10) geomean 1.257n 1.252n -0.46% Change-Id: I9e93ea49736760c19dc6b6463d2aa95878121b7b Reviewed-on: https://go-review.googlesource.com/c/go/+/627855 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Junyang Shao <shaojunyang@google.com>
2024-11-20cmd/compile, internal/runtime/atomic: add Xchg8 for loong64Guoqi Chen
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic instruction AMSWAP[DB]{B,H} [1] is supported. Therefore, the implementation of the atomic operation exchange can be selected according to the CPUCFG flag LAM_BH: AMSWAPDBB(full barrier) instruction is used on new microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can significantly improve the performance of Go programs on new microstructures. Because Xchg8 implemented using traditional LL-SC uses too many temporary registers, it is not suitable for intrinsics. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz BenchmarkXchg8 100000000 10.41 ns/op BenchmarkXchg8-2 100000000 10.41 ns/op BenchmarkXchg8-4 100000000 10.41 ns/op BenchmarkXchg8Parallel 96647592 12.41 ns/op BenchmarkXchg8Parallel-2 58376136 20.60 ns/op BenchmarkXchg8Parallel-4 78458899 17.97 ns/op goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000-HV @ 2500.00MHz BenchmarkXchg8 38323825 31.23 ns/op BenchmarkXchg8-2 38368219 31.23 ns/op BenchmarkXchg8-4 37154156 31.26 ns/op BenchmarkXchg8Parallel 37908301 31.63 ns/op BenchmarkXchg8Parallel-2 30413440 39.42 ns/op BenchmarkXchg8Parallel-4 30737626 39.03 ns/op For #69735 [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Change-Id: I02ba68f66a2210b6902344fdc9975eb62de728ab Reviewed-on: https://go-review.googlesource.com/c/go/+/623058 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
2024-11-19cmd/compiler,internal/runtime/atomic: optimize Cas{64,32} on loong64Guoqi Chen
In Loongson's new microstructure LA664 (Loongson-3A6000) and later, the atomic compare-and-exchange instruction AMCAS[DB]{B,W,H,V} [1] is supported. Therefore, the implementation of the atomic operation compare-and-swap can be selected according to the CPUCFG flag LAMCAS: AMCASDB(full barrier) instruction is used on new microstructures, and traditional LL-SC is used on LA464 (Loongson-3A5000) and older microstructures. This can significantly improve the performance of Go programs on new microstructures. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | Cas 46.84n ± 0% 22.82n ± 0% -51.28% (p=0.000 n=20) Cas-2 47.58n ± 0% 29.57n ± 0% -37.85% (p=0.000 n=20) Cas-4 43.27n ± 20% 25.31n ± 13% -41.50% (p=0.000 n=20) Cas64 46.85n ± 0% 22.82n ± 0% -51.29% (p=0.000 n=20) Cas64-2 47.43n ± 0% 29.53n ± 0% -37.74% (p=0.002 n=20) Cas64-4 43.18n ± 0% 25.28n ± 2% -41.46% (p=0.000 n=20) geomean 45.82n 25.74n -43.82% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | Cas 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20) Cas-2 52.80n ± 0% 53.11n ± 0% +0.59% (p=0.000 n=20) Cas-4 55.97n ± 0% 57.31n ± 0% +2.39% (p=0.000 n=20) Cas64 50.05n ± 0% 51.26n ± 0% +2.42% (p=0.000 n=20) Cas64-2 52.68n ± 0% 53.11n ± 0% +0.82% (p=0.000 n=20) Cas64-4 55.96n ± 0% 57.26n ± 0% +2.33% (p=0.000 n=20) geomean 52.86n 53.83n +1.82% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html Change-Id: I9b777c63c124fb492f61c903f77061fa2b4e5322 Reviewed-on: https://go-review.googlesource.com/c/go/+/613396 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-13cmd/compile: wire up math/bits.TrailingZeros intrinsics for loong64Xiaolin Zhao
Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | TrailingZeros 1.7240n ± 0% 0.8120n ± 0% -52.90% (p=0.000 n=20) TrailingZeros8 1.0530n ± 0% 0.8015n ± 0% -23.88% (p=0.000 n=20) TrailingZeros16 2.072n ± 0% 1.015n ± 0% -51.01% (p=0.000 n=20) TrailingZeros32 1.7160n ± 0% 0.8122n ± 0% -52.67% (p=0.000 n=20) TrailingZeros64 2.0060n ± 0% 0.8125n ± 0% -59.50% (p=0.000 n=20) geomean 1.669n 0.8470n -49.25% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | TrailingZeros 2.6275n ± 0% 0.9120n ± 0% -65.29% (p=0.000 n=20) TrailingZeros8 1.451n ± 0% 1.163n ± 0% -19.85% (p=0.000 n=20) TrailingZeros16 3.069n ± 0% 1.201n ± 0% -60.87% (p=0.000 n=20) TrailingZeros32 2.9060n ± 0% 0.9115n ± 0% -68.63% (p=0.000 n=20) TrailingZeros64 2.6305n ± 0% 0.9115n ± 0% -65.35% (p=0.000 n=20) geomean 2.456n 1.011n -58.83% This patch is a copy of CL 479498. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: I1a5b2114a844dc0d02c8e68f41ce2443ac3b5fda Reviewed-on: https://go-review.googlesource.com/c/go/+/624356 Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com>
2024-11-12cmd/compile: optimize math/bits.OnesCount{16,32,64} implementation on loong64Guoqi Chen
Use Loong64's LSX instruction VPCNT to implement math/bits.OnesCount{16,32,64} and make it intrinsic. Benchmark results on loongson 3A5000 and 3A6000 machines: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000-HV @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | OnesCount 4.413n ± 0% 1.401n ± 0% -68.25% (p=0.000 n=10) OnesCount8 1.364n ± 0% 1.363n ± 0% ~ (p=0.130 n=10) OnesCount16 2.112n ± 0% 1.534n ± 0% -27.37% (p=0.000 n=10) OnesCount32 4.533n ± 0% 1.529n ± 0% -66.27% (p=0.000 n=10) OnesCount64 4.565n ± 0% 1.531n ± 1% -66.46% (p=0.000 n=10) geomean 3.048n 1.470n -51.78% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | OnesCount 3.553n ± 0% 1.201n ± 0% -66.20% (p=0.000 n=10) OnesCount8 0.8021n ± 0% 0.8004n ± 0% -0.21% (p=0.000 n=10) OnesCount16 1.216n ± 0% 1.000n ± 0% -17.76% (p=0.000 n=10) OnesCount32 3.006n ± 0% 1.035n ± 0% -65.57% (p=0.000 n=10) OnesCount64 3.503n ± 0% 1.035n ± 0% -70.45% (p=0.000 n=10) geomean 2.053n 1.006n -51.01% Change-Id: I07a5b8da2bb48711b896387ec7625145804affc8 Reviewed-on: https://go-review.googlesource.com/c/go/+/620978 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-11-11cmd/compile: wire up bits.Reverse intrinsics for loong64Xiaolin Zhao
Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | CL 624576 | this CL | | sec/op | sec/op vs base | Reverse 2.8130n ± 0% 0.8008n ± 0% -71.53% (p=0.000 n=20) Reverse8 0.7014n ± 0% 0.4040n ± 0% -42.40% (p=0.000 n=20) Reverse16 1.2975n ± 0% 0.6632n ± 1% -48.89% (p=0.000 n=20) Reverse32 2.7520n ± 0% 0.4042n ± 0% -85.31% (p=0.000 n=20) Reverse64 2.8970n ± 0% 0.4041n ± 0% -86.05% (p=0.000 n=20) geomean 1.828n 0.5116n -72.01% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | CL 624576 | this CL | | sec/op | sec/op vs base | Reverse 4.0050n ± 0% 0.8011n ± 0% -80.00% (p=0.000 n=20) Reverse8 0.8010n ± 0% 0.5210n ± 1% -34.96% (p=0.000 n=20) Reverse16 1.6160n ± 0% 0.6008n ± 0% -62.82% (p=0.000 n=20) Reverse32 3.8550n ± 0% 0.5179n ± 0% -86.57% (p=0.000 n=20) Reverse64 3.8050n ± 0% 0.5177n ± 0% -86.40% (p=0.000 n=20) geomean 2.378n 0.5828n -75.49% Updates #59120 This patch is a copy of CL 483656. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: I98681091763279279c8404bd0295785f13ea1c8e Reviewed-on: https://go-review.googlesource.com/c/go/+/624276 Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com>
2024-11-11cmd/compiler,internal/runtime/atomic: optimize And{64,32,8} and Or{64,32,8} ↵Guoqi Chen
on loong64 Use loong64's atomic operation instruction AMANDDB{V,W,W} (full barrier) to implement And{64,32,8}, AMORDB{V,W,W} (full barrier) to implement Or{64,32,8}. Intrinsify And{64,32,8} and Or{64,32,8}, And this CL alias all of the And/Or operations into sync/atomic package. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000-HV @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | And32 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20) And32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) And64 27.73n ± 0% 10.81n ± 0% -61.02% (p=0.000 n=20) And64Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) Or32 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20) Or32Parallel 28.96n ± 0% 12.41n ± 0% -57.15% (p=0.000 n=20) Or64 27.62n ± 0% 10.81n ± 0% -60.86% (p=0.000 n=20) Or64Parallel 28.97n ± 0% 12.41n ± 0% -57.16% (p=0.000 n=20) And8 29.15n ± 0% 13.21n ± 0% -54.68% (p=0.000 n=20) And 27.71n ± 0% 12.82n ± 0% -53.74% (p=0.000 n=20) And8Parallel 28.99n ± 0% 14.46n ± 0% -50.12% (p=0.000 n=20) AndParallel 29.12n ± 0% 14.42n ± 0% -50.48% (p=0.000 n=20) Or8 28.31n ± 0% 12.81n ± 0% -54.75% (p=0.000 n=20) Or 27.72n ± 0% 12.81n ± 0% -53.79% (p=0.000 n=20) Or8Parallel 29.03n ± 0% 14.62n ± 0% -49.64% (p=0.000 n=20) OrParallel 29.12n ± 0% 14.42n ± 0% -50.49% (p=0.000 n=20) geomean 28.47n 12.58n -55.80% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | And32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) And32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) And64 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) And64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) And8 30.42n ± 0% 14.41n ± 0% -52.63% (p=0.000 n=20) And 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20) And8Parallel 31.23n ± 0% 15.21n ± 0% -51.30% (p=0.000 n=20) AndParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20) Or32 30.02n ± 0% 14.81n ± 0% -50.67% (p=0.000 n=20) Or32Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) Or64 30.02n ± 0% 14.82n ± 0% -50.63% (p=0.000 n=20) Or64Parallel 30.83n ± 0% 15.61n ± 0% -49.37% (p=0.000 n=20) Or8 30.02n ± 0% 14.01n ± 0% -53.33% (p=0.000 n=20) Or 30.02n ± 0% 13.61n ± 0% -54.66% (p=0.000 n=20) Or8Parallel 30.83n ± 0% 14.81n ± 0% -51.96% (p=0.000 n=20) OrParallel 30.83n ± 0% 14.41n ± 0% -53.26% (p=0.000 n=20) geomean 30.47n 14.75n -51.61% Change-Id: If008ff6a08b51905076f8ddb6e92f8e214d3f7b3 Reviewed-on: https://go-review.googlesource.com/c/go/+/482756 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-11-11cmd/compiler,internal/runtime/atomic: optimize xchg{32,64} on loong64Guoqi Chen
Use Loong64's atomic operation instruction AMSWAPDB{W,V} (full barrier) to implement atomic.Xchg{32,64} goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz | old.bench | new.bench | | sec/op | sec/op vs base | Xchg 26.44n ± 0% 12.01n ± 0% -54.58% (p=0.000 n=20) Xchg-2 30.10n ± 0% 25.58n ± 0% -15.02% (p=0.000 n=20) Xchg-4 30.06n ± 0% 24.82n ± 0% -17.43% (p=0.000 n=20) Xchg64 26.44n ± 0% 12.02n ± 0% -54.54% (p=0.000 n=20) Xchg64-2 30.10n ± 0% 25.57n ± 0% -15.05% (p=0.000 n=20) Xchg64-4 30.05n ± 0% 24.80n ± 0% -17.47% (p=0.000 n=20) geomean 28.81n 19.68n -31.69% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | old.bench | new.bench | | sec/op | sec/op vs base | Xchg 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg-4 34.63n ± 0% 19.59n ± 0% -43.42% (p=0.000 n=20) Xchg64 25.62n ± 0% 12.41n ± 0% -51.56% (p=0.000 n=20) Xchg64-2 35.01n ± 0% 20.59n ± 0% -41.19% (p=0.000 n=20) Xchg64-4 34.67n ± 0% 19.59n ± 0% -43.50% (p=0.000 n=20) geomean 31.44n 17.11n -45.59% Updates #59120. Change-Id: Ied74fc20338b63799c6d6eeb122c31b42cff0f7e Reviewed-on: https://go-review.googlesource.com/c/go/+/481578 Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: WANG Xuerui <git@xen0n.name> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
2024-11-08cmd/compile: implement FMA codegen for loong64Xiaolin Zhao
Benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | FMA 25.930n ± 0% 2.002n ± 0% -92.28% (p=0.000 n=10) goos: linux goarch: loong64 pkg: math cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | FMA 32.840n ± 0% 2.002n ± 0% -93.90% (p=0.000 n=10) Updates #59120 This patch is a copy of CL 483355. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: I88b89d23f00864f9173a182a47ee135afec7ed6e Reviewed-on: https://go-review.googlesource.com/c/go/+/625335 Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-08cmd/compile/internal: intrinsify publicationBarrier on loong64Guoqi Chen
The publication barrier is a StoreStore barrier, which is implemented by "DBAR 0x1A" [1] on loong64. goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | Malloc8 31.76n ± 0% 22.79n ± 0% -28.24% (p=0.000 n=20) Malloc8-2 25.46n ± 0% 18.33n ± 0% -28.00% (p=0.000 n=20) Malloc8-4 25.75n ± 0% 18.43n ± 0% -28.41% (p=0.000 n=20) Malloc16 62.97n ± 0% 42.41n ± 0% -32.65% (p=0.000 n=20) Malloc16-2 49.11n ± 0% 31.68n ± 0% -35.50% (p=0.000 n=20) Malloc16-4 49.64n ± 1% 31.95n ± 0% -35.62% (p=0.000 n=20) MallocTypeInfo8 58.57n ± 0% 46.51n ± 0% -20.61% (p=0.000 n=20) MallocTypeInfo8-2 51.43n ± 0% 38.01n ± 0% -26.09% (p=0.000 n=20) MallocTypeInfo8-4 51.65n ± 0% 38.15n ± 0% -26.13% (p=0.000 n=20) MallocTypeInfo16 68.07n ± 0% 51.62n ± 0% -24.17% (p=0.000 n=20) MallocTypeInfo16-2 54.73n ± 0% 41.13n ± 0% -24.85% (p=0.000 n=20) MallocTypeInfo16-4 55.05n ± 0% 41.28n ± 0% -25.02% (p=0.000 n=20) MallocLargeStruct 491.5n ± 0% 454.8n ± 0% -7.47% (p=0.000 n=20) MallocLargeStruct-2 351.8n ± 1% 323.8n ± 0% -7.94% (p=0.000 n=20) MallocLargeStruct-4 333.6n ± 0% 316.7n ± 0% -5.10% (p=0.000 n=20) geomean 71.01n 53.78n -24.26% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html Change-Id: Ica0c89db6f2bebd55d9b3207a1c462a9454e9268 Reviewed-on: https://go-review.googlesource.com/c/go/+/577515 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-08cmd/compiler,internal/runtime/atomic: optimize xadd{32,64} on loong64Guoqi Chen
Use Loong64's atomic operation instruction AMADDDB{W,V} (full barrier) to implement atomic.Xadd{32,64} goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | Xadd 27.24n ± 0% 12.01n ± 0% -55.91% (p=0.000 n=20) Xadd-2 31.93n ± 0% 25.55n ± 0% -19.98% (p=0.000 n=20) Xadd-4 31.90n ± 0% 24.80n ± 0% -22.26% (p=0.000 n=20) Xadd64 27.23n ± 0% 12.01n ± 0% -55.89% (p=0.000 n=20) Xadd64-2 31.93n ± 0% 25.57n ± 0% -19.90% (p=0.000 n=20) Xadd64-4 31.89n ± 0% 24.80n ± 0% -22.23% (p=0.000 n=20) geomean 30.27n 19.67n -35.01% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | Xadd 26.02n ± 0% 12.41n ± 0% -52.31% (p=0.000 n=20) Xadd-2 37.36n ± 0% 20.60n ± 0% -44.86% (p=0.000 n=20) Xadd-4 37.22n ± 0% 19.59n ± 0% -47.37% (p=0.000 n=20) Xadd64 26.42n ± 0% 12.41n ± 0% -53.03% (p=0.000 n=20) Xadd64-2 37.77n ± 0% 20.60n ± 0% -45.46% (p=0.000 n=20) Xadd64-4 37.78n ± 0% 19.59n ± 0% -48.15% (p=0.000 n=20) geomean 33.30n 17.11n -48.62% Change-Id: I982539c2aa04680e9dd11b099ba8d5f215bf9b32 Reviewed-on: https://go-review.googlesource.com/c/go/+/481937 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
2024-11-07cmd/compiler,internal/runtime/atomic: optimize Store{64,32,8} on loong64Guoqi Chen
On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1] is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures. Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according to the CPU feature. The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to ensure consistency in the order of Store/Load [2]. LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses the R0 register, and there is no performance difference between the implementations of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}. goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A5000-HV @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | AtomicStore64 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore64-2 19.61n ± 0% 13.61n ± 0% -30.57% (p=0.000 n=20) AtomicStore64-4 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20) AtomicStore-2 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20) AtomicStore-4 19.62n ± 0% 13.62n ± 0% -30.58% (p=0.000 n=20) AtomicStore8 19.61n ± 0% 20.01n ± 0% +2.04% (p=0.000 n=20) AtomicStore8-2 19.62n ± 0% 20.02n ± 0% +2.01% (p=0.000 n=20) AtomicStore8-4 19.61n ± 0% 20.02n ± 0% +2.09% (p=0.000 n=20) geomean 19.61n 15.48n -21.08% goos: linux goarch: loong64 pkg: internal/runtime/atomic cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | AtomicStore64 18.03n ± 0% 12.81n ± 0% -28.93% (p=0.000 n=20) AtomicStore64-2 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore64-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20) AtomicStore-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) AtomicStore8-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20) geomean 18.01n 12.81n -28.89% [1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html [2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097 Reviewed-on: https://go-review.googlesource.com/c/go/+/581356 Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>
2024-11-06cmd/compile: wire up Bswap/ReverseBytes intrinsics for loong64Xiaolin Zhao
Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | ReverseBytes 2.0020n ± 0% 0.4040n ± 0% -79.82% (p=0.000 n=20) ReverseBytes16 0.8866n ± 1% 0.8007n ± 0% -9.69% (p=0.000 n=20) ReverseBytes32 1.2195n ± 0% 0.8007n ± 0% -34.34% (p=0.000 n=20) ReverseBytes64 2.0705n ± 0% 0.8008n ± 0% -61.32% (p=0.000 n=20) geomean 1.455n 0.6749n -53.62% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | ReverseBytes 2.8040n ± 0% 0.5205n ± 0% -81.44% (p=0.000 n=20) ReverseBytes16 0.7066n ± 0% 0.8011n ± 0% +13.37% (p=0.000 n=20) ReverseBytes32 1.5500n ± 0% 0.8010n ± 0% -48.32% (p=0.000 n=20) ReverseBytes64 2.7665n ± 0% 0.8010n ± 0% -71.05% (p=0.000 n=20) geomean 1.707n 0.7192n -57.87% Updates #59120 This patch is a copy of CL 483357. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: If355354cd031533df91991fcc3392e5a6c314295 Reviewed-on: https://go-review.googlesource.com/c/go/+/624576 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Carlos Amedee <carlos@golang.org>
2024-11-06cmd/compile: wire up math/bits.Len intrinsics for loong64Xiaolin Zhao
For the SubFromLen64 codegen test case to work as intended, we need to fold c-(-(x-d)) into x+(c-d). Still, some instances of LeadingZeros are not optimized into single CLZ instructions right now (actually, the LeadingZeros micro-benchmarks are currently still compiled with redundant adds/subs of 64, due to interference of loop optimizations before lowering), but perf numbers indicate it's not that bad after all. Micro-benchmark results on Loongson 3A5000 and 3A6000: goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A5000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 3.660n ± 0% 1.348n ± 0% -63.17% (p=0.000 n=20) LeadingZeros8 1.777n ± 0% 1.767n ± 0% -0.56% (p=0.000 n=20) LeadingZeros16 2.816n ± 0% 1.770n ± 0% -37.14% (p=0.000 n=20) LeadingZeros32 5.293n ± 1% 1.683n ± 0% -68.21% (p=0.000 n=20) LeadingZeros64 3.622n ± 0% 1.349n ± 0% -62.76% (p=0.000 n=20) geomean 3.229n 1.571n -51.35% goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 2.410n ± 0% 1.103n ± 1% -54.23% (p=0.000 n=20) LeadingZeros8 1.236n ± 0% 1.501n ± 0% +21.44% (p=0.000 n=20) LeadingZeros16 2.106n ± 0% 1.501n ± 0% -28.73% (p=0.000 n=20) LeadingZeros32 2.860n ± 0% 1.324n ± 0% -53.72% (p=0.000 n=20) LeadingZeros64 2.6135n ± 0% 0.9509n ± 0% -63.62% (p=0.000 n=20) geomean 2.159n 1.256n -41.81% Updates #59120 This patch is a copy of CL 483356. Co-authored-by: WANG Xuerui <git@xen0n.name> Change-Id: Iee81a17f7da06d77a427e73dfcc016f2b15ae556 Reviewed-on: https://go-review.googlesource.com/c/go/+/624575 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2024-10-24cmd/compile/internal: optimize condition branch implementationlimeidan
os: linux goarch: loong64 pkg: test/bench/go1 cpu: Loongson-3A6000 @ 2500.00MHz │ old │ new │ │ sec/op │ sec/op vs base │ BinaryTree17 7.521 ± 1% 7.551 ± 2% ~ (p=0.190 n=10) Fannkuch11 2.736 ± 0% 2.667 ± 0% -2.51% (p=0.000 n=10) FmtFprintfEmpty 34.42n ± 0% 35.22n ± 0% +2.32% (p=0.000 n=10) FmtFprintfString 61.24n ± 0% 56.84n ± 0% -7.18% (p=0.000 n=10) FmtFprintfInt 68.04n ± 0% 65.65n ± 0% -3.51% (p=0.000 n=10) FmtFprintfIntInt 111.9n ± 0% 106.0n ± 0% -5.32% (p=0.000 n=10) FmtFprintfPrefixedInt 131.4n ± 0% 122.5n ± 0% -6.77% (p=0.000 n=10) FmtFprintfFloat 241.1n ± 0% 235.1n ± 0% -2.51% (p=0.000 n=10) FmtManyArgs 553.7n ± 0% 518.9n ± 0% -6.28% (p=0.000 n=10) GobDecode 7.223m ± 1% 7.291m ± 1% +0.94% (p=0.004 n=10) GobEncode 6.741m ± 1% 6.622m ± 2% -1.77% (p=0.011 n=10) Gzip 288.9m ± 0% 280.3m ± 0% -3.00% (p=0.000 n=10) Gunzip 34.07m ± 0% 33.33m ± 0% -2.18% (p=0.000 n=10) HTTPClientServer 60.15µ ± 0% 60.63µ ± 0% +0.80% (p=0.000 n=10) JSONEncode 10.052m ± 1% 9.840m ± 0% -2.12% (p=0.000 n=10) JSONDecode 50.96m ± 0% 51.32m ± 0% +0.70% (p=0.002 n=10) Mandelbrot200 4.525m ± 0% 4.602m ± 0% +1.69% (p=0.000 n=10) GoParse 5.018m ± 0% 4.996m ± 0% -0.44% (p=0.000 n=10) RegexpMatchEasy0_32 58.74n ± 0% 59.95n ± 0% +2.06% (p=0.000 n=10) RegexpMatchEasy0_1K 464.9n ± 0% 466.1n ± 0% +0.26% (p=0.000 n=10) RegexpMatchEasy1_32 64.88n ± 0% 59.64n ± 0% -8.08% (p=0.000 n=10) RegexpMatchEasy1_1K 557.2n ± 0% 564.4n ± 0% +1.29% (p=0.000 n=10) RegexpMatchMedium_32 879.3n ± 0% 912.8n ± 1% +3.81% (p=0.000 n=10) RegexpMatchMedium_1K 28.08µ ± 0% 28.70µ ± 0% +2.20% (p=0.000 n=10) RegexpMatchHard_32 1.456µ ± 0% 1.414µ ± 0% -2.88% (p=0.000 n=10) RegexpMatchHard_1K 43.81µ ± 0% 42.23µ ± 0% -3.61% (p=0.000 n=10) Revcomp 472.4m ± 0% 474.5m ± 1% +0.45% (p=0.000 n=10) Template 83.45m ± 0% 83.39m ± 0% ~ (p=0.481 n=10) TimeParse 291.3n ± 0% 283.8n ± 0% -2.57% (p=0.000 n=10) TimeFormat 322.8n ± 0% 313.1n ± 0% -3.02% (p=0.000 n=10) geomean 54.32µ 53.45µ -1.61% Change-Id: If68fdd952ec6137c77e25ce8932358cac28da324 Reviewed-on: https://go-review.googlesource.com/c/go/+/620977 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
2024-10-18cmd/compile: add patterns for bitfield opcodes on loong64Xiaolin Zhao
goos: linux goarch: loong64 pkg: math/bits cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | LeadingZeros 1.0095n ± 0% 0.8011n ± 0% -20.64% (p=0.000 n=10) LeadingZeros8 1.201n ± 0% 1.167n ± 0% -2.83% (p=0.000 n=10) LeadingZeros16 1.201n ± 0% 1.167n ± 0% -2.83% (p=0.000 n=10) LeadingZeros32 1.201n ± 0% 1.134n ± 0% -5.58% (p=0.000 n=10) LeadingZeros64 0.8007n ± 0% 1.0115n ± 0% +26.32% (p=0.000 n=10) TrailingZeros 0.8054n ± 0% 0.8106n ± 1% +0.65% (p=0.000 n=10) TrailingZeros8 1.067n ± 0% 1.002n ± 1% -6.09% (p=0.000 n=10) TrailingZeros16 1.0540n ± 0% 0.8389n ± 0% -20.40% (p=0.000 n=10) TrailingZeros32 0.8014n ± 0% 0.8117n ± 0% +1.29% (p=0.000 n=10) TrailingZeros64 0.8015n ± 0% 0.8124n ± 1% +1.36% (p=0.000 n=10) OnesCount 3.418n ± 0% 3.417n ± 0% ~ (p=0.911 n=10) OnesCount8 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) OnesCount16 1.440n ± 0% 1.299n ± 0% -9.79% (p=0.000 n=10) OnesCount32 2.969n ± 0% 2.940n ± 0% -0.94% (p=0.000 n=10) OnesCount64 3.563n ± 0% 3.558n ± 0% -0.14% (p=0.000 n=10) RotateLeft 0.6677n ± 0% 0.6670n ± 0% ~ (p=0.055 n=10) RotateLeft8 1.318n ± 1% 1.321n ± 0% ~ (p=0.117 n=10) RotateLeft16 0.8457n ± 1% 0.8442n ± 0% ~ (p=0.325 n=10) RotateLeft32 0.8004n ± 0% 0.8004n ± 0% ~ (p=0.837 n=10) RotateLeft64 0.6678n ± 0% 0.6670n ± 0% -0.13% (p=0.000 n=10) Reverse 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) Reverse8 0.6989n ± 0% 0.6969n ± 1% ~ (p=0.138 n=10) Reverse16 0.6998n ± 1% 0.7004n ± 1% ~ (p=0.985 n=10) Reverse32 0.4158n ± 1% 0.4159n ± 1% ~ (p=0.870 n=10) Reverse64 0.4165n ± 1% 0.4194n ± 2% ~ (p=0.093 n=10) ReverseBytes 0.8004n ± 0% 0.8004n ± 0% ~ (p=1.000 n=10) ReverseBytes16 0.4183n ± 2% 0.4148n ± 1% ~ (p=0.055 n=10) ReverseBytes32 0.4143n ± 2% 0.4153n ± 1% ~ (p=0.869 n=10) ReverseBytes64 0.4168n ± 1% 0.4177n ± 1% ~ (p=0.184 n=10) Add 1.201n ± 0% 1.201n ± 0% ~ (p=0.087 n=10) Add32 1.603n ± 0% 1.601n ± 0% -0.12% (p=0.000 n=10) Add64 1.201n ± 0% 1.201n ± 0% ~ (p=0.211 n=10) Add64multiple 1.839n ± 0% 1.835n ± 0% -0.24% (p=0.001 n=10) Sub 1.202n ± 0% 1.201n ± 0% -0.04% (p=0.033 n=10) Sub32 2.401n ± 0% 1.601n ± 0% -33.32% (p=0.000 n=10) Sub64 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=10) Sub64multiple 2.105n ± 0% 2.096n ± 0% -0.40% (p=0.000 n=10) Mul 0.8008n ± 0% 0.8004n ± 0% -0.05% (p=0.000 n=10) Mul32 0.8041n ± 0% 0.8014n ± 0% -0.34% (p=0.000 n=10) Mul64 0.8008n ± 0% 0.8004n ± 0% -0.05% (p=0.000 n=10) Div 8.977n ± 0% 8.945n ± 0% -0.36% (p=0.000 n=10) Div32 4.084n ± 0% 4.086n ± 0% ~ (p=0.445 n=10) Div64 9.316n ± 0% 9.301n ± 0% -0.17% (p=0.000 n=10) geomean 1.141n 1.117n -2.09% Change-Id: I4dc1eaab6728f771bc722ed331fe5c6429bd1037 Reviewed-on: https://go-review.googlesource.com/c/go/+/618475 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-10-17cmd/compile: optimize loong64 with register indexed load/storeXiaolin Zhao
goos: linux goarch: loong64 pkg: test/bench/go1 cpu: Loongson-3A6000 @ 2500.00MHz | bench.old | bench.new | | sec/op | sec/op vs base | BinaryTree17 7.766 ± 1% 7.640 ± 2% -1.62% (p=0.000 n=20) Fannkuch11 2.649 ± 0% 2.358 ± 0% -10.96% (p=0.000 n=20) FmtFprintfEmpty 35.89n ± 0% 35.87n ± 0% -0.06% (p=0.000 n=20) FmtFprintfString 59.44n ± 0% 57.25n ± 2% -3.68% (p=0.000 n=20) FmtFprintfInt 62.07n ± 0% 60.04n ± 0% -3.27% (p=0.000 n=20) FmtFprintfIntInt 97.90n ± 0% 97.26n ± 0% -0.65% (p=0.000 n=20) FmtFprintfPrefixedInt 116.7n ± 0% 119.2n ± 0% +2.14% (p=0.000 n=20) FmtFprintfFloat 204.5n ± 0% 201.9n ± 0% -1.30% (p=0.000 n=20) FmtManyArgs 455.9n ± 0% 466.8n ± 0% +2.39% (p=0.000 n=20) GobDecode 7.458m ± 1% 7.138m ± 1% -4.28% (p=0.000 n=20) GobEncode 8.573m ± 1% 8.473m ± 1% ~ (p=0.091 n=20) Gzip 280.2m ± 0% 284.9m ± 0% +1.67% (p=0.000 n=20) Gunzip 32.68m ± 0% 32.67m ± 0% ~ (p=0.211 n=20) HTTPClientServer 54.22µ ± 0% 53.24µ ± 0% -1.80% (p=0.000 n=20) JSONEncode 9.427m ± 1% 9.152m ± 0% -2.92% (p=0.000 n=20) JSONDecode 47.08m ± 1% 46.85m ± 1% -0.49% (p=0.007 n=20) Mandelbrot200 4.601m ± 0% 4.605m ± 0% +0.08% (p=0.000 n=20) GoParse 4.776m ± 0% 4.655m ± 1% -2.52% (p=0.000 n=20) RegexpMatchEasy0_32 59.77n ± 0% 57.59n ± 0% -3.66% (p=0.000 n=20) RegexpMatchEasy0_1K 458.1n ± 0% 458.8n ± 0% +0.15% (p=0.000 n=20) RegexpMatchEasy1_32 59.36n ± 0% 59.24n ± 0% -0.20% (p=0.000 n=20) RegexpMatchEasy1_1K 557.7n ± 0% 560.2n ± 0% +0.46% (p=0.000 n=20) RegexpMatchMedium_32 803.1n ± 0% 772.8n ± 0% -3.77% (p=0.000 n=20) RegexpMatchMedium_1K 27.29µ ± 0% 25.88µ ± 0% -5.18% (p=0.000 n=20) RegexpMatchHard_32 1.385µ ± 0% 1.304µ ± 0% -5.85% (p=0.000 n=20) RegexpMatchHard_1K 40.92µ ± 0% 39.58µ ± 0% -3.27% (p=0.000 n=20) Revcomp 474.3m ± 0% 410.0m ± 0% -13.56% (p=0.000 n=20) Template 78.16m ± 0% 76.32m ± 1% -2.36% (p=0.000 n=20) TimeParse 271.8n ± 0% 272.1n ± 0% +0.11% (p=0.000 n=20) TimeFormat 292.3n ± 0% 294.8n ± 0% +0.86% (p=0.000 n=20) geomean 51.98µ 50.82µ -2.22% Change-Id: Ia78f1ddee8f1d9ec7192a4b8d2a4ec6058679956 Reviewed-on: https://go-review.googlesource.com/c/go/+/615918 Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2024-09-17runtime: move getcallerpc to internal/runtime/sysMichael Pratt
Moving these intrinsics to a base package enables other internal/runtime packages to use them. For #54766. Change-Id: I0b3eded3bb45af53e3eb5bab93e3792e6a8beb46 Reviewed-on: https://go-review.googlesource.com/c/go/+/613260 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2024-09-13cmd/compile: optimize math.Float64(32)bits and math.Float64(32)frombits on ↵Xiaolin Zhao
loong64 Use float <-> int register moves without conversion instead of stores and loads to move float <-> int values like arm64 and mips64. goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ bench.old │ bench.new │ │ sec/op │ sec/op vs base │ Acos 15.98n ± 0% 15.94n ± 0% -0.25% (p=0.000 n=20) Acosh 27.75n ± 0% 25.56n ± 0% -7.89% (p=0.000 n=20) Asin 15.85n ± 0% 15.76n ± 0% -0.57% (p=0.000 n=20) Asinh 39.79n ± 0% 37.69n ± 0% -5.28% (p=0.000 n=20) Atan 7.261n ± 0% 7.242n ± 0% -0.27% (p=0.000 n=20) Atanh 28.30n ± 0% 27.62n ± 0% -2.40% (p=0.000 n=20) Atan2 15.85n ± 0% 15.75n ± 0% -0.63% (p=0.000 n=20) Cbrt 27.02n ± 0% 21.08n ± 0% -21.98% (p=0.000 n=20) Ceil 2.830n ± 1% 2.896n ± 1% +2.31% (p=0.000 n=20) Copysign 0.8022n ± 0% 0.8004n ± 0% -0.22% (p=0.000 n=20) Cos 11.64n ± 0% 11.61n ± 0% -0.26% (p=0.000 n=20) Cosh 35.98n ± 0% 33.44n ± 0% -7.05% (p=0.000 n=20) Erf 10.09n ± 0% 10.08n ± 0% -0.10% (p=0.000 n=20) Erfc 11.40n ± 0% 11.35n ± 0% -0.44% (p=0.000 n=20) Erfinv 12.31n ± 0% 12.29n ± 0% -0.16% (p=0.000 n=20) Erfcinv 12.16n ± 0% 12.17n ± 0% +0.08% (p=0.000 n=20) Exp 28.41n ± 0% 26.44n ± 0% -6.95% (p=0.000 n=20) ExpGo 28.68n ± 0% 27.07n ± 0% -5.60% (p=0.000 n=20) Expm1 17.21n ± 0% 16.75n ± 0% -2.67% (p=0.000 n=20) Exp2 24.71n ± 0% 23.01n ± 0% -6.88% (p=0.000 n=20) Exp2Go 25.17n ± 0% 23.91n ± 0% -4.99% (p=0.000 n=20) Abs 0.8004n ± 0% 0.8004n ± 0% ~ (p=0.224 n=20) Dim 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=20) ¹ Floor 2.848n ± 0% 2.859n ± 0% +0.39% (p=0.000 n=20) Max 3.074n ± 0% 3.071n ± 0% ~ (p=0.481 n=20) Min 3.179n ± 0% 3.176n ± 0% -0.09% (p=0.003 n=20) Mod 49.62n ± 0% 44.82n ± 0% -9.67% (p=0.000 n=20) Frexp 7.604n ± 0% 6.803n ± 0% -10.53% (p=0.000 n=20) Gamma 18.01n ± 0% 17.61n ± 0% -2.22% (p=0.000 n=20) Hypot 7.204n ± 0% 7.604n ± 0% +5.55% (p=0.000 n=20) HypotGo 7.204n ± 0% 7.604n ± 0% +5.56% (p=0.000 n=20) Ilogb 6.003n ± 0% 6.003n ± 0% ~ (p=0.407 n=20) J0 76.43n ± 0% 76.24n ± 0% -0.25% (p=0.000 n=20) J1 76.44n ± 0% 76.44n ± 0% ~ (p=1.000 n=20) Jn 168.2n ± 0% 168.5n ± 0% +0.18% (p=0.000 n=20) Ldexp 8.804n ± 0% 7.604n ± 0% -13.63% (p=0.000 n=20) Lgamma 19.01n ± 0% 19.01n ± 0% ~ (p=0.695 n=20) Log 19.38n ± 0% 19.12n ± 0% -1.34% (p=0.000 n=20) Logb 6.003n ± 0% 6.003n ± 0% ~ (p=1.000 n=20) Log1p 18.57n ± 0% 16.72n ± 0% -9.96% (p=0.000 n=20) Log10 20.67n ± 0% 20.45n ± 0% -1.06% (p=0.000 n=20) Log2 9.605n ± 0% 8.804n ± 0% -8.34% (p=0.000 n=20) Modf 4.402n ± 0% 4.402n ± 0% ~ (p=1.000 n=20) Nextafter32 7.204n ± 0% 5.603n ± 0% -22.22% (p=0.000 n=20) Nextafter64 6.803n ± 0% 6.003n ± 0% -11.76% (p=0.000 n=20) PowInt 39.62n ± 0% 37.22n ± 0% -6.06% (p=0.000 n=20) PowFrac 120.9n ± 0% 108.9n ± 0% -9.93% (p=0.000 n=20) Pow10Pos 1.601n ± 0% 1.601n ± 0% ~ (p=0.487 n=20) Pow10Neg 2.675n ± 0% 2.675n ± 0% ~ (p=1.000 n=20) Round 3.018n ± 0% 2.401n ± 0% -20.46% (p=0.000 n=20) RoundToEven 3.822n ± 0% 3.001n ± 0% -21.48% (p=0.000 n=20) Remainder 45.62n ± 0% 42.42n ± 0% -7.01% (p=0.000 n=20) Signbit 0.9075n ± 0% 0.8004n ± 0% -11.81% (p=0.000 n=20) Sin 12.65n ± 0% 12.65n ± 0% ~ (p=0.503 n=20) Sincos 14.81n ± 0% 14.60n ± 0% -1.42% (p=0.000 n=20) Sinh 36.75n ± 0% 35.11n ± 0% -4.46% (p=0.000 n=20) SqrtIndirect 1.201n ± 0% 1.201n ± 0% ~ (p=1.000 n=20) ¹ SqrtLatency 4.002n ± 0% 4.002n ± 0% ~ (p=1.000 n=20) SqrtIndirectLatency 4.002n ± 0% 4.002n ± 0% ~ (p=1.000 n=20) SqrtGoLatency 52.85n ± 0% 40.82n ± 0% -22.76% (p=0.000 n=20) SqrtPrime 887.4n ± 0% 887.4n ± 0% ~ (p=0.751 n=20) Tan 13.95n ± 0% 13.97n ± 0% +0.18% (p=0.000 n=20) Tanh 36.79n ± 0% 34.89n ± 0% -5.16% (p=0.000 n=20) Trunc 2.849n ± 0% 2.861n ± 0% +0.42% (p=0.000 n=20) Y0 77.44n ± 0% 77.64n ± 0% +0.26% (p=0.000 n=20) Y1 74.41n ± 0% 74.33n ± 0% -0.11% (p=0.000 n=20) Yn 158.7n ± 0% 159.0n ± 0% +0.19% (p=0.000 n=20) Float64bits 0.8774n ± 0% 0.4002n ± 0% -54.39% (p=0.000 n=20) Float64frombits 0.8042n ± 0% 0.4002n ± 0% -50.24% (p=0.000 n=20) Float32bits 1.1230n ± 0% 0.5336n ± 0% -52.48% (p=0.000 n=20) Float32frombits 1.0670n ± 0% 0.8004n ± 0% -24.99% (p=0.000 n=20) FMA 2.001n ± 0% 2.001n ± 0% ~ (p=0.605 n=20) geomean 10.87n 10.10n -7.15% ¹ all samples are equal goos: linux goarch: loong64 pkg: math cpu: Loongson-3A5000 @ 2500.00MHz │ bench.old │ bench.new │ │ sec/op │ sec/op vs base │ Acos 33.10n ± 0% 31.95n ± 2% -3.46% (p=0.000 n=20) Acosh 58.38n ± 0% 50.44n ± 0% -13.60% (p=0.000 n=20) Asin 32.70n ± 0% 31.94n ± 0% -2.32% (p=0.000 n=20) Asinh 57.65n ± 0% 50.83n ± 0% -11.82% (p=0.000 n=20) Atan 14.21n ± 0% 14.21n ± 0% ~ (p=0.501 n=20) Atanh 60.86n ± 0% 54.44n ± 0% -10.56% (p=0.000 n=20) Atan2 32.02n ± 0% 34.02n ± 0% +6.25% (p=0.000 n=20) Cbrt 55.58n ± 0% 40.64n ± 0% -26.88% (p=0.000 n=20) Ceil 9.566n ± 0% 9.566n ± 0% ~ (p=0.463 n=20) Copysign 0.8005n ± 0% 0.8005n ± 0% ~ (p=0.806 n=20) Cos 18.02n ± 0% 18.02n ± 0% ~ (p=0.191 n=20) Cosh 64.44n ± 0% 65.64n ± 0% +1.86% (p=0.000 n=20) Erf 16.15n ± 0% 16.16n ± 0% ~ (p=0.770 n=20) Erfc 18.71n ± 0% 18.83n ± 0% +0.61% (p=0.000 n=20) Erfinv 19.33n ± 0% 19.34n ± 0% ~ (p=0.513 n=20) Erfcinv 18.90n ± 0% 19.78n ± 0% +4.63% (p=0.000 n=20) Exp 50.04n ± 0% 49.66n ± 0% -0.75% (p=0.000 n=20) ExpGo 50.03n ± 0% 50.03n ± 0% ~ (p=0.723 n=20) Expm1 28.41n ± 0% 28.27n ± 0% -0.49% (p=0.000 n=20) Exp2 50.08n ± 0% 51.23n ± 0% +2.31% (p=0.000 n=20) Exp2Go 49.77n ± 0% 49.89n ± 0% +0.24% (p=0.000 n=20) Abs 0.8009n ± 0% 0.8006n ± 0% ~ (p=0.317 n=20) Dim 1.987n ± 0% 1.993n ± 0% +0.28% (p=0.001 n=20) Floor 8.543n ± 0% 8.548n ± 0% ~ (p=0.509 n=20) Max 6.670n ± 0% 6.672n ± 0% ~ (p=0.335 n=20) Min 6.694n ± 0% 6.694n ± 0% ~ (p=0.459 n=20) Mod 56.44n ± 0% 53.23n ± 0% -5.70% (p=0.000 n=20) Frexp 8.409n ± 0% 7.606n ± 0% -9.55% (p=0.000 n=20) Gamma 35.64n ± 0% 35.23n ± 0% -1.15% (p=0.000 n=20) Hypot 11.21n ± 0% 10.61n ± 0% -5.31% (p=0.000 n=20) HypotGo 11.50n ± 0% 11.01n ± 0% -4.30% (p=0.000 n=20) Ilogb 7.606n ± 0% 6.804n ± 0% -10.54% (p=0.000 n=20) J0 125.3n ± 0% 126.5n ± 0% +0.96% (p=0.000 n=20) J1 124.9n ± 0% 125.3n ± 0% +0.32% (p=0.000 n=20) Jn 264.3n ± 0% 265.9n ± 0% +0.61% (p=0.000 n=20) Ldexp 9.606n ± 0% 9.204n ± 0% -4.19% (p=0.000 n=20) Lgamma 38.82n ± 0% 38.85n ± 0% +0.06% (p=0.019 n=20) Log 38.44n ± 0% 28.04n ± 0% -27.06% (p=0.000 n=20) Logb 8.405n ± 0% 7.605n ± 0% -9.52% (p=0.000 n=20) Log1p 31.62n ± 0% 27.11n ± 0% -14.26% (p=0.000 n=20) Log10 38.83n ± 0% 28.42n ± 0% -26.81% (p=0.000 n=20) Log2 11.21n ± 0% 10.41n ± 0% -7.14% (p=0.000 n=20) Modf 5.204n ± 0% 5.205n ± 0% ~ (p=0.983 n=20) Nextafter32 8.809n ± 0% 7.208n ± 0% -18.18% (p=0.000 n=20) Nextafter64 8.405n ± 0% 8.406n ± 0% +0.01% (p=0.007 n=20) PowInt 48.83n ± 0% 44.78n ± 0% -8.28% (p=0.000 n=20) PowFrac 146.9n ± 0% 142.1n ± 0% -3.23% (p=0.000 n=20) Pow10Pos 2.334n ± 0% 2.333n ± 0% ~ (p=0.110 n=20) Pow10Neg 4.803n ± 0% 4.803n ± 0% ~ (p=0.130 n=20) Round 4.816n ± 0% 3.819n ± 0% -20.70% (p=0.000 n=20) RoundToEven 5.735n ± 0% 5.204n ± 0% -9.26% (p=0.000 n=20) Remainder 52.05n ± 0% 49.64n ± 0% -4.63% (p=0.000 n=20) Signbit 1.201n ± 0% 1.001n ± 0% -16.65% (p=0.000 n=20) Sin 20.63n ± 0% 20.64n ± 0% +0.05% (p=0.040 n=20) Sincos 23.82n ± 0% 24.62n ± 0% +3.36% (p=0.000 n=20) Sinh 71.25n ± 0% 68.44n ± 0% -3.94% (p=0.000 n=20) SqrtIndirect 2.001n ± 0% 2.001n ± 0% ~ (p=0.182 n=20) SqrtLatency 4.003n ± 0% 4.003n ± 0% ~ (p=0.754 n=20) SqrtIndirectLatency 4.003n ± 0% 4.003n ± 0% ~ (p=0.773 n=20) SqrtGoLatency 60.84n ± 0% 81.26n ± 0% +33.56% (p=0.000 n=20) SqrtPrime 1.791µ ± 0% 1.791µ ± 0% ~ (p=0.784 n=20) Tan 27.22n ± 0% 27.22n ± 0% ~ (p=0.819 n=20) Tanh 70.88n ± 0% 69.04n ± 0% -2.60% (p=0.000 n=20) Trunc 8.543n ± 0% 8.543n ± 0% ~ (p=0.784 n=20) Y0 122.9n ± 0% 122.9n ± 0% ~ (p=0.559 n=20) Y1 123.3n ± 0% 121.7n ± 0% -1.30% (p=0.000 n=20) Yn 263.0n ± 0% 262.6n ± 0% -0.15% (p=0.000 n=20) Float64bits 1.2010n ± 0% 0.6004n ± 0% -50.01% (p=0.000 n=20) Float64frombits 1.2010n ± 0% 0.6004n ± 0% -50.01% (p=0.000 n=20) Float32bits 1.7010n ± 0% 0.8005n ± 0% -52.94% (p=0.000 n=20) Float32frombits 1.5010n ± 0% 0.8005n ± 0% -46.67% (p=0.000 n=20) FMA 2.001n ± 0% 2.001n ± 0% ~ (p=0.238 n=20) geomean 17.41n 16.15n -7.19% Change-Id: I0a0c263af2f07203eab1782e69c706f20c689d8d Reviewed-on: https://go-review.googlesource.com/c/go/+/604737 Auto-Submit: Tim King <taking@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Tim King <taking@google.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn>
2024-08-07cmd/compile: fix loong64 MINF → FMINF name and friendsJorropo
CL 580283 left cmd/compile/internal/ssa/_gen/ in a state where `go run *.go` would always fails ! :'( Change-Id: I0b3aea9b3f6275cb17c552898c5034e15f0107d5 Reviewed-on: https://go-review.googlesource.com/c/go/+/603995 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: David Chase <drchase@google.com>
2024-08-07cmd/compile, math: make math.{Abs,Copysign} intrinsics on loong64Xiaolin Zhao
goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Copysign 1.9710n ± 0% 0.8006n ± 0% -59.38% (p=0.000 n=10) Abs 1.8745n ± 0% 0.8006n ± 0% -57.29% (p=0.000 n=10) geomean 1.922n 0.8006n -58.35% goos: linux goarch: loong64 pkg: math cpu: Loongson-3A5000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Copysign 2.4020n ± 0% 0.9006n ± 0% -62.51% (p=0.000 n=10) Abs 2.4020n ± 0% 0.8005n ± 0% -66.67% (p=0.000 n=10) geomean 2.402n 0.8491n -64.65% Updates #59120. Change-Id: Ic409e1f4d15ad15cb3568a5aaa100046e9302842 Reviewed-on: https://go-review.googlesource.com/c/go/+/580280 Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: David Chase <drchase@google.com>
2024-08-07cmd/compile, math: improve implementation of math.{Max,Min} on loong64Xiaolin Zhao
Make math.{Min,Max} intrinsics and implement math.{archMax,archMin} in hardware. goos: linux goarch: loong64 pkg: math cpu: Loongson-3A6000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 7.606n ± 0% 3.087n ± 0% -59.41% (p=0.000 n=20) Min 7.205n ± 0% 2.904n ± 0% -59.69% (p=0.000 n=20) MinFloat 37.220n ± 0% 4.802n ± 0% -87.10% (p=0.000 n=20) MaxFloat 33.620n ± 0% 4.802n ± 0% -85.72% (p=0.000 n=20) geomean 16.18n 3.792n -76.57% goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3A5000 @ 2500.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Max 10.010n ± 0% 7.196n ± 0% -28.11% (p=0.000 n=20) Min 8.806n ± 0% 7.155n ± 0% -18.75% (p=0.000 n=20) MinFloat 60.010n ± 0% 7.976n ± 0% -86.71% (p=0.000 n=20) MaxFloat 56.410n ± 0% 7.980n ± 0% -85.85% (p=0.000 n=20) geomean 23.37n 7.566n -67.63% Updates #59120. Change-Id: I6815d20bc304af3cbf5d6ca8fe0ca1c2ddebea2d Reviewed-on: https://go-review.googlesource.com/c/go/+/580283 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: abner chenc <chenguoqi@loongson.cn> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: David Chase <drchase@google.com>
2023-11-21cmd/compile: update loong64 CALL* opsGuoqi Chen
allow the loong64 CALL* ops to take variable number of args Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: I4706d9651fcbf9a0f201af6820c97b1a924f14e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521781 Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: David Chase <drchase@google.com>
2023-11-21cmd/compile/internal: add register info for loong64 regABIGuoqi Chen
Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ifd7d94147b01e4fc83978b53dca2bcc0ad1ac4e3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521779 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn>
2023-11-21cmd/compile,cmd/internal,runtime: change registers on loong64 to avoid ↵Guoqi Chen
regABI arguments Update #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: Ic7e2e7fb4c1d3670e6abbfb817aa6e4e654e08d3 Reviewed-on: https://go-review.googlesource.com/c/go/+/521777 Reviewed-by: David Chase <drchase@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Than McIntosh <thanm@google.com> Auto-Submit: David Chase <drchase@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: David Chase <drchase@google.com>
2023-11-21cmd/compile, cmd/internal, runtime: change the registers used by the duff ↵Guoqi Chen
device for loong64 Add R21 to the allocatable registers, use R20 and R21 in duff device. This CL is in preparation for subsequent regABI support. Updates #40724 Co-authored-by: Xiaolin Zhao <zhaoxiaolin@loongson.cn> Change-Id: If1661adc0f766925fbe74827a369797f95fa28a9 Reviewed-on: https://go-review.googlesource.com/c/go/+/521775 Reviewed-by: David Chase <drchase@google.com> Run-TryBot: David Chase <drchase@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: Than McIntosh <thanm@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2023-10-20cmd/compile: improve the implementation of Lowered{Move,Zero} on linux/loong64Guoqi Chen
Like the CL 487295, when implementing Lowered{Move,Zero}, 8 is first subtracted from Rarg0 (parameter Ptr), and then the offset of 8 is added during subsequent operations on Rarg0. This operation is meaningless, so delete it. Change LoweredMove's Rarg0 register to R20, consistent with duffcopy. goos: linux goarch: loong64 pkg: runtime cpu: Loongson-3C5000 @ 2200.00MHz │ old.bench │ new.bench │ │ sec/op │ sec/op vs base │ Memmove/15 19.10n ± 0% 19.10n ± 0% ~ (p=0.483 n=15) MemmoveUnalignedDst/15 25.02n ± 0% 25.02n ± 0% ~ (p=0.741 n=15) MemmoveUnalignedDst/32 48.22n ± 0% 48.22n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedDst/64 90.57n ± 0% 90.52n ± 0% ~ (p=0.212 n=15) MemmoveUnalignedDstOverlap/32 44.12n ± 0% 44.13n ± 0% +0.02% (p=0.000 n=15) MemmoveUnalignedDstOverlap/64 87.79n ± 0% 87.80n ± 0% +0.01% (p=0.002 n=15) MemmoveUnalignedSrc/0 3.639n ± 0% 3.639n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/1 7.733n ± 0% 7.733n ± 0% ~ (p=1.000 n=15) MemmoveUnalignedSrc/2 9.097n ± 0% 9.097n ± 0% ~ (p=1.000 n=15) MemmoveUnalignedSrc/3 10.46n ± 0% 10.46n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/4 11.83n ± 0% 11.83n ± 0% ~ (p=1.000 n=15) ¹ MemmoveUnalignedSrc/64 93.71n ± 0% 93.70n ± 0% ~ (p=0.128 n=15) Memclr/4096 699.1n ± 0% 699.1n ± 0% ~ (p=0.682 n=15) Memclr/65536 11.18µ ± 0% 11.18µ ± 0% -0.01% (p=0.000 n=15) Memclr/1M 175.2µ ± 0% 175.2µ ± 0% ~ (p=0.191 n=15) Memclr/4M 661.8µ ± 0% 662.0µ ± 0% ~ (p=0.486 n=15) MemclrUnaligned/4_5 19.39n ± 0% 20.47n ± 0% +5.57% (p=0.000 n=15) MemclrUnaligned/4_16 22.29n ± 0% 21.38n ± 0% -4.08% (p=0.000 n=15) MemclrUnaligned/4_64 30.58n ± 0% 29.81n ± 0% -2.52% (p=0.000 n=15) MemclrUnaligned/4_65536 11.19µ ± 0% 11.20µ ± 0% +0.02% (p=0.000 n=15) GoMemclr/5 12.73n ± 0% 12.73n ± 0% ~ (p=0.261 n=15) GoMemclr/16 10.01n ± 0% 10.00n ± 0% ~ (p=0.264 n=15) GoMemclr/256 50.94n ± 0% 50.94n ± 0% ~ (p=0.372 n=15) ClearFat15 14.95n ± 0% 15.01n ± 4% ~ (p=0.925 n=15) ClearFat1032 125.5n ± 0% 125.6n ± 0% +0.08% (p=0.000 n=15) CopyFat64 10.58n ± 0% 10.01n ± 0% -5.39% (p=0.000 n=15) CopyFat1040 244.3n ± 0% 155.6n ± 0% -36.31% (p=0.000 n=15) Issue18740/2byte 29.82µ ± 0% 29.82µ ± 0% ~ (p=0.648 n=30) Issue18740/4byte 18.18µ ± 0% 18.18µ ± 0% -0.02% (p=0.001 n=30) Issue18740/8byte 8.395µ ± 0% 8.395µ ± 0% ~ (p=0.401 n=30) geomean 154.5n 151.8n -1.70% ¹ all samples are equal Change-Id: Ia3f3c8b25e1e93c97ab72328651de78ca9dec016 Reviewed-on: https://go-review.googlesource.com/c/go/+/488515 Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Auto-Submit: Ian Lance Taylor <iant@golang.org> Reviewed-by: WANG Xuerui <git@xen0n.name> Reviewed-by: xiaodong liu <teaofmoli@gmail.com> Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-09-01runtime: remove the meaningless offset of 8 for duffzero on loong64Guoqi Chen
Currently we subtract 8 from offset when calling duffzero because 8 is added to offset in the duffzero implementation. This operation is meaningless, so remove it. Change-Id: I7e451d04d7e98ccafe711645d81d3aadf376766f Reviewed-on: https://go-review.googlesource.com/c/go/+/487295 Reviewed-by: Keith Randall <khr@google.com> Reviewed-by: Meidan Li <limeidan@loongson.cn> Reviewed-by: WANG Xuerui <git@xen0n.name> Run-TryBot: WANG Xuerui <git@xen0n.name> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: xiaodong liu <teaofmoli@gmail.com> Reviewed-by: Carlos Amedee <carlos@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Auto-Submit: Ian Lance Taylor <iant@golang.org>
2023-04-11cmd/compile: split DIVV/DIVVU op on loong64Wayne Zuo
Previously, we need calculate both quotient and remainder together. However, in most cases, only one result is needed. By separating these instructions, we can save one instruction in most cases. Change-Id: I0a2d4167cda68ab606783ba1aa2720ede19d6b53 Reviewed-on: https://go-review.googlesource.com/c/go/+/475315 Reviewed-by: Than McIntosh <thanm@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: abner chenc <chenguoqi@loongson.cn> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: David Chase <drchase@google.com>
2023-03-03cmd/compile: optimize multiplication on loong64Wayne Zuo
Previously, multiplication on loong64 architecture was performed using MULV and MULHVU instructions to calculate the low 64-bit and high 64-bit of a multiplication respectively. However, in most cases, only the low 64-bits are needed. This commit enalbes only computating the low 64-bit result with the MULV instruction. Reduce the binary size slightly. file before after Δ % addr2line 2833777 2833849 +72 +0.003% asm 5267499 5266963 -536 -0.010% buildid 2579706 2579402 -304 -0.012% cgo 4798260 4797444 -816 -0.017% compile 25247419 25175030 -72389 -0.287% cover 4973091 4972027 -1064 -0.021% dist 3631013 3565653 -65360 -1.800% doc 4076036 4074004 -2032 -0.050% fix 3496378 3496066 -312 -0.009% link 6984102 6983214 -888 -0.013% nm 2743820 2743516 -304 -0.011% objdump 4277171 4277035 -136 -0.003% pack 2379248 2378872 -376 -0.016% pprof 14419090 14419874 +784 +0.005% test2json 2684386 2684018 -368 -0.014% trace 13640018 13631034 -8984 -0.066% vet 7748918 7752630 +3712 +0.048% go 15643850 15638098 -5752 -0.037% total 127423782 127268729 -155053 -0.122% Change-Id: Ifce4a9a3ed1d03c170681e39cb6f3541db9882dc Reviewed-on: https://go-review.googlesource.com/c/go/+/472775 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Reviewed-by: David Chase <drchase@google.com>
2023-02-24cmd/compile: batch write barrier callsKeith Randall
Have the write barrier call return a pointer to a buffer into which the generated code records pointers that need write barrier treatment. Change-Id: I7871764298e0aa1513de417010c8d46b296b199e Reviewed-on: https://go-review.googlesource.com/c/go/+/447781 Reviewed-by: Keith Randall <khr@google.com> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Bypass: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>