| Age | Commit message (Collapse) | Author |
|
Fixes #78541
Change-Id: I73ba10b6d34f9f189b5bdd356d6325d5a4a6985f
GitHub-Last-Rev: 0594d99f55c51f2f164d17a61c4eb1b2bbb8462e
GitHub-Pull-Request: golang/go#78542
Reviewed-on: https://go-review.googlesource.com/c/go/+/763000
Auto-Submit: Robert Griesemer <gri@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Neal Patel <nealpatel@google.com>
|
|
This overhauls writeMultiple to reduce allocations:
- return early if called with a count of 0 rather than allocating a buffer
- if writing one byte, which happens when padding, try io.ByteWriter
- try io.StringWriter
- fallback to io.Writer as currently done
Unlike what is suggested in #71465 I did not used io.WriteString
to avoid a regression where we would allocate the byte slice
once per iteration of count.
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 5 3600 6-Core Processor
│ /tmp/old │ /tmp/new │
│ sec/op │ sec/op vs base │
Format-12 70.45µ ± 1% 63.55µ ± 2% -9.80% (p=0.000 n=10)
│ /tmp/old │ /tmp/new │
│ B/op │ B/op vs base │
Format-12 14.73Ki ± 0% 10.96Ki ± 0% -25.58% (p=0.000 n=10)
│ /tmp/old │ /tmp/new │
│ allocs/op │ allocs/op vs base │
Format-12 1098.0 ± 0% 649.0 ± 0% -40.89% (p=0.000 n=10)
Fixes #71465
Change-Id: I44565c540b2d73c8737ac9733687141b645d9856
Reviewed-on: https://go-review.googlesource.com/c/go/+/645215
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Auto-Submit: Jorropo <jorropo.pgm@gmail.com>
|
|
Change-Id: I1aac166aea4f907a7fb93028a39ef9d1e3888c9c
Reviewed-on: https://go-review.googlesource.com/c/go/+/743800
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Junyang Shao <shaojunyang@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
|
|
This change mechanically replaces all occurrences of interface{}
by 'any' (where deemed safe by the 'any' modernizer) throughout
std and cmd, minus their vendor trees.
Since this fix is relatively numerous, it gets its own CL.
Also, 'go generate go/types'.
Change-Id: I14a6b52856c3291c1d27935409bca8d5fd4242a2
Reviewed-on: https://go-review.googlesource.com/c/go/+/719702
Commit-Queue: Alan Donovan <adonovan@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Alan Donovan <adonovan@google.com>
|
|
According to the MIPS ABI, R26/R27 are reserved for OS kernel, and may be clobbered by it. They must not be used by user mode.
See Figure 3-18 of MIPS ELF ABI specification: https://refspecs.linuxfoundation.org/elf/mipsabi.pdf
Fixes #73472
Change-Id: Ifda692a803176bfaab2c70d6623636c5d135f42e
Reviewed-on: https://go-review.googlesource.com/c/go/+/667816
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Checking that the lengths are equal and panicking teaches the compiler
that it can assume “i in range for z” implies “i in range for x”, letting us
simplify the actual loops a bit.
It also turns up a few places in math/big that were playing maybe a little
too fast and loose with slice lengths. Update those to explicitly set all the
input slices to the same length.
These speedups are basically irrelevant, since they only happen
in real code if people are compiling with -tags math_big_pure_go.
But at least the code is clearer.
benchmark \ system c3h88 c2s16 s7 386 s7-386 c4as16 mac arm loong64 ppc64le riscv64 s390x
AddVV/words=1/impl=go ~ +11.20% +5.11% -7.67% -7.77% +1.90% +10.76% -33.22% ~ +10.98% ~ +6.60%
AddVV/words=10/impl=go -22.12% -13.48% -10.37% -17.95% -18.07% -24.58% -22.04% -29.95% -14.22% ~ -6.33% +3.66%
AddVV/words=16/impl=go -9.75% -13.73% ~ -21.90% -18.66% -30.03% -20.45% -28.09% -17.33% -7.15% -8.96% +12.55%
AddVV/words=100/impl=go -5.91% -1.02% ~ -29.23% -22.18% -25.62% -6.49% -23.59% -22.31% -1.88% -14.13% +9.23%
AddVV/words=1000/impl=go -0.52% -0.19% -3.58% -33.89% -23.46% -22.46% ~ -24.00% -24.73% +0.93% -15.79% +12.32%
AddVV/words=10000/impl=go ~ ~ ~ -33.79% -23.72% -23.79% -5.98% -23.92% ~ +0.78% -15.45% +8.59%
AddVV/words=100000/impl=go ~ ~ ~ -33.90% -24.25% -22.82% -4.09% -24.63% ~ +1.00% -13.56% ~
SubVV/words=1/impl=go ~ +11.64% +14.05% ~ -4.07% ~ +10.79% -33.69% ~ ~ +3.89% +12.33%
SubVV/words=10/impl=go -10.31% -14.09% -7.38% +13.76% -13.25% -18.05% -20.08% -24.97% -14.15% +10.13% -0.97% -2.51%
SubVV/words=16/impl=go -8.06% -13.73% -5.70% +17.00% -12.83% -23.76% -17.52% -25.25% -17.30% -2.80% -4.96% -18.25%
SubVV/words=100/impl=go -9.22% -1.30% -2.76% +20.88% -14.35% -15.29% -8.49% -19.64% -22.31% -0.68% -14.30% -9.04%
SubVV/words=1000/impl=go -0.60% ~ -3.43% +23.08% -16.14% -11.96% ~ -28.52% -24.73% ~ -15.95% -9.91%
SubVV/words=10000/impl=go ~ ~ ~ +26.01% -15.24% -11.92% ~ -28.26% +4.25% ~ -15.42% -5.95%
SubVV/words=100000/impl=go ~ ~ ~ +25.71% -15.83% -12.13% ~ -27.88% -1.27% ~ -13.57% -6.72%
LshVU/words=1/impl=go +0.56% +0.36% ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
LshVU/words=10/impl=go +13.37% +4.63% ~ ~ ~ ~ ~ -2.90% ~ ~ ~ ~
LshVU/words=16/impl=go +22.83% +6.47% ~ ~ ~ ~ ~ ~ +0.80% ~ ~ +5.88%
LshVU/words=100/impl=go +7.56% +13.95% ~ ~ ~ ~ ~ ~ +0.33% -2.50% ~ ~
LshVU/words=1000/impl=go +0.64% +17.92% ~ ~ ~ ~ ~ -6.52% ~ -2.58% ~ ~
LshVU/words=10000/impl=go ~ +17.60% ~ ~ ~ ~ ~ -6.64% -6.22% -1.40% ~ ~
LshVU/words=100000/impl=go ~ +14.57% ~ ~ ~ ~ ~ ~ -5.47% ~ ~ ~
RshVU/words=1/impl=go ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ +2.72%
RshVU/words=10/impl=go ~ ~ ~ ~ ~ ~ ~ +2.50% ~ ~ ~ ~
RshVU/words=16/impl=go ~ +0.53% ~ ~ ~ ~ ~ +3.82% ~ ~ ~ ~
RshVU/words=100/impl=go ~ ~ ~ ~ ~ ~ ~ +6.18% ~ ~ ~ ~
RshVU/words=1000/impl=go ~ ~ ~ ~ ~ ~ ~ +7.00% ~ ~ ~ ~
RshVU/words=10000/impl=go ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
RshVU/words=100000/impl=go ~ ~ ~ ~ ~ ~ ~ +7.05% ~ ~ ~ ~
MulAddVWW/words=1/impl=go -10.34% +4.43% +10.62% -1.62% -4.74% -2.86% +11.75% ~ -8.00% +8.89% +3.87% ~
MulAddVWW/words=10/impl=go -1.61% -5.87% ~ -8.30% -4.55% +0.87% ~ -5.28% -20.82% ~ ~ -2.32%
MulAddVWW/words=16/impl=go -2.96% -5.28% ~ -9.22% -5.28% ~ ~ -3.74% -19.52% -1.48% -2.53% -9.52%
MulAddVWW/words=100/impl=go -3.89% -7.53% +1.93% -10.49% -4.87% -8.27% ~ ~ -0.65% -0.61% -7.59% -20.61%
MulAddVWW/words=1000/impl=go -0.45% -3.91% +4.54% -11.46% -4.69% -8.53% ~ ~ -0.05% ~ -8.88% -19.77%
MulAddVWW/words=10000/impl=go ~ -3.30% +4.10% -11.34% -4.10% -9.43% ~ -0.61% ~ -0.55% -8.21% -18.48%
MulAddVWW/words=100000/impl=go -0.30% -3.03% +4.31% -11.55% -4.41% -9.74% ~ -0.75% +0.63% ~ -7.80% -19.82%
AddMulVVWW/words=1/impl=go ~ +13.09% +12.50% -7.05% -10.41% +2.53% +13.32% -3.49% ~ +15.56% +3.62% ~
AddMulVVWW/words=10/impl=go -15.96% -9.06% -5.06% -14.56% -11.83% -5.44% -26.30% -14.23% -11.44% -1.79% -5.93% -6.60%
AddMulVVWW/words=16/impl=go -19.05% -12.43% -6.19% -14.24% -12.67% -8.65% -18.64% -16.56% -10.64% -3.00% -7.61% -12.80%
AddMulVVWW/words=100/impl=go -22.13% -16.59% -13.04% -13.79% -11.46% -12.01% -6.46% -21.80% -5.08% -3.13% -13.60% -22.53%
AddMulVVWW/words=1000/impl=go -17.07% -17.05% -14.08% -13.59% -12.13% -11.21% ~ -22.81% -4.27% -1.27% -16.35% -23.47%
AddMulVVWW/words=10000/impl=go -15.03% -16.78% -14.23% -13.86% -11.84% -11.69% ~ -22.75% -13.39% -1.10% -14.37% -22.01%
AddMulVVWW/words=100000/impl=go -13.70% -14.90% -14.26% -13.55% -12.04% -11.63% ~ -22.61% ~ -2.53% -10.42% -23.16%
Change-Id: Ic6f64344484a762b818c7090d1396afceb638607
Reviewed-on: https://go-review.googlesource.com/c/go/+/665155
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Step 4 of the mini-compiler: switch to the new generated assembly.
No systematic performance regressions, and many many improvements.
In the benchmarks, the systems are:
c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud)
c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud)
s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X)
386 GOARCH=386 gotip-linux-386 gomote (Intel, Google Cloud)
s7-386 GOARCH=386 rsc basement server (AMD Ryzen 9 7950X)
c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud)
mac GOARCH=arm64 Apple M3 Pro in MacBook Pro
arm GOARCH=arm gotip-linux-arm gomote
loong64 GOARCH=loong64 gotip-linux-loong64 gomote
ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote
riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote
s390x GOARCH=s390x linux-s390x-ibm old gomote
benchmark \ system c3h88 c2s16 s7 386 s7-386 c4as16 mac arm loong64 ppc64le riscv64 s390x
AddVV/words=1 -4.03% +5.21% -4.04% +4.94% ~ ~ ~ ~ -19.51% ~ ~ ~
AddVV/words=10 -10.20% +0.34% -3.46% -11.50% -7.46% +7.66% +5.97% ~ -17.90% ~ ~ ~
AddVV/words=16 -10.91% -6.45% -8.45% -21.86% -17.90% +2.73% -1.61% ~ -22.47% -3.54% ~ ~
AddVV/words=100 -3.77% -4.30% -3.17% -47.27% -45.34% -0.78% ~ -8.74% -27.19% ~ ~ ~
AddVV/words=1000 -0.08% -0.71% ~ -49.21% -48.07% ~ ~ -16.80% -24.74% ~ ~ ~
AddVV/words=10000 ~ ~ ~ -48.73% -48.56% -0.06% ~ -17.08% ~ ~ -4.81% ~
AddVV/words=100000 ~ ~ ~ -47.80% -48.38% ~ ~ -15.10% -25.06% ~ -5.34% ~
SubVV/words=1 -0.84% +3.43% -3.62% +1.34% ~ -0.76% ~ ~ -18.18% +5.58% ~ ~
SubVV/words=10 -9.99% +0.34% ~ -11.23% -8.24% +7.53% +6.15% ~ -17.55% +2.77% -2.08% ~
SubVV/words=16 -11.94% -6.45% -6.81% -21.82% -18.11% +1.58% -1.21% ~ -20.36% ~ ~ ~
SubVV/words=100 -3.38% -4.32% -1.80% -46.14% -46.43% +0.41% ~ -7.20% -26.17% ~ -0.42% ~
SubVV/words=1000 -0.38% -0.80% ~ -49.22% -48.90% ~ ~ -15.86% -24.73% ~ ~ ~
SubVV/words=10000 ~ ~ ~ -49.57% -49.64% -0.03% ~ -15.85% -26.52% ~ -5.05% ~
SubVV/words=100000 ~ ~ ~ -46.88% -49.66% ~ ~ -15.45% -16.11% ~ -4.99% ~
LshVU/words=1 ~ +5.78% ~ ~ -2.48% +1.61% +2.18% +2.70% -18.16% -34.16% -21.29% ~
LshVU/words=10 -18.34% -3.78% +2.21% ~ ~ -2.81% -12.54% ~ -25.02% -24.78% -38.11% -66.98%
LshVU/words=16 -23.15% +1.03% +7.74% +0.73% ~ +8.88% +1.56% ~ -25.37% -28.46% -41.27% ~
LshVU/words=100 -32.85% -8.86% -2.58% ~ +2.69% +1.24% ~ -20.63% -44.14% -42.68% -53.09% ~
LshVU/words=1000 -37.30% -0.20% +5.67% ~ ~ +1.44% ~ -27.83% -45.01% -37.07% -57.02% -46.57%
LshVU/words=10000 -36.84% -2.30% +3.82% ~ +1.86% +1.57% -66.81% -28.00% -13.15% -35.40% -41.97% ~
LshVU/words=100000 -40.30% ~ +3.96% ~ ~ ~ ~ -24.91% -19.06% -36.14% -40.99% -66.03%
RshVU/words=1 -3.17% +4.76% -4.06% +4.31% +4.55% ~ ~ ~ -20.61% ~ -26.20% -51.33%
RshVU/words=10 -22.08% -4.41% -17.99% +3.64% -11.87% ~ -16.30% ~ -30.01% ~ -40.37% -63.05%
RshVU/words=16 -26.03% -8.50% -18.09% ~ -17.52% +6.50% ~ -2.85% -30.24% ~ -42.93% -63.13%
RshVU/words=100 -20.87% -28.83% -29.45% ~ -26.25% +1.46% -1.14% -16.20% -45.65% -16.20% -53.66% -77.27%
RshVU/words=1000 -24.03% -21.37% -26.71% ~ -28.95% +0.98% ~ -18.82% -45.21% -23.55% -57.09% -71.18%
RshVU/words=10000 -24.56% -22.44% -27.01% ~ -28.88% +0.78% -5.35% -17.47% -16.87% -20.67% -41.97% ~
RshVU/words=100000 -23.36% -15.65% -27.54% ~ -29.26% +1.73% -6.67% -13.68% -21.40% -23.02% -40.37% -66.31%
MulAddVWW/words=1 +2.37% +8.14% ~ +4.10% +3.71% ~ ~ ~ -21.62% ~ +1.12% ~
MulAddVWW/words=10 ~ -2.72% -15.15% +8.04% ~ ~ ~ -2.52% -19.48% ~ -6.18% ~
MulAddVWW/words=16 ~ +1.49% ~ +4.49% +6.58% -8.70% -7.16% -12.08% -21.43% -6.59% -9.05% ~
MulAddVWW/words=100 +0.37% +1.11% -4.51% -13.59% ~ -11.10% -3.63% -21.40% -22.27% -2.92% -14.41% ~
MulAddVWW/words=1000 ~ +0.90% -7.13% -18.94% ~ -14.02% -9.97% -28.31% -18.72% -2.32% -15.80% ~
MulAddVWW/words=10000 ~ +1.08% -6.75% -19.10% ~ -14.61% -9.04% -28.48% -14.29% -2.25% -9.40% ~
MulAddVWW/words=100000 ~ ~ -6.93% -18.09% ~ -14.33% -9.66% -28.92% -16.63% -2.43% -8.23% ~
AddMulVVWW/words=1 +2.30% +4.83% -11.37% +4.58% ~ -3.14% ~ ~ -10.58% +30.35% ~ ~
AddMulVVWW/words=10 -3.27% ~ +8.96% +5.74% ~ +2.67% -1.44% -7.64% -13.41% ~ ~ ~
AddMulVVWW/words=16 -6.12% ~ ~ ~ +1.91% -7.90% -16.22% -14.07% -14.26% -4.15% -7.30% ~
AddMulVVWW/words=100 -5.48% -2.14% ~ -9.40% +9.98% -1.43% -12.35% -18.56% -21.94% ~ -9.84% ~
AddMulVVWW/words=1000 -11.35% -3.40% -3.64% -11.04% +12.82% -1.33% -15.63% -20.50% -20.95% ~ -11.06% -51.97%
AddMulVVWW/words=10000 -10.31% -1.61% -8.41% -12.15% +13.10% -1.03% -16.34% -22.46% -1.00% ~ -10.33% -49.80%
AddMulVVWW/words=100000 -13.71% ~ -8.31% -12.18% +12.98% -1.35% -15.20% -21.89% ~ ~ -9.38% -48.30%
Change-Id: I0a33c33602c0d053c84d9946e662500cfa048e2d
Reviewed-on: https://go-review.googlesource.com/c/go/+/664938
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Step 3 of the mini-compiler: add the generators for the shift and mul routines.
Change-Id: I981d5b7086262c740036f5db768d3e63083984e2
Reviewed-on: https://go-review.googlesource.com/c/go/+/664937
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Step 2 of the mini-compiler: add all the remaining architectures.
Change-Id: I8c5283aa8baa497785a5c15f2248528fa9ae886e
Reviewed-on: https://go-review.googlesource.com/c/go/+/664936
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
The arith assembly is big enough, and the details that you have to keep
in mind are complex enough and varied enough, that it is worth using
a Go program to generate the assembly. That way, all the architectures
can use the same algorithms, and porting to new architectures will be
easier.
This is the first of a sequence of CLs to introduce a new mini-compiler
for generating the arith assembly, in math/big/internal/asmgen.
This CL has the basics of the compiler as well as a couple simple
architectures and the generator for addVV/subVV. It does not check
in the generated assembly yet. That will happen in a followup CL after
the other architectures and generators have been added.
Change-Id: Ib704c60fd972fc5690ac04d8fae3712ee2c1a80a
Reviewed-on: https://go-review.googlesource.com/c/go/+/664935
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
The vast majority of the time, carry propagation is limited and
addVW/subVW only need to consider a single word for carry propagation.
As Josh Bleecher-Snyder pointed out in 2019 (CL 164968), once carrying
is done, the remaining words can be handled faster with copy (memmove).
In the benchmarks below, this is the data=random case.
Even more important, if the source and destination are the same,
the copy can be optimized away entirely, making a small in-place
addition to a big.Int O(1) instead of O(N). To date, only a few
systems (amd64, arm64, and pure Go, meaning wasm) make use of this
asymptotic improvement. This is the data=shortcut case.
This CL deletes the addVW/subVW assembly and replaces it with
an optimized pure Go version. Using Go makes it easy to call
the real copy builtin, which will use optimized memmove code,
instead of recreating a worse memmove in assembly (as arm64 does)
or omitting the copy optimization entirely (as most others do).
The worst case for the Go version versus assembly is the case
of incrementing 2^N-1 by 1, which has to propagate a carry
the entire length of the array. This is the data=carry case.
On balance, we believe this case is rare enough to be worth
taking a hit in that case, in exchange for significant wins
in the other cases and the deletion of significant amounts of
assembly of varying quality. (Remember that half the assembly has
the copy optimization and shortcut, while half does not.)
In the benchmarks, the systems are:
c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud)
c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud)
s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X)
c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud)
mac GOARCH=arm64 Apple M3 Pro in MacBook Pro
386 GOARCH=386 gotip-linux-386 gomote
arm GOARCH=arm gotip-linux-arm gomote
loong64 GOARCH=loong64 gotip-linux-loong64 gomote
ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote
riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote
benchmark \ system c2s16 c3h88 s7 c4as16 mac 386 arm loong64 ppc64le riscv64
AddVW/words=1/data=random -1.15% -1.74% -5.89% -9.80% -11.54% +23.71% -12.74% -14.25% +14.67% +10.27%
AddVW/words=2/data=random -2.59% ~ -4.38% -19.31% -15.41% +24.80% ~ -19.99% +13.73% +19.71%
AddVW/words=3/data=random -3.75% -19.10% -3.79% -23.15% -17.04% +20.04% -10.07% -23.20% ~ +15.39%
AddVW/words=4/data=random -2.84% +7.05% -8.77% -22.64% -15.77% +16.01% -7.36% -28.22% ~ +23.00%
AddVW/words=5/data=random -10.97% +2.16% -12.09% -20.89% -17.14% +9.42% -4.69% -32.60% ~ +10.07%
AddVW/words=6/data=random -9.87% ~ -7.54% -19.08% -6.46% ~ -3.44% -34.61% ~ +12.19%
AddVW/words=7/data=random -14.36% ~ -10.09% -19.10% -10.47% -6.20% -5.06% -38.14% -11.54% +6.79%
AddVW/words=8/data=random -17.50% ~ -11.06% -25.14% -12.88% -8.35% -5.11% -41.39% -14.04% +11.87%
AddVW/words=9/data=random -19.76% -4.05% -15.47% -24.08% -16.50% -12.34% -21.56% -44.25% -14.82% ~
AddVW/words=10/data=random -13.89% ~ -9.69% -23.06% -8.04% -12.58% -19.25% -32.80% -11.68% ~
AddVW/words=16/data=random -29.36% -15.35% -21.86% -25.04% -19.89% -32.26% -16.29% -42.66% -25.92% -3.01%
AddVW/words=32/data=random -39.02% -28.76% -39.87% -11.22% -2.85% -55.40% -31.17% -55.37% -37.92% -16.28%
AddVW/words=64/data=random -25.94% -19.09% -20.60% -6.90% +8.91% -51.00% -43.72% -62.27% -44.11% -28.74%
AddVW/words=100/data=random -22.79% -18.13% -18.25% ~ +33.89% -67.40% -51.77% -63.54% -53.75% -30.97%
AddVW/words=1000/data=random -8.98% -3.84% ~ -3.15% ~ -93.35% -63.92% -65.66% -68.67% -42.30%
AddVW/words=10000/data=random -1.38% -0.38% ~ ~ ~ -89.16% -65.18% -44.65% -70.35% -20.08%
AddVW/words=100000/data=random ~ ~ ~ ~ ~ -87.03% -64.51% -36.08% -61.40% -16.53%
SubVW/words=1/data=random -3.67% ~ -8.38% -10.26% -3.07% +45.78% -6.06% -11.17% ~ ~
SubVW/words=2/data=random -3.48% -10.07% -5.76% -20.14% -8.45% +44.28% ~ -19.09% ~ +16.98%
SubVW/words=3/data=random -7.11% -26.64% -4.48% -22.07% -9.21% +35.61% ~ -23.93% -18.20% ~
SubVW/words=4/data=random -4.23% +7.19% -8.95% -22.62% -13.89% +33.20% -8.96% -29.96% ~ +22.23%
SubVW/words=5/data=random -11.49% +1.92% -10.86% -22.27% -17.53% +24.48% -2.88% -35.19% -19.55% ~
SubVW/words=6/data=random -7.67% ~ -7.72% -18.44% -6.24% +12.03% -2.00% -39.68% -10.73% ~
SubVW/words=7/data=random -13.69% -18.32% -11.82% -18.92% -11.57% +6.63% ~ -43.54% -30.81% ~
SubVW/words=8/data=random -16.02% ~ -11.07% -24.50% -11.92% +4.32% -3.01% -46.95% -24.14% ~
SubVW/words=9/data=random -18.76% -3.34% -14.84% -23.79% -17.50% ~ -21.80% -49.98% -29.62% ~
SubVW/words=10/data=random -13.23% ~ -9.25% -21.26% -11.63% ~ -18.58% -39.19% -20.09% ~
SubVW/words=16/data=random -28.25% -13.24% -22.66% -27.18% -19.13% -23.38% -20.24% -51.01% -28.06% -3.05%
SubVW/words=32/data=random -38.41% -28.88% -40.12% -11.20% -2.80% -49.17% -34.67% -63.29% -39.25% -15.20%
SubVW/words=64/data=random -25.51% -19.24% -22.20% -6.57% +9.98% -48.52% -48.14% -69.50% -49.44% -27.92%
SubVW/words=100/data=random -21.69% -18.51% ~ +1.92% +34.42% -65.88% -54.67% -71.24% -58.88% -30.71%
SubVW/words=1000/data=random -9.81% -4.05% -2.14% -3.06% ~ -93.37% -67.33% -74.12% -68.36% -42.17%
SubVW/words=10000/data=random ~ -0.52% ~ ~ ~ -88.87% -68.54% -44.94% -70.63% -19.95%
SubVW/words=100000/data=random ~ ~ ~ ~ ~ -86.69% -68.09% -48.36% -62.42% -19.32%
AddVW/words=1/data=shortcut -29.38% -25.38% -27.37% -23.15% -25.41% +3.01% -33.60% -36.12% -15.76% ~
AddVW/words=2/data=shortcut -32.79% -34.72% -31.47% -24.47% -28.21% -3.75% -34.66% -43.89% -23.65% -21.56%
AddVW/words=3/data=shortcut -38.50% -46.83% -35.67% -26.38% -30.29% -10.41% -44.89% -47.68% -30.93% -26.85%
AddVW/words=4/data=shortcut -40.40% -28.85% -34.19% -29.83% -32.95% -16.09% -42.86% -51.02% -34.19% -26.69%
AddVW/words=5/data=shortcut -43.87% -35.42% -36.46% -32.59% -37.72% -20.82% -45.14% -54.01% -35.49% -30.48%
AddVW/words=6/data=shortcut -46.98% -39.34% -42.22% -35.43% -38.18% -27.46% -46.72% -56.61% -40.21% -34.07%
AddVW/words=7/data=shortcut -49.63% -47.97% -46.61% -35.28% -41.93% -31.14% -49.29% -58.89% -41.10% -37.01%
AddVW/words=8/data=shortcut -50.48% -42.33% -45.40% -40.24% -41.74% -32.92% -50.62% -60.98% -44.85% -38.10%
AddVW/words=9/data=shortcut -54.27% -43.52% -49.06% -42.16% -45.22% -37.57% -51.84% -62.91% -46.04% -40.82%
AddVW/words=10/data=shortcut -56.01% -45.40% -51.42% -43.29% -46.14% -38.65% -53.65% -64.62% -47.05% -43.21%
AddVW/words=16/data=shortcut -62.73% -55.66% -59.31% -56.38% -54.31% -53.16% -61.03% -72.29% -58.24% -52.57%
AddVW/words=32/data=shortcut -74.00% -69.42% -71.75% -33.65% -37.35% -71.73% -72.59% -82.44% -70.87% -67.69%
AddVW/words=64/data=shortcut -56.69% -52.72% -52.09% -35.48% -36.87% -84.24% -83.10% -90.37% -82.56% -80.81%
AddVW/words=100/data=shortcut -56.68% -53.18% -51.49% -33.49% -37.72% -89.95% -88.21% -93.37% -88.47% -86.52%
AddVW/words=1000/data=shortcut -56.68% -52.45% -51.66% -35.31% -36.65% -98.88% -98.62% -99.24% -98.78% -98.41%
AddVW/words=10000/data=shortcut -56.70% -52.40% -51.92% -33.49% -36.98% -99.89% -99.86% -99.92% -99.87% -99.91%
AddVW/words=100000/data=shortcut -56.67% -52.46% -52.38% -35.31% -37.20% -99.99% -99.99% -99.99% -99.99% -99.99%
SubVW/words=1/data=shortcut -29.80% -20.71% -26.94% -23.24% -25.33% +26.97% -32.02% -37.85% -40.20% -12.67%
SubVW/words=2/data=shortcut -35.47% -36.38% -31.93% -25.43% -30.18% +18.96% -33.48% -46.48% -39.38% -18.65%
SubVW/words=3/data=shortcut -39.22% -49.96% -36.90% -25.82% -30.96% +12.53% -40.67% -51.07% -43.71% -23.78%
SubVW/words=4/data=shortcut -40.46% -24.90% -34.66% -29.87% -33.97% +4.60% -42.32% -54.92% -42.83% -22.45%
SubVW/words=5/data=shortcut -43.84% -34.17% -38.00% -32.55% -37.27% -2.46% -43.09% -58.18% -45.70% -26.45%
SubVW/words=6/data=shortcut -47.69% -37.49% -42.73% -35.90% -37.73% -8.52% -46.55% -61.01% -44.00% -30.14%
SubVW/words=7/data=shortcut -49.45% -50.66% -46.88% -34.77% -41.64% -14.46% -48.92% -63.46% -50.47% -33.39%
SubVW/words=8/data=shortcut -50.45% -39.31% -47.14% -40.47% -41.70% -15.77% -50.21% -65.64% -47.71% -34.01%
SubVW/words=9/data=shortcut -54.28% -43.07% -49.42% -41.34% -44.99% -19.39% -51.55% -67.61% -56.92% -36.82%
SubVW/words=10/data=shortcut -56.85% -47.88% -50.92% -42.76% -45.67% -23.60% -53.04% -69.34% -60.18% -39.43%
SubVW/words=16/data=shortcut -62.36% -54.83% -58.80% -55.83% -53.74% -41.04% -60.16% -76.75% -60.56% -48.63%
SubVW/words=32/data=shortcut -73.68% -68.64% -71.57% -33.52% -37.34% -64.73% -72.67% -85.89% -71.87% -64.56%
SubVW/words=64/data=shortcut -56.68% -51.66% -52.56% -34.75% -37.54% -80.30% -83.58% -92.39% -83.41% -78.70%
SubVW/words=100/data=shortcut -56.68% -50.97% -51.57% -33.68% -36.78% -87.42% -88.53% -94.84% -88.87% -84.96%
SubVW/words=1000/data=shortcut -56.68% -50.89% -52.10% -34.94% -37.77% -98.59% -98.71% -99.43% -98.80% -98.20%
SubVW/words=10000/data=shortcut -56.68% -51.00% -52.44% -33.65% -37.27% -99.86% -99.87% -99.94% -99.88% -99.90%
SubVW/words=100000/data=shortcut -56.68% -50.80% -52.20% -34.79% -37.46% -99.99% -99.99% -99.99% -99.99% -99.99%
AddVW/words=1/data=carry -0.51% -5.29% -24.03% -26.48% ~ ~ -33.14% -30.23% ~ -20.74%
AddVW/words=2/data=carry -6.36% ~ -21.05% -39.40% ~ +10.72% -29.12% -31.34% ~ -17.29%
AddVW/words=3/data=carry ~ ~ -17.46% -19.53% +17.58% ~ -26.23% -23.61% +7.80% -14.34%
AddVW/words=4/data=carry +19.02% +16.80% ~ ~ +28.25% ~ -27.90% -20.31% +19.16% ~
AddVW/words=5/data=carry +3.97% +53.02% ~ ~ +11.31% ~ -19.05% -17.47% +16.81% ~
AddVW/words=6/data=carry +2.98% +19.83% ~ ~ +14.84% ~ -18.48% -14.92% +18.25% ~
AddVW/words=7/data=carry ~ ~ ~ ~ +27.17% ~ -15.50% -12.74% +13.00% ~
AddVW/words=8/data=carry +0.58% +22.32% ~ +6.10% +29.63% ~ -13.04% ~ +28.46% +2.95%
AddVW/words=9/data=carry ~ +31.53% ~ ~ +14.42% ~ -11.32% ~ +18.37% +3.28%
AddVW/words=10/data=carry +3.94% +22.36% ~ +6.29% +19.22% ~ -11.27% ~ +20.10% +3.91%
AddVW/words=16/data=carry +2.82% +14.23% ~ +10.06% +25.91% -16.12% ~ ~ +52.28% +10.40%
AddVW/words=32/data=carry ~ +25.35% +13.66% ~ +34.89% -34.39% +6.51% -18.71% +41.06% +19.42%
AddVW/words=64/data=carry -42.03% ~ -39.70% +6.65% +32.29% -39.94% +14.34% ~ +19.68% +20.86%
AddVW/words=100/data=carry -33.95% -34.28% -39.65% ~ +27.72% -26.80% +17.40% ~ +26.39% +23.32%
AddVW/words=1000/data=carry -42.49% -47.87% -47.44% +1.25% +4.25% -41.76% +23.40% ~ +25.48% +27.99%
AddVW/words=10000/data=carry -41.85% -48.49% -49.43% ~ ~ -42.09% +24.61% -10.32% +40.55% +18.35%
AddVW/words=100000/data=carry -28.18% -48.13% -48.24% +1.35% ~ -42.90% +24.73% -9.79% +22.55% +17.16%
SubVW/words=1/data=carry -10.32% -17.16% -24.14% -26.24% ~ +18.43% -34.10% -29.54% -9.57% ~
SubVW/words=2/data=carry -19.45% -23.31% -20.74% -39.73% ~ +15.74% -28.13% -30.21% ~ -18.74%
SubVW/words=3/data=carry ~ -16.18% -15.34% -19.54% +17.62% +12.39% -27.64% -27.09% ~ -14.97%
SubVW/words=4/data=carry +11.67% +24.42% ~ ~ +25.11% +14.07% -28.08% -26.18% ~ ~
SubVW/words=5/data=carry +8.08% +25.64% ~ ~ +10.35% +8.12% -21.75% -25.50% ~ -4.86%
SubVW/words=6/data=carry ~ +13.82% ~ ~ +12.92% +6.79% -20.25% -24.70% ~ -2.74%
SubVW/words=7/data=carry ~ ~ +8.29% +4.51% +26.59% +4.62% -18.01% -24.09% ~ -1.26%
SubVW/words=8/data=carry ~ +23.16% +16.19% +6.16% +25.46% +6.74% -15.57% -22.74% ~ +1.44%
SubVW/words=9/data=carry ~ +30.71% +20.81% ~ +12.36% ~ -12.99% ~ ~ +3.13%
SubVW/words=10/data=carry +5.03% +19.53% +14.84% +14.16% +16.12% ~ -11.64% -16.00% +15.45% +3.29%
SubVW/words=16/data=carry +14.42% +15.58% +33.07% +11.43% +24.65% ~ ~ -21.90% +25.59% +9.40%
SubVW/words=32/data=carry ~ +27.57% +46.58% ~ +35.35% -8.49% ~ -24.04% +11.86% +18.40%
SubVW/words=64/data=carry -24.34% -27.83% -20.90% +13.34% +37.17% -14.90% ~ -8.81% +12.88% +18.92%
SubVW/words=100/data=carry -25.19% -34.70% -27.45% +12.86% +28.42% -14.48% ~ ~ +25.71% +21.93%
SubVW/words=1000/data=carry -24.93% -47.86% -47.26% +2.66% ~ -23.88% ~ ~ +25.99% +27.81%
SubVW/words=10000/data=carry -24.17% -36.48% -49.41% +1.06% ~ -25.06% ~ -26.50% +27.94% +18.36%
SubVW/words=100000/data=carry -22.51% -35.86% -49.46% +3.96% ~ -25.18% ~ -22.15% +26.86% +15.44%
Change-Id: I8f252073040e674780ac6ec9912082fb205329dd
Reviewed-on: https://go-review.googlesource.com/c/go/+/664898
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Also fix a few real but currently harmless bugs from CL 664895.
There were a few places that were still wrong if z != x or if a != 0.
Change-Id: Id8971e2505523bc4708780c82bf998a546f4f081
Reviewed-on: https://go-review.googlesource.com/c/go/+/664897
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Vet is failing on this code because some arguments of mulAddVWW
got renamed in the go decl (CL 664895) but not the assembly accessors.
Looks like the assembly got written before that CL but checked in
after that CL.
Change-Id: I270e8db5f8327aa2029c21a126fab1231a3506a1
Reviewed-on: https://go-review.googlesource.com/c/go/+/665717
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
|
|
Benchmark results on Loongson 3C5000 (which is an LA464 implementation):
goos: linux
goarch: loong64
pkg: math/big
cpu: Loongson-3C5000 @ 2200.00MHz
│ test/old_3c5000_subvv.log │ test/new_3c5000_subvv.log │
│ sec/op │ sec/op vs base │
SubVV/1 10.920n ± 0% 7.657n ± 0% -29.88% (p=0.000 n=20)
SubVV/2 14.100n ± 0% 8.841n ± 0% -37.30% (p=0.000 n=20)
SubVV/3 16.38n ± 0% 11.06n ± 0% -32.48% (p=0.000 n=20)
SubVV/4 18.65n ± 0% 12.85n ± 0% -31.10% (p=0.000 n=20)
SubVV/5 20.93n ± 0% 14.79n ± 0% -29.34% (p=0.000 n=20)
SubVV/10 32.30n ± 0% 22.29n ± 0% -30.99% (p=0.000 n=20)
SubVV/100 244.3n ± 0% 149.2n ± 0% -38.93% (p=0.000 n=20)
SubVV/1000 2.292µ ± 0% 1.378µ ± 0% -39.88% (p=0.000 n=20)
SubVV/10000 26.26µ ± 0% 25.64µ ± 0% -2.33% (p=0.000 n=20)
SubVV/100000 341.3µ ± 0% 238.0µ ± 0% -30.26% (p=0.000 n=20)
geomean 209.1n 144.5n -30.86%
Change-Id: I3863c2c6728f1b0f8fecbf77de13254299c5b1cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/659877
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Benchmark results on Loongson 3A5000 (which is an LA464 implementation):
goos: linux
goarch: loong64
pkg: math/big
cpu: Loongson-3A5000-HV @ 2500.00MHz
│ test/old_3a5000_muladdvww.log │ test/new_3a5000_muladdvww.log │
│ sec/op │ sec/op vs base │
MulAddVWW/1 7.606n ± 0% 6.987n ± 0% -8.14% (p=0.000 n=20)
MulAddVWW/2 9.207n ± 0% 8.567n ± 0% -6.95% (p=0.000 n=20)
MulAddVWW/3 10.810n ± 0% 9.223n ± 0% -14.68% (p=0.000 n=20)
MulAddVWW/4 13.01n ± 0% 12.41n ± 0% -4.61% (p=0.000 n=20)
MulAddVWW/5 15.79n ± 0% 12.99n ± 0% -17.73% (p=0.000 n=20)
MulAddVWW/10 25.62n ± 0% 20.02n ± 0% -21.86% (p=0.000 n=20)
MulAddVWW/100 217.0n ± 0% 170.9n ± 0% -21.24% (p=0.000 n=20)
MulAddVWW/1000 2.064µ ± 0% 1.612µ ± 0% -21.90% (p=0.000 n=20)
MulAddVWW/10000 24.50µ ± 0% 16.74µ ± 0% -31.66% (p=0.000 n=20)
MulAddVWW/100000 239.1µ ± 0% 171.1µ ± 0% -28.45% (p=0.000 n=20)
geomean 159.2n 130.3n -18.18%
Change-Id: I063434bc382f4f1234f879172ab671a3d6f2eb80
Reviewed-on: https://go-review.googlesource.com/c/go/+/659881
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Benchmark results on Loongson 3C5000 (which is an LA464 implementation):
goos: linux
goarch: loong64
pkg: math/big
cpu: Loongson-3C5000 @ 2200.00MHz
│ test/old_3c5000_subvw.log │ test/new_3c5000_subvw.log │
│ sec/op │ sec/op vs base │
SubVW/1 8.564n ± 0% 5.915n ± 0% -30.93% (p=0.000 n=20)
SubVW/2 11.675n ± 0% 6.825n ± 0% -41.54% (p=0.000 n=20)
SubVW/3 13.410n ± 0% 7.969n ± 0% -40.57% (p=0.000 n=20)
SubVW/4 15.300n ± 0% 9.740n ± 0% -36.34% (p=0.000 n=20)
SubVW/5 17.34n ± 1% 10.66n ± 0% -38.55% (p=0.000 n=20)
SubVW/10 26.55n ± 0% 15.21n ± 0% -42.70% (p=0.000 n=20)
SubVW/100 199.2n ± 0% 102.5n ± 0% -48.52% (p=0.000 n=20)
SubVW/1000 1866.5n ± 1% 924.6n ± 0% -50.46% (p=0.000 n=20)
SubVW/10000 17.67µ ± 2% 12.04µ ± 2% -31.83% (p=0.000 n=20)
SubVW/100000 186.4µ ± 0% 132.0µ ± 0% -29.17% (p=0.000 n=20)
SubVWext/1 8.616n ± 0% 5.949n ± 0% -30.95% (p=0.000 n=20)
SubVWext/2 11.410n ± 0% 7.008n ± 1% -38.58% (p=0.000 n=20)
SubVWext/3 13.255n ± 1% 8.073n ± 0% -39.09% (p=0.000 n=20)
SubVWext/4 15.095n ± 0% 9.893n ± 0% -34.47% (p=0.000 n=20)
SubVWext/5 16.87n ± 0% 10.86n ± 0% -35.63% (p=0.000 n=20)
SubVWext/10 26.00n ± 0% 15.54n ± 0% -40.22% (p=0.000 n=20)
SubVWext/100 196.0n ± 0% 104.3n ± 1% -46.76% (p=0.000 n=20)
SubVWext/1000 1847.0n ± 0% 923.7n ± 0% -49.99% (p=0.000 n=20)
SubVWext/10000 17.30µ ± 1% 11.71µ ± 1% -32.31% (p=0.000 n=20)
SubVWext/100000 187.5µ ± 0% 131.6µ ± 0% -29.82% (p=0.000 n=20)
geomean 159.7n 97.79n -38.79%
Change-Id: I21a6903e79b02cb22282e80c9bfe2ae9f1a87589
Reviewed-on: https://go-review.googlesource.com/c/go/+/659878
Reviewed-by: Carlos Amedee <carlos@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
|
|
Benchmark results on Loongson 3C5000 (which is an LA464 implementation):
goos: linux
goarch: loong64
pkg: math/big
cpu: Loongson-3C5000 @ 2200.00MHz
│ test/old_3c5000_addvw.log │ test/new_3c5000_addvw.log │
│ sec/op │ sec/op vs base │
AddVW/1 9.555n ± 0% 5.915n ± 0% -38.09% (p=0.000 n=20)
AddVW/2 11.370n ± 0% 6.825n ± 0% -39.97% (p=0.000 n=20)
AddVW/3 12.485n ± 0% 7.970n ± 0% -36.16% (p=0.000 n=20)
AddVW/4 14.980n ± 0% 9.718n ± 0% -35.13% (p=0.000 n=20)
AddVW/5 16.73n ± 0% 10.63n ± 0% -36.46% (p=0.000 n=20)
AddVW/10 24.57n ± 0% 15.18n ± 0% -38.23% (p=0.000 n=20)
AddVW/100 184.9n ± 0% 102.4n ± 0% -44.62% (p=0.000 n=20)
AddVW/1000 1721.0n ± 0% 921.4n ± 0% -46.46% (p=0.000 n=20)
AddVW/10000 16.83µ ± 0% 11.68µ ± 0% -30.58% (p=0.000 n=20)
AddVW/100000 184.7µ ± 0% 131.3µ ± 0% -28.93% (p=0.000 n=20)
AddVWext/1 9.554n ± 0% 5.915n ± 0% -38.09% (p=0.000 n=20)
AddVWext/2 11.370n ± 0% 6.825n ± 0% -39.97% (p=0.000 n=20)
AddVWext/3 12.505n ± 0% 7.969n ± 0% -36.27% (p=0.000 n=20)
AddVWext/4 14.980n ± 0% 9.718n ± 0% -35.13% (p=0.000 n=20)
AddVWext/5 16.70n ± 0% 10.63n ± 0% -36.33% (p=0.000 n=20)
AddVWext/10 24.54n ± 0% 15.18n ± 0% -38.13% (p=0.000 n=20)
AddVWext/100 185.0n ± 0% 102.4n ± 0% -44.65% (p=0.000 n=20)
AddVWext/1000 1721.0n ± 0% 921.4n ± 0% -46.46% (p=0.000 n=20)
AddVWext/10000 16.83µ ± 0% 11.68µ ± 0% -30.60% (p=0.000 n=20)
AddVWext/100000 184.9µ ± 0% 130.4µ ± 0% -29.51% (p=0.000 n=20)
geomean 155.5n 96.87n -37.70%
Change-Id: I824a90cb365e09d7d0d4a2c53ff4b30cf057a75e
Reviewed-on: https://go-review.googlesource.com/c/go/+/659876
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: abner chenc <chenguoqi@loongson.cn>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
|
|
It is annoying that non-x86 implementations of shlVU and shrVU
have to go out of their way to handle the trivial case shift==0
with their own copy loops. Instead, arrange to never call them
with shift==0, so that the code can be removed.
Unfortunately, there are linknames of shlVU, so we cannot
change that function. But we can rename the functions and
then leave behind a shlVU wrapper, so do that.
Since the big.Int API calls the operations Lsh and Rsh, rename
shlVU/shrVU to lshVU/rshVU. Also rename various other shl/shr
methods and functions to lsh/rsh.
Change-Id: Ieaf54e0110a298730aa3e4566ce5be57ba7fc121
Reviewed-on: https://go-review.googlesource.com/c/go/+/664896
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
addMulVVW is an unnecessarily special case.
All other assembly routines taking []Word (V as in vector) arguments
take separate source and destination. For example:
addVV: z = x+y
mulAddVWW: z = x*m+a
addMulVVW uses the z parameter as both destination and source:
addMulVVW: z = z+x*m
Even looking at the signatures is confusing: all the VV routines take
two input vectors x and y, but addMulVVW takes only x: where is y?
(The answer is that the two inputs are z and x.)
It would be nice to fix this, both for understandability and regularity,
and to simplify a future assembly generator.
We cannot remove or redefine addMulVVW, because it has been used
in linknames. Instead, the CL adds a new final addend argument ‘a’
like in mulAddVWW, making the natural name addMulVVWW
(two input vectors, two input words):
addMulVVWW: z = x+y*m+a
This CL updates all the assembly implementations to rename the
inputs z, x, y -> x, y, m, and then introduces a separate destination z.
Change-Id: Ib76c80b53f6d1f4a901f663566e9c4764bb20488
Reviewed-on: https://go-review.googlesource.com/c/go/+/664895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Refactor calibration tests to use the same logic for all.
Choosing thresholds that are broadly appropriate for all systems is part science
but also part guesswork and judgement. We could instead set per-GOOS/GOARCH
thresholds, but that seems like too much work, and even then there would be
variation between different chips within a GOOS/GOARCH.
(For example see the three linux/amd64 systems benchmarked below.)
The thresholds chosen in this CL are:
karatsubaThreshold = 40 // unchanged
basicSqrThreshold = 12 // was 20
karatsubaSqrThreshold = 80 // was 260
divRecursiveThreshold = 40 // was 100
The new file calibrate.md explains the calibration process and links to graphs
justifying those values. (The graphs are hosted on swtch.com to avoid adding
a megabyte of extra data to the Go repo and Go distributions.)
A rendered copy of calibrate.md is at https://swtch.com/math/big/calibrate.html.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-88 13.13n ± 2% 13.14n ± 2% ~ (p=0.494 n=15)
Div/40/20-88 13.13n ± 2% 13.14n ± 2% ~ (p=0.137 n=15)
Div/100/50-88 25.50n ± 0% 25.51n ± 0% ~ (p=0.038 n=15)
Div/200/100-88 113.1n ± 1% 116.0n ± 3% +2.56% (p=0.000 n=15)
Div/400/200-88 135.3n ± 0% 137.1n ± 1% ~ (p=0.004 n=15)
Div/1000/500-88 259.9n ± 1% 259.0n ± 2% ~ (p=0.182 n=15)
Div/2000/1000-88 568.8n ± 1% 564.7n ± 3% ~ (p=0.927 n=15)
Div/20000/10000-88 25.79µ ± 1% 22.11µ ± 2% -14.26% (p=0.000 n=15)
Div/200000/100000-88 755.1µ ± 1% 737.6µ ± 1% -2.32% (p=0.000 n=15)
Div/2000000/1000000-88 31.30m ± 0% 31.20m ± 1% ~ (p=0.081 n=15)
Div/20000000/10000000-88 1.268 ± 0% 1.265 ± 0% ~ (p=0.011 n=15)
NatMul/10-88 142.6n ± 0% 142.9n ± 7% ~ (p=0.145 n=15)
NatMul/100-88 4.347µ ± 0% 4.350µ ± 3% ~ (p=0.430 n=15)
NatMul/1000-88 187.6µ ± 0% 188.4µ ± 2% ~ (p=0.004 n=15)
NatMul/10000-88 8.052m ± 0% 8.057m ± 1% ~ (p=0.148 n=15)
NatMul/100000-88 260.6m ± 0% 260.7m ± 0% ~ (p=0.512 n=15)
NatSqr/1-88 26.58n ± 5% 27.96n ± 8% ~ (p=0.574 n=15)
NatSqr/2-88 42.35n ± 7% 44.87n ± 6% ~ (p=0.690 n=15)
NatSqr/3-88 53.28n ± 4% 55.62n ± 5% ~ (p=0.151 n=15)
NatSqr/5-88 76.26n ± 6% 81.43n ± 6% +6.78% (p=0.000 n=15)
NatSqr/8-88 110.8n ± 5% 116.4n ± 6% ~ (p=0.040 n=15)
NatSqr/10-88 141.4n ± 4% 147.8n ± 4% ~ (p=0.011 n=15)
NatSqr/20-88 325.8n ± 3% 341.7n ± 4% +4.88% (p=0.000 n=15)
NatSqr/30-88 536.8n ± 3% 556.1n ± 4% ~ (p=0.027 n=15)
NatSqr/50-88 1.168µ ± 3% 1.197µ ± 3% ~ (p=0.442 n=15)
NatSqr/80-88 2.527µ ± 2% 2.480µ ± 2% -1.86% (p=0.000 n=15)
NatSqr/100-88 3.771µ ± 2% 3.535µ ± 2% -6.26% (p=0.000 n=15)
NatSqr/200-88 14.03µ ± 2% 10.57µ ± 3% -24.68% (p=0.000 n=15)
NatSqr/300-88 24.06µ ± 2% 20.57µ ± 2% -14.52% (p=0.000 n=15)
NatSqr/500-88 65.43µ ± 1% 45.45µ ± 1% -30.55% (p=0.000 n=15)
NatSqr/800-88 126.41µ ± 1% 94.13µ ± 2% -25.54% (p=0.000 n=15)
NatSqr/1000-88 196.4µ ± 1% 135.1µ ± 1% -31.18% (p=0.000 n=15)
NatSqr/10000-88 6.404m ± 0% 5.326m ± 1% -16.84% (p=0.000 n=15)
NatSqr/100000-88 267.2m ± 0% 198.7m ± 0% -25.64% (p=0.000 n=15)
geomean 7.318µ 6.948µ -5.06%
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 22.23n ± 0% 22.23n ± 0% ~ (p=0.973 n=15)
Div/40/20-16 22.23n ± 0% 22.23n ± 0% ~ (p=0.226 n=15)
Div/100/50-16 55.27n ± 1% 55.59n ± 0% ~ (p=0.004 n=15)
Div/200/100-16 174.7n ± 3% 175.9n ± 2% ~ (p=0.645 n=15)
Div/400/200-16 208.8n ± 1% 209.5n ± 2% ~ (p=0.169 n=15)
Div/1000/500-16 378.7n ± 2% 380.5n ± 2% ~ (p=0.091 n=15)
Div/2000/1000-16 778.4n ± 1% 781.1n ± 2% ~ (p=0.104 n=15)
Div/20000/10000-16 25.16µ ± 1% 24.93µ ± 1% -0.91% (p=0.000 n=15)
Div/200000/100000-16 926.4µ ± 0% 927.7µ ± 1% ~ (p=0.436 n=15)
Div/2000000/1000000-16 35.58m ± 0% 35.53m ± 0% ~ (p=0.267 n=15)
Div/20000000/10000000-16 1.333 ± 0% 1.330 ± 0% ~ (p=0.126 n=15)
NatMul/10-16 172.6n ± 0% 165.4n ± 0% -4.17% (p=0.000 n=15)
NatMul/100-16 5.706µ ± 0% 5.503µ ± 0% -3.56% (p=0.000 n=15)
NatMul/1000-16 220.8µ ± 0% 219.1µ ± 0% -0.76% (p=0.000 n=15)
NatMul/10000-16 8.688m ± 0% 8.621m ± 0% -0.77% (p=0.000 n=15)
NatMul/100000-16 333.3m ± 0% 333.5m ± 0% ~ (p=0.512 n=15)
NatSqr/1-16 28.66n ± 1% 28.42n ± 3% -0.84% (p=0.000 n=15)
NatSqr/2-16 48.29n ± 2% 48.19n ± 2% ~ (p=0.042 n=15)
NatSqr/3-16 59.93n ± 0% 59.64n ± 2% -0.48% (p=0.000 n=15)
NatSqr/5-16 88.05n ± 0% 87.89n ± 3% ~ (p=0.066 n=15)
NatSqr/8-16 127.7n ± 0% 126.9n ± 3% -0.63% (p=0.000 n=15)
NatSqr/10-16 170.4n ± 0% 169.7n ± 3% ~ (p=0.004 n=15)
NatSqr/20-16 388.8n ± 0% 392.9n ± 3% ~ (p=0.123 n=15)
NatSqr/30-16 635.2n ± 0% 641.7n ± 3% ~ (p=0.123 n=15)
NatSqr/50-16 1.304µ ± 1% 1.314µ ± 3% ~ (p=0.927 n=15)
NatSqr/80-16 2.709µ ± 1% 2.899µ ± 4% +7.01% (p=0.000 n=15)
NatSqr/100-16 3.885µ ± 0% 3.981µ ± 4% ~ (p=0.123 n=15)
NatSqr/200-16 13.29µ ± 2% 12.14µ ± 4% -8.67% (p=0.000 n=15)
NatSqr/300-16 23.39µ ± 0% 22.51µ ± 3% -3.78% (p=0.000 n=15)
NatSqr/500-16 58.13µ ± 1% 50.56µ ± 2% -13.02% (p=0.000 n=15)
NatSqr/800-16 118.4µ ± 1% 107.6µ ± 2% -9.11% (p=0.000 n=15)
NatSqr/1000-16 172.7µ ± 1% 151.8µ ± 2% -12.11% (p=0.000 n=15)
NatSqr/10000-16 6.065m ± 1% 5.757m ± 1% -5.08% (p=0.000 n=15)
NatSqr/100000-16 240.9m ± 0% 228.1m ± 0% -5.32% (p=0.000 n=15)
geomean 8.601µ 8.453µ -1.71%
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-32 11.11n ± 0% 11.11n ± 1% ~ (p=0.532 n=15)
Div/40/20-32 11.08n ± 1% 11.11n ± 0% ~ (p=0.815 n=15)
Div/100/50-32 16.81n ± 0% 16.84n ± 29% ~ (p=0.020 n=15)
Div/200/100-32 73.91n ± 0% 76.85n ± 11% +3.98% (p=0.000 n=15)
Div/400/200-32 87.35n ± 0% 88.91n ± 34% +1.79% (p=0.000 n=15)
Div/1000/500-32 169.3n ± 1% 168.9n ± 1% ~ (p=0.049 n=15)
Div/2000/1000-32 369.3n ± 0% 369.0n ± 0% ~ (p=0.108 n=15)
Div/20000/10000-32 15.92µ ± 0% 13.55µ ± 2% -14.91% (p=0.000 n=15)
Div/200000/100000-32 491.4µ ± 0% 482.4µ ± 1% -1.84% (p=0.000 n=15)
Div/2000000/1000000-32 20.09m ± 0% 19.96m ± 0% -0.69% (p=0.000 n=15)
Div/20000000/10000000-32 756.5m ± 0% 755.5m ± 0% ~ (p=0.089 n=15)
NatMul/10-32 125.4n ± 5% 124.8n ± 1% ~ (p=0.588 n=15)
NatMul/100-32 2.952µ ± 3% 2.969µ ± 0% ~ (p=0.237 n=15)
NatMul/1000-32 120.7µ ± 0% 121.1µ ± 0% +0.30% (p=0.000 n=15)
NatMul/10000-32 4.845m ± 0% 4.839m ± 1% ~ (p=0.653 n=15)
NatMul/100000-32 173.3m ± 0% 173.3m ± 0% ~ (p=0.838 n=15)
NatSqr/1-32 31.18n ± 23% 32.08n ± 2% ~ (p=0.015 n=15)
NatSqr/2-32 57.22n ± 28% 58.88n ± 2% ~ (p=0.054 n=15)
NatSqr/3-32 61.34n ± 18% 64.33n ± 2% ~ (p=0.237 n=15)
NatSqr/5-32 72.47n ± 17% 79.81n ± 3% ~ (p=0.067 n=15)
NatSqr/8-32 83.26n ± 26% 100.10n ± 3% ~ (p=0.016 n=15)
NatSqr/10-32 87.31n ± 43% 125.50n ± 2% ~ (p=0.003 n=15)
NatSqr/20-32 193.5n ± 25% 244.4n ± 13% ~ (p=0.002 n=15)
NatSqr/30-32 323.9n ± 17% 380.9n ± 6% ~ (p=0.003 n=15)
NatSqr/50-32 713.4n ± 9% 761.7n ± 8% ~ (p=0.419 n=15)
NatSqr/80-32 1.486µ ± 7% 1.609µ ± 5% +8.28% (p=0.000 n=15)
NatSqr/100-32 2.115µ ± 9% 2.253µ ± 1% ~ (p=0.104 n=15)
NatSqr/200-32 7.201µ ± 4% 6.610µ ± 1% -8.21% (p=0.000 n=15)
NatSqr/300-32 13.08µ ± 2% 12.37µ ± 1% -5.41% (p=0.000 n=15)
NatSqr/500-32 32.56µ ± 2% 27.83µ ± 2% -14.52% (p=0.000 n=15)
NatSqr/800-32 66.83µ ± 3% 59.59µ ± 1% -10.83% (p=0.000 n=15)
NatSqr/1000-32 98.09µ ± 1% 83.59µ ± 1% -14.78% (p=0.000 n=15)
NatSqr/10000-32 3.445m ± 1% 3.245m ± 0% -5.81% (p=0.000 n=15)
NatSqr/100000-32 137.3m ± 0% 127.0m ± 0% -7.54% (p=0.000 n=15)
geomean 4.897µ 4.972µ +1.52%
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 15.26n ± 2% 15.14n ± 1% ~ (p=0.212 n=15)
Div/40/20-16 15.22n ± 1% 15.16n ± 0% ~ (p=0.190 n=15)
Div/100/50-16 26.53n ± 2% 26.42n ± 0% -0.41% (p=0.000 n=15)
Div/200/100-16 124.3n ± 0% 124.0n ± 0% ~ (p=0.704 n=15)
Div/400/200-16 142.4n ± 0% 141.8n ± 0% ~ (p=0.074 n=15)
Div/1000/500-16 262.0n ± 1% 261.3n ± 1% ~ (p=0.046 n=15)
Div/2000/1000-16 532.6n ± 0% 532.5n ± 1% ~ (p=0.798 n=15)
Div/20000/10000-16 22.27µ ± 0% 22.88µ ± 0% +2.73% (p=0.000 n=15)
Div/200000/100000-16 890.4µ ± 0% 902.8µ ± 0% +1.39% (p=0.000 n=15)
Div/2000000/1000000-16 35.03m ± 0% 35.10m ± 0% ~ (p=0.305 n=15)
Div/20000000/10000000-16 1.380 ± 0% 1.385 ± 0% ~ (p=0.019 n=15)
NatMul/10-16 177.6n ± 1% 175.6n ± 3% ~ (p=0.480 n=15)
NatMul/100-16 5.675µ ± 0% 5.669µ ± 1% ~ (p=0.705 n=15)
NatMul/1000-16 224.3µ ± 0% 224.6µ ± 0% ~ (p=0.653 n=15)
NatMul/10000-16 8.735m ± 0% 8.739m ± 0% ~ (p=0.567 n=15)
NatMul/100000-16 331.6m ± 0% 331.6m ± 1% ~ (p=0.412 n=15)
NatSqr/1-16 43.69n ± 2% 42.77n ± 6% ~ (p=0.383 n=15)
NatSqr/2-16 65.26n ± 2% 63.91n ± 5% ~ (p=0.285 n=15)
NatSqr/3-16 73.95n ± 1% 72.25n ± 6% ~ (p=0.198 n=15)
NatSqr/5-16 95.06n ± 1% 94.21n ± 3% ~ (p=0.721 n=15)
NatSqr/8-16 155.5n ± 1% 153.4n ± 4% ~ (p=0.170 n=15)
NatSqr/10-16 175.4n ± 1% 174.0n ± 2% ~ (p=0.271 n=15)
NatSqr/20-16 360.8n ± 0% 358.5n ± 2% ~ (p=0.170 n=15)
NatSqr/30-16 584.7n ± 0% 582.9n ± 1% ~ (p=0.170 n=15)
NatSqr/50-16 1.323µ ± 0% 1.322µ ± 0% ~ (p=0.627 n=15)
NatSqr/80-16 2.916µ ± 0% 2.674µ ± 0% -8.30% (p=0.000 n=15)
NatSqr/100-16 4.365µ ± 0% 3.802µ ± 0% -12.90% (p=0.000 n=15)
NatSqr/200-16 16.42µ ± 0% 11.29µ ± 0% -31.26% (p=0.000 n=15)
NatSqr/300-16 28.07µ ± 0% 22.83µ ± 0% -18.68% (p=0.000 n=15)
NatSqr/500-16 76.30µ ± 0% 50.06µ ± 0% -34.39% (p=0.000 n=15)
NatSqr/800-16 147.5µ ± 0% 101.2µ ± 1% -31.41% (p=0.000 n=15)
NatSqr/1000-16 228.6µ ± 0% 149.5µ ± 0% -34.61% (p=0.000 n=15)
NatSqr/10000-16 7.417m ± 0% 6.025m ± 0% -18.76% (p=0.000 n=15)
NatSqr/100000-16 309.2m ± 0% 214.9m ± 0% -30.50% (p=0.000 n=15)
geomean 8.559µ 7.906µ -7.63%
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-12 9.577n ± 6% 9.473n ± 5% ~ (p=0.384 n=15)
Div/40/20-12 9.480n ± 1% 9.430n ± 1% ~ (p=0.019 n=15)
Div/100/50-12 14.82n ± 0% 14.82n ± 0% ~ (p=0.845 n=15)
Div/200/100-12 83.94n ± 1% 84.35n ± 4% ~ (p=0.512 n=15)
Div/400/200-12 102.7n ± 1% 102.9n ± 0% ~ (p=0.845 n=15)
Div/1000/500-12 185.3n ± 1% 181.9n ± 1% -1.83% (p=0.000 n=15)
Div/2000/1000-12 397.0n ± 1% 396.7n ± 0% ~ (p=0.959 n=15)
Div/20000/10000-12 14.05µ ± 0% 13.70µ ± 1% ~ (p=0.002 n=15)
Div/200000/100000-12 529.4µ ± 3% 526.7µ ± 2% ~ (p=0.967 n=15)
Div/2000000/1000000-12 20.05m ± 0% 20.05m ± 0% ~ (p=0.653 n=15)
Div/20000000/10000000-12 788.2m ± 1% 789.0m ± 1% ~ (p=0.412 n=15)
NatMul/10-12 79.95n ± 1% 80.87n ± 1% +1.15% (p=0.000 n=15)
NatMul/100-12 2.973µ ± 0% 2.986µ ± 2% ~ (p=0.051 n=15)
NatMul/1000-12 122.6µ ± 5% 123.0µ ± 1% ~ (p=0.783 n=15)
NatMul/10000-12 4.990m ± 1% 5.000m ± 1% ~ (p=0.653 n=15)
NatMul/100000-12 185.3m ± 3% 190.3m ± 1% ~ (p=0.089 n=15)
NatSqr/1-12 11.84n ± 1% 11.88n ± 1% ~ (p=0.735 n=15)
NatSqr/2-12 21.01n ± 1% 21.44n ± 6% ~ (p=0.039 n=15)
NatSqr/3-12 25.59n ± 0% 26.74n ± 9% +4.49% (p=0.000 n=15)
NatSqr/5-12 36.78n ± 0% 37.04n ± 1% +0.71% (p=0.000 n=15)
NatSqr/8-12 63.09n ± 3% 63.22n ± 1% ~ (p=0.846 n=15)
NatSqr/10-12 79.98n ± 0% 79.78n ± 0% ~ (p=0.100 n=15)
NatSqr/20-12 174.0n ± 0% 175.5n ± 1% ~ (p=0.361 n=15)
NatSqr/30-12 290.0n ± 0% 291.4n ± 0% ~ (p=0.002 n=15)
NatSqr/50-12 655.2n ± 4% 658.1n ± 0% ~ (p=0.060 n=15)
NatSqr/80-12 1.506µ ± 0% 1.397µ ± 5% -7.24% (p=0.000 n=15)
NatSqr/100-12 2.273µ ± 0% 2.005µ ± 5% -11.79% (p=0.000 n=15)
NatSqr/200-12 8.833µ ± 6% 6.109µ ± 0% -30.84% (p=0.000 n=15)
NatSqr/300-12 15.15µ ± 4% 12.37µ ± 0% -18.34% (p=0.000 n=15)
NatSqr/500-12 41.89µ ± 0% 27.70µ ± 1% -33.88% (p=0.000 n=15)
NatSqr/800-12 80.72µ ± 0% 56.40µ ± 0% -30.12% (p=0.000 n=15)
NatSqr/1000-12 127.06µ ± 1% 84.06µ ± 1% -33.84% (p=0.000 n=15)
NatSqr/10000-12 4.130m ± 0% 3.390m ± 0% -17.91% (p=0.000 n=15)
NatSqr/100000-12 173.2m ± 0% 131.2m ± 6% -24.25% (p=0.000 n=15)
geomean 4.489µ 4.189µ -6.68%
Change-Id: Iaf65fd85457b003ebf07a787c875cda321b40cc9
Reviewed-on: https://go-review.googlesource.com/c/go/+/652058
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
The old Karatsuba implementation only operated on lengths that are
a power of two times a number smaller than karatsubaThreshold.
For example, when karatsubaThreshold = 40, multiplying a pair
of 99-word numbers runs karatsuba on the low 96 (= 39<<2) words
and then has to fix up the answer to include the high 3 words of each.
I suspect this requirement was needed to make the analysis of
how many temporary words to reserve easier, back when the
answer was 3*n and depended on exactly halving the size at
each Karatsuba step.
Now that we have the more flexible temporary allocation stack,
we can change Karatsuba to accept operands of odd length.
Doing so avoids most of the fixup that the old approach required.
For example, multiplying a pair of 99-word numbers runs
karatsuba on all 99 words now.
This is simpler and about the same speed or, for large cases, faster.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-16 99.62n ± 3% 99.10n ± 3% ~ (p=0.009 n=15)
GCD10x10/WithXY-16 243.4n ± 1% 245.2n ± 1% ~ (p=0.009 n=15)
GCD100x100/WithoutXY-16 921.9n ± 1% 919.2n ± 1% ~ (p=0.076 n=15)
GCD100x100/WithXY-16 1.527µ ± 1% 1.526µ ± 0% ~ (p=0.813 n=15)
GCD1000x1000/WithoutXY-16 9.704µ ± 1% 9.696µ ± 0% ~ (p=0.532 n=15)
GCD1000x1000/WithXY-16 14.03µ ± 1% 13.96µ ± 0% ~ (p=0.014 n=15)
GCD10000x10000/WithoutXY-16 206.5µ ± 2% 206.5µ ± 0% ~ (p=0.967 n=15)
GCD10000x10000/WithXY-16 398.0µ ± 1% 397.4µ ± 0% ~ (p=0.683 n=15)
Div/20/10-16 22.22n ± 0% 22.23n ± 0% ~ (p=0.105 n=15)
Div/40/20-16 22.23n ± 0% 22.23n ± 0% ~ (p=0.307 n=15)
Div/100/50-16 55.47n ± 0% 55.47n ± 0% ~ (p=0.573 n=15)
Div/200/100-16 174.9n ± 1% 174.6n ± 1% ~ (p=0.814 n=15)
Div/400/200-16 209.5n ± 1% 210.5n ± 1% ~ (p=0.454 n=15)
Div/1000/500-16 379.9n ± 0% 383.5n ± 2% ~ (p=0.123 n=15)
Div/2000/1000-16 780.1n ± 0% 784.6n ± 1% +0.58% (p=0.000 n=15)
Div/20000/10000-16 25.22µ ± 1% 25.15µ ± 0% ~ (p=0.213 n=15)
Div/200000/100000-16 921.8µ ± 1% 926.1µ ± 0% ~ (p=0.009 n=15)
Div/2000000/1000000-16 37.91m ± 0% 35.63m ± 0% -6.02% (p=0.000 n=15)
Div/20000000/10000000-16 1.378 ± 0% 1.336 ± 0% -3.03% (p=0.000 n=15)
NatMul/10-16 166.8n ± 4% 168.9n ± 3% ~ (p=0.008 n=15)
NatMul/100-16 5.519µ ± 2% 5.548µ ± 4% ~ (p=0.032 n=15)
NatMul/1000-16 230.4µ ± 1% 220.2µ ± 1% -4.43% (p=0.000 n=15)
NatMul/10000-16 8.569m ± 1% 8.640m ± 1% ~ (p=0.005 n=15)
NatMul/100000-16 376.5m ± 1% 334.1m ± 0% -11.26% (p=0.000 n=15)
NatSqr/1-16 27.85n ± 5% 28.60n ± 2% ~ (p=0.123 n=15)
NatSqr/2-16 47.99n ± 2% 48.84n ± 1% ~ (p=0.008 n=15)
NatSqr/3-16 59.41n ± 2% 60.87n ± 2% +2.46% (p=0.001 n=15)
NatSqr/5-16 87.27n ± 2% 89.31n ± 3% ~ (p=0.087 n=15)
NatSqr/8-16 124.6n ± 3% 128.9n ± 3% ~ (p=0.006 n=15)
NatSqr/10-16 166.3n ± 3% 172.7n ± 3% ~ (p=0.002 n=15)
NatSqr/20-16 385.2n ± 2% 394.7n ± 3% ~ (p=0.036 n=15)
NatSqr/30-16 622.7n ± 3% 642.9n ± 3% ~ (p=0.032 n=15)
NatSqr/50-16 1.274µ ± 3% 1.323µ ± 4% ~ (p=0.003 n=15)
NatSqr/80-16 2.606µ ± 4% 2.714µ ± 4% ~ (p=0.044 n=15)
NatSqr/100-16 3.731µ ± 4% 3.871µ ± 4% ~ (p=0.038 n=15)
NatSqr/200-16 12.99µ ± 2% 13.09µ ± 3% ~ (p=0.838 n=15)
NatSqr/300-16 22.87µ ± 2% 23.25µ ± 2% ~ (p=0.285 n=15)
NatSqr/500-16 58.43µ ± 1% 58.25µ ± 2% ~ (p=0.345 n=15)
NatSqr/800-16 115.3µ ± 3% 116.2µ ± 3% ~ (p=0.126 n=15)
NatSqr/1000-16 173.9µ ± 1% 174.3µ ± 1% ~ (p=0.935 n=15)
NatSqr/10000-16 6.133m ± 2% 6.034m ± 1% -1.62% (p=0.000 n=15)
NatSqr/100000-16 253.8m ± 1% 241.5m ± 0% -4.87% (p=0.000 n=15)
geomean 7.745µ 7.760µ +0.19%
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-88 62.17n ± 4% 61.44n ± 0% -1.17% (p=0.000 n=15)
GCD10x10/WithXY-88 173.4n ± 2% 172.4n ± 4% ~ (p=0.615 n=15)
GCD100x100/WithoutXY-88 584.0n ± 1% 582.9n ± 0% ~ (p=0.009 n=15)
GCD100x100/WithXY-88 1.098µ ± 1% 1.091µ ± 2% ~ (p=0.002 n=15)
GCD1000x1000/WithoutXY-88 6.055µ ± 0% 6.049µ ± 0% ~ (p=0.007 n=15)
GCD1000x1000/WithXY-88 9.430µ ± 0% 9.417µ ± 1% ~ (p=0.123 n=15)
GCD10000x10000/WithoutXY-88 153.4µ ± 2% 149.0µ ± 2% -2.85% (p=0.000 n=15)
GCD10000x10000/WithXY-88 350.6µ ± 3% 349.0µ ± 2% ~ (p=0.126 n=15)
Div/20/10-88 13.12n ± 0% 13.12n ± 1% 0.00% (p=0.042 n=15)
Div/40/20-88 13.12n ± 0% 13.13n ± 0% ~ (p=0.004 n=15)
Div/100/50-88 25.49n ± 0% 25.49n ± 0% ~ (p=0.452 n=15)
Div/200/100-88 115.7n ± 2% 113.8n ± 2% ~ (p=0.212 n=15)
Div/400/200-88 135.0n ± 1% 136.1n ± 1% ~ (p=0.005 n=15)
Div/1000/500-88 257.5n ± 1% 259.9n ± 1% ~ (p=0.004 n=15)
Div/2000/1000-88 567.5n ± 1% 572.4n ± 2% ~ (p=0.616 n=15)
Div/20000/10000-88 25.65µ ± 0% 25.77µ ± 1% ~ (p=0.032 n=15)
Div/200000/100000-88 777.4µ ± 1% 754.3µ ± 1% -2.97% (p=0.000 n=15)
Div/2000000/1000000-88 33.66m ± 0% 31.37m ± 0% -6.81% (p=0.000 n=15)
Div/20000000/10000000-88 1.320 ± 0% 1.266 ± 0% -4.04% (p=0.000 n=15)
NatMul/10-88 151.9n ± 7% 143.3n ± 7% ~ (p=0.878 n=15)
NatMul/100-88 4.418µ ± 2% 4.337µ ± 3% ~ (p=0.512 n=15)
NatMul/1000-88 206.8µ ± 1% 189.8µ ± 1% -8.25% (p=0.000 n=15)
NatMul/10000-88 8.531m ± 1% 8.095m ± 0% -5.12% (p=0.000 n=15)
NatMul/100000-88 298.9m ± 0% 260.5m ± 1% -12.85% (p=0.000 n=15)
NatSqr/1-88 27.55n ± 6% 28.25n ± 7% ~ (p=0.024 n=15)
NatSqr/2-88 44.71n ± 6% 46.21n ± 9% ~ (p=0.024 n=15)
NatSqr/3-88 55.44n ± 4% 58.41n ± 10% ~ (p=0.126 n=15)
NatSqr/5-88 80.71n ± 5% 81.41n ± 5% ~ (p=0.032 n=15)
NatSqr/8-88 115.7n ± 4% 115.4n ± 5% ~ (p=0.814 n=15)
NatSqr/10-88 147.4n ± 4% 147.3n ± 4% ~ (p=0.505 n=15)
NatSqr/20-88 337.8n ± 3% 337.3n ± 4% ~ (p=0.814 n=15)
NatSqr/30-88 556.9n ± 3% 557.6n ± 4% ~ (p=0.814 n=15)
NatSqr/50-88 1.208µ ± 4% 1.208µ ± 3% ~ (p=0.910 n=15)
NatSqr/80-88 2.591µ ± 3% 2.581µ ± 3% ~ (p=0.705 n=15)
NatSqr/100-88 3.870µ ± 3% 3.858µ ± 3% ~ (p=0.846 n=15)
NatSqr/200-88 14.43µ ± 3% 14.28µ ± 2% ~ (p=0.383 n=15)
NatSqr/300-88 24.68µ ± 2% 24.49µ ± 2% ~ (p=0.624 n=15)
NatSqr/500-88 66.27µ ± 1% 66.18µ ± 1% ~ (p=0.735 n=15)
NatSqr/800-88 128.7µ ± 1% 127.4µ ± 1% ~ (p=0.050 n=15)
NatSqr/1000-88 198.7µ ± 1% 197.7µ ± 1% ~ (p=0.229 n=15)
NatSqr/10000-88 6.582m ± 1% 6.426m ± 1% -2.37% (p=0.000 n=15)
NatSqr/100000-88 274.3m ± 0% 267.3m ± 0% -2.57% (p=0.000 n=15)
geomean 6.518µ 6.438µ -1.22%
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-16 61.70n ± 1% 61.32n ± 1% ~ (p=0.361 n=15)
GCD10x10/WithXY-16 217.3n ± 1% 217.0n ± 1% ~ (p=0.395 n=15)
GCD100x100/WithoutXY-16 569.7n ± 0% 572.6n ± 2% ~ (p=0.213 n=15)
GCD100x100/WithXY-16 1.241µ ± 1% 1.236µ ± 1% ~ (p=0.157 n=15)
GCD1000x1000/WithoutXY-16 5.558µ ± 0% 5.566µ ± 0% ~ (p=0.228 n=15)
GCD1000x1000/WithXY-16 9.319µ ± 0% 9.326µ ± 0% ~ (p=0.233 n=15)
GCD10000x10000/WithoutXY-16 126.4µ ± 2% 128.7µ ± 3% ~ (p=0.081 n=15)
GCD10000x10000/WithXY-16 279.3µ ± 0% 278.3µ ± 5% ~ (p=0.187 n=15)
Div/20/10-16 15.12n ± 1% 15.21n ± 1% ~ (p=0.490 n=15)
Div/40/20-16 15.11n ± 0% 15.23n ± 1% ~ (p=0.107 n=15)
Div/100/50-16 26.53n ± 0% 26.50n ± 0% ~ (p=0.299 n=15)
Div/200/100-16 123.7n ± 0% 124.0n ± 0% ~ (p=0.086 n=15)
Div/400/200-16 142.5n ± 0% 142.4n ± 0% ~ (p=0.039 n=15)
Div/1000/500-16 259.9n ± 1% 261.2n ± 1% ~ (p=0.044 n=15)
Div/2000/1000-16 539.4n ± 1% 532.3n ± 1% -1.32% (p=0.001 n=15)
Div/20000/10000-16 22.43µ ± 0% 22.32µ ± 0% -0.49% (p=0.000 n=15)
Div/200000/100000-16 898.3µ ± 0% 889.6µ ± 0% -0.96% (p=0.000 n=15)
Div/2000000/1000000-16 38.37m ± 0% 35.11m ± 0% -8.49% (p=0.000 n=15)
Div/20000000/10000000-16 1.449 ± 0% 1.384 ± 0% -4.48% (p=0.000 n=15)
NatMul/10-16 182.0n ± 1% 177.8n ± 1% -2.31% (p=0.000 n=15)
NatMul/100-16 5.537µ ± 0% 5.693µ ± 0% +2.82% (p=0.000 n=15)
NatMul/1000-16 229.9µ ± 0% 224.8µ ± 0% -2.24% (p=0.000 n=15)
NatMul/10000-16 8.985m ± 0% 8.751m ± 0% -2.61% (p=0.000 n=15)
NatMul/100000-16 371.1m ± 0% 331.5m ± 0% -10.66% (p=0.000 n=15)
NatSqr/1-16 46.77n ± 6% 42.76n ± 1% -8.57% (p=0.000 n=15)
NatSqr/2-16 66.99n ± 4% 63.62n ± 1% -5.03% (p=0.000 n=15)
NatSqr/3-16 76.79n ± 4% 73.42n ± 1% ~ (p=0.007 n=15)
NatSqr/5-16 99.00n ± 3% 95.35n ± 1% -3.69% (p=0.000 n=15)
NatSqr/8-16 160.0n ± 3% 155.1n ± 1% -3.06% (p=0.001 n=15)
NatSqr/10-16 178.4n ± 2% 175.9n ± 0% -1.40% (p=0.001 n=15)
NatSqr/20-16 361.9n ± 2% 361.3n ± 0% ~ (p=0.083 n=15)
NatSqr/30-16 584.7n ± 0% 586.8n ± 0% +0.36% (p=0.000 n=15)
NatSqr/50-16 1.327µ ± 0% 1.329µ ± 0% ~ (p=0.349 n=15)
NatSqr/80-16 2.893µ ± 1% 2.925µ ± 0% +1.11% (p=0.000 n=15)
NatSqr/100-16 4.330µ ± 1% 4.381µ ± 0% +1.18% (p=0.000 n=15)
NatSqr/200-16 16.25µ ± 1% 16.43µ ± 0% +1.07% (p=0.000 n=15)
NatSqr/300-16 27.85µ ± 1% 28.06µ ± 0% +0.77% (p=0.000 n=15)
NatSqr/500-16 76.01µ ± 0% 76.34µ ± 0% ~ (p=0.002 n=15)
NatSqr/800-16 146.8µ ± 0% 148.1µ ± 0% +0.83% (p=0.000 n=15)
NatSqr/1000-16 228.2µ ± 0% 228.6µ ± 0% ~ (p=0.123 n=15)
NatSqr/10000-16 7.524m ± 0% 7.426m ± 0% -1.31% (p=0.000 n=15)
NatSqr/100000-16 316.7m ± 0% 309.2m ± 0% -2.36% (p=0.000 n=15)
geomean 7.264µ 7.172µ -1.27%
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-12 32.61n ± 1% 32.42n ± 1% ~ (p=0.021 n=15)
GCD10x10/WithXY-12 87.70n ± 1% 88.42n ± 1% ~ (p=0.010 n=15)
GCD100x100/WithoutXY-12 305.9n ± 0% 306.4n ± 0% ~ (p=0.003 n=15)
GCD100x100/WithXY-12 560.3n ± 2% 556.6n ± 1% ~ (p=0.018 n=15)
GCD1000x1000/WithoutXY-12 3.509µ ± 2% 3.464µ ± 1% ~ (p=0.145 n=15)
GCD1000x1000/WithXY-12 5.347µ ± 2% 5.372µ ± 1% ~ (p=0.046 n=15)
GCD10000x10000/WithoutXY-12 73.75µ ± 1% 73.99µ ± 1% ~ (p=0.004 n=15)
GCD10000x10000/WithXY-12 148.4µ ± 0% 147.8µ ± 1% ~ (p=0.076 n=15)
Div/20/10-12 9.481n ± 0% 9.462n ± 1% ~ (p=0.631 n=15)
Div/40/20-12 9.457n ± 0% 9.462n ± 1% ~ (p=0.798 n=15)
Div/100/50-12 14.91n ± 0% 14.79n ± 1% -0.80% (p=0.000 n=15)
Div/200/100-12 84.56n ± 1% 84.60n ± 1% ~ (p=0.271 n=15)
Div/400/200-12 103.8n ± 0% 102.8n ± 0% -0.96% (p=0.000 n=15)
Div/1000/500-12 181.3n ± 1% 184.2n ± 2% ~ (p=0.091 n=15)
Div/2000/1000-12 397.5n ± 0% 397.4n ± 0% ~ (p=0.299 n=15)
Div/20000/10000-12 14.04µ ± 1% 13.99µ ± 0% ~ (p=0.221 n=15)
Div/200000/100000-12 523.1µ ± 0% 514.0µ ± 3% ~ (p=0.775 n=15)
Div/2000000/1000000-12 21.58m ± 0% 20.01m ± 1% -7.29% (p=0.000 n=15)
Div/20000000/10000000-12 813.5m ± 0% 796.2m ± 1% -2.13% (p=0.000 n=15)
NatMul/10-12 80.46n ± 1% 80.02n ± 1% ~ (p=0.063 n=15)
NatMul/100-12 2.904µ ± 0% 2.979µ ± 1% +2.58% (p=0.000 n=15)
NatMul/1000-12 127.8µ ± 0% 122.3µ ± 0% -4.28% (p=0.000 n=15)
NatMul/10000-12 5.141m ± 0% 4.975m ± 1% -3.23% (p=0.000 n=15)
NatMul/100000-12 208.8m ± 0% 189.6m ± 3% -9.21% (p=0.000 n=15)
NatSqr/1-12 11.90n ± 1% 11.76n ± 1% ~ (p=0.059 n=15)
NatSqr/2-12 21.33n ± 1% 21.12n ± 0% ~ (p=0.063 n=15)
NatSqr/3-12 26.05n ± 1% 25.79n ± 0% ~ (p=0.002 n=15)
NatSqr/5-12 37.31n ± 0% 36.98n ± 1% ~ (p=0.008 n=15)
NatSqr/8-12 63.07n ± 0% 62.75n ± 1% ~ (p=0.061 n=15)
NatSqr/10-12 79.48n ± 0% 79.59n ± 0% ~ (p=0.455 n=15)
NatSqr/20-12 173.1n ± 0% 173.2n ± 1% ~ (p=0.518 n=15)
NatSqr/30-12 288.6n ± 1% 289.2n ± 0% ~ (p=0.030 n=15)
NatSqr/50-12 653.3n ± 0% 653.3n ± 0% ~ (p=0.361 n=15)
NatSqr/80-12 1.492µ ± 0% 1.496µ ± 0% ~ (p=0.018 n=15)
NatSqr/100-12 2.270µ ± 1% 2.270µ ± 0% ~ (p=0.326 n=15)
NatSqr/200-12 8.776µ ± 1% 8.784µ ± 1% ~ (p=0.083 n=15)
NatSqr/300-12 15.07µ ± 0% 15.09µ ± 0% ~ (p=0.455 n=15)
NatSqr/500-12 41.71µ ± 0% 41.77µ ± 1% ~ (p=0.305 n=15)
NatSqr/800-12 80.77µ ± 1% 80.59µ ± 0% ~ (p=0.113 n=15)
NatSqr/1000-12 126.4µ ± 1% 126.5µ ± 0% ~ (p=0.683 n=15)
NatSqr/10000-12 4.204m ± 0% 4.119m ± 0% -2.02% (p=0.000 n=15)
NatSqr/100000-12 177.0m ± 0% 172.9m ± 0% -2.31% (p=0.000 n=15)
geomean 3.790µ 3.757µ -0.87%
Change-Id: Ifc7a9b61f678df216690511ac8bb9143189a795e
Reviewed-on: https://go-review.googlesource.com/c/go/+/652057
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
|
|
In a division, normally the answer to N digits / D digits has N-D digits,
but not when N-D is negative. Fix the calculation of the number of
digits for the temporary in nat.rem not to be negative.
Fixes #72043.
Change-Id: Ib9faa430aeb6c5f4c4a730f1ec631d2bf3f7472c
Reviewed-on: https://go-review.googlesource.com/c/go/+/655156
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Test that big.Int.Mul reusing the same target is not allocating
temporary garbage during its computation. That code is going
to be modified in an upcoming CL.
Change-Id: I3ed55c06da030282233c29cd7af2a04f395dc7a2
Reviewed-on: https://go-review.googlesource.com/c/go/+/652056
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
No code changes.
This CL moves the multiplication (and squaring) code into natmul.go,
in preparation for cleaning up Karatsuba and then adding Toom-Cook
and FFT-based multiplication.
Change-Id: I7f84328284cc4e1ca4da0ebb9f666a5535e8d7f2
Reviewed-on: https://go-review.googlesource.com/c/go/+/652055
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Avoid multiplies when converting base 2, 4, 16 inputs,
reducing conversion time from O(N²) to O(N).
The Base8 and Base10 code paths should be unmodified,
but the base-2,4,16 changes tickle the compiler to generate
better (amd64) or worse (arm64) when really it should not.
This is described in detail in #71868 and should be ignored
for the purposes of this CL.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-16 324.4n ± 0% 258.7n ± 0% -20.25% (p=0.000 n=15)
Scan/100/Base2-16 2.376µ ± 0% 1.968µ ± 0% -17.17% (p=0.000 n=15)
Scan/1000/Base2-16 23.89µ ± 0% 19.16µ ± 0% -19.80% (p=0.000 n=15)
Scan/10000/Base2-16 311.5µ ± 0% 190.4µ ± 0% -38.86% (p=0.000 n=15)
Scan/100000/Base2-16 10.508m ± 0% 1.904m ± 0% -81.88% (p=0.000 n=15)
Scan/10/Base8-16 138.3n ± 0% 127.9n ± 0% -7.52% (p=0.000 n=15)
Scan/100/Base8-16 886.1n ± 0% 790.2n ± 0% -10.82% (p=0.000 n=15)
Scan/1000/Base8-16 9.227µ ± 0% 8.234µ ± 0% -10.76% (p=0.000 n=15)
Scan/10000/Base8-16 165.8µ ± 0% 155.6µ ± 0% -6.19% (p=0.000 n=15)
Scan/100000/Base8-16 9.044m ± 0% 8.935m ± 0% -1.20% (p=0.000 n=15)
Scan/10/Base10-16 129.9n ± 0% 120.0n ± 0% -7.62% (p=0.000 n=15)
Scan/100/Base10-16 816.3n ± 0% 730.0n ± 0% -10.57% (p=0.000 n=15)
Scan/1000/Base10-16 8.518µ ± 0% 7.628µ ± 0% -10.45% (p=0.000 n=15)
Scan/10000/Base10-16 158.6µ ± 0% 149.4µ ± 0% -5.80% (p=0.000 n=15)
Scan/100000/Base10-16 8.962m ± 0% 8.855m ± 0% -1.20% (p=0.000 n=15)
Scan/10/Base16-16 114.5n ± 0% 108.6n ± 0% -5.15% (p=0.000 n=15)
Scan/100/Base16-16 648.3n ± 0% 525.0n ± 0% -19.02% (p=0.000 n=15)
Scan/1000/Base16-16 7.375µ ± 0% 5.636µ ± 0% -23.58% (p=0.000 n=15)
Scan/10000/Base16-16 171.18µ ± 0% 66.99µ ± 0% -60.87% (p=0.000 n=15)
Scan/100000/Base16-16 9490.9µ ± 0% 682.8µ ± 0% -92.81% (p=0.000 n=15)
geomean 20.11µ 13.69µ -31.94%
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-88 275.4n ± 0% 215.0n ± 0% -21.93% (p=0.000 n=15)
Scan/100/Base2-88 1.869µ ± 0% 1.629µ ± 0% -12.84% (p=0.000 n=15)
Scan/1000/Base2-88 18.56µ ± 0% 15.81µ ± 0% -14.82% (p=0.000 n=15)
Scan/10000/Base2-88 270.0µ ± 0% 157.2µ ± 0% -41.77% (p=0.000 n=15)
Scan/100000/Base2-88 11.518m ± 0% 1.571m ± 0% -86.36% (p=0.000 n=15)
Scan/10/Base8-88 108.9n ± 0% 106.0n ± 0% -2.66% (p=0.000 n=15)
Scan/100/Base8-88 655.2n ± 0% 594.9n ± 0% -9.20% (p=0.000 n=15)
Scan/1000/Base8-88 6.467µ ± 0% 5.966µ ± 0% -7.75% (p=0.000 n=15)
Scan/10000/Base8-88 151.2µ ± 0% 147.4µ ± 0% -2.53% (p=0.000 n=15)
Scan/100000/Base8-88 10.33m ± 0% 10.30m ± 0% -0.25% (p=0.000 n=15)
Scan/10/Base10-88 100.20n ± 0% 98.53n ± 0% -1.67% (p=0.000 n=15)
Scan/100/Base10-88 596.9n ± 0% 543.3n ± 0% -8.98% (p=0.000 n=15)
Scan/1000/Base10-88 5.904µ ± 0% 5.485µ ± 0% -7.10% (p=0.000 n=15)
Scan/10000/Base10-88 145.7µ ± 0% 142.0µ ± 0% -2.55% (p=0.000 n=15)
Scan/100000/Base10-88 10.26m ± 0% 10.24m ± 0% -0.18% (p=0.000 n=15)
Scan/10/Base16-88 90.33n ± 0% 87.60n ± 0% -3.02% (p=0.000 n=15)
Scan/100/Base16-88 506.4n ± 0% 437.7n ± 0% -13.57% (p=0.000 n=15)
Scan/1000/Base16-88 5.056µ ± 0% 4.007µ ± 0% -20.75% (p=0.000 n=15)
Scan/10000/Base16-88 163.35µ ± 0% 65.37µ ± 0% -59.98% (p=0.000 n=15)
Scan/100000/Base16-88 11027.2µ ± 0% 735.1µ ± 0% -93.33% (p=0.000 n=15)
geomean 17.13µ 11.74µ -31.46%
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-16 324.7n ± 0% 348.4n ± 0% +7.30% (p=0.000 n=15)
Scan/100/Base2-16 2.604µ ± 0% 3.031µ ± 0% +16.40% (p=0.000 n=15)
Scan/1000/Base2-16 26.15µ ± 0% 29.94µ ± 0% +14.52% (p=0.000 n=15)
Scan/10000/Base2-16 334.3µ ± 0% 298.8µ ± 0% -10.64% (p=0.000 n=15)
Scan/100000/Base2-16 10.664m ± 0% 2.991m ± 0% -71.95% (p=0.000 n=15)
Scan/10/Base8-16 144.4n ± 1% 162.2n ± 1% +12.33% (p=0.000 n=15)
Scan/100/Base8-16 917.2n ± 0% 1084.0n ± 0% +18.19% (p=0.000 n=15)
Scan/1000/Base8-16 9.367µ ± 0% 10.901µ ± 0% +16.38% (p=0.000 n=15)
Scan/10000/Base8-16 164.2µ ± 0% 181.2µ ± 0% +10.34% (p=0.000 n=15)
Scan/100000/Base8-16 8.871m ± 1% 9.140m ± 0% +3.04% (p=0.000 n=15)
Scan/10/Base10-16 134.6n ± 1% 148.3n ± 1% +10.18% (p=0.000 n=15)
Scan/100/Base10-16 837.1n ± 0% 986.6n ± 0% +17.86% (p=0.000 n=15)
Scan/1000/Base10-16 8.563µ ± 0% 9.936µ ± 0% +16.03% (p=0.000 n=15)
Scan/10000/Base10-16 156.5µ ± 1% 171.3µ ± 0% +9.41% (p=0.000 n=15)
Scan/100000/Base10-16 8.863m ± 1% 9.011m ± 0% +1.66% (p=0.000 n=15)
Scan/10/Base16-16 115.7n ± 2% 129.1n ± 1% +11.58% (p=0.000 n=15)
Scan/100/Base16-16 708.6n ± 0% 796.8n ± 0% +12.45% (p=0.000 n=15)
Scan/1000/Base16-16 7.314µ ± 0% 7.554µ ± 0% +3.28% (p=0.000 n=15)
Scan/10000/Base16-16 149.05µ ± 0% 74.60µ ± 0% -49.95% (p=0.000 n=15)
Scan/100000/Base16-16 9091.6µ ± 0% 741.5µ ± 0% -91.84% (p=0.000 n=15)
geomean 20.39µ 17.65µ -13.44%
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
Scan/10/Base2-12 193.8n ± 2% 157.3n ± 1% -18.83% (p=0.000 n=15)
Scan/100/Base2-12 1.445µ ± 2% 1.362µ ± 1% -5.74% (p=0.000 n=15)
Scan/1000/Base2-12 14.28µ ± 0% 13.51µ ± 0% -5.42% (p=0.000 n=15)
Scan/10000/Base2-12 177.1µ ± 0% 134.6µ ± 0% -24.04% (p=0.000 n=15)
Scan/100000/Base2-12 5.429m ± 1% 1.333m ± 0% -75.45% (p=0.000 n=15)
Scan/10/Base8-12 75.52n ± 2% 76.09n ± 1% ~ (p=0.010 n=15)
Scan/100/Base8-12 528.4n ± 1% 532.1n ± 1% ~ (p=0.003 n=15)
Scan/1000/Base8-12 5.423µ ± 1% 5.427µ ± 0% ~ (p=0.183 n=15)
Scan/10000/Base8-12 89.26µ ± 1% 89.37µ ± 0% ~ (p=0.237 n=15)
Scan/100000/Base8-12 4.543m ± 2% 4.560m ± 1% ~ (p=0.595 n=15)
Scan/10/Base10-12 69.87n ± 1% 70.51n ± 0% ~ (p=0.002 n=15)
Scan/100/Base10-12 488.4n ± 1% 491.2n ± 0% ~ (p=0.060 n=15)
Scan/1000/Base10-12 5.014µ ± 1% 5.008µ ± 0% ~ (p=0.783 n=15)
Scan/10000/Base10-12 84.90µ ± 0% 85.10µ ± 0% ~ (p=0.109 n=15)
Scan/100000/Base10-12 4.516m ± 1% 4.521m ± 1% ~ (p=0.713 n=15)
Scan/10/Base16-12 59.21n ± 1% 57.70n ± 1% -2.55% (p=0.000 n=15)
Scan/100/Base16-12 380.0n ± 1% 360.7n ± 1% -5.08% (p=0.000 n=15)
Scan/1000/Base16-12 3.775µ ± 0% 3.421µ ± 0% -9.38% (p=0.000 n=15)
Scan/10000/Base16-12 80.62µ ± 0% 34.44µ ± 1% -57.28% (p=0.000 n=15)
Scan/100000/Base16-12 4826.4µ ± 2% 450.9µ ± 2% -90.66% (p=0.000 n=15)
geomean 11.05µ 8.448µ -23.52%
Change-Id: Ifdb2049545f34072aa75cdbb72bed4cf465f0ad7
Reviewed-on: https://go-review.googlesource.com/c/go/+/650640
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
|
|
Add a few more test cases for scanning (integer conversion),
which were helpful in debugging some upcoming changes.
BenchmarkScan currently times converting the value 10**N
represented in base B back into []Word form.
When B = 10, the text is 1 followed by many zeros, which
could hit a "multiply by zero" special case when processing
many digit chunks, misrepresenting the actual time required
depending on whether that case is optimized.
Change the benchmark to use 9**N, which is about as big and
will not cause runs of zeros in any of the tested bases.
The benchmark comparison below is not showing faster code,
since of course the code is not changing at all here. Instead,
it is showing that the new benchmark work is roughly the same
size as the old benchmark work.
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
ScanPi-12 43.35µ ± 1% 43.59µ ± 1% ~ (p=0.069 n=15)
Scan/10/Base2-12 202.3n ± 2% 193.7n ± 1% -4.25% (p=0.000 n=15)
Scan/100/Base2-12 1.512µ ± 3% 1.447µ ± 1% -4.30% (p=0.000 n=15)
Scan/1000/Base2-12 15.06µ ± 2% 14.33µ ± 0% -4.83% (p=0.000 n=15)
Scan/10000/Base2-12 188.0µ ± 5% 177.3µ ± 1% -5.65% (p=0.000 n=15)
Scan/100000/Base2-12 5.814m ± 3% 5.382m ± 1% -7.43% (p=0.000 n=15)
Scan/10/Base8-12 78.57n ± 2% 75.02n ± 1% -4.52% (p=0.000 n=15)
Scan/100/Base8-12 548.2n ± 2% 526.8n ± 1% -3.90% (p=0.000 n=15)
Scan/1000/Base8-12 5.674µ ± 2% 5.421µ ± 0% -4.46% (p=0.000 n=15)
Scan/10000/Base8-12 94.42µ ± 1% 88.61µ ± 1% -6.15% (p=0.000 n=15)
Scan/100000/Base8-12 4.906m ± 2% 4.498m ± 3% -8.31% (p=0.000 n=15)
Scan/10/Base10-12 73.42n ± 1% 69.56n ± 0% -5.26% (p=0.000 n=15)
Scan/100/Base10-12 511.9n ± 1% 488.2n ± 0% -4.63% (p=0.000 n=15)
Scan/1000/Base10-12 5.254µ ± 2% 5.009µ ± 0% -4.66% (p=0.000 n=15)
Scan/10000/Base10-12 90.22µ ± 2% 84.52µ ± 0% -6.32% (p=0.000 n=15)
Scan/100000/Base10-12 4.842m ± 3% 4.471m ± 3% -7.65% (p=0.000 n=15)
Scan/10/Base16-12 62.28n ± 1% 58.70n ± 1% -5.75% (p=0.000 n=15)
Scan/100/Base16-12 398.6n ± 0% 377.9n ± 1% -5.19% (p=0.000 n=15)
Scan/1000/Base16-12 4.108µ ± 1% 3.782µ ± 0% -7.94% (p=0.000 n=15)
Scan/10000/Base16-12 83.78µ ± 2% 80.51µ ± 1% -3.90% (p=0.000 n=15)
Scan/100000/Base16-12 5.080m ± 3% 4.698m ± 3% -7.53% (p=0.000 n=15)
geomean 12.41µ 11.74µ -5.36%
Change-Id: If3ce290ecc7f38672f11b42fd811afb53dee665d
Reviewed-on: https://go-review.googlesource.com/c/go/+/650639
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
In the early days of math/big, algorithms that needed more space
grew the result larger than it needed to be and then used the
high words as extra space. This made results their own temporary
space caches, at the cost that saving a result in a data structure
might hold significantly more memory than necessary.
Specifically, new(big.Int).Mul(x, y) returned a big.Int with a
backing slice 3X as big as it strictly needed to be.
If you are storing many multiplication results, or even a single
large result, the 3X overhead can add up.
This approach to storage for temporaries also requires being able
to analyze the algorithms to predict the exact amount they need,
which can be difficult.
For both these reasons, the implementation of recursive long division,
which came later, introduced a “nat pool” where temporaries could be
stored and reused, or reclaimed by the GC when no longer used.
This avoids the storage and bookkeeping overheads but introduces a
per-temporary sync.Pool overhead. divRecursiveStep takes an array
of cached temporaries to remove some of that overhead.
The nat pool was better but is still not quite right.
This CL introduces something even better than the nat pool
(still probably not quite right, but the best I can see for now):
a sync.Pool holding stacks for allocating temporaries.
Now an operation can get one stack out of the pool and then
allocate as many temporaries as it needs during the operation,
eventually returning the stack back to the pool. The sync.Pool
operations are now per-exported-operation (like big.Int.Mul),
not per-temporary.
This CL converts both the pre-allocation in nat.mul and the
uses of the nat pool to use stack pools instead. This simplifies
some code and sets us up better for more complex algorithms
(such as Toom-Cook or FFT-based multiplication) that need
more temporaries. It is also a little bit faster.
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) CPU @ 3.10GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15)
Div/40/20-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15)
Div/100/50-16 56.65n ± 0% 55.53n ± 0% -1.98% (p=0.000 n=15)
Div/200/100-16 194.6n ± 1% 172.8n ± 0% -11.20% (p=0.000 n=15)
Div/400/200-16 232.1n ± 0% 206.7n ± 0% -10.94% (p=0.000 n=15)
Div/1000/500-16 405.3n ± 1% 383.8n ± 0% -5.30% (p=0.000 n=15)
Div/2000/1000-16 810.4n ± 1% 795.2n ± 0% -1.88% (p=0.000 n=15)
Div/20000/10000-16 25.88µ ± 0% 25.39µ ± 0% -1.89% (p=0.000 n=15)
Div/200000/100000-16 931.5µ ± 0% 924.3µ ± 0% -0.77% (p=0.000 n=15)
Div/2000000/1000000-16 37.77m ± 0% 37.75m ± 0% ~ (p=0.098 n=15)
Div/20000000/10000000-16 1.367 ± 0% 1.377 ± 0% +0.72% (p=0.003 n=15)
NatMul/10-16 168.5n ± 3% 164.0n ± 4% ~ (p=0.751 n=15)
NatMul/100-16 6.086µ ± 3% 5.380µ ± 3% -11.60% (p=0.000 n=15)
NatMul/1000-16 238.1µ ± 3% 228.3µ ± 1% -4.12% (p=0.000 n=15)
NatMul/10000-16 8.721m ± 2% 8.518m ± 1% -2.33% (p=0.000 n=15)
NatMul/100000-16 369.6m ± 0% 371.1m ± 0% +0.42% (p=0.000 n=15)
geomean 19.57µ 18.74µ -4.21%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15)
NatMul/1000-16 48.16Ki ± 0% 16.02Ki ± 0% -66.73% (p=0.000 n=15)
NatMul/10000-16 482.9Ki ± 1% 165.4Ki ± 3% -65.75% (p=0.000 n=15)
NatMul/100000-16 5.747Mi ± 7% 4.197Mi ± 0% -26.97% (p=0.000 n=15)
geomean 41.42Ki 20.63Ki -50.18%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-16 7.000 ± 14% 7.000 ± 14% ~ (p=0.668 n=15)
geomean 1.476 1.476 +0.00%
¹ all samples are equal
goos: linux
goarch: amd64
pkg: math/big
cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-88 15.84n ± 1% 13.12n ± 0% -17.17% (p=0.000 n=15)
Div/40/20-88 15.88n ± 1% 13.12n ± 0% -17.38% (p=0.000 n=15)
Div/100/50-88 26.42n ± 0% 25.47n ± 0% -3.60% (p=0.000 n=15)
Div/200/100-88 132.4n ± 0% 114.9n ± 0% -13.22% (p=0.000 n=15)
Div/400/200-88 150.1n ± 0% 135.6n ± 0% -9.66% (p=0.000 n=15)
Div/1000/500-88 275.5n ± 0% 264.1n ± 0% -4.14% (p=0.000 n=15)
Div/2000/1000-88 586.5n ± 0% 581.1n ± 0% -0.92% (p=0.000 n=15)
Div/20000/10000-88 25.87µ ± 0% 25.72µ ± 0% -0.59% (p=0.000 n=15)
Div/200000/100000-88 772.2µ ± 0% 779.0µ ± 0% +0.88% (p=0.000 n=15)
Div/2000000/1000000-88 33.36m ± 0% 33.63m ± 0% +0.80% (p=0.000 n=15)
Div/20000000/10000000-88 1.307 ± 0% 1.320 ± 0% +1.03% (p=0.000 n=15)
NatMul/10-88 140.4n ± 0% 148.8n ± 4% +5.98% (p=0.000 n=15)
NatMul/100-88 4.663µ ± 1% 4.388µ ± 1% -5.90% (p=0.000 n=15)
NatMul/1000-88 207.7µ ± 0% 205.8µ ± 0% -0.89% (p=0.000 n=15)
NatMul/10000-88 8.456m ± 0% 8.468m ± 0% +0.14% (p=0.021 n=15)
NatMul/100000-88 295.1m ± 0% 297.9m ± 0% +0.94% (p=0.000 n=15)
geomean 14.96µ 14.33µ -4.23%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-88 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-88 4.750Ki ± 0% 1.758Ki ± 0% -62.99% (p=0.000 n=15)
NatMul/1000-88 48.44Ki ± 0% 16.08Ki ± 0% -66.80% (p=0.000 n=15)
NatMul/10000-88 489.7Ki ± 1% 166.1Ki ± 3% -66.08% (p=0.000 n=15)
NatMul/100000-88 5.546Mi ± 0% 3.819Mi ± 60% -31.15% (p=0.000 n=15)
geomean 41.29Ki 20.30Ki -50.85%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-88 5.000 ± 20% 6.000 ± 67% ~ (p=0.672 n=15)
geomean 1.380 1.431 +3.71%
¹ all samples are equal
goos: linux
goarch: arm64
pkg: math/big
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-16 15.85n ± 0% 15.23n ± 0% -3.91% (p=0.000 n=15)
Div/40/20-16 15.88n ± 0% 15.22n ± 0% -4.16% (p=0.000 n=15)
Div/100/50-16 29.69n ± 0% 26.39n ± 0% -11.11% (p=0.000 n=15)
Div/200/100-16 149.2n ± 0% 123.3n ± 0% -17.36% (p=0.000 n=15)
Div/400/200-16 160.3n ± 0% 139.2n ± 0% -13.16% (p=0.000 n=15)
Div/1000/500-16 271.0n ± 0% 256.1n ± 0% -5.50% (p=0.000 n=15)
Div/2000/1000-16 545.3n ± 0% 527.0n ± 0% -3.36% (p=0.000 n=15)
Div/20000/10000-16 22.60µ ± 0% 22.20µ ± 0% -1.77% (p=0.000 n=15)
Div/200000/100000-16 889.0µ ± 0% 892.2µ ± 0% +0.35% (p=0.000 n=15)
Div/2000000/1000000-16 38.01m ± 0% 38.12m ± 0% +0.30% (p=0.000 n=15)
Div/20000000/10000000-16 1.437 ± 0% 1.444 ± 0% +0.50% (p=0.000 n=15)
NatMul/10-16 166.4n ± 2% 169.5n ± 1% +1.86% (p=0.000 n=15)
NatMul/100-16 5.733µ ± 1% 5.570µ ± 1% -2.84% (p=0.000 n=15)
NatMul/1000-16 232.6µ ± 1% 229.8µ ± 0% -1.22% (p=0.000 n=15)
NatMul/10000-16 9.039m ± 1% 8.969m ± 0% -0.77% (p=0.000 n=15)
NatMul/100000-16 367.0m ± 0% 368.8m ± 0% +0.48% (p=0.000 n=15)
geomean 16.15µ 15.50µ -4.01%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15)
NatMul/1000-16 48.33Ki ± 0% 16.02Ki ± 0% -66.85% (p=0.000 n=15)
NatMul/10000-16 536.5Ki ± 1% 165.7Ki ± 3% -69.12% (p=0.000 n=15)
NatMul/100000-16 6.078Mi ± 6% 4.197Mi ± 0% -30.94% (p=0.000 n=15)
geomean 42.81Ki 20.64Ki -51.78%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-16 2.000 ± 50% 1.000 ± 0% -50.00% (p=0.001 n=15)
NatMul/100000-16 9.000 ± 11% 8.000 ± 12% -11.11% (p=0.001 n=15)
geomean 1.783 1.516 -14.97%
¹ all samples are equal
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ old │ new │
│ sec/op │ sec/op vs base │
Div/20/10-12 9.850n ± 1% 9.405n ± 1% -4.52% (p=0.000 n=15)
Div/40/20-12 9.858n ± 0% 9.403n ± 1% -4.62% (p=0.000 n=15)
Div/100/50-12 16.40n ± 1% 14.81n ± 0% -9.70% (p=0.000 n=15)
Div/200/100-12 88.48n ± 2% 80.88n ± 0% -8.59% (p=0.000 n=15)
Div/400/200-12 107.90n ± 1% 99.28n ± 1% -7.99% (p=0.000 n=15)
Div/1000/500-12 188.8n ± 1% 178.6n ± 1% -5.40% (p=0.000 n=15)
Div/2000/1000-12 399.9n ± 0% 389.1n ± 0% -2.70% (p=0.000 n=15)
Div/20000/10000-12 13.94µ ± 2% 13.81µ ± 1% ~ (p=0.574 n=15)
Div/200000/100000-12 523.8µ ± 0% 521.7µ ± 0% -0.40% (p=0.000 n=15)
Div/2000000/1000000-12 21.46m ± 0% 21.48m ± 0% ~ (p=0.067 n=15)
Div/20000000/10000000-12 812.5m ± 0% 812.9m ± 0% ~ (p=0.061 n=15)
NatMul/10-12 77.14n ± 0% 78.35n ± 1% +1.57% (p=0.000 n=15)
NatMul/100-12 2.999µ ± 0% 2.871µ ± 1% -4.27% (p=0.000 n=15)
NatMul/1000-12 126.2µ ± 0% 126.8µ ± 0% +0.51% (p=0.011 n=15)
NatMul/10000-12 5.099m ± 0% 5.125m ± 0% +0.51% (p=0.000 n=15)
NatMul/100000-12 206.7m ± 0% 208.4m ± 0% +0.80% (p=0.000 n=15)
geomean 9.512µ 9.236µ -2.91%
│ old │ new │
│ B/op │ B/op vs base │
NatMul/10-12 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-12 4.750Ki ± 0% 1.750Ki ± 0% -63.16% (p=0.000 n=15)
NatMul/1000-12 48.13Ki ± 0% 16.01Ki ± 0% -66.73% (p=0.000 n=15)
NatMul/10000-12 483.5Ki ± 1% 163.2Ki ± 2% -66.24% (p=0.000 n=15)
NatMul/100000-12 5.480Mi ± 4% 1.532Mi ± 104% -72.05% (p=0.000 n=15)
geomean 41.03Ki 16.82Ki -59.01%
¹ all samples are equal
│ old │ new │
│ allocs/op │ allocs/op vs base │
NatMul/10-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/1000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/10000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹
NatMul/100000-12 5.000 ± 0% 1.000 ± 400% -80.00% (p=0.007 n=15)
geomean 1.380 1.000 -27.52%
¹ all samples are equal
Change-Id: I7efa6fe37971ed26ae120a32250fcb47ece0a011
Reviewed-on: https://go-review.googlesource.com/c/go/+/650638
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Change-Id: I112f55c0e3ee3b75e615a06b27552de164565c04
Reviewed-on: https://go-review.googlesource.com/c/go/+/650637
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
The GCD code was setting one *Int to the value of another
by smashing one struct on top of the other, instead of using Set.
That was safe in this one case, but it's not idiomatic in math/big
nor safe in general, so rewrite the code not to do that.
(In one case, by swapping variables around; in another, by calling Set.)
The added Set call does slow down GCDs by a small amount,
since the answer has to be copied out. To compensate for that,
optimize a bit: remove the s, t temporaries entirely and handle
vector x word multiplication directly. The net result is that almost
all GCDs are faster, except for small ones, which are a few
nanoseconds slower.
goos: darwin
goarch: arm64
pkg: math/big
cpu: Apple M3 Pro
│ bench.before │ bench.after │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-12 23.80n ± 1% 31.71n ± 1% +33.24% (p=0.000 n=10)
GCD10x10/WithXY-12 100.40n ± 0% 92.14n ± 1% -8.22% (p=0.000 n=10)
GCD10x100/WithoutXY-12 63.70n ± 0% 70.73n ± 0% +11.05% (p=0.000 n=10)
GCD10x100/WithXY-12 278.6n ± 0% 233.1n ± 1% -16.35% (p=0.000 n=10)
GCD10x1000/WithoutXY-12 153.4n ± 0% 162.2n ± 1% +5.74% (p=0.000 n=10)
GCD10x1000/WithXY-12 456.0n ± 0% 411.8n ± 1% -9.69% (p=0.000 n=10)
GCD10x10000/WithoutXY-12 1.002µ ± 1% 1.036µ ± 0% +3.39% (p=0.000 n=10)
GCD10x10000/WithXY-12 2.330µ ± 1% 2.210µ ± 0% -5.13% (p=0.000 n=10)
GCD10x100000/WithoutXY-12 8.894µ ± 0% 8.889µ ± 1% ~ (p=0.754 n=10)
GCD10x100000/WithXY-12 20.84µ ± 0% 20.24µ ± 0% -2.84% (p=0.000 n=10)
GCD100x100/WithoutXY-12 373.3n ± 3% 314.4n ± 0% -15.76% (p=0.000 n=10)
GCD100x100/WithXY-12 662.5n ± 0% 572.4n ± 1% -13.59% (p=0.000 n=10)
GCD100x1000/WithoutXY-12 641.8n ± 0% 598.1n ± 1% -6.81% (p=0.000 n=10)
GCD100x1000/WithXY-12 1.123µ ± 0% 1.019µ ± 1% -9.26% (p=0.000 n=10)
GCD100x10000/WithoutXY-12 2.870µ ± 0% 2.831µ ± 0% -1.38% (p=0.000 n=10)
GCD100x10000/WithXY-12 4.930µ ± 1% 4.675µ ± 0% -5.16% (p=0.000 n=10)
GCD100x100000/WithoutXY-12 24.08µ ± 0% 23.97µ ± 0% -0.48% (p=0.007 n=10)
GCD100x100000/WithXY-12 43.66µ ± 0% 42.52µ ± 0% -2.61% (p=0.001 n=10)
GCD1000x1000/WithoutXY-12 3.999µ ± 0% 3.569µ ± 1% -10.75% (p=0.000 n=10)
GCD1000x1000/WithXY-12 6.397µ ± 0% 5.534µ ± 0% -13.49% (p=0.000 n=10)
GCD1000x10000/WithoutXY-12 6.875µ ± 0% 6.450µ ± 0% -6.18% (p=0.000 n=10)
GCD1000x10000/WithXY-12 20.75µ ± 1% 19.17µ ± 1% -7.64% (p=0.000 n=10)
GCD1000x100000/WithoutXY-12 36.38µ ± 0% 35.60µ ± 1% -2.13% (p=0.000 n=10)
GCD1000x100000/WithXY-12 172.1µ ± 0% 174.4µ ± 3% ~ (p=0.052 n=10)
GCD10000x10000/WithoutXY-12 79.89µ ± 1% 75.16µ ± 2% -5.92% (p=0.000 n=10)
GCD10000x10000/WithXY-12 160.1µ ± 0% 150.0µ ± 0% -6.33% (p=0.000 n=10)
GCD10000x100000/WithoutXY-12 213.2µ ± 1% 209.0µ ± 1% -1.98% (p=0.000 n=10)
GCD10000x100000/WithXY-12 1.399m ± 0% 1.342m ± 3% -4.08% (p=0.002 n=10)
GCD100000x100000/WithoutXY-12 5.463m ± 1% 5.504m ± 2% ~ (p=0.190 n=10)
GCD100000x100000/WithXY-12 11.36m ± 0% 11.46m ± 1% +0.86% (p=0.000 n=10)
geomean 6.953µ 6.695µ -3.71%
goos: linux
goarch: amd64
pkg: math/big
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bench.before │ bench.after │
│ sec/op │ sec/op vs base │
GCD10x10/WithoutXY-32 39.66n ± 4% 44.34n ± 4% +11.77% (p=0.000 n=10)
GCD10x10/WithXY-32 156.7n ± 12% 130.8n ± 2% -16.53% (p=0.000 n=10)
GCD10x100/WithoutXY-32 115.8n ± 5% 120.2n ± 2% +3.89% (p=0.000 n=10)
GCD10x100/WithXY-32 465.3n ± 3% 368.1n ± 2% -20.91% (p=0.000 n=10)
GCD10x1000/WithoutXY-32 201.1n ± 1% 210.8n ± 2% +4.82% (p=0.000 n=10)
GCD10x1000/WithXY-32 652.9n ± 4% 605.0n ± 1% -7.32% (p=0.002 n=10)
GCD10x10000/WithoutXY-32 1.046µ ± 2% 1.143µ ± 1% +9.33% (p=0.000 n=10)
GCD10x10000/WithXY-32 3.360µ ± 1% 3.258µ ± 1% -3.04% (p=0.000 n=10)
GCD10x100000/WithoutXY-32 9.391µ ± 3% 9.997µ ± 1% +6.46% (p=0.000 n=10)
GCD10x100000/WithXY-32 27.92µ ± 1% 28.21µ ± 0% +1.04% (p=0.043 n=10)
GCD100x100/WithoutXY-32 443.7n ± 5% 320.0n ± 2% -27.88% (p=0.000 n=10)
GCD100x100/WithXY-32 789.9n ± 2% 690.4n ± 1% -12.60% (p=0.000 n=10)
GCD100x1000/WithoutXY-32 718.4n ± 3% 600.0n ± 1% -16.48% (p=0.000 n=10)
GCD100x1000/WithXY-32 1.388µ ± 4% 1.175µ ± 1% -15.28% (p=0.000 n=10)
GCD100x10000/WithoutXY-32 2.750µ ± 1% 2.668µ ± 1% -2.96% (p=0.000 n=10)
GCD100x10000/WithXY-32 6.016µ ± 1% 5.590µ ± 1% -7.09% (p=0.000 n=10)
GCD100x100000/WithoutXY-32 21.40µ ± 1% 22.30µ ± 1% +4.21% (p=0.000 n=10)
GCD100x100000/WithXY-32 47.02µ ± 4% 48.80µ ± 0% +3.78% (p=0.015 n=10)
GCD1000x1000/WithoutXY-32 3.417µ ± 4% 3.020µ ± 1% -11.65% (p=0.000 n=10)
GCD1000x1000/WithXY-32 5.752µ ± 0% 5.418µ ± 2% -5.81% (p=0.000 n=10)
GCD1000x10000/WithoutXY-32 6.150µ ± 0% 6.246µ ± 1% +1.55% (p=0.000 n=10)
GCD1000x10000/WithXY-32 24.68µ ± 3% 25.07µ ± 1% ~ (p=0.051 n=10)
GCD1000x100000/WithoutXY-32 34.60µ ± 2% 36.85µ ± 1% +6.51% (p=0.000 n=10)
GCD1000x100000/WithXY-32 209.5µ ± 4% 227.4µ ± 0% +8.56% (p=0.000 n=10)
GCD10000x10000/WithoutXY-32 90.69µ ± 0% 88.48µ ± 0% -2.44% (p=0.000 n=10)
GCD10000x10000/WithXY-32 197.1µ ± 0% 200.5µ ± 0% +1.73% (p=0.000 n=10)
GCD10000x100000/WithoutXY-32 239.1µ ± 0% 242.5µ ± 0% +1.42% (p=0.000 n=10)
GCD10000x100000/WithXY-32 1.963m ± 3% 2.028m ± 0% +3.28% (p=0.000 n=10)
GCD100000x100000/WithoutXY-32 7.466m ± 0% 7.412m ± 0% -0.71% (p=0.000 n=10)
GCD100000x100000/WithXY-32 16.10m ± 2% 16.47m ± 0% +2.25% (p=0.000 n=10)
geomean 8.388µ 8.127µ -3.12%
Change-Id: I161dc409bad11bcc553bc8116449905ae5b06742
Reviewed-on: https://go-review.googlesource.com/c/go/+/650636
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
Change-Id: I65721039dab311762e55c6a60dd75b82f6b4622f
Reviewed-on: https://go-review.googlesource.com/c/go/+/642335
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Griesemer <gri@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
If Be and Le stand for big-endian and little-endian,
then they should be BE and LE.
Change-Id: I723e3962b8918da84791783d3c547638f1c9e8a9
Reviewed-on: https://go-review.googlesource.com/c/go/+/627376
Reviewed-by: Robert Griesemer <gri@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Change-Id: Ie7649060db25f1573eeaadd534a600bb24d30572
GitHub-Last-Rev: c617848a4ec9f5c21820982efc95e0ec4ca2510c
GitHub-Pull-Request: golang/go#70134
Reviewed-on: https://go-review.googlesource.com/c/go/+/623757
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Robert Griesemer <gri@google.com>
|
|
Follow-up on CL 467555.
Change-Id: I1815b5def656ae4b86c31385ad0737f0465fa2d6
Reviewed-on: https://go-review.googlesource.com/c/go/+/613535
Auto-Submit: Robert Griesemer <gri@google.com>
TryBot-Bypass: Robert Griesemer <gri@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
Reviewed-by: Tim King <taking@google.com>
|
|
Rather than conditionally assigning ujn, initialise ujn above the
loop to invent the leading 0 for u, then unconditionally load ujn
at the bottom of the loop. This code operates on the basis that
n >= 2, hence j+n-1 is always greater than zero.
Change-Id: I1272ef30c787ed8707ae8421af2adcccc776d389
Reviewed-on: https://go-review.googlesource.com/c/go/+/467555
Auto-Submit: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Commit-Queue: Robert Griesemer <gri@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
|
|
Change-Id: I535a7aaaf3f9e8a9c0e0c04f8f745ad7445a32f7
Reviewed-on: https://go-review.googlesource.com/c/go/+/611678
Run-TryBot: shuang cui <imcusg@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
|
|
All changes are related to the code, except for the comments in src/regexp/syntax/parse.go and src/slices/slices.go.
Change-Id: I73c5d3c54099749b62210aa7f3182c5eb84bb6a6
GitHub-Last-Rev: 794aa9b0539811d00e1cd42be1e8d9fe9afe0281
GitHub-Pull-Request: golang/go#69170
Reviewed-on: https://go-review.googlesource.com/c/go/+/609678
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
This provides an assembly implementation of addMulVVW for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ addmulvvw.1 │ addmulvvw.2 │
│ sec/op │ sec/op vs base │
AddMulVVW/1-4 65.49n ± 0% 50.79n ± 0% -22.44% (p=0.000 n=10)
AddMulVVW/2-4 82.81n ± 0% 66.83n ± 0% -19.29% (p=0.000 n=10)
AddMulVVW/3-4 100.20n ± 0% 82.87n ± 0% -17.30% (p=0.000 n=10)
AddMulVVW/4-4 117.50n ± 0% 84.20n ± 0% -28.34% (p=0.000 n=10)
AddMulVVW/5-4 134.9n ± 0% 100.3n ± 0% -25.69% (p=0.000 n=10)
AddMulVVW/10-4 221.7n ± 0% 164.4n ± 0% -25.85% (p=0.000 n=10)
AddMulVVW/100-4 1.794µ ± 0% 1.250µ ± 0% -30.32% (p=0.000 n=10)
AddMulVVW/1000-4 17.42µ ± 0% 12.08µ ± 0% -30.68% (p=0.000 n=10)
AddMulVVW/10000-4 254.9µ ± 0% 214.8µ ± 0% -15.75% (p=0.000 n=10)
AddMulVVW/100000-4 2.569m ± 0% 2.178m ± 0% -15.20% (p=0.000 n=10)
geomean 1.443µ 1.107µ -23.29%
│ addmulvvw.1 │ addmulvvw.2 │
│ B/s │ B/s vs base │
AddMulVVW/1-4 932.0Mi ± 0% 1201.6Mi ± 0% +28.93% (p=0.000 n=10)
AddMulVVW/2-4 1.440Gi ± 0% 1.784Gi ± 0% +23.90% (p=0.000 n=10)
AddMulVVW/3-4 1.785Gi ± 0% 2.158Gi ± 0% +20.87% (p=0.000 n=10)
AddMulVVW/4-4 2.029Gi ± 0% 2.832Gi ± 0% +39.59% (p=0.000 n=10)
AddMulVVW/5-4 2.209Gi ± 0% 2.973Gi ± 0% +34.55% (p=0.000 n=10)
AddMulVVW/10-4 2.689Gi ± 0% 3.626Gi ± 0% +34.86% (p=0.000 n=10)
AddMulVVW/100-4 3.323Gi ± 0% 4.770Gi ± 0% +43.54% (p=0.000 n=10)
AddMulVVW/1000-4 3.421Gi ± 0% 4.936Gi ± 0% +44.27% (p=0.000 n=10)
AddMulVVW/10000-4 2.338Gi ± 0% 2.776Gi ± 0% +18.69% (p=0.000 n=10)
AddMulVVW/100000-4 2.320Gi ± 0% 2.736Gi ± 0% +17.93% (p=0.000 n=10)
geomean 2.109Gi 2.749Gi +30.36%
Change-Id: I6c7ee48233c53ff9b6a5a9002675886cd9bff5af
Reviewed-on: https://go-review.googlesource.com/c/go/+/595400
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This provides an assembly implementation of mulAddVWW for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ muladdvww.1 │ muladdvww.2 │
│ sec/op │ sec/op vs base │
MulAddVWW/1-4 68.18n ± 0% 65.49n ± 0% -3.95% (p=0.000 n=10)
MulAddVWW/2-4 82.81n ± 0% 78.85n ± 0% -4.78% (p=0.000 n=10)
MulAddVWW/3-4 97.49n ± 0% 72.18n ± 0% -25.96% (p=0.000 n=10)
MulAddVWW/4-4 112.20n ± 0% 85.54n ± 0% -23.76% (p=0.000 n=10)
MulAddVWW/5-4 126.90n ± 0% 98.90n ± 0% -22.06% (p=0.000 n=10)
MulAddVWW/10-4 200.3n ± 0% 144.3n ± 0% -27.96% (p=0.000 n=10)
MulAddVWW/100-4 1532.0n ± 0% 860.0n ± 0% -43.86% (p=0.000 n=10)
MulAddVWW/1000-4 14.757µ ± 0% 8.076µ ± 0% -45.27% (p=0.000 n=10)
MulAddVWW/10000-4 204.0µ ± 0% 137.1µ ± 0% -32.77% (p=0.000 n=10)
MulAddVWW/100000-4 2.066m ± 0% 1.382m ± 0% -33.12% (p=0.000 n=10)
geomean 1.311µ 950.0n -27.51%
│ muladdvww.1 │ muladdvww.2 │
│ B/s │ B/s vs base │
MulAddVWW/1-4 895.1Mi ± 0% 932.0Mi ± 0% +4.11% (p=0.000 n=10)
MulAddVWW/2-4 1.440Gi ± 0% 1.512Gi ± 0% +5.02% (p=0.000 n=10)
MulAddVWW/3-4 1.834Gi ± 0% 2.477Gi ± 0% +35.07% (p=0.000 n=10)
MulAddVWW/4-4 2.125Gi ± 0% 2.787Gi ± 0% +31.15% (p=0.000 n=10)
MulAddVWW/5-4 2.349Gi ± 0% 3.013Gi ± 0% +28.28% (p=0.000 n=10)
MulAddVWW/10-4 2.975Gi ± 0% 4.130Gi ± 0% +38.79% (p=0.000 n=10)
MulAddVWW/100-4 3.891Gi ± 0% 6.930Gi ± 0% +78.11% (p=0.000 n=10)
MulAddVWW/1000-4 4.039Gi ± 0% 7.380Gi ± 0% +82.72% (p=0.000 n=10)
MulAddVWW/10000-4 2.922Gi ± 0% 4.346Gi ± 0% +48.74% (p=0.000 n=10)
MulAddVWW/100000-4 2.884Gi ± 0% 4.313Gi ± 0% +49.52% (p=0.000 n=10)
geomean 2.321Gi 3.202Gi +37.95%
Change-Id: If08191607913ce5c7641f34bae8fa5c9dfb44777
Reviewed-on: https://go-review.googlesource.com/c/go/+/595399
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
|
|
This provides an assembly implementation of subVW for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ subvw.1 │ subvw.2 │
│ sec/op │ sec/op vs base │
SubVW/1-4 57.43n ± 0% 41.45n ± 0% -27.82% (p=0.000 n=10)
SubVW/2-4 69.31n ± 0% 48.15n ± 0% -30.53% (p=0.000 n=10)
SubVW/3-4 76.12n ± 0% 54.87n ± 0% -27.92% (p=0.000 n=10)
SubVW/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10)
SubVW/5-4 96.15n ± 0% 62.83n ± 0% -34.65% (p=0.000 n=10)
SubVW/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10)
SubVW/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10)
SubVW/1000-4 10.732µ ± 0% 5.071µ ± 0% -52.75% (p=0.000 n=10)
SubVW/10000-4 153.0µ ± 0% 103.7µ ± 0% -32.21% (p=0.000 n=10)
SubVW/100000-4 1.542m ± 0% 1.046m ± 0% -32.13% (p=0.000 n=10)
SubVWext/1-4 57.42n ± 0% 41.45n ± 0% -27.81% (p=0.000 n=10)
SubVWext/2-4 69.33n ± 0% 48.15n ± 0% -30.55% (p=0.000 n=10)
SubVWext/3-4 76.12n ± 0% 54.93n ± 0% -27.84% (p=0.000 n=10)
SubVWext/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10)
SubVWext/5-4 96.15n ± 0% 62.83n ± 0% -34.65% (p=0.000 n=10)
SubVWext/10-4 149.60n ± 0% 89.56n ± 0% -40.14% (p=0.000 n=10)
SubVWext/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10)
SubVWext/1000-4 10.732µ ± 0% 5.061µ ± 0% -52.84% (p=0.000 n=10)
SubVWext/10000-4 152.5µ ± 0% 103.7µ ± 0% -32.02% (p=0.000 n=10)
SubVWext/100000-4 1.533m ± 0% 1.046m ± 0% -31.75% (p=0.000 n=10)
geomean 1.005µ 633.7n -36.92%
│ subvw.1 │ subvw.2 │
│ B/s │ B/s vs base │
SubVW/1-4 132.9Mi ± 0% 184.1Mi ± 0% +38.54% (p=0.000 n=10)
SubVW/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.95% (p=0.000 n=10)
SubVW/3-4 300.7Mi ± 0% 417.1Mi ± 0% +38.72% (p=0.000 n=10)
SubVW/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.24% (p=0.000 n=10)
SubVW/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.03% (p=0.000 n=10)
SubVW/10-4 510.1Mi ± 0% 851.9Mi ± 0% +67.01% (p=0.000 n=10)
SubVW/100-4 684.2Mi ± 0% 1388.9Mi ± 0% +102.99% (p=0.000 n=10)
SubVW/1000-4 710.9Mi ± 0% 1504.5Mi ± 0% +111.63% (p=0.000 n=10)
SubVW/10000-4 498.7Mi ± 0% 735.7Mi ± 0% +47.52% (p=0.000 n=10)
SubVW/100000-4 494.8Mi ± 0% 729.1Mi ± 0% +47.34% (p=0.000 n=10)
SubVWext/1-4 132.9Mi ± 0% 184.1Mi ± 0% +38.53% (p=0.000 n=10)
SubVWext/2-4 220.1Mi ± 0% 316.9Mi ± 0% +44.00% (p=0.000 n=10)
SubVWext/3-4 300.7Mi ± 0% 416.7Mi ± 0% +38.57% (p=0.000 n=10)
SubVWext/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.24% (p=0.000 n=10)
SubVWext/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.04% (p=0.000 n=10)
SubVWext/10-4 510.1Mi ± 0% 851.9Mi ± 0% +67.01% (p=0.000 n=10)
SubVWext/100-4 684.2Mi ± 0% 1388.9Mi ± 0% +102.99% (p=0.000 n=10)
SubVWext/1000-4 710.9Mi ± 0% 1507.6Mi ± 0% +112.07% (p=0.000 n=10)
SubVWext/10000-4 500.1Mi ± 0% 735.7Mi ± 0% +47.10% (p=0.000 n=10)
SubVWext/100000-4 497.8Mi ± 0% 729.4Mi ± 0% +46.52% (p=0.000 n=10)
geomean 387.6Mi 614.5Mi +58.51%
Change-Id: I9d7fac719e977710ad9db9121fa298db6df605de
Reviewed-on: https://go-review.googlesource.com/c/go/+/595398
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This provides an assembly implementation of addVW for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ addvw.1 │ addvw.2 │
│ sec/op │ sec/op vs base │
AddVW/1-4 57.43n ± 0% 41.45n ± 0% -27.83% (p=0.000 n=10)
AddVW/2-4 69.31n ± 0% 48.15n ± 0% -30.53% (p=0.000 n=10)
AddVW/3-4 76.12n ± 0% 54.97n ± 0% -27.79% (p=0.000 n=10)
AddVW/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10)
AddVW/5-4 96.16n ± 0% 62.82n ± 0% -34.67% (p=0.000 n=10)
AddVW/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10)
AddVW/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10)
AddVW/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10)
AddVW/10000-4 151.7µ ± 0% 103.7µ ± 0% -31.63% (p=0.000 n=10)
AddVW/100000-4 1.523m ± 0% 1.050m ± 0% -31.03% (p=0.000 n=10)
AddVWext/1-4 57.42n ± 0% 41.45n ± 0% -27.81% (p=0.000 n=10)
AddVWext/2-4 69.32n ± 0% 48.15n ± 0% -30.54% (p=0.000 n=10)
AddVWext/3-4 76.12n ± 0% 54.87n ± 0% -27.92% (p=0.000 n=10)
AddVWext/4-4 85.47n ± 0% 56.14n ± 0% -34.32% (p=0.000 n=10)
AddVWext/5-4 96.15n ± 0% 62.82n ± 0% -34.66% (p=0.000 n=10)
AddVWext/10-4 149.60n ± 0% 89.55n ± 0% -40.14% (p=0.000 n=10)
AddVWext/100-4 1115.0n ± 0% 549.3n ± 0% -50.74% (p=0.000 n=10)
AddVWext/1000-4 10.732µ ± 0% 5.060µ ± 0% -52.85% (p=0.000 n=10)
AddVWext/10000-4 150.5µ ± 0% 103.7µ ± 0% -31.10% (p=0.000 n=10)
AddVWext/100000-4 1.530m ± 0% 1.049m ± 0% -31.41% (p=0.000 n=10)
geomean 1.003µ 633.9n -36.79%
│ addvw.1 │ addvw.2 │
│ B/s │ B/s vs base │
AddVW/1-4 132.8Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10)
AddVW/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.96% (p=0.000 n=10)
AddVW/3-4 300.7Mi ± 0% 416.4Mi ± 0% +38.48% (p=0.000 n=10)
AddVW/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10)
AddVW/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.06% (p=0.000 n=10)
AddVW/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10)
AddVW/100-4 684.1Mi ± 0% 1389.0Mi ± 0% +103.03% (p=0.000 n=10)
AddVW/1000-4 710.9Mi ± 0% 1507.8Mi ± 0% +112.08% (p=0.000 n=10)
AddVW/10000-4 503.1Mi ± 0% 735.8Mi ± 0% +46.26% (p=0.000 n=10)
AddVW/100000-4 501.0Mi ± 0% 726.5Mi ± 0% +45.00% (p=0.000 n=10)
AddVWext/1-4 132.9Mi ± 0% 184.1Mi ± 0% +38.55% (p=0.000 n=10)
AddVWext/2-4 220.1Mi ± 0% 316.9Mi ± 0% +43.98% (p=0.000 n=10)
AddVWext/3-4 300.7Mi ± 0% 417.1Mi ± 0% +38.73% (p=0.000 n=10)
AddVWext/4-4 357.1Mi ± 0% 543.6Mi ± 0% +52.25% (p=0.000 n=10)
AddVWext/5-4 396.7Mi ± 0% 607.2Mi ± 0% +53.05% (p=0.000 n=10)
AddVWext/10-4 510.1Mi ± 0% 852.0Mi ± 0% +67.02% (p=0.000 n=10)
AddVWext/100-4 684.2Mi ± 0% 1389.0Mi ± 0% +103.02% (p=0.000 n=10)
AddVWext/1000-4 710.9Mi ± 0% 1507.7Mi ± 0% +112.08% (p=0.000 n=10)
AddVWext/10000-4 506.9Mi ± 0% 735.8Mi ± 0% +45.15% (p=0.000 n=10)
AddVWext/100000-4 498.6Mi ± 0% 727.0Mi ± 0% +45.79% (p=0.000 n=10)
geomean 388.3Mi 614.3Mi +58.19%
Change-Id: Ib14a4b8c1d81e710753bbf6dd5546bbca44fe3f1
Reviewed-on: https://go-review.googlesource.com/c/go/+/595397
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
|
|
For #62384
Change-Id: I1557704c6a0f9c6f3b9aad001374dd5cdbc99065
GitHub-Last-Rev: c258d18ccedab5feeb481a2431d5647bde7e5c58
GitHub-Pull-Request: golang/go#68893
Reviewed-on: https://go-review.googlesource.com/c/go/+/605758
Reviewed-by: Ian Lance Taylor <iant@google.com>
Commit-Queue: Robert Griesemer <gri@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Robert Griesemer <gri@google.com>
|
|
This provides an assembly implementation of subVV for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ subvv.1 │ subvv.2 │
│ sec/op │ sec/op vs base │
SubVV/1-4 73.46n ± 0% 48.08n ± 0% -34.55% (p=0.000 n=10)
SubVV/2-4 88.13n ± 0% 58.76n ± 0% -33.33% (p=0.000 n=10)
SubVV/3-4 102.80n ± 0% 69.45n ± 0% -32.44% (p=0.000 n=10)
SubVV/4-4 117.50n ± 0% 72.11n ± 0% -38.63% (p=0.000 n=10)
SubVV/5-4 132.20n ± 0% 82.80n ± 0% -37.37% (p=0.000 n=10)
SubVV/10-4 216.3n ± 0% 126.9n ± 0% -41.33% (p=0.000 n=10)
SubVV/100-4 1659.0n ± 0% 886.5n ± 0% -46.56% (p=0.000 n=10)
SubVV/1000-4 16.089µ ± 0% 8.401µ ± 0% -47.78% (p=0.000 n=10)
SubVV/10000-4 244.7µ ± 0% 176.8µ ± 0% -27.74% (p=0.000 n=10)
SubVV/100000-4 2.562m ± 0% 1.871m ± 0% -26.96% (p=0.000 n=10)
geomean 1.436µ 904.4n -37.04%
│ subvv.1 │ subvv.2 │
│ B/s │ B/s vs base │
SubVV/1-4 830.9Mi ± 0% 1269.5Mi ± 0% +52.79% (p=0.000 n=10)
SubVV/2-4 1.353Gi ± 0% 2.029Gi ± 0% +49.99% (p=0.000 n=10)
SubVV/3-4 1.739Gi ± 0% 2.575Gi ± 0% +48.06% (p=0.000 n=10)
SubVV/4-4 2.029Gi ± 0% 3.306Gi ± 0% +62.96% (p=0.000 n=10)
SubVV/5-4 2.254Gi ± 0% 3.600Gi ± 0% +59.67% (p=0.000 n=10)
SubVV/10-4 2.755Gi ± 0% 4.699Gi ± 0% +70.53% (p=0.000 n=10)
SubVV/100-4 3.594Gi ± 0% 6.723Gi ± 0% +87.08% (p=0.000 n=10)
SubVV/1000-4 3.705Gi ± 0% 7.095Gi ± 0% +91.52% (p=0.000 n=10)
SubVV/10000-4 2.436Gi ± 0% 3.372Gi ± 0% +38.39% (p=0.000 n=10)
SubVV/100000-4 2.327Gi ± 0% 3.185Gi ± 0% +36.91% (p=0.000 n=10)
geomean 2.118Gi 3.364Gi +58.84%
Change-Id: I361cb3f4195b27a9f1e9486c9e1fdbeaa94d32b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/595396
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Meng Zhuo <mengzhuo1203@gmail.com>
Reviewed-by: Mark Ryan <markdryan@rivosinc.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
|
|
This provides an assembly implementation of addVV for riscv64,
processing up to four words per loop, resulting in a significant
performance gain.
On a StarFive VisionFive 2:
│ addvv.1 │ addvv.2 │
│ sec/op │ sec/op vs base │
AddVV/1-4 73.45n ± 0% 48.08n ± 0% -34.54% (p=0.000 n=10)
AddVV/2-4 88.14n ± 0% 58.76n ± 0% -33.33% (p=0.000 n=10)
AddVV/3-4 102.80n ± 0% 69.44n ± 0% -32.45% (p=0.000 n=10)
AddVV/4-4 117.50n ± 0% 72.18n ± 0% -38.57% (p=0.000 n=10)
AddVV/5-4 132.20n ± 0% 82.79n ± 0% -37.38% (p=0.000 n=10)
AddVV/10-4 216.3n ± 0% 126.8n ± 0% -41.35% (p=0.000 n=10)
AddVV/100-4 1659.0n ± 0% 885.2n ± 0% -46.64% (p=0.000 n=10)
AddVV/1000-4 16.089µ ± 0% 8.400µ ± 0% -47.79% (p=0.000 n=10)
AddVV/10000-4 245.3µ ± 0% 176.9µ ± 0% -27.88% (p=0.000 n=10)
AddVV/100000-4 2.537m ± 0% 1.873m ± 0% -26.17% (p=0.000 n=10)
geomean 1.435µ 904.5n -36.99%
│ addvv.1 │ addvv.2 │
│ B/s │ B/s vs base │
AddVV/1-4 830.9Mi ± 0% 1269.5Mi ± 0% +52.78% (p=0.000 n=10)
AddVV/2-4 1.353Gi ± 0% 2.029Gi ± 0% +50.00% (p=0.000 n=10)
AddVV/3-4 1.739Gi ± 0% 2.575Gi ± 0% +48.09% (p=0.000 n=10)
AddVV/4-4 2.029Gi ± 0% 3.303Gi ± 0% +62.82% (p=0.000 n=10)
AddVV/5-4 2.254Gi ± 0% 3.600Gi ± 0% +59.69% (p=0.000 n=10)
AddVV/10-4 2.755Gi ± 0% 4.699Gi ± 0% +70.54% (p=0.000 n=10)
AddVV/100-4 3.594Gi ± 0% 6.734Gi ± 0% +87.37% (p=0.000 n=10)
AddVV/1000-4 3.705Gi ± 0% 7.096Gi ± 0% +91.54% (p=0.000 n=10)
AddVV/10000-4 2.430Gi ± 0% 3.369Gi ± 0% +38.65% (p=0.000 n=10)
AddVV/100000-4 2.350Gi ± 0% 3.183Gi ± 0% +35.44% (p=0.000 n=10)
geomean 2.119Gi 3.364Gi +58.71%
Change-Id: I727b3d9f8ab01eada7270046480b1430d56d0a96
Reviewed-on: https://go-review.googlesource.com/c/go/+/595395
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: M Zhuo <mengzhuo1203@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Than McIntosh <thanm@google.com>
|
|
Comment in line 395:
[x₀ < S, so S - x₀ < 0; drop it]
Should be:
[x₀ < S, so S - x₀ > 0; drop it]
The proof is based on S - x₀ > 0, thus it's a typo of comment.
Fixes #68466
Change-Id: I68bb7cb909ba2bfe02a8873f74b57edc6679b72a
GitHub-Last-Rev: 40a2fc80cf22e97e0f535454a9b87b31b2e51421
GitHub-Pull-Request: golang/go#68487
Reviewed-on: https://go-review.googlesource.com/c/go/+/598855
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Keith Randall <khr@golang.org>
|
|
This looks way better than the code formatting.
Similar to CL 597656.
Change-Id: I2c8809c1d6f8a8387941567213880662ff649a73
Reviewed-on: https://go-review.googlesource.com/c/go/+/597659
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Change-Id: I3541859bbf3ac4f9317b82a66d21be3d5c4c5a84
Reviewed-on: https://go-review.googlesource.com/c/go/+/597658
Reviewed-by: Ian Lance Taylor <iant@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
Fixes #66358.
Change-Id: Ic9bde88eabfb2a446d32e1dc5ac404a51ef49f11
Reviewed-on: https://go-review.googlesource.com/c/go/+/590635
Auto-Submit: Robert Griesemer <gri@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
|
|
In all cases the intent was not to interpret s as a format string.
In one case (go/types), this was a latent bug in production.
(These were uncovered by a new check in vet's printf analyzer.)
Updates #60529
Change-Id: I3e17af7e589be9aec1580783a1b1011c52ec494b
Reviewed-on: https://go-review.googlesource.com/c/go/+/587855
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Russ Cox <rsc@golang.org>
|
|
For #67401.
Change-Id: Ifea84af92017b405466937f50fb8f28e6893c8cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/587220
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Russ Cox <rsc@golang.org>
|
|
Doing this because the slices functions are slightly faster and
slightly easier to use. It also removes one dependency layer.
This CL does not change packages that are used during bootstrap,
as the bootstrap compiler does not have the required slices functions.
It does not change the go/scanner package because the ErrorList
Len, Swap, and Less methods are part of the Go 1 API.
Change-Id: If52899be791c829198e11d2408727720b91ebe8a
Reviewed-on: https://go-review.googlesource.com/c/go/+/587655
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Commit-Queue: Ian Lance Taylor <iant@google.com>
Reviewed-by: Damien Neil <dneil@google.com>
|