| Age | Commit message (Collapse) | Author |
|
Step 4 of the mini-compiler: switch to the new generated assembly.
No systematic performance regressions, and many many improvements.
In the benchmarks, the systems are:
c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud)
c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud)
s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X)
386 GOARCH=386 gotip-linux-386 gomote (Intel, Google Cloud)
s7-386 GOARCH=386 rsc basement server (AMD Ryzen 9 7950X)
c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud)
mac GOARCH=arm64 Apple M3 Pro in MacBook Pro
arm GOARCH=arm gotip-linux-arm gomote
loong64 GOARCH=loong64 gotip-linux-loong64 gomote
ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote
riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote
s390x GOARCH=s390x linux-s390x-ibm old gomote
benchmark \ system c3h88 c2s16 s7 386 s7-386 c4as16 mac arm loong64 ppc64le riscv64 s390x
AddVV/words=1 -4.03% +5.21% -4.04% +4.94% ~ ~ ~ ~ -19.51% ~ ~ ~
AddVV/words=10 -10.20% +0.34% -3.46% -11.50% -7.46% +7.66% +5.97% ~ -17.90% ~ ~ ~
AddVV/words=16 -10.91% -6.45% -8.45% -21.86% -17.90% +2.73% -1.61% ~ -22.47% -3.54% ~ ~
AddVV/words=100 -3.77% -4.30% -3.17% -47.27% -45.34% -0.78% ~ -8.74% -27.19% ~ ~ ~
AddVV/words=1000 -0.08% -0.71% ~ -49.21% -48.07% ~ ~ -16.80% -24.74% ~ ~ ~
AddVV/words=10000 ~ ~ ~ -48.73% -48.56% -0.06% ~ -17.08% ~ ~ -4.81% ~
AddVV/words=100000 ~ ~ ~ -47.80% -48.38% ~ ~ -15.10% -25.06% ~ -5.34% ~
SubVV/words=1 -0.84% +3.43% -3.62% +1.34% ~ -0.76% ~ ~ -18.18% +5.58% ~ ~
SubVV/words=10 -9.99% +0.34% ~ -11.23% -8.24% +7.53% +6.15% ~ -17.55% +2.77% -2.08% ~
SubVV/words=16 -11.94% -6.45% -6.81% -21.82% -18.11% +1.58% -1.21% ~ -20.36% ~ ~ ~
SubVV/words=100 -3.38% -4.32% -1.80% -46.14% -46.43% +0.41% ~ -7.20% -26.17% ~ -0.42% ~
SubVV/words=1000 -0.38% -0.80% ~ -49.22% -48.90% ~ ~ -15.86% -24.73% ~ ~ ~
SubVV/words=10000 ~ ~ ~ -49.57% -49.64% -0.03% ~ -15.85% -26.52% ~ -5.05% ~
SubVV/words=100000 ~ ~ ~ -46.88% -49.66% ~ ~ -15.45% -16.11% ~ -4.99% ~
LshVU/words=1 ~ +5.78% ~ ~ -2.48% +1.61% +2.18% +2.70% -18.16% -34.16% -21.29% ~
LshVU/words=10 -18.34% -3.78% +2.21% ~ ~ -2.81% -12.54% ~ -25.02% -24.78% -38.11% -66.98%
LshVU/words=16 -23.15% +1.03% +7.74% +0.73% ~ +8.88% +1.56% ~ -25.37% -28.46% -41.27% ~
LshVU/words=100 -32.85% -8.86% -2.58% ~ +2.69% +1.24% ~ -20.63% -44.14% -42.68% -53.09% ~
LshVU/words=1000 -37.30% -0.20% +5.67% ~ ~ +1.44% ~ -27.83% -45.01% -37.07% -57.02% -46.57%
LshVU/words=10000 -36.84% -2.30% +3.82% ~ +1.86% +1.57% -66.81% -28.00% -13.15% -35.40% -41.97% ~
LshVU/words=100000 -40.30% ~ +3.96% ~ ~ ~ ~ -24.91% -19.06% -36.14% -40.99% -66.03%
RshVU/words=1 -3.17% +4.76% -4.06% +4.31% +4.55% ~ ~ ~ -20.61% ~ -26.20% -51.33%
RshVU/words=10 -22.08% -4.41% -17.99% +3.64% -11.87% ~ -16.30% ~ -30.01% ~ -40.37% -63.05%
RshVU/words=16 -26.03% -8.50% -18.09% ~ -17.52% +6.50% ~ -2.85% -30.24% ~ -42.93% -63.13%
RshVU/words=100 -20.87% -28.83% -29.45% ~ -26.25% +1.46% -1.14% -16.20% -45.65% -16.20% -53.66% -77.27%
RshVU/words=1000 -24.03% -21.37% -26.71% ~ -28.95% +0.98% ~ -18.82% -45.21% -23.55% -57.09% -71.18%
RshVU/words=10000 -24.56% -22.44% -27.01% ~ -28.88% +0.78% -5.35% -17.47% -16.87% -20.67% -41.97% ~
RshVU/words=100000 -23.36% -15.65% -27.54% ~ -29.26% +1.73% -6.67% -13.68% -21.40% -23.02% -40.37% -66.31%
MulAddVWW/words=1 +2.37% +8.14% ~ +4.10% +3.71% ~ ~ ~ -21.62% ~ +1.12% ~
MulAddVWW/words=10 ~ -2.72% -15.15% +8.04% ~ ~ ~ -2.52% -19.48% ~ -6.18% ~
MulAddVWW/words=16 ~ +1.49% ~ +4.49% +6.58% -8.70% -7.16% -12.08% -21.43% -6.59% -9.05% ~
MulAddVWW/words=100 +0.37% +1.11% -4.51% -13.59% ~ -11.10% -3.63% -21.40% -22.27% -2.92% -14.41% ~
MulAddVWW/words=1000 ~ +0.90% -7.13% -18.94% ~ -14.02% -9.97% -28.31% -18.72% -2.32% -15.80% ~
MulAddVWW/words=10000 ~ +1.08% -6.75% -19.10% ~ -14.61% -9.04% -28.48% -14.29% -2.25% -9.40% ~
MulAddVWW/words=100000 ~ ~ -6.93% -18.09% ~ -14.33% -9.66% -28.92% -16.63% -2.43% -8.23% ~
AddMulVVWW/words=1 +2.30% +4.83% -11.37% +4.58% ~ -3.14% ~ ~ -10.58% +30.35% ~ ~
AddMulVVWW/words=10 -3.27% ~ +8.96% +5.74% ~ +2.67% -1.44% -7.64% -13.41% ~ ~ ~
AddMulVVWW/words=16 -6.12% ~ ~ ~ +1.91% -7.90% -16.22% -14.07% -14.26% -4.15% -7.30% ~
AddMulVVWW/words=100 -5.48% -2.14% ~ -9.40% +9.98% -1.43% -12.35% -18.56% -21.94% ~ -9.84% ~
AddMulVVWW/words=1000 -11.35% -3.40% -3.64% -11.04% +12.82% -1.33% -15.63% -20.50% -20.95% ~ -11.06% -51.97%
AddMulVVWW/words=10000 -10.31% -1.61% -8.41% -12.15% +13.10% -1.03% -16.34% -22.46% -1.00% ~ -10.33% -49.80%
AddMulVVWW/words=100000 -13.71% ~ -8.31% -12.18% +12.98% -1.35% -15.20% -21.89% ~ ~ -9.38% -48.30%
Change-Id: I0a33c33602c0d053c84d9946e662500cfa048e2d
Reviewed-on: https://go-review.googlesource.com/c/go/+/664938
Reviewed-by: Alan Donovan <adonovan@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
The vast majority of the time, carry propagation is limited and
addVW/subVW only need to consider a single word for carry propagation.
As Josh Bleecher-Snyder pointed out in 2019 (CL 164968), once carrying
is done, the remaining words can be handled faster with copy (memmove).
In the benchmarks below, this is the data=random case.
Even more important, if the source and destination are the same,
the copy can be optimized away entirely, making a small in-place
addition to a big.Int O(1) instead of O(N). To date, only a few
systems (amd64, arm64, and pure Go, meaning wasm) make use of this
asymptotic improvement. This is the data=shortcut case.
This CL deletes the addVW/subVW assembly and replaces it with
an optimized pure Go version. Using Go makes it easy to call
the real copy builtin, which will use optimized memmove code,
instead of recreating a worse memmove in assembly (as arm64 does)
or omitting the copy optimization entirely (as most others do).
The worst case for the Go version versus assembly is the case
of incrementing 2^N-1 by 1, which has to propagate a carry
the entire length of the array. This is the data=carry case.
On balance, we believe this case is rare enough to be worth
taking a hit in that case, in exchange for significant wins
in the other cases and the deletion of significant amounts of
assembly of varying quality. (Remember that half the assembly has
the copy optimization and shortcut, while half does not.)
In the benchmarks, the systems are:
c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud)
c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud)
s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X)
c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud)
mac GOARCH=arm64 Apple M3 Pro in MacBook Pro
386 GOARCH=386 gotip-linux-386 gomote
arm GOARCH=arm gotip-linux-arm gomote
loong64 GOARCH=loong64 gotip-linux-loong64 gomote
ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote
riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote
benchmark \ system c2s16 c3h88 s7 c4as16 mac 386 arm loong64 ppc64le riscv64
AddVW/words=1/data=random -1.15% -1.74% -5.89% -9.80% -11.54% +23.71% -12.74% -14.25% +14.67% +10.27%
AddVW/words=2/data=random -2.59% ~ -4.38% -19.31% -15.41% +24.80% ~ -19.99% +13.73% +19.71%
AddVW/words=3/data=random -3.75% -19.10% -3.79% -23.15% -17.04% +20.04% -10.07% -23.20% ~ +15.39%
AddVW/words=4/data=random -2.84% +7.05% -8.77% -22.64% -15.77% +16.01% -7.36% -28.22% ~ +23.00%
AddVW/words=5/data=random -10.97% +2.16% -12.09% -20.89% -17.14% +9.42% -4.69% -32.60% ~ +10.07%
AddVW/words=6/data=random -9.87% ~ -7.54% -19.08% -6.46% ~ -3.44% -34.61% ~ +12.19%
AddVW/words=7/data=random -14.36% ~ -10.09% -19.10% -10.47% -6.20% -5.06% -38.14% -11.54% +6.79%
AddVW/words=8/data=random -17.50% ~ -11.06% -25.14% -12.88% -8.35% -5.11% -41.39% -14.04% +11.87%
AddVW/words=9/data=random -19.76% -4.05% -15.47% -24.08% -16.50% -12.34% -21.56% -44.25% -14.82% ~
AddVW/words=10/data=random -13.89% ~ -9.69% -23.06% -8.04% -12.58% -19.25% -32.80% -11.68% ~
AddVW/words=16/data=random -29.36% -15.35% -21.86% -25.04% -19.89% -32.26% -16.29% -42.66% -25.92% -3.01%
AddVW/words=32/data=random -39.02% -28.76% -39.87% -11.22% -2.85% -55.40% -31.17% -55.37% -37.92% -16.28%
AddVW/words=64/data=random -25.94% -19.09% -20.60% -6.90% +8.91% -51.00% -43.72% -62.27% -44.11% -28.74%
AddVW/words=100/data=random -22.79% -18.13% -18.25% ~ +33.89% -67.40% -51.77% -63.54% -53.75% -30.97%
AddVW/words=1000/data=random -8.98% -3.84% ~ -3.15% ~ -93.35% -63.92% -65.66% -68.67% -42.30%
AddVW/words=10000/data=random -1.38% -0.38% ~ ~ ~ -89.16% -65.18% -44.65% -70.35% -20.08%
AddVW/words=100000/data=random ~ ~ ~ ~ ~ -87.03% -64.51% -36.08% -61.40% -16.53%
SubVW/words=1/data=random -3.67% ~ -8.38% -10.26% -3.07% +45.78% -6.06% -11.17% ~ ~
SubVW/words=2/data=random -3.48% -10.07% -5.76% -20.14% -8.45% +44.28% ~ -19.09% ~ +16.98%
SubVW/words=3/data=random -7.11% -26.64% -4.48% -22.07% -9.21% +35.61% ~ -23.93% -18.20% ~
SubVW/words=4/data=random -4.23% +7.19% -8.95% -22.62% -13.89% +33.20% -8.96% -29.96% ~ +22.23%
SubVW/words=5/data=random -11.49% +1.92% -10.86% -22.27% -17.53% +24.48% -2.88% -35.19% -19.55% ~
SubVW/words=6/data=random -7.67% ~ -7.72% -18.44% -6.24% +12.03% -2.00% -39.68% -10.73% ~
SubVW/words=7/data=random -13.69% -18.32% -11.82% -18.92% -11.57% +6.63% ~ -43.54% -30.81% ~
SubVW/words=8/data=random -16.02% ~ -11.07% -24.50% -11.92% +4.32% -3.01% -46.95% -24.14% ~
SubVW/words=9/data=random -18.76% -3.34% -14.84% -23.79% -17.50% ~ -21.80% -49.98% -29.62% ~
SubVW/words=10/data=random -13.23% ~ -9.25% -21.26% -11.63% ~ -18.58% -39.19% -20.09% ~
SubVW/words=16/data=random -28.25% -13.24% -22.66% -27.18% -19.13% -23.38% -20.24% -51.01% -28.06% -3.05%
SubVW/words=32/data=random -38.41% -28.88% -40.12% -11.20% -2.80% -49.17% -34.67% -63.29% -39.25% -15.20%
SubVW/words=64/data=random -25.51% -19.24% -22.20% -6.57% +9.98% -48.52% -48.14% -69.50% -49.44% -27.92%
SubVW/words=100/data=random -21.69% -18.51% ~ +1.92% +34.42% -65.88% -54.67% -71.24% -58.88% -30.71%
SubVW/words=1000/data=random -9.81% -4.05% -2.14% -3.06% ~ -93.37% -67.33% -74.12% -68.36% -42.17%
SubVW/words=10000/data=random ~ -0.52% ~ ~ ~ -88.87% -68.54% -44.94% -70.63% -19.95%
SubVW/words=100000/data=random ~ ~ ~ ~ ~ -86.69% -68.09% -48.36% -62.42% -19.32%
AddVW/words=1/data=shortcut -29.38% -25.38% -27.37% -23.15% -25.41% +3.01% -33.60% -36.12% -15.76% ~
AddVW/words=2/data=shortcut -32.79% -34.72% -31.47% -24.47% -28.21% -3.75% -34.66% -43.89% -23.65% -21.56%
AddVW/words=3/data=shortcut -38.50% -46.83% -35.67% -26.38% -30.29% -10.41% -44.89% -47.68% -30.93% -26.85%
AddVW/words=4/data=shortcut -40.40% -28.85% -34.19% -29.83% -32.95% -16.09% -42.86% -51.02% -34.19% -26.69%
AddVW/words=5/data=shortcut -43.87% -35.42% -36.46% -32.59% -37.72% -20.82% -45.14% -54.01% -35.49% -30.48%
AddVW/words=6/data=shortcut -46.98% -39.34% -42.22% -35.43% -38.18% -27.46% -46.72% -56.61% -40.21% -34.07%
AddVW/words=7/data=shortcut -49.63% -47.97% -46.61% -35.28% -41.93% -31.14% -49.29% -58.89% -41.10% -37.01%
AddVW/words=8/data=shortcut -50.48% -42.33% -45.40% -40.24% -41.74% -32.92% -50.62% -60.98% -44.85% -38.10%
AddVW/words=9/data=shortcut -54.27% -43.52% -49.06% -42.16% -45.22% -37.57% -51.84% -62.91% -46.04% -40.82%
AddVW/words=10/data=shortcut -56.01% -45.40% -51.42% -43.29% -46.14% -38.65% -53.65% -64.62% -47.05% -43.21%
AddVW/words=16/data=shortcut -62.73% -55.66% -59.31% -56.38% -54.31% -53.16% -61.03% -72.29% -58.24% -52.57%
AddVW/words=32/data=shortcut -74.00% -69.42% -71.75% -33.65% -37.35% -71.73% -72.59% -82.44% -70.87% -67.69%
AddVW/words=64/data=shortcut -56.69% -52.72% -52.09% -35.48% -36.87% -84.24% -83.10% -90.37% -82.56% -80.81%
AddVW/words=100/data=shortcut -56.68% -53.18% -51.49% -33.49% -37.72% -89.95% -88.21% -93.37% -88.47% -86.52%
AddVW/words=1000/data=shortcut -56.68% -52.45% -51.66% -35.31% -36.65% -98.88% -98.62% -99.24% -98.78% -98.41%
AddVW/words=10000/data=shortcut -56.70% -52.40% -51.92% -33.49% -36.98% -99.89% -99.86% -99.92% -99.87% -99.91%
AddVW/words=100000/data=shortcut -56.67% -52.46% -52.38% -35.31% -37.20% -99.99% -99.99% -99.99% -99.99% -99.99%
SubVW/words=1/data=shortcut -29.80% -20.71% -26.94% -23.24% -25.33% +26.97% -32.02% -37.85% -40.20% -12.67%
SubVW/words=2/data=shortcut -35.47% -36.38% -31.93% -25.43% -30.18% +18.96% -33.48% -46.48% -39.38% -18.65%
SubVW/words=3/data=shortcut -39.22% -49.96% -36.90% -25.82% -30.96% +12.53% -40.67% -51.07% -43.71% -23.78%
SubVW/words=4/data=shortcut -40.46% -24.90% -34.66% -29.87% -33.97% +4.60% -42.32% -54.92% -42.83% -22.45%
SubVW/words=5/data=shortcut -43.84% -34.17% -38.00% -32.55% -37.27% -2.46% -43.09% -58.18% -45.70% -26.45%
SubVW/words=6/data=shortcut -47.69% -37.49% -42.73% -35.90% -37.73% -8.52% -46.55% -61.01% -44.00% -30.14%
SubVW/words=7/data=shortcut -49.45% -50.66% -46.88% -34.77% -41.64% -14.46% -48.92% -63.46% -50.47% -33.39%
SubVW/words=8/data=shortcut -50.45% -39.31% -47.14% -40.47% -41.70% -15.77% -50.21% -65.64% -47.71% -34.01%
SubVW/words=9/data=shortcut -54.28% -43.07% -49.42% -41.34% -44.99% -19.39% -51.55% -67.61% -56.92% -36.82%
SubVW/words=10/data=shortcut -56.85% -47.88% -50.92% -42.76% -45.67% -23.60% -53.04% -69.34% -60.18% -39.43%
SubVW/words=16/data=shortcut -62.36% -54.83% -58.80% -55.83% -53.74% -41.04% -60.16% -76.75% -60.56% -48.63%
SubVW/words=32/data=shortcut -73.68% -68.64% -71.57% -33.52% -37.34% -64.73% -72.67% -85.89% -71.87% -64.56%
SubVW/words=64/data=shortcut -56.68% -51.66% -52.56% -34.75% -37.54% -80.30% -83.58% -92.39% -83.41% -78.70%
SubVW/words=100/data=shortcut -56.68% -50.97% -51.57% -33.68% -36.78% -87.42% -88.53% -94.84% -88.87% -84.96%
SubVW/words=1000/data=shortcut -56.68% -50.89% -52.10% -34.94% -37.77% -98.59% -98.71% -99.43% -98.80% -98.20%
SubVW/words=10000/data=shortcut -56.68% -51.00% -52.44% -33.65% -37.27% -99.86% -99.87% -99.94% -99.88% -99.90%
SubVW/words=100000/data=shortcut -56.68% -50.80% -52.20% -34.79% -37.46% -99.99% -99.99% -99.99% -99.99% -99.99%
AddVW/words=1/data=carry -0.51% -5.29% -24.03% -26.48% ~ ~ -33.14% -30.23% ~ -20.74%
AddVW/words=2/data=carry -6.36% ~ -21.05% -39.40% ~ +10.72% -29.12% -31.34% ~ -17.29%
AddVW/words=3/data=carry ~ ~ -17.46% -19.53% +17.58% ~ -26.23% -23.61% +7.80% -14.34%
AddVW/words=4/data=carry +19.02% +16.80% ~ ~ +28.25% ~ -27.90% -20.31% +19.16% ~
AddVW/words=5/data=carry +3.97% +53.02% ~ ~ +11.31% ~ -19.05% -17.47% +16.81% ~
AddVW/words=6/data=carry +2.98% +19.83% ~ ~ +14.84% ~ -18.48% -14.92% +18.25% ~
AddVW/words=7/data=carry ~ ~ ~ ~ +27.17% ~ -15.50% -12.74% +13.00% ~
AddVW/words=8/data=carry +0.58% +22.32% ~ +6.10% +29.63% ~ -13.04% ~ +28.46% +2.95%
AddVW/words=9/data=carry ~ +31.53% ~ ~ +14.42% ~ -11.32% ~ +18.37% +3.28%
AddVW/words=10/data=carry +3.94% +22.36% ~ +6.29% +19.22% ~ -11.27% ~ +20.10% +3.91%
AddVW/words=16/data=carry +2.82% +14.23% ~ +10.06% +25.91% -16.12% ~ ~ +52.28% +10.40%
AddVW/words=32/data=carry ~ +25.35% +13.66% ~ +34.89% -34.39% +6.51% -18.71% +41.06% +19.42%
AddVW/words=64/data=carry -42.03% ~ -39.70% +6.65% +32.29% -39.94% +14.34% ~ +19.68% +20.86%
AddVW/words=100/data=carry -33.95% -34.28% -39.65% ~ +27.72% -26.80% +17.40% ~ +26.39% +23.32%
AddVW/words=1000/data=carry -42.49% -47.87% -47.44% +1.25% +4.25% -41.76% +23.40% ~ +25.48% +27.99%
AddVW/words=10000/data=carry -41.85% -48.49% -49.43% ~ ~ -42.09% +24.61% -10.32% +40.55% +18.35%
AddVW/words=100000/data=carry -28.18% -48.13% -48.24% +1.35% ~ -42.90% +24.73% -9.79% +22.55% +17.16%
SubVW/words=1/data=carry -10.32% -17.16% -24.14% -26.24% ~ +18.43% -34.10% -29.54% -9.57% ~
SubVW/words=2/data=carry -19.45% -23.31% -20.74% -39.73% ~ +15.74% -28.13% -30.21% ~ -18.74%
SubVW/words=3/data=carry ~ -16.18% -15.34% -19.54% +17.62% +12.39% -27.64% -27.09% ~ -14.97%
SubVW/words=4/data=carry +11.67% +24.42% ~ ~ +25.11% +14.07% -28.08% -26.18% ~ ~
SubVW/words=5/data=carry +8.08% +25.64% ~ ~ +10.35% +8.12% -21.75% -25.50% ~ -4.86%
SubVW/words=6/data=carry ~ +13.82% ~ ~ +12.92% +6.79% -20.25% -24.70% ~ -2.74%
SubVW/words=7/data=carry ~ ~ +8.29% +4.51% +26.59% +4.62% -18.01% -24.09% ~ -1.26%
SubVW/words=8/data=carry ~ +23.16% +16.19% +6.16% +25.46% +6.74% -15.57% -22.74% ~ +1.44%
SubVW/words=9/data=carry ~ +30.71% +20.81% ~ +12.36% ~ -12.99% ~ ~ +3.13%
SubVW/words=10/data=carry +5.03% +19.53% +14.84% +14.16% +16.12% ~ -11.64% -16.00% +15.45% +3.29%
SubVW/words=16/data=carry +14.42% +15.58% +33.07% +11.43% +24.65% ~ ~ -21.90% +25.59% +9.40%
SubVW/words=32/data=carry ~ +27.57% +46.58% ~ +35.35% -8.49% ~ -24.04% +11.86% +18.40%
SubVW/words=64/data=carry -24.34% -27.83% -20.90% +13.34% +37.17% -14.90% ~ -8.81% +12.88% +18.92%
SubVW/words=100/data=carry -25.19% -34.70% -27.45% +12.86% +28.42% -14.48% ~ ~ +25.71% +21.93%
SubVW/words=1000/data=carry -24.93% -47.86% -47.26% +2.66% ~ -23.88% ~ ~ +25.99% +27.81%
SubVW/words=10000/data=carry -24.17% -36.48% -49.41% +1.06% ~ -25.06% ~ -26.50% +27.94% +18.36%
SubVW/words=100000/data=carry -22.51% -35.86% -49.46% +3.96% ~ -25.18% ~ -22.15% +26.86% +15.44%
Change-Id: I8f252073040e674780ac6ec9912082fb205329dd
Reviewed-on: https://go-review.googlesource.com/c/go/+/664898
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
It is annoying that non-x86 implementations of shlVU and shrVU
have to go out of their way to handle the trivial case shift==0
with their own copy loops. Instead, arrange to never call them
with shift==0, so that the code can be removed.
Unfortunately, there are linknames of shlVU, so we cannot
change that function. But we can rename the functions and
then leave behind a shlVU wrapper, so do that.
Since the big.Int API calls the operations Lsh and Rsh, rename
shlVU/shrVU to lshVU/rshVU. Also rename various other shl/shr
methods and functions to lsh/rsh.
Change-Id: Ieaf54e0110a298730aa3e4566ce5be57ba7fc121
Reviewed-on: https://go-review.googlesource.com/c/go/+/664896
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
addMulVVW is an unnecessarily special case.
All other assembly routines taking []Word (V as in vector) arguments
take separate source and destination. For example:
addVV: z = x+y
mulAddVWW: z = x*m+a
addMulVVW uses the z parameter as both destination and source:
addMulVVW: z = z+x*m
Even looking at the signatures is confusing: all the VV routines take
two input vectors x and y, but addMulVVW takes only x: where is y?
(The answer is that the two inputs are z and x.)
It would be nice to fix this, both for understandability and regularity,
and to simplify a future assembly generator.
We cannot remove or redefine addMulVVW, because it has been used
in linknames. Instead, the CL adds a new final addend argument ‘a’
like in mulAddVWW, making the natural name addMulVVWW
(two input vectors, two input words):
addMulVVWW: z = x+y*m+a
This CL updates all the assembly implementations to rename the
inputs z, x, y -> x, y, m, and then introduces a separate destination z.
Change-Id: Ib76c80b53f6d1f4a901f663566e9c4764bb20488
Reviewed-on: https://go-review.googlesource.com/c/go/+/664895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Running 'go fix' on the cmd+std packages handled much of this change.
Also update code generators to use only the new go:build lines,
not the old +build ones.
For #41184.
For #60268.
Change-Id: If35532abe3012e7357b02c79d5992ff5ac37ca23
Cq-Include-Trybots: luci.golang.try:gotip-linux-386-longtest,gotip-linux-amd64-longtest,gotip-windows-amd64-longtest
Reviewed-on: https://go-review.googlesource.com/c/go/+/536237
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Now gc can generate the same assembly code.
Change-Id: Iac503003e14045d63e2def66408c13cee516aa37
Reviewed-on: https://go-review.googlesource.com/c/go/+/402575
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Don't add them to files in vendor and cmd/vendor though. These will be
pulled in by updating the respective dependencies.
For #41184
Change-Id: Icc57458c9b3033c347124323f33084c85b224c70
Reviewed-on: https://go-review.googlesource.com/c/go/+/319389
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
|
|
Division is much slower than multiplication. And the method of using
multiplication by multiplying reciprocal and replacing division with it
can increase the speed of divWVW algorithm by three times,and at the
same time increase the speed of nats division.
The benchmark test on arm64 is as follows:
name old time/op new time/op delta
DivWVW/1-4 13.1ns ± 4% 13.3ns ± 4% ~ (p=0.444 n=5+5)
DivWVW/2-4 48.6ns ± 1% 51.2ns ± 2% +5.39% (p=0.008 n=5+5)
DivWVW/3-4 82.0ns ± 1% 69.7ns ± 1% -15.03% (p=0.008 n=5+5)
DivWVW/4-4 116ns ± 1% 71ns ± 2% -38.88% (p=0.008 n=5+5)
DivWVW/5-4 152ns ± 1% 84ns ± 4% -44.70% (p=0.008 n=5+5)
DivWVW/10-4 319ns ± 1% 155ns ± 4% -51.50% (p=0.008 n=5+5)
DivWVW/100-4 3.44µs ± 3% 1.30µs ± 8% -62.30% (p=0.008 n=5+5)
DivWVW/1000-4 33.8µs ± 0% 10.9µs ± 1% -67.74% (p=0.008 n=5+5)
DivWVW/10000-4 343µs ± 4% 111µs ± 5% -67.63% (p=0.008 n=5+5)
DivWVW/100000-4 3.35ms ± 1% 1.25ms ± 3% -62.79% (p=0.008 n=5+5)
QuoRem-4 3.08µs ± 2% 2.21µs ± 4% -28.40% (p=0.008 n=5+5)
ModSqrt225_Tonelli-4 444µs ± 2% 457µs ± 3% ~ (p=0.095 n=5+5)
ModSqrt225_3Mod4-4 136µs ± 1% 138µs ± 3% ~ (p=0.151 n=5+5)
ModSqrt231_Tonelli-4 473µs ± 3% 483µs ± 4% ~ (p=0.548 n=5+5)
ModSqrt231_5Mod8-4 164µs ± 9% 169µs ±12% ~ (p=0.421 n=5+5)
Sqrt-4 36.8µs ± 1% 28.6µs ± 0% -22.17% (p=0.016 n=5+4)
Div/20/10-4 50.0ns ± 3% 51.3ns ± 6% ~ (p=0.238 n=5+5)
Div/40/20-4 49.8ns ± 2% 51.3ns ± 6% ~ (p=0.222 n=5+5)
Div/100/50-4 85.8ns ± 4% 86.5ns ± 5% ~ (p=0.246 n=5+5)
Div/200/100-4 335ns ± 3% 296ns ± 2% -11.60% (p=0.008 n=5+5)
Div/400/200-4 442ns ± 2% 359ns ± 5% -18.81% (p=0.008 n=5+5)
Div/1000/500-4 858ns ± 3% 643ns ± 6% -25.06% (p=0.008 n=5+5)
Div/2000/1000-4 1.70µs ± 3% 1.28µs ± 4% -24.80% (p=0.008 n=5+5)
Div/20000/10000-4 45.0µs ± 5% 41.8µs ± 4% -7.17% (p=0.016 n=5+5)
Div/200000/100000-4 1.51ms ± 7% 1.43ms ± 3% -5.42% (p=0.016 n=5+5)
Div/2000000/1000000-4 57.6ms ± 4% 57.5ms ± 3% ~ (p=1.000 n=5+5)
Div/20000000/10000000-4 2.08s ± 3% 2.04s ± 1% ~ (p=0.095 n=5+5)
name old speed new speed delta
DivWVW/1-4 4.87GB/s ± 4% 4.80GB/s ± 4% ~ (p=0.310 n=5+5)
DivWVW/2-4 2.63GB/s ± 1% 2.50GB/s ± 2% -5.07% (p=0.008 n=5+5)
DivWVW/3-4 2.34GB/s ± 1% 2.76GB/s ± 1% +17.70% (p=0.008 n=5+5)
DivWVW/4-4 2.21GB/s ± 1% 3.61GB/s ± 2% +63.42% (p=0.008 n=5+5)
DivWVW/5-4 2.10GB/s ± 2% 3.81GB/s ± 4% +80.89% (p=0.008 n=5+5)
DivWVW/10-4 2.01GB/s ± 0% 4.13GB/s ± 4% +105.91% (p=0.008 n=5+5)
DivWVW/100-4 1.86GB/s ± 2% 4.95GB/s ± 7% +165.63% (p=0.008 n=5+5)
DivWVW/1000-4 1.89GB/s ± 0% 5.86GB/s ± 1% +209.96% (p=0.008 n=5+5)
DivWVW/10000-4 1.87GB/s ± 4% 5.76GB/s ± 5% +208.96% (p=0.008 n=5+5)
DivWVW/100000-4 1.91GB/s ± 1% 5.14GB/s ± 3% +168.85% (p=0.008 n=5+5)
Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830
Reviewed-on: https://go-review.googlesource.com/c/go/+/250417
Trust: Giovanni Bajo <rasky@develer.com>
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
Add an optimization for addVW and subVW over large-sized vectors, it switches
from add/sub with carry to copy the rest of the vector when we are done with
carries. Consistent performance improvement are observed on various arm64
machines.
Add additional tests and benchmarks to increase the test coverage.
TestFunVWExt:
Testing with various types of input vector, using the result from go-version
addVW/subVW as golden reference.
BenchmarkAddVWext and BenchmarkSubVWext:
Benchmarking using input vector having all 1s or all 0s, for evaluating the
overhead of worst case.
1. Perf. comparison over randomly generated input vectors:
Server 1:
name old time/op new time/op delta
AddVW/1 12.3ns ± 3% 12.0ns ± 0% -2.60% (p=0.001 n=10+8)
AddVW/2 12.5ns ± 2% 12.3ns ± 0% -1.84% (p=0.001 n=10+8)
AddVW/3 12.6ns ± 2% 12.3ns ± 0% -1.91% (p=0.009 n=10+10)
AddVW/4 13.1ns ± 3% 12.7ns ± 0% -2.98% (p=0.006 n=10+8)
AddVW/5 14.4ns ± 1% 13.9ns ± 0% -3.81% (p=0.000 n=10+10)
AddVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal)
AddVW/100 47.8ns ± 0% 29.9ns ± 2% -37.38% (p=0.000 n=10+9)
AddVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10)
AddVW/10000 4.35µs ± 1% 2.92µs ± 0% -32.85% (p=0.000 n=10+10)
AddVW/100000 43.6µs ± 0% 29.7µs ± 0% -31.92% (p=0.000 n=8+10)
SubVW/1 12.6ns ± 0% 12.3ns ± 2% -2.22% (p=0.000 n=7+10)
SubVW/2 12.7ns ± 0% 12.6ns ± 1% -0.39% (p=0.046 n=8+10)
SubVW/3 12.7ns ± 1% 12.6ns ± 1% ~ (p=0.410 n=10+10)
SubVW/4 13.3ns ± 3% 13.1ns ± 3% ~ (p=0.072 n=10+10)
SubVW/5 14.2ns ± 0% 14.1ns ± 1% -0.63% (p=0.046 n=8+10)
SubVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal)
SubVW/100 47.8ns ± 0% 33.1ns ±19% -30.71% (p=0.000 n=10+10)
SubVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10)
SubVW/10000 4.33µs ± 1% 2.92µs ± 0% -32.66% (p=0.000 n=10+6)
SubVW/100000 43.4µs ± 0% 29.6µs ± 0% -31.90% (p=0.000 n=10+9)
Server 2:
name old time/op new time/op delta
AddVW/1 5.49ns ± 0% 5.53ns ± 2% ~ (p=1.000 n=9+10)
AddVW/2 5.96ns ± 2% 5.92ns ± 1% -0.69% (p=0.039 n=10+10)
AddVW/3 6.72ns ± 0% 6.73ns ± 0% ~ (p=0.078 n=10+10)
AddVW/4 7.07ns ± 0% 6.75ns ± 2% -4.55% (p=0.000 n=10+10)
AddVW/5 8.14ns ± 0% 8.17ns ± 0% +0.46% (p=0.003 n=8+8)
AddVW/10 10.0ns ± 0% 10.1ns ± 1% +0.70% (p=0.003 n=10+10)
AddVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=9+9)
AddVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10)
AddVW/10000 4.18µs ± 0% 3.14µs ± 0% -24.81% (p=0.000 n=8+8)
AddVW/100000 68.3µs ± 3% 62.1µs ± 5% -9.13% (p=0.000 n=10+10)
SubVW/1 5.37ns ± 2% 5.42ns ± 1% ~ (p=0.990 n=10+10)
SubVW/2 5.89ns ± 0% 5.92ns ± 1% +0.58% (p=0.000 n=8+10)
SubVW/3 6.64ns ± 1% 6.82ns ± 3% +2.63% (p=0.000 n=9+10)
SubVW/4 7.17ns ± 0% 6.69ns ± 2% -6.74% (p=0.000 n=10+9)
SubVW/5 8.22ns ± 0% 8.18ns ± 0% -0.46% (p=0.001 n=8+9)
SubVW/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.341 n=10+10)
SubVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=7+10)
SubVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10)
SubVW/10000 4.18µs ± 0% 3.15µs ± 0% -24.62% (p=0.000 n=9+9)
SubVW/100000 67.7µs ± 4% 62.4µs ± 2% -7.92% (p=0.000 n=10+10)
2. Perf. comparison over input vectors of all 1s or all 0s
Server 1:
name old time/op new time/op delta
AddVWext/1 12.6ns ± 0% 12.0ns ± 0% -4.76% (p=0.000 n=6+10)
AddVWext/2 12.7ns ± 0% 12.4ns ± 1% -2.52% (p=0.000 n=10+10)
AddVWext/3 12.7ns ± 0% 12.4ns ± 0% -2.36% (p=0.000 n=9+7)
AddVWext/4 13.2ns ± 4% 12.7ns ± 0% -3.71% (p=0.001 n=10+9)
AddVWext/5 14.6ns ± 0% 13.9ns ± 0% -4.79% (p=0.000 n=10+8)
AddVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal)
AddVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10)
AddVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10)
AddVWext/10000 4.34µs ± 1% 3.90µs ± 0% -10.12% (p=0.000 n=10+10)
AddVWext/100000 43.9µs ± 1% 39.4µs ± 0% -10.18% (p=0.000 n=10+10)
SubVWext/1 12.6ns ± 0% 12.3ns ± 2% -2.70% (p=0.000 n=7+10)
SubVWext/2 12.6ns ± 1% 12.6ns ± 2% ~ (p=0.234 n=10+10)
SubVWext/3 12.7ns ± 0% 12.6ns ± 2% -0.71% (p=0.033 n=10+10)
SubVWext/4 13.4ns ± 0% 13.1ns ± 3% -2.01% (p=0.006 n=8+10)
SubVWext/5 14.2ns ± 0% 14.1ns ± 1% -0.85% (p=0.003 n=10+10)
SubVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal)
SubVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10)
SubVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10)
SubVWext/10000 4.33µs ± 1% 3.90µs ± 0% -10.02% (p=0.000 n=10+10)
SubVWext/100000 43.5µs ± 0% 39.5µs ± 1% -9.16% (p=0.000 n=7+10)
Server 2:
name old time/op new time/op delta
AddVWext/1 5.48ns ± 0% 5.43ns ± 1% -0.97% (p=0.000 n=9+9)
AddVWext/2 5.99ns ± 2% 5.93ns ± 1% ~ (p=0.054 n=10+10)
AddVWext/3 6.74ns ± 0% 6.79ns ± 1% +0.80% (p=0.000 n=9+10)
AddVWext/4 7.18ns ± 0% 7.21ns ± 1% +0.36% (p=0.034 n=9+10)
AddVWext/5 7.93ns ± 3% 8.18ns ± 0% +3.18% (p=0.000 n=10+8)
AddVWext/10 10.0ns ± 0% 10.1ns ± 1% +0.60% (p=0.011 n=10+10)
AddVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=9+10)
AddVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10)
AddVWext/10000 4.18µs ± 0% 4.50µs ± 0% +7.73% (p=0.000 n=9+10)
AddVWext/100000 67.6µs ± 2% 68.4µs ± 3% ~ (p=0.139 n=9+8)
SubVWext/1 5.46ns ± 1% 5.43ns ± 0% -0.55% (p=0.002 n=9+9)
SubVWext/2 5.89ns ± 0% 5.93ns ± 1% +0.68% (p=0.000 n=8+10)
SubVWext/3 6.72ns ± 1% 6.79ns ± 1% +1.07% (p=0.000 n=10+10)
SubVWext/4 6.98ns ± 1% 7.21ns ± 0% +3.25% (p=0.000 n=10+10)
SubVWext/5 8.22ns ± 0% 7.99ns ± 3% -2.83% (p=0.000 n=8+10)
SubVWext/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.239 n=10+10)
SubVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=8+10)
SubVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10)
SubVWext/10000 4.18µs ± 0% 4.51µs ± 0% +7.86% (p=0.000 n=8+8)
SubVWext/100000 68.3µs ± 2% 68.0µs ± 3% ~ (p=0.515 n=10+8)
Change-Id: I134a5194b8a2deaaebbaa2b771baf72846971d58
Reviewed-on: https://go-review.googlesource.com/c/go/+/229739
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
For the case where the addresses of parameter z and x of the function
shlVU overlap and the address of z is greater than x, x (input value)
can be polluted during the calculation when the high words of x are
overlapped with the low words of z (output value).
Fixes #31084
Change-Id: I9bb0266a1d7856b8faa9a9b1975d6f57dece0479
Reviewed-on: https://go-review.googlesource.com/c/go/+/169780
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
Unroll the cycle 4 times to reduce load overhead.
Benchmarks:
name old time/op new time/op delta
MulAddVWW/1-8 15.9ns ± 0% 11.9ns ± 0% -24.92% (p=0.000 n=8+8)
MulAddVWW/2-8 16.1ns ± 0% 13.9ns ± 1% -13.82% (p=0.000 n=8+8)
MulAddVWW/3-8 18.9ns ± 0% 17.3ns ± 0% -8.47% (p=0.000 n=8+8)
MulAddVWW/4-8 21.7ns ± 0% 19.5ns ± 0% -10.14% (p=0.000 n=8+8)
MulAddVWW/5-8 25.1ns ± 0% 22.5ns ± 0% -10.27% (p=0.000 n=8+8)
MulAddVWW/10-8 41.6ns ± 0% 40.0ns ± 0% -3.79% (p=0.000 n=8+8)
MulAddVWW/100-8 368ns ± 0% 363ns ± 0% -1.36% (p=0.000 n=8+8)
MulAddVWW/1000-8 3.52µs ± 0% 3.52µs ± 0% -0.14% (p=0.000 n=8+8)
MulAddVWW/10000-8 35.1µs ± 0% 35.1µs ± 0% -0.01% (p=0.000 n=7+6)
MulAddVWW/100000-8 351µs ± 0% 351µs ± 0% +0.15% (p=0.038 n=8+8)
Change-Id: I052a4db286ac6e4f3293289c7e9a82027da0405e
Reviewed-on: https://go-review.googlesource.com/c/go/+/155780
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
|
|
This addresses the failures we have seen in #31084. The correct
fix is to find the actual bug in the assembly code.
Updates #31084.
Change-Id: I437780c53d0c4423d742e2e3b650b899ce845372
Reviewed-on: https://go-review.googlesource.com/c/go/+/169721
Run-TryBot: Robert Griesemer <gri@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
This CL implements shlVU and shrVU with arm64 HW instructions "LDP" and "STP" to reduce load cost,
it also removes unnecessary checks on the number of shifts for better performance.
Benchmarks:
name old time/op new time/op delta
AddVV/1-8 21.6ns ± 1% 21.6ns ± 1% ~ (p=0.683 n=5+5)
AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal)
AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal)
AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal)
AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal)
AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal)
AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal)
AddVV/1000-8 2.02µs ± 0% 2.03µs ± 0% +0.73% (p=0.008 n=5+5)
AddVV/10000-8 20.5µs ± 0% 20.5µs ± 0% -0.01% (p=0.008 n=5+5)
AddVV/100000-8 246µs ± 5% 250µs ± 4% ~ (p=0.548 n=5+5)
AddVW/1-8 9.26ns ± 0% 9.32ns ± 0% +0.65% (p=0.016 n=4+5)
AddVW/2-8 19.8ns ± 3% 19.8ns ± 0% ~ (p=0.143 n=5+5)
AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal)
AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal)
AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal)
AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal)
AddVW/100-8 167ns ± 0% 166ns ± 0% -0.60% (p=0.000 n=5+4)
AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% ~ (all equal)
AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% +0.01% (p=0.008 n=5+5)
AddVW/100000-8 163µs ± 4% 153µs ± 3% -5.97% (p=0.016 n=5+5)
AddMulVVW/1-8 32.4ns ± 1% 33.0ns ± 1% +1.73% (p=0.040 n=5+5)
AddMulVVW/2-8 56.4ns ± 2% 55.9ns ± 1% ~ (p=0.135 n=5+5)
AddMulVVW/3-8 85.4ns ± 1% 85.1ns ± 0% ~ (p=0.079 n=5+5)
AddMulVVW/4-8 129ns ± 1% 129ns ± 0% ~ (p=0.397 n=5+5)
AddMulVVW/5-8 148ns ± 0% 148ns ± 0% ~ (all equal)
AddMulVVW/10-8 270ns ± 0% 268ns ± 0% -0.74% (p=0.029 n=4+4)
AddMulVVW/100-8 2.75µs ± 0% 2.75µs ± 0% -0.09% (p=0.008 n=5+5)
AddMulVVW/1000-8 26.0µs ± 0% 26.0µs ± 0% -0.06% (p=0.024 n=5+5)
AddMulVVW/10000-8 312µs ± 0% 312µs ± 0% -0.09% (p=0.008 n=5+5)
AddMulVVW/100000-8 2.89ms ± 0% 2.89ms ± 0% +0.14% (p=0.016 n=5+5)
DecimalConversion-8 315µs ± 1% 312µs ± 0% ~ (p=0.095 n=5+5)
FloatString/100-8 2.56µs ± 1% 2.52µs ± 1% -1.31% (p=0.016 n=5+5)
FloatString/1000-8 58.6µs ± 0% 58.2µs ± 0% -0.75% (p=0.008 n=5+5)
FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.056 n=5+5)
FloatString/100000-8 446ms ± 0% 446ms ± 0% -0.04% (p=0.008 n=5+5)
FloatAdd/10-8 184ns ± 0% 178ns ± 0% -3.48% (p=0.008 n=5+5)
FloatAdd/100-8 189ns ± 3% 178ns ± 2% -6.02% (p=0.008 n=5+5)
FloatAdd/1000-8 371ns ± 0% 267ns ± 0% -27.99% (p=0.000 n=5+4)
FloatAdd/10000-8 1.87µs ± 0% 1.03µs ± 0% -44.74% (p=0.008 n=5+5)
FloatAdd/100000-8 17.1µs ± 0% 8.8µs ± 0% -48.71% (p=0.016 n=5+4)
FloatSub/10-8 148ns ± 0% 146ns ± 0% -1.35% (p=0.000 n=5+4)
FloatSub/100-8 148ns ± 0% 140ns ± 0% -5.41% (p=0.008 n=5+5)
FloatSub/1000-8 242ns ± 0% 191ns ± 0% -21.24% (p=0.008 n=5+5)
FloatSub/10000-8 1.07µs ± 0% 0.64µs ± 1% -39.89% (p=0.016 n=4+5)
FloatSub/100000-8 9.48µs ± 0% 5.32µs ± 0% -43.87% (p=0.008 n=5+5)
ParseFloatSmallExp-8 29.3µs ± 3% 28.6µs ± 1% ~ (p=0.310 n=5+5)
ParseFloatLargeExp-8 125µs ± 1% 123µs ± 0% -1.99% (p=0.008 n=5+5)
GCD10x10/WithoutXY-8 278ns ± 4% 289ns ± 5% +3.96% (p=0.040 n=5+5)
GCD10x10/WithXY-8 2.12µs ± 2% 2.15µs ± 2% ~ (p=0.095 n=5+5)
GCD10x100/WithoutXY-8 615ns ± 1% 629ns ± 5% ~ (p=0.135 n=5+5)
GCD10x100/WithXY-8 3.42µs ± 1% 3.53µs ± 2% +3.38% (p=0.008 n=5+5)
GCD10x1000/WithoutXY-8 1.39µs ± 1% 1.38µs ± 1% ~ (p=0.460 n=5+5)
GCD10x1000/WithXY-8 7.47µs ± 2% 7.49µs ± 3% ~ (p=1.000 n=5+5)
GCD10x10000/WithoutXY-8 8.71µs ± 1% 8.71µs ± 0% ~ (p=0.841 n=5+5)
GCD10x10000/WithXY-8 28.4µs ± 2% 27.2µs ± 2% -4.24% (p=0.008 n=5+5)
GCD10x100000/WithoutXY-8 78.9µs ± 1% 79.1µs ± 0% ~ (p=0.222 n=5+5)
GCD10x100000/WithXY-8 240µs ± 1% 228µs ± 1% -4.98% (p=0.008 n=5+5)
GCD100x100/WithoutXY-8 1.87µs ± 2% 1.89µs ± 1% ~ (p=0.095 n=5+5)
GCD100x100/WithXY-8 26.6µs ± 1% 26.3µs ± 0% -1.14% (p=0.032 n=5+5)
GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.47µs ± 2% ~ (p=0.444 n=5+5)
GCD100x1000/WithXY-8 36.7µs ± 1% 36.0µs ± 1% -1.96% (p=0.008 n=5+5)
GCD100x10000/WithoutXY-8 22.8µs ± 1% 22.3µs ± 1% -2.52% (p=0.008 n=5+5)
GCD100x10000/WithXY-8 145µs ± 1% 142µs ± 0% -1.88% (p=0.008 n=5+5)
GCD100x100000/WithoutXY-8 198µs ± 1% 190µs ± 0% -4.06% (p=0.008 n=5+5)
GCD100x100000/WithXY-8 1.11ms ± 0% 1.09ms ± 0% -1.87% (p=0.008 n=5+5)
GCD1000x1000/WithoutXY-8 25.4µs ± 1% 25.0µs ± 1% -1.34% (p=0.008 n=5+5)
GCD1000x1000/WithXY-8 515µs ± 1% 485µs ± 0% -5.85% (p=0.008 n=5+5)
GCD1000x10000/WithoutXY-8 57.3µs ± 1% 56.2µs ± 1% -1.95% (p=0.008 n=5+5)
GCD1000x10000/WithXY-8 1.21ms ± 0% 1.18ms ± 0% -2.65% (p=0.008 n=5+5)
GCD1000x100000/WithoutXY-8 358µs ± 0% 352µs ± 1% -1.71% (p=0.008 n=5+5)
GCD1000x100000/WithXY-8 8.72ms ± 0% 8.66ms ± 0% -0.71% (p=0.008 n=5+5)
GCD10000x10000/WithoutXY-8 690µs ± 0% 687µs ± 1% ~ (p=0.095 n=5+5)
GCD10000x10000/WithXY-8 16.0ms ± 0% 12.5ms ± 0% -22.01% (p=0.008 n=5+5)
GCD10000x100000/WithoutXY-8 2.09ms ± 0% 2.07ms ± 0% -0.58% (p=0.008 n=5+5)
GCD10000x100000/WithXY-8 86.8ms ± 0% 83.4ms ± 0% -3.95% (p=0.008 n=5+5)
GCD100000x100000/WithoutXY-8 51.2ms ± 0% 51.2ms ± 0% ~ (p=0.548 n=5+5)
GCD100000x100000/WithXY-8 1.25s ± 0% 0.89s ± 0% -28.98% (p=0.008 n=5+5)
Hilbert-8 2.46ms ± 2% 2.53ms ± 1% +2.89% (p=0.032 n=5+5)
Binomial-8 5.15µs ± 4% 4.92µs ± 1% -4.43% (p=0.032 n=5+5)
QuoRem-8 7.10µs ± 0% 7.05µs ± 0% -0.59% (p=0.008 n=5+5)
Exp-8 161ms ± 0% 161ms ± 0% -0.24% (p=0.008 n=5+5)
Exp2-8 161ms ± 0% 161ms ± 0% -0.30% (p=0.016 n=4+5)
Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.159 n=5+5)
BitsetNeg-8 158ns ± 4% 155ns ± 2% ~ (p=0.183 n=5+5)
BitsetOrig-8 374ns ± 0% 383ns ± 1% +2.35% (p=0.008 n=5+5)
BitsetNegOrig-8 620ns ± 1% 663ns ± 2% +7.00% (p=0.008 n=5+5)
ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=0.841 n=5+5)
ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5)
ModSqrt5430_Tonelli-8 62.3s ± 0% 62.4s ± 0% +0.15% (p=0.008 n=5+5)
ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.151 n=5+5)
Sqrt-8 101µs ± 0% 97µs ± 0% -3.99% (p=0.008 n=5+5)
IntSqr/1-8 32.7ns ± 1% 32.5ns ± 1% ~ (p=0.325 n=5+5)
IntSqr/2-8 161ns ± 4% 160ns ± 4% ~ (p=0.659 n=5+5)
IntSqr/3-8 296ns ± 7% 297ns ± 6% ~ (p=0.841 n=5+5)
IntSqr/5-8 752ns ± 7% 755ns ± 6% ~ (p=0.889 n=5+5)
IntSqr/8-8 1.91µs ± 3% 1.90µs ± 3% ~ (p=0.746 n=5+5)
IntSqr/10-8 2.99µs ± 4% 3.00µs ± 4% ~ (p=0.516 n=5+5)
IntSqr/20-8 6.29µs ± 2% 6.19µs ± 2% ~ (p=0.151 n=5+5)
IntSqr/30-8 14.0µs ± 1% 13.8µs ± 2% ~ (p=0.056 n=5+5)
IntSqr/50-8 38.1µs ± 3% 37.9µs ± 3% ~ (p=0.548 n=5+5)
IntSqr/80-8 95.1µs ± 1% 94.7µs ± 1% ~ (p=0.310 n=5+5)
IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.548 n=5+5)
IntSqr/200-8 587µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5)
IntSqr/300-8 1.31ms ± 1% 1.32ms ± 1% ~ (p=0.151 n=5+5)
IntSqr/500-8 2.48ms ± 0% 2.49ms ± 0% ~ (p=0.310 n=5+5)
IntSqr/800-8 4.68ms ± 0% 4.67ms ± 0% ~ (p=0.548 n=5+5)
IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5)
Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.151 n=5+5)
Exp3Power/0x10-8 584ns ± 2% 573ns ± 1% ~ (p=0.190 n=5+5)
Exp3Power/0x40-8 646ns ± 2% 649ns ± 1% ~ (p=0.690 n=5+5)
Exp3Power/0x100-8 1.42µs ± 2% 1.45µs ± 1% +2.03% (p=0.032 n=5+5)
Exp3Power/0x400-8 8.28µs ± 1% 8.39µs ± 0% +1.33% (p=0.008 n=5+5)
Exp3Power/0x1000-8 60.1µs ± 0% 59.8µs ± 0% -0.44% (p=0.008 n=5+5)
Exp3Power/0x4000-8 818µs ± 0% 816µs ± 0% -0.23% (p=0.008 n=5+5)
Exp3Power/0x10000-8 7.79ms ± 0% 7.78ms ± 0% ~ (p=0.690 n=5+5)
Exp3Power/0x40000-8 73.4ms ± 0% 73.3ms ± 0% ~ (p=0.151 n=5+5)
Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.16% (p=0.016 n=4+5)
Exp3Power/0x400000-8 5.99s ± 0% 5.97s ± 0% -0.24% (p=0.008 n=5+5)
Fibo-8 116ms ± 0% 117ms ± 0% +0.42% (p=0.008 n=5+5)
NatSqr/1-8 113ns ± 2% 112ns ± 1% ~ (p=0.190 n=5+5)
NatSqr/2-8 249ns ± 2% 250ns ± 2% ~ (p=0.365 n=5+5)
NatSqr/3-8 379ns ± 1% 381ns ± 2% ~ (p=0.127 n=5+5)
NatSqr/5-8 838ns ± 3% 841ns ± 5% ~ (p=0.754 n=5+5)
NatSqr/8-8 1.97µs ± 3% 1.97µs ± 4% ~ (p=1.000 n=5+5)
NatSqr/10-8 3.04µs ± 4% 3.04µs ± 4% ~ (p=1.000 n=5+5)
NatSqr/20-8 6.49µs ± 3% 6.50µs ± 2% ~ (p=0.841 n=5+5)
NatSqr/30-8 14.3µs ± 2% 14.2µs ± 2% ~ (p=0.548 n=5+5)
NatSqr/50-8 38.5µs ± 3% 38.3µs ± 3% ~ (p=0.421 n=5+5)
NatSqr/80-8 96.3µs ± 1% 96.1µs ± 1% ~ (p=0.421 n=5+5)
NatSqr/100-8 149µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5)
NatSqr/200-8 591µs ± 1% 592µs ± 1% ~ (p=0.690 n=5+5)
NatSqr/300-8 1.31ms ± 1% 1.32ms ± 0% ~ (p=0.190 n=5+4)
NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.095 n=5+5)
NatSqr/800-8 4.70ms ± 0% 4.69ms ± 0% ~ (p=0.222 n=5+5)
NatSqr/1000-8 7.60ms ± 0% 7.58ms ± 0% ~ (p=0.222 n=5+5)
ScanPi-8 326µs ± 0% 327µs ± 1% ~ (p=0.222 n=5+5)
StringPiParallel-8 71.4µs ± 5% 67.7µs ± 4% ~ (p=0.095 n=5+5)
Scan/10/Base2-8 1.09µs ± 0% 1.10µs ± 1% ~ (p=0.810 n=5+5)
Scan/100/Base2-8 7.79µs ± 0% 7.83µs ± 0% +0.53% (p=0.008 n=5+5)
Scan/1000/Base2-8 78.9µs ± 0% 79.0µs ± 0% ~ (p=0.151 n=5+5)
Scan/10000/Base2-8 1.22ms ± 0% 1.23ms ± 1% ~ (p=0.690 n=5+5)
Scan/100000/Base2-8 55.1ms ± 0% 55.1ms ± 0% +0.10% (p=0.008 n=5+5)
Scan/10/Base8-8 512ns ± 1% 534ns ± 1% +4.34% (p=0.008 n=5+5)
Scan/100/Base8-8 2.90µs ± 1% 2.92µs ± 0% +0.67% (p=0.024 n=5+5)
Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.27% (p=0.008 n=5+5)
Scan/10000/Base8-8 741µs ± 0% 744µs ± 1% ~ (p=0.310 n=5+5)
Scan/100000/Base8-8 50.5ms ± 0% 50.7ms ± 0% +0.23% (p=0.016 n=5+4)
Scan/10/Base10-8 485ns ± 0% 510ns ± 1% +5.15% (p=0.008 n=5+5)
Scan/100/Base10-8 2.68µs ± 0% 2.70µs ± 0% +0.84% (p=0.008 n=5+5)
Scan/1000/Base10-8 28.7µs ± 0% 28.8µs ± 0% +0.34% (p=0.008 n=5+5)
Scan/10000/Base10-8 717µs ± 0% 720µs ± 1% ~ (p=0.238 n=5+5)
Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.02% (p=0.016 n=4+5)
Scan/10/Base16-8 439ns ± 0% 461ns ± 1% +5.06% (p=0.008 n=5+5)
Scan/100/Base16-8 2.48µs ± 0% 2.49µs ± 0% +0.59% (p=0.024 n=5+5)
Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.063 n=5+5)
Scan/10000/Base16-8 722µs ± 0% 725µs ± 1% ~ (p=0.421 n=5+5)
Scan/100000/Base16-8 52.7ms ± 0% 52.7ms ± 0% ~ (p=0.686 n=4+4)
String/10/Base2-8 248ns ± 1% 248ns ± 1% ~ (p=0.802 n=5+5)
String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.54% (p=0.024 n=5+5)
String/1000/Base2-8 13.6µs ± 0% 13.6µs ± 0% ~ (p=0.548 n=5+5)
String/10000/Base2-8 135µs ± 1% 135µs ± 2% ~ (p=0.421 n=5+5)
String/100000/Base2-8 1.32ms ± 1% 1.33ms ± 1% ~ (p=0.310 n=5+5)
String/10/Base8-8 169ns ± 0% 170ns ± 0% ~ (p=0.079 n=5+5)
String/100/Base8-8 635ns ± 1% 633ns ± 1% ~ (p=0.595 n=5+5)
String/1000/Base8-8 5.33µs ± 0% 5.30µs ± 0% ~ (p=0.063 n=5+5)
String/10000/Base8-8 50.7µs ± 1% 50.7µs ± 1% ~ (p=1.000 n=5+5)
String/100000/Base8-8 499µs ± 1% 500µs ± 1% ~ (p=1.000 n=5+5)
String/10/Base10-8 517ns ± 1% 512ns ± 1% -1.01% (p=0.032 n=5+5)
String/100/Base10-8 1.97µs ± 0% 2.01µs ± 1% +2.13% (p=0.008 n=5+5)
String/1000/Base10-8 12.6µs ± 1% 12.1µs ± 1% -4.16% (p=0.008 n=5+5)
String/10000/Base10-8 57.9µs ± 1% 54.8µs ± 1% -5.46% (p=0.008 n=5+5)
String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% -0.12% (p=0.008 n=5+5)
String/10/Base16-8 149ns ± 0% 149ns ± 1% ~ (p=1.000 n=5+5)
String/100/Base16-8 514ns ± 0% 514ns ± 1% ~ (p=0.825 n=5+5)
String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.595 n=5+5)
String/10000/Base16-8 37.7µs ± 0% 37.8µs ± 1% ~ (p=0.222 n=5+5)
String/100000/Base16-8 373µs ± 1% 372µs ± 0% ~ (p=1.000 n=5+5)
LeafSize/0-8 6.64ms ± 0% 6.66ms ± 0% +0.32% (p=0.008 n=5+5)
LeafSize/1-8 74.0µs ± 1% 71.2µs ± 1% -3.75% (p=0.008 n=5+5)
LeafSize/2-8 74.1µs ± 0% 70.7µs ± 1% -4.53% (p=0.008 n=5+5)
LeafSize/3-8 379µs ± 0% 374µs ± 0% -1.25% (p=0.008 n=5+5)
LeafSize/4-8 72.7µs ± 0% 69.2µs ± 0% -4.79% (p=0.008 n=5+5)
LeafSize/5-8 471µs ± 0% 466µs ± 0% -1.05% (p=0.008 n=5+5)
LeafSize/6-8 377µs ± 0% 373µs ± 0% -1.16% (p=0.008 n=5+5)
LeafSize/7-8 245µs ± 0% 241µs ± 0% -1.65% (p=0.008 n=5+5)
LeafSize/8-8 73.1µs ± 0% 69.4µs ± 0% -5.10% (p=0.008 n=5+5)
LeafSize/9-8 538µs ± 0% 532µs ± 0% -1.01% (p=0.008 n=5+5)
LeafSize/10-8 472µs ± 0% 467µs ± 0% -1.07% (p=0.008 n=5+5)
LeafSize/11-8 460µs ± 0% 454µs ± 0% -1.22% (p=0.008 n=5+5)
LeafSize/12-8 378µs ± 0% 373µs ± 0% -1.34% (p=0.008 n=5+5)
LeafSize/13-8 344µs ± 0% 338µs ± 0% -1.61% (p=0.008 n=5+5)
LeafSize/14-8 247µs ± 0% 243µs ± 0% -1.62% (p=0.008 n=5+5)
LeafSize/15-8 169µs ± 0% 165µs ± 0% -2.71% (p=0.008 n=5+5)
LeafSize/16-8 73.3µs ± 1% 69.5µs ± 0% -5.11% (p=0.008 n=5+5)
LeafSize/32-8 82.7µs ± 0% 79.2µs ± 0% -4.24% (p=0.008 n=5+5)
LeafSize/64-8 135µs ± 0% 132µs ± 0% -2.20% (p=0.008 n=5+5)
ProbablyPrime/n=0-8 44.2ms ± 0% 43.9ms ± 0% -0.69% (p=0.008 n=5+5)
ProbablyPrime/n=1-8 64.8ms ± 0% 64.4ms ± 0% -0.60% (p=0.008 n=5+5)
ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.34% (p=0.008 n=5+5)
ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.29% (p=0.008 n=5+5)
ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.29% (p=0.008 n=5+5)
ProbablyPrime/Lucas-8 23.6ms ± 0% 23.2ms ± 0% -1.44% (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% -0.31% (p=0.008 n=5+5)
FloatSqrt/64-8 2.27µs ± 1% 2.11µs ± 1% -7.02% (p=0.008 n=5+5)
FloatSqrt/128-8 4.93µs ± 1% 4.40µs ± 1% -10.73% (p=0.008 n=5+5)
FloatSqrt/256-8 13.6µs ± 0% 6.6µs ± 1% -51.40% (p=0.008 n=5+5)
FloatSqrt/1000-8 69.8µs ± 0% 31.2µs ± 0% -55.27% (p=0.008 n=5+5)
FloatSqrt/10000-8 1.91ms ± 0% 0.59ms ± 0% -69.17% (p=0.008 n=5+5)
FloatSqrt/100000-8 55.4ms ± 0% 17.8ms ± 0% -67.79% (p=0.008 n=5+5)
FloatSqrt/1000000-8 4.56s ± 0% 1.52s ± 0% -66.59% (p=0.008 n=5+5)
Change-Id: Icce52c69668f564490c69b908338b21a2288e116
Reviewed-on: https://go-review.googlesource.com/79355
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions,
this CL can reduce the "load" cost and improve performance.
Benchmarks:
name old time/op new time/op delta
AddVV/1-8 21.5ns ± 0% 21.5ns ± 0% ~ (all equal)
AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal)
AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal)
AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal)
AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal)
AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal)
AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal)
AddVV/1000-8 2.02µs ± 0% 2.02µs ± 0% ~ (all equal)
AddVV/10000-8 20.3µs ± 0% 20.3µs ± 0% ~ (p=0.603 n=5+5)
AddVV/100000-8 223µs ± 7% 228µs ± 8% ~ (p=0.548 n=5+5)
AddVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5)
AddVW/2-8 19.8ns ± 3% 10.5ns ± 0% -46.92% (p=0.008 n=5+5)
AddVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5)
AddVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5)
AddVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5)
AddVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5)
AddVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5)
AddVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5)
AddVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.55% (p=0.008 n=5+5)
AddVW/100000-8 150µs ± 0% 71µs ± 0% -52.95% (p=0.008 n=5+5)
SubVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5)
SubVW/2-8 19.7ns ± 2% 10.5ns ± 0% -46.70% (p=0.008 n=5+5)
SubVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5)
SubVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5)
SubVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5)
SubVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5)
SubVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5)
SubVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5)
SubVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.49% (p=0.008 n=5+5)
SubVW/100000-8 150µs ± 0% 71µs ± 0% -52.91% (p=0.008 n=5+5)
AddMulVVW/1-8 32.4ns ± 1% 32.6ns ± 1% ~ (p=0.119 n=5+5)
AddMulVVW/2-8 57.0ns ± 0% 57.0ns ± 0% ~ (p=0.643 n=5+5)
AddMulVVW/3-8 90.8ns ± 0% 90.7ns ± 0% ~ (p=0.524 n=5+5)
AddMulVVW/4-8 118ns ± 0% 118ns ± 1% ~ (p=1.000 n=4+5)
AddMulVVW/5-8 144ns ± 1% 144ns ± 0% ~ (p=0.794 n=5+4)
AddMulVVW/10-8 294ns ± 1% 296ns ± 0% +0.48% (p=0.040 n=5+5)
AddMulVVW/100-8 2.73µs ± 0% 2.73µs ± 0% ~ (p=0.278 n=5+5)
AddMulVVW/1000-8 26.0µs ± 0% 26.5µs ± 0% +2.14% (p=0.008 n=5+5)
AddMulVVW/10000-8 297µs ± 0% 297µs ± 0% +0.24% (p=0.008 n=5+5)
AddMulVVW/100000-8 3.15ms ± 1% 3.13ms ± 0% ~ (p=0.690 n=5+5)
DecimalConversion-8 311µs ± 2% 309µs ± 2% ~ (p=0.310 n=5+5)
FloatString/100-8 2.55µs ± 2% 2.54µs ± 2% ~ (p=1.000 n=5+5)
FloatString/1000-8 58.1µs ± 0% 58.1µs ± 0% ~ (p=0.151 n=5+5)
FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.151 n=5+5)
FloatString/100000-8 446ms ± 0% 446ms ± 0% +0.01% (p=0.016 n=5+5)
FloatAdd/10-8 183ns ± 0% 183ns ± 0% ~ (p=0.333 n=4+5)
FloatAdd/100-8 187ns ± 1% 192ns ± 2% ~ (p=0.056 n=5+5)
FloatAdd/1000-8 369ns ± 0% 371ns ± 0% +0.54% (p=0.016 n=4+5)
FloatAdd/10000-8 1.88µs ± 0% 1.88µs ± 0% -0.14% (p=0.000 n=4+5)
FloatAdd/100000-8 17.2µs ± 0% 17.1µs ± 0% -0.37% (p=0.008 n=5+5)
FloatSub/10-8 147ns ± 0% 147ns ± 0% ~ (all equal)
FloatSub/100-8 145ns ± 0% 146ns ± 0% ~ (p=0.238 n=5+4)
FloatSub/1000-8 241ns ± 0% 241ns ± 0% ~ (p=0.333 n=5+4)
FloatSub/10000-8 1.06µs ± 0% 1.06µs ± 0% ~ (p=0.444 n=5+5)
FloatSub/100000-8 9.50µs ± 0% 9.48µs ± 0% -0.14% (p=0.008 n=5+5)
ParseFloatSmallExp-8 28.4µs ± 2% 28.5µs ± 1% ~ (p=0.690 n=5+5)
ParseFloatLargeExp-8 125µs ± 1% 124µs ± 1% ~ (p=0.095 n=5+5)
GCD10x10/WithoutXY-8 277ns ± 2% 278ns ± 3% ~ (p=0.937 n=5+5)
GCD10x10/WithXY-8 2.08µs ± 3% 2.15µs ± 3% ~ (p=0.056 n=5+5)
GCD10x100/WithoutXY-8 592ns ± 3% 613ns ± 4% ~ (p=0.056 n=5+5)
GCD10x100/WithXY-8 3.40µs ± 2% 3.42µs ± 4% ~ (p=0.841 n=5+5)
GCD10x1000/WithoutXY-8 1.37µs ± 2% 1.35µs ± 3% ~ (p=0.460 n=5+5)
GCD10x1000/WithXY-8 7.34µs ± 2% 7.33µs ± 4% ~ (p=0.841 n=5+5)
GCD10x10000/WithoutXY-8 8.52µs ± 0% 8.51µs ± 1% ~ (p=0.421 n=5+5)
GCD10x10000/WithXY-8 27.5µs ± 2% 27.2µs ± 1% ~ (p=0.151 n=5+5)
GCD10x100000/WithoutXY-8 78.3µs ± 1% 78.5µs ± 1% ~ (p=0.690 n=5+5)
GCD10x100000/WithXY-8 231µs ± 0% 229µs ± 1% -1.11% (p=0.016 n=5+5)
GCD100x100/WithoutXY-8 1.86µs ± 2% 1.86µs ± 2% ~ (p=0.881 n=5+5)
GCD100x100/WithXY-8 27.1µs ± 2% 27.2µs ± 1% ~ (p=0.421 n=5+5)
GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.41µs ± 1% ~ (p=0.310 n=5+5)
GCD100x1000/WithXY-8 36.3µs ± 1% 36.2µs ± 1% ~ (p=0.310 n=5+5)
GCD100x10000/WithoutXY-8 22.6µs ± 2% 22.5µs ± 1% ~ (p=0.690 n=5+5)
GCD100x10000/WithXY-8 145µs ± 1% 145µs ± 1% ~ (p=1.000 n=5+5)
GCD100x100000/WithoutXY-8 195µs ± 0% 196µs ± 1% ~ (p=0.548 n=5+5)
GCD100x100000/WithXY-8 1.10ms ± 0% 1.10ms ± 0% -0.30% (p=0.016 n=5+5)
GCD1000x1000/WithoutXY-8 25.0µs ± 1% 25.2µs ± 2% ~ (p=0.222 n=5+5)
GCD1000x1000/WithXY-8 520µs ± 0% 520µs ± 1% ~ (p=0.151 n=5+5)
GCD1000x10000/WithoutXY-8 57.0µs ± 1% 56.9µs ± 1% ~ (p=0.690 n=5+5)
GCD1000x10000/WithXY-8 1.21ms ± 0% 1.21ms ± 1% ~ (p=0.881 n=5+5)
GCD1000x100000/WithoutXY-8 358µs ± 0% 359µs ± 1% ~ (p=0.548 n=5+5)
GCD1000x100000/WithXY-8 8.73ms ± 0% 8.73ms ± 0% ~ (p=0.548 n=5+5)
GCD10000x10000/WithoutXY-8 686µs ± 0% 687µs ± 0% ~ (p=0.548 n=5+5)
GCD10000x10000/WithXY-8 15.9ms ± 0% 15.9ms ± 0% ~ (p=0.841 n=5+5)
GCD10000x100000/WithoutXY-8 2.08ms ± 0% 2.08ms ± 0% ~ (p=1.000 n=5+5)
GCD10000x100000/WithXY-8 86.7ms ± 0% 86.7ms ± 0% ~ (p=1.000 n=5+5)
GCD100000x100000/WithoutXY-8 51.1ms ± 0% 51.0ms ± 0% ~ (p=0.151 n=5+5)
GCD100000x100000/WithXY-8 1.23s ± 0% 1.23s ± 0% ~ (p=0.841 n=5+5)
Hilbert-8 2.41ms ± 1% 2.42ms ± 2% ~ (p=0.690 n=5+5)
Binomial-8 4.86µs ± 1% 4.86µs ± 1% ~ (p=0.889 n=5+5)
QuoRem-8 7.09µs ± 0% 7.08µs ± 0% -0.09% (p=0.024 n=5+5)
Exp-8 161ms ± 0% 161ms ± 0% -0.08% (p=0.032 n=5+5)
Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=1.000 n=5+5)
Bitset-8 40.7ns ± 0% 40.6ns ± 0% ~ (p=0.095 n=4+5)
BitsetNeg-8 159ns ± 4% 148ns ± 0% -6.92% (p=0.016 n=5+4)
BitsetOrig-8 378ns ± 1% 378ns ± 1% ~ (p=0.937 n=5+5)
BitsetNegOrig-8 647ns ± 5% 647ns ± 4% ~ (p=1.000 n=5+5)
ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=1.000 n=5+5)
ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5)
ModSqrt5430_Tonelli-8 62.8s ± 1% 62.5s ± 0% ~ (p=0.063 n=5+4)
ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.310 n=5+5)
Sqrt-8 101µs ± 1% 101µs ± 0% -0.35% (p=0.032 n=5+5)
IntSqr/1-8 32.3ns ± 1% 32.5ns ± 1% ~ (p=0.421 n=5+5)
IntSqr/2-8 157ns ± 5% 156ns ± 5% ~ (p=0.651 n=5+5)
IntSqr/3-8 292ns ± 2% 291ns ± 3% ~ (p=0.881 n=5+5)
IntSqr/5-8 738ns ± 6% 740ns ± 5% ~ (p=0.841 n=5+5)
IntSqr/8-8 1.82µs ± 4% 1.83µs ± 4% ~ (p=0.730 n=5+5)
IntSqr/10-8 2.92µs ± 1% 2.93µs ± 1% ~ (p=0.643 n=5+5)
IntSqr/20-8 6.28µs ± 2% 6.28µs ± 2% ~ (p=1.000 n=5+5)
IntSqr/30-8 13.8µs ± 2% 13.9µs ± 3% ~ (p=1.000 n=5+5)
IntSqr/50-8 37.8µs ± 4% 37.9µs ± 4% ~ (p=0.690 n=5+5)
IntSqr/80-8 95.9µs ± 1% 95.8µs ± 1% ~ (p=0.841 n=5+5)
IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5)
IntSqr/200-8 586µs ± 1% 586µs ± 1% ~ (p=0.841 n=5+5)
IntSqr/300-8 1.32ms ± 0% 1.31ms ± 0% ~ (p=0.222 n=5+5)
IntSqr/500-8 2.48ms ± 0% 2.48ms ± 0% ~ (p=0.556 n=5+4)
IntSqr/800-8 4.68ms ± 0% 4.68ms ± 0% ~ (p=0.548 n=5+5)
IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5)
Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.548 n=5+5)
Exp3Power/0x10-8 559ns ± 1% 560ns ± 1% ~ (p=0.984 n=5+5)
Exp3Power/0x40-8 641ns ± 1% 634ns ± 1% ~ (p=0.063 n=5+5)
Exp3Power/0x100-8 1.39µs ± 2% 1.40µs ± 2% ~ (p=0.381 n=5+5)
Exp3Power/0x400-8 8.27µs ± 1% 8.26µs ± 0% ~ (p=0.571 n=5+5)
Exp3Power/0x1000-8 59.9µs ± 0% 59.7µs ± 0% -0.23% (p=0.008 n=5+5)
Exp3Power/0x4000-8 816µs ± 0% 816µs ± 0% ~ (p=1.000 n=5+5)
Exp3Power/0x10000-8 7.77ms ± 0% 7.77ms ± 0% ~ (p=0.841 n=5+5)
Exp3Power/0x40000-8 73.4ms ± 0% 73.4ms ± 0% ~ (p=0.690 n=5+5)
Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.14% (p=0.008 n=5+5)
Exp3Power/0x400000-8 5.98s ± 0% 5.98s ± 0% -0.09% (p=0.008 n=5+5)
Fibo-8 116ms ± 0% 116ms ± 0% -0.25% (p=0.008 n=5+5)
NatSqr/1-8 115ns ± 3% 116ns ± 2% ~ (p=0.238 n=5+5)
NatSqr/2-8 237ns ± 1% 237ns ± 1% ~ (p=0.683 n=5+5)
NatSqr/3-8 367ns ± 3% 368ns ± 3% ~ (p=0.817 n=5+5)
NatSqr/5-8 807ns ± 3% 812ns ± 3% ~ (p=0.913 n=5+5)
NatSqr/8-8 1.93µs ± 2% 1.93µs ± 3% ~ (p=0.651 n=5+5)
NatSqr/10-8 2.98µs ± 2% 2.99µs ± 2% ~ (p=0.690 n=5+5)
NatSqr/20-8 6.49µs ± 2% 6.46µs ± 2% ~ (p=0.548 n=5+5)
NatSqr/30-8 14.4µs ± 2% 14.3µs ± 2% ~ (p=0.690 n=5+5)
NatSqr/50-8 38.6µs ± 2% 38.7µs ± 2% ~ (p=0.841 n=5+5)
NatSqr/80-8 96.1µs ± 2% 95.8µs ± 2% ~ (p=0.548 n=5+5)
NatSqr/100-8 149µs ± 1% 149µs ± 1% ~ (p=0.841 n=5+5)
NatSqr/200-8 593µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5)
NatSqr/300-8 1.32ms ± 0% 1.32ms ± 1% ~ (p=0.222 n=5+5)
NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.690 n=5+5)
NatSqr/800-8 4.69ms ± 0% 4.69ms ± 0% ~ (p=1.000 n=5+5)
NatSqr/1000-8 7.59ms ± 0% 7.58ms ± 0% ~ (p=0.841 n=5+5)
ScanPi-8 322µs ± 0% 321µs ± 0% ~ (p=0.095 n=5+5)
StringPiParallel-8 71.4µs ± 5% 68.8µs ± 4% ~ (p=0.151 n=5+5)
Scan/10/Base2-8 1.10µs ± 0% 1.09µs ± 0% -0.36% (p=0.032 n=5+5)
Scan/100/Base2-8 7.78µs ± 0% 7.79µs ± 0% +0.14% (p=0.008 n=5+5)
Scan/1000/Base2-8 78.8µs ± 0% 79.0µs ± 0% +0.24% (p=0.008 n=5+5)
Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% ~ (p=0.056 n=5+5)
Scan/100000/Base2-8 55.1ms ± 0% 55.0ms ± 0% -0.15% (p=0.008 n=5+5)
Scan/10/Base8-8 514ns ± 0% 515ns ± 0% ~ (p=0.079 n=5+5)
Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% +0.15% (p=0.008 n=5+5)
Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.12% (p=0.008 n=5+5)
Scan/10000/Base8-8 740µs ± 0% 740µs ± 0% ~ (p=0.222 n=5+5)
Scan/100000/Base8-8 50.6ms ± 0% 50.5ms ± 0% -0.06% (p=0.016 n=4+5)
Scan/10/Base10-8 492ns ± 1% 490ns ± 1% ~ (p=0.310 n=5+5)
Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.056 n=5+5)
Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% ~ (p=1.000 n=5+5)
Scan/10000/Base10-8 717µs ± 0% 716µs ± 0% ~ (p=0.222 n=5+5)
Scan/100000/Base10-8 50.2ms ± 0% 50.3ms ± 0% +0.05% (p=0.008 n=5+5)
Scan/10/Base16-8 442ns ± 1% 442ns ± 0% ~ (p=0.468 n=5+5)
Scan/100/Base16-8 2.46µs ± 0% 2.45µs ± 0% ~ (p=0.159 n=5+5)
Scan/1000/Base16-8 27.2µs ± 0% 27.2µs ± 0% ~ (p=0.841 n=5+5)
Scan/10000/Base16-8 721µs ± 0% 722µs ± 0% ~ (p=0.548 n=5+5)
Scan/100000/Base16-8 52.6ms ± 0% 52.6ms ± 0% +0.07% (p=0.008 n=5+5)
String/10/Base2-8 244ns ± 1% 242ns ± 1% ~ (p=0.103 n=5+5)
String/100/Base2-8 1.48µs ± 0% 1.48µs ± 1% ~ (p=0.786 n=5+5)
String/1000/Base2-8 13.3µs ± 1% 13.3µs ± 0% ~ (p=0.222 n=5+5)
String/10000/Base2-8 132µs ± 1% 132µs ± 1% ~ (p=1.000 n=5+5)
String/100000/Base2-8 1.30ms ± 1% 1.30ms ± 1% ~ (p=1.000 n=5+5)
String/10/Base8-8 167ns ± 1% 168ns ± 1% ~ (p=0.135 n=5+5)
String/100/Base8-8 623ns ± 1% 626ns ± 1% ~ (p=0.151 n=5+5)
String/1000/Base8-8 5.24µs ± 1% 5.24µs ± 0% ~ (p=1.000 n=5+5)
String/10000/Base8-8 50.0µs ± 1% 50.0µs ± 1% ~ (p=1.000 n=5+5)
String/100000/Base8-8 492µs ± 1% 489µs ± 1% ~ (p=0.056 n=5+5)
String/10/Base10-8 503ns ± 1% 501ns ± 0% ~ (p=0.183 n=5+5)
String/100/Base10-8 1.96µs ± 0% 1.97µs ± 0% ~ (p=0.389 n=5+5)
String/1000/Base10-8 12.4µs ± 1% 12.4µs ± 1% ~ (p=0.841 n=5+5)
String/10000/Base10-8 56.7µs ± 1% 56.6µs ± 0% ~ (p=1.000 n=5+5)
String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% ~ (p=0.222 n=5+5)
String/10/Base16-8 147ns ± 0% 148ns ± 2% ~ (p=1.000 n=4+5)
String/100/Base16-8 505ns ± 0% 505ns ± 1% ~ (p=0.778 n=5+5)
String/1000/Base16-8 3.94µs ± 0% 3.94µs ± 0% ~ (p=0.841 n=5+5)
String/10000/Base16-8 37.4µs ± 1% 37.2µs ± 1% ~ (p=0.095 n=5+5)
String/100000/Base16-8 367µs ± 1% 367µs ± 0% ~ (p=1.000 n=5+5)
LeafSize/0-8 6.64ms ± 0% 6.65ms ± 0% ~ (p=0.690 n=5+5)
LeafSize/1-8 72.5µs ± 1% 72.4µs ± 1% ~ (p=0.841 n=5+5)
LeafSize/2-8 72.6µs ± 1% 72.6µs ± 1% ~ (p=1.000 n=5+5)
LeafSize/3-8 377µs ± 0% 377µs ± 0% ~ (p=0.421 n=5+5)
LeafSize/4-8 71.2µs ± 1% 71.3µs ± 0% ~ (p=0.278 n=5+5)
LeafSize/5-8 469µs ± 0% 469µs ± 0% ~ (p=0.310 n=5+5)
LeafSize/6-8 376µs ± 0% 376µs ± 0% ~ (p=0.841 n=5+5)
LeafSize/7-8 244µs ± 0% 244µs ± 0% ~ (p=0.841 n=5+5)
LeafSize/8-8 71.9µs ± 1% 72.1µs ± 1% ~ (p=0.548 n=5+5)
LeafSize/9-8 536µs ± 0% 536µs ± 0% ~ (p=0.151 n=5+5)
LeafSize/10-8 470µs ± 0% 471µs ± 0% +0.10% (p=0.032 n=5+5)
LeafSize/11-8 458µs ± 0% 458µs ± 0% ~ (p=0.881 n=5+5)
LeafSize/12-8 376µs ± 0% 376µs ± 0% ~ (p=0.548 n=5+5)
LeafSize/13-8 341µs ± 0% 342µs ± 0% ~ (p=0.222 n=5+5)
LeafSize/14-8 246µs ± 0% 245µs ± 0% ~ (p=0.167 n=5+5)
LeafSize/15-8 168µs ± 0% 168µs ± 0% ~ (p=0.548 n=5+5)
LeafSize/16-8 72.1µs ± 1% 72.2µs ± 1% ~ (p=0.690 n=5+5)
LeafSize/32-8 81.5µs ± 1% 81.4µs ± 1% ~ (p=1.000 n=5+5)
LeafSize/64-8 133µs ± 1% 134µs ± 1% ~ (p=0.690 n=5+5)
ProbablyPrime/n=0-8 44.3ms ± 0% 44.2ms ± 0% -0.28% (p=0.008 n=5+5)
ProbablyPrime/n=1-8 64.8ms ± 0% 64.7ms ± 0% -0.15% (p=0.008 n=5+5)
ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.11% (p=0.008 n=5+5)
ProbablyPrime/n=10-8 250ms ± 0% 250ms ± 0% ~ (p=0.056 n=5+5)
ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.05% (p=0.008 n=5+5)
ProbablyPrime/Lucas-8 23.6ms ± 0% 23.5ms ± 0% -0.29% (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% ~ (p=0.690 n=5+5)
FloatSqrt/64-8 2.01µs ± 1% 2.02µs ± 1% ~ (p=0.421 n=5+5)
FloatSqrt/128-8 4.43µs ± 2% 4.38µs ± 2% ~ (p=0.222 n=5+5)
FloatSqrt/256-8 6.64µs ± 1% 6.68µs ± 2% ~ (p=0.516 n=5+5)
FloatSqrt/1000-8 31.9µs ± 0% 31.8µs ± 0% ~ (p=0.095 n=5+5)
FloatSqrt/10000-8 595µs ± 0% 594µs ± 0% ~ (p=0.056 n=5+5)
FloatSqrt/100000-8 17.9ms ± 0% 17.9ms ± 0% ~ (p=0.151 n=5+5)
FloatSqrt/1000000-8 1.52s ± 0% 1.52s ± 0% ~ (p=0.841 n=5+5)
name old speed new speed delta
AddVV/1-8 2.97GB/s ± 0% 2.97GB/s ± 0% ~ (p=0.971 n=4+4)
AddVV/2-8 9.47GB/s ± 0% 9.47GB/s ± 0% +0.01% (p=0.016 n=5+5)
AddVV/3-8 12.4GB/s ± 0% 12.4GB/s ± 0% ~ (p=0.548 n=5+5)
AddVV/4-8 14.6GB/s ± 0% 14.6GB/s ± 0% ~ (p=1.000 n=5+5)
AddVV/5-8 16.4GB/s ± 0% 16.4GB/s ± 0% ~ (p=1.000 n=5+5)
AddVV/10-8 21.7GB/s ± 0% 21.7GB/s ± 0% ~ (p=0.548 n=5+5)
AddVV/100-8 29.4GB/s ± 0% 29.4GB/s ± 0% ~ (p=1.000 n=5+5)
AddVV/1000-8 31.7GB/s ± 0% 31.7GB/s ± 0% ~ (p=0.524 n=5+4)
AddVV/10000-8 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.690 n=5+5)
AddVV/100000-8 28.8GB/s ± 7% 28.1GB/s ± 8% ~ (p=0.548 n=5+5)
AddVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5)
AddVW/2-8 809MB/s ± 2% 1520MB/s ± 0% +87.78% (p=0.008 n=5+5)
AddVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.54% (p=0.008 n=5+5)
AddVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.016 n=4+5)
AddVW/5-8 2.76GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5)
AddVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.83% (p=0.008 n=5+5)
AddVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.12% (p=0.008 n=5+5)
AddVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5)
AddVW/10000-8 5.31GB/s ± 0% 11.19GB/s ± 0% +110.71% (p=0.008 n=5+5)
AddVW/100000-8 5.32GB/s ± 0% 11.32GB/s ± 0% +112.56% (p=0.008 n=5+5)
SubVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5)
SubVW/2-8 812MB/s ± 2% 1520MB/s ± 0% +87.09% (p=0.008 n=5+5)
SubVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.55% (p=0.008 n=5+5)
SubVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.008 n=5+5)
SubVW/5-8 2.75GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5)
SubVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.82% (p=0.008 n=5+5)
SubVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.13% (p=0.008 n=5+5)
SubVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5)
SubVW/10000-8 5.31GB/s ± 0% 11.17GB/s ± 0% +110.44% (p=0.008 n=5+5)
SubVW/100000-8 5.32GB/s ± 0% 11.31GB/s ± 0% +112.35% (p=0.008 n=5+5)
AddMulVVW/1-8 1.97GB/s ± 1% 1.96GB/s ± 1% ~ (p=0.151 n=5+5)
AddMulVVW/2-8 2.24GB/s ± 0% 2.25GB/s ± 0% ~ (p=0.095 n=5+5)
AddMulVVW/3-8 2.11GB/s ± 0% 2.12GB/s ± 0% ~ (p=0.548 n=5+5)
AddMulVVW/4-8 2.17GB/s ± 1% 2.17GB/s ± 1% ~ (p=0.548 n=5+5)
AddMulVVW/5-8 2.22GB/s ± 1% 2.21GB/s ± 1% ~ (p=0.421 n=5+5)
AddMulVVW/10-8 2.17GB/s ± 1% 2.16GB/s ± 0% ~ (p=0.095 n=5+5)
AddMulVVW/100-8 2.35GB/s ± 0% 2.35GB/s ± 0% ~ (p=0.421 n=5+5)
AddMulVVW/1000-8 2.47GB/s ± 0% 2.41GB/s ± 0% -2.09% (p=0.008 n=5+5)
AddMulVVW/10000-8 2.16GB/s ± 0% 2.15GB/s ± 0% -0.23% (p=0.008 n=5+5)
AddMulVVW/100000-8 2.03GB/s ± 1% 2.04GB/s ± 0% ~ (p=0.690 n=5+5)
name old alloc/op new alloc/op delta
FloatString/100-8 400B ± 0% 400B ± 0% ~ (all equal)
FloatString/1000-8 3.22kB ± 0% 3.22kB ± 0% ~ (all equal)
FloatString/10000-8 55.6kB ± 0% 55.5kB ± 0% ~ (p=0.206 n=5+5)
FloatString/100000-8 627kB ± 0% 627kB ± 0% ~ (all equal)
FloatAdd/10-8 0.00B 0.00B ~ (all equal)
FloatAdd/100-8 0.00B 0.00B ~ (all equal)
FloatAdd/1000-8 0.00B 0.00B ~ (all equal)
FloatAdd/10000-8 0.00B 0.00B ~ (all equal)
FloatAdd/100000-8 0.00B 0.00B ~ (all equal)
FloatSub/10-8 0.00B 0.00B ~ (all equal)
FloatSub/100-8 0.00B 0.00B ~ (all equal)
FloatSub/1000-8 0.00B 0.00B ~ (all equal)
FloatSub/10000-8 0.00B 0.00B ~ (all equal)
FloatSub/100000-8 0.00B 0.00B ~ (all equal)
FloatSqrt/64-8 416B ± 0% 416B ± 0% ~ (all equal)
FloatSqrt/128-8 720B ± 0% 720B ± 0% ~ (all equal)
FloatSqrt/256-8 816B ± 0% 816B ± 0% ~ (all equal)
FloatSqrt/1000-8 2.50kB ± 0% 2.50kB ± 0% ~ (all equal)
FloatSqrt/10000-8 23.5kB ± 0% 23.5kB ± 0% ~ (all equal)
FloatSqrt/100000-8 251kB ± 0% 251kB ± 0% ~ (all equal)
FloatSqrt/1000000-8 4.61MB ± 0% 4.61MB ± 0% ~ (all equal)
name old allocs/op new allocs/op delta
FloatString/100-8 8.00 ± 0% 8.00 ± 0% ~ (all equal)
FloatString/1000-8 10.0 ± 0% 10.0 ± 0% ~ (all equal)
FloatString/10000-8 42.0 ± 0% 42.0 ± 0% ~ (all equal)
FloatString/100000-8 346 ± 0% 346 ± 0% ~ (all equal)
FloatAdd/10-8 0.00 0.00 ~ (all equal)
FloatAdd/100-8 0.00 0.00 ~ (all equal)
FloatAdd/1000-8 0.00 0.00 ~ (all equal)
FloatAdd/10000-8 0.00 0.00 ~ (all equal)
FloatAdd/100000-8 0.00 0.00 ~ (all equal)
FloatSub/10-8 0.00 0.00 ~ (all equal)
FloatSub/100-8 0.00 0.00 ~ (all equal)
FloatSub/1000-8 0.00 0.00 ~ (all equal)
FloatSub/10000-8 0.00 0.00 ~ (all equal)
FloatSub/100000-8 0.00 0.00 ~ (all equal)
FloatSqrt/64-8 9.00 ± 0% 9.00 ± 0% ~ (all equal)
FloatSqrt/128-8 13.0 ± 0% 13.0 ± 0% ~ (all equal)
FloatSqrt/256-8 12.0 ± 0% 12.0 ± 0% ~ (all equal)
FloatSqrt/1000-8 19.0 ± 0% 19.0 ± 0% ~ (all equal)
FloatSqrt/10000-8 35.0 ± 0% 35.0 ± 0% ~ (all equal)
FloatSqrt/100000-8 55.0 ± 0% 55.0 ± 0% ~ (all equal)
FloatSqrt/1000000-8 122 ± 0% 122 ± 0% ~ (all equal)
Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24
Reviewed-on: https://go-review.googlesource.com/77832
Reviewed-by: Vlad Krasnov <vlad@cloudflare.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
The lack of proper addMulVVW implementation for arm64 hurts RSA performance.
This assembly implementation is optimized for arm64 based servers.
name old time/op new time/op delta
pkg:math/big goos:linux goarch:arm64
AddMulVVW/1 55.2ns ± 0% 11.9ns ± 1% -78.37% (p=0.000 n=8+10)
AddMulVVW/2 67.0ns ± 0% 11.2ns ± 0% -83.28% (p=0.000 n=7+10)
AddMulVVW/3 93.2ns ± 0% 13.2ns ± 0% -85.84% (p=0.000 n=10+10)
AddMulVVW/4 126ns ± 0% 13ns ± 1% -89.82% (p=0.000 n=10+10)
AddMulVVW/5 151ns ± 0% 17ns ± 0% -88.87% (p=0.000 n=10+9)
AddMulVVW/10 323ns ± 0% 25ns ± 0% -92.20% (p=0.000 n=10+10)
AddMulVVW/100 3.28µs ± 0% 0.14µs ± 0% -95.82% (p=0.000 n=10+10)
AddMulVVW/1000 31.7µs ± 0% 1.3µs ± 0% -96.00% (p=0.000 n=10+8)
AddMulVVW/10000 313µs ± 0% 13µs ± 0% -95.98% (p=0.000 n=10+10)
AddMulVVW/100000 3.24ms ± 0% 0.13ms ± 1% -96.13% (p=0.000 n=9+9)
pkg:crypto/rsa goos:linux goarch:arm64
RSA2048Decrypt 44.7ms ± 0% 4.0ms ± 6% -91.08% (p=0.000 n=8+10)
RSA2048Sign 46.3ms ± 0% 5.0ms ± 0% -89.29% (p=0.000 n=9+10)
3PrimeRSA2048Decrypt 22.3ms ± 0% 2.4ms ± 0% -89.26% (p=0.000 n=10+10)
Change-Id: I295f0bd5c51a4442d02c44ece1f6026d30dff0bc
Reviewed-on: https://go-review.googlesource.com/76270
Reviewed-by: Vlad Krasnov <vlad@cloudflare.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Vlad Krasnov <vlad@cloudflare.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
R18 seems reserved on Apple platforms.
May fix darwin/arm64 build.
Change-Id: Ia2c1de550a64827c85a64affa53b94c62aacce8e
Reviewed-on: https://go-review.googlesource.com/98896
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Elias Naur <elias.naur@gmail.com>
|
|
The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance.
By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance.
Benchmarks:
name old time/op new time/op delta
AddVV/1-8 21.5ns ± 0% 11.5ns ± 0% -46.51% (p=0.008 n=5+5)
AddVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5)
AddVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5)
AddVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5)
AddVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5)
AddVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5)
AddVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5)
AddVV/1000-8 2.02µs ± 0% 1.03µs ± 0% -48.85% (p=0.008 n=5+5)
AddVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.70% (p=0.008 n=5+5)
AddVV/100000-8 247µs ± 3% 154µs ± 0% -37.52% (p=0.008 n=5+5)
SubVV/1-8 21.5ns ± 0% 11.5ns ± 0% ~ (p=0.079 n=4+5)
SubVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5)
SubVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5)
SubVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5)
SubVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5)
SubVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5)
SubVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5)
SubVV/1000-8 2.02µs ± 0% 0.80µs ± 0% -60.50% (p=0.008 n=5+5)
SubVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.99% (p=0.008 n=5+5)
SubVV/100000-8 221µs ±11% 223µs ±16% ~ (p=0.690 n=5+5)
AddVW/1-8 9.32ns ± 0% 9.32ns ± 0% ~ (all equal)
AddVW/2-8 19.7ns ± 1% 19.7ns ± 0% ~ (p=0.381 n=5+4)
AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal)
AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal)
AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal)
AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal)
AddVW/100-8 167ns ± 0% 167ns ± 0% ~ (all equal)
AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% +0.40% (p=0.008 n=5+5)
AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% ~ (p=0.556 n=5+4)
AddVW/100000-8 152µs ± 1% 152µs ± 1% ~ (p=0.690 n=5+5)
AddMulVVW/1-8 33.3ns ± 0% 32.7ns ± 1% -1.86% (p=0.008 n=5+5)
AddMulVVW/2-8 59.3ns ± 1% 56.9ns ± 1% -4.15% (p=0.008 n=5+5)
AddMulVVW/3-8 80.5ns ± 1% 85.4ns ± 3% +6.19% (p=0.008 n=5+5)
AddMulVVW/4-8 127ns ± 0% 111ns ± 1% -13.19% (p=0.008 n=5+5)
AddMulVVW/5-8 144ns ± 0% 149ns ± 0% +3.47% (p=0.016 n=4+5)
AddMulVVW/10-8 298ns ± 1% 283ns ± 0% -4.77% (p=0.008 n=5+5)
AddMulVVW/100-8 3.06µs ± 0% 2.99µs ± 0% -2.21% (p=0.008 n=5+5)
AddMulVVW/1000-8 31.3µs ± 0% 26.9µs ± 0% -14.17% (p=0.008 n=5+5)
AddMulVVW/10000-8 316µs ± 0% 305µs ± 0% -3.51% (p=0.008 n=5+5)
AddMulVVW/100000-8 3.17ms ± 0% 3.17ms ± 1% ~ (p=0.690 n=5+5)
DecimalConversion-8 316µs ± 1% 313µs ± 2% ~ (p=0.095 n=5+5)
FloatString/100-8 2.53µs ± 1% 2.56µs ± 2% ~ (p=0.222 n=5+5)
FloatString/1000-8 58.4µs ± 0% 58.5µs ± 0% ~ (p=0.206 n=5+5)
FloatString/10000-8 4.59ms ± 0% 4.58ms ± 0% -0.31% (p=0.008 n=5+5)
FloatString/100000-8 446ms ± 0% 444ms ± 0% -0.31% (p=0.008 n=5+5)
FloatAdd/10-8 184ns ± 0% 172ns ± 0% -6.30% (p=0.008 n=5+5)
FloatAdd/100-8 189ns ± 2% 191ns ± 4% ~ (p=0.381 n=5+5)
FloatAdd/1000-8 371ns ± 0% 347ns ± 1% -6.42% (p=0.008 n=5+5)
FloatAdd/10000-8 1.87µs ± 0% 1.68µs ± 0% -10.16% (p=0.008 n=5+5)
FloatAdd/100000-8 17.1µs ± 0% 15.6µs ± 0% -8.74% (p=0.016 n=5+4)
FloatSub/10-8 152ns ± 0% 138ns ± 0% -9.47% (p=0.000 n=4+5)
FloatSub/100-8 148ns ± 0% 142ns ± 0% -4.05% (p=0.000 n=5+4)
FloatSub/1000-8 245ns ± 1% 217ns ± 0% -11.28% (p=0.000 n=5+4)
FloatSub/10000-8 1.07µs ± 0% 0.88µs ± 1% -18.14% (p=0.008 n=5+5)
FloatSub/100000-8 9.58µs ± 0% 7.96µs ± 0% -16.84% (p=0.008 n=5+5)
ParseFloatSmallExp-8 28.8µs ± 1% 29.0µs ± 1% ~ (p=0.095 n=5+5)
ParseFloatLargeExp-8 126µs ± 1% 126µs ± 1% ~ (p=0.841 n=5+5)
GCD10x10/WithoutXY-8 277ns ± 2% 281ns ± 4% ~ (p=0.746 n=5+5)
GCD10x10/WithXY-8 2.10µs ± 1% 2.12µs ± 3% ~ (p=0.548 n=5+5)
GCD10x100/WithoutXY-8 615ns ± 3% 607ns ± 2% ~ (p=0.135 n=5+5)
GCD10x100/WithXY-8 3.50µs ± 2% 3.62µs ± 5% ~ (p=0.151 n=5+5)
GCD10x1000/WithoutXY-8 1.39µs ± 2% 1.39µs ± 3% ~ (p=0.690 n=5+5)
GCD10x1000/WithXY-8 7.39µs ± 1% 7.34µs ± 2% ~ (p=0.135 n=5+5)
GCD10x10000/WithoutXY-8 8.66µs ± 1% 8.68µs ± 1% ~ (p=0.421 n=5+5)
GCD10x10000/WithXY-8 28.1µs ± 2% 27.0µs ± 2% -3.81% (p=0.008 n=5+5)
GCD10x100000/WithoutXY-8 79.3µs ± 1% 79.3µs ± 1% ~ (p=0.841 n=5+5)
GCD10x100000/WithXY-8 238µs ± 0% 227µs ± 1% -4.74% (p=0.008 n=5+5)
GCD100x100/WithoutXY-8 1.89µs ± 1% 1.88µs ± 2% ~ (p=0.968 n=5+5)
GCD100x100/WithXY-8 26.7µs ± 1% 27.0µs ± 1% +1.44% (p=0.032 n=5+5)
GCD100x1000/WithoutXY-8 4.48µs ± 1% 4.45µs ± 2% ~ (p=0.341 n=5+5)
GCD100x1000/WithXY-8 36.3µs ± 1% 35.1µs ± 1% -3.27% (p=0.008 n=5+5)
GCD100x10000/WithoutXY-8 22.8µs ± 0% 22.7µs ± 1% ~ (p=0.056 n=5+5)
GCD100x10000/WithXY-8 145µs ± 1% 133µs ± 1% -8.33% (p=0.008 n=5+5)
GCD100x100000/WithoutXY-8 198µs ± 0% 195µs ± 0% -1.56% (p=0.008 n=5+5)
GCD100x100000/WithXY-8 1.11ms ± 0% 1.00ms ± 0% -10.04% (p=0.008 n=5+5)
GCD1000x1000/WithoutXY-8 25.2µs ± 1% 24.8µs ± 1% -1.63% (p=0.016 n=5+5)
GCD1000x1000/WithXY-8 513µs ± 0% 517µs ± 2% ~ (p=0.421 n=5+5)
GCD1000x10000/WithoutXY-8 57.0µs ± 0% 52.7µs ± 1% -7.56% (p=0.008 n=5+5)
GCD1000x10000/WithXY-8 1.20ms ± 0% 1.10ms ± 0% -8.70% (p=0.008 n=5+5)
GCD1000x100000/WithoutXY-8 358µs ± 0% 318µs ± 1% -11.03% (p=0.008 n=5+5)
GCD1000x100000/WithXY-8 8.71ms ± 0% 7.65ms ± 0% -12.19% (p=0.008 n=5+5)
GCD10000x10000/WithoutXY-8 690µs ± 0% 630µs ± 0% -8.71% (p=0.008 n=5+5)
GCD10000x10000/WithXY-8 16.0ms ± 1% 14.9ms ± 0% -6.85% (p=0.008 n=5+5)
GCD10000x100000/WithoutXY-8 2.09ms ± 0% 1.75ms ± 0% -16.09% (p=0.016 n=5+4)
GCD10000x100000/WithXY-8 86.8ms ± 0% 76.3ms ± 0% -12.09% (p=0.008 n=5+5)
GCD100000x100000/WithoutXY-8 51.1ms ± 0% 46.0ms ± 0% -9.97% (p=0.008 n=5+5)
GCD100000x100000/WithXY-8 1.25s ± 0% 1.15s ± 0% -7.92% (p=0.008 n=5+5)
Hilbert-8 2.45ms ± 1% 2.49ms ± 1% +1.99% (p=0.008 n=5+5)
Binomial-8 4.98µs ± 3% 4.90µs ± 2% ~ (p=0.421 n=5+5)
QuoRem-8 7.10µs ± 0% 6.21µs ± 0% -12.55% (p=0.016 n=5+4)
Exp-8 161ms ± 0% 161ms ± 0% ~ (p=0.421 n=5+5)
Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=0.151 n=5+5)
Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.190 n=5+5)
BitsetNeg-8 163ns ± 3% 137ns ± 2% -15.91% (p=0.008 n=5+5)
BitsetOrig-8 377ns ± 1% 372ns ± 1% -1.22% (p=0.024 n=5+5)
BitsetNegOrig-8 631ns ± 1% 605ns ± 1% -4.09% (p=0.008 n=5+5)
ModSqrt225_Tonelli-8 7.26ms ± 0% 7.26ms ± 0% ~ (p=0.548 n=5+5)
ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=1.000 n=5+5)
ModSqrt5430_Tonelli-8 62.4s ± 0% 62.4s ± 0% ~ (p=0.841 n=5+5)
ModSqrt5430_3Mod4-8 20.8s ± 0% 20.7s ± 0% ~ (p=0.056 n=5+5)
Sqrt-8 101µs ± 0% 89µs ± 0% -12.17% (p=0.008 n=5+5)
IntSqr/1-8 32.5ns ± 1% 32.7ns ± 1% ~ (p=0.056 n=5+5)
IntSqr/2-8 160ns ± 5% 158ns ± 0% ~ (p=0.397 n=5+4)
IntSqr/3-8 298ns ± 4% 296ns ± 4% ~ (p=0.667 n=5+5)
IntSqr/5-8 737ns ± 5% 761ns ± 3% +3.34% (p=0.016 n=5+5)
IntSqr/8-8 1.87µs ± 4% 1.90µs ± 3% ~ (p=0.222 n=5+5)
IntSqr/10-8 2.96µs ± 4% 2.92µs ± 6% ~ (p=0.310 n=5+5)
IntSqr/20-8 6.28µs ± 3% 6.21µs ± 2% ~ (p=0.310 n=5+5)
IntSqr/30-8 14.0µs ± 2% 13.9µs ± 2% ~ (p=0.548 n=5+5)
IntSqr/50-8 37.7µs ± 3% 38.3µs ± 2% ~ (p=0.095 n=5+5)
IntSqr/80-8 95.9µs ± 2% 95.1µs ± 1% ~ (p=0.310 n=5+5)
IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.841 n=5+5)
IntSqr/200-8 586µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5)
IntSqr/300-8 1.32ms ± 0% 1.31ms ± 1% -0.73% (p=0.032 n=5+5)
IntSqr/500-8 2.48ms ± 0% 2.45ms ± 0% -1.15% (p=0.008 n=5+5)
IntSqr/800-8 4.68ms ± 0% 4.62ms ± 0% -1.23% (p=0.008 n=5+5)
IntSqr/1000-8 7.57ms ± 0% 7.50ms ± 0% -0.84% (p=0.008 n=5+5)
Mul-8 311ms ± 0% 308ms ± 0% -0.81% (p=0.008 n=5+5)
Exp3Power/0x10-8 574ns ± 1% 578ns ± 2% ~ (p=0.500 n=5+5)
Exp3Power/0x40-8 640ns ± 1% 646ns ± 0% ~ (p=0.056 n=5+5)
Exp3Power/0x100-8 1.42µs ± 1% 1.42µs ± 1% ~ (p=0.246 n=5+5)
Exp3Power/0x400-8 8.30µs ± 1% 8.29µs ± 1% ~ (p=0.802 n=5+5)
Exp3Power/0x1000-8 60.0µs ± 0% 59.9µs ± 0% -0.24% (p=0.016 n=5+5)
Exp3Power/0x4000-8 817µs ± 0% 816µs ± 0% -0.17% (p=0.008 n=5+5)
Exp3Power/0x10000-8 7.80ms ± 1% 7.70ms ± 0% -1.23% (p=0.008 n=5+5)
Exp3Power/0x40000-8 73.4ms ± 0% 72.5ms ± 0% -1.28% (p=0.008 n=5+5)
Exp3Power/0x100000-8 665ms ± 0% 656ms ± 0% -1.34% (p=0.008 n=5+5)
Exp3Power/0x400000-8 5.99s ± 0% 5.90s ± 0% -1.40% (p=0.008 n=5+5)
Fibo-8 116ms ± 0% 50ms ± 0% -57.09% (p=0.008 n=5+5)
NatSqr/1-8 112ns ± 4% 112ns ± 2% ~ (p=0.968 n=5+5)
NatSqr/2-8 251ns ± 2% 250ns ± 1% ~ (p=0.571 n=5+5)
NatSqr/3-8 378ns ± 2% 379ns ± 2% ~ (p=0.794 n=5+5)
NatSqr/5-8 829ns ± 3% 827ns ± 2% ~ (p=1.000 n=5+5)
NatSqr/8-8 1.97µs ± 2% 1.95µs ± 2% ~ (p=0.310 n=5+5)
NatSqr/10-8 3.02µs ± 2% 2.99µs ± 2% ~ (p=0.421 n=5+5)
NatSqr/20-8 6.51µs ± 2% 6.49µs ± 1% ~ (p=0.841 n=5+5)
NatSqr/30-8 14.1µs ± 2% 14.0µs ± 2% ~ (p=0.841 n=5+5)
NatSqr/50-8 38.1µs ± 2% 38.3µs ± 3% ~ (p=0.690 n=5+5)
NatSqr/80-8 95.5µs ± 2% 96.0µs ± 1% ~ (p=0.421 n=5+5)
NatSqr/100-8 150µs ± 1% 148µs ± 2% ~ (p=0.095 n=5+5)
NatSqr/200-8 588µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5)
NatSqr/300-8 1.32ms ± 1% 1.31ms ± 1% ~ (p=0.841 n=5+5)
NatSqr/500-8 2.50ms ± 0% 2.47ms ± 0% -1.03% (p=0.008 n=5+5)
NatSqr/800-8 4.70ms ± 0% 4.64ms ± 0% -1.31% (p=0.008 n=5+5)
NatSqr/1000-8 7.60ms ± 0% 7.52ms ± 0% -1.01% (p=0.008 n=5+5)
ScanPi-8 326µs ± 0% 326µs ± 0% ~ (p=0.841 n=5+5)
StringPiParallel-8 70.3µs ± 5% 63.8µs ±10% ~ (p=0.056 n=5+5)
Scan/10/Base2-8 1.09µs ± 0% 1.09µs ± 0% ~ (p=0.317 n=5+5)
Scan/100/Base2-8 7.79µs ± 0% 7.78µs ± 0% ~ (p=0.063 n=5+5)
Scan/1000/Base2-8 79.0µs ± 0% 78.9µs ± 0% -0.18% (p=0.008 n=5+5)
Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% -0.15% (p=0.008 n=5+5)
Scan/100000/Base2-8 55.1ms ± 0% 55.2ms ± 0% +0.20% (p=0.008 n=5+5)
Scan/10/Base8-8 512ns ± 0% 512ns ± 1% ~ (p=0.810 n=5+5)
Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% ~ (p=0.810 n=5+5)
Scan/1000/Base8-8 31.0µs ± 0% 31.0µs ± 0% ~ (p=0.151 n=5+5)
Scan/10000/Base8-8 740µs ± 0% 741µs ± 0% +0.10% (p=0.008 n=5+5)
Scan/100000/Base8-8 50.6ms ± 0% 50.6ms ± 0% +0.08% (p=0.008 n=5+5)
Scan/10/Base10-8 487ns ± 0% 487ns ± 0% ~ (p=0.571 n=5+5)
Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.810 n=5+5)
Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% +0.06% (p=0.008 n=5+5)
Scan/10000/Base10-8 716µs ± 0% 717µs ± 0% ~ (p=0.222 n=5+5)
Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.10% (p=0.008 n=5+5)
Scan/10/Base16-8 438ns ± 0% 437ns ± 1% ~ (p=0.786 n=5+5)
Scan/100/Base16-8 2.47µs ± 0% 2.47µs ± 0% -0.19% (p=0.048 n=5+5)
Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.087 n=5+5)
Scan/10000/Base16-8 722µs ± 0% 722µs ± 0% +0.11% (p=0.008 n=5+5)
Scan/100000/Base16-8 52.6ms ± 0% 52.7ms ± 0% +0.15% (p=0.008 n=5+5)
String/10/Base2-8 247ns ± 2% 248ns ± 1% ~ (p=0.437 n=5+5)
String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.37% (p=0.024 n=5+5)
String/1000/Base2-8 13.6µs ± 1% 13.5µs ± 0% ~ (p=0.095 n=5+5)
String/10000/Base2-8 135µs ± 0% 135µs ± 1% ~ (p=0.841 n=5+5)
String/100000/Base2-8 1.32ms ± 1% 1.32ms ± 1% ~ (p=0.690 n=5+5)
String/10/Base8-8 169ns ± 1% 169ns ± 1% ~ (p=1.000 n=5+5)
String/100/Base8-8 636ns ± 0% 634ns ± 1% ~ (p=0.413 n=5+5)
String/1000/Base8-8 5.33µs ± 1% 5.32µs ± 0% ~ (p=0.222 n=5+5)
String/10000/Base8-8 50.9µs ± 1% 50.7µs ± 0% ~ (p=0.151 n=5+5)
String/100000/Base8-8 500µs ± 1% 497µs ± 0% ~ (p=0.421 n=5+5)
String/10/Base10-8 516ns ± 1% 513ns ± 0% -0.62% (p=0.016 n=5+4)
String/100/Base10-8 1.97µs ± 0% 1.96µs ± 0% ~ (p=0.667 n=4+5)
String/1000/Base10-8 12.5µs ± 0% 11.5µs ± 0% -7.92% (p=0.008 n=5+5)
String/10000/Base10-8 57.7µs ± 0% 52.5µs ± 0% -8.93% (p=0.008 n=5+5)
String/100000/Base10-8 25.6ms ± 0% 21.6ms ± 0% -15.94% (p=0.008 n=5+5)
String/10/Base16-8 150ns ± 1% 149ns ± 0% ~ (p=0.413 n=5+4)
String/100/Base16-8 514ns ± 1% 514ns ± 1% ~ (p=0.849 n=5+5)
String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.421 n=5+5)
String/10000/Base16-8 37.8µs ± 1% 37.8µs ± 1% ~ (p=0.841 n=5+5)
String/100000/Base16-8 373µs ± 2% 373µs ± 0% ~ (p=0.421 n=5+5)
LeafSize/0-8 6.63ms ± 0% 6.63ms ± 0% ~ (p=0.730 n=4+5)
LeafSize/1-8 74.0µs ± 0% 67.7µs ± 1% -8.53% (p=0.008 n=5+5)
LeafSize/2-8 74.2µs ± 0% 68.3µs ± 1% -7.99% (p=0.008 n=5+5)
LeafSize/3-8 379µs ± 0% 309µs ± 0% -18.52% (p=0.008 n=5+5)
LeafSize/4-8 72.7µs ± 1% 66.7µs ± 0% -8.37% (p=0.008 n=5+5)
LeafSize/5-8 471µs ± 0% 384µs ± 0% -18.55% (p=0.008 n=5+5)
LeafSize/6-8 378µs ± 0% 308µs ± 0% -18.59% (p=0.008 n=5+5)
LeafSize/7-8 245µs ± 0% 204µs ± 1% -16.75% (p=0.008 n=5+5)
LeafSize/8-8 73.4µs ± 0% 66.9µs ± 1% -8.79% (p=0.008 n=5+5)
LeafSize/9-8 538µs ± 0% 437µs ± 0% -18.75% (p=0.008 n=5+5)
LeafSize/10-8 472µs ± 0% 396µs ± 1% -16.01% (p=0.008 n=5+5)
LeafSize/11-8 460µs ± 0% 374µs ± 0% -18.58% (p=0.008 n=5+5)
LeafSize/12-8 378µs ± 0% 308µs ± 0% -18.38% (p=0.008 n=5+5)
LeafSize/13-8 343µs ± 0% 284µs ± 0% -17.30% (p=0.008 n=5+5)
LeafSize/14-8 248µs ± 0% 206µs ± 0% -16.94% (p=0.008 n=5+5)
LeafSize/15-8 169µs ± 0% 144µs ± 0% -14.69% (p=0.008 n=5+5)
LeafSize/16-8 72.9µs ± 0% 66.8µs ± 1% -8.27% (p=0.008 n=5+5)
LeafSize/32-8 82.5µs ± 0% 76.7µs ± 0% -7.04% (p=0.008 n=5+5)
LeafSize/64-8 134µs ± 0% 129µs ± 0% -3.80% (p=0.008 n=5+5)
ProbablyPrime/n=0-8 44.2ms ± 0% 43.4ms ± 0% -1.95% (p=0.008 n=5+5)
ProbablyPrime/n=1-8 64.9ms ± 0% 64.0ms ± 0% -1.27% (p=0.008 n=5+5)
ProbablyPrime/n=5-8 147ms ± 0% 146ms ± 0% -0.58% (p=0.008 n=5+5)
ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.35% (p=0.008 n=5+5)
ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.18% (p=0.008 n=5+5)
ProbablyPrime/Lucas-8 23.6ms ± 0% 22.7ms ± 0% -3.74% (p=0.008 n=5+5)
ProbablyPrime/MillerRabinBase2-8 20.7ms ± 0% 20.6ms ± 0% ~ (p=0.421 n=5+5)
FloatSqrt/64-8 2.25µs ± 1% 2.29µs ± 0% +1.48% (p=0.008 n=5+5)
FloatSqrt/128-8 4.86µs ± 1% 4.92µs ± 1% +1.21% (p=0.032 n=5+5)
FloatSqrt/256-8 13.6µs ± 0% 13.7µs ± 1% +1.31% (p=0.032 n=5+5)
FloatSqrt/1000-8 70.0µs ± 1% 70.1µs ± 0% ~ (p=0.690 n=5+5)
FloatSqrt/10000-8 1.92ms ± 0% 1.90ms ± 0% -0.59% (p=0.008 n=5+5)
FloatSqrt/100000-8 55.3ms ± 0% 54.8ms ± 0% -1.01% (p=0.008 n=5+5)
FloatSqrt/1000000-8 4.56s ± 0% 4.50s ± 0% -1.28% (p=0.008 n=5+5)
name old speed new speed delta
AddVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.85% (p=0.008 n=5+5)
AddVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.50% (p=0.008 n=5+5)
AddVV/3-8 12.4GB/s ± 0% 14.7GB/s ± 0% +19.10% (p=0.008 n=5+5)
AddVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.63% (p=0.016 n=4+5)
AddVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=5+4)
AddVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5)
AddVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5)
AddVV/1000-8 31.7GB/s ± 0% 61.9GB/s ± 0% +95.43% (p=0.008 n=5+5)
AddVV/10000-8 31.2GB/s ± 0% 56.4GB/s ± 0% +80.83% (p=0.008 n=5+5)
AddVV/100000-8 25.9GB/s ± 3% 41.4GB/s ± 0% +59.98% (p=0.008 n=5+5)
SubVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.97% (p=0.016 n=4+5)
SubVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.51% (p=0.008 n=5+5)
SubVV/3-8 12.4GB/s ± 0% 14.8GB/s ± 0% +19.23% (p=0.016 n=4+5)
SubVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.56% (p=0.008 n=5+5)
SubVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=4+5)
SubVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5)
SubVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5)
SubVV/1000-8 31.6GB/s ± 0% 80.1GB/s ± 0% +153.08% (p=0.008 n=5+5)
SubVV/10000-8 31.2GB/s ± 0% 56.7GB/s ± 0% +81.79% (p=0.008 n=5+5)
SubVV/100000-8 29.1GB/s ±10% 29.0GB/s ±18% ~ (p=0.690 n=5+5)
AddVW/1-8 859MB/s ± 0% 859MB/s ± 0% -0.01% (p=0.008 n=5+5)
AddVW/2-8 811MB/s ± 1% 814MB/s ± 0% ~ (p=0.413 n=5+4)
AddVW/3-8 2.08GB/s ± 0% 2.08GB/s ± 0% ~ (p=0.206 n=5+5)
AddVW/4-8 2.46GB/s ± 0% 2.46GB/s ± 0% ~ (p=0.056 n=5+5)
AddVW/5-8 2.75GB/s ± 0% 2.75GB/s ± 0% ~ (p=0.508 n=5+5)
AddVW/10-8 3.63GB/s ± 0% 3.63GB/s ± 0% ~ (p=0.214 n=5+5)
AddVW/100-8 4.79GB/s ± 0% 4.79GB/s ± 0% ~ (p=0.500 n=5+5)
AddVW/1000-8 5.27GB/s ± 0% 5.25GB/s ± 0% -0.43% (p=0.008 n=5+5)
AddVW/10000-8 5.30GB/s ± 0% 5.30GB/s ± 0% ~ (p=0.397 n=5+5)
AddVW/100000-8 5.27GB/s ± 1% 5.25GB/s ± 1% ~ (p=0.690 n=5+5)
AddMulVVW/1-8 1.92GB/s ± 0% 1.96GB/s ± 1% +1.95% (p=0.008 n=5+5)
AddMulVVW/2-8 2.16GB/s ± 1% 2.25GB/s ± 1% +4.32% (p=0.008 n=5+5)
AddMulVVW/3-8 2.39GB/s ± 1% 2.25GB/s ± 3% -5.79% (p=0.008 n=5+5)
AddMulVVW/4-8 2.00GB/s ± 0% 2.31GB/s ± 1% +15.31% (p=0.008 n=5+5)
AddMulVVW/5-8 2.22GB/s ± 0% 2.14GB/s ± 0% -3.86% (p=0.008 n=5+5)
AddMulVVW/10-8 2.15GB/s ± 1% 2.25GB/s ± 0% +5.03% (p=0.008 n=5+5)
AddMulVVW/100-8 2.09GB/s ± 0% 2.14GB/s ± 0% +2.25% (p=0.008 n=5+5)
AddMulVVW/1000-8 2.04GB/s ± 0% 2.38GB/s ± 0% +16.52% (p=0.008 n=5+5)
AddMulVVW/10000-8 2.03GB/s ± 0% 2.10GB/s ± 0% +3.64% (p=0.008 n=5+5)
AddMulVVW/100000-8 2.02GB/s ± 0% 2.02GB/s ± 1% ~ (p=0.690 n=5+5)
Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db
Reviewed-on: https://go-review.googlesource.com/77831
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
|
|
Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op
an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by
math/bits benchmarks).
Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb
Reviewed-on: https://go-review.googlesource.com/38464
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
benchmark old ns/op new ns/op delta
BenchmarkAddVV_1 18.7 14.8 -20.86%
BenchmarkAddVV_2 21.8 16.6 -23.85%
BenchmarkAddVV_3 26.1 17.1 -34.48%
BenchmarkAddVV_4 30.4 21.9 -27.96%
BenchmarkAddVV_5 35.5 19.8 -44.23%
BenchmarkAddVV_1e1 63.0 28.3 -55.08%
BenchmarkAddVV_1e2 593 178 -69.98%
BenchmarkAddVV_1e3 5691 1490 -73.82%
BenchmarkAddVV_1e4 56868 20761 -63.49%
BenchmarkAddVV_1e5 569062 207679 -63.51%
BenchmarkAddVW_1 15.8 12.6 -20.25%
BenchmarkAddVW_2 17.8 13.1 -26.40%
BenchmarkAddVW_3 21.2 13.9 -34.43%
BenchmarkAddVW_4 23.6 14.7 -37.71%
BenchmarkAddVW_5 26.0 15.8 -39.23%
BenchmarkAddVW_1e1 41.3 21.6 -47.70%
BenchmarkAddVW_1e2 383 145 -62.14%
BenchmarkAddVW_1e3 3703 1264 -65.87%
BenchmarkAddVW_1e4 36920 14359 -61.11%
BenchmarkAddVW_1e5 370345 143046 -61.37%
BenchmarkAddMulVVW_1 33.2 32.5 -2.11%
BenchmarkAddMulVVW_2 58.0 57.2 -1.38%
BenchmarkAddMulVVW_3 95.2 93.9 -1.37%
BenchmarkAddMulVVW_4 108 106 -1.85%
BenchmarkAddMulVVW_5 159 156 -1.89%
BenchmarkAddMulVVW_1e1 344 340 -1.16%
BenchmarkAddMulVVW_1e2 3644 3624 -0.55%
BenchmarkAddMulVVW_1e3 37344 37208 -0.36%
BenchmarkAddMulVVW_1e4 373295 372170 -0.30%
BenchmarkAddMulVVW_1e5 3438116 3425606 -0.36%
BenchmarkBitLen0 7.21 4.32 -40.08%
BenchmarkBitLen1 6.49 4.32 -33.44%
BenchmarkBitLen2 7.23 4.32 -40.25%
BenchmarkBitLen3 6.49 4.32 -33.44%
BenchmarkBitLen4 7.22 4.32 -40.17%
BenchmarkBitLen5 6.52 4.33 -33.59%
BenchmarkBitLen8 7.22 4.32 -40.17%
BenchmarkBitLen9 6.49 4.32 -33.44%
BenchmarkBitLen16 8.66 4.32 -50.12%
BenchmarkBitLen17 7.95 4.32 -45.66%
BenchmarkBitLen31 8.69 4.32 -50.29%
BenchmarkGCD10x10 5021 5033 +0.24%
BenchmarkGCD10x100 5571 5572 +0.02%
BenchmarkGCD10x1000 6707 6729 +0.33%
BenchmarkGCD10x10000 13526 13419 -0.79%
BenchmarkGCD10x100000 85668 83242 -2.83%
BenchmarkGCD100x100 24196 23936 -1.07%
BenchmarkGCD100x1000 28802 27309 -5.18%
BenchmarkGCD100x10000 64111 51704 -19.35%
BenchmarkGCD100x100000 385840 274385 -28.89%
BenchmarkGCD1000x1000 262892 236269 -10.13%
BenchmarkGCD1000x10000 371393 277883 -25.18%
BenchmarkGCD1000x100000 1311795 589055 -55.10%
BenchmarkGCD10000x10000 9596740 6123930 -36.19%
BenchmarkGCD10000x100000 16404000 7269610 -55.68%
BenchmarkGCD100000x100000 776660000 419270000 -46.02%
BenchmarkHilbert 13478980 13402270 -0.57%
BenchmarkBinomial 9802 9440 -3.69%
BenchmarkBitset 142 142 +0.00%
BenchmarkBitsetNeg 328 279 -14.94%
BenchmarkBitsetOrig 853 861 +0.94%
BenchmarkBitsetNegOrig 1489 1444 -3.02%
BenchmarkMul 420949000 410481000 -2.49%
BenchmarkExp3Power0x10 1148 1229 +7.06%
BenchmarkExp3Power0x40 1322 1376 +4.08%
BenchmarkExp3Power0x100 2437 2486 +2.01%
BenchmarkExp3Power0x400 9456 9346 -1.16%
BenchmarkExp3Power0x1000 113623 108701 -4.33%
BenchmarkExp3Power0x4000 1134933 1101481 -2.95%
BenchmarkExp3Power0x10000 10773570 10396160 -3.50%
BenchmarkExp3Power0x40000 101362100 97788300 -3.53%
BenchmarkExp3Power0x100000 921114000 885249000 -3.89%
BenchmarkExp3Power0x400000 8323094000 7969020000 -4.25%
BenchmarkFibo 322021600 92554450 -71.26%
BenchmarkScanPi 1264583 321065 -74.61%
BenchmarkStringPiParallel 1644661 554216 -66.30%
BenchmarkScan10Base2 1111 1080 -2.79%
BenchmarkScan100Base2 6645 6345 -4.51%
BenchmarkScan1000Base2 84084 62405 -25.78%
BenchmarkScan10000Base2 3105998 932551 -69.98%
BenchmarkScan100000Base2 257234800 40113333 -84.41%
BenchmarkScan10Base8 571 573 +0.35%
BenchmarkScan100Base8 2810 2543 -9.50%
BenchmarkScan1000Base8 47383 25834 -45.48%
BenchmarkScan10000Base8 2739518 567203 -79.30%
BenchmarkScan100000Base8 253952400 36495680 -85.63%
BenchmarkScan10Base10 553 556 +0.54%
BenchmarkScan100Base10 2640 2385 -9.66%
BenchmarkScan1000Base10 50865 24049 -52.72%
BenchmarkScan10000Base10 3279916 549313 -83.25%
BenchmarkScan100000Base10 309121000 36213140 -88.29%
BenchmarkScan10Base16 478 483 +1.05%
BenchmarkScan100Base16 2353 2144 -8.88%
BenchmarkScan1000Base16 48091 24246 -49.58%
BenchmarkScan10000Base16 2858886 586475 -79.49%
BenchmarkScan100000Base16 266320000 38190500 -85.66%
BenchmarkString10Base2 736 730 -0.82%
BenchmarkString100Base2 2695 2707 +0.45%
BenchmarkString1000Base2 20549 20388 -0.78%
BenchmarkString10000Base2 212638 210782 -0.87%
BenchmarkString100000Base2 1944963 1938033 -0.36%
BenchmarkString10Base8 524 517 -1.34%
BenchmarkString100Base8 1326 1320 -0.45%
BenchmarkString1000Base8 8213 8249 +0.44%
BenchmarkString10000Base8 72204 72092 -0.16%
BenchmarkString100000Base8 769068 765993 -0.40%
BenchmarkString10Base10 1018 982 -3.54%
BenchmarkString100Base10 3485 3206 -8.01%
BenchmarkString1000Base10 37102 18935 -48.97%
BenchmarkString10000Base10 188633 88637 -53.01%
BenchmarkString100000Base10 124490300 19700940 -84.17%
BenchmarkString10Base16 509 502 -1.38%
BenchmarkString100Base16 1084 1098 +1.29%
BenchmarkString1000Base16 5641 5650 +0.16%
BenchmarkString10000Base16 46900 46745 -0.33%
BenchmarkString100000Base16 508957 505840 -0.61%
BenchmarkLeafSize0 8934320 8149465 -8.78%
BenchmarkLeafSize1 237666 118381 -50.19%
BenchmarkLeafSize2 237807 117854 -50.44%
BenchmarkLeafSize3 1688640 353494 -79.07%
BenchmarkLeafSize4 235676 116196 -50.70%
BenchmarkLeafSize5 2121896 430325 -79.72%
BenchmarkLeafSize6 1682306 351775 -79.09%
BenchmarkLeafSize7 1051847 251436 -76.10%
BenchmarkLeafSize8 232697 115674 -50.29%
BenchmarkLeafSize9 2403616 488443 -79.68%
BenchmarkLeafSize10 2120975 429545 -79.75%
BenchmarkLeafSize11 2023789 426525 -78.92%
BenchmarkLeafSize12 1684830 351985 -79.11%
BenchmarkLeafSize13 1465529 337906 -76.94%
BenchmarkLeafSize14 1050498 253872 -75.83%
BenchmarkLeafSize15 683228 197384 -71.11%
BenchmarkLeafSize16 232496 116026 -50.10%
BenchmarkLeafSize32 245841 126671 -48.47%
BenchmarkLeafSize64 301728 190285 -36.93%
Change-Id: I63e63297896d96b89c9a275b893c2b405a7e105d
Reviewed-on: https://go-review.googlesource.com/9260
Reviewed-by: David Crawshaw <crawshaw@golang.org>
|
|
To use a pure Go implementation of the low-level arithmetic
functions (when no platform-specific assembly implementations
are available), set the build tag math_big_pure_go.
This will make it easy to vendor the math/big package where no
assembly is available (for instance for use with gc which relies
on 1.4 functionality for now).
Change-Id: I91e17c0fdc568a20ec1512d7c64621241dc60c17
Reviewed-on: https://go-review.googlesource.com/7856
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
Change-Id: Ief12e1435a40dd2eaddc3f97f63be44c4dd2e050
Reviewed-on: https://go-review.googlesource.com/7144
Reviewed-by: Minux Ma <minux@golang.org>
Reviewed-by: Dave Cheney <dave@cheney.net>
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
|