| Age | Commit message (Collapse) | Author |
|
Checking that the lengths are equal and panicking teaches the compiler
that it can assume “i in range for z” implies “i in range for x”, letting us
simplify the actual loops a bit.
It also turns up a few places in math/big that were playing maybe a little
too fast and loose with slice lengths. Update those to explicitly set all the
input slices to the same length.
These speedups are basically irrelevant, since they only happen
in real code if people are compiling with -tags math_big_pure_go.
But at least the code is clearer.
benchmark \ system c3h88 c2s16 s7 386 s7-386 c4as16 mac arm loong64 ppc64le riscv64 s390x
AddVV/words=1/impl=go ~ +11.20% +5.11% -7.67% -7.77% +1.90% +10.76% -33.22% ~ +10.98% ~ +6.60%
AddVV/words=10/impl=go -22.12% -13.48% -10.37% -17.95% -18.07% -24.58% -22.04% -29.95% -14.22% ~ -6.33% +3.66%
AddVV/words=16/impl=go -9.75% -13.73% ~ -21.90% -18.66% -30.03% -20.45% -28.09% -17.33% -7.15% -8.96% +12.55%
AddVV/words=100/impl=go -5.91% -1.02% ~ -29.23% -22.18% -25.62% -6.49% -23.59% -22.31% -1.88% -14.13% +9.23%
AddVV/words=1000/impl=go -0.52% -0.19% -3.58% -33.89% -23.46% -22.46% ~ -24.00% -24.73% +0.93% -15.79% +12.32%
AddVV/words=10000/impl=go ~ ~ ~ -33.79% -23.72% -23.79% -5.98% -23.92% ~ +0.78% -15.45% +8.59%
AddVV/words=100000/impl=go ~ ~ ~ -33.90% -24.25% -22.82% -4.09% -24.63% ~ +1.00% -13.56% ~
SubVV/words=1/impl=go ~ +11.64% +14.05% ~ -4.07% ~ +10.79% -33.69% ~ ~ +3.89% +12.33%
SubVV/words=10/impl=go -10.31% -14.09% -7.38% +13.76% -13.25% -18.05% -20.08% -24.97% -14.15% +10.13% -0.97% -2.51%
SubVV/words=16/impl=go -8.06% -13.73% -5.70% +17.00% -12.83% -23.76% -17.52% -25.25% -17.30% -2.80% -4.96% -18.25%
SubVV/words=100/impl=go -9.22% -1.30% -2.76% +20.88% -14.35% -15.29% -8.49% -19.64% -22.31% -0.68% -14.30% -9.04%
SubVV/words=1000/impl=go -0.60% ~ -3.43% +23.08% -16.14% -11.96% ~ -28.52% -24.73% ~ -15.95% -9.91%
SubVV/words=10000/impl=go ~ ~ ~ +26.01% -15.24% -11.92% ~ -28.26% +4.25% ~ -15.42% -5.95%
SubVV/words=100000/impl=go ~ ~ ~ +25.71% -15.83% -12.13% ~ -27.88% -1.27% ~ -13.57% -6.72%
LshVU/words=1/impl=go +0.56% +0.36% ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
LshVU/words=10/impl=go +13.37% +4.63% ~ ~ ~ ~ ~ -2.90% ~ ~ ~ ~
LshVU/words=16/impl=go +22.83% +6.47% ~ ~ ~ ~ ~ ~ +0.80% ~ ~ +5.88%
LshVU/words=100/impl=go +7.56% +13.95% ~ ~ ~ ~ ~ ~ +0.33% -2.50% ~ ~
LshVU/words=1000/impl=go +0.64% +17.92% ~ ~ ~ ~ ~ -6.52% ~ -2.58% ~ ~
LshVU/words=10000/impl=go ~ +17.60% ~ ~ ~ ~ ~ -6.64% -6.22% -1.40% ~ ~
LshVU/words=100000/impl=go ~ +14.57% ~ ~ ~ ~ ~ ~ -5.47% ~ ~ ~
RshVU/words=1/impl=go ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ +2.72%
RshVU/words=10/impl=go ~ ~ ~ ~ ~ ~ ~ +2.50% ~ ~ ~ ~
RshVU/words=16/impl=go ~ +0.53% ~ ~ ~ ~ ~ +3.82% ~ ~ ~ ~
RshVU/words=100/impl=go ~ ~ ~ ~ ~ ~ ~ +6.18% ~ ~ ~ ~
RshVU/words=1000/impl=go ~ ~ ~ ~ ~ ~ ~ +7.00% ~ ~ ~ ~
RshVU/words=10000/impl=go ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
RshVU/words=100000/impl=go ~ ~ ~ ~ ~ ~ ~ +7.05% ~ ~ ~ ~
MulAddVWW/words=1/impl=go -10.34% +4.43% +10.62% -1.62% -4.74% -2.86% +11.75% ~ -8.00% +8.89% +3.87% ~
MulAddVWW/words=10/impl=go -1.61% -5.87% ~ -8.30% -4.55% +0.87% ~ -5.28% -20.82% ~ ~ -2.32%
MulAddVWW/words=16/impl=go -2.96% -5.28% ~ -9.22% -5.28% ~ ~ -3.74% -19.52% -1.48% -2.53% -9.52%
MulAddVWW/words=100/impl=go -3.89% -7.53% +1.93% -10.49% -4.87% -8.27% ~ ~ -0.65% -0.61% -7.59% -20.61%
MulAddVWW/words=1000/impl=go -0.45% -3.91% +4.54% -11.46% -4.69% -8.53% ~ ~ -0.05% ~ -8.88% -19.77%
MulAddVWW/words=10000/impl=go ~ -3.30% +4.10% -11.34% -4.10% -9.43% ~ -0.61% ~ -0.55% -8.21% -18.48%
MulAddVWW/words=100000/impl=go -0.30% -3.03% +4.31% -11.55% -4.41% -9.74% ~ -0.75% +0.63% ~ -7.80% -19.82%
AddMulVVWW/words=1/impl=go ~ +13.09% +12.50% -7.05% -10.41% +2.53% +13.32% -3.49% ~ +15.56% +3.62% ~
AddMulVVWW/words=10/impl=go -15.96% -9.06% -5.06% -14.56% -11.83% -5.44% -26.30% -14.23% -11.44% -1.79% -5.93% -6.60%
AddMulVVWW/words=16/impl=go -19.05% -12.43% -6.19% -14.24% -12.67% -8.65% -18.64% -16.56% -10.64% -3.00% -7.61% -12.80%
AddMulVVWW/words=100/impl=go -22.13% -16.59% -13.04% -13.79% -11.46% -12.01% -6.46% -21.80% -5.08% -3.13% -13.60% -22.53%
AddMulVVWW/words=1000/impl=go -17.07% -17.05% -14.08% -13.59% -12.13% -11.21% ~ -22.81% -4.27% -1.27% -16.35% -23.47%
AddMulVVWW/words=10000/impl=go -15.03% -16.78% -14.23% -13.86% -11.84% -11.69% ~ -22.75% -13.39% -1.10% -14.37% -22.01%
AddMulVVWW/words=100000/impl=go -13.70% -14.90% -14.26% -13.55% -12.04% -11.63% ~ -22.61% ~ -2.53% -10.42% -23.16%
Change-Id: Ic6f64344484a762b818c7090d1396afceb638607
Reviewed-on: https://go-review.googlesource.com/c/go/+/665155
Auto-Submit: Russ Cox <rsc@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
The vast majority of the time, carry propagation is limited and
addVW/subVW only need to consider a single word for carry propagation.
As Josh Bleecher-Snyder pointed out in 2019 (CL 164968), once carrying
is done, the remaining words can be handled faster with copy (memmove).
In the benchmarks below, this is the data=random case.
Even more important, if the source and destination are the same,
the copy can be optimized away entirely, making a small in-place
addition to a big.Int O(1) instead of O(N). To date, only a few
systems (amd64, arm64, and pure Go, meaning wasm) make use of this
asymptotic improvement. This is the data=shortcut case.
This CL deletes the addVW/subVW assembly and replaces it with
an optimized pure Go version. Using Go makes it easy to call
the real copy builtin, which will use optimized memmove code,
instead of recreating a worse memmove in assembly (as arm64 does)
or omitting the copy optimization entirely (as most others do).
The worst case for the Go version versus assembly is the case
of incrementing 2^N-1 by 1, which has to propagate a carry
the entire length of the array. This is the data=carry case.
On balance, we believe this case is rare enough to be worth
taking a hit in that case, in exchange for significant wins
in the other cases and the deletion of significant amounts of
assembly of varying quality. (Remember that half the assembly has
the copy optimization and shortcut, while half does not.)
In the benchmarks, the systems are:
c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud)
c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud)
s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X)
c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud)
mac GOARCH=arm64 Apple M3 Pro in MacBook Pro
386 GOARCH=386 gotip-linux-386 gomote
arm GOARCH=arm gotip-linux-arm gomote
loong64 GOARCH=loong64 gotip-linux-loong64 gomote
ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote
riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote
benchmark \ system c2s16 c3h88 s7 c4as16 mac 386 arm loong64 ppc64le riscv64
AddVW/words=1/data=random -1.15% -1.74% -5.89% -9.80% -11.54% +23.71% -12.74% -14.25% +14.67% +10.27%
AddVW/words=2/data=random -2.59% ~ -4.38% -19.31% -15.41% +24.80% ~ -19.99% +13.73% +19.71%
AddVW/words=3/data=random -3.75% -19.10% -3.79% -23.15% -17.04% +20.04% -10.07% -23.20% ~ +15.39%
AddVW/words=4/data=random -2.84% +7.05% -8.77% -22.64% -15.77% +16.01% -7.36% -28.22% ~ +23.00%
AddVW/words=5/data=random -10.97% +2.16% -12.09% -20.89% -17.14% +9.42% -4.69% -32.60% ~ +10.07%
AddVW/words=6/data=random -9.87% ~ -7.54% -19.08% -6.46% ~ -3.44% -34.61% ~ +12.19%
AddVW/words=7/data=random -14.36% ~ -10.09% -19.10% -10.47% -6.20% -5.06% -38.14% -11.54% +6.79%
AddVW/words=8/data=random -17.50% ~ -11.06% -25.14% -12.88% -8.35% -5.11% -41.39% -14.04% +11.87%
AddVW/words=9/data=random -19.76% -4.05% -15.47% -24.08% -16.50% -12.34% -21.56% -44.25% -14.82% ~
AddVW/words=10/data=random -13.89% ~ -9.69% -23.06% -8.04% -12.58% -19.25% -32.80% -11.68% ~
AddVW/words=16/data=random -29.36% -15.35% -21.86% -25.04% -19.89% -32.26% -16.29% -42.66% -25.92% -3.01%
AddVW/words=32/data=random -39.02% -28.76% -39.87% -11.22% -2.85% -55.40% -31.17% -55.37% -37.92% -16.28%
AddVW/words=64/data=random -25.94% -19.09% -20.60% -6.90% +8.91% -51.00% -43.72% -62.27% -44.11% -28.74%
AddVW/words=100/data=random -22.79% -18.13% -18.25% ~ +33.89% -67.40% -51.77% -63.54% -53.75% -30.97%
AddVW/words=1000/data=random -8.98% -3.84% ~ -3.15% ~ -93.35% -63.92% -65.66% -68.67% -42.30%
AddVW/words=10000/data=random -1.38% -0.38% ~ ~ ~ -89.16% -65.18% -44.65% -70.35% -20.08%
AddVW/words=100000/data=random ~ ~ ~ ~ ~ -87.03% -64.51% -36.08% -61.40% -16.53%
SubVW/words=1/data=random -3.67% ~ -8.38% -10.26% -3.07% +45.78% -6.06% -11.17% ~ ~
SubVW/words=2/data=random -3.48% -10.07% -5.76% -20.14% -8.45% +44.28% ~ -19.09% ~ +16.98%
SubVW/words=3/data=random -7.11% -26.64% -4.48% -22.07% -9.21% +35.61% ~ -23.93% -18.20% ~
SubVW/words=4/data=random -4.23% +7.19% -8.95% -22.62% -13.89% +33.20% -8.96% -29.96% ~ +22.23%
SubVW/words=5/data=random -11.49% +1.92% -10.86% -22.27% -17.53% +24.48% -2.88% -35.19% -19.55% ~
SubVW/words=6/data=random -7.67% ~ -7.72% -18.44% -6.24% +12.03% -2.00% -39.68% -10.73% ~
SubVW/words=7/data=random -13.69% -18.32% -11.82% -18.92% -11.57% +6.63% ~ -43.54% -30.81% ~
SubVW/words=8/data=random -16.02% ~ -11.07% -24.50% -11.92% +4.32% -3.01% -46.95% -24.14% ~
SubVW/words=9/data=random -18.76% -3.34% -14.84% -23.79% -17.50% ~ -21.80% -49.98% -29.62% ~
SubVW/words=10/data=random -13.23% ~ -9.25% -21.26% -11.63% ~ -18.58% -39.19% -20.09% ~
SubVW/words=16/data=random -28.25% -13.24% -22.66% -27.18% -19.13% -23.38% -20.24% -51.01% -28.06% -3.05%
SubVW/words=32/data=random -38.41% -28.88% -40.12% -11.20% -2.80% -49.17% -34.67% -63.29% -39.25% -15.20%
SubVW/words=64/data=random -25.51% -19.24% -22.20% -6.57% +9.98% -48.52% -48.14% -69.50% -49.44% -27.92%
SubVW/words=100/data=random -21.69% -18.51% ~ +1.92% +34.42% -65.88% -54.67% -71.24% -58.88% -30.71%
SubVW/words=1000/data=random -9.81% -4.05% -2.14% -3.06% ~ -93.37% -67.33% -74.12% -68.36% -42.17%
SubVW/words=10000/data=random ~ -0.52% ~ ~ ~ -88.87% -68.54% -44.94% -70.63% -19.95%
SubVW/words=100000/data=random ~ ~ ~ ~ ~ -86.69% -68.09% -48.36% -62.42% -19.32%
AddVW/words=1/data=shortcut -29.38% -25.38% -27.37% -23.15% -25.41% +3.01% -33.60% -36.12% -15.76% ~
AddVW/words=2/data=shortcut -32.79% -34.72% -31.47% -24.47% -28.21% -3.75% -34.66% -43.89% -23.65% -21.56%
AddVW/words=3/data=shortcut -38.50% -46.83% -35.67% -26.38% -30.29% -10.41% -44.89% -47.68% -30.93% -26.85%
AddVW/words=4/data=shortcut -40.40% -28.85% -34.19% -29.83% -32.95% -16.09% -42.86% -51.02% -34.19% -26.69%
AddVW/words=5/data=shortcut -43.87% -35.42% -36.46% -32.59% -37.72% -20.82% -45.14% -54.01% -35.49% -30.48%
AddVW/words=6/data=shortcut -46.98% -39.34% -42.22% -35.43% -38.18% -27.46% -46.72% -56.61% -40.21% -34.07%
AddVW/words=7/data=shortcut -49.63% -47.97% -46.61% -35.28% -41.93% -31.14% -49.29% -58.89% -41.10% -37.01%
AddVW/words=8/data=shortcut -50.48% -42.33% -45.40% -40.24% -41.74% -32.92% -50.62% -60.98% -44.85% -38.10%
AddVW/words=9/data=shortcut -54.27% -43.52% -49.06% -42.16% -45.22% -37.57% -51.84% -62.91% -46.04% -40.82%
AddVW/words=10/data=shortcut -56.01% -45.40% -51.42% -43.29% -46.14% -38.65% -53.65% -64.62% -47.05% -43.21%
AddVW/words=16/data=shortcut -62.73% -55.66% -59.31% -56.38% -54.31% -53.16% -61.03% -72.29% -58.24% -52.57%
AddVW/words=32/data=shortcut -74.00% -69.42% -71.75% -33.65% -37.35% -71.73% -72.59% -82.44% -70.87% -67.69%
AddVW/words=64/data=shortcut -56.69% -52.72% -52.09% -35.48% -36.87% -84.24% -83.10% -90.37% -82.56% -80.81%
AddVW/words=100/data=shortcut -56.68% -53.18% -51.49% -33.49% -37.72% -89.95% -88.21% -93.37% -88.47% -86.52%
AddVW/words=1000/data=shortcut -56.68% -52.45% -51.66% -35.31% -36.65% -98.88% -98.62% -99.24% -98.78% -98.41%
AddVW/words=10000/data=shortcut -56.70% -52.40% -51.92% -33.49% -36.98% -99.89% -99.86% -99.92% -99.87% -99.91%
AddVW/words=100000/data=shortcut -56.67% -52.46% -52.38% -35.31% -37.20% -99.99% -99.99% -99.99% -99.99% -99.99%
SubVW/words=1/data=shortcut -29.80% -20.71% -26.94% -23.24% -25.33% +26.97% -32.02% -37.85% -40.20% -12.67%
SubVW/words=2/data=shortcut -35.47% -36.38% -31.93% -25.43% -30.18% +18.96% -33.48% -46.48% -39.38% -18.65%
SubVW/words=3/data=shortcut -39.22% -49.96% -36.90% -25.82% -30.96% +12.53% -40.67% -51.07% -43.71% -23.78%
SubVW/words=4/data=shortcut -40.46% -24.90% -34.66% -29.87% -33.97% +4.60% -42.32% -54.92% -42.83% -22.45%
SubVW/words=5/data=shortcut -43.84% -34.17% -38.00% -32.55% -37.27% -2.46% -43.09% -58.18% -45.70% -26.45%
SubVW/words=6/data=shortcut -47.69% -37.49% -42.73% -35.90% -37.73% -8.52% -46.55% -61.01% -44.00% -30.14%
SubVW/words=7/data=shortcut -49.45% -50.66% -46.88% -34.77% -41.64% -14.46% -48.92% -63.46% -50.47% -33.39%
SubVW/words=8/data=shortcut -50.45% -39.31% -47.14% -40.47% -41.70% -15.77% -50.21% -65.64% -47.71% -34.01%
SubVW/words=9/data=shortcut -54.28% -43.07% -49.42% -41.34% -44.99% -19.39% -51.55% -67.61% -56.92% -36.82%
SubVW/words=10/data=shortcut -56.85% -47.88% -50.92% -42.76% -45.67% -23.60% -53.04% -69.34% -60.18% -39.43%
SubVW/words=16/data=shortcut -62.36% -54.83% -58.80% -55.83% -53.74% -41.04% -60.16% -76.75% -60.56% -48.63%
SubVW/words=32/data=shortcut -73.68% -68.64% -71.57% -33.52% -37.34% -64.73% -72.67% -85.89% -71.87% -64.56%
SubVW/words=64/data=shortcut -56.68% -51.66% -52.56% -34.75% -37.54% -80.30% -83.58% -92.39% -83.41% -78.70%
SubVW/words=100/data=shortcut -56.68% -50.97% -51.57% -33.68% -36.78% -87.42% -88.53% -94.84% -88.87% -84.96%
SubVW/words=1000/data=shortcut -56.68% -50.89% -52.10% -34.94% -37.77% -98.59% -98.71% -99.43% -98.80% -98.20%
SubVW/words=10000/data=shortcut -56.68% -51.00% -52.44% -33.65% -37.27% -99.86% -99.87% -99.94% -99.88% -99.90%
SubVW/words=100000/data=shortcut -56.68% -50.80% -52.20% -34.79% -37.46% -99.99% -99.99% -99.99% -99.99% -99.99%
AddVW/words=1/data=carry -0.51% -5.29% -24.03% -26.48% ~ ~ -33.14% -30.23% ~ -20.74%
AddVW/words=2/data=carry -6.36% ~ -21.05% -39.40% ~ +10.72% -29.12% -31.34% ~ -17.29%
AddVW/words=3/data=carry ~ ~ -17.46% -19.53% +17.58% ~ -26.23% -23.61% +7.80% -14.34%
AddVW/words=4/data=carry +19.02% +16.80% ~ ~ +28.25% ~ -27.90% -20.31% +19.16% ~
AddVW/words=5/data=carry +3.97% +53.02% ~ ~ +11.31% ~ -19.05% -17.47% +16.81% ~
AddVW/words=6/data=carry +2.98% +19.83% ~ ~ +14.84% ~ -18.48% -14.92% +18.25% ~
AddVW/words=7/data=carry ~ ~ ~ ~ +27.17% ~ -15.50% -12.74% +13.00% ~
AddVW/words=8/data=carry +0.58% +22.32% ~ +6.10% +29.63% ~ -13.04% ~ +28.46% +2.95%
AddVW/words=9/data=carry ~ +31.53% ~ ~ +14.42% ~ -11.32% ~ +18.37% +3.28%
AddVW/words=10/data=carry +3.94% +22.36% ~ +6.29% +19.22% ~ -11.27% ~ +20.10% +3.91%
AddVW/words=16/data=carry +2.82% +14.23% ~ +10.06% +25.91% -16.12% ~ ~ +52.28% +10.40%
AddVW/words=32/data=carry ~ +25.35% +13.66% ~ +34.89% -34.39% +6.51% -18.71% +41.06% +19.42%
AddVW/words=64/data=carry -42.03% ~ -39.70% +6.65% +32.29% -39.94% +14.34% ~ +19.68% +20.86%
AddVW/words=100/data=carry -33.95% -34.28% -39.65% ~ +27.72% -26.80% +17.40% ~ +26.39% +23.32%
AddVW/words=1000/data=carry -42.49% -47.87% -47.44% +1.25% +4.25% -41.76% +23.40% ~ +25.48% +27.99%
AddVW/words=10000/data=carry -41.85% -48.49% -49.43% ~ ~ -42.09% +24.61% -10.32% +40.55% +18.35%
AddVW/words=100000/data=carry -28.18% -48.13% -48.24% +1.35% ~ -42.90% +24.73% -9.79% +22.55% +17.16%
SubVW/words=1/data=carry -10.32% -17.16% -24.14% -26.24% ~ +18.43% -34.10% -29.54% -9.57% ~
SubVW/words=2/data=carry -19.45% -23.31% -20.74% -39.73% ~ +15.74% -28.13% -30.21% ~ -18.74%
SubVW/words=3/data=carry ~ -16.18% -15.34% -19.54% +17.62% +12.39% -27.64% -27.09% ~ -14.97%
SubVW/words=4/data=carry +11.67% +24.42% ~ ~ +25.11% +14.07% -28.08% -26.18% ~ ~
SubVW/words=5/data=carry +8.08% +25.64% ~ ~ +10.35% +8.12% -21.75% -25.50% ~ -4.86%
SubVW/words=6/data=carry ~ +13.82% ~ ~ +12.92% +6.79% -20.25% -24.70% ~ -2.74%
SubVW/words=7/data=carry ~ ~ +8.29% +4.51% +26.59% +4.62% -18.01% -24.09% ~ -1.26%
SubVW/words=8/data=carry ~ +23.16% +16.19% +6.16% +25.46% +6.74% -15.57% -22.74% ~ +1.44%
SubVW/words=9/data=carry ~ +30.71% +20.81% ~ +12.36% ~ -12.99% ~ ~ +3.13%
SubVW/words=10/data=carry +5.03% +19.53% +14.84% +14.16% +16.12% ~ -11.64% -16.00% +15.45% +3.29%
SubVW/words=16/data=carry +14.42% +15.58% +33.07% +11.43% +24.65% ~ ~ -21.90% +25.59% +9.40%
SubVW/words=32/data=carry ~ +27.57% +46.58% ~ +35.35% -8.49% ~ -24.04% +11.86% +18.40%
SubVW/words=64/data=carry -24.34% -27.83% -20.90% +13.34% +37.17% -14.90% ~ -8.81% +12.88% +18.92%
SubVW/words=100/data=carry -25.19% -34.70% -27.45% +12.86% +28.42% -14.48% ~ ~ +25.71% +21.93%
SubVW/words=1000/data=carry -24.93% -47.86% -47.26% +2.66% ~ -23.88% ~ ~ +25.99% +27.81%
SubVW/words=10000/data=carry -24.17% -36.48% -49.41% +1.06% ~ -25.06% ~ -26.50% +27.94% +18.36%
SubVW/words=100000/data=carry -22.51% -35.86% -49.46% +3.96% ~ -25.18% ~ -22.15% +26.86% +15.44%
Change-Id: I8f252073040e674780ac6ec9912082fb205329dd
Reviewed-on: https://go-review.googlesource.com/c/go/+/664898
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
It is annoying that non-x86 implementations of shlVU and shrVU
have to go out of their way to handle the trivial case shift==0
with their own copy loops. Instead, arrange to never call them
with shift==0, so that the code can be removed.
Unfortunately, there are linknames of shlVU, so we cannot
change that function. But we can rename the functions and
then leave behind a shlVU wrapper, so do that.
Since the big.Int API calls the operations Lsh and Rsh, rename
shlVU/shrVU to lshVU/rshVU. Also rename various other shl/shr
methods and functions to lsh/rsh.
Change-Id: Ieaf54e0110a298730aa3e4566ce5be57ba7fc121
Reviewed-on: https://go-review.googlesource.com/c/go/+/664896
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
addMulVVW is an unnecessarily special case.
All other assembly routines taking []Word (V as in vector) arguments
take separate source and destination. For example:
addVV: z = x+y
mulAddVWW: z = x*m+a
addMulVVW uses the z parameter as both destination and source:
addMulVVW: z = z+x*m
Even looking at the signatures is confusing: all the VV routines take
two input vectors x and y, but addMulVVW takes only x: where is y?
(The answer is that the two inputs are z and x.)
It would be nice to fix this, both for understandability and regularity,
and to simplify a future assembly generator.
We cannot remove or redefine addMulVVW, because it has been used
in linknames. Instead, the CL adds a new final addend argument ‘a’
like in mulAddVWW, making the natural name addMulVVWW
(two input vectors, two input words):
addMulVVWW: z = x+y*m+a
This CL updates all the assembly implementations to rename the
inputs z, x, y -> x, y, m, and then introduces a separate destination z.
Change-Id: Ib76c80b53f6d1f4a901f663566e9c4764bb20488
Reviewed-on: https://go-review.googlesource.com/c/go/+/664895
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Now gc can generate the same assembly code.
Change-Id: Iac503003e14045d63e2def66408c13cee516aa37
Reviewed-on: https://go-review.googlesource.com/c/go/+/402575
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org>
Auto-Submit: Keith Randall <khr@golang.org>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Code moved and functions reordered to be in a consistent
top-down dependency order, but otherwise unchanged.
First step toward commenting division algorithms.
Change-Id: Ib5e604fb5b2867edff3a228ba4e57b5cb32c4137
Reviewed-on: https://go-review.googlesource.com/c/go/+/321077
Trust: Russ Cox <rsc@golang.org>
Trust: Katie Hockman <katie@golang.org>
Trust: Robert Griesemer <gri@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Katie Hockman <katie@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
Make explicit a shrVU_g precondition.
Replace i with i+1 throughout the loop.
The resulting loop is functionally identical,
but the compiler can do better BCE without the i-1 slice offset.
Benchmarks results on amd64 with -tags=math_big_pure_go.
name old time/op new time/op delta
NonZeroShifts/1/shrVU-8 4.55ns ± 2% 4.45ns ± 3% -2.27% (p=0.000 n=28+30)
NonZeroShifts/1/shlVU-8 4.07ns ± 1% 4.13ns ± 4% +1.55% (p=0.000 n=26+29)
NonZeroShifts/2/shrVU-8 6.12ns ± 1% 5.55ns ± 1% -9.30% (p=0.000 n=28+28)
NonZeroShifts/2/shlVU-8 5.65ns ± 3% 5.70ns ± 2% +0.92% (p=0.008 n=30+29)
NonZeroShifts/3/shrVU-8 7.58ns ± 2% 6.79ns ± 2% -10.46% (p=0.000 n=28+28)
NonZeroShifts/3/shlVU-8 6.62ns ± 2% 6.69ns ± 1% +1.07% (p=0.000 n=29+28)
NonZeroShifts/4/shrVU-8 9.02ns ± 1% 7.79ns ± 2% -13.59% (p=0.000 n=27+30)
NonZeroShifts/4/shlVU-8 7.74ns ± 1% 7.82ns ± 1% +0.92% (p=0.000 n=26+28)
NonZeroShifts/5/shrVU-8 10.6ns ± 1% 8.9ns ± 3% -16.31% (p=0.000 n=25+29)
NonZeroShifts/5/shlVU-8 8.59ns ± 1% 8.68ns ± 1% +1.13% (p=0.000 n=27+29)
NonZeroShifts/10/shrVU-8 18.2ns ± 2% 14.4ns ± 1% -20.96% (p=0.000 n=27+28)
NonZeroShifts/10/shlVU-8 14.1ns ± 1% 14.1ns ± 1% +0.46% (p=0.001 n=26+28)
NonZeroShifts/100/shrVU-8 161ns ± 2% 118ns ± 1% -26.83% (p=0.000 n=29+30)
NonZeroShifts/100/shlVU-8 119ns ± 2% 120ns ± 2% +0.92% (p=0.000 n=29+29)
NonZeroShifts/1000/shrVU-8 1.54µs ± 1% 1.10µs ± 1% -28.63% (p=0.000 n=29+29)
NonZeroShifts/1000/shlVU-8 1.10µs ± 1% 1.10µs ± 2% ~ (p=0.701 n=28+29)
NonZeroShifts/10000/shrVU-8 15.3µs ± 2% 10.9µs ± 1% -28.68% (p=0.000 n=28+28)
NonZeroShifts/10000/shlVU-8 10.9µs ± 2% 10.9µs ± 2% -0.57% (p=0.003 n=26+29)
NonZeroShifts/100000/shrVU-8 154µs ± 1% 111µs ± 2% -28.04% (p=0.000 n=27+28)
NonZeroShifts/100000/shlVU-8 113µs ± 2% 113µs ± 2% ~ (p=0.790 n=30+30)
Change-Id: Ib6a621ee7c88b27f0f18121fb2cba3606c40c9b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/297049
Trust: Josh Bleecher Snyder <josharian@gmail.com>
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
Division is much slower than multiplication. And the method of using
multiplication by multiplying reciprocal and replacing division with it
can increase the speed of divWVW algorithm by three times,and at the
same time increase the speed of nats division.
The benchmark test on arm64 is as follows:
name old time/op new time/op delta
DivWVW/1-4 13.1ns ± 4% 13.3ns ± 4% ~ (p=0.444 n=5+5)
DivWVW/2-4 48.6ns ± 1% 51.2ns ± 2% +5.39% (p=0.008 n=5+5)
DivWVW/3-4 82.0ns ± 1% 69.7ns ± 1% -15.03% (p=0.008 n=5+5)
DivWVW/4-4 116ns ± 1% 71ns ± 2% -38.88% (p=0.008 n=5+5)
DivWVW/5-4 152ns ± 1% 84ns ± 4% -44.70% (p=0.008 n=5+5)
DivWVW/10-4 319ns ± 1% 155ns ± 4% -51.50% (p=0.008 n=5+5)
DivWVW/100-4 3.44µs ± 3% 1.30µs ± 8% -62.30% (p=0.008 n=5+5)
DivWVW/1000-4 33.8µs ± 0% 10.9µs ± 1% -67.74% (p=0.008 n=5+5)
DivWVW/10000-4 343µs ± 4% 111µs ± 5% -67.63% (p=0.008 n=5+5)
DivWVW/100000-4 3.35ms ± 1% 1.25ms ± 3% -62.79% (p=0.008 n=5+5)
QuoRem-4 3.08µs ± 2% 2.21µs ± 4% -28.40% (p=0.008 n=5+5)
ModSqrt225_Tonelli-4 444µs ± 2% 457µs ± 3% ~ (p=0.095 n=5+5)
ModSqrt225_3Mod4-4 136µs ± 1% 138µs ± 3% ~ (p=0.151 n=5+5)
ModSqrt231_Tonelli-4 473µs ± 3% 483µs ± 4% ~ (p=0.548 n=5+5)
ModSqrt231_5Mod8-4 164µs ± 9% 169µs ±12% ~ (p=0.421 n=5+5)
Sqrt-4 36.8µs ± 1% 28.6µs ± 0% -22.17% (p=0.016 n=5+4)
Div/20/10-4 50.0ns ± 3% 51.3ns ± 6% ~ (p=0.238 n=5+5)
Div/40/20-4 49.8ns ± 2% 51.3ns ± 6% ~ (p=0.222 n=5+5)
Div/100/50-4 85.8ns ± 4% 86.5ns ± 5% ~ (p=0.246 n=5+5)
Div/200/100-4 335ns ± 3% 296ns ± 2% -11.60% (p=0.008 n=5+5)
Div/400/200-4 442ns ± 2% 359ns ± 5% -18.81% (p=0.008 n=5+5)
Div/1000/500-4 858ns ± 3% 643ns ± 6% -25.06% (p=0.008 n=5+5)
Div/2000/1000-4 1.70µs ± 3% 1.28µs ± 4% -24.80% (p=0.008 n=5+5)
Div/20000/10000-4 45.0µs ± 5% 41.8µs ± 4% -7.17% (p=0.016 n=5+5)
Div/200000/100000-4 1.51ms ± 7% 1.43ms ± 3% -5.42% (p=0.016 n=5+5)
Div/2000000/1000000-4 57.6ms ± 4% 57.5ms ± 3% ~ (p=1.000 n=5+5)
Div/20000000/10000000-4 2.08s ± 3% 2.04s ± 1% ~ (p=0.095 n=5+5)
name old speed new speed delta
DivWVW/1-4 4.87GB/s ± 4% 4.80GB/s ± 4% ~ (p=0.310 n=5+5)
DivWVW/2-4 2.63GB/s ± 1% 2.50GB/s ± 2% -5.07% (p=0.008 n=5+5)
DivWVW/3-4 2.34GB/s ± 1% 2.76GB/s ± 1% +17.70% (p=0.008 n=5+5)
DivWVW/4-4 2.21GB/s ± 1% 3.61GB/s ± 2% +63.42% (p=0.008 n=5+5)
DivWVW/5-4 2.10GB/s ± 2% 3.81GB/s ± 4% +80.89% (p=0.008 n=5+5)
DivWVW/10-4 2.01GB/s ± 0% 4.13GB/s ± 4% +105.91% (p=0.008 n=5+5)
DivWVW/100-4 1.86GB/s ± 2% 4.95GB/s ± 7% +165.63% (p=0.008 n=5+5)
DivWVW/1000-4 1.89GB/s ± 0% 5.86GB/s ± 1% +209.96% (p=0.008 n=5+5)
DivWVW/10000-4 1.87GB/s ± 4% 5.76GB/s ± 5% +208.96% (p=0.008 n=5+5)
DivWVW/100000-4 1.91GB/s ± 1% 5.14GB/s ± 3% +168.85% (p=0.008 n=5+5)
Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830
Reviewed-on: https://go-review.googlesource.com/c/go/+/250417
Trust: Giovanni Bajo <rasky@develer.com>
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Giovanni Bajo <rasky@develer.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
|
|
Rewrote a few lines to be more idiomatic/less assembly-ish.
Benchmarked with `go test -bench Float -tags math_big_pure_go`:
name old time/op new time/op delta
FloatString/100-8 751ns ± 0% 746ns ± 1% -0.71% (p=0.000 n=10+10)
FloatString/1000-8 22.9µs ± 0% 22.9µs ± 0% ~ (p=0.271 n=10+10)
FloatString/10000-8 1.89ms ± 0% 1.89ms ± 0% ~ (p=0.481 n=10+10)
FloatString/100000-8 184ms ± 0% 184ms ± 0% ~ (p=0.094 n=9+9)
FloatAdd/10-8 56.4ns ± 1% 56.5ns ± 0% ~ (p=0.170 n=9+9)
FloatAdd/100-8 59.7ns ± 0% 59.3ns ± 0% -0.70% (p=0.000 n=8+9)
FloatAdd/1000-8 101ns ± 0% 99ns ± 0% -1.89% (p=0.000 n=8+8)
FloatAdd/10000-8 553ns ± 0% 536ns ± 0% -3.00% (p=0.000 n=9+10)
FloatAdd/100000-8 4.94µs ± 0% 4.74µs ± 0% -3.94% (p=0.000 n=9+10)
FloatSub/10-8 50.3ns ± 0% 50.5ns ± 0% +0.52% (p=0.000 n=8+8)
FloatSub/100-8 52.0ns ± 0% 52.2ns ± 1% +0.46% (p=0.012 n=8+10)
FloatSub/1000-8 77.9ns ± 0% 77.3ns ± 0% -0.80% (p=0.000 n=7+8)
FloatSub/10000-8 371ns ± 0% 362ns ± 0% -2.67% (p=0.000 n=10+10)
FloatSub/100000-8 3.20µs ± 0% 3.10µs ± 0% -3.16% (p=0.000 n=10+10)
ParseFloatSmallExp-8 7.84µs ± 0% 7.82µs ± 0% -0.17% (p=0.037 n=9+9)
ParseFloatLargeExp-8 29.3µs ± 1% 29.5µs ± 0% ~ (p=0.059 n=9+8)
FloatSqrt/64-8 516ns ± 0% 519ns ± 0% +0.54% (p=0.000 n=9+9)
FloatSqrt/128-8 1.07µs ± 0% 1.07µs ± 0% ~ (p=0.109 n=8+9)
FloatSqrt/256-8 1.23µs ± 0% 1.23µs ± 0% +0.50% (p=0.000 n=9+9)
FloatSqrt/1000-8 3.43µs ± 0% 3.44µs ± 0% +0.53% (p=0.000 n=9+8)
FloatSqrt/10000-8 40.9µs ± 0% 40.7µs ± 0% -0.39% (p=0.000 n=9+8)
FloatSqrt/100000-8 1.07ms ± 0% 1.07ms ± 0% -0.10% (p=0.017 n=10+9)
FloatSqrt/1000000-8 89.3ms ± 0% 89.2ms ± 0% -0.07% (p=0.015 n=9+8)
Change-Id: Ibf07c6142719d11bc7f329246957d87a9f3ba3d2
GitHub-Last-Rev: 870a041ab7bb9c24be083114f53653a5f4eed611
GitHub-Pull-Request: golang/go#31220
Reviewed-on: https://go-review.googlesource.com/c/go/+/170449
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
In the normal case, only a few words have to be updated when adding a word to a vector.
When that happens, we can simply copy the rest of the words, which is much faster.
However, the overhead of that makes it prohibitive for small vectors,
so we check the size at the beginning.
The implementation is a bit weird to allow addVW to continued to be inlined; see #30548.
The AddVW benchmarks are surprising, but fully repeatable.
The SubVW benchmarks are more or less as expected.
I expect that removing the indirect function call will
help both and make them a bit more normal.
name old time/op new time/op delta
AddVW/1-8 4.27ns ± 2% 3.81ns ± 3% -10.83% (p=0.000 n=89+90)
AddVW/2-8 4.91ns ± 2% 4.34ns ± 1% -11.60% (p=0.000 n=83+90)
AddVW/3-8 5.77ns ± 4% 5.76ns ± 2% ~ (p=0.365 n=91+87)
AddVW/4-8 6.03ns ± 1% 6.03ns ± 1% ~ (p=0.392 n=80+76)
AddVW/5-8 6.48ns ± 2% 6.63ns ± 1% +2.27% (p=0.000 n=76+74)
AddVW/10-8 9.56ns ± 2% 9.56ns ± 1% -0.02% (p=0.002 n=69+76)
AddVW/100-8 90.6ns ± 0% 18.1ns ± 4% -79.99% (p=0.000 n=72+94)
AddVW/1000-8 865ns ± 0% 85ns ± 6% -90.14% (p=0.000 n=66+96)
AddVW/10000-8 8.57µs ± 2% 1.82µs ± 3% -78.73% (p=0.000 n=99+94)
AddVW/100000-8 84.4µs ± 2% 31.8µs ± 4% -62.29% (p=0.000 n=93+98)
name old time/op new time/op delta
SubVW/1-8 3.90ns ± 2% 4.13ns ± 4% +6.02% (p=0.000 n=92+95)
SubVW/2-8 4.15ns ± 1% 5.20ns ± 1% +25.22% (p=0.000 n=83+85)
SubVW/3-8 5.50ns ± 2% 6.22ns ± 6% +13.21% (p=0.000 n=91+97)
SubVW/4-8 5.99ns ± 1% 6.63ns ± 1% +10.63% (p=0.000 n=79+61)
SubVW/5-8 6.75ns ± 4% 6.88ns ± 2% +1.82% (p=0.000 n=98+73)
SubVW/10-8 9.57ns ± 1% 9.56ns ± 1% -0.13% (p=0.000 n=77+64)
SubVW/100-8 90.3ns ± 1% 18.1ns ± 2% -80.00% (p=0.000 n=75+94)
SubVW/1000-8 860ns ± 4% 85ns ± 7% -90.14% (p=0.000 n=97+99)
SubVW/10000-8 8.51µs ± 3% 1.77µs ± 6% -79.21% (p=0.000 n=100+97)
SubVW/100000-8 84.4µs ± 3% 31.5µs ± 3% -62.66% (p=0.000 n=92+92)
Change-Id: I721d7031d40f245b4a284f5bdd93e7bb85e7e937
Reviewed-on: https://go-review.googlesource.com/c/go/+/164968
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
These routines are quite sensitive to BCE.
This change eliminates bounds checks from loops.
It does so at the cost of a bit of safety:
malformed input will now return incorrect answers
instead of panicking.
This isn't as bad as it sounds: math/big has very good
test coverage, and the alternative implementations are in
assembly, which could do much worse things with malformed input.
If the compiler's BCE improves, so could these routines.
Notable BCE improvements for these routines would be:
* Allowing and propagating more cross-slice length hints.
Then hints like _ = y[:len(z)] would eliminate bounds checks for y[i].
* Propagating enough information so that we could do
n := len(x)
if len(z) < n {
n = len(z)
}
and then have i < n eliminate the same bounds checks as
i < len(x) && i < len(z) currently does.
* Providing some way to do BCE for unrolled loops.
Now that we have math/bits implementations,
it is possible to write things like ADC chains in
pure Go, if you can reasonably unroll loops.
Benchmarks below are for amd64, using -tags=math_big_pure_go.
name old time/op new time/op delta
AddVV/1-8 5.15ns ± 3% 4.65ns ± 4% -9.81% (p=0.000 n=93+86)
AddVV/2-8 6.40ns ± 2% 5.58ns ± 4% -12.78% (p=0.000 n=90+95)
AddVV/3-8 7.07ns ± 2% 6.66ns ± 2% -5.88% (p=0.000 n=87+83)
AddVV/4-8 7.94ns ± 5% 7.41ns ± 4% -6.65% (p=0.000 n=94+98)
AddVV/5-8 8.55ns ± 1% 8.80ns ± 0% +2.92% (p=0.000 n=87+92)
AddVV/10-8 12.7ns ± 1% 12.3ns ± 1% -3.12% (p=0.000 n=83+71)
AddVV/100-8 119ns ± 5% 117ns ± 4% -1.60% (p=0.000 n=93+90)
AddVV/1000-8 1.14µs ± 4% 1.14µs ± 5% ~ (p=0.812 n=95+91)
AddVV/10000-8 11.4µs ± 5% 11.3µs ± 5% ~ (p=0.503 n=97+96)
AddVV/100000-8 114µs ± 4% 113µs ± 5% -0.98% (p=0.002 n=97+90)
name old time/op new time/op delta
SubVV/1-8 5.23ns ± 5% 4.65ns ± 3% -11.18% (p=0.000 n=89+91)
SubVV/2-8 6.49ns ± 5% 5.58ns ± 3% -14.04% (p=0.000 n=92+94)
SubVV/3-8 7.10ns ± 3% 6.65ns ± 2% -6.28% (p=0.000 n=87+80)
SubVV/4-8 8.04ns ± 1% 7.44ns ± 5% -7.49% (p=0.000 n=83+98)
SubVV/5-8 8.55ns ± 2% 8.32ns ± 1% -2.75% (p=0.000 n=84+92)
SubVV/10-8 12.7ns ± 1% 12.3ns ± 1% -3.09% (p=0.000 n=80+75)
SubVV/100-8 119ns ± 0% 116ns ± 3% -1.83% (p=0.000 n=87+98)
SubVV/1000-8 1.13µs ± 5% 1.13µs ± 3% ~ (p=0.082 n=96+98)
SubVV/10000-8 11.2µs ± 1% 11.3µs ± 3% +0.76% (p=0.000 n=87+97)
SubVV/100000-8 112µs ± 2% 113µs ± 3% +0.55% (p=0.000 n=76+88)
name old time/op new time/op delta
AddVW/1-8 4.30ns ± 4% 3.96ns ± 6% -8.02% (p=0.000 n=89+97)
AddVW/2-8 5.15ns ± 2% 4.91ns ± 1% -4.56% (p=0.000 n=87+80)
AddVW/3-8 5.59ns ± 3% 5.75ns ± 2% +2.91% (p=0.000 n=91+88)
AddVW/4-8 6.20ns ± 1% 6.03ns ± 1% -2.71% (p=0.000 n=75+90)
AddVW/5-8 6.93ns ± 3% 6.49ns ± 2% -6.35% (p=0.000 n=100+82)
AddVW/10-8 10.0ns ± 7% 9.6ns ± 0% -4.02% (p=0.000 n=98+74)
AddVW/100-8 91.1ns ± 1% 90.6ns ± 1% -0.55% (p=0.000 n=84+80)
AddVW/1000-8 866ns ± 1% 856ns ± 4% -1.06% (p=0.000 n=69+96)
AddVW/10000-8 8.64µs ± 1% 8.53µs ± 4% -1.25% (p=0.000 n=67+99)
AddVW/100000-8 84.3µs ± 2% 85.4µs ± 4% +1.22% (p=0.000 n=89+99)
name old time/op new time/op delta
SubVW/1-8 4.28ns ± 2% 3.82ns ± 3% -10.63% (p=0.000 n=91+89)
SubVW/2-8 4.61ns ± 1% 4.48ns ± 3% -2.67% (p=0.000 n=94+96)
SubVW/3-8 5.54ns ± 1% 5.81ns ± 4% +4.87% (p=0.000 n=92+97)
SubVW/4-8 6.20ns ± 1% 6.08ns ± 2% -1.99% (p=0.000 n=71+88)
SubVW/5-8 6.91ns ± 3% 6.64ns ± 1% -3.90% (p=0.000 n=97+70)
SubVW/10-8 9.85ns ± 2% 9.62ns ± 0% -2.31% (p=0.000 n=82+62)
SubVW/100-8 91.1ns ± 1% 90.9ns ± 3% -0.14% (p=0.010 n=71+93)
SubVW/1000-8 859ns ± 3% 867ns ± 1% +0.98% (p=0.000 n=99+78)
SubVW/10000-8 8.54µs ± 5% 8.57µs ± 2% +0.38% (p=0.007 n=98+92)
SubVW/100000-8 84.5µs ± 3% 84.6µs ± 3% ~ (p=0.334 n=95+94)
name old time/op new time/op delta
AddMulVVW/1-8 5.43ns ± 3% 4.36ns ± 2% -19.67% (p=0.000 n=95+94)
AddMulVVW/2-8 6.56ns ± 4% 6.11ns ± 1% -6.90% (p=0.000 n=91+91)
AddMulVVW/3-8 8.00ns ± 1% 7.80ns ± 4% -2.52% (p=0.000 n=83+95)
AddMulVVW/4-8 9.81ns ± 2% 9.53ns ± 1% -2.86% (p=0.000 n=77+64)
AddMulVVW/5-8 11.4ns ± 3% 11.3ns ± 5% -0.89% (p=0.000 n=95+97)
AddMulVVW/10-8 18.9ns ± 5% 19.1ns ± 5% +0.89% (p=0.000 n=91+94)
AddMulVVW/100-8 165ns ± 5% 165ns ± 4% ~ (p=0.427 n=97+98)
AddMulVVW/1000-8 1.56µs ± 3% 1.56µs ± 4% ~ (p=0.167 n=98+96)
AddMulVVW/10000-8 15.7µs ± 5% 15.6µs ± 5% -0.31% (p=0.044 n=95+97)
AddMulVVW/100000-8 156µs ± 3% 157µs ± 8% ~ (p=0.373 n=72+99)
Change-Id: Ibc720785d5b95f6a797103b1363843205f4d56bf
Reviewed-on: https://go-review.googlesource.com/c/go/+/164966
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
While we're here, delete addWW_g and subWW_g, per the TODO.
They are now obsolete.
Benchmarks on amd64 with -tags=math_big_pure_go.
name old time/op new time/op delta
AddVV/1-8 5.24ns ± 2% 5.12ns ± 1% -2.11% (p=0.000 n=82+87)
AddVV/2-8 6.44ns ± 1% 6.33ns ± 2% -1.82% (p=0.000 n=77+82)
AddVV/3-8 7.89ns ± 8% 6.97ns ± 4% -11.71% (p=0.000 n=100+96)
AddVV/4-8 8.60ns ± 0% 7.72ns ± 4% -10.24% (p=0.000 n=90+96)
AddVV/5-8 10.3ns ± 4% 8.5ns ± 1% -17.02% (p=0.000 n=96+91)
AddVV/10-8 16.2ns ± 5% 12.8ns ± 1% -21.11% (p=0.000 n=97+86)
AddVV/100-8 148ns ± 1% 117ns ± 5% -21.07% (p=0.000 n=66+98)
AddVV/1000-8 1.41µs ± 4% 1.13µs ± 3% -19.90% (p=0.000 n=97+97)
AddVV/10000-8 14.2µs ± 5% 11.2µs ± 1% -20.82% (p=0.000 n=99+84)
AddVV/100000-8 142µs ± 4% 113µs ± 4% -20.40% (p=0.000 n=91+92)
SubVV/1-8 5.29ns ± 1% 5.11ns ± 0% -3.30% (p=0.000 n=87+88)
SubVV/2-8 6.36ns ± 4% 6.33ns ± 2% -0.56% (p=0.002 n=98+73)
SubVV/3-8 7.58ns ± 5% 6.98ns ± 4% -8.01% (p=0.000 n=97+91)
SubVV/4-8 8.61ns ± 3% 7.98ns ± 2% -7.31% (p=0.000 n=95+83)
SubVV/5-8 10.6ns ± 2% 8.5ns ± 1% -19.56% (p=0.000 n=79+89)
SubVV/10-8 16.3ns ± 4% 12.7ns ± 1% -21.97% (p=0.000 n=98+82)
SubVV/100-8 124ns ± 1% 118ns ± 1% -4.83% (p=0.000 n=85+81)
SubVV/1000-8 1.14µs ± 5% 1.12µs ± 2% -1.17% (p=0.000 n=97+81)
SubVV/10000-8 11.6µs ±10% 11.2µs ± 1% -3.39% (p=0.000 n=100+84)
SubVV/100000-8 114µs ± 6% 114µs ± 5% ~ (p=0.396 n=83+94)
AddVW/1-8 4.04ns ± 4% 4.34ns ± 4% +7.57% (p=0.000 n=96+98)
AddVW/2-8 4.34ns ± 5% 4.40ns ± 5% +1.40% (p=0.000 n=99+98)
AddVW/3-8 5.43ns ± 0% 5.54ns ± 2% +1.97% (p=0.000 n=85+94)
AddVW/4-8 6.23ns ± 1% 6.18ns ± 2% -0.66% (p=0.000 n=77+78)
AddVW/5-8 6.78ns ± 2% 6.90ns ± 4% +1.77% (p=0.000 n=80+99)
AddVW/10-8 10.5ns ± 4% 9.9ns ± 1% -5.77% (p=0.000 n=97+69)
AddVW/100-8 114ns ± 3% 91ns ± 0% -20.38% (p=0.000 n=98+77)
AddVW/1000-8 1.12µs ± 1% 0.87µs ± 1% -22.80% (p=0.000 n=82+68)
AddVW/10000-8 11.2µs ± 2% 8.5µs ± 5% -23.85% (p=0.000 n=85+100)
AddVW/100000-8 112µs ± 2% 85µs ± 5% -24.22% (p=0.000 n=71+96)
SubVW/1-8 4.09ns ± 2% 4.18ns ± 4% +2.32% (p=0.000 n=78+96)
SubVW/2-8 4.59ns ± 5% 4.52ns ± 7% -1.54% (p=0.000 n=98+94)
SubVW/3-8 5.41ns ±10% 5.55ns ± 1% +2.48% (p=0.000 n=100+89)
SubVW/4-8 6.51ns ± 2% 6.19ns ± 0% -4.85% (p=0.000 n=97+81)
SubVW/5-8 7.25ns ± 3% 6.90ns ± 4% -4.93% (p=0.000 n=97+96)
SubVW/10-8 10.6ns ± 4% 9.8ns ± 2% -7.32% (p=0.000 n=95+96)
SubVW/100-8 90.4ns ± 0% 90.8ns ± 0% +0.43% (p=0.000 n=83+78)
SubVW/1000-8 853ns ± 4% 857ns ± 2% +0.42% (p=0.000 n=100+98)
SubVW/10000-8 8.52µs ± 4% 8.53µs ± 2% ~ (p=0.061 n=99+97)
SubVW/100000-8 84.8µs ± 5% 84.2µs ± 2% -0.78% (p=0.000 n=99+93)
AddMulVVW/1-8 8.73ns ± 0% 5.33ns ± 3% -38.91% (p=0.000 n=91+96)
AddMulVVW/2-8 14.8ns ± 3% 6.5ns ± 2% -56.33% (p=0.000 n=100+79)
AddMulVVW/3-8 18.6ns ± 2% 7.8ns ± 5% -57.84% (p=0.000 n=89+96)
AddMulVVW/4-8 24.0ns ± 2% 9.8ns ± 0% -59.09% (p=0.000 n=95+67)
AddMulVVW/5-8 29.0ns ± 2% 11.5ns ± 5% -60.44% (p=0.000 n=90+97)
AddMulVVW/10-8 54.1ns ± 0% 18.8ns ± 1% -65.37% (p=0.000 n=82+84)
AddMulVVW/100-8 508ns ± 2% 165ns ± 4% -67.62% (p=0.000 n=72+98)
AddMulVVW/1000-8 4.96µs ± 3% 1.55µs ± 1% -68.86% (p=0.000 n=99+91)
AddMulVVW/10000-8 50.0µs ± 4% 15.5µs ± 4% -68.95% (p=0.000 n=97+97)
AddMulVVW/100000-8 491µs ± 1% 156µs ± 8% -68.22% (p=0.000 n=79+95)
Change-Id: I4c6ae0b4065f371aea8103f6a85d9e9274bf01d0
Reviewed-on: https://go-review.googlesource.com/c/go/+/164965
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
Special case shifts by zero.
Provide hints to the compiler that shifts are bounded.
There are no existing benchmarks for shifts,
but the Float implementation uses shifts,
so we can use those.
Benchmarks on amd64 with -tags=math_big_pure_go.
name old time/op new time/op delta
FloatString/100-8 869ns ± 3% 872ns ± 4% +0.40% (p=0.001 n=94+83)
FloatString/1000-8 26.5µs ± 1% 26.4µs ± 1% -0.46% (p=0.000 n=87+96)
FloatString/10000-8 2.18ms ± 2% 2.18ms ± 2% ~ (p=0.687 n=90+89)
FloatString/100000-8 200ms ± 7% 197ms ± 5% -1.47% (p=0.000 n=100+90)
FloatAdd/10-8 65.9ns ± 4% 64.0ns ± 4% -2.94% (p=0.000 n=92+93)
FloatAdd/100-8 71.3ns ± 4% 67.4ns ± 4% -5.51% (p=0.000 n=96+93)
FloatAdd/1000-8 128ns ± 1% 121ns ± 0% -5.69% (p=0.000 n=91+80)
FloatAdd/10000-8 718ns ± 4% 626ns ± 4% -12.83% (p=0.000 n=99+99)
FloatAdd/100000-8 6.43µs ± 3% 5.50µs ± 1% -14.50% (p=0.000 n=98+83)
FloatSub/10-8 57.7ns ± 2% 57.0ns ± 4% -1.20% (p=0.000 n=89+96)
FloatSub/100-8 59.9ns ± 3% 58.7ns ± 4% -2.10% (p=0.000 n=100+98)
FloatSub/1000-8 94.5ns ± 1% 88.6ns ± 0% -6.16% (p=0.000 n=74+70)
FloatSub/10000-8 456ns ± 1% 416ns ± 5% -8.83% (p=0.000 n=87+95)
FloatSub/100000-8 4.00µs ± 1% 3.57µs ± 1% -10.87% (p=0.000 n=68+85)
FloatSqrt/64-8 585ns ± 1% 579ns ± 1% -0.99% (p=0.000 n=92+90)
FloatSqrt/128-8 1.26µs ± 1% 1.23µs ± 2% -2.42% (p=0.000 n=91+81)
FloatSqrt/256-8 1.45µs ± 3% 1.40µs ± 1% -3.61% (p=0.000 n=96+90)
FloatSqrt/1000-8 4.03µs ± 1% 3.91µs ± 1% -3.05% (p=0.000 n=90+93)
FloatSqrt/10000-8 48.0µs ± 0% 47.3µs ± 1% -1.55% (p=0.000 n=90+90)
FloatSqrt/100000-8 1.23ms ± 3% 1.22ms ± 4% -1.00% (p=0.000 n=99+99)
FloatSqrt/1000000-8 96.7ms ± 4% 98.0ms ±10% ~ (p=0.322 n=89+99)
Change-Id: I0f941c05b7c324256d7f0674559b6ba906e92ba8
Reviewed-on: https://go-review.googlesource.com/c/go/+/164967
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
The function documentation was wrong, it was using a wrong parameter. This change
replaces it with the right parameter.
The wrong formula was: q = (u1<<_W + u0 - r)/y
The function has got a parameter "v" (of type Word), not a parameter "y".
So, the right formula is: q = (u1<<_W + u0 - r)/v
Fixes #28444
Change-Id: I82e57ba014735a9fdb6262874ddf498754d30d33
Reviewed-on: https://go-review.googlesource.com/c/145280
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op
an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by
math/bits benchmarks).
Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb
Reviewed-on: https://go-review.googlesource.com/38464
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
This change adds math/bits as a new dependency of math/big.
- use bits.LeadingZeroes instead of local implementation
(they are identical, so there's no performance loss here)
- leave other functionality local (ntz, bitLen) since there's
faster implementations in math/big at the moment
Change-Id: I1218aa8a1df0cc9783583b090a4bb5a8a145c4a2
Reviewed-on: https://go-review.googlesource.com/37141
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
For compatibility with math/bits uint operations.
When math/big was written originally, the Go compiler used 32bit
int/uint values even on a 64bit machine. uintptr was the type that
represented the machine register size. Now, the int/uint types are
sized to the native machine register size, so they are the natural
machine Word type.
On most machines, the size of int/uint correspond to the size of
uintptr. On platforms where uint and uintptr have different sizes,
this change may lead to performance differences (e.g., amd64p32).
Change-Id: Ief249c160b707b6441848f20041e32e9e9d8d8ca
Reviewed-on: https://go-review.googlesource.com/37372
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
|
|
Change-Id: Ibace718452b6dc029c5af5240117f5fc794c38cf
Reviewed-on: https://go-review.googlesource.com/10388
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Fixes #10525.
Change-Id: I92dc87f5d6db396d8dde2220fc37b7093b772d81
Reviewed-on: https://go-review.googlesource.com/9210
Reviewed-by: Robert Griesemer <gri@golang.org>
|
|
(platforms w/o corresponding assembly kernels)
For short vector adds there's some erradic slow-down, but overall
these routines have become significantly faster. This only matters
for platforms w/o native (assembly) versions of these kernels, so
we are not concerned about the minor slow-down for short vectors.
This code was already reviewed under Mercurial (golang.org/cl/172810043)
but wasn't submitted before the switch to git.
Benchmarks run on 2.3GHz Intel Core i7, running OS X 10.9.5,
with the respective AddVV and AddVW assembly routines disabled.
benchmark old ns/op new ns/op delta
BenchmarkAddVV_1 6.59 7.09 +7.59%
BenchmarkAddVV_2 10.3 10.1 -1.94%
BenchmarkAddVV_3 10.9 12.6 +15.60%
BenchmarkAddVV_4 13.9 15.6 +12.23%
BenchmarkAddVV_5 16.8 17.3 +2.98%
BenchmarkAddVV_1e1 29.5 29.9 +1.36%
BenchmarkAddVV_1e2 246 232 -5.69%
BenchmarkAddVV_1e3 2374 2185 -7.96%
BenchmarkAddVV_1e4 58942 22292 -62.18%
BenchmarkAddVV_1e5 668622 225279 -66.31%
BenchmarkAddVW_1 6.81 5.58 -18.06%
BenchmarkAddVW_2 7.69 6.86 -10.79%
BenchmarkAddVW_3 9.56 8.32 -12.97%
BenchmarkAddVW_4 12.1 9.53 -21.24%
BenchmarkAddVW_5 13.2 10.9 -17.42%
BenchmarkAddVW_1e1 23.4 18.0 -23.08%
BenchmarkAddVW_1e2 175 141 -19.43%
BenchmarkAddVW_1e3 1568 1266 -19.26%
BenchmarkAddVW_1e4 15425 12596 -18.34%
BenchmarkAddVW_1e5 156737 133539 -14.80%
BenchmarkFibo 381678466 132958666 -65.16%
benchmark old MB/s new MB/s speedup
BenchmarkAddVV_1 9715.25 9028.30 0.93x
BenchmarkAddVV_2 12461.72 12622.60 1.01x
BenchmarkAddVV_3 17549.64 15243.82 0.87x
BenchmarkAddVV_4 18392.54 16398.29 0.89x
BenchmarkAddVV_5 18995.23 18496.57 0.97x
BenchmarkAddVV_1e1 21708.98 21438.28 0.99x
BenchmarkAddVV_1e2 25956.53 27506.88 1.06x
BenchmarkAddVV_1e3 26947.93 29286.66 1.09x
BenchmarkAddVV_1e4 10857.96 28709.46 2.64x
BenchmarkAddVV_1e5 9571.91 28409.21 2.97x
BenchmarkAddVW_1 1175.28 1433.98 1.22x
BenchmarkAddVW_2 2080.01 2332.54 1.12x
BenchmarkAddVW_3 2509.28 2883.97 1.15x
BenchmarkAddVW_4 2646.09 3356.83 1.27x
BenchmarkAddVW_5 3020.69 3671.07 1.22x
BenchmarkAddVW_1e1 3425.76 4441.40 1.30x
BenchmarkAddVW_1e2 4553.17 5642.96 1.24x
BenchmarkAddVW_1e3 5100.14 6318.72 1.24x
BenchmarkAddVW_1e4 5186.15 6350.96 1.22x
BenchmarkAddVW_1e5 5104.07 5990.74 1.17x
Change-Id: I7a62023b1105248a0e85e5b9819d3fd4266123d4
Reviewed-on: https://go-review.googlesource.com/2480
Reviewed-by: Russ Cox <rsc@golang.org>
Reviewed-by: Alan Donovan <adonovan@google.com>
|
|
Preparation was in CL 134570043.
This CL contains only the effect of 'hg mv src/pkg/* src'.
For more about the move, see golang.org/s/go14nopkg.
|