go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2025-04-18	math/big: replace addVW/subVW assembly with fast pure Go	Russ Cox
	The vast majority of the time, carry propagation is limited and addVW/subVW only need to consider a single word for carry propagation. As Josh Bleecher-Snyder pointed out in 2019 (CL 164968), once carrying is done, the remaining words can be handled faster with copy (memmove). In the benchmarks below, this is the data=random case. Even more important, if the source and destination are the same, the copy can be optimized away entirely, making a small in-place addition to a big.Int O(1) instead of O(N). To date, only a few systems (amd64, arm64, and pure Go, meaning wasm) make use of this asymptotic improvement. This is the data=shortcut case. This CL deletes the addVW/subVW assembly and replaces it with an optimized pure Go version. Using Go makes it easy to call the real copy builtin, which will use optimized memmove code, instead of recreating a worse memmove in assembly (as arm64 does) or omitting the copy optimization entirely (as most others do). The worst case for the Go version versus assembly is the case of incrementing 2^N-1 by 1, which has to propagate a carry the entire length of the array. This is the data=carry case. On balance, we believe this case is rare enough to be worth taking a hit in that case, in exchange for significant wins in the other cases and the deletion of significant amounts of assembly of varying quality. (Remember that half the assembly has the copy optimization and shortcut, while half does not.) In the benchmarks, the systems are: c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud) c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud) s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X) c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud) mac GOARCH=arm64 Apple M3 Pro in MacBook Pro 386 GOARCH=386 gotip-linux-386 gomote arm GOARCH=arm gotip-linux-arm gomote loong64 GOARCH=loong64 gotip-linux-loong64 gomote ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote benchmark \ system c2s16 c3h88 s7 c4as16 mac 386 arm loong64 ppc64le riscv64 AddVW/words=1/data=random -1.15% -1.74% -5.89% -9.80% -11.54% +23.71% -12.74% -14.25% +14.67% +10.27% AddVW/words=2/data=random -2.59% ~ -4.38% -19.31% -15.41% +24.80% ~ -19.99% +13.73% +19.71% AddVW/words=3/data=random -3.75% -19.10% -3.79% -23.15% -17.04% +20.04% -10.07% -23.20% ~ +15.39% AddVW/words=4/data=random -2.84% +7.05% -8.77% -22.64% -15.77% +16.01% -7.36% -28.22% ~ +23.00% AddVW/words=5/data=random -10.97% +2.16% -12.09% -20.89% -17.14% +9.42% -4.69% -32.60% ~ +10.07% AddVW/words=6/data=random -9.87% ~ -7.54% -19.08% -6.46% ~ -3.44% -34.61% ~ +12.19% AddVW/words=7/data=random -14.36% ~ -10.09% -19.10% -10.47% -6.20% -5.06% -38.14% -11.54% +6.79% AddVW/words=8/data=random -17.50% ~ -11.06% -25.14% -12.88% -8.35% -5.11% -41.39% -14.04% +11.87% AddVW/words=9/data=random -19.76% -4.05% -15.47% -24.08% -16.50% -12.34% -21.56% -44.25% -14.82% ~ AddVW/words=10/data=random -13.89% ~ -9.69% -23.06% -8.04% -12.58% -19.25% -32.80% -11.68% ~ AddVW/words=16/data=random -29.36% -15.35% -21.86% -25.04% -19.89% -32.26% -16.29% -42.66% -25.92% -3.01% AddVW/words=32/data=random -39.02% -28.76% -39.87% -11.22% -2.85% -55.40% -31.17% -55.37% -37.92% -16.28% AddVW/words=64/data=random -25.94% -19.09% -20.60% -6.90% +8.91% -51.00% -43.72% -62.27% -44.11% -28.74% AddVW/words=100/data=random -22.79% -18.13% -18.25% ~ +33.89% -67.40% -51.77% -63.54% -53.75% -30.97% AddVW/words=1000/data=random -8.98% -3.84% ~ -3.15% ~ -93.35% -63.92% -65.66% -68.67% -42.30% AddVW/words=10000/data=random -1.38% -0.38% ~ ~ ~ -89.16% -65.18% -44.65% -70.35% -20.08% AddVW/words=100000/data=random ~ ~ ~ ~ ~ -87.03% -64.51% -36.08% -61.40% -16.53% SubVW/words=1/data=random -3.67% ~ -8.38% -10.26% -3.07% +45.78% -6.06% -11.17% ~ ~ SubVW/words=2/data=random -3.48% -10.07% -5.76% -20.14% -8.45% +44.28% ~ -19.09% ~ +16.98% SubVW/words=3/data=random -7.11% -26.64% -4.48% -22.07% -9.21% +35.61% ~ -23.93% -18.20% ~ SubVW/words=4/data=random -4.23% +7.19% -8.95% -22.62% -13.89% +33.20% -8.96% -29.96% ~ +22.23% SubVW/words=5/data=random -11.49% +1.92% -10.86% -22.27% -17.53% +24.48% -2.88% -35.19% -19.55% ~ SubVW/words=6/data=random -7.67% ~ -7.72% -18.44% -6.24% +12.03% -2.00% -39.68% -10.73% ~ SubVW/words=7/data=random -13.69% -18.32% -11.82% -18.92% -11.57% +6.63% ~ -43.54% -30.81% ~ SubVW/words=8/data=random -16.02% ~ -11.07% -24.50% -11.92% +4.32% -3.01% -46.95% -24.14% ~ SubVW/words=9/data=random -18.76% -3.34% -14.84% -23.79% -17.50% ~ -21.80% -49.98% -29.62% ~ SubVW/words=10/data=random -13.23% ~ -9.25% -21.26% -11.63% ~ -18.58% -39.19% -20.09% ~ SubVW/words=16/data=random -28.25% -13.24% -22.66% -27.18% -19.13% -23.38% -20.24% -51.01% -28.06% -3.05% SubVW/words=32/data=random -38.41% -28.88% -40.12% -11.20% -2.80% -49.17% -34.67% -63.29% -39.25% -15.20% SubVW/words=64/data=random -25.51% -19.24% -22.20% -6.57% +9.98% -48.52% -48.14% -69.50% -49.44% -27.92% SubVW/words=100/data=random -21.69% -18.51% ~ +1.92% +34.42% -65.88% -54.67% -71.24% -58.88% -30.71% SubVW/words=1000/data=random -9.81% -4.05% -2.14% -3.06% ~ -93.37% -67.33% -74.12% -68.36% -42.17% SubVW/words=10000/data=random ~ -0.52% ~ ~ ~ -88.87% -68.54% -44.94% -70.63% -19.95% SubVW/words=100000/data=random ~ ~ ~ ~ ~ -86.69% -68.09% -48.36% -62.42% -19.32% AddVW/words=1/data=shortcut -29.38% -25.38% -27.37% -23.15% -25.41% +3.01% -33.60% -36.12% -15.76% ~ AddVW/words=2/data=shortcut -32.79% -34.72% -31.47% -24.47% -28.21% -3.75% -34.66% -43.89% -23.65% -21.56% AddVW/words=3/data=shortcut -38.50% -46.83% -35.67% -26.38% -30.29% -10.41% -44.89% -47.68% -30.93% -26.85% AddVW/words=4/data=shortcut -40.40% -28.85% -34.19% -29.83% -32.95% -16.09% -42.86% -51.02% -34.19% -26.69% AddVW/words=5/data=shortcut -43.87% -35.42% -36.46% -32.59% -37.72% -20.82% -45.14% -54.01% -35.49% -30.48% AddVW/words=6/data=shortcut -46.98% -39.34% -42.22% -35.43% -38.18% -27.46% -46.72% -56.61% -40.21% -34.07% AddVW/words=7/data=shortcut -49.63% -47.97% -46.61% -35.28% -41.93% -31.14% -49.29% -58.89% -41.10% -37.01% AddVW/words=8/data=shortcut -50.48% -42.33% -45.40% -40.24% -41.74% -32.92% -50.62% -60.98% -44.85% -38.10% AddVW/words=9/data=shortcut -54.27% -43.52% -49.06% -42.16% -45.22% -37.57% -51.84% -62.91% -46.04% -40.82% AddVW/words=10/data=shortcut -56.01% -45.40% -51.42% -43.29% -46.14% -38.65% -53.65% -64.62% -47.05% -43.21% AddVW/words=16/data=shortcut -62.73% -55.66% -59.31% -56.38% -54.31% -53.16% -61.03% -72.29% -58.24% -52.57% AddVW/words=32/data=shortcut -74.00% -69.42% -71.75% -33.65% -37.35% -71.73% -72.59% -82.44% -70.87% -67.69% AddVW/words=64/data=shortcut -56.69% -52.72% -52.09% -35.48% -36.87% -84.24% -83.10% -90.37% -82.56% -80.81% AddVW/words=100/data=shortcut -56.68% -53.18% -51.49% -33.49% -37.72% -89.95% -88.21% -93.37% -88.47% -86.52% AddVW/words=1000/data=shortcut -56.68% -52.45% -51.66% -35.31% -36.65% -98.88% -98.62% -99.24% -98.78% -98.41% AddVW/words=10000/data=shortcut -56.70% -52.40% -51.92% -33.49% -36.98% -99.89% -99.86% -99.92% -99.87% -99.91% AddVW/words=100000/data=shortcut -56.67% -52.46% -52.38% -35.31% -37.20% -99.99% -99.99% -99.99% -99.99% -99.99% SubVW/words=1/data=shortcut -29.80% -20.71% -26.94% -23.24% -25.33% +26.97% -32.02% -37.85% -40.20% -12.67% SubVW/words=2/data=shortcut -35.47% -36.38% -31.93% -25.43% -30.18% +18.96% -33.48% -46.48% -39.38% -18.65% SubVW/words=3/data=shortcut -39.22% -49.96% -36.90% -25.82% -30.96% +12.53% -40.67% -51.07% -43.71% -23.78% SubVW/words=4/data=shortcut -40.46% -24.90% -34.66% -29.87% -33.97% +4.60% -42.32% -54.92% -42.83% -22.45% SubVW/words=5/data=shortcut -43.84% -34.17% -38.00% -32.55% -37.27% -2.46% -43.09% -58.18% -45.70% -26.45% SubVW/words=6/data=shortcut -47.69% -37.49% -42.73% -35.90% -37.73% -8.52% -46.55% -61.01% -44.00% -30.14% SubVW/words=7/data=shortcut -49.45% -50.66% -46.88% -34.77% -41.64% -14.46% -48.92% -63.46% -50.47% -33.39% SubVW/words=8/data=shortcut -50.45% -39.31% -47.14% -40.47% -41.70% -15.77% -50.21% -65.64% -47.71% -34.01% SubVW/words=9/data=shortcut -54.28% -43.07% -49.42% -41.34% -44.99% -19.39% -51.55% -67.61% -56.92% -36.82% SubVW/words=10/data=shortcut -56.85% -47.88% -50.92% -42.76% -45.67% -23.60% -53.04% -69.34% -60.18% -39.43% SubVW/words=16/data=shortcut -62.36% -54.83% -58.80% -55.83% -53.74% -41.04% -60.16% -76.75% -60.56% -48.63% SubVW/words=32/data=shortcut -73.68% -68.64% -71.57% -33.52% -37.34% -64.73% -72.67% -85.89% -71.87% -64.56% SubVW/words=64/data=shortcut -56.68% -51.66% -52.56% -34.75% -37.54% -80.30% -83.58% -92.39% -83.41% -78.70% SubVW/words=100/data=shortcut -56.68% -50.97% -51.57% -33.68% -36.78% -87.42% -88.53% -94.84% -88.87% -84.96% SubVW/words=1000/data=shortcut -56.68% -50.89% -52.10% -34.94% -37.77% -98.59% -98.71% -99.43% -98.80% -98.20% SubVW/words=10000/data=shortcut -56.68% -51.00% -52.44% -33.65% -37.27% -99.86% -99.87% -99.94% -99.88% -99.90% SubVW/words=100000/data=shortcut -56.68% -50.80% -52.20% -34.79% -37.46% -99.99% -99.99% -99.99% -99.99% -99.99% AddVW/words=1/data=carry -0.51% -5.29% -24.03% -26.48% ~ ~ -33.14% -30.23% ~ -20.74% AddVW/words=2/data=carry -6.36% ~ -21.05% -39.40% ~ +10.72% -29.12% -31.34% ~ -17.29% AddVW/words=3/data=carry ~ ~ -17.46% -19.53% +17.58% ~ -26.23% -23.61% +7.80% -14.34% AddVW/words=4/data=carry +19.02% +16.80% ~ ~ +28.25% ~ -27.90% -20.31% +19.16% ~ AddVW/words=5/data=carry +3.97% +53.02% ~ ~ +11.31% ~ -19.05% -17.47% +16.81% ~ AddVW/words=6/data=carry +2.98% +19.83% ~ ~ +14.84% ~ -18.48% -14.92% +18.25% ~ AddVW/words=7/data=carry ~ ~ ~ ~ +27.17% ~ -15.50% -12.74% +13.00% ~ AddVW/words=8/data=carry +0.58% +22.32% ~ +6.10% +29.63% ~ -13.04% ~ +28.46% +2.95% AddVW/words=9/data=carry ~ +31.53% ~ ~ +14.42% ~ -11.32% ~ +18.37% +3.28% AddVW/words=10/data=carry +3.94% +22.36% ~ +6.29% +19.22% ~ -11.27% ~ +20.10% +3.91% AddVW/words=16/data=carry +2.82% +14.23% ~ +10.06% +25.91% -16.12% ~ ~ +52.28% +10.40% AddVW/words=32/data=carry ~ +25.35% +13.66% ~ +34.89% -34.39% +6.51% -18.71% +41.06% +19.42% AddVW/words=64/data=carry -42.03% ~ -39.70% +6.65% +32.29% -39.94% +14.34% ~ +19.68% +20.86% AddVW/words=100/data=carry -33.95% -34.28% -39.65% ~ +27.72% -26.80% +17.40% ~ +26.39% +23.32% AddVW/words=1000/data=carry -42.49% -47.87% -47.44% +1.25% +4.25% -41.76% +23.40% ~ +25.48% +27.99% AddVW/words=10000/data=carry -41.85% -48.49% -49.43% ~ ~ -42.09% +24.61% -10.32% +40.55% +18.35% AddVW/words=100000/data=carry -28.18% -48.13% -48.24% +1.35% ~ -42.90% +24.73% -9.79% +22.55% +17.16% SubVW/words=1/data=carry -10.32% -17.16% -24.14% -26.24% ~ +18.43% -34.10% -29.54% -9.57% ~ SubVW/words=2/data=carry -19.45% -23.31% -20.74% -39.73% ~ +15.74% -28.13% -30.21% ~ -18.74% SubVW/words=3/data=carry ~ -16.18% -15.34% -19.54% +17.62% +12.39% -27.64% -27.09% ~ -14.97% SubVW/words=4/data=carry +11.67% +24.42% ~ ~ +25.11% +14.07% -28.08% -26.18% ~ ~ SubVW/words=5/data=carry +8.08% +25.64% ~ ~ +10.35% +8.12% -21.75% -25.50% ~ -4.86% SubVW/words=6/data=carry ~ +13.82% ~ ~ +12.92% +6.79% -20.25% -24.70% ~ -2.74% SubVW/words=7/data=carry ~ ~ +8.29% +4.51% +26.59% +4.62% -18.01% -24.09% ~ -1.26% SubVW/words=8/data=carry ~ +23.16% +16.19% +6.16% +25.46% +6.74% -15.57% -22.74% ~ +1.44% SubVW/words=9/data=carry ~ +30.71% +20.81% ~ +12.36% ~ -12.99% ~ ~ +3.13% SubVW/words=10/data=carry +5.03% +19.53% +14.84% +14.16% +16.12% ~ -11.64% -16.00% +15.45% +3.29% SubVW/words=16/data=carry +14.42% +15.58% +33.07% +11.43% +24.65% ~ ~ -21.90% +25.59% +9.40% SubVW/words=32/data=carry ~ +27.57% +46.58% ~ +35.35% -8.49% ~ -24.04% +11.86% +18.40% SubVW/words=64/data=carry -24.34% -27.83% -20.90% +13.34% +37.17% -14.90% ~ -8.81% +12.88% +18.92% SubVW/words=100/data=carry -25.19% -34.70% -27.45% +12.86% +28.42% -14.48% ~ ~ +25.71% +21.93% SubVW/words=1000/data=carry -24.93% -47.86% -47.26% +2.66% ~ -23.88% ~ ~ +25.99% +27.81% SubVW/words=10000/data=carry -24.17% -36.48% -49.41% +1.06% ~ -25.06% ~ -26.50% +27.94% +18.36% SubVW/words=100000/data=carry -22.51% -35.86% -49.46% +3.96% ~ -25.18% ~ -22.15% +26.86% +15.44% Change-Id: I8f252073040e674780ac6ec9912082fb205329dd Reviewed-on: https://go-review.googlesource.com/c/go/+/664898 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-18	math/big: add more complete tests and benchmarks of assembly	Russ Cox
	Also fix a few real but currently harmless bugs from CL 664895. There were a few places that were still wrong if z != x or if a != 0. Change-Id: Id8971e2505523bc4708780c82bf998a546f4f081 Reviewed-on: https://go-review.googlesource.com/c/go/+/664897 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>
2025-04-11	math/big: remove copy responsibility from, rename shlVU, shrVU	Russ Cox
	It is annoying that non-x86 implementations of shlVU and shrVU have to go out of their way to handle the trivial case shift==0 with their own copy loops. Instead, arrange to never call them with shift==0, so that the code can be removed. Unfortunately, there are linknames of shlVU, so we cannot change that function. But we can rename the functions and then leave behind a shlVU wrapper, so do that. Since the big.Int API calls the operations Lsh and Rsh, rename shlVU/shrVU to lshVU/rshVU. Also rename various other shl/shr methods and functions to lsh/rsh. Change-Id: Ieaf54e0110a298730aa3e4566ce5be57ba7fc121 Reviewed-on: https://go-review.googlesource.com/c/go/+/664896 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2025-04-11	math/big: replace addMulVVW with addMulVVWW	Russ Cox
	addMulVVW is an unnecessarily special case. All other assembly routines taking []Word (V as in vector) arguments take separate source and destination. For example: addVV: z = x+y mulAddVWW: z = xm+a addMulVVW uses the z parameter as both destination and source: addMulVVW: z = z+xm Even looking at the signatures is confusing: all the VV routines take two input vectors x and y, but addMulVVW takes only x: where is y? (The answer is that the two inputs are z and x.) It would be nice to fix this, both for understandability and regularity, and to simplify a future assembly generator. We cannot remove or redefine addMulVVW, because it has been used in linknames. Instead, the CL adds a new final addend argument ‘a’ like in mulAddVWW, making the natural name addMulVVWW (two input vectors, two input words): addMulVVWW: z = x+y*m+a This CL updates all the assembly implementations to rename the inputs z, x, y -> x, y, m, and then introduces a separate destination z. Change-Id: Ib76c80b53f6d1f4a901f663566e9c4764bb20488 Reviewed-on: https://go-review.googlesource.com/c/go/+/664895 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>
2025-02-27	math/big: replace nat pool with Word stack	Russ Cox
	In the early days of math/big, algorithms that needed more space grew the result larger than it needed to be and then used the high words as extra space. This made results their own temporary space caches, at the cost that saving a result in a data structure might hold significantly more memory than necessary. Specifically, new(big.Int).Mul(x, y) returned a big.Int with a backing slice 3X as big as it strictly needed to be. If you are storing many multiplication results, or even a single large result, the 3X overhead can add up. This approach to storage for temporaries also requires being able to analyze the algorithms to predict the exact amount they need, which can be difficult. For both these reasons, the implementation of recursive long division, which came later, introduced a “nat pool” where temporaries could be stored and reused, or reclaimed by the GC when no longer used. This avoids the storage and bookkeeping overheads but introduces a per-temporary sync.Pool overhead. divRecursiveStep takes an array of cached temporaries to remove some of that overhead. The nat pool was better but is still not quite right. This CL introduces something even better than the nat pool (still probably not quite right, but the best I can see for now): a sync.Pool holding stacks for allocating temporaries. Now an operation can get one stack out of the pool and then allocate as many temporaries as it needs during the operation, eventually returning the stack back to the pool. The sync.Pool operations are now per-exported-operation (like big.Int.Mul), not per-temporary. This CL converts both the pre-allocation in nat.mul and the uses of the nat pool to use stack pools instead. This simplifies some code and sets us up better for more complex algorithms (such as Toom-Cook or FFT-based multiplication) that need more temporaries. It is also a little bit faster. goos: linux goarch: amd64 pkg: math/big cpu: Intel(R) Xeon(R) CPU @ 3.10GHz │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15) Div/40/20-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15) Div/100/50-16 56.65n ± 0% 55.53n ± 0% -1.98% (p=0.000 n=15) Div/200/100-16 194.6n ± 1% 172.8n ± 0% -11.20% (p=0.000 n=15) Div/400/200-16 232.1n ± 0% 206.7n ± 0% -10.94% (p=0.000 n=15) Div/1000/500-16 405.3n ± 1% 383.8n ± 0% -5.30% (p=0.000 n=15) Div/2000/1000-16 810.4n ± 1% 795.2n ± 0% -1.88% (p=0.000 n=15) Div/20000/10000-16 25.88µ ± 0% 25.39µ ± 0% -1.89% (p=0.000 n=15) Div/200000/100000-16 931.5µ ± 0% 924.3µ ± 0% -0.77% (p=0.000 n=15) Div/2000000/1000000-16 37.77m ± 0% 37.75m ± 0% ~ (p=0.098 n=15) Div/20000000/10000000-16 1.367 ± 0% 1.377 ± 0% +0.72% (p=0.003 n=15) NatMul/10-16 168.5n ± 3% 164.0n ± 4% ~ (p=0.751 n=15) NatMul/100-16 6.086µ ± 3% 5.380µ ± 3% -11.60% (p=0.000 n=15) NatMul/1000-16 238.1µ ± 3% 228.3µ ± 1% -4.12% (p=0.000 n=15) NatMul/10000-16 8.721m ± 2% 8.518m ± 1% -2.33% (p=0.000 n=15) NatMul/100000-16 369.6m ± 0% 371.1m ± 0% +0.42% (p=0.000 n=15) geomean 19.57µ 18.74µ -4.21% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15) NatMul/1000-16 48.16Ki ± 0% 16.02Ki ± 0% -66.73% (p=0.000 n=15) NatMul/10000-16 482.9Ki ± 1% 165.4Ki ± 3% -65.75% (p=0.000 n=15) NatMul/100000-16 5.747Mi ± 7% 4.197Mi ± 0% -26.97% (p=0.000 n=15) geomean 41.42Ki 20.63Ki -50.18% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-16 7.000 ± 14% 7.000 ± 14% ~ (p=0.668 n=15) geomean 1.476 1.476 +0.00% ¹ all samples are equal goos: linux goarch: amd64 pkg: math/big cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-88 15.84n ± 1% 13.12n ± 0% -17.17% (p=0.000 n=15) Div/40/20-88 15.88n ± 1% 13.12n ± 0% -17.38% (p=0.000 n=15) Div/100/50-88 26.42n ± 0% 25.47n ± 0% -3.60% (p=0.000 n=15) Div/200/100-88 132.4n ± 0% 114.9n ± 0% -13.22% (p=0.000 n=15) Div/400/200-88 150.1n ± 0% 135.6n ± 0% -9.66% (p=0.000 n=15) Div/1000/500-88 275.5n ± 0% 264.1n ± 0% -4.14% (p=0.000 n=15) Div/2000/1000-88 586.5n ± 0% 581.1n ± 0% -0.92% (p=0.000 n=15) Div/20000/10000-88 25.87µ ± 0% 25.72µ ± 0% -0.59% (p=0.000 n=15) Div/200000/100000-88 772.2µ ± 0% 779.0µ ± 0% +0.88% (p=0.000 n=15) Div/2000000/1000000-88 33.36m ± 0% 33.63m ± 0% +0.80% (p=0.000 n=15) Div/20000000/10000000-88 1.307 ± 0% 1.320 ± 0% +1.03% (p=0.000 n=15) NatMul/10-88 140.4n ± 0% 148.8n ± 4% +5.98% (p=0.000 n=15) NatMul/100-88 4.663µ ± 1% 4.388µ ± 1% -5.90% (p=0.000 n=15) NatMul/1000-88 207.7µ ± 0% 205.8µ ± 0% -0.89% (p=0.000 n=15) NatMul/10000-88 8.456m ± 0% 8.468m ± 0% +0.14% (p=0.021 n=15) NatMul/100000-88 295.1m ± 0% 297.9m ± 0% +0.94% (p=0.000 n=15) geomean 14.96µ 14.33µ -4.23% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-88 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-88 4.750Ki ± 0% 1.758Ki ± 0% -62.99% (p=0.000 n=15) NatMul/1000-88 48.44Ki ± 0% 16.08Ki ± 0% -66.80% (p=0.000 n=15) NatMul/10000-88 489.7Ki ± 1% 166.1Ki ± 3% -66.08% (p=0.000 n=15) NatMul/100000-88 5.546Mi ± 0% 3.819Mi ± 60% -31.15% (p=0.000 n=15) geomean 41.29Ki 20.30Ki -50.85% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-88 5.000 ± 20% 6.000 ± 67% ~ (p=0.672 n=15) geomean 1.380 1.431 +3.71% ¹ all samples are equal goos: linux goarch: arm64 pkg: math/big │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-16 15.85n ± 0% 15.23n ± 0% -3.91% (p=0.000 n=15) Div/40/20-16 15.88n ± 0% 15.22n ± 0% -4.16% (p=0.000 n=15) Div/100/50-16 29.69n ± 0% 26.39n ± 0% -11.11% (p=0.000 n=15) Div/200/100-16 149.2n ± 0% 123.3n ± 0% -17.36% (p=0.000 n=15) Div/400/200-16 160.3n ± 0% 139.2n ± 0% -13.16% (p=0.000 n=15) Div/1000/500-16 271.0n ± 0% 256.1n ± 0% -5.50% (p=0.000 n=15) Div/2000/1000-16 545.3n ± 0% 527.0n ± 0% -3.36% (p=0.000 n=15) Div/20000/10000-16 22.60µ ± 0% 22.20µ ± 0% -1.77% (p=0.000 n=15) Div/200000/100000-16 889.0µ ± 0% 892.2µ ± 0% +0.35% (p=0.000 n=15) Div/2000000/1000000-16 38.01m ± 0% 38.12m ± 0% +0.30% (p=0.000 n=15) Div/20000000/10000000-16 1.437 ± 0% 1.444 ± 0% +0.50% (p=0.000 n=15) NatMul/10-16 166.4n ± 2% 169.5n ± 1% +1.86% (p=0.000 n=15) NatMul/100-16 5.733µ ± 1% 5.570µ ± 1% -2.84% (p=0.000 n=15) NatMul/1000-16 232.6µ ± 1% 229.8µ ± 0% -1.22% (p=0.000 n=15) NatMul/10000-16 9.039m ± 1% 8.969m ± 0% -0.77% (p=0.000 n=15) NatMul/100000-16 367.0m ± 0% 368.8m ± 0% +0.48% (p=0.000 n=15) geomean 16.15µ 15.50µ -4.01% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15) NatMul/1000-16 48.33Ki ± 0% 16.02Ki ± 0% -66.85% (p=0.000 n=15) NatMul/10000-16 536.5Ki ± 1% 165.7Ki ± 3% -69.12% (p=0.000 n=15) NatMul/100000-16 6.078Mi ± 6% 4.197Mi ± 0% -30.94% (p=0.000 n=15) geomean 42.81Ki 20.64Ki -51.78% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-16 2.000 ± 50% 1.000 ± 0% -50.00% (p=0.001 n=15) NatMul/100000-16 9.000 ± 11% 8.000 ± 12% -11.11% (p=0.001 n=15) geomean 1.783 1.516 -14.97% ¹ all samples are equal goos: darwin goarch: arm64 pkg: math/big cpu: Apple M3 Pro │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-12 9.850n ± 1% 9.405n ± 1% -4.52% (p=0.000 n=15) Div/40/20-12 9.858n ± 0% 9.403n ± 1% -4.62% (p=0.000 n=15) Div/100/50-12 16.40n ± 1% 14.81n ± 0% -9.70% (p=0.000 n=15) Div/200/100-12 88.48n ± 2% 80.88n ± 0% -8.59% (p=0.000 n=15) Div/400/200-12 107.90n ± 1% 99.28n ± 1% -7.99% (p=0.000 n=15) Div/1000/500-12 188.8n ± 1% 178.6n ± 1% -5.40% (p=0.000 n=15) Div/2000/1000-12 399.9n ± 0% 389.1n ± 0% -2.70% (p=0.000 n=15) Div/20000/10000-12 13.94µ ± 2% 13.81µ ± 1% ~ (p=0.574 n=15) Div/200000/100000-12 523.8µ ± 0% 521.7µ ± 0% -0.40% (p=0.000 n=15) Div/2000000/1000000-12 21.46m ± 0% 21.48m ± 0% ~ (p=0.067 n=15) Div/20000000/10000000-12 812.5m ± 0% 812.9m ± 0% ~ (p=0.061 n=15) NatMul/10-12 77.14n ± 0% 78.35n ± 1% +1.57% (p=0.000 n=15) NatMul/100-12 2.999µ ± 0% 2.871µ ± 1% -4.27% (p=0.000 n=15) NatMul/1000-12 126.2µ ± 0% 126.8µ ± 0% +0.51% (p=0.011 n=15) NatMul/10000-12 5.099m ± 0% 5.125m ± 0% +0.51% (p=0.000 n=15) NatMul/100000-12 206.7m ± 0% 208.4m ± 0% +0.80% (p=0.000 n=15) geomean 9.512µ 9.236µ -2.91% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-12 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-12 4.750Ki ± 0% 1.750Ki ± 0% -63.16% (p=0.000 n=15) NatMul/1000-12 48.13Ki ± 0% 16.01Ki ± 0% -66.73% (p=0.000 n=15) NatMul/10000-12 483.5Ki ± 1% 163.2Ki ± 2% -66.24% (p=0.000 n=15) NatMul/100000-12 5.480Mi ± 4% 1.532Mi ± 104% -72.05% (p=0.000 n=15) geomean 41.03Ki 16.82Ki -59.01% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-12 5.000 ± 0% 1.000 ± 400% -80.00% (p=0.007 n=15) geomean 1.380 1.000 -27.52% ¹ all samples are equal Change-Id: I7efa6fe37971ed26ae120a32250fcb47ece0a011 Reviewed-on: https://go-review.googlesource.com/c/go/+/650638 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com> Reviewed-by: Alan Donovan <adonovan@google.com>
2022-11-02	math/big: use Montgomery for z.Exp(x, y, m) even for even m	Russ Cox
	Montgomery multiplication can be used for Exp mod even m by splitting it into two steps - Exp mod an odd number and Exp mod a power of two - and then combining the two results using the Chinese Remainder Theorem. For more details, see Ç. K. Koç, “Montgomery Reduction with Even Modulus”, IEE Proceedings: Computers and Digital Techniques, 141(5) 314-316, September 1994. http://www.people.vcu.edu/~jwang3/CMSC691/j34monex.pdf Thanks to Guido Vranken for suggesting that we use a faster algorithm. name old time/op new time/op delta ExpMont/Odd-16 240µs ± 2% 239µs ± 2% ~ (p=0.853 n=10+10) ExpMont/Even1-16 757µs ± 3% 249µs ± 6% -67.17% (p=0.000 n=10+10) ExpMont/Even2-16 755µs ± 1% 244µs ± 4% -67.64% (p=0.000 n=8+10) ExpMont/Even3-16 771µs ± 3% 240µs ± 2% -68.90% (p=0.000 n=10+10) ExpMont/Even4-16 775µs ± 3% 241µs ± 2% -68.91% (p=0.000 n=10+10) ExpMont/Even8-16 779µs ± 2% 241µs ± 3% -69.06% (p=0.000 n=9+10) ExpMont/Even32-16 778µs ± 3% 240µs ± 4% -69.11% (p=0.000 n=9+9) ExpMont/Even64-16 774µs ± 6% 186µs ± 2% -76.00% (p=0.000 n=10+10) ExpMont/Even96-16 776µs ± 4% 186µs ± 6% -76.09% (p=0.000 n=9+10) ExpMont/Even128-16 764µs ± 2% 143µs ± 3% -81.24% (p=0.000 n=10+10) ExpMont/Even255-16 761µs ± 3% 109µs ± 2% -85.73% (p=0.000 n=10+10) ExpMont/SmallEven1-16 45.6µs ± 1% 36.3µs ± 3% -20.49% (p=0.000 n=9+10) ExpMont/SmallEven2-16 44.3µs ± 2% 37.5µs ± 2% -15.26% (p=0.000 n=9+10) ExpMont/SmallEven3-16 44.1µs ± 5% 37.3µs ± 3% -15.32% (p=0.000 n=9+10) ExpMont/SmallEven4-16 47.1µs ± 6% 38.0µs ± 5% -19.40% (p=0.000 n=10+9) name old alloc/op new alloc/op delta ExpMont/Odd-16 2.53kB ± 0% 2.53kB ± 0% ~ (p=0.137 n=8+10) ExpMont/Even1-16 2.57kB ± 0% 3.31kB ± 0% +28.90% (p=0.000 n=8+10) ExpMont/Even2-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even3-16 2.57kB ± 0% 3.37kB ± 0% +31.08% (p=0.000 n=8+8) ExpMont/Even4-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even8-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even32-16 2.57kB ± 0% 3.37kB ± 0% +31.14% (p=0.000 n=10+10) ExpMont/Even64-16 2.57kB ± 0% 3.16kB ± 0% +22.99% (p=0.000 n=9+9) ExpMont/Even96-16 2.57kB ± 0% 3.44kB ± 0% +33.90% (p=0.000 n=10+8) ExpMont/Even128-16 2.57kB ± 0% 2.88kB ± 0% +12.10% (p=0.000 n=10+10) ExpMont/Even255-16 2.57kB ± 0% 2.54kB ± 0% -1.30% (p=0.000 n=9+10) ExpMont/SmallEven1-16 872B ± 0% 1232B ± 0% +41.28% (p=0.000 n=10+10) ExpMont/SmallEven2-16 872B ± 0% 1233B ± 0% +41.40% (p=0.000 n=10+9) ExpMont/SmallEven3-16 872B ± 0% 1289B ± 0% +47.82% (p=0.000 n=10+10) ExpMont/SmallEven4-16 872B ± 0% 1241B ± 0% +42.32% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ExpMont/Odd-16 21.0 ± 0% 21.0 ± 0% ~ (all equal) ExpMont/Even1-16 24.0 ± 0% 38.0 ± 0% +58.33% (p=0.000 n=10+10) ExpMont/Even2-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even3-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even4-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even8-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even32-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even64-16 24.0 ± 0% 39.0 ± 0% +62.50% (p=0.000 n=10+10) ExpMont/Even96-16 24.0 ± 0% 42.0 ± 0% +75.00% (p=0.000 n=10+10) ExpMont/Even128-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even255-16 24.0 ± 0% 38.0 ± 0% +58.33% (p=0.000 n=10+10) ExpMont/SmallEven1-16 16.0 ± 0% 35.0 ± 0% +118.75% (p=0.000 n=10+10) ExpMont/SmallEven2-16 16.0 ± 0% 35.0 ± 0% +118.75% (p=0.000 n=10+10) ExpMont/SmallEven3-16 16.0 ± 0% 37.0 ± 0% +131.25% (p=0.000 n=10+10) ExpMont/SmallEven4-16 16.0 ± 0% 36.0 ± 0% +125.00% (p=0.000 n=10+10) Change-Id: Ib7f70290f8f087b78805ec3120baf17dd7737ac3 Reviewed-on: https://go-review.googlesource.com/c/go/+/420897 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Roland Shoemaker <roland@golang.org> Auto-Submit: Russ Cox <rsc@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>
2022-04-28	math/big: delete all mulWW assembly code	Wayne Zuo
	Now gc can generate the same assembly code. Change-Id: Iac503003e14045d63e2def66408c13cee516aa37 Reviewed-on: https://go-review.googlesource.com/c/go/+/402575 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>
2022-04-01	all: fix TODO comment hanging indents	Russ Cox
	For whatever reason (perhaps some tool does this), a handful of comments, including some doc comments, have TODOs formatted like: // TODO(name): Text here and // more text aligned // under first text. In doc comments the second line turns into a <pre> block, which is undesirable in this context. Rewrite those to unindent, like this instead: // TODO(name): Text here and // more text aligned // at left column. For #51082. Change-Id: Ibf5145659a61ebf9496f016752a709a7656d2d4b Reviewed-on: https://go-review.googlesource.com/c/go/+/384258 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>
2021-03-07	math/big: add shrVU and shlVU benchmarks	Josh Bleecher Snyder
	Change-Id: Id67d6ac856bd9271de99c3381bde910aa0c166e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/296011 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>
2020-12-03	math/big: remove the s390x assembly for shlVU and shrVU	Jonathan Albrecht
	The s390x assembly for shlVU does a forward copy when the shift amount s is 0. This causes corruption of the result z when z is aliased to the input x. This fix removes the s390x assembly for both shlVU and shrVU so the pure go implementations will be used. Test cases have been added to the existing TestShiftOverlap test to cover shift values of 0, 1 and (_W - 1). Fixes #42838 Change-Id: I75ca0e98f3acfaa6366a26355dcd9dd82499a48b Reviewed-on: https://go-review.googlesource.com/c/go/+/274442 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Trust: Robert Griesemer <gri@golang.org>
2020-09-23	math/big: replace division with multiplication by reciprocal word	SparrowLii
	Division is much slower than multiplication. And the method of using multiplication by multiplying reciprocal and replacing division with it can increase the speed of divWVW algorithm by three times,and at the same time increase the speed of nats division. The benchmark test on arm64 is as follows: name old time/op new time/op delta DivWVW/1-4 13.1ns ± 4% 13.3ns ± 4% ~ (p=0.444 n=5+5) DivWVW/2-4 48.6ns ± 1% 51.2ns ± 2% +5.39% (p=0.008 n=5+5) DivWVW/3-4 82.0ns ± 1% 69.7ns ± 1% -15.03% (p=0.008 n=5+5) DivWVW/4-4 116ns ± 1% 71ns ± 2% -38.88% (p=0.008 n=5+5) DivWVW/5-4 152ns ± 1% 84ns ± 4% -44.70% (p=0.008 n=5+5) DivWVW/10-4 319ns ± 1% 155ns ± 4% -51.50% (p=0.008 n=5+5) DivWVW/100-4 3.44µs ± 3% 1.30µs ± 8% -62.30% (p=0.008 n=5+5) DivWVW/1000-4 33.8µs ± 0% 10.9µs ± 1% -67.74% (p=0.008 n=5+5) DivWVW/10000-4 343µs ± 4% 111µs ± 5% -67.63% (p=0.008 n=5+5) DivWVW/100000-4 3.35ms ± 1% 1.25ms ± 3% -62.79% (p=0.008 n=5+5) QuoRem-4 3.08µs ± 2% 2.21µs ± 4% -28.40% (p=0.008 n=5+5) ModSqrt225_Tonelli-4 444µs ± 2% 457µs ± 3% ~ (p=0.095 n=5+5) ModSqrt225_3Mod4-4 136µs ± 1% 138µs ± 3% ~ (p=0.151 n=5+5) ModSqrt231_Tonelli-4 473µs ± 3% 483µs ± 4% ~ (p=0.548 n=5+5) ModSqrt231_5Mod8-4 164µs ± 9% 169µs ±12% ~ (p=0.421 n=5+5) Sqrt-4 36.8µs ± 1% 28.6µs ± 0% -22.17% (p=0.016 n=5+4) Div/20/10-4 50.0ns ± 3% 51.3ns ± 6% ~ (p=0.238 n=5+5) Div/40/20-4 49.8ns ± 2% 51.3ns ± 6% ~ (p=0.222 n=5+5) Div/100/50-4 85.8ns ± 4% 86.5ns ± 5% ~ (p=0.246 n=5+5) Div/200/100-4 335ns ± 3% 296ns ± 2% -11.60% (p=0.008 n=5+5) Div/400/200-4 442ns ± 2% 359ns ± 5% -18.81% (p=0.008 n=5+5) Div/1000/500-4 858ns ± 3% 643ns ± 6% -25.06% (p=0.008 n=5+5) Div/2000/1000-4 1.70µs ± 3% 1.28µs ± 4% -24.80% (p=0.008 n=5+5) Div/20000/10000-4 45.0µs ± 5% 41.8µs ± 4% -7.17% (p=0.016 n=5+5) Div/200000/100000-4 1.51ms ± 7% 1.43ms ± 3% -5.42% (p=0.016 n=5+5) Div/2000000/1000000-4 57.6ms ± 4% 57.5ms ± 3% ~ (p=1.000 n=5+5) Div/20000000/10000000-4 2.08s ± 3% 2.04s ± 1% ~ (p=0.095 n=5+5) name old speed new speed delta DivWVW/1-4 4.87GB/s ± 4% 4.80GB/s ± 4% ~ (p=0.310 n=5+5) DivWVW/2-4 2.63GB/s ± 1% 2.50GB/s ± 2% -5.07% (p=0.008 n=5+5) DivWVW/3-4 2.34GB/s ± 1% 2.76GB/s ± 1% +17.70% (p=0.008 n=5+5) DivWVW/4-4 2.21GB/s ± 1% 3.61GB/s ± 2% +63.42% (p=0.008 n=5+5) DivWVW/5-4 2.10GB/s ± 2% 3.81GB/s ± 4% +80.89% (p=0.008 n=5+5) DivWVW/10-4 2.01GB/s ± 0% 4.13GB/s ± 4% +105.91% (p=0.008 n=5+5) DivWVW/100-4 1.86GB/s ± 2% 4.95GB/s ± 7% +165.63% (p=0.008 n=5+5) DivWVW/1000-4 1.89GB/s ± 0% 5.86GB/s ± 1% +209.96% (p=0.008 n=5+5) DivWVW/10000-4 1.87GB/s ± 4% 5.76GB/s ± 5% +208.96% (p=0.008 n=5+5) DivWVW/100000-4 1.91GB/s ± 1% 5.14GB/s ± 3% +168.85% (p=0.008 n=5+5) Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830 Reviewed-on: https://go-review.googlesource.com/c/go/+/250417 Trust: Giovanni Bajo <rasky@develer.com> Trust: Keith Randall <khr@golang.org> Run-TryBot: Giovanni Bajo <rasky@develer.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
2020-08-28	math/big: tune addVW/subVW performance on arm64	Xiangdong Ji
	Add an optimization for addVW and subVW over large-sized vectors, it switches from add/sub with carry to copy the rest of the vector when we are done with carries. Consistent performance improvement are observed on various arm64 machines. Add additional tests and benchmarks to increase the test coverage. TestFunVWExt: Testing with various types of input vector, using the result from go-version addVW/subVW as golden reference. BenchmarkAddVWext and BenchmarkSubVWext: Benchmarking using input vector having all 1s or all 0s, for evaluating the overhead of worst case. 1. Perf. comparison over randomly generated input vectors: Server 1: name old time/op new time/op delta AddVW/1 12.3ns ± 3% 12.0ns ± 0% -2.60% (p=0.001 n=10+8) AddVW/2 12.5ns ± 2% 12.3ns ± 0% -1.84% (p=0.001 n=10+8) AddVW/3 12.6ns ± 2% 12.3ns ± 0% -1.91% (p=0.009 n=10+10) AddVW/4 13.1ns ± 3% 12.7ns ± 0% -2.98% (p=0.006 n=10+8) AddVW/5 14.4ns ± 1% 13.9ns ± 0% -3.81% (p=0.000 n=10+10) AddVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) AddVW/100 47.8ns ± 0% 29.9ns ± 2% -37.38% (p=0.000 n=10+9) AddVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10) AddVW/10000 4.35µs ± 1% 2.92µs ± 0% -32.85% (p=0.000 n=10+10) AddVW/100000 43.6µs ± 0% 29.7µs ± 0% -31.92% (p=0.000 n=8+10) SubVW/1 12.6ns ± 0% 12.3ns ± 2% -2.22% (p=0.000 n=7+10) SubVW/2 12.7ns ± 0% 12.6ns ± 1% -0.39% (p=0.046 n=8+10) SubVW/3 12.7ns ± 1% 12.6ns ± 1% ~ (p=0.410 n=10+10) SubVW/4 13.3ns ± 3% 13.1ns ± 3% ~ (p=0.072 n=10+10) SubVW/5 14.2ns ± 0% 14.1ns ± 1% -0.63% (p=0.046 n=8+10) SubVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) SubVW/100 47.8ns ± 0% 33.1ns ±19% -30.71% (p=0.000 n=10+10) SubVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10) SubVW/10000 4.33µs ± 1% 2.92µs ± 0% -32.66% (p=0.000 n=10+6) SubVW/100000 43.4µs ± 0% 29.6µs ± 0% -31.90% (p=0.000 n=10+9) Server 2: name old time/op new time/op delta AddVW/1 5.49ns ± 0% 5.53ns ± 2% ~ (p=1.000 n=9+10) AddVW/2 5.96ns ± 2% 5.92ns ± 1% -0.69% (p=0.039 n=10+10) AddVW/3 6.72ns ± 0% 6.73ns ± 0% ~ (p=0.078 n=10+10) AddVW/4 7.07ns ± 0% 6.75ns ± 2% -4.55% (p=0.000 n=10+10) AddVW/5 8.14ns ± 0% 8.17ns ± 0% +0.46% (p=0.003 n=8+8) AddVW/10 10.0ns ± 0% 10.1ns ± 1% +0.70% (p=0.003 n=10+10) AddVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=9+9) AddVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10) AddVW/10000 4.18µs ± 0% 3.14µs ± 0% -24.81% (p=0.000 n=8+8) AddVW/100000 68.3µs ± 3% 62.1µs ± 5% -9.13% (p=0.000 n=10+10) SubVW/1 5.37ns ± 2% 5.42ns ± 1% ~ (p=0.990 n=10+10) SubVW/2 5.89ns ± 0% 5.92ns ± 1% +0.58% (p=0.000 n=8+10) SubVW/3 6.64ns ± 1% 6.82ns ± 3% +2.63% (p=0.000 n=9+10) SubVW/4 7.17ns ± 0% 6.69ns ± 2% -6.74% (p=0.000 n=10+9) SubVW/5 8.22ns ± 0% 8.18ns ± 0% -0.46% (p=0.001 n=8+9) SubVW/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.341 n=10+10) SubVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=7+10) SubVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10) SubVW/10000 4.18µs ± 0% 3.15µs ± 0% -24.62% (p=0.000 n=9+9) SubVW/100000 67.7µs ± 4% 62.4µs ± 2% -7.92% (p=0.000 n=10+10) 2. Perf. comparison over input vectors of all 1s or all 0s Server 1: name old time/op new time/op delta AddVWext/1 12.6ns ± 0% 12.0ns ± 0% -4.76% (p=0.000 n=6+10) AddVWext/2 12.7ns ± 0% 12.4ns ± 1% -2.52% (p=0.000 n=10+10) AddVWext/3 12.7ns ± 0% 12.4ns ± 0% -2.36% (p=0.000 n=9+7) AddVWext/4 13.2ns ± 4% 12.7ns ± 0% -3.71% (p=0.001 n=10+9) AddVWext/5 14.6ns ± 0% 13.9ns ± 0% -4.79% (p=0.000 n=10+8) AddVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) AddVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10) AddVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10) AddVWext/10000 4.34µs ± 1% 3.90µs ± 0% -10.12% (p=0.000 n=10+10) AddVWext/100000 43.9µs ± 1% 39.4µs ± 0% -10.18% (p=0.000 n=10+10) SubVWext/1 12.6ns ± 0% 12.3ns ± 2% -2.70% (p=0.000 n=7+10) SubVWext/2 12.6ns ± 1% 12.6ns ± 2% ~ (p=0.234 n=10+10) SubVWext/3 12.7ns ± 0% 12.6ns ± 2% -0.71% (p=0.033 n=10+10) SubVWext/4 13.4ns ± 0% 13.1ns ± 3% -2.01% (p=0.006 n=8+10) SubVWext/5 14.2ns ± 0% 14.1ns ± 1% -0.85% (p=0.003 n=10+10) SubVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) SubVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10) SubVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10) SubVWext/10000 4.33µs ± 1% 3.90µs ± 0% -10.02% (p=0.000 n=10+10) SubVWext/100000 43.5µs ± 0% 39.5µs ± 1% -9.16% (p=0.000 n=7+10) Server 2: name old time/op new time/op delta AddVWext/1 5.48ns ± 0% 5.43ns ± 1% -0.97% (p=0.000 n=9+9) AddVWext/2 5.99ns ± 2% 5.93ns ± 1% ~ (p=0.054 n=10+10) AddVWext/3 6.74ns ± 0% 6.79ns ± 1% +0.80% (p=0.000 n=9+10) AddVWext/4 7.18ns ± 0% 7.21ns ± 1% +0.36% (p=0.034 n=9+10) AddVWext/5 7.93ns ± 3% 8.18ns ± 0% +3.18% (p=0.000 n=10+8) AddVWext/10 10.0ns ± 0% 10.1ns ± 1% +0.60% (p=0.011 n=10+10) AddVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=9+10) AddVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10) AddVWext/10000 4.18µs ± 0% 4.50µs ± 0% +7.73% (p=0.000 n=9+10) AddVWext/100000 67.6µs ± 2% 68.4µs ± 3% ~ (p=0.139 n=9+8) SubVWext/1 5.46ns ± 1% 5.43ns ± 0% -0.55% (p=0.002 n=9+9) SubVWext/2 5.89ns ± 0% 5.93ns ± 1% +0.68% (p=0.000 n=8+10) SubVWext/3 6.72ns ± 1% 6.79ns ± 1% +1.07% (p=0.000 n=10+10) SubVWext/4 6.98ns ± 1% 7.21ns ± 0% +3.25% (p=0.000 n=10+10) SubVWext/5 8.22ns ± 0% 7.99ns ± 3% -2.83% (p=0.000 n=8+10) SubVWext/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.239 n=10+10) SubVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=8+10) SubVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10) SubVWext/10000 4.18µs ± 0% 4.51µs ± 0% +7.86% (p=0.000 n=8+8) SubVWext/100000 68.3µs ± 2% 68.0µs ± 3% ~ (p=0.515 n=10+8) Change-Id: I134a5194b8a2deaaebbaa2b771baf72846971d58 Reviewed-on: https://go-review.googlesource.com/c/go/+/229739 Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2020-08-25	math/big: fix TestShiftOverlap for test -count arguments > 1	SparrowLii
	Don't overwrite incoming test data. The change uses copy instead of assigning statement to avoid this. Change-Id: Ib907101822d811de5c45145cb9d7961907e212c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/250137 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>
2019-05-08	math/big: fix the bug in assembly implementation of shlVU on arm64	erifan01
	For the case where the addresses of parameter z and x of the function shlVU overlap and the address of z is greater than x, x (input value) can be polluted during the calculation when the high words of x are overlapped with the low words of z (output value). Fixes #31084 Change-Id: I9bb0266a1d7856b8faa9a9b1975d6f57dece0479 Reviewed-on: https://go-review.googlesource.com/c/go/+/169780 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-04-22	math/big: optimize mulAddVWW on arm64 for better performance	erifan01
	Unroll the cycle 4 times to reduce load overhead. Benchmarks: name old time/op new time/op delta MulAddVWW/1-8 15.9ns ± 0% 11.9ns ± 0% -24.92% (p=0.000 n=8+8) MulAddVWW/2-8 16.1ns ± 0% 13.9ns ± 1% -13.82% (p=0.000 n=8+8) MulAddVWW/3-8 18.9ns ± 0% 17.3ns ± 0% -8.47% (p=0.000 n=8+8) MulAddVWW/4-8 21.7ns ± 0% 19.5ns ± 0% -10.14% (p=0.000 n=8+8) MulAddVWW/5-8 25.1ns ± 0% 22.5ns ± 0% -10.27% (p=0.000 n=8+8) MulAddVWW/10-8 41.6ns ± 0% 40.0ns ± 0% -3.79% (p=0.000 n=8+8) MulAddVWW/100-8 368ns ± 0% 363ns ± 0% -1.36% (p=0.000 n=8+8) MulAddVWW/1000-8 3.52µs ± 0% 3.52µs ± 0% -0.14% (p=0.000 n=8+8) MulAddVWW/10000-8 35.1µs ± 0% 35.1µs ± 0% -0.01% (p=0.000 n=7+6) MulAddVWW/100000-8 351µs ± 0% 351µs ± 0% +0.15% (p=0.038 n=8+8) Change-Id: I052a4db286ac6e4f3293289c7e9a82027da0405e Reviewed-on: https://go-review.googlesource.com/c/go/+/155780 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>
2019-03-04	math/big: rewrite pure Go implementations to use math/bits	Josh Bleecher Snyder
	While we're here, delete addWW_g and subWW_g, per the TODO. They are now obsolete. Benchmarks on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta AddVV/1-8 5.24ns ± 2% 5.12ns ± 1% -2.11% (p=0.000 n=82+87) AddVV/2-8 6.44ns ± 1% 6.33ns ± 2% -1.82% (p=0.000 n=77+82) AddVV/3-8 7.89ns ± 8% 6.97ns ± 4% -11.71% (p=0.000 n=100+96) AddVV/4-8 8.60ns ± 0% 7.72ns ± 4% -10.24% (p=0.000 n=90+96) AddVV/5-8 10.3ns ± 4% 8.5ns ± 1% -17.02% (p=0.000 n=96+91) AddVV/10-8 16.2ns ± 5% 12.8ns ± 1% -21.11% (p=0.000 n=97+86) AddVV/100-8 148ns ± 1% 117ns ± 5% -21.07% (p=0.000 n=66+98) AddVV/1000-8 1.41µs ± 4% 1.13µs ± 3% -19.90% (p=0.000 n=97+97) AddVV/10000-8 14.2µs ± 5% 11.2µs ± 1% -20.82% (p=0.000 n=99+84) AddVV/100000-8 142µs ± 4% 113µs ± 4% -20.40% (p=0.000 n=91+92) SubVV/1-8 5.29ns ± 1% 5.11ns ± 0% -3.30% (p=0.000 n=87+88) SubVV/2-8 6.36ns ± 4% 6.33ns ± 2% -0.56% (p=0.002 n=98+73) SubVV/3-8 7.58ns ± 5% 6.98ns ± 4% -8.01% (p=0.000 n=97+91) SubVV/4-8 8.61ns ± 3% 7.98ns ± 2% -7.31% (p=0.000 n=95+83) SubVV/5-8 10.6ns ± 2% 8.5ns ± 1% -19.56% (p=0.000 n=79+89) SubVV/10-8 16.3ns ± 4% 12.7ns ± 1% -21.97% (p=0.000 n=98+82) SubVV/100-8 124ns ± 1% 118ns ± 1% -4.83% (p=0.000 n=85+81) SubVV/1000-8 1.14µs ± 5% 1.12µs ± 2% -1.17% (p=0.000 n=97+81) SubVV/10000-8 11.6µs ±10% 11.2µs ± 1% -3.39% (p=0.000 n=100+84) SubVV/100000-8 114µs ± 6% 114µs ± 5% ~ (p=0.396 n=83+94) AddVW/1-8 4.04ns ± 4% 4.34ns ± 4% +7.57% (p=0.000 n=96+98) AddVW/2-8 4.34ns ± 5% 4.40ns ± 5% +1.40% (p=0.000 n=99+98) AddVW/3-8 5.43ns ± 0% 5.54ns ± 2% +1.97% (p=0.000 n=85+94) AddVW/4-8 6.23ns ± 1% 6.18ns ± 2% -0.66% (p=0.000 n=77+78) AddVW/5-8 6.78ns ± 2% 6.90ns ± 4% +1.77% (p=0.000 n=80+99) AddVW/10-8 10.5ns ± 4% 9.9ns ± 1% -5.77% (p=0.000 n=97+69) AddVW/100-8 114ns ± 3% 91ns ± 0% -20.38% (p=0.000 n=98+77) AddVW/1000-8 1.12µs ± 1% 0.87µs ± 1% -22.80% (p=0.000 n=82+68) AddVW/10000-8 11.2µs ± 2% 8.5µs ± 5% -23.85% (p=0.000 n=85+100) AddVW/100000-8 112µs ± 2% 85µs ± 5% -24.22% (p=0.000 n=71+96) SubVW/1-8 4.09ns ± 2% 4.18ns ± 4% +2.32% (p=0.000 n=78+96) SubVW/2-8 4.59ns ± 5% 4.52ns ± 7% -1.54% (p=0.000 n=98+94) SubVW/3-8 5.41ns ±10% 5.55ns ± 1% +2.48% (p=0.000 n=100+89) SubVW/4-8 6.51ns ± 2% 6.19ns ± 0% -4.85% (p=0.000 n=97+81) SubVW/5-8 7.25ns ± 3% 6.90ns ± 4% -4.93% (p=0.000 n=97+96) SubVW/10-8 10.6ns ± 4% 9.8ns ± 2% -7.32% (p=0.000 n=95+96) SubVW/100-8 90.4ns ± 0% 90.8ns ± 0% +0.43% (p=0.000 n=83+78) SubVW/1000-8 853ns ± 4% 857ns ± 2% +0.42% (p=0.000 n=100+98) SubVW/10000-8 8.52µs ± 4% 8.53µs ± 2% ~ (p=0.061 n=99+97) SubVW/100000-8 84.8µs ± 5% 84.2µs ± 2% -0.78% (p=0.000 n=99+93) AddMulVVW/1-8 8.73ns ± 0% 5.33ns ± 3% -38.91% (p=0.000 n=91+96) AddMulVVW/2-8 14.8ns ± 3% 6.5ns ± 2% -56.33% (p=0.000 n=100+79) AddMulVVW/3-8 18.6ns ± 2% 7.8ns ± 5% -57.84% (p=0.000 n=89+96) AddMulVVW/4-8 24.0ns ± 2% 9.8ns ± 0% -59.09% (p=0.000 n=95+67) AddMulVVW/5-8 29.0ns ± 2% 11.5ns ± 5% -60.44% (p=0.000 n=90+97) AddMulVVW/10-8 54.1ns ± 0% 18.8ns ± 1% -65.37% (p=0.000 n=82+84) AddMulVVW/100-8 508ns ± 2% 165ns ± 4% -67.62% (p=0.000 n=72+98) AddMulVVW/1000-8 4.96µs ± 3% 1.55µs ± 1% -68.86% (p=0.000 n=99+91) AddMulVVW/10000-8 50.0µs ± 4% 15.5µs ± 4% -68.95% (p=0.000 n=97+97) AddMulVVW/100000-8 491µs ± 1% 156µs ± 8% -68.22% (p=0.000 n=79+95) Change-Id: I4c6ae0b4065f371aea8103f6a85d9e9274bf01d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/164965 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>
2018-03-08	math/big: optimize addVW and subVW on arm64	erifan01
	The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 21.5ns ± 0% ~ (all equal) AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal) AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal) AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal) AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal) AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal) AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal) AddVV/1000-8 2.02µs ± 0% 2.02µs ± 0% ~ (all equal) AddVV/10000-8 20.3µs ± 0% 20.3µs ± 0% ~ (p=0.603 n=5+5) AddVV/100000-8 223µs ± 7% 228µs ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) AddVW/2-8 19.8ns ± 3% 10.5ns ± 0% -46.92% (p=0.008 n=5+5) AddVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) AddVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) AddVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) AddVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) AddVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) AddVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.55% (p=0.008 n=5+5) AddVW/100000-8 150µs ± 0% 71µs ± 0% -52.95% (p=0.008 n=5+5) SubVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) SubVW/2-8 19.7ns ± 2% 10.5ns ± 0% -46.70% (p=0.008 n=5+5) SubVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) SubVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) SubVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) SubVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) SubVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) SubVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) SubVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.49% (p=0.008 n=5+5) SubVW/100000-8 150µs ± 0% 71µs ± 0% -52.91% (p=0.008 n=5+5) AddMulVVW/1-8 32.4ns ± 1% 32.6ns ± 1% ~ (p=0.119 n=5+5) AddMulVVW/2-8 57.0ns ± 0% 57.0ns ± 0% ~ (p=0.643 n=5+5) AddMulVVW/3-8 90.8ns ± 0% 90.7ns ± 0% ~ (p=0.524 n=5+5) AddMulVVW/4-8 118ns ± 0% 118ns ± 1% ~ (p=1.000 n=4+5) AddMulVVW/5-8 144ns ± 1% 144ns ± 0% ~ (p=0.794 n=5+4) AddMulVVW/10-8 294ns ± 1% 296ns ± 0% +0.48% (p=0.040 n=5+5) AddMulVVW/100-8 2.73µs ± 0% 2.73µs ± 0% ~ (p=0.278 n=5+5) AddMulVVW/1000-8 26.0µs ± 0% 26.5µs ± 0% +2.14% (p=0.008 n=5+5) AddMulVVW/10000-8 297µs ± 0% 297µs ± 0% +0.24% (p=0.008 n=5+5) AddMulVVW/100000-8 3.15ms ± 1% 3.13ms ± 0% ~ (p=0.690 n=5+5) DecimalConversion-8 311µs ± 2% 309µs ± 2% ~ (p=0.310 n=5+5) FloatString/100-8 2.55µs ± 2% 2.54µs ± 2% ~ (p=1.000 n=5+5) FloatString/1000-8 58.1µs ± 0% 58.1µs ± 0% ~ (p=0.151 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.151 n=5+5) FloatString/100000-8 446ms ± 0% 446ms ± 0% +0.01% (p=0.016 n=5+5) FloatAdd/10-8 183ns ± 0% 183ns ± 0% ~ (p=0.333 n=4+5) FloatAdd/100-8 187ns ± 1% 192ns ± 2% ~ (p=0.056 n=5+5) FloatAdd/1000-8 369ns ± 0% 371ns ± 0% +0.54% (p=0.016 n=4+5) FloatAdd/10000-8 1.88µs ± 0% 1.88µs ± 0% -0.14% (p=0.000 n=4+5) FloatAdd/100000-8 17.2µs ± 0% 17.1µs ± 0% -0.37% (p=0.008 n=5+5) FloatSub/10-8 147ns ± 0% 147ns ± 0% ~ (all equal) FloatSub/100-8 145ns ± 0% 146ns ± 0% ~ (p=0.238 n=5+4) FloatSub/1000-8 241ns ± 0% 241ns ± 0% ~ (p=0.333 n=5+4) FloatSub/10000-8 1.06µs ± 0% 1.06µs ± 0% ~ (p=0.444 n=5+5) FloatSub/100000-8 9.50µs ± 0% 9.48µs ± 0% -0.14% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.4µs ± 2% 28.5µs ± 1% ~ (p=0.690 n=5+5) ParseFloatLargeExp-8 125µs ± 1% 124µs ± 1% ~ (p=0.095 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 278ns ± 3% ~ (p=0.937 n=5+5) GCD10x10/WithXY-8 2.08µs ± 3% 2.15µs ± 3% ~ (p=0.056 n=5+5) GCD10x100/WithoutXY-8 592ns ± 3% 613ns ± 4% ~ (p=0.056 n=5+5) GCD10x100/WithXY-8 3.40µs ± 2% 3.42µs ± 4% ~ (p=0.841 n=5+5) GCD10x1000/WithoutXY-8 1.37µs ± 2% 1.35µs ± 3% ~ (p=0.460 n=5+5) GCD10x1000/WithXY-8 7.34µs ± 2% 7.33µs ± 4% ~ (p=0.841 n=5+5) GCD10x10000/WithoutXY-8 8.52µs ± 0% 8.51µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 27.5µs ± 2% 27.2µs ± 1% ~ (p=0.151 n=5+5) GCD10x100000/WithoutXY-8 78.3µs ± 1% 78.5µs ± 1% ~ (p=0.690 n=5+5) GCD10x100000/WithXY-8 231µs ± 0% 229µs ± 1% -1.11% (p=0.016 n=5+5) GCD100x100/WithoutXY-8 1.86µs ± 2% 1.86µs ± 2% ~ (p=0.881 n=5+5) GCD100x100/WithXY-8 27.1µs ± 2% 27.2µs ± 1% ~ (p=0.421 n=5+5) GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.41µs ± 1% ~ (p=0.310 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 36.2µs ± 1% ~ (p=0.310 n=5+5) GCD100x10000/WithoutXY-8 22.6µs ± 2% 22.5µs ± 1% ~ (p=0.690 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 145µs ± 1% ~ (p=1.000 n=5+5) GCD100x100000/WithoutXY-8 195µs ± 0% 196µs ± 1% ~ (p=0.548 n=5+5) GCD100x100000/WithXY-8 1.10ms ± 0% 1.10ms ± 0% -0.30% (p=0.016 n=5+5) GCD1000x1000/WithoutXY-8 25.0µs ± 1% 25.2µs ± 2% ~ (p=0.222 n=5+5) GCD1000x1000/WithXY-8 520µs ± 0% 520µs ± 1% ~ (p=0.151 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 1% 56.9µs ± 1% ~ (p=0.690 n=5+5) GCD1000x10000/WithXY-8 1.21ms ± 0% 1.21ms ± 1% ~ (p=0.881 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 359µs ± 1% ~ (p=0.548 n=5+5) GCD1000x100000/WithXY-8 8.73ms ± 0% 8.73ms ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithoutXY-8 686µs ± 0% 687µs ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithXY-8 15.9ms ± 0% 15.9ms ± 0% ~ (p=0.841 n=5+5) GCD10000x100000/WithoutXY-8 2.08ms ± 0% 2.08ms ± 0% ~ (p=1.000 n=5+5) GCD10000x100000/WithXY-8 86.7ms ± 0% 86.7ms ± 0% ~ (p=1.000 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 51.0ms ± 0% ~ (p=0.151 n=5+5) GCD100000x100000/WithXY-8 1.23s ± 0% 1.23s ± 0% ~ (p=0.841 n=5+5) Hilbert-8 2.41ms ± 1% 2.42ms ± 2% ~ (p=0.690 n=5+5) Binomial-8 4.86µs ± 1% 4.86µs ± 1% ~ (p=0.889 n=5+5) QuoRem-8 7.09µs ± 0% 7.08µs ± 0% -0.09% (p=0.024 n=5+5) Exp-8 161ms ± 0% 161ms ± 0% -0.08% (p=0.032 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=1.000 n=5+5) Bitset-8 40.7ns ± 0% 40.6ns ± 0% ~ (p=0.095 n=4+5) BitsetNeg-8 159ns ± 4% 148ns ± 0% -6.92% (p=0.016 n=5+4) BitsetOrig-8 378ns ± 1% 378ns ± 1% ~ (p=0.937 n=5+5) BitsetNegOrig-8 647ns ± 5% 647ns ± 4% ~ (p=1.000 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=1.000 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5) ModSqrt5430_Tonelli-8 62.8s ± 1% 62.5s ± 0% ~ (p=0.063 n=5+4) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.310 n=5+5) Sqrt-8 101µs ± 1% 101µs ± 0% -0.35% (p=0.032 n=5+5) IntSqr/1-8 32.3ns ± 1% 32.5ns ± 1% ~ (p=0.421 n=5+5) IntSqr/2-8 157ns ± 5% 156ns ± 5% ~ (p=0.651 n=5+5) IntSqr/3-8 292ns ± 2% 291ns ± 3% ~ (p=0.881 n=5+5) IntSqr/5-8 738ns ± 6% 740ns ± 5% ~ (p=0.841 n=5+5) IntSqr/8-8 1.82µs ± 4% 1.83µs ± 4% ~ (p=0.730 n=5+5) IntSqr/10-8 2.92µs ± 1% 2.93µs ± 1% ~ (p=0.643 n=5+5) IntSqr/20-8 6.28µs ± 2% 6.28µs ± 2% ~ (p=1.000 n=5+5) IntSqr/30-8 13.8µs ± 2% 13.9µs ± 3% ~ (p=1.000 n=5+5) IntSqr/50-8 37.8µs ± 4% 37.9µs ± 4% ~ (p=0.690 n=5+5) IntSqr/80-8 95.9µs ± 1% 95.8µs ± 1% ~ (p=0.841 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5) IntSqr/200-8 586µs ± 1% 586µs ± 1% ~ (p=0.841 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 0% ~ (p=0.222 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.48ms ± 0% ~ (p=0.556 n=5+4) IntSqr/800-8 4.68ms ± 0% 4.68ms ± 0% ~ (p=0.548 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5) Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.548 n=5+5) Exp3Power/0x10-8 559ns ± 1% 560ns ± 1% ~ (p=0.984 n=5+5) Exp3Power/0x40-8 641ns ± 1% 634ns ± 1% ~ (p=0.063 n=5+5) Exp3Power/0x100-8 1.39µs ± 2% 1.40µs ± 2% ~ (p=0.381 n=5+5) Exp3Power/0x400-8 8.27µs ± 1% 8.26µs ± 0% ~ (p=0.571 n=5+5) Exp3Power/0x1000-8 59.9µs ± 0% 59.7µs ± 0% -0.23% (p=0.008 n=5+5) Exp3Power/0x4000-8 816µs ± 0% 816µs ± 0% ~ (p=1.000 n=5+5) Exp3Power/0x10000-8 7.77ms ± 0% 7.77ms ± 0% ~ (p=0.841 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 73.4ms ± 0% ~ (p=0.690 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.14% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.98s ± 0% 5.98s ± 0% -0.09% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 116ms ± 0% -0.25% (p=0.008 n=5+5) NatSqr/1-8 115ns ± 3% 116ns ± 2% ~ (p=0.238 n=5+5) NatSqr/2-8 237ns ± 1% 237ns ± 1% ~ (p=0.683 n=5+5) NatSqr/3-8 367ns ± 3% 368ns ± 3% ~ (p=0.817 n=5+5) NatSqr/5-8 807ns ± 3% 812ns ± 3% ~ (p=0.913 n=5+5) NatSqr/8-8 1.93µs ± 2% 1.93µs ± 3% ~ (p=0.651 n=5+5) NatSqr/10-8 2.98µs ± 2% 2.99µs ± 2% ~ (p=0.690 n=5+5) NatSqr/20-8 6.49µs ± 2% 6.46µs ± 2% ~ (p=0.548 n=5+5) NatSqr/30-8 14.4µs ± 2% 14.3µs ± 2% ~ (p=0.690 n=5+5) NatSqr/50-8 38.6µs ± 2% 38.7µs ± 2% ~ (p=0.841 n=5+5) NatSqr/80-8 96.1µs ± 2% 95.8µs ± 2% ~ (p=0.548 n=5+5) NatSqr/100-8 149µs ± 1% 149µs ± 1% ~ (p=0.841 n=5+5) NatSqr/200-8 593µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 0% 1.32ms ± 1% ~ (p=0.222 n=5+5) NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.690 n=5+5) NatSqr/800-8 4.69ms ± 0% 4.69ms ± 0% ~ (p=1.000 n=5+5) NatSqr/1000-8 7.59ms ± 0% 7.58ms ± 0% ~ (p=0.841 n=5+5) ScanPi-8 322µs ± 0% 321µs ± 0% ~ (p=0.095 n=5+5) StringPiParallel-8 71.4µs ± 5% 68.8µs ± 4% ~ (p=0.151 n=5+5) Scan/10/Base2-8 1.10µs ± 0% 1.09µs ± 0% -0.36% (p=0.032 n=5+5) Scan/100/Base2-8 7.78µs ± 0% 7.79µs ± 0% +0.14% (p=0.008 n=5+5) Scan/1000/Base2-8 78.8µs ± 0% 79.0µs ± 0% +0.24% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% ~ (p=0.056 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.0ms ± 0% -0.15% (p=0.008 n=5+5) Scan/10/Base8-8 514ns ± 0% 515ns ± 0% ~ (p=0.079 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% +0.15% (p=0.008 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.12% (p=0.008 n=5+5) Scan/10000/Base8-8 740µs ± 0% 740µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.5ms ± 0% -0.06% (p=0.016 n=4+5) Scan/10/Base10-8 492ns ± 1% 490ns ± 1% ~ (p=0.310 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.056 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% ~ (p=1.000 n=5+5) Scan/10000/Base10-8 717µs ± 0% 716µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.2ms ± 0% 50.3ms ± 0% +0.05% (p=0.008 n=5+5) Scan/10/Base16-8 442ns ± 1% 442ns ± 0% ~ (p=0.468 n=5+5) Scan/100/Base16-8 2.46µs ± 0% 2.45µs ± 0% ~ (p=0.159 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.2µs ± 0% ~ (p=0.841 n=5+5) Scan/10000/Base16-8 721µs ± 0% 722µs ± 0% ~ (p=0.548 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.6ms ± 0% +0.07% (p=0.008 n=5+5) String/10/Base2-8 244ns ± 1% 242ns ± 1% ~ (p=0.103 n=5+5) String/100/Base2-8 1.48µs ± 0% 1.48µs ± 1% ~ (p=0.786 n=5+5) String/1000/Base2-8 13.3µs ± 1% 13.3µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base2-8 132µs ± 1% 132µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base2-8 1.30ms ± 1% 1.30ms ± 1% ~ (p=1.000 n=5+5) String/10/Base8-8 167ns ± 1% 168ns ± 1% ~ (p=0.135 n=5+5) String/100/Base8-8 623ns ± 1% 626ns ± 1% ~ (p=0.151 n=5+5) String/1000/Base8-8 5.24µs ± 1% 5.24µs ± 0% ~ (p=1.000 n=5+5) String/10000/Base8-8 50.0µs ± 1% 50.0µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base8-8 492µs ± 1% 489µs ± 1% ~ (p=0.056 n=5+5) String/10/Base10-8 503ns ± 1% 501ns ± 0% ~ (p=0.183 n=5+5) String/100/Base10-8 1.96µs ± 0% 1.97µs ± 0% ~ (p=0.389 n=5+5) String/1000/Base10-8 12.4µs ± 1% 12.4µs ± 1% ~ (p=0.841 n=5+5) String/10000/Base10-8 56.7µs ± 1% 56.6µs ± 0% ~ (p=1.000 n=5+5) String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% ~ (p=0.222 n=5+5) String/10/Base16-8 147ns ± 0% 148ns ± 2% ~ (p=1.000 n=4+5) String/100/Base16-8 505ns ± 0% 505ns ± 1% ~ (p=0.778 n=5+5) String/1000/Base16-8 3.94µs ± 0% 3.94µs ± 0% ~ (p=0.841 n=5+5) String/10000/Base16-8 37.4µs ± 1% 37.2µs ± 1% ~ (p=0.095 n=5+5) String/100000/Base16-8 367µs ± 1% 367µs ± 0% ~ (p=1.000 n=5+5) LeafSize/0-8 6.64ms ± 0% 6.65ms ± 0% ~ (p=0.690 n=5+5) LeafSize/1-8 72.5µs ± 1% 72.4µs ± 1% ~ (p=0.841 n=5+5) LeafSize/2-8 72.6µs ± 1% 72.6µs ± 1% ~ (p=1.000 n=5+5) LeafSize/3-8 377µs ± 0% 377µs ± 0% ~ (p=0.421 n=5+5) LeafSize/4-8 71.2µs ± 1% 71.3µs ± 0% ~ (p=0.278 n=5+5) LeafSize/5-8 469µs ± 0% 469µs ± 0% ~ (p=0.310 n=5+5) LeafSize/6-8 376µs ± 0% 376µs ± 0% ~ (p=0.841 n=5+5) LeafSize/7-8 244µs ± 0% 244µs ± 0% ~ (p=0.841 n=5+5) LeafSize/8-8 71.9µs ± 1% 72.1µs ± 1% ~ (p=0.548 n=5+5) LeafSize/9-8 536µs ± 0% 536µs ± 0% ~ (p=0.151 n=5+5) LeafSize/10-8 470µs ± 0% 471µs ± 0% +0.10% (p=0.032 n=5+5) LeafSize/11-8 458µs ± 0% 458µs ± 0% ~ (p=0.881 n=5+5) LeafSize/12-8 376µs ± 0% 376µs ± 0% ~ (p=0.548 n=5+5) LeafSize/13-8 341µs ± 0% 342µs ± 0% ~ (p=0.222 n=5+5) LeafSize/14-8 246µs ± 0% 245µs ± 0% ~ (p=0.167 n=5+5) LeafSize/15-8 168µs ± 0% 168µs ± 0% ~ (p=0.548 n=5+5) LeafSize/16-8 72.1µs ± 1% 72.2µs ± 1% ~ (p=0.690 n=5+5) LeafSize/32-8 81.5µs ± 1% 81.4µs ± 1% ~ (p=1.000 n=5+5) LeafSize/64-8 133µs ± 1% 134µs ± 1% ~ (p=0.690 n=5+5) ProbablyPrime/n=0-8 44.3ms ± 0% 44.2ms ± 0% -0.28% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.8ms ± 0% 64.7ms ± 0% -0.15% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.11% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 250ms ± 0% ~ (p=0.056 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.05% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 23.5ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% ~ (p=0.690 n=5+5) FloatSqrt/64-8 2.01µs ± 1% 2.02µs ± 1% ~ (p=0.421 n=5+5) FloatSqrt/128-8 4.43µs ± 2% 4.38µs ± 2% ~ (p=0.222 n=5+5) FloatSqrt/256-8 6.64µs ± 1% 6.68µs ± 2% ~ (p=0.516 n=5+5) FloatSqrt/1000-8 31.9µs ± 0% 31.8µs ± 0% ~ (p=0.095 n=5+5) FloatSqrt/10000-8 595µs ± 0% 594µs ± 0% ~ (p=0.056 n=5+5) FloatSqrt/100000-8 17.9ms ± 0% 17.9ms ± 0% ~ (p=0.151 n=5+5) FloatSqrt/1000000-8 1.52s ± 0% 1.52s ± 0% ~ (p=0.841 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 2.97GB/s ± 0% ~ (p=0.971 n=4+4) AddVV/2-8 9.47GB/s ± 0% 9.47GB/s ± 0% +0.01% (p=0.016 n=5+5) AddVV/3-8 12.4GB/s ± 0% 12.4GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/4-8 14.6GB/s ± 0% 14.6GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/5-8 16.4GB/s ± 0% 16.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/10-8 21.7GB/s ± 0% 21.7GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/100-8 29.4GB/s ± 0% 29.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 31.7GB/s ± 0% ~ (p=0.524 n=5+4) AddVV/10000-8 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.690 n=5+5) AddVV/100000-8 28.8GB/s ± 7% 28.1GB/s ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) AddVW/2-8 809MB/s ± 2% 1520MB/s ± 0% +87.78% (p=0.008 n=5+5) AddVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.54% (p=0.008 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.016 n=4+5) AddVW/5-8 2.76GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) AddVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.83% (p=0.008 n=5+5) AddVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.12% (p=0.008 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) AddVW/10000-8 5.31GB/s ± 0% 11.19GB/s ± 0% +110.71% (p=0.008 n=5+5) AddVW/100000-8 5.32GB/s ± 0% 11.32GB/s ± 0% +112.56% (p=0.008 n=5+5) SubVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) SubVW/2-8 812MB/s ± 2% 1520MB/s ± 0% +87.09% (p=0.008 n=5+5) SubVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.55% (p=0.008 n=5+5) SubVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.008 n=5+5) SubVW/5-8 2.75GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) SubVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.82% (p=0.008 n=5+5) SubVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.13% (p=0.008 n=5+5) SubVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) SubVW/10000-8 5.31GB/s ± 0% 11.17GB/s ± 0% +110.44% (p=0.008 n=5+5) SubVW/100000-8 5.32GB/s ± 0% 11.31GB/s ± 0% +112.35% (p=0.008 n=5+5) AddMulVVW/1-8 1.97GB/s ± 1% 1.96GB/s ± 1% ~ (p=0.151 n=5+5) AddMulVVW/2-8 2.24GB/s ± 0% 2.25GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/3-8 2.11GB/s ± 0% 2.12GB/s ± 0% ~ (p=0.548 n=5+5) AddMulVVW/4-8 2.17GB/s ± 1% 2.17GB/s ± 1% ~ (p=0.548 n=5+5) AddMulVVW/5-8 2.22GB/s ± 1% 2.21GB/s ± 1% ~ (p=0.421 n=5+5) AddMulVVW/10-8 2.17GB/s ± 1% 2.16GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/100-8 2.35GB/s ± 0% 2.35GB/s ± 0% ~ (p=0.421 n=5+5) AddMulVVW/1000-8 2.47GB/s ± 0% 2.41GB/s ± 0% -2.09% (p=0.008 n=5+5) AddMulVVW/10000-8 2.16GB/s ± 0% 2.15GB/s ± 0% -0.23% (p=0.008 n=5+5) AddMulVVW/100000-8 2.03GB/s ± 1% 2.04GB/s ± 0% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta FloatString/100-8 400B ± 0% 400B ± 0% ~ (all equal) FloatString/1000-8 3.22kB ± 0% 3.22kB ± 0% ~ (all equal) FloatString/10000-8 55.6kB ± 0% 55.5kB ± 0% ~ (p=0.206 n=5+5) FloatString/100000-8 627kB ± 0% 627kB ± 0% ~ (all equal) FloatAdd/10-8 0.00B 0.00B ~ (all equal) FloatAdd/100-8 0.00B 0.00B ~ (all equal) FloatAdd/1000-8 0.00B 0.00B ~ (all equal) FloatAdd/10000-8 0.00B 0.00B ~ (all equal) FloatAdd/100000-8 0.00B 0.00B ~ (all equal) FloatSub/10-8 0.00B 0.00B ~ (all equal) FloatSub/100-8 0.00B 0.00B ~ (all equal) FloatSub/1000-8 0.00B 0.00B ~ (all equal) FloatSub/10000-8 0.00B 0.00B ~ (all equal) FloatSub/100000-8 0.00B 0.00B ~ (all equal) FloatSqrt/64-8 416B ± 0% 416B ± 0% ~ (all equal) FloatSqrt/128-8 720B ± 0% 720B ± 0% ~ (all equal) FloatSqrt/256-8 816B ± 0% 816B ± 0% ~ (all equal) FloatSqrt/1000-8 2.50kB ± 0% 2.50kB ± 0% ~ (all equal) FloatSqrt/10000-8 23.5kB ± 0% 23.5kB ± 0% ~ (all equal) FloatSqrt/100000-8 251kB ± 0% 251kB ± 0% ~ (all equal) FloatSqrt/1000000-8 4.61MB ± 0% 4.61MB ± 0% ~ (all equal) name old allocs/op new allocs/op delta FloatString/100-8 8.00 ± 0% 8.00 ± 0% ~ (all equal) FloatString/1000-8 10.0 ± 0% 10.0 ± 0% ~ (all equal) FloatString/10000-8 42.0 ± 0% 42.0 ± 0% ~ (all equal) FloatString/100000-8 346 ± 0% 346 ± 0% ~ (all equal) FloatAdd/10-8 0.00 0.00 ~ (all equal) FloatAdd/100-8 0.00 0.00 ~ (all equal) FloatAdd/1000-8 0.00 0.00 ~ (all equal) FloatAdd/10000-8 0.00 0.00 ~ (all equal) FloatAdd/100000-8 0.00 0.00 ~ (all equal) FloatSub/10-8 0.00 0.00 ~ (all equal) FloatSub/100-8 0.00 0.00 ~ (all equal) FloatSub/1000-8 0.00 0.00 ~ (all equal) FloatSub/10000-8 0.00 0.00 ~ (all equal) FloatSub/100000-8 0.00 0.00 ~ (all equal) FloatSqrt/64-8 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-8 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-8 12.0 ± 0% 12.0 ± 0% ~ (all equal) FloatSqrt/1000-8 19.0 ± 0% 19.0 ± 0% ~ (all equal) FloatSqrt/10000-8 35.0 ± 0% 35.0 ± 0% ~ (all equal) FloatSqrt/100000-8 55.0 ± 0% 55.0 ± 0% ~ (all equal) FloatSqrt/1000000-8 122 ± 0% 122 ± 0% ~ (all equal) Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24 Reviewed-on: https://go-review.googlesource.com/77832 Reviewed-by: Vlad Krasnov <vlad@cloudflare.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2018-03-06	math/big: optimize addVV and subVV on arm64	erifan01
	The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 11.5ns ± 0% -46.51% (p=0.008 n=5+5) AddVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) AddVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) AddVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) AddVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) AddVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) AddVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) AddVV/1000-8 2.02µs ± 0% 1.03µs ± 0% -48.85% (p=0.008 n=5+5) AddVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.70% (p=0.008 n=5+5) AddVV/100000-8 247µs ± 3% 154µs ± 0% -37.52% (p=0.008 n=5+5) SubVV/1-8 21.5ns ± 0% 11.5ns ± 0% ~ (p=0.079 n=4+5) SubVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) SubVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) SubVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) SubVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) SubVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) SubVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) SubVV/1000-8 2.02µs ± 0% 0.80µs ± 0% -60.50% (p=0.008 n=5+5) SubVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.99% (p=0.008 n=5+5) SubVV/100000-8 221µs ±11% 223µs ±16% ~ (p=0.690 n=5+5) AddVW/1-8 9.32ns ± 0% 9.32ns ± 0% ~ (all equal) AddVW/2-8 19.7ns ± 1% 19.7ns ± 0% ~ (p=0.381 n=5+4) AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal) AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal) AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal) AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) AddVW/100-8 167ns ± 0% 167ns ± 0% ~ (all equal) AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% +0.40% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% ~ (p=0.556 n=5+4) AddVW/100000-8 152µs ± 1% 152µs ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 33.3ns ± 0% 32.7ns ± 1% -1.86% (p=0.008 n=5+5) AddMulVVW/2-8 59.3ns ± 1% 56.9ns ± 1% -4.15% (p=0.008 n=5+5) AddMulVVW/3-8 80.5ns ± 1% 85.4ns ± 3% +6.19% (p=0.008 n=5+5) AddMulVVW/4-8 127ns ± 0% 111ns ± 1% -13.19% (p=0.008 n=5+5) AddMulVVW/5-8 144ns ± 0% 149ns ± 0% +3.47% (p=0.016 n=4+5) AddMulVVW/10-8 298ns ± 1% 283ns ± 0% -4.77% (p=0.008 n=5+5) AddMulVVW/100-8 3.06µs ± 0% 2.99µs ± 0% -2.21% (p=0.008 n=5+5) AddMulVVW/1000-8 31.3µs ± 0% 26.9µs ± 0% -14.17% (p=0.008 n=5+5) AddMulVVW/10000-8 316µs ± 0% 305µs ± 0% -3.51% (p=0.008 n=5+5) AddMulVVW/100000-8 3.17ms ± 0% 3.17ms ± 1% ~ (p=0.690 n=5+5) DecimalConversion-8 316µs ± 1% 313µs ± 2% ~ (p=0.095 n=5+5) FloatString/100-8 2.53µs ± 1% 2.56µs ± 2% ~ (p=0.222 n=5+5) FloatString/1000-8 58.4µs ± 0% 58.5µs ± 0% ~ (p=0.206 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.58ms ± 0% -0.31% (p=0.008 n=5+5) FloatString/100000-8 446ms ± 0% 444ms ± 0% -0.31% (p=0.008 n=5+5) FloatAdd/10-8 184ns ± 0% 172ns ± 0% -6.30% (p=0.008 n=5+5) FloatAdd/100-8 189ns ± 2% 191ns ± 4% ~ (p=0.381 n=5+5) FloatAdd/1000-8 371ns ± 0% 347ns ± 1% -6.42% (p=0.008 n=5+5) FloatAdd/10000-8 1.87µs ± 0% 1.68µs ± 0% -10.16% (p=0.008 n=5+5) FloatAdd/100000-8 17.1µs ± 0% 15.6µs ± 0% -8.74% (p=0.016 n=5+4) FloatSub/10-8 152ns ± 0% 138ns ± 0% -9.47% (p=0.000 n=4+5) FloatSub/100-8 148ns ± 0% 142ns ± 0% -4.05% (p=0.000 n=5+4) FloatSub/1000-8 245ns ± 1% 217ns ± 0% -11.28% (p=0.000 n=5+4) FloatSub/10000-8 1.07µs ± 0% 0.88µs ± 1% -18.14% (p=0.008 n=5+5) FloatSub/100000-8 9.58µs ± 0% 7.96µs ± 0% -16.84% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.8µs ± 1% 29.0µs ± 1% ~ (p=0.095 n=5+5) ParseFloatLargeExp-8 126µs ± 1% 126µs ± 1% ~ (p=0.841 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 281ns ± 4% ~ (p=0.746 n=5+5) GCD10x10/WithXY-8 2.10µs ± 1% 2.12µs ± 3% ~ (p=0.548 n=5+5) GCD10x100/WithoutXY-8 615ns ± 3% 607ns ± 2% ~ (p=0.135 n=5+5) GCD10x100/WithXY-8 3.50µs ± 2% 3.62µs ± 5% ~ (p=0.151 n=5+5) GCD10x1000/WithoutXY-8 1.39µs ± 2% 1.39µs ± 3% ~ (p=0.690 n=5+5) GCD10x1000/WithXY-8 7.39µs ± 1% 7.34µs ± 2% ~ (p=0.135 n=5+5) GCD10x10000/WithoutXY-8 8.66µs ± 1% 8.68µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 28.1µs ± 2% 27.0µs ± 2% -3.81% (p=0.008 n=5+5) GCD10x100000/WithoutXY-8 79.3µs ± 1% 79.3µs ± 1% ~ (p=0.841 n=5+5) GCD10x100000/WithXY-8 238µs ± 0% 227µs ± 1% -4.74% (p=0.008 n=5+5) GCD100x100/WithoutXY-8 1.89µs ± 1% 1.88µs ± 2% ~ (p=0.968 n=5+5) GCD100x100/WithXY-8 26.7µs ± 1% 27.0µs ± 1% +1.44% (p=0.032 n=5+5) GCD100x1000/WithoutXY-8 4.48µs ± 1% 4.45µs ± 2% ~ (p=0.341 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 35.1µs ± 1% -3.27% (p=0.008 n=5+5) GCD100x10000/WithoutXY-8 22.8µs ± 0% 22.7µs ± 1% ~ (p=0.056 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 133µs ± 1% -8.33% (p=0.008 n=5+5) GCD100x100000/WithoutXY-8 198µs ± 0% 195µs ± 0% -1.56% (p=0.008 n=5+5) GCD100x100000/WithXY-8 1.11ms ± 0% 1.00ms ± 0% -10.04% (p=0.008 n=5+5) GCD1000x1000/WithoutXY-8 25.2µs ± 1% 24.8µs ± 1% -1.63% (p=0.016 n=5+5) GCD1000x1000/WithXY-8 513µs ± 0% 517µs ± 2% ~ (p=0.421 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 0% 52.7µs ± 1% -7.56% (p=0.008 n=5+5) GCD1000x10000/WithXY-8 1.20ms ± 0% 1.10ms ± 0% -8.70% (p=0.008 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 318µs ± 1% -11.03% (p=0.008 n=5+5) GCD1000x100000/WithXY-8 8.71ms ± 0% 7.65ms ± 0% -12.19% (p=0.008 n=5+5) GCD10000x10000/WithoutXY-8 690µs ± 0% 630µs ± 0% -8.71% (p=0.008 n=5+5) GCD10000x10000/WithXY-8 16.0ms ± 1% 14.9ms ± 0% -6.85% (p=0.008 n=5+5) GCD10000x100000/WithoutXY-8 2.09ms ± 0% 1.75ms ± 0% -16.09% (p=0.016 n=5+4) GCD10000x100000/WithXY-8 86.8ms ± 0% 76.3ms ± 0% -12.09% (p=0.008 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 46.0ms ± 0% -9.97% (p=0.008 n=5+5) GCD100000x100000/WithXY-8 1.25s ± 0% 1.15s ± 0% -7.92% (p=0.008 n=5+5) Hilbert-8 2.45ms ± 1% 2.49ms ± 1% +1.99% (p=0.008 n=5+5) Binomial-8 4.98µs ± 3% 4.90µs ± 2% ~ (p=0.421 n=5+5) QuoRem-8 7.10µs ± 0% 6.21µs ± 0% -12.55% (p=0.016 n=5+4) Exp-8 161ms ± 0% 161ms ± 0% ~ (p=0.421 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=0.151 n=5+5) Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.190 n=5+5) BitsetNeg-8 163ns ± 3% 137ns ± 2% -15.91% (p=0.008 n=5+5) BitsetOrig-8 377ns ± 1% 372ns ± 1% -1.22% (p=0.024 n=5+5) BitsetNegOrig-8 631ns ± 1% 605ns ± 1% -4.09% (p=0.008 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.26ms ± 0% ~ (p=0.548 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=1.000 n=5+5) ModSqrt5430_Tonelli-8 62.4s ± 0% 62.4s ± 0% ~ (p=0.841 n=5+5) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.7s ± 0% ~ (p=0.056 n=5+5) Sqrt-8 101µs ± 0% 89µs ± 0% -12.17% (p=0.008 n=5+5) IntSqr/1-8 32.5ns ± 1% 32.7ns ± 1% ~ (p=0.056 n=5+5) IntSqr/2-8 160ns ± 5% 158ns ± 0% ~ (p=0.397 n=5+4) IntSqr/3-8 298ns ± 4% 296ns ± 4% ~ (p=0.667 n=5+5) IntSqr/5-8 737ns ± 5% 761ns ± 3% +3.34% (p=0.016 n=5+5) IntSqr/8-8 1.87µs ± 4% 1.90µs ± 3% ~ (p=0.222 n=5+5) IntSqr/10-8 2.96µs ± 4% 2.92µs ± 6% ~ (p=0.310 n=5+5) IntSqr/20-8 6.28µs ± 3% 6.21µs ± 2% ~ (p=0.310 n=5+5) IntSqr/30-8 14.0µs ± 2% 13.9µs ± 2% ~ (p=0.548 n=5+5) IntSqr/50-8 37.7µs ± 3% 38.3µs ± 2% ~ (p=0.095 n=5+5) IntSqr/80-8 95.9µs ± 2% 95.1µs ± 1% ~ (p=0.310 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.841 n=5+5) IntSqr/200-8 586µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 1% -0.73% (p=0.032 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.45ms ± 0% -1.15% (p=0.008 n=5+5) IntSqr/800-8 4.68ms ± 0% 4.62ms ± 0% -1.23% (p=0.008 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.50ms ± 0% -0.84% (p=0.008 n=5+5) Mul-8 311ms ± 0% 308ms ± 0% -0.81% (p=0.008 n=5+5) Exp3Power/0x10-8 574ns ± 1% 578ns ± 2% ~ (p=0.500 n=5+5) Exp3Power/0x40-8 640ns ± 1% 646ns ± 0% ~ (p=0.056 n=5+5) Exp3Power/0x100-8 1.42µs ± 1% 1.42µs ± 1% ~ (p=0.246 n=5+5) Exp3Power/0x400-8 8.30µs ± 1% 8.29µs ± 1% ~ (p=0.802 n=5+5) Exp3Power/0x1000-8 60.0µs ± 0% 59.9µs ± 0% -0.24% (p=0.016 n=5+5) Exp3Power/0x4000-8 817µs ± 0% 816µs ± 0% -0.17% (p=0.008 n=5+5) Exp3Power/0x10000-8 7.80ms ± 1% 7.70ms ± 0% -1.23% (p=0.008 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 72.5ms ± 0% -1.28% (p=0.008 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 656ms ± 0% -1.34% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.99s ± 0% 5.90s ± 0% -1.40% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 50ms ± 0% -57.09% (p=0.008 n=5+5) NatSqr/1-8 112ns ± 4% 112ns ± 2% ~ (p=0.968 n=5+5) NatSqr/2-8 251ns ± 2% 250ns ± 1% ~ (p=0.571 n=5+5) NatSqr/3-8 378ns ± 2% 379ns ± 2% ~ (p=0.794 n=5+5) NatSqr/5-8 829ns ± 3% 827ns ± 2% ~ (p=1.000 n=5+5) NatSqr/8-8 1.97µs ± 2% 1.95µs ± 2% ~ (p=0.310 n=5+5) NatSqr/10-8 3.02µs ± 2% 2.99µs ± 2% ~ (p=0.421 n=5+5) NatSqr/20-8 6.51µs ± 2% 6.49µs ± 1% ~ (p=0.841 n=5+5) NatSqr/30-8 14.1µs ± 2% 14.0µs ± 2% ~ (p=0.841 n=5+5) NatSqr/50-8 38.1µs ± 2% 38.3µs ± 3% ~ (p=0.690 n=5+5) NatSqr/80-8 95.5µs ± 2% 96.0µs ± 1% ~ (p=0.421 n=5+5) NatSqr/100-8 150µs ± 1% 148µs ± 2% ~ (p=0.095 n=5+5) NatSqr/200-8 588µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 1% 1.31ms ± 1% ~ (p=0.841 n=5+5) NatSqr/500-8 2.50ms ± 0% 2.47ms ± 0% -1.03% (p=0.008 n=5+5) NatSqr/800-8 4.70ms ± 0% 4.64ms ± 0% -1.31% (p=0.008 n=5+5) NatSqr/1000-8 7.60ms ± 0% 7.52ms ± 0% -1.01% (p=0.008 n=5+5) ScanPi-8 326µs ± 0% 326µs ± 0% ~ (p=0.841 n=5+5) StringPiParallel-8 70.3µs ± 5% 63.8µs ±10% ~ (p=0.056 n=5+5) Scan/10/Base2-8 1.09µs ± 0% 1.09µs ± 0% ~ (p=0.317 n=5+5) Scan/100/Base2-8 7.79µs ± 0% 7.78µs ± 0% ~ (p=0.063 n=5+5) Scan/1000/Base2-8 79.0µs ± 0% 78.9µs ± 0% -0.18% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% -0.15% (p=0.008 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.2ms ± 0% +0.20% (p=0.008 n=5+5) Scan/10/Base8-8 512ns ± 0% 512ns ± 1% ~ (p=0.810 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.0µs ± 0% ~ (p=0.151 n=5+5) Scan/10000/Base8-8 740µs ± 0% 741µs ± 0% +0.10% (p=0.008 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.6ms ± 0% +0.08% (p=0.008 n=5+5) Scan/10/Base10-8 487ns ± 0% 487ns ± 0% ~ (p=0.571 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% +0.06% (p=0.008 n=5+5) Scan/10000/Base10-8 716µs ± 0% 717µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.10% (p=0.008 n=5+5) Scan/10/Base16-8 438ns ± 0% 437ns ± 1% ~ (p=0.786 n=5+5) Scan/100/Base16-8 2.47µs ± 0% 2.47µs ± 0% -0.19% (p=0.048 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.087 n=5+5) Scan/10000/Base16-8 722µs ± 0% 722µs ± 0% +0.11% (p=0.008 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.7ms ± 0% +0.15% (p=0.008 n=5+5) String/10/Base2-8 247ns ± 2% 248ns ± 1% ~ (p=0.437 n=5+5) String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.37% (p=0.024 n=5+5) String/1000/Base2-8 13.6µs ± 1% 13.5µs ± 0% ~ (p=0.095 n=5+5) String/10000/Base2-8 135µs ± 0% 135µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base2-8 1.32ms ± 1% 1.32ms ± 1% ~ (p=0.690 n=5+5) String/10/Base8-8 169ns ± 1% 169ns ± 1% ~ (p=1.000 n=5+5) String/100/Base8-8 636ns ± 0% 634ns ± 1% ~ (p=0.413 n=5+5) String/1000/Base8-8 5.33µs ± 1% 5.32µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base8-8 50.9µs ± 1% 50.7µs ± 0% ~ (p=0.151 n=5+5) String/100000/Base8-8 500µs ± 1% 497µs ± 0% ~ (p=0.421 n=5+5) String/10/Base10-8 516ns ± 1% 513ns ± 0% -0.62% (p=0.016 n=5+4) String/100/Base10-8 1.97µs ± 0% 1.96µs ± 0% ~ (p=0.667 n=4+5) String/1000/Base10-8 12.5µs ± 0% 11.5µs ± 0% -7.92% (p=0.008 n=5+5) String/10000/Base10-8 57.7µs ± 0% 52.5µs ± 0% -8.93% (p=0.008 n=5+5) String/100000/Base10-8 25.6ms ± 0% 21.6ms ± 0% -15.94% (p=0.008 n=5+5) String/10/Base16-8 150ns ± 1% 149ns ± 0% ~ (p=0.413 n=5+4) String/100/Base16-8 514ns ± 1% 514ns ± 1% ~ (p=0.849 n=5+5) String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.421 n=5+5) String/10000/Base16-8 37.8µs ± 1% 37.8µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base16-8 373µs ± 2% 373µs ± 0% ~ (p=0.421 n=5+5) LeafSize/0-8 6.63ms ± 0% 6.63ms ± 0% ~ (p=0.730 n=4+5) LeafSize/1-8 74.0µs ± 0% 67.7µs ± 1% -8.53% (p=0.008 n=5+5) LeafSize/2-8 74.2µs ± 0% 68.3µs ± 1% -7.99% (p=0.008 n=5+5) LeafSize/3-8 379µs ± 0% 309µs ± 0% -18.52% (p=0.008 n=5+5) LeafSize/4-8 72.7µs ± 1% 66.7µs ± 0% -8.37% (p=0.008 n=5+5) LeafSize/5-8 471µs ± 0% 384µs ± 0% -18.55% (p=0.008 n=5+5) LeafSize/6-8 378µs ± 0% 308µs ± 0% -18.59% (p=0.008 n=5+5) LeafSize/7-8 245µs ± 0% 204µs ± 1% -16.75% (p=0.008 n=5+5) LeafSize/8-8 73.4µs ± 0% 66.9µs ± 1% -8.79% (p=0.008 n=5+5) LeafSize/9-8 538µs ± 0% 437µs ± 0% -18.75% (p=0.008 n=5+5) LeafSize/10-8 472µs ± 0% 396µs ± 1% -16.01% (p=0.008 n=5+5) LeafSize/11-8 460µs ± 0% 374µs ± 0% -18.58% (p=0.008 n=5+5) LeafSize/12-8 378µs ± 0% 308µs ± 0% -18.38% (p=0.008 n=5+5) LeafSize/13-8 343µs ± 0% 284µs ± 0% -17.30% (p=0.008 n=5+5) LeafSize/14-8 248µs ± 0% 206µs ± 0% -16.94% (p=0.008 n=5+5) LeafSize/15-8 169µs ± 0% 144µs ± 0% -14.69% (p=0.008 n=5+5) LeafSize/16-8 72.9µs ± 0% 66.8µs ± 1% -8.27% (p=0.008 n=5+5) LeafSize/32-8 82.5µs ± 0% 76.7µs ± 0% -7.04% (p=0.008 n=5+5) LeafSize/64-8 134µs ± 0% 129µs ± 0% -3.80% (p=0.008 n=5+5) ProbablyPrime/n=0-8 44.2ms ± 0% 43.4ms ± 0% -1.95% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.9ms ± 0% 64.0ms ± 0% -1.27% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 146ms ± 0% -0.58% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.35% (p=0.008 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.18% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 22.7ms ± 0% -3.74% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.7ms ± 0% 20.6ms ± 0% ~ (p=0.421 n=5+5) FloatSqrt/64-8 2.25µs ± 1% 2.29µs ± 0% +1.48% (p=0.008 n=5+5) FloatSqrt/128-8 4.86µs ± 1% 4.92µs ± 1% +1.21% (p=0.032 n=5+5) FloatSqrt/256-8 13.6µs ± 0% 13.7µs ± 1% +1.31% (p=0.032 n=5+5) FloatSqrt/1000-8 70.0µs ± 1% 70.1µs ± 0% ~ (p=0.690 n=5+5) FloatSqrt/10000-8 1.92ms ± 0% 1.90ms ± 0% -0.59% (p=0.008 n=5+5) FloatSqrt/100000-8 55.3ms ± 0% 54.8ms ± 0% -1.01% (p=0.008 n=5+5) FloatSqrt/1000000-8 4.56s ± 0% 4.50s ± 0% -1.28% (p=0.008 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.85% (p=0.008 n=5+5) AddVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.50% (p=0.008 n=5+5) AddVV/3-8 12.4GB/s ± 0% 14.7GB/s ± 0% +19.10% (p=0.008 n=5+5) AddVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.63% (p=0.016 n=4+5) AddVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=5+4) AddVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) AddVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 61.9GB/s ± 0% +95.43% (p=0.008 n=5+5) AddVV/10000-8 31.2GB/s ± 0% 56.4GB/s ± 0% +80.83% (p=0.008 n=5+5) AddVV/100000-8 25.9GB/s ± 3% 41.4GB/s ± 0% +59.98% (p=0.008 n=5+5) SubVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.97% (p=0.016 n=4+5) SubVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.51% (p=0.008 n=5+5) SubVV/3-8 12.4GB/s ± 0% 14.8GB/s ± 0% +19.23% (p=0.016 n=4+5) SubVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.56% (p=0.008 n=5+5) SubVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=4+5) SubVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) SubVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) SubVV/1000-8 31.6GB/s ± 0% 80.1GB/s ± 0% +153.08% (p=0.008 n=5+5) SubVV/10000-8 31.2GB/s ± 0% 56.7GB/s ± 0% +81.79% (p=0.008 n=5+5) SubVV/100000-8 29.1GB/s ±10% 29.0GB/s ±18% ~ (p=0.690 n=5+5) AddVW/1-8 859MB/s ± 0% 859MB/s ± 0% -0.01% (p=0.008 n=5+5) AddVW/2-8 811MB/s ± 1% 814MB/s ± 0% ~ (p=0.413 n=5+4) AddVW/3-8 2.08GB/s ± 0% 2.08GB/s ± 0% ~ (p=0.206 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.46GB/s ± 0% ~ (p=0.056 n=5+5) AddVW/5-8 2.75GB/s ± 0% 2.75GB/s ± 0% ~ (p=0.508 n=5+5) AddVW/10-8 3.63GB/s ± 0% 3.63GB/s ± 0% ~ (p=0.214 n=5+5) AddVW/100-8 4.79GB/s ± 0% 4.79GB/s ± 0% ~ (p=0.500 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 5.25GB/s ± 0% -0.43% (p=0.008 n=5+5) AddVW/10000-8 5.30GB/s ± 0% 5.30GB/s ± 0% ~ (p=0.397 n=5+5) AddVW/100000-8 5.27GB/s ± 1% 5.25GB/s ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 1.92GB/s ± 0% 1.96GB/s ± 1% +1.95% (p=0.008 n=5+5) AddMulVVW/2-8 2.16GB/s ± 1% 2.25GB/s ± 1% +4.32% (p=0.008 n=5+5) AddMulVVW/3-8 2.39GB/s ± 1% 2.25GB/s ± 3% -5.79% (p=0.008 n=5+5) AddMulVVW/4-8 2.00GB/s ± 0% 2.31GB/s ± 1% +15.31% (p=0.008 n=5+5) AddMulVVW/5-8 2.22GB/s ± 0% 2.14GB/s ± 0% -3.86% (p=0.008 n=5+5) AddMulVVW/10-8 2.15GB/s ± 1% 2.25GB/s ± 0% +5.03% (p=0.008 n=5+5) AddMulVVW/100-8 2.09GB/s ± 0% 2.14GB/s ± 0% +2.25% (p=0.008 n=5+5) AddMulVVW/1000-8 2.04GB/s ± 0% 2.38GB/s ± 0% +16.52% (p=0.008 n=5+5) AddMulVVW/10000-8 2.03GB/s ± 0% 2.10GB/s ± 0% +3.64% (p=0.008 n=5+5) AddMulVVW/100000-8 2.02GB/s ± 0% 2.02GB/s ± 1% ~ (p=0.690 n=5+5) Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db Reviewed-on: https://go-review.googlesource.com/77831 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2017-03-23	math/big: replace local versions of bitLen, nlz with math/bits versions	Robert Griesemer
	Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by math/bits benchmarks). Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb Reviewed-on: https://go-review.googlesource.com/38464 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-09-14	math/big: cut 2 minutes off race tests	Brad Fitzpatrick
	No need to test so many sizes in race mode, especially for a package which doesn't use goroutines. Reduces test time from 2.5 minutes to 25 seconds. Updates #17104 Change-Id: I7065b39273f82edece385c0d67b3f2d83d4934b8 Reviewed-on: https://go-review.googlesource.com/29163 Reviewed-by: David Crawshaw <crawshaw@golang.org>
2016-05-26	math/big: simplify benchmarking code some more	Robert Griesemer
	Follow-up cleanup to https://golang.org/cl/23424/ . Change-Id: Ifb05c1ff5327df6bc5f4cbc554e18363293f7960 Reviewed-on: https://go-review.googlesource.com/23446 Reviewed-by: Marcel van Lohuizen <mpvl@golang.org>
2016-05-25	math/big: use run for benchmarks	Marcel van Lohuizen
	shortens code and gives an example of the use of Run. Change-Id: I75ffaf762218a589274b4b62e19022e31e805d1b Reviewed-on: https://go-review.googlesource.com/23424 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Marcel van Lohuizen <mpvl@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-21	all: delete dead test code	Dominik Honnef
	This deletes unused code and helpers from tests. Change-Id: Ie31d46115f558ceb8da6efbf90c3c204e03b0d7e Reviewed-on: https://go-review.googlesource.com/20927 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-02	all: single space after period.	Brad Fitzpatrick
	The tree's pretty inconsistent about single space vs double space after a period in documentation. Make it consistently a single space, per earlier decisions. This means contributors won't be confused by misleading precedence. This CL doesn't use go/doc to parse. It only addresses // comments. It was generated with: $ perl -i -npe 's,^(\s// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s//(.+\.) +([A-Z])') $ go test go/doc -update Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7 Reviewed-on: https://go-review.googlesource.com/20022 Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Dave Day <djd@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-04-21	math/big: test that subVW and addVW work with arbitrary y	Josh Bleecher Snyder
	Fixes #10525. Change-Id: I92dc87f5d6db396d8dde2220fc37b7093b772d81 Reviewed-on: https://go-review.googlesource.com/9210 Reviewed-by: Robert Griesemer <gri@golang.org>
2015-01-08	math/big: faster "pure Go" addition/subtraction for long vectors	Robert Griesemer
	(platforms w/o corresponding assembly kernels) For short vector adds there's some erradic slow-down, but overall these routines have become significantly faster. This only matters for platforms w/o native (assembly) versions of these kernels, so we are not concerned about the minor slow-down for short vectors. This code was already reviewed under Mercurial (golang.org/cl/172810043) but wasn't submitted before the switch to git. Benchmarks run on 2.3GHz Intel Core i7, running OS X 10.9.5, with the respective AddVV and AddVW assembly routines disabled. benchmark old ns/op new ns/op delta BenchmarkAddVV_1 6.59 7.09 +7.59% BenchmarkAddVV_2 10.3 10.1 -1.94% BenchmarkAddVV_3 10.9 12.6 +15.60% BenchmarkAddVV_4 13.9 15.6 +12.23% BenchmarkAddVV_5 16.8 17.3 +2.98% BenchmarkAddVV_1e1 29.5 29.9 +1.36% BenchmarkAddVV_1e2 246 232 -5.69% BenchmarkAddVV_1e3 2374 2185 -7.96% BenchmarkAddVV_1e4 58942 22292 -62.18% BenchmarkAddVV_1e5 668622 225279 -66.31% BenchmarkAddVW_1 6.81 5.58 -18.06% BenchmarkAddVW_2 7.69 6.86 -10.79% BenchmarkAddVW_3 9.56 8.32 -12.97% BenchmarkAddVW_4 12.1 9.53 -21.24% BenchmarkAddVW_5 13.2 10.9 -17.42% BenchmarkAddVW_1e1 23.4 18.0 -23.08% BenchmarkAddVW_1e2 175 141 -19.43% BenchmarkAddVW_1e3 1568 1266 -19.26% BenchmarkAddVW_1e4 15425 12596 -18.34% BenchmarkAddVW_1e5 156737 133539 -14.80% BenchmarkFibo 381678466 132958666 -65.16% benchmark old MB/s new MB/s speedup BenchmarkAddVV_1 9715.25 9028.30 0.93x BenchmarkAddVV_2 12461.72 12622.60 1.01x BenchmarkAddVV_3 17549.64 15243.82 0.87x BenchmarkAddVV_4 18392.54 16398.29 0.89x BenchmarkAddVV_5 18995.23 18496.57 0.97x BenchmarkAddVV_1e1 21708.98 21438.28 0.99x BenchmarkAddVV_1e2 25956.53 27506.88 1.06x BenchmarkAddVV_1e3 26947.93 29286.66 1.09x BenchmarkAddVV_1e4 10857.96 28709.46 2.64x BenchmarkAddVV_1e5 9571.91 28409.21 2.97x BenchmarkAddVW_1 1175.28 1433.98 1.22x BenchmarkAddVW_2 2080.01 2332.54 1.12x BenchmarkAddVW_3 2509.28 2883.97 1.15x BenchmarkAddVW_4 2646.09 3356.83 1.27x BenchmarkAddVW_5 3020.69 3671.07 1.22x BenchmarkAddVW_1e1 3425.76 4441.40 1.30x BenchmarkAddVW_1e2 4553.17 5642.96 1.24x BenchmarkAddVW_1e3 5100.14 6318.72 1.24x BenchmarkAddVW_1e4 5186.15 6350.96 1.22x BenchmarkAddVW_1e5 5104.07 5990.74 1.17x Change-Id: I7a62023b1105248a0e85e5b9819d3fd4266123d4 Reviewed-on: https://go-review.googlesource.com/2480 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Alan Donovan <adonovan@google.com>
2014-09-08	build: move package sources from src/pkg to src	Russ Cox
	Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg.

The vast majority of the time, carry propagation is limited and addVW/subVW only need to consider a single word for carry propagation. As Josh Bleecher-Snyder pointed out in 2019 (CL 164968), once carrying is done, the remaining words can be handled faster with copy (memmove). In the benchmarks below, this is the data=random case. Even more important, if the source and destination are the same, the copy can be optimized away entirely, making a small in-place addition to a big.Int O(1) instead of O(N). To date, only a few systems (amd64, arm64, and pure Go, meaning wasm) make use of this asymptotic improvement. This is the data=shortcut case. This CL deletes the addVW/subVW assembly and replaces it with an optimized pure Go version. Using Go makes it easy to call the real copy builtin, which will use optimized memmove code, instead of recreating a worse memmove in assembly (as arm64 does) or omitting the copy optimization entirely (as most others do). The worst case for the Go version versus assembly is the case of incrementing 2^N-1 by 1, which has to propagate a carry the entire length of the array. This is the data=carry case. On balance, we believe this case is rare enough to be worth taking a hit in that case, in exchange for significant wins in the other cases and the deletion of significant amounts of assembly of varying quality. (Remember that half the assembly has the copy optimization and shortcut, while half does not.) In the benchmarks, the systems are: c2s16 GOARCH=amd64 c2s16 perf gomote (Intel, Google Cloud) c3h88 GOARCH=amd64 c3h88 perf gomote (newer Intel, Google Cloud) s7 GOARCH=amd64 rsc basement server (AMD Ryzen 9 7950X) c4as16 GOARCH=arm64 c4as16 perf gomote (Google Cloud) mac GOARCH=arm64 Apple M3 Pro in MacBook Pro 386 GOARCH=386 gotip-linux-386 gomote arm GOARCH=arm gotip-linux-arm gomote loong64 GOARCH=loong64 gotip-linux-loong64 gomote ppc64le GOARCH=ppc64le gotip-linux-ppc64le gomote riscv64 GOARCH=riscv64 gotip-linux-riscv64 gomote benchmark \ system c2s16 c3h88 s7 c4as16 mac 386 arm loong64 ppc64le riscv64 AddVW/words=1/data=random -1.15% -1.74% -5.89% -9.80% -11.54% +23.71% -12.74% -14.25% +14.67% +10.27% AddVW/words=2/data=random -2.59% ~ -4.38% -19.31% -15.41% +24.80% ~ -19.99% +13.73% +19.71% AddVW/words=3/data=random -3.75% -19.10% -3.79% -23.15% -17.04% +20.04% -10.07% -23.20% ~ +15.39% AddVW/words=4/data=random -2.84% +7.05% -8.77% -22.64% -15.77% +16.01% -7.36% -28.22% ~ +23.00% AddVW/words=5/data=random -10.97% +2.16% -12.09% -20.89% -17.14% +9.42% -4.69% -32.60% ~ +10.07% AddVW/words=6/data=random -9.87% ~ -7.54% -19.08% -6.46% ~ -3.44% -34.61% ~ +12.19% AddVW/words=7/data=random -14.36% ~ -10.09% -19.10% -10.47% -6.20% -5.06% -38.14% -11.54% +6.79% AddVW/words=8/data=random -17.50% ~ -11.06% -25.14% -12.88% -8.35% -5.11% -41.39% -14.04% +11.87% AddVW/words=9/data=random -19.76% -4.05% -15.47% -24.08% -16.50% -12.34% -21.56% -44.25% -14.82% ~ AddVW/words=10/data=random -13.89% ~ -9.69% -23.06% -8.04% -12.58% -19.25% -32.80% -11.68% ~ AddVW/words=16/data=random -29.36% -15.35% -21.86% -25.04% -19.89% -32.26% -16.29% -42.66% -25.92% -3.01% AddVW/words=32/data=random -39.02% -28.76% -39.87% -11.22% -2.85% -55.40% -31.17% -55.37% -37.92% -16.28% AddVW/words=64/data=random -25.94% -19.09% -20.60% -6.90% +8.91% -51.00% -43.72% -62.27% -44.11% -28.74% AddVW/words=100/data=random -22.79% -18.13% -18.25% ~ +33.89% -67.40% -51.77% -63.54% -53.75% -30.97% AddVW/words=1000/data=random -8.98% -3.84% ~ -3.15% ~ -93.35% -63.92% -65.66% -68.67% -42.30% AddVW/words=10000/data=random -1.38% -0.38% ~ ~ ~ -89.16% -65.18% -44.65% -70.35% -20.08% AddVW/words=100000/data=random ~ ~ ~ ~ ~ -87.03% -64.51% -36.08% -61.40% -16.53% SubVW/words=1/data=random -3.67% ~ -8.38% -10.26% -3.07% +45.78% -6.06% -11.17% ~ ~ SubVW/words=2/data=random -3.48% -10.07% -5.76% -20.14% -8.45% +44.28% ~ -19.09% ~ +16.98% SubVW/words=3/data=random -7.11% -26.64% -4.48% -22.07% -9.21% +35.61% ~ -23.93% -18.20% ~ SubVW/words=4/data=random -4.23% +7.19% -8.95% -22.62% -13.89% +33.20% -8.96% -29.96% ~ +22.23% SubVW/words=5/data=random -11.49% +1.92% -10.86% -22.27% -17.53% +24.48% -2.88% -35.19% -19.55% ~ SubVW/words=6/data=random -7.67% ~ -7.72% -18.44% -6.24% +12.03% -2.00% -39.68% -10.73% ~ SubVW/words=7/data=random -13.69% -18.32% -11.82% -18.92% -11.57% +6.63% ~ -43.54% -30.81% ~ SubVW/words=8/data=random -16.02% ~ -11.07% -24.50% -11.92% +4.32% -3.01% -46.95% -24.14% ~ SubVW/words=9/data=random -18.76% -3.34% -14.84% -23.79% -17.50% ~ -21.80% -49.98% -29.62% ~ SubVW/words=10/data=random -13.23% ~ -9.25% -21.26% -11.63% ~ -18.58% -39.19% -20.09% ~ SubVW/words=16/data=random -28.25% -13.24% -22.66% -27.18% -19.13% -23.38% -20.24% -51.01% -28.06% -3.05% SubVW/words=32/data=random -38.41% -28.88% -40.12% -11.20% -2.80% -49.17% -34.67% -63.29% -39.25% -15.20% SubVW/words=64/data=random -25.51% -19.24% -22.20% -6.57% +9.98% -48.52% -48.14% -69.50% -49.44% -27.92% SubVW/words=100/data=random -21.69% -18.51% ~ +1.92% +34.42% -65.88% -54.67% -71.24% -58.88% -30.71% SubVW/words=1000/data=random -9.81% -4.05% -2.14% -3.06% ~ -93.37% -67.33% -74.12% -68.36% -42.17% SubVW/words=10000/data=random ~ -0.52% ~ ~ ~ -88.87% -68.54% -44.94% -70.63% -19.95% SubVW/words=100000/data=random ~ ~ ~ ~ ~ -86.69% -68.09% -48.36% -62.42% -19.32% AddVW/words=1/data=shortcut -29.38% -25.38% -27.37% -23.15% -25.41% +3.01% -33.60% -36.12% -15.76% ~ AddVW/words=2/data=shortcut -32.79% -34.72% -31.47% -24.47% -28.21% -3.75% -34.66% -43.89% -23.65% -21.56% AddVW/words=3/data=shortcut -38.50% -46.83% -35.67% -26.38% -30.29% -10.41% -44.89% -47.68% -30.93% -26.85% AddVW/words=4/data=shortcut -40.40% -28.85% -34.19% -29.83% -32.95% -16.09% -42.86% -51.02% -34.19% -26.69% AddVW/words=5/data=shortcut -43.87% -35.42% -36.46% -32.59% -37.72% -20.82% -45.14% -54.01% -35.49% -30.48% AddVW/words=6/data=shortcut -46.98% -39.34% -42.22% -35.43% -38.18% -27.46% -46.72% -56.61% -40.21% -34.07% AddVW/words=7/data=shortcut -49.63% -47.97% -46.61% -35.28% -41.93% -31.14% -49.29% -58.89% -41.10% -37.01% AddVW/words=8/data=shortcut -50.48% -42.33% -45.40% -40.24% -41.74% -32.92% -50.62% -60.98% -44.85% -38.10% AddVW/words=9/data=shortcut -54.27% -43.52% -49.06% -42.16% -45.22% -37.57% -51.84% -62.91% -46.04% -40.82% AddVW/words=10/data=shortcut -56.01% -45.40% -51.42% -43.29% -46.14% -38.65% -53.65% -64.62% -47.05% -43.21% AddVW/words=16/data=shortcut -62.73% -55.66% -59.31% -56.38% -54.31% -53.16% -61.03% -72.29% -58.24% -52.57% AddVW/words=32/data=shortcut -74.00% -69.42% -71.75% -33.65% -37.35% -71.73% -72.59% -82.44% -70.87% -67.69% AddVW/words=64/data=shortcut -56.69% -52.72% -52.09% -35.48% -36.87% -84.24% -83.10% -90.37% -82.56% -80.81% AddVW/words=100/data=shortcut -56.68% -53.18% -51.49% -33.49% -37.72% -89.95% -88.21% -93.37% -88.47% -86.52% AddVW/words=1000/data=shortcut -56.68% -52.45% -51.66% -35.31% -36.65% -98.88% -98.62% -99.24% -98.78% -98.41% AddVW/words=10000/data=shortcut -56.70% -52.40% -51.92% -33.49% -36.98% -99.89% -99.86% -99.92% -99.87% -99.91% AddVW/words=100000/data=shortcut -56.67% -52.46% -52.38% -35.31% -37.20% -99.99% -99.99% -99.99% -99.99% -99.99% SubVW/words=1/data=shortcut -29.80% -20.71% -26.94% -23.24% -25.33% +26.97% -32.02% -37.85% -40.20% -12.67% SubVW/words=2/data=shortcut -35.47% -36.38% -31.93% -25.43% -30.18% +18.96% -33.48% -46.48% -39.38% -18.65% SubVW/words=3/data=shortcut -39.22% -49.96% -36.90% -25.82% -30.96% +12.53% -40.67% -51.07% -43.71% -23.78% SubVW/words=4/data=shortcut -40.46% -24.90% -34.66% -29.87% -33.97% +4.60% -42.32% -54.92% -42.83% -22.45% SubVW/words=5/data=shortcut -43.84% -34.17% -38.00% -32.55% -37.27% -2.46% -43.09% -58.18% -45.70% -26.45% SubVW/words=6/data=shortcut -47.69% -37.49% -42.73% -35.90% -37.73% -8.52% -46.55% -61.01% -44.00% -30.14% SubVW/words=7/data=shortcut -49.45% -50.66% -46.88% -34.77% -41.64% -14.46% -48.92% -63.46% -50.47% -33.39% SubVW/words=8/data=shortcut -50.45% -39.31% -47.14% -40.47% -41.70% -15.77% -50.21% -65.64% -47.71% -34.01% SubVW/words=9/data=shortcut -54.28% -43.07% -49.42% -41.34% -44.99% -19.39% -51.55% -67.61% -56.92% -36.82% SubVW/words=10/data=shortcut -56.85% -47.88% -50.92% -42.76% -45.67% -23.60% -53.04% -69.34% -60.18% -39.43% SubVW/words=16/data=shortcut -62.36% -54.83% -58.80% -55.83% -53.74% -41.04% -60.16% -76.75% -60.56% -48.63% SubVW/words=32/data=shortcut -73.68% -68.64% -71.57% -33.52% -37.34% -64.73% -72.67% -85.89% -71.87% -64.56% SubVW/words=64/data=shortcut -56.68% -51.66% -52.56% -34.75% -37.54% -80.30% -83.58% -92.39% -83.41% -78.70% SubVW/words=100/data=shortcut -56.68% -50.97% -51.57% -33.68% -36.78% -87.42% -88.53% -94.84% -88.87% -84.96% SubVW/words=1000/data=shortcut -56.68% -50.89% -52.10% -34.94% -37.77% -98.59% -98.71% -99.43% -98.80% -98.20% SubVW/words=10000/data=shortcut -56.68% -51.00% -52.44% -33.65% -37.27% -99.86% -99.87% -99.94% -99.88% -99.90% SubVW/words=100000/data=shortcut -56.68% -50.80% -52.20% -34.79% -37.46% -99.99% -99.99% -99.99% -99.99% -99.99% AddVW/words=1/data=carry -0.51% -5.29% -24.03% -26.48% ~ ~ -33.14% -30.23% ~ -20.74% AddVW/words=2/data=carry -6.36% ~ -21.05% -39.40% ~ +10.72% -29.12% -31.34% ~ -17.29% AddVW/words=3/data=carry ~ ~ -17.46% -19.53% +17.58% ~ -26.23% -23.61% +7.80% -14.34% AddVW/words=4/data=carry +19.02% +16.80% ~ ~ +28.25% ~ -27.90% -20.31% +19.16% ~ AddVW/words=5/data=carry +3.97% +53.02% ~ ~ +11.31% ~ -19.05% -17.47% +16.81% ~ AddVW/words=6/data=carry +2.98% +19.83% ~ ~ +14.84% ~ -18.48% -14.92% +18.25% ~ AddVW/words=7/data=carry ~ ~ ~ ~ +27.17% ~ -15.50% -12.74% +13.00% ~ AddVW/words=8/data=carry +0.58% +22.32% ~ +6.10% +29.63% ~ -13.04% ~ +28.46% +2.95% AddVW/words=9/data=carry ~ +31.53% ~ ~ +14.42% ~ -11.32% ~ +18.37% +3.28% AddVW/words=10/data=carry +3.94% +22.36% ~ +6.29% +19.22% ~ -11.27% ~ +20.10% +3.91% AddVW/words=16/data=carry +2.82% +14.23% ~ +10.06% +25.91% -16.12% ~ ~ +52.28% +10.40% AddVW/words=32/data=carry ~ +25.35% +13.66% ~ +34.89% -34.39% +6.51% -18.71% +41.06% +19.42% AddVW/words=64/data=carry -42.03% ~ -39.70% +6.65% +32.29% -39.94% +14.34% ~ +19.68% +20.86% AddVW/words=100/data=carry -33.95% -34.28% -39.65% ~ +27.72% -26.80% +17.40% ~ +26.39% +23.32% AddVW/words=1000/data=carry -42.49% -47.87% -47.44% +1.25% +4.25% -41.76% +23.40% ~ +25.48% +27.99% AddVW/words=10000/data=carry -41.85% -48.49% -49.43% ~ ~ -42.09% +24.61% -10.32% +40.55% +18.35% AddVW/words=100000/data=carry -28.18% -48.13% -48.24% +1.35% ~ -42.90% +24.73% -9.79% +22.55% +17.16% SubVW/words=1/data=carry -10.32% -17.16% -24.14% -26.24% ~ +18.43% -34.10% -29.54% -9.57% ~ SubVW/words=2/data=carry -19.45% -23.31% -20.74% -39.73% ~ +15.74% -28.13% -30.21% ~ -18.74% SubVW/words=3/data=carry ~ -16.18% -15.34% -19.54% +17.62% +12.39% -27.64% -27.09% ~ -14.97% SubVW/words=4/data=carry +11.67% +24.42% ~ ~ +25.11% +14.07% -28.08% -26.18% ~ ~ SubVW/words=5/data=carry +8.08% +25.64% ~ ~ +10.35% +8.12% -21.75% -25.50% ~ -4.86% SubVW/words=6/data=carry ~ +13.82% ~ ~ +12.92% +6.79% -20.25% -24.70% ~ -2.74% SubVW/words=7/data=carry ~ ~ +8.29% +4.51% +26.59% +4.62% -18.01% -24.09% ~ -1.26% SubVW/words=8/data=carry ~ +23.16% +16.19% +6.16% +25.46% +6.74% -15.57% -22.74% ~ +1.44% SubVW/words=9/data=carry ~ +30.71% +20.81% ~ +12.36% ~ -12.99% ~ ~ +3.13% SubVW/words=10/data=carry +5.03% +19.53% +14.84% +14.16% +16.12% ~ -11.64% -16.00% +15.45% +3.29% SubVW/words=16/data=carry +14.42% +15.58% +33.07% +11.43% +24.65% ~ ~ -21.90% +25.59% +9.40% SubVW/words=32/data=carry ~ +27.57% +46.58% ~ +35.35% -8.49% ~ -24.04% +11.86% +18.40% SubVW/words=64/data=carry -24.34% -27.83% -20.90% +13.34% +37.17% -14.90% ~ -8.81% +12.88% +18.92% SubVW/words=100/data=carry -25.19% -34.70% -27.45% +12.86% +28.42% -14.48% ~ ~ +25.71% +21.93% SubVW/words=1000/data=carry -24.93% -47.86% -47.26% +2.66% ~ -23.88% ~ ~ +25.99% +27.81% SubVW/words=10000/data=carry -24.17% -36.48% -49.41% +1.06% ~ -25.06% ~ -26.50% +27.94% +18.36% SubVW/words=100000/data=carry -22.51% -35.86% -49.46% +3.96% ~ -25.18% ~ -22.15% +26.86% +15.44% Change-Id: I8f252073040e674780ac6ec9912082fb205329dd Reviewed-on: https://go-review.googlesource.com/c/go/+/664898 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

Also fix a few real but currently harmless bugs from CL 664895. There were a few places that were still wrong if z != x or if a != 0. Change-Id: Id8971e2505523bc4708780c82bf998a546f4f081 Reviewed-on: https://go-review.googlesource.com/c/go/+/664897 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>

It is annoying that non-x86 implementations of shlVU and shrVU have to go out of their way to handle the trivial case shift==0 with their own copy loops. Instead, arrange to never call them with shift==0, so that the code can be removed. Unfortunately, there are linknames of shlVU, so we cannot change that function. But we can rename the functions and then leave behind a shlVU wrapper, so do that. Since the big.Int API calls the operations Lsh and Rsh, rename shlVU/shrVU to lshVU/rshVU. Also rename various other shl/shr methods and functions to lsh/rsh. Change-Id: Ieaf54e0110a298730aa3e4566ce5be57ba7fc121 Reviewed-on: https://go-review.googlesource.com/c/go/+/664896 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

addMulVVW is an unnecessarily special case. All other assembly routines taking []Word (V as in vector) arguments take separate source and destination. For example: addVV: z = x+y mulAddVWW: z = x*m+a addMulVVW uses the z parameter as both destination and source: addMulVVW: z = z+x*m Even looking at the signatures is confusing: all the VV routines take two input vectors x and y, but addMulVVW takes only x: where is y? (The answer is that the two inputs are z and x.) It would be nice to fix this, both for understandability and regularity, and to simplify a future assembly generator. We cannot remove or redefine addMulVVW, because it has been used in linknames. Instead, the CL adds a new final addend argument ‘a’ like in mulAddVWW, making the natural name addMulVVWW (two input vectors, two input words): addMulVVWW: z = x+y*m+a This CL updates all the assembly implementations to rename the inputs z, x, y -> x, y, m, and then introduces a separate destination z. Change-Id: Ib76c80b53f6d1f4a901f663566e9c4764bb20488 Reviewed-on: https://go-review.googlesource.com/c/go/+/664895 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Alan Donovan <adonovan@google.com>

In the early days of math/big, algorithms that needed more space grew the result larger than it needed to be and then used the high words as extra space. This made results their own temporary space caches, at the cost that saving a result in a data structure might hold significantly more memory than necessary. Specifically, new(big.Int).Mul(x, y) returned a big.Int with a backing slice 3X as big as it strictly needed to be. If you are storing many multiplication results, or even a single large result, the 3X overhead can add up. This approach to storage for temporaries also requires being able to analyze the algorithms to predict the exact amount they need, which can be difficult. For both these reasons, the implementation of recursive long division, which came later, introduced a “nat pool” where temporaries could be stored and reused, or reclaimed by the GC when no longer used. This avoids the storage and bookkeeping overheads but introduces a per-temporary sync.Pool overhead. divRecursiveStep takes an array of cached temporaries to remove some of that overhead. The nat pool was better but is still not quite right. This CL introduces something even better than the nat pool (still probably not quite right, but the best I can see for now): a sync.Pool holding stacks for allocating temporaries. Now an operation can get one stack out of the pool and then allocate as many temporaries as it needs during the operation, eventually returning the stack back to the pool. The sync.Pool operations are now per-exported-operation (like big.Int.Mul), not per-temporary. This CL converts both the pre-allocation in nat.mul and the uses of the nat pool to use stack pools instead. This simplifies some code and sets us up better for more complex algorithms (such as Toom-Cook or FFT-based multiplication) that need more temporaries. It is also a little bit faster. goos: linux goarch: amd64 pkg: math/big cpu: Intel(R) Xeon(R) CPU @ 3.10GHz │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15) Div/40/20-16 23.68n ± 0% 22.21n ± 0% -6.21% (p=0.000 n=15) Div/100/50-16 56.65n ± 0% 55.53n ± 0% -1.98% (p=0.000 n=15) Div/200/100-16 194.6n ± 1% 172.8n ± 0% -11.20% (p=0.000 n=15) Div/400/200-16 232.1n ± 0% 206.7n ± 0% -10.94% (p=0.000 n=15) Div/1000/500-16 405.3n ± 1% 383.8n ± 0% -5.30% (p=0.000 n=15) Div/2000/1000-16 810.4n ± 1% 795.2n ± 0% -1.88% (p=0.000 n=15) Div/20000/10000-16 25.88µ ± 0% 25.39µ ± 0% -1.89% (p=0.000 n=15) Div/200000/100000-16 931.5µ ± 0% 924.3µ ± 0% -0.77% (p=0.000 n=15) Div/2000000/1000000-16 37.77m ± 0% 37.75m ± 0% ~ (p=0.098 n=15) Div/20000000/10000000-16 1.367 ± 0% 1.377 ± 0% +0.72% (p=0.003 n=15) NatMul/10-16 168.5n ± 3% 164.0n ± 4% ~ (p=0.751 n=15) NatMul/100-16 6.086µ ± 3% 5.380µ ± 3% -11.60% (p=0.000 n=15) NatMul/1000-16 238.1µ ± 3% 228.3µ ± 1% -4.12% (p=0.000 n=15) NatMul/10000-16 8.721m ± 2% 8.518m ± 1% -2.33% (p=0.000 n=15) NatMul/100000-16 369.6m ± 0% 371.1m ± 0% +0.42% (p=0.000 n=15) geomean 19.57µ 18.74µ -4.21% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15) NatMul/1000-16 48.16Ki ± 0% 16.02Ki ± 0% -66.73% (p=0.000 n=15) NatMul/10000-16 482.9Ki ± 1% 165.4Ki ± 3% -65.75% (p=0.000 n=15) NatMul/100000-16 5.747Mi ± 7% 4.197Mi ± 0% -26.97% (p=0.000 n=15) geomean 41.42Ki 20.63Ki -50.18% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-16 7.000 ± 14% 7.000 ± 14% ~ (p=0.668 n=15) geomean 1.476 1.476 +0.00% ¹ all samples are equal goos: linux goarch: amd64 pkg: math/big cpu: Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-88 15.84n ± 1% 13.12n ± 0% -17.17% (p=0.000 n=15) Div/40/20-88 15.88n ± 1% 13.12n ± 0% -17.38% (p=0.000 n=15) Div/100/50-88 26.42n ± 0% 25.47n ± 0% -3.60% (p=0.000 n=15) Div/200/100-88 132.4n ± 0% 114.9n ± 0% -13.22% (p=0.000 n=15) Div/400/200-88 150.1n ± 0% 135.6n ± 0% -9.66% (p=0.000 n=15) Div/1000/500-88 275.5n ± 0% 264.1n ± 0% -4.14% (p=0.000 n=15) Div/2000/1000-88 586.5n ± 0% 581.1n ± 0% -0.92% (p=0.000 n=15) Div/20000/10000-88 25.87µ ± 0% 25.72µ ± 0% -0.59% (p=0.000 n=15) Div/200000/100000-88 772.2µ ± 0% 779.0µ ± 0% +0.88% (p=0.000 n=15) Div/2000000/1000000-88 33.36m ± 0% 33.63m ± 0% +0.80% (p=0.000 n=15) Div/20000000/10000000-88 1.307 ± 0% 1.320 ± 0% +1.03% (p=0.000 n=15) NatMul/10-88 140.4n ± 0% 148.8n ± 4% +5.98% (p=0.000 n=15) NatMul/100-88 4.663µ ± 1% 4.388µ ± 1% -5.90% (p=0.000 n=15) NatMul/1000-88 207.7µ ± 0% 205.8µ ± 0% -0.89% (p=0.000 n=15) NatMul/10000-88 8.456m ± 0% 8.468m ± 0% +0.14% (p=0.021 n=15) NatMul/100000-88 295.1m ± 0% 297.9m ± 0% +0.94% (p=0.000 n=15) geomean 14.96µ 14.33µ -4.23% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-88 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-88 4.750Ki ± 0% 1.758Ki ± 0% -62.99% (p=0.000 n=15) NatMul/1000-88 48.44Ki ± 0% 16.08Ki ± 0% -66.80% (p=0.000 n=15) NatMul/10000-88 489.7Ki ± 1% 166.1Ki ± 3% -66.08% (p=0.000 n=15) NatMul/100000-88 5.546Mi ± 0% 3.819Mi ± 60% -31.15% (p=0.000 n=15) geomean 41.29Ki 20.30Ki -50.85% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-88 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-88 5.000 ± 20% 6.000 ± 67% ~ (p=0.672 n=15) geomean 1.380 1.431 +3.71% ¹ all samples are equal goos: linux goarch: arm64 pkg: math/big │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-16 15.85n ± 0% 15.23n ± 0% -3.91% (p=0.000 n=15) Div/40/20-16 15.88n ± 0% 15.22n ± 0% -4.16% (p=0.000 n=15) Div/100/50-16 29.69n ± 0% 26.39n ± 0% -11.11% (p=0.000 n=15) Div/200/100-16 149.2n ± 0% 123.3n ± 0% -17.36% (p=0.000 n=15) Div/400/200-16 160.3n ± 0% 139.2n ± 0% -13.16% (p=0.000 n=15) Div/1000/500-16 271.0n ± 0% 256.1n ± 0% -5.50% (p=0.000 n=15) Div/2000/1000-16 545.3n ± 0% 527.0n ± 0% -3.36% (p=0.000 n=15) Div/20000/10000-16 22.60µ ± 0% 22.20µ ± 0% -1.77% (p=0.000 n=15) Div/200000/100000-16 889.0µ ± 0% 892.2µ ± 0% +0.35% (p=0.000 n=15) Div/2000000/1000000-16 38.01m ± 0% 38.12m ± 0% +0.30% (p=0.000 n=15) Div/20000000/10000000-16 1.437 ± 0% 1.444 ± 0% +0.50% (p=0.000 n=15) NatMul/10-16 166.4n ± 2% 169.5n ± 1% +1.86% (p=0.000 n=15) NatMul/100-16 5.733µ ± 1% 5.570µ ± 1% -2.84% (p=0.000 n=15) NatMul/1000-16 232.6µ ± 1% 229.8µ ± 0% -1.22% (p=0.000 n=15) NatMul/10000-16 9.039m ± 1% 8.969m ± 0% -0.77% (p=0.000 n=15) NatMul/100000-16 367.0m ± 0% 368.8m ± 0% +0.48% (p=0.000 n=15) geomean 16.15µ 15.50µ -4.01% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-16 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 4.750Ki ± 0% 1.751Ki ± 0% -63.14% (p=0.000 n=15) NatMul/1000-16 48.33Ki ± 0% 16.02Ki ± 0% -66.85% (p=0.000 n=15) NatMul/10000-16 536.5Ki ± 1% 165.7Ki ± 3% -69.12% (p=0.000 n=15) NatMul/100000-16 6.078Mi ± 6% 4.197Mi ± 0% -30.94% (p=0.000 n=15) geomean 42.81Ki 20.64Ki -51.78% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-16 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-16 2.000 ± 50% 1.000 ± 0% -50.00% (p=0.001 n=15) NatMul/100000-16 9.000 ± 11% 8.000 ± 12% -11.11% (p=0.001 n=15) geomean 1.783 1.516 -14.97% ¹ all samples are equal goos: darwin goarch: arm64 pkg: math/big cpu: Apple M3 Pro │ old │ new │ │ sec/op │ sec/op vs base │ Div/20/10-12 9.850n ± 1% 9.405n ± 1% -4.52% (p=0.000 n=15) Div/40/20-12 9.858n ± 0% 9.403n ± 1% -4.62% (p=0.000 n=15) Div/100/50-12 16.40n ± 1% 14.81n ± 0% -9.70% (p=0.000 n=15) Div/200/100-12 88.48n ± 2% 80.88n ± 0% -8.59% (p=0.000 n=15) Div/400/200-12 107.90n ± 1% 99.28n ± 1% -7.99% (p=0.000 n=15) Div/1000/500-12 188.8n ± 1% 178.6n ± 1% -5.40% (p=0.000 n=15) Div/2000/1000-12 399.9n ± 0% 389.1n ± 0% -2.70% (p=0.000 n=15) Div/20000/10000-12 13.94µ ± 2% 13.81µ ± 1% ~ (p=0.574 n=15) Div/200000/100000-12 523.8µ ± 0% 521.7µ ± 0% -0.40% (p=0.000 n=15) Div/2000000/1000000-12 21.46m ± 0% 21.48m ± 0% ~ (p=0.067 n=15) Div/20000000/10000000-12 812.5m ± 0% 812.9m ± 0% ~ (p=0.061 n=15) NatMul/10-12 77.14n ± 0% 78.35n ± 1% +1.57% (p=0.000 n=15) NatMul/100-12 2.999µ ± 0% 2.871µ ± 1% -4.27% (p=0.000 n=15) NatMul/1000-12 126.2µ ± 0% 126.8µ ± 0% +0.51% (p=0.011 n=15) NatMul/10000-12 5.099m ± 0% 5.125m ± 0% +0.51% (p=0.000 n=15) NatMul/100000-12 206.7m ± 0% 208.4m ± 0% +0.80% (p=0.000 n=15) geomean 9.512µ 9.236µ -2.91% │ old │ new │ │ B/op │ B/op vs base │ NatMul/10-12 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-12 4.750Ki ± 0% 1.750Ki ± 0% -63.16% (p=0.000 n=15) NatMul/1000-12 48.13Ki ± 0% 16.01Ki ± 0% -66.73% (p=0.000 n=15) NatMul/10000-12 483.5Ki ± 1% 163.2Ki ± 2% -66.24% (p=0.000 n=15) NatMul/100000-12 5.480Mi ± 4% 1.532Mi ± 104% -72.05% (p=0.000 n=15) geomean 41.03Ki 16.82Ki -59.01% ¹ all samples are equal │ old │ new │ │ allocs/op │ allocs/op vs base │ NatMul/10-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/1000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/10000-12 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=15) ¹ NatMul/100000-12 5.000 ± 0% 1.000 ± 400% -80.00% (p=0.007 n=15) geomean 1.380 1.000 -27.52% ¹ all samples are equal Change-Id: I7efa6fe37971ed26ae120a32250fcb47ece0a011 Reviewed-on: https://go-review.googlesource.com/c/go/+/650638 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@google.com> Reviewed-by: Alan Donovan <adonovan@google.com>

Montgomery multiplication can be used for Exp mod even m by splitting it into two steps - Exp mod an odd number and Exp mod a power of two - and then combining the two results using the Chinese Remainder Theorem. For more details, see Ç. K. Koç, “Montgomery Reduction with Even Modulus”, IEE Proceedings: Computers and Digital Techniques, 141(5) 314-316, September 1994. http://www.people.vcu.edu/~jwang3/CMSC691/j34monex.pdf Thanks to Guido Vranken for suggesting that we use a faster algorithm. name old time/op new time/op delta ExpMont/Odd-16 240µs ± 2% 239µs ± 2% ~ (p=0.853 n=10+10) ExpMont/Even1-16 757µs ± 3% 249µs ± 6% -67.17% (p=0.000 n=10+10) ExpMont/Even2-16 755µs ± 1% 244µs ± 4% -67.64% (p=0.000 n=8+10) ExpMont/Even3-16 771µs ± 3% 240µs ± 2% -68.90% (p=0.000 n=10+10) ExpMont/Even4-16 775µs ± 3% 241µs ± 2% -68.91% (p=0.000 n=10+10) ExpMont/Even8-16 779µs ± 2% 241µs ± 3% -69.06% (p=0.000 n=9+10) ExpMont/Even32-16 778µs ± 3% 240µs ± 4% -69.11% (p=0.000 n=9+9) ExpMont/Even64-16 774µs ± 6% 186µs ± 2% -76.00% (p=0.000 n=10+10) ExpMont/Even96-16 776µs ± 4% 186µs ± 6% -76.09% (p=0.000 n=9+10) ExpMont/Even128-16 764µs ± 2% 143µs ± 3% -81.24% (p=0.000 n=10+10) ExpMont/Even255-16 761µs ± 3% 109µs ± 2% -85.73% (p=0.000 n=10+10) ExpMont/SmallEven1-16 45.6µs ± 1% 36.3µs ± 3% -20.49% (p=0.000 n=9+10) ExpMont/SmallEven2-16 44.3µs ± 2% 37.5µs ± 2% -15.26% (p=0.000 n=9+10) ExpMont/SmallEven3-16 44.1µs ± 5% 37.3µs ± 3% -15.32% (p=0.000 n=9+10) ExpMont/SmallEven4-16 47.1µs ± 6% 38.0µs ± 5% -19.40% (p=0.000 n=10+9) name old alloc/op new alloc/op delta ExpMont/Odd-16 2.53kB ± 0% 2.53kB ± 0% ~ (p=0.137 n=8+10) ExpMont/Even1-16 2.57kB ± 0% 3.31kB ± 0% +28.90% (p=0.000 n=8+10) ExpMont/Even2-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even3-16 2.57kB ± 0% 3.37kB ± 0% +31.08% (p=0.000 n=8+8) ExpMont/Even4-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even8-16 2.57kB ± 0% 3.37kB ± 0% +31.09% (p=0.000 n=9+10) ExpMont/Even32-16 2.57kB ± 0% 3.37kB ± 0% +31.14% (p=0.000 n=10+10) ExpMont/Even64-16 2.57kB ± 0% 3.16kB ± 0% +22.99% (p=0.000 n=9+9) ExpMont/Even96-16 2.57kB ± 0% 3.44kB ± 0% +33.90% (p=0.000 n=10+8) ExpMont/Even128-16 2.57kB ± 0% 2.88kB ± 0% +12.10% (p=0.000 n=10+10) ExpMont/Even255-16 2.57kB ± 0% 2.54kB ± 0% -1.30% (p=0.000 n=9+10) ExpMont/SmallEven1-16 872B ± 0% 1232B ± 0% +41.28% (p=0.000 n=10+10) ExpMont/SmallEven2-16 872B ± 0% 1233B ± 0% +41.40% (p=0.000 n=10+9) ExpMont/SmallEven3-16 872B ± 0% 1289B ± 0% +47.82% (p=0.000 n=10+10) ExpMont/SmallEven4-16 872B ± 0% 1241B ± 0% +42.32% (p=0.000 n=10+10) name old allocs/op new allocs/op delta ExpMont/Odd-16 21.0 ± 0% 21.0 ± 0% ~ (all equal) ExpMont/Even1-16 24.0 ± 0% 38.0 ± 0% +58.33% (p=0.000 n=10+10) ExpMont/Even2-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even3-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even4-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even8-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even32-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even64-16 24.0 ± 0% 39.0 ± 0% +62.50% (p=0.000 n=10+10) ExpMont/Even96-16 24.0 ± 0% 42.0 ± 0% +75.00% (p=0.000 n=10+10) ExpMont/Even128-16 24.0 ± 0% 40.0 ± 0% +66.67% (p=0.000 n=10+10) ExpMont/Even255-16 24.0 ± 0% 38.0 ± 0% +58.33% (p=0.000 n=10+10) ExpMont/SmallEven1-16 16.0 ± 0% 35.0 ± 0% +118.75% (p=0.000 n=10+10) ExpMont/SmallEven2-16 16.0 ± 0% 35.0 ± 0% +118.75% (p=0.000 n=10+10) ExpMont/SmallEven3-16 16.0 ± 0% 37.0 ± 0% +131.25% (p=0.000 n=10+10) ExpMont/SmallEven4-16 16.0 ± 0% 36.0 ± 0% +125.00% (p=0.000 n=10+10) Change-Id: Ib7f70290f8f087b78805ec3120baf17dd7737ac3 Reviewed-on: https://go-review.googlesource.com/c/go/+/420897 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Roland Shoemaker <roland@golang.org> Auto-Submit: Russ Cox <rsc@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>

Now gc can generate the same assembly code. Change-Id: Iac503003e14045d63e2def66408c13cee516aa37 Reviewed-on: https://go-review.googlesource.com/c/go/+/402575 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Run-TryBot: Wayne Zuo <wdvxdr@golangcn.org> Auto-Submit: Keith Randall <khr@golang.org> Reviewed-by: Keith Randall <khr@google.com>

For whatever reason (perhaps some tool does this), a handful of comments, including some doc comments, have TODOs formatted like: // TODO(name): Text here and // more text aligned // under first text. In doc comments the second line turns into a <pre> block, which is undesirable in this context. Rewrite those to unindent, like this instead: // TODO(name): Text here and // more text aligned // at left column. For #51082. Change-Id: Ibf5145659a61ebf9496f016752a709a7656d2d4b Reviewed-on: https://go-review.googlesource.com/c/go/+/384258 Trust: Russ Cox <rsc@golang.org> Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org>

Change-Id: Id67d6ac856bd9271de99c3381bde910aa0c166e0 Reviewed-on: https://go-review.googlesource.com/c/go/+/296011 Trust: Josh Bleecher Snyder <josharian@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>

The s390x assembly for shlVU does a forward copy when the shift amount s is 0. This causes corruption of the result z when z is aliased to the input x. This fix removes the s390x assembly for both shlVU and shrVU so the pure go implementations will be used. Test cases have been added to the existing TestShiftOverlap test to cover shift values of 0, 1 and (_W - 1). Fixes #42838 Change-Id: I75ca0e98f3acfaa6366a26355dcd9dd82499a48b Reviewed-on: https://go-review.googlesource.com/c/go/+/274442 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org> Reviewed-by: Matthew Dempsky <mdempsky@google.com> Trust: Robert Griesemer <gri@golang.org>

Division is much slower than multiplication. And the method of using multiplication by multiplying reciprocal and replacing division with it can increase the speed of divWVW algorithm by three times,and at the same time increase the speed of nats division. The benchmark test on arm64 is as follows: name old time/op new time/op delta DivWVW/1-4 13.1ns ± 4% 13.3ns ± 4% ~ (p=0.444 n=5+5) DivWVW/2-4 48.6ns ± 1% 51.2ns ± 2% +5.39% (p=0.008 n=5+5) DivWVW/3-4 82.0ns ± 1% 69.7ns ± 1% -15.03% (p=0.008 n=5+5) DivWVW/4-4 116ns ± 1% 71ns ± 2% -38.88% (p=0.008 n=5+5) DivWVW/5-4 152ns ± 1% 84ns ± 4% -44.70% (p=0.008 n=5+5) DivWVW/10-4 319ns ± 1% 155ns ± 4% -51.50% (p=0.008 n=5+5) DivWVW/100-4 3.44µs ± 3% 1.30µs ± 8% -62.30% (p=0.008 n=5+5) DivWVW/1000-4 33.8µs ± 0% 10.9µs ± 1% -67.74% (p=0.008 n=5+5) DivWVW/10000-4 343µs ± 4% 111µs ± 5% -67.63% (p=0.008 n=5+5) DivWVW/100000-4 3.35ms ± 1% 1.25ms ± 3% -62.79% (p=0.008 n=5+5) QuoRem-4 3.08µs ± 2% 2.21µs ± 4% -28.40% (p=0.008 n=5+5) ModSqrt225_Tonelli-4 444µs ± 2% 457µs ± 3% ~ (p=0.095 n=5+5) ModSqrt225_3Mod4-4 136µs ± 1% 138µs ± 3% ~ (p=0.151 n=5+5) ModSqrt231_Tonelli-4 473µs ± 3% 483µs ± 4% ~ (p=0.548 n=5+5) ModSqrt231_5Mod8-4 164µs ± 9% 169µs ±12% ~ (p=0.421 n=5+5) Sqrt-4 36.8µs ± 1% 28.6µs ± 0% -22.17% (p=0.016 n=5+4) Div/20/10-4 50.0ns ± 3% 51.3ns ± 6% ~ (p=0.238 n=5+5) Div/40/20-4 49.8ns ± 2% 51.3ns ± 6% ~ (p=0.222 n=5+5) Div/100/50-4 85.8ns ± 4% 86.5ns ± 5% ~ (p=0.246 n=5+5) Div/200/100-4 335ns ± 3% 296ns ± 2% -11.60% (p=0.008 n=5+5) Div/400/200-4 442ns ± 2% 359ns ± 5% -18.81% (p=0.008 n=5+5) Div/1000/500-4 858ns ± 3% 643ns ± 6% -25.06% (p=0.008 n=5+5) Div/2000/1000-4 1.70µs ± 3% 1.28µs ± 4% -24.80% (p=0.008 n=5+5) Div/20000/10000-4 45.0µs ± 5% 41.8µs ± 4% -7.17% (p=0.016 n=5+5) Div/200000/100000-4 1.51ms ± 7% 1.43ms ± 3% -5.42% (p=0.016 n=5+5) Div/2000000/1000000-4 57.6ms ± 4% 57.5ms ± 3% ~ (p=1.000 n=5+5) Div/20000000/10000000-4 2.08s ± 3% 2.04s ± 1% ~ (p=0.095 n=5+5) name old speed new speed delta DivWVW/1-4 4.87GB/s ± 4% 4.80GB/s ± 4% ~ (p=0.310 n=5+5) DivWVW/2-4 2.63GB/s ± 1% 2.50GB/s ± 2% -5.07% (p=0.008 n=5+5) DivWVW/3-4 2.34GB/s ± 1% 2.76GB/s ± 1% +17.70% (p=0.008 n=5+5) DivWVW/4-4 2.21GB/s ± 1% 3.61GB/s ± 2% +63.42% (p=0.008 n=5+5) DivWVW/5-4 2.10GB/s ± 2% 3.81GB/s ± 4% +80.89% (p=0.008 n=5+5) DivWVW/10-4 2.01GB/s ± 0% 4.13GB/s ± 4% +105.91% (p=0.008 n=5+5) DivWVW/100-4 1.86GB/s ± 2% 4.95GB/s ± 7% +165.63% (p=0.008 n=5+5) DivWVW/1000-4 1.89GB/s ± 0% 5.86GB/s ± 1% +209.96% (p=0.008 n=5+5) DivWVW/10000-4 1.87GB/s ± 4% 5.76GB/s ± 5% +208.96% (p=0.008 n=5+5) DivWVW/100000-4 1.91GB/s ± 1% 5.14GB/s ± 3% +168.85% (p=0.008 n=5+5) Change-Id: I049f1196562b20800e6ef8a6493fd147f93ad830 Reviewed-on: https://go-review.googlesource.com/c/go/+/250417 Trust: Giovanni Bajo <rasky@develer.com> Trust: Keith Randall <khr@golang.org> Run-TryBot: Giovanni Bajo <rasky@develer.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>

Add an optimization for addVW and subVW over large-sized vectors, it switches from add/sub with carry to copy the rest of the vector when we are done with carries. Consistent performance improvement are observed on various arm64 machines. Add additional tests and benchmarks to increase the test coverage. TestFunVWExt: Testing with various types of input vector, using the result from go-version addVW/subVW as golden reference. BenchmarkAddVWext and BenchmarkSubVWext: Benchmarking using input vector having all 1s or all 0s, for evaluating the overhead of worst case. 1. Perf. comparison over randomly generated input vectors: Server 1: name old time/op new time/op delta AddVW/1 12.3ns ± 3% 12.0ns ± 0% -2.60% (p=0.001 n=10+8) AddVW/2 12.5ns ± 2% 12.3ns ± 0% -1.84% (p=0.001 n=10+8) AddVW/3 12.6ns ± 2% 12.3ns ± 0% -1.91% (p=0.009 n=10+10) AddVW/4 13.1ns ± 3% 12.7ns ± 0% -2.98% (p=0.006 n=10+8) AddVW/5 14.4ns ± 1% 13.9ns ± 0% -3.81% (p=0.000 n=10+10) AddVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) AddVW/100 47.8ns ± 0% 29.9ns ± 2% -37.38% (p=0.000 n=10+9) AddVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10) AddVW/10000 4.35µs ± 1% 2.92µs ± 0% -32.85% (p=0.000 n=10+10) AddVW/100000 43.6µs ± 0% 29.7µs ± 0% -31.92% (p=0.000 n=8+10) SubVW/1 12.6ns ± 0% 12.3ns ± 2% -2.22% (p=0.000 n=7+10) SubVW/2 12.7ns ± 0% 12.6ns ± 1% -0.39% (p=0.046 n=8+10) SubVW/3 12.7ns ± 1% 12.6ns ± 1% ~ (p=0.410 n=10+10) SubVW/4 13.3ns ± 3% 13.1ns ± 3% ~ (p=0.072 n=10+10) SubVW/5 14.2ns ± 0% 14.1ns ± 1% -0.63% (p=0.046 n=8+10) SubVW/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) SubVW/100 47.8ns ± 0% 33.1ns ±19% -30.71% (p=0.000 n=10+10) SubVW/1000 446ns ± 0% 207ns ± 0% -53.59% (p=0.000 n=10+10) SubVW/10000 4.33µs ± 1% 2.92µs ± 0% -32.66% (p=0.000 n=10+6) SubVW/100000 43.4µs ± 0% 29.6µs ± 0% -31.90% (p=0.000 n=10+9) Server 2: name old time/op new time/op delta AddVW/1 5.49ns ± 0% 5.53ns ± 2% ~ (p=1.000 n=9+10) AddVW/2 5.96ns ± 2% 5.92ns ± 1% -0.69% (p=0.039 n=10+10) AddVW/3 6.72ns ± 0% 6.73ns ± 0% ~ (p=0.078 n=10+10) AddVW/4 7.07ns ± 0% 6.75ns ± 2% -4.55% (p=0.000 n=10+10) AddVW/5 8.14ns ± 0% 8.17ns ± 0% +0.46% (p=0.003 n=8+8) AddVW/10 10.0ns ± 0% 10.1ns ± 1% +0.70% (p=0.003 n=10+10) AddVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=9+9) AddVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10) AddVW/10000 4.18µs ± 0% 3.14µs ± 0% -24.81% (p=0.000 n=8+8) AddVW/100000 68.3µs ± 3% 62.1µs ± 5% -9.13% (p=0.000 n=10+10) SubVW/1 5.37ns ± 2% 5.42ns ± 1% ~ (p=0.990 n=10+10) SubVW/2 5.89ns ± 0% 5.92ns ± 1% +0.58% (p=0.000 n=8+10) SubVW/3 6.64ns ± 1% 6.82ns ± 3% +2.63% (p=0.000 n=9+10) SubVW/4 7.17ns ± 0% 6.69ns ± 2% -6.74% (p=0.000 n=10+9) SubVW/5 8.22ns ± 0% 8.18ns ± 0% -0.46% (p=0.001 n=8+9) SubVW/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.341 n=10+10) SubVW/100 43.0ns ± 0% 33.5ns ± 0% -22.09% (p=0.000 n=7+10) SubVW/1000 394ns ± 0% 278ns ± 0% -29.44% (p=0.000 n=10+10) SubVW/10000 4.18µs ± 0% 3.15µs ± 0% -24.62% (p=0.000 n=9+9) SubVW/100000 67.7µs ± 4% 62.4µs ± 2% -7.92% (p=0.000 n=10+10) 2. Perf. comparison over input vectors of all 1s or all 0s Server 1: name old time/op new time/op delta AddVWext/1 12.6ns ± 0% 12.0ns ± 0% -4.76% (p=0.000 n=6+10) AddVWext/2 12.7ns ± 0% 12.4ns ± 1% -2.52% (p=0.000 n=10+10) AddVWext/3 12.7ns ± 0% 12.4ns ± 0% -2.36% (p=0.000 n=9+7) AddVWext/4 13.2ns ± 4% 12.7ns ± 0% -3.71% (p=0.001 n=10+9) AddVWext/5 14.6ns ± 0% 13.9ns ± 0% -4.79% (p=0.000 n=10+8) AddVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) AddVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10) AddVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10) AddVWext/10000 4.34µs ± 1% 3.90µs ± 0% -10.12% (p=0.000 n=10+10) AddVWext/100000 43.9µs ± 1% 39.4µs ± 0% -10.18% (p=0.000 n=10+10) SubVWext/1 12.6ns ± 0% 12.3ns ± 2% -2.70% (p=0.000 n=7+10) SubVWext/2 12.6ns ± 1% 12.6ns ± 2% ~ (p=0.234 n=10+10) SubVWext/3 12.7ns ± 0% 12.6ns ± 2% -0.71% (p=0.033 n=10+10) SubVWext/4 13.4ns ± 0% 13.1ns ± 3% -2.01% (p=0.006 n=8+10) SubVWext/5 14.2ns ± 0% 14.1ns ± 1% -0.85% (p=0.003 n=10+10) SubVWext/10 11.7ns ± 0% 11.7ns ± 0% ~ (all equal) SubVWext/100 47.8ns ± 0% 47.4ns ± 0% -0.84% (p=0.000 n=10+10) SubVWext/1000 446ns ± 0% 399ns ± 0% -10.54% (p=0.000 n=10+10) SubVWext/10000 4.33µs ± 1% 3.90µs ± 0% -10.02% (p=0.000 n=10+10) SubVWext/100000 43.5µs ± 0% 39.5µs ± 1% -9.16% (p=0.000 n=7+10) Server 2: name old time/op new time/op delta AddVWext/1 5.48ns ± 0% 5.43ns ± 1% -0.97% (p=0.000 n=9+9) AddVWext/2 5.99ns ± 2% 5.93ns ± 1% ~ (p=0.054 n=10+10) AddVWext/3 6.74ns ± 0% 6.79ns ± 1% +0.80% (p=0.000 n=9+10) AddVWext/4 7.18ns ± 0% 7.21ns ± 1% +0.36% (p=0.034 n=9+10) AddVWext/5 7.93ns ± 3% 8.18ns ± 0% +3.18% (p=0.000 n=10+8) AddVWext/10 10.0ns ± 0% 10.1ns ± 1% +0.60% (p=0.011 n=10+10) AddVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=9+10) AddVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10) AddVWext/10000 4.18µs ± 0% 4.50µs ± 0% +7.73% (p=0.000 n=9+10) AddVWext/100000 67.6µs ± 2% 68.4µs ± 3% ~ (p=0.139 n=9+8) SubVWext/1 5.46ns ± 1% 5.43ns ± 0% -0.55% (p=0.002 n=9+9) SubVWext/2 5.89ns ± 0% 5.93ns ± 1% +0.68% (p=0.000 n=8+10) SubVWext/3 6.72ns ± 1% 6.79ns ± 1% +1.07% (p=0.000 n=10+10) SubVWext/4 6.98ns ± 1% 7.21ns ± 0% +3.25% (p=0.000 n=10+10) SubVWext/5 8.22ns ± 0% 7.99ns ± 3% -2.83% (p=0.000 n=8+10) SubVWext/10 10.0ns ± 1% 10.1ns ± 1% ~ (p=0.239 n=10+10) SubVWext/100 43.0ns ± 0% 47.7ns ± 0% +10.93% (p=0.000 n=8+10) SubVWext/1000 394ns ± 0% 399ns ± 0% +1.27% (p=0.000 n=10+10) SubVWext/10000 4.18µs ± 0% 4.51µs ± 0% +7.86% (p=0.000 n=8+8) SubVWext/100000 68.3µs ± 2% 68.0µs ± 3% ~ (p=0.515 n=10+8) Change-Id: I134a5194b8a2deaaebbaa2b771baf72846971d58 Reviewed-on: https://go-review.googlesource.com/c/go/+/229739 Reviewed-by: Cherry Zhang <cherryyz@google.com> Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Don't overwrite incoming test data. The change uses copy instead of assigning statement to avoid this. Change-Id: Ib907101822d811de5c45145cb9d7961907e212c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/250137 Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>

For the case where the addresses of parameter z and x of the function shlVU overlap and the address of z is greater than x, x (input value) can be polluted during the calculation when the high words of x are overlapped with the low words of z (output value). Fixes #31084 Change-Id: I9bb0266a1d7856b8faa9a9b1975d6f57dece0479 Reviewed-on: https://go-review.googlesource.com/c/go/+/169780 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

Unroll the cycle 4 times to reduce load overhead. Benchmarks: name old time/op new time/op delta MulAddVWW/1-8 15.9ns ± 0% 11.9ns ± 0% -24.92% (p=0.000 n=8+8) MulAddVWW/2-8 16.1ns ± 0% 13.9ns ± 1% -13.82% (p=0.000 n=8+8) MulAddVWW/3-8 18.9ns ± 0% 17.3ns ± 0% -8.47% (p=0.000 n=8+8) MulAddVWW/4-8 21.7ns ± 0% 19.5ns ± 0% -10.14% (p=0.000 n=8+8) MulAddVWW/5-8 25.1ns ± 0% 22.5ns ± 0% -10.27% (p=0.000 n=8+8) MulAddVWW/10-8 41.6ns ± 0% 40.0ns ± 0% -3.79% (p=0.000 n=8+8) MulAddVWW/100-8 368ns ± 0% 363ns ± 0% -1.36% (p=0.000 n=8+8) MulAddVWW/1000-8 3.52µs ± 0% 3.52µs ± 0% -0.14% (p=0.000 n=8+8) MulAddVWW/10000-8 35.1µs ± 0% 35.1µs ± 0% -0.01% (p=0.000 n=7+6) MulAddVWW/100000-8 351µs ± 0% 351µs ± 0% +0.15% (p=0.038 n=8+8) Change-Id: I052a4db286ac6e4f3293289c7e9a82027da0405e Reviewed-on: https://go-review.googlesource.com/c/go/+/155780 Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Cherry Zhang <cherryyz@google.com>

While we're here, delete addWW_g and subWW_g, per the TODO. They are now obsolete. Benchmarks on amd64 with -tags=math_big_pure_go. name old time/op new time/op delta AddVV/1-8 5.24ns ± 2% 5.12ns ± 1% -2.11% (p=0.000 n=82+87) AddVV/2-8 6.44ns ± 1% 6.33ns ± 2% -1.82% (p=0.000 n=77+82) AddVV/3-8 7.89ns ± 8% 6.97ns ± 4% -11.71% (p=0.000 n=100+96) AddVV/4-8 8.60ns ± 0% 7.72ns ± 4% -10.24% (p=0.000 n=90+96) AddVV/5-8 10.3ns ± 4% 8.5ns ± 1% -17.02% (p=0.000 n=96+91) AddVV/10-8 16.2ns ± 5% 12.8ns ± 1% -21.11% (p=0.000 n=97+86) AddVV/100-8 148ns ± 1% 117ns ± 5% -21.07% (p=0.000 n=66+98) AddVV/1000-8 1.41µs ± 4% 1.13µs ± 3% -19.90% (p=0.000 n=97+97) AddVV/10000-8 14.2µs ± 5% 11.2µs ± 1% -20.82% (p=0.000 n=99+84) AddVV/100000-8 142µs ± 4% 113µs ± 4% -20.40% (p=0.000 n=91+92) SubVV/1-8 5.29ns ± 1% 5.11ns ± 0% -3.30% (p=0.000 n=87+88) SubVV/2-8 6.36ns ± 4% 6.33ns ± 2% -0.56% (p=0.002 n=98+73) SubVV/3-8 7.58ns ± 5% 6.98ns ± 4% -8.01% (p=0.000 n=97+91) SubVV/4-8 8.61ns ± 3% 7.98ns ± 2% -7.31% (p=0.000 n=95+83) SubVV/5-8 10.6ns ± 2% 8.5ns ± 1% -19.56% (p=0.000 n=79+89) SubVV/10-8 16.3ns ± 4% 12.7ns ± 1% -21.97% (p=0.000 n=98+82) SubVV/100-8 124ns ± 1% 118ns ± 1% -4.83% (p=0.000 n=85+81) SubVV/1000-8 1.14µs ± 5% 1.12µs ± 2% -1.17% (p=0.000 n=97+81) SubVV/10000-8 11.6µs ±10% 11.2µs ± 1% -3.39% (p=0.000 n=100+84) SubVV/100000-8 114µs ± 6% 114µs ± 5% ~ (p=0.396 n=83+94) AddVW/1-8 4.04ns ± 4% 4.34ns ± 4% +7.57% (p=0.000 n=96+98) AddVW/2-8 4.34ns ± 5% 4.40ns ± 5% +1.40% (p=0.000 n=99+98) AddVW/3-8 5.43ns ± 0% 5.54ns ± 2% +1.97% (p=0.000 n=85+94) AddVW/4-8 6.23ns ± 1% 6.18ns ± 2% -0.66% (p=0.000 n=77+78) AddVW/5-8 6.78ns ± 2% 6.90ns ± 4% +1.77% (p=0.000 n=80+99) AddVW/10-8 10.5ns ± 4% 9.9ns ± 1% -5.77% (p=0.000 n=97+69) AddVW/100-8 114ns ± 3% 91ns ± 0% -20.38% (p=0.000 n=98+77) AddVW/1000-8 1.12µs ± 1% 0.87µs ± 1% -22.80% (p=0.000 n=82+68) AddVW/10000-8 11.2µs ± 2% 8.5µs ± 5% -23.85% (p=0.000 n=85+100) AddVW/100000-8 112µs ± 2% 85µs ± 5% -24.22% (p=0.000 n=71+96) SubVW/1-8 4.09ns ± 2% 4.18ns ± 4% +2.32% (p=0.000 n=78+96) SubVW/2-8 4.59ns ± 5% 4.52ns ± 7% -1.54% (p=0.000 n=98+94) SubVW/3-8 5.41ns ±10% 5.55ns ± 1% +2.48% (p=0.000 n=100+89) SubVW/4-8 6.51ns ± 2% 6.19ns ± 0% -4.85% (p=0.000 n=97+81) SubVW/5-8 7.25ns ± 3% 6.90ns ± 4% -4.93% (p=0.000 n=97+96) SubVW/10-8 10.6ns ± 4% 9.8ns ± 2% -7.32% (p=0.000 n=95+96) SubVW/100-8 90.4ns ± 0% 90.8ns ± 0% +0.43% (p=0.000 n=83+78) SubVW/1000-8 853ns ± 4% 857ns ± 2% +0.42% (p=0.000 n=100+98) SubVW/10000-8 8.52µs ± 4% 8.53µs ± 2% ~ (p=0.061 n=99+97) SubVW/100000-8 84.8µs ± 5% 84.2µs ± 2% -0.78% (p=0.000 n=99+93) AddMulVVW/1-8 8.73ns ± 0% 5.33ns ± 3% -38.91% (p=0.000 n=91+96) AddMulVVW/2-8 14.8ns ± 3% 6.5ns ± 2% -56.33% (p=0.000 n=100+79) AddMulVVW/3-8 18.6ns ± 2% 7.8ns ± 5% -57.84% (p=0.000 n=89+96) AddMulVVW/4-8 24.0ns ± 2% 9.8ns ± 0% -59.09% (p=0.000 n=95+67) AddMulVVW/5-8 29.0ns ± 2% 11.5ns ± 5% -60.44% (p=0.000 n=90+97) AddMulVVW/10-8 54.1ns ± 0% 18.8ns ± 1% -65.37% (p=0.000 n=82+84) AddMulVVW/100-8 508ns ± 2% 165ns ± 4% -67.62% (p=0.000 n=72+98) AddMulVVW/1000-8 4.96µs ± 3% 1.55µs ± 1% -68.86% (p=0.000 n=99+91) AddMulVVW/10000-8 50.0µs ± 4% 15.5µs ± 4% -68.95% (p=0.000 n=97+97) AddMulVVW/100000-8 491µs ± 1% 156µs ± 8% -68.22% (p=0.000 n=79+95) Change-Id: I4c6ae0b4065f371aea8103f6a85d9e9274bf01d0 Reviewed-on: https://go-review.googlesource.com/c/go/+/164965 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>

The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4 times and 2 times, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 21.5ns ± 0% ~ (all equal) AddVV/2-8 13.5ns ± 0% 13.5ns ± 0% ~ (all equal) AddVV/3-8 15.5ns ± 0% 15.5ns ± 0% ~ (all equal) AddVV/4-8 17.5ns ± 0% 17.5ns ± 0% ~ (all equal) AddVV/5-8 19.5ns ± 0% 19.5ns ± 0% ~ (all equal) AddVV/10-8 29.5ns ± 0% 29.5ns ± 0% ~ (all equal) AddVV/100-8 217ns ± 0% 217ns ± 0% ~ (all equal) AddVV/1000-8 2.02µs ± 0% 2.02µs ± 0% ~ (all equal) AddVV/10000-8 20.3µs ± 0% 20.3µs ± 0% ~ (p=0.603 n=5+5) AddVV/100000-8 223µs ± 7% 228µs ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) AddVW/2-8 19.8ns ± 3% 10.5ns ± 0% -46.92% (p=0.008 n=5+5) AddVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) AddVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) AddVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) AddVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) AddVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) AddVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.55% (p=0.008 n=5+5) AddVW/100000-8 150µs ± 0% 71µs ± 0% -52.95% (p=0.008 n=5+5) SubVW/1-8 9.32ns ± 0% 9.26ns ± 0% -0.64% (p=0.008 n=5+5) SubVW/2-8 19.7ns ± 2% 10.5ns ± 0% -46.70% (p=0.008 n=5+5) SubVW/3-8 11.5ns ± 0% 11.0ns ± 0% -4.35% (p=0.008 n=5+5) SubVW/4-8 13.0ns ± 0% 12.0ns ± 0% -7.69% (p=0.008 n=5+5) SubVW/5-8 14.5ns ± 0% 12.5ns ± 0% -13.79% (p=0.008 n=5+5) SubVW/10-8 22.0ns ± 0% 15.5ns ± 0% -29.55% (p=0.008 n=5+5) SubVW/100-8 167ns ± 0% 81ns ± 0% -51.44% (p=0.008 n=5+5) SubVW/1000-8 1.52µs ± 0% 0.64µs ± 0% -57.58% (p=0.008 n=5+5) SubVW/10000-8 15.1µs ± 0% 7.2µs ± 0% -52.49% (p=0.008 n=5+5) SubVW/100000-8 150µs ± 0% 71µs ± 0% -52.91% (p=0.008 n=5+5) AddMulVVW/1-8 32.4ns ± 1% 32.6ns ± 1% ~ (p=0.119 n=5+5) AddMulVVW/2-8 57.0ns ± 0% 57.0ns ± 0% ~ (p=0.643 n=5+5) AddMulVVW/3-8 90.8ns ± 0% 90.7ns ± 0% ~ (p=0.524 n=5+5) AddMulVVW/4-8 118ns ± 0% 118ns ± 1% ~ (p=1.000 n=4+5) AddMulVVW/5-8 144ns ± 1% 144ns ± 0% ~ (p=0.794 n=5+4) AddMulVVW/10-8 294ns ± 1% 296ns ± 0% +0.48% (p=0.040 n=5+5) AddMulVVW/100-8 2.73µs ± 0% 2.73µs ± 0% ~ (p=0.278 n=5+5) AddMulVVW/1000-8 26.0µs ± 0% 26.5µs ± 0% +2.14% (p=0.008 n=5+5) AddMulVVW/10000-8 297µs ± 0% 297µs ± 0% +0.24% (p=0.008 n=5+5) AddMulVVW/100000-8 3.15ms ± 1% 3.13ms ± 0% ~ (p=0.690 n=5+5) DecimalConversion-8 311µs ± 2% 309µs ± 2% ~ (p=0.310 n=5+5) FloatString/100-8 2.55µs ± 2% 2.54µs ± 2% ~ (p=1.000 n=5+5) FloatString/1000-8 58.1µs ± 0% 58.1µs ± 0% ~ (p=0.151 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.59ms ± 0% ~ (p=0.151 n=5+5) FloatString/100000-8 446ms ± 0% 446ms ± 0% +0.01% (p=0.016 n=5+5) FloatAdd/10-8 183ns ± 0% 183ns ± 0% ~ (p=0.333 n=4+5) FloatAdd/100-8 187ns ± 1% 192ns ± 2% ~ (p=0.056 n=5+5) FloatAdd/1000-8 369ns ± 0% 371ns ± 0% +0.54% (p=0.016 n=4+5) FloatAdd/10000-8 1.88µs ± 0% 1.88µs ± 0% -0.14% (p=0.000 n=4+5) FloatAdd/100000-8 17.2µs ± 0% 17.1µs ± 0% -0.37% (p=0.008 n=5+5) FloatSub/10-8 147ns ± 0% 147ns ± 0% ~ (all equal) FloatSub/100-8 145ns ± 0% 146ns ± 0% ~ (p=0.238 n=5+4) FloatSub/1000-8 241ns ± 0% 241ns ± 0% ~ (p=0.333 n=5+4) FloatSub/10000-8 1.06µs ± 0% 1.06µs ± 0% ~ (p=0.444 n=5+5) FloatSub/100000-8 9.50µs ± 0% 9.48µs ± 0% -0.14% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.4µs ± 2% 28.5µs ± 1% ~ (p=0.690 n=5+5) ParseFloatLargeExp-8 125µs ± 1% 124µs ± 1% ~ (p=0.095 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 278ns ± 3% ~ (p=0.937 n=5+5) GCD10x10/WithXY-8 2.08µs ± 3% 2.15µs ± 3% ~ (p=0.056 n=5+5) GCD10x100/WithoutXY-8 592ns ± 3% 613ns ± 4% ~ (p=0.056 n=5+5) GCD10x100/WithXY-8 3.40µs ± 2% 3.42µs ± 4% ~ (p=0.841 n=5+5) GCD10x1000/WithoutXY-8 1.37µs ± 2% 1.35µs ± 3% ~ (p=0.460 n=5+5) GCD10x1000/WithXY-8 7.34µs ± 2% 7.33µs ± 4% ~ (p=0.841 n=5+5) GCD10x10000/WithoutXY-8 8.52µs ± 0% 8.51µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 27.5µs ± 2% 27.2µs ± 1% ~ (p=0.151 n=5+5) GCD10x100000/WithoutXY-8 78.3µs ± 1% 78.5µs ± 1% ~ (p=0.690 n=5+5) GCD10x100000/WithXY-8 231µs ± 0% 229µs ± 1% -1.11% (p=0.016 n=5+5) GCD100x100/WithoutXY-8 1.86µs ± 2% 1.86µs ± 2% ~ (p=0.881 n=5+5) GCD100x100/WithXY-8 27.1µs ± 2% 27.2µs ± 1% ~ (p=0.421 n=5+5) GCD100x1000/WithoutXY-8 4.44µs ± 2% 4.41µs ± 1% ~ (p=0.310 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 36.2µs ± 1% ~ (p=0.310 n=5+5) GCD100x10000/WithoutXY-8 22.6µs ± 2% 22.5µs ± 1% ~ (p=0.690 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 145µs ± 1% ~ (p=1.000 n=5+5) GCD100x100000/WithoutXY-8 195µs ± 0% 196µs ± 1% ~ (p=0.548 n=5+5) GCD100x100000/WithXY-8 1.10ms ± 0% 1.10ms ± 0% -0.30% (p=0.016 n=5+5) GCD1000x1000/WithoutXY-8 25.0µs ± 1% 25.2µs ± 2% ~ (p=0.222 n=5+5) GCD1000x1000/WithXY-8 520µs ± 0% 520µs ± 1% ~ (p=0.151 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 1% 56.9µs ± 1% ~ (p=0.690 n=5+5) GCD1000x10000/WithXY-8 1.21ms ± 0% 1.21ms ± 1% ~ (p=0.881 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 359µs ± 1% ~ (p=0.548 n=5+5) GCD1000x100000/WithXY-8 8.73ms ± 0% 8.73ms ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithoutXY-8 686µs ± 0% 687µs ± 0% ~ (p=0.548 n=5+5) GCD10000x10000/WithXY-8 15.9ms ± 0% 15.9ms ± 0% ~ (p=0.841 n=5+5) GCD10000x100000/WithoutXY-8 2.08ms ± 0% 2.08ms ± 0% ~ (p=1.000 n=5+5) GCD10000x100000/WithXY-8 86.7ms ± 0% 86.7ms ± 0% ~ (p=1.000 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 51.0ms ± 0% ~ (p=0.151 n=5+5) GCD100000x100000/WithXY-8 1.23s ± 0% 1.23s ± 0% ~ (p=0.841 n=5+5) Hilbert-8 2.41ms ± 1% 2.42ms ± 2% ~ (p=0.690 n=5+5) Binomial-8 4.86µs ± 1% 4.86µs ± 1% ~ (p=0.889 n=5+5) QuoRem-8 7.09µs ± 0% 7.08µs ± 0% -0.09% (p=0.024 n=5+5) Exp-8 161ms ± 0% 161ms ± 0% -0.08% (p=0.032 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=1.000 n=5+5) Bitset-8 40.7ns ± 0% 40.6ns ± 0% ~ (p=0.095 n=4+5) BitsetNeg-8 159ns ± 4% 148ns ± 0% -6.92% (p=0.016 n=5+4) BitsetOrig-8 378ns ± 1% 378ns ± 1% ~ (p=0.937 n=5+5) BitsetNegOrig-8 647ns ± 5% 647ns ± 4% ~ (p=1.000 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.27ms ± 0% ~ (p=1.000 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=0.690 n=5+5) ModSqrt5430_Tonelli-8 62.8s ± 1% 62.5s ± 0% ~ (p=0.063 n=5+4) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.8s ± 0% ~ (p=0.310 n=5+5) Sqrt-8 101µs ± 1% 101µs ± 0% -0.35% (p=0.032 n=5+5) IntSqr/1-8 32.3ns ± 1% 32.5ns ± 1% ~ (p=0.421 n=5+5) IntSqr/2-8 157ns ± 5% 156ns ± 5% ~ (p=0.651 n=5+5) IntSqr/3-8 292ns ± 2% 291ns ± 3% ~ (p=0.881 n=5+5) IntSqr/5-8 738ns ± 6% 740ns ± 5% ~ (p=0.841 n=5+5) IntSqr/8-8 1.82µs ± 4% 1.83µs ± 4% ~ (p=0.730 n=5+5) IntSqr/10-8 2.92µs ± 1% 2.93µs ± 1% ~ (p=0.643 n=5+5) IntSqr/20-8 6.28µs ± 2% 6.28µs ± 2% ~ (p=1.000 n=5+5) IntSqr/30-8 13.8µs ± 2% 13.9µs ± 3% ~ (p=1.000 n=5+5) IntSqr/50-8 37.8µs ± 4% 37.9µs ± 4% ~ (p=0.690 n=5+5) IntSqr/80-8 95.9µs ± 1% 95.8µs ± 1% ~ (p=0.841 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.310 n=5+5) IntSqr/200-8 586µs ± 1% 586µs ± 1% ~ (p=0.841 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 0% ~ (p=0.222 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.48ms ± 0% ~ (p=0.556 n=5+4) IntSqr/800-8 4.68ms ± 0% 4.68ms ± 0% ~ (p=0.548 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.56ms ± 0% ~ (p=0.421 n=5+5) Mul-8 311ms ± 0% 311ms ± 0% ~ (p=0.548 n=5+5) Exp3Power/0x10-8 559ns ± 1% 560ns ± 1% ~ (p=0.984 n=5+5) Exp3Power/0x40-8 641ns ± 1% 634ns ± 1% ~ (p=0.063 n=5+5) Exp3Power/0x100-8 1.39µs ± 2% 1.40µs ± 2% ~ (p=0.381 n=5+5) Exp3Power/0x400-8 8.27µs ± 1% 8.26µs ± 0% ~ (p=0.571 n=5+5) Exp3Power/0x1000-8 59.9µs ± 0% 59.7µs ± 0% -0.23% (p=0.008 n=5+5) Exp3Power/0x4000-8 816µs ± 0% 816µs ± 0% ~ (p=1.000 n=5+5) Exp3Power/0x10000-8 7.77ms ± 0% 7.77ms ± 0% ~ (p=0.841 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 73.4ms ± 0% ~ (p=0.690 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 664ms ± 0% -0.14% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.98s ± 0% 5.98s ± 0% -0.09% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 116ms ± 0% -0.25% (p=0.008 n=5+5) NatSqr/1-8 115ns ± 3% 116ns ± 2% ~ (p=0.238 n=5+5) NatSqr/2-8 237ns ± 1% 237ns ± 1% ~ (p=0.683 n=5+5) NatSqr/3-8 367ns ± 3% 368ns ± 3% ~ (p=0.817 n=5+5) NatSqr/5-8 807ns ± 3% 812ns ± 3% ~ (p=0.913 n=5+5) NatSqr/8-8 1.93µs ± 2% 1.93µs ± 3% ~ (p=0.651 n=5+5) NatSqr/10-8 2.98µs ± 2% 2.99µs ± 2% ~ (p=0.690 n=5+5) NatSqr/20-8 6.49µs ± 2% 6.46µs ± 2% ~ (p=0.548 n=5+5) NatSqr/30-8 14.4µs ± 2% 14.3µs ± 2% ~ (p=0.690 n=5+5) NatSqr/50-8 38.6µs ± 2% 38.7µs ± 2% ~ (p=0.841 n=5+5) NatSqr/80-8 96.1µs ± 2% 95.8µs ± 2% ~ (p=0.548 n=5+5) NatSqr/100-8 149µs ± 1% 149µs ± 1% ~ (p=0.841 n=5+5) NatSqr/200-8 593µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 0% 1.32ms ± 1% ~ (p=0.222 n=5+5) NatSqr/500-8 2.49ms ± 0% 2.49ms ± 0% ~ (p=0.690 n=5+5) NatSqr/800-8 4.69ms ± 0% 4.69ms ± 0% ~ (p=1.000 n=5+5) NatSqr/1000-8 7.59ms ± 0% 7.58ms ± 0% ~ (p=0.841 n=5+5) ScanPi-8 322µs ± 0% 321µs ± 0% ~ (p=0.095 n=5+5) StringPiParallel-8 71.4µs ± 5% 68.8µs ± 4% ~ (p=0.151 n=5+5) Scan/10/Base2-8 1.10µs ± 0% 1.09µs ± 0% -0.36% (p=0.032 n=5+5) Scan/100/Base2-8 7.78µs ± 0% 7.79µs ± 0% +0.14% (p=0.008 n=5+5) Scan/1000/Base2-8 78.8µs ± 0% 79.0µs ± 0% +0.24% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% ~ (p=0.056 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.0ms ± 0% -0.15% (p=0.008 n=5+5) Scan/10/Base8-8 514ns ± 0% 515ns ± 0% ~ (p=0.079 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% +0.15% (p=0.008 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.1µs ± 0% +0.12% (p=0.008 n=5+5) Scan/10000/Base8-8 740µs ± 0% 740µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.5ms ± 0% -0.06% (p=0.016 n=4+5) Scan/10/Base10-8 492ns ± 1% 490ns ± 1% ~ (p=0.310 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.056 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% ~ (p=1.000 n=5+5) Scan/10000/Base10-8 717µs ± 0% 716µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.2ms ± 0% 50.3ms ± 0% +0.05% (p=0.008 n=5+5) Scan/10/Base16-8 442ns ± 1% 442ns ± 0% ~ (p=0.468 n=5+5) Scan/100/Base16-8 2.46µs ± 0% 2.45µs ± 0% ~ (p=0.159 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.2µs ± 0% ~ (p=0.841 n=5+5) Scan/10000/Base16-8 721µs ± 0% 722µs ± 0% ~ (p=0.548 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.6ms ± 0% +0.07% (p=0.008 n=5+5) String/10/Base2-8 244ns ± 1% 242ns ± 1% ~ (p=0.103 n=5+5) String/100/Base2-8 1.48µs ± 0% 1.48µs ± 1% ~ (p=0.786 n=5+5) String/1000/Base2-8 13.3µs ± 1% 13.3µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base2-8 132µs ± 1% 132µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base2-8 1.30ms ± 1% 1.30ms ± 1% ~ (p=1.000 n=5+5) String/10/Base8-8 167ns ± 1% 168ns ± 1% ~ (p=0.135 n=5+5) String/100/Base8-8 623ns ± 1% 626ns ± 1% ~ (p=0.151 n=5+5) String/1000/Base8-8 5.24µs ± 1% 5.24µs ± 0% ~ (p=1.000 n=5+5) String/10000/Base8-8 50.0µs ± 1% 50.0µs ± 1% ~ (p=1.000 n=5+5) String/100000/Base8-8 492µs ± 1% 489µs ± 1% ~ (p=0.056 n=5+5) String/10/Base10-8 503ns ± 1% 501ns ± 0% ~ (p=0.183 n=5+5) String/100/Base10-8 1.96µs ± 0% 1.97µs ± 0% ~ (p=0.389 n=5+5) String/1000/Base10-8 12.4µs ± 1% 12.4µs ± 1% ~ (p=0.841 n=5+5) String/10000/Base10-8 56.7µs ± 1% 56.6µs ± 0% ~ (p=1.000 n=5+5) String/100000/Base10-8 25.6ms ± 0% 25.6ms ± 0% ~ (p=0.222 n=5+5) String/10/Base16-8 147ns ± 0% 148ns ± 2% ~ (p=1.000 n=4+5) String/100/Base16-8 505ns ± 0% 505ns ± 1% ~ (p=0.778 n=5+5) String/1000/Base16-8 3.94µs ± 0% 3.94µs ± 0% ~ (p=0.841 n=5+5) String/10000/Base16-8 37.4µs ± 1% 37.2µs ± 1% ~ (p=0.095 n=5+5) String/100000/Base16-8 367µs ± 1% 367µs ± 0% ~ (p=1.000 n=5+5) LeafSize/0-8 6.64ms ± 0% 6.65ms ± 0% ~ (p=0.690 n=5+5) LeafSize/1-8 72.5µs ± 1% 72.4µs ± 1% ~ (p=0.841 n=5+5) LeafSize/2-8 72.6µs ± 1% 72.6µs ± 1% ~ (p=1.000 n=5+5) LeafSize/3-8 377µs ± 0% 377µs ± 0% ~ (p=0.421 n=5+5) LeafSize/4-8 71.2µs ± 1% 71.3µs ± 0% ~ (p=0.278 n=5+5) LeafSize/5-8 469µs ± 0% 469µs ± 0% ~ (p=0.310 n=5+5) LeafSize/6-8 376µs ± 0% 376µs ± 0% ~ (p=0.841 n=5+5) LeafSize/7-8 244µs ± 0% 244µs ± 0% ~ (p=0.841 n=5+5) LeafSize/8-8 71.9µs ± 1% 72.1µs ± 1% ~ (p=0.548 n=5+5) LeafSize/9-8 536µs ± 0% 536µs ± 0% ~ (p=0.151 n=5+5) LeafSize/10-8 470µs ± 0% 471µs ± 0% +0.10% (p=0.032 n=5+5) LeafSize/11-8 458µs ± 0% 458µs ± 0% ~ (p=0.881 n=5+5) LeafSize/12-8 376µs ± 0% 376µs ± 0% ~ (p=0.548 n=5+5) LeafSize/13-8 341µs ± 0% 342µs ± 0% ~ (p=0.222 n=5+5) LeafSize/14-8 246µs ± 0% 245µs ± 0% ~ (p=0.167 n=5+5) LeafSize/15-8 168µs ± 0% 168µs ± 0% ~ (p=0.548 n=5+5) LeafSize/16-8 72.1µs ± 1% 72.2µs ± 1% ~ (p=0.690 n=5+5) LeafSize/32-8 81.5µs ± 1% 81.4µs ± 1% ~ (p=1.000 n=5+5) LeafSize/64-8 133µs ± 1% 134µs ± 1% ~ (p=0.690 n=5+5) ProbablyPrime/n=0-8 44.3ms ± 0% 44.2ms ± 0% -0.28% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.8ms ± 0% 64.7ms ± 0% -0.15% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 147ms ± 0% -0.11% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 250ms ± 0% ~ (p=0.056 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.05% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 23.5ms ± 0% -0.29% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.6ms ± 0% 20.6ms ± 0% ~ (p=0.690 n=5+5) FloatSqrt/64-8 2.01µs ± 1% 2.02µs ± 1% ~ (p=0.421 n=5+5) FloatSqrt/128-8 4.43µs ± 2% 4.38µs ± 2% ~ (p=0.222 n=5+5) FloatSqrt/256-8 6.64µs ± 1% 6.68µs ± 2% ~ (p=0.516 n=5+5) FloatSqrt/1000-8 31.9µs ± 0% 31.8µs ± 0% ~ (p=0.095 n=5+5) FloatSqrt/10000-8 595µs ± 0% 594µs ± 0% ~ (p=0.056 n=5+5) FloatSqrt/100000-8 17.9ms ± 0% 17.9ms ± 0% ~ (p=0.151 n=5+5) FloatSqrt/1000000-8 1.52s ± 0% 1.52s ± 0% ~ (p=0.841 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 2.97GB/s ± 0% ~ (p=0.971 n=4+4) AddVV/2-8 9.47GB/s ± 0% 9.47GB/s ± 0% +0.01% (p=0.016 n=5+5) AddVV/3-8 12.4GB/s ± 0% 12.4GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/4-8 14.6GB/s ± 0% 14.6GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/5-8 16.4GB/s ± 0% 16.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/10-8 21.7GB/s ± 0% 21.7GB/s ± 0% ~ (p=0.548 n=5+5) AddVV/100-8 29.4GB/s ± 0% 29.4GB/s ± 0% ~ (p=1.000 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 31.7GB/s ± 0% ~ (p=0.524 n=5+4) AddVV/10000-8 31.5GB/s ± 0% 31.5GB/s ± 0% ~ (p=0.690 n=5+5) AddVV/100000-8 28.8GB/s ± 7% 28.1GB/s ± 8% ~ (p=0.548 n=5+5) AddVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) AddVW/2-8 809MB/s ± 2% 1520MB/s ± 0% +87.78% (p=0.008 n=5+5) AddVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.54% (p=0.008 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.016 n=4+5) AddVW/5-8 2.76GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) AddVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.83% (p=0.008 n=5+5) AddVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.12% (p=0.008 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) AddVW/10000-8 5.31GB/s ± 0% 11.19GB/s ± 0% +110.71% (p=0.008 n=5+5) AddVW/100000-8 5.32GB/s ± 0% 11.32GB/s ± 0% +112.56% (p=0.008 n=5+5) SubVW/1-8 859MB/s ± 0% 864MB/s ± 0% +0.61% (p=0.008 n=5+5) SubVW/2-8 812MB/s ± 2% 1520MB/s ± 0% +87.09% (p=0.008 n=5+5) SubVW/3-8 2.08GB/s ± 0% 2.18GB/s ± 0% +4.55% (p=0.008 n=5+5) SubVW/4-8 2.46GB/s ± 0% 2.66GB/s ± 0% +8.33% (p=0.008 n=5+5) SubVW/5-8 2.75GB/s ± 0% 3.20GB/s ± 0% +16.03% (p=0.008 n=5+5) SubVW/10-8 3.63GB/s ± 0% 5.15GB/s ± 0% +41.82% (p=0.008 n=5+5) SubVW/100-8 4.79GB/s ± 0% 9.87GB/s ± 0% +106.13% (p=0.008 n=5+5) SubVW/1000-8 5.27GB/s ± 0% 12.42GB/s ± 0% +135.74% (p=0.008 n=5+5) SubVW/10000-8 5.31GB/s ± 0% 11.17GB/s ± 0% +110.44% (p=0.008 n=5+5) SubVW/100000-8 5.32GB/s ± 0% 11.31GB/s ± 0% +112.35% (p=0.008 n=5+5) AddMulVVW/1-8 1.97GB/s ± 1% 1.96GB/s ± 1% ~ (p=0.151 n=5+5) AddMulVVW/2-8 2.24GB/s ± 0% 2.25GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/3-8 2.11GB/s ± 0% 2.12GB/s ± 0% ~ (p=0.548 n=5+5) AddMulVVW/4-8 2.17GB/s ± 1% 2.17GB/s ± 1% ~ (p=0.548 n=5+5) AddMulVVW/5-8 2.22GB/s ± 1% 2.21GB/s ± 1% ~ (p=0.421 n=5+5) AddMulVVW/10-8 2.17GB/s ± 1% 2.16GB/s ± 0% ~ (p=0.095 n=5+5) AddMulVVW/100-8 2.35GB/s ± 0% 2.35GB/s ± 0% ~ (p=0.421 n=5+5) AddMulVVW/1000-8 2.47GB/s ± 0% 2.41GB/s ± 0% -2.09% (p=0.008 n=5+5) AddMulVVW/10000-8 2.16GB/s ± 0% 2.15GB/s ± 0% -0.23% (p=0.008 n=5+5) AddMulVVW/100000-8 2.03GB/s ± 1% 2.04GB/s ± 0% ~ (p=0.690 n=5+5) name old alloc/op new alloc/op delta FloatString/100-8 400B ± 0% 400B ± 0% ~ (all equal) FloatString/1000-8 3.22kB ± 0% 3.22kB ± 0% ~ (all equal) FloatString/10000-8 55.6kB ± 0% 55.5kB ± 0% ~ (p=0.206 n=5+5) FloatString/100000-8 627kB ± 0% 627kB ± 0% ~ (all equal) FloatAdd/10-8 0.00B 0.00B ~ (all equal) FloatAdd/100-8 0.00B 0.00B ~ (all equal) FloatAdd/1000-8 0.00B 0.00B ~ (all equal) FloatAdd/10000-8 0.00B 0.00B ~ (all equal) FloatAdd/100000-8 0.00B 0.00B ~ (all equal) FloatSub/10-8 0.00B 0.00B ~ (all equal) FloatSub/100-8 0.00B 0.00B ~ (all equal) FloatSub/1000-8 0.00B 0.00B ~ (all equal) FloatSub/10000-8 0.00B 0.00B ~ (all equal) FloatSub/100000-8 0.00B 0.00B ~ (all equal) FloatSqrt/64-8 416B ± 0% 416B ± 0% ~ (all equal) FloatSqrt/128-8 720B ± 0% 720B ± 0% ~ (all equal) FloatSqrt/256-8 816B ± 0% 816B ± 0% ~ (all equal) FloatSqrt/1000-8 2.50kB ± 0% 2.50kB ± 0% ~ (all equal) FloatSqrt/10000-8 23.5kB ± 0% 23.5kB ± 0% ~ (all equal) FloatSqrt/100000-8 251kB ± 0% 251kB ± 0% ~ (all equal) FloatSqrt/1000000-8 4.61MB ± 0% 4.61MB ± 0% ~ (all equal) name old allocs/op new allocs/op delta FloatString/100-8 8.00 ± 0% 8.00 ± 0% ~ (all equal) FloatString/1000-8 10.0 ± 0% 10.0 ± 0% ~ (all equal) FloatString/10000-8 42.0 ± 0% 42.0 ± 0% ~ (all equal) FloatString/100000-8 346 ± 0% 346 ± 0% ~ (all equal) FloatAdd/10-8 0.00 0.00 ~ (all equal) FloatAdd/100-8 0.00 0.00 ~ (all equal) FloatAdd/1000-8 0.00 0.00 ~ (all equal) FloatAdd/10000-8 0.00 0.00 ~ (all equal) FloatAdd/100000-8 0.00 0.00 ~ (all equal) FloatSub/10-8 0.00 0.00 ~ (all equal) FloatSub/100-8 0.00 0.00 ~ (all equal) FloatSub/1000-8 0.00 0.00 ~ (all equal) FloatSub/10000-8 0.00 0.00 ~ (all equal) FloatSub/100000-8 0.00 0.00 ~ (all equal) FloatSqrt/64-8 9.00 ± 0% 9.00 ± 0% ~ (all equal) FloatSqrt/128-8 13.0 ± 0% 13.0 ± 0% ~ (all equal) FloatSqrt/256-8 12.0 ± 0% 12.0 ± 0% ~ (all equal) FloatSqrt/1000-8 19.0 ± 0% 19.0 ± 0% ~ (all equal) FloatSqrt/10000-8 35.0 ± 0% 35.0 ± 0% ~ (all equal) FloatSqrt/100000-8 55.0 ± 0% 55.0 ± 0% ~ (all equal) FloatSqrt/1000000-8 122 ± 0% 122 ± 0% ~ (all equal) Change-Id: I6888d84c037d91f9e2199f3492ea3f6a0ed77b24 Reviewed-on: https://go-review.googlesource.com/77832 Reviewed-by: Vlad Krasnov <vlad@cloudflare.com> Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

The biggest hot spot of the existing implementation is "load" operations, which lead to poor performance. By unrolling the cycle 4x and 2x, and using "LDP", "STP" instructions, this CL can reduce the "load" cost and improve performance. Benchmarks: name old time/op new time/op delta AddVV/1-8 21.5ns ± 0% 11.5ns ± 0% -46.51% (p=0.008 n=5+5) AddVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) AddVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) AddVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) AddVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) AddVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) AddVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) AddVV/1000-8 2.02µs ± 0% 1.03µs ± 0% -48.85% (p=0.008 n=5+5) AddVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.70% (p=0.008 n=5+5) AddVV/100000-8 247µs ± 3% 154µs ± 0% -37.52% (p=0.008 n=5+5) SubVV/1-8 21.5ns ± 0% 11.5ns ± 0% ~ (p=0.079 n=4+5) SubVV/2-8 13.5ns ± 0% 12.0ns ± 0% -11.11% (p=0.008 n=5+5) SubVV/3-8 15.5ns ± 0% 13.0ns ± 0% -16.13% (p=0.008 n=5+5) SubVV/4-8 17.5ns ± 0% 13.5ns ± 0% -22.86% (p=0.008 n=5+5) SubVV/5-8 19.5ns ± 0% 14.5ns ± 0% -25.64% (p=0.008 n=5+5) SubVV/10-8 29.5ns ± 0% 18.0ns ± 0% -38.98% (p=0.008 n=5+5) SubVV/100-8 217ns ± 0% 94ns ± 0% -56.64% (p=0.008 n=5+5) SubVV/1000-8 2.02µs ± 0% 0.80µs ± 0% -60.50% (p=0.008 n=5+5) SubVV/10000-8 20.5µs ± 0% 11.3µs ± 0% -44.99% (p=0.008 n=5+5) SubVV/100000-8 221µs ±11% 223µs ±16% ~ (p=0.690 n=5+5) AddVW/1-8 9.32ns ± 0% 9.32ns ± 0% ~ (all equal) AddVW/2-8 19.7ns ± 1% 19.7ns ± 0% ~ (p=0.381 n=5+4) AddVW/3-8 11.5ns ± 0% 11.5ns ± 0% ~ (all equal) AddVW/4-8 13.0ns ± 0% 13.0ns ± 0% ~ (all equal) AddVW/5-8 14.5ns ± 0% 14.5ns ± 0% ~ (all equal) AddVW/10-8 22.0ns ± 0% 22.0ns ± 0% ~ (all equal) AddVW/100-8 167ns ± 0% 167ns ± 0% ~ (all equal) AddVW/1000-8 1.52µs ± 0% 1.52µs ± 0% +0.40% (p=0.008 n=5+5) AddVW/10000-8 15.1µs ± 0% 15.1µs ± 0% ~ (p=0.556 n=5+4) AddVW/100000-8 152µs ± 1% 152µs ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 33.3ns ± 0% 32.7ns ± 1% -1.86% (p=0.008 n=5+5) AddMulVVW/2-8 59.3ns ± 1% 56.9ns ± 1% -4.15% (p=0.008 n=5+5) AddMulVVW/3-8 80.5ns ± 1% 85.4ns ± 3% +6.19% (p=0.008 n=5+5) AddMulVVW/4-8 127ns ± 0% 111ns ± 1% -13.19% (p=0.008 n=5+5) AddMulVVW/5-8 144ns ± 0% 149ns ± 0% +3.47% (p=0.016 n=4+5) AddMulVVW/10-8 298ns ± 1% 283ns ± 0% -4.77% (p=0.008 n=5+5) AddMulVVW/100-8 3.06µs ± 0% 2.99µs ± 0% -2.21% (p=0.008 n=5+5) AddMulVVW/1000-8 31.3µs ± 0% 26.9µs ± 0% -14.17% (p=0.008 n=5+5) AddMulVVW/10000-8 316µs ± 0% 305µs ± 0% -3.51% (p=0.008 n=5+5) AddMulVVW/100000-8 3.17ms ± 0% 3.17ms ± 1% ~ (p=0.690 n=5+5) DecimalConversion-8 316µs ± 1% 313µs ± 2% ~ (p=0.095 n=5+5) FloatString/100-8 2.53µs ± 1% 2.56µs ± 2% ~ (p=0.222 n=5+5) FloatString/1000-8 58.4µs ± 0% 58.5µs ± 0% ~ (p=0.206 n=5+5) FloatString/10000-8 4.59ms ± 0% 4.58ms ± 0% -0.31% (p=0.008 n=5+5) FloatString/100000-8 446ms ± 0% 444ms ± 0% -0.31% (p=0.008 n=5+5) FloatAdd/10-8 184ns ± 0% 172ns ± 0% -6.30% (p=0.008 n=5+5) FloatAdd/100-8 189ns ± 2% 191ns ± 4% ~ (p=0.381 n=5+5) FloatAdd/1000-8 371ns ± 0% 347ns ± 1% -6.42% (p=0.008 n=5+5) FloatAdd/10000-8 1.87µs ± 0% 1.68µs ± 0% -10.16% (p=0.008 n=5+5) FloatAdd/100000-8 17.1µs ± 0% 15.6µs ± 0% -8.74% (p=0.016 n=5+4) FloatSub/10-8 152ns ± 0% 138ns ± 0% -9.47% (p=0.000 n=4+5) FloatSub/100-8 148ns ± 0% 142ns ± 0% -4.05% (p=0.000 n=5+4) FloatSub/1000-8 245ns ± 1% 217ns ± 0% -11.28% (p=0.000 n=5+4) FloatSub/10000-8 1.07µs ± 0% 0.88µs ± 1% -18.14% (p=0.008 n=5+5) FloatSub/100000-8 9.58µs ± 0% 7.96µs ± 0% -16.84% (p=0.008 n=5+5) ParseFloatSmallExp-8 28.8µs ± 1% 29.0µs ± 1% ~ (p=0.095 n=5+5) ParseFloatLargeExp-8 126µs ± 1% 126µs ± 1% ~ (p=0.841 n=5+5) GCD10x10/WithoutXY-8 277ns ± 2% 281ns ± 4% ~ (p=0.746 n=5+5) GCD10x10/WithXY-8 2.10µs ± 1% 2.12µs ± 3% ~ (p=0.548 n=5+5) GCD10x100/WithoutXY-8 615ns ± 3% 607ns ± 2% ~ (p=0.135 n=5+5) GCD10x100/WithXY-8 3.50µs ± 2% 3.62µs ± 5% ~ (p=0.151 n=5+5) GCD10x1000/WithoutXY-8 1.39µs ± 2% 1.39µs ± 3% ~ (p=0.690 n=5+5) GCD10x1000/WithXY-8 7.39µs ± 1% 7.34µs ± 2% ~ (p=0.135 n=5+5) GCD10x10000/WithoutXY-8 8.66µs ± 1% 8.68µs ± 1% ~ (p=0.421 n=5+5) GCD10x10000/WithXY-8 28.1µs ± 2% 27.0µs ± 2% -3.81% (p=0.008 n=5+5) GCD10x100000/WithoutXY-8 79.3µs ± 1% 79.3µs ± 1% ~ (p=0.841 n=5+5) GCD10x100000/WithXY-8 238µs ± 0% 227µs ± 1% -4.74% (p=0.008 n=5+5) GCD100x100/WithoutXY-8 1.89µs ± 1% 1.88µs ± 2% ~ (p=0.968 n=5+5) GCD100x100/WithXY-8 26.7µs ± 1% 27.0µs ± 1% +1.44% (p=0.032 n=5+5) GCD100x1000/WithoutXY-8 4.48µs ± 1% 4.45µs ± 2% ~ (p=0.341 n=5+5) GCD100x1000/WithXY-8 36.3µs ± 1% 35.1µs ± 1% -3.27% (p=0.008 n=5+5) GCD100x10000/WithoutXY-8 22.8µs ± 0% 22.7µs ± 1% ~ (p=0.056 n=5+5) GCD100x10000/WithXY-8 145µs ± 1% 133µs ± 1% -8.33% (p=0.008 n=5+5) GCD100x100000/WithoutXY-8 198µs ± 0% 195µs ± 0% -1.56% (p=0.008 n=5+5) GCD100x100000/WithXY-8 1.11ms ± 0% 1.00ms ± 0% -10.04% (p=0.008 n=5+5) GCD1000x1000/WithoutXY-8 25.2µs ± 1% 24.8µs ± 1% -1.63% (p=0.016 n=5+5) GCD1000x1000/WithXY-8 513µs ± 0% 517µs ± 2% ~ (p=0.421 n=5+5) GCD1000x10000/WithoutXY-8 57.0µs ± 0% 52.7µs ± 1% -7.56% (p=0.008 n=5+5) GCD1000x10000/WithXY-8 1.20ms ± 0% 1.10ms ± 0% -8.70% (p=0.008 n=5+5) GCD1000x100000/WithoutXY-8 358µs ± 0% 318µs ± 1% -11.03% (p=0.008 n=5+5) GCD1000x100000/WithXY-8 8.71ms ± 0% 7.65ms ± 0% -12.19% (p=0.008 n=5+5) GCD10000x10000/WithoutXY-8 690µs ± 0% 630µs ± 0% -8.71% (p=0.008 n=5+5) GCD10000x10000/WithXY-8 16.0ms ± 1% 14.9ms ± 0% -6.85% (p=0.008 n=5+5) GCD10000x100000/WithoutXY-8 2.09ms ± 0% 1.75ms ± 0% -16.09% (p=0.016 n=5+4) GCD10000x100000/WithXY-8 86.8ms ± 0% 76.3ms ± 0% -12.09% (p=0.008 n=5+5) GCD100000x100000/WithoutXY-8 51.1ms ± 0% 46.0ms ± 0% -9.97% (p=0.008 n=5+5) GCD100000x100000/WithXY-8 1.25s ± 0% 1.15s ± 0% -7.92% (p=0.008 n=5+5) Hilbert-8 2.45ms ± 1% 2.49ms ± 1% +1.99% (p=0.008 n=5+5) Binomial-8 4.98µs ± 3% 4.90µs ± 2% ~ (p=0.421 n=5+5) QuoRem-8 7.10µs ± 0% 6.21µs ± 0% -12.55% (p=0.016 n=5+4) Exp-8 161ms ± 0% 161ms ± 0% ~ (p=0.421 n=5+5) Exp2-8 161ms ± 0% 161ms ± 0% ~ (p=0.151 n=5+5) Bitset-8 40.4ns ± 0% 40.3ns ± 0% ~ (p=0.190 n=5+5) BitsetNeg-8 163ns ± 3% 137ns ± 2% -15.91% (p=0.008 n=5+5) BitsetOrig-8 377ns ± 1% 372ns ± 1% -1.22% (p=0.024 n=5+5) BitsetNegOrig-8 631ns ± 1% 605ns ± 1% -4.09% (p=0.008 n=5+5) ModSqrt225_Tonelli-8 7.26ms ± 0% 7.26ms ± 0% ~ (p=0.548 n=5+5) ModSqrt224_3Mod4-8 2.24ms ± 0% 2.24ms ± 0% ~ (p=1.000 n=5+5) ModSqrt5430_Tonelli-8 62.4s ± 0% 62.4s ± 0% ~ (p=0.841 n=5+5) ModSqrt5430_3Mod4-8 20.8s ± 0% 20.7s ± 0% ~ (p=0.056 n=5+5) Sqrt-8 101µs ± 0% 89µs ± 0% -12.17% (p=0.008 n=5+5) IntSqr/1-8 32.5ns ± 1% 32.7ns ± 1% ~ (p=0.056 n=5+5) IntSqr/2-8 160ns ± 5% 158ns ± 0% ~ (p=0.397 n=5+4) IntSqr/3-8 298ns ± 4% 296ns ± 4% ~ (p=0.667 n=5+5) IntSqr/5-8 737ns ± 5% 761ns ± 3% +3.34% (p=0.016 n=5+5) IntSqr/8-8 1.87µs ± 4% 1.90µs ± 3% ~ (p=0.222 n=5+5) IntSqr/10-8 2.96µs ± 4% 2.92µs ± 6% ~ (p=0.310 n=5+5) IntSqr/20-8 6.28µs ± 3% 6.21µs ± 2% ~ (p=0.310 n=5+5) IntSqr/30-8 14.0µs ± 2% 13.9µs ± 2% ~ (p=0.548 n=5+5) IntSqr/50-8 37.7µs ± 3% 38.3µs ± 2% ~ (p=0.095 n=5+5) IntSqr/80-8 95.9µs ± 2% 95.1µs ± 1% ~ (p=0.310 n=5+5) IntSqr/100-8 148µs ± 1% 148µs ± 1% ~ (p=0.841 n=5+5) IntSqr/200-8 586µs ± 1% 587µs ± 1% ~ (p=1.000 n=5+5) IntSqr/300-8 1.32ms ± 0% 1.31ms ± 1% -0.73% (p=0.032 n=5+5) IntSqr/500-8 2.48ms ± 0% 2.45ms ± 0% -1.15% (p=0.008 n=5+5) IntSqr/800-8 4.68ms ± 0% 4.62ms ± 0% -1.23% (p=0.008 n=5+5) IntSqr/1000-8 7.57ms ± 0% 7.50ms ± 0% -0.84% (p=0.008 n=5+5) Mul-8 311ms ± 0% 308ms ± 0% -0.81% (p=0.008 n=5+5) Exp3Power/0x10-8 574ns ± 1% 578ns ± 2% ~ (p=0.500 n=5+5) Exp3Power/0x40-8 640ns ± 1% 646ns ± 0% ~ (p=0.056 n=5+5) Exp3Power/0x100-8 1.42µs ± 1% 1.42µs ± 1% ~ (p=0.246 n=5+5) Exp3Power/0x400-8 8.30µs ± 1% 8.29µs ± 1% ~ (p=0.802 n=5+5) Exp3Power/0x1000-8 60.0µs ± 0% 59.9µs ± 0% -0.24% (p=0.016 n=5+5) Exp3Power/0x4000-8 817µs ± 0% 816µs ± 0% -0.17% (p=0.008 n=5+5) Exp3Power/0x10000-8 7.80ms ± 1% 7.70ms ± 0% -1.23% (p=0.008 n=5+5) Exp3Power/0x40000-8 73.4ms ± 0% 72.5ms ± 0% -1.28% (p=0.008 n=5+5) Exp3Power/0x100000-8 665ms ± 0% 656ms ± 0% -1.34% (p=0.008 n=5+5) Exp3Power/0x400000-8 5.99s ± 0% 5.90s ± 0% -1.40% (p=0.008 n=5+5) Fibo-8 116ms ± 0% 50ms ± 0% -57.09% (p=0.008 n=5+5) NatSqr/1-8 112ns ± 4% 112ns ± 2% ~ (p=0.968 n=5+5) NatSqr/2-8 251ns ± 2% 250ns ± 1% ~ (p=0.571 n=5+5) NatSqr/3-8 378ns ± 2% 379ns ± 2% ~ (p=0.794 n=5+5) NatSqr/5-8 829ns ± 3% 827ns ± 2% ~ (p=1.000 n=5+5) NatSqr/8-8 1.97µs ± 2% 1.95µs ± 2% ~ (p=0.310 n=5+5) NatSqr/10-8 3.02µs ± 2% 2.99µs ± 2% ~ (p=0.421 n=5+5) NatSqr/20-8 6.51µs ± 2% 6.49µs ± 1% ~ (p=0.841 n=5+5) NatSqr/30-8 14.1µs ± 2% 14.0µs ± 2% ~ (p=0.841 n=5+5) NatSqr/50-8 38.1µs ± 2% 38.3µs ± 3% ~ (p=0.690 n=5+5) NatSqr/80-8 95.5µs ± 2% 96.0µs ± 1% ~ (p=0.421 n=5+5) NatSqr/100-8 150µs ± 1% 148µs ± 2% ~ (p=0.095 n=5+5) NatSqr/200-8 588µs ± 1% 590µs ± 1% ~ (p=0.421 n=5+5) NatSqr/300-8 1.32ms ± 1% 1.31ms ± 1% ~ (p=0.841 n=5+5) NatSqr/500-8 2.50ms ± 0% 2.47ms ± 0% -1.03% (p=0.008 n=5+5) NatSqr/800-8 4.70ms ± 0% 4.64ms ± 0% -1.31% (p=0.008 n=5+5) NatSqr/1000-8 7.60ms ± 0% 7.52ms ± 0% -1.01% (p=0.008 n=5+5) ScanPi-8 326µs ± 0% 326µs ± 0% ~ (p=0.841 n=5+5) StringPiParallel-8 70.3µs ± 5% 63.8µs ±10% ~ (p=0.056 n=5+5) Scan/10/Base2-8 1.09µs ± 0% 1.09µs ± 0% ~ (p=0.317 n=5+5) Scan/100/Base2-8 7.79µs ± 0% 7.78µs ± 0% ~ (p=0.063 n=5+5) Scan/1000/Base2-8 79.0µs ± 0% 78.9µs ± 0% -0.18% (p=0.008 n=5+5) Scan/10000/Base2-8 1.22ms ± 0% 1.22ms ± 0% -0.15% (p=0.008 n=5+5) Scan/100000/Base2-8 55.1ms ± 0% 55.2ms ± 0% +0.20% (p=0.008 n=5+5) Scan/10/Base8-8 512ns ± 0% 512ns ± 1% ~ (p=0.810 n=5+5) Scan/100/Base8-8 2.89µs ± 0% 2.89µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base8-8 31.0µs ± 0% 31.0µs ± 0% ~ (p=0.151 n=5+5) Scan/10000/Base8-8 740µs ± 0% 741µs ± 0% +0.10% (p=0.008 n=5+5) Scan/100000/Base8-8 50.6ms ± 0% 50.6ms ± 0% +0.08% (p=0.008 n=5+5) Scan/10/Base10-8 487ns ± 0% 487ns ± 0% ~ (p=0.571 n=5+5) Scan/100/Base10-8 2.67µs ± 0% 2.67µs ± 0% ~ (p=0.810 n=5+5) Scan/1000/Base10-8 28.7µs ± 0% 28.7µs ± 0% +0.06% (p=0.008 n=5+5) Scan/10000/Base10-8 716µs ± 0% 717µs ± 0% ~ (p=0.222 n=5+5) Scan/100000/Base10-8 50.3ms ± 0% 50.3ms ± 0% +0.10% (p=0.008 n=5+5) Scan/10/Base16-8 438ns ± 0% 437ns ± 1% ~ (p=0.786 n=5+5) Scan/100/Base16-8 2.47µs ± 0% 2.47µs ± 0% -0.19% (p=0.048 n=5+5) Scan/1000/Base16-8 27.2µs ± 0% 27.3µs ± 0% ~ (p=0.087 n=5+5) Scan/10000/Base16-8 722µs ± 0% 722µs ± 0% +0.11% (p=0.008 n=5+5) Scan/100000/Base16-8 52.6ms ± 0% 52.7ms ± 0% +0.15% (p=0.008 n=5+5) String/10/Base2-8 247ns ± 2% 248ns ± 1% ~ (p=0.437 n=5+5) String/100/Base2-8 1.51µs ± 0% 1.51µs ± 0% -0.37% (p=0.024 n=5+5) String/1000/Base2-8 13.6µs ± 1% 13.5µs ± 0% ~ (p=0.095 n=5+5) String/10000/Base2-8 135µs ± 0% 135µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base2-8 1.32ms ± 1% 1.32ms ± 1% ~ (p=0.690 n=5+5) String/10/Base8-8 169ns ± 1% 169ns ± 1% ~ (p=1.000 n=5+5) String/100/Base8-8 636ns ± 0% 634ns ± 1% ~ (p=0.413 n=5+5) String/1000/Base8-8 5.33µs ± 1% 5.32µs ± 0% ~ (p=0.222 n=5+5) String/10000/Base8-8 50.9µs ± 1% 50.7µs ± 0% ~ (p=0.151 n=5+5) String/100000/Base8-8 500µs ± 1% 497µs ± 0% ~ (p=0.421 n=5+5) String/10/Base10-8 516ns ± 1% 513ns ± 0% -0.62% (p=0.016 n=5+4) String/100/Base10-8 1.97µs ± 0% 1.96µs ± 0% ~ (p=0.667 n=4+5) String/1000/Base10-8 12.5µs ± 0% 11.5µs ± 0% -7.92% (p=0.008 n=5+5) String/10000/Base10-8 57.7µs ± 0% 52.5µs ± 0% -8.93% (p=0.008 n=5+5) String/100000/Base10-8 25.6ms ± 0% 21.6ms ± 0% -15.94% (p=0.008 n=5+5) String/10/Base16-8 150ns ± 1% 149ns ± 0% ~ (p=0.413 n=5+4) String/100/Base16-8 514ns ± 1% 514ns ± 1% ~ (p=0.849 n=5+5) String/1000/Base16-8 4.01µs ± 0% 4.01µs ± 0% ~ (p=0.421 n=5+5) String/10000/Base16-8 37.8µs ± 1% 37.8µs ± 1% ~ (p=0.841 n=5+5) String/100000/Base16-8 373µs ± 2% 373µs ± 0% ~ (p=0.421 n=5+5) LeafSize/0-8 6.63ms ± 0% 6.63ms ± 0% ~ (p=0.730 n=4+5) LeafSize/1-8 74.0µs ± 0% 67.7µs ± 1% -8.53% (p=0.008 n=5+5) LeafSize/2-8 74.2µs ± 0% 68.3µs ± 1% -7.99% (p=0.008 n=5+5) LeafSize/3-8 379µs ± 0% 309µs ± 0% -18.52% (p=0.008 n=5+5) LeafSize/4-8 72.7µs ± 1% 66.7µs ± 0% -8.37% (p=0.008 n=5+5) LeafSize/5-8 471µs ± 0% 384µs ± 0% -18.55% (p=0.008 n=5+5) LeafSize/6-8 378µs ± 0% 308µs ± 0% -18.59% (p=0.008 n=5+5) LeafSize/7-8 245µs ± 0% 204µs ± 1% -16.75% (p=0.008 n=5+5) LeafSize/8-8 73.4µs ± 0% 66.9µs ± 1% -8.79% (p=0.008 n=5+5) LeafSize/9-8 538µs ± 0% 437µs ± 0% -18.75% (p=0.008 n=5+5) LeafSize/10-8 472µs ± 0% 396µs ± 1% -16.01% (p=0.008 n=5+5) LeafSize/11-8 460µs ± 0% 374µs ± 0% -18.58% (p=0.008 n=5+5) LeafSize/12-8 378µs ± 0% 308µs ± 0% -18.38% (p=0.008 n=5+5) LeafSize/13-8 343µs ± 0% 284µs ± 0% -17.30% (p=0.008 n=5+5) LeafSize/14-8 248µs ± 0% 206µs ± 0% -16.94% (p=0.008 n=5+5) LeafSize/15-8 169µs ± 0% 144µs ± 0% -14.69% (p=0.008 n=5+5) LeafSize/16-8 72.9µs ± 0% 66.8µs ± 1% -8.27% (p=0.008 n=5+5) LeafSize/32-8 82.5µs ± 0% 76.7µs ± 0% -7.04% (p=0.008 n=5+5) LeafSize/64-8 134µs ± 0% 129µs ± 0% -3.80% (p=0.008 n=5+5) ProbablyPrime/n=0-8 44.2ms ± 0% 43.4ms ± 0% -1.95% (p=0.008 n=5+5) ProbablyPrime/n=1-8 64.9ms ± 0% 64.0ms ± 0% -1.27% (p=0.008 n=5+5) ProbablyPrime/n=5-8 147ms ± 0% 146ms ± 0% -0.58% (p=0.008 n=5+5) ProbablyPrime/n=10-8 250ms ± 0% 249ms ± 0% -0.35% (p=0.008 n=5+5) ProbablyPrime/n=20-8 456ms ± 0% 455ms ± 0% -0.18% (p=0.008 n=5+5) ProbablyPrime/Lucas-8 23.6ms ± 0% 22.7ms ± 0% -3.74% (p=0.008 n=5+5) ProbablyPrime/MillerRabinBase2-8 20.7ms ± 0% 20.6ms ± 0% ~ (p=0.421 n=5+5) FloatSqrt/64-8 2.25µs ± 1% 2.29µs ± 0% +1.48% (p=0.008 n=5+5) FloatSqrt/128-8 4.86µs ± 1% 4.92µs ± 1% +1.21% (p=0.032 n=5+5) FloatSqrt/256-8 13.6µs ± 0% 13.7µs ± 1% +1.31% (p=0.032 n=5+5) FloatSqrt/1000-8 70.0µs ± 1% 70.1µs ± 0% ~ (p=0.690 n=5+5) FloatSqrt/10000-8 1.92ms ± 0% 1.90ms ± 0% -0.59% (p=0.008 n=5+5) FloatSqrt/100000-8 55.3ms ± 0% 54.8ms ± 0% -1.01% (p=0.008 n=5+5) FloatSqrt/1000000-8 4.56s ± 0% 4.50s ± 0% -1.28% (p=0.008 n=5+5) name old speed new speed delta AddVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.85% (p=0.008 n=5+5) AddVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.50% (p=0.008 n=5+5) AddVV/3-8 12.4GB/s ± 0% 14.7GB/s ± 0% +19.10% (p=0.008 n=5+5) AddVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.63% (p=0.016 n=4+5) AddVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=5+4) AddVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) AddVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) AddVV/1000-8 31.7GB/s ± 0% 61.9GB/s ± 0% +95.43% (p=0.008 n=5+5) AddVV/10000-8 31.2GB/s ± 0% 56.4GB/s ± 0% +80.83% (p=0.008 n=5+5) AddVV/100000-8 25.9GB/s ± 3% 41.4GB/s ± 0% +59.98% (p=0.008 n=5+5) SubVV/1-8 2.97GB/s ± 0% 5.56GB/s ± 0% +86.97% (p=0.016 n=4+5) SubVV/2-8 9.47GB/s ± 0% 10.66GB/s ± 0% +12.51% (p=0.008 n=5+5) SubVV/3-8 12.4GB/s ± 0% 14.8GB/s ± 0% +19.23% (p=0.016 n=4+5) SubVV/4-8 14.6GB/s ± 0% 18.9GB/s ± 0% +29.56% (p=0.008 n=5+5) SubVV/5-8 16.4GB/s ± 0% 22.0GB/s ± 0% +34.47% (p=0.016 n=4+5) SubVV/10-8 21.7GB/s ± 0% 35.5GB/s ± 0% +63.89% (p=0.008 n=5+5) SubVV/100-8 29.4GB/s ± 0% 68.0GB/s ± 0% +131.38% (p=0.008 n=5+5) SubVV/1000-8 31.6GB/s ± 0% 80.1GB/s ± 0% +153.08% (p=0.008 n=5+5) SubVV/10000-8 31.2GB/s ± 0% 56.7GB/s ± 0% +81.79% (p=0.008 n=5+5) SubVV/100000-8 29.1GB/s ±10% 29.0GB/s ±18% ~ (p=0.690 n=5+5) AddVW/1-8 859MB/s ± 0% 859MB/s ± 0% -0.01% (p=0.008 n=5+5) AddVW/2-8 811MB/s ± 1% 814MB/s ± 0% ~ (p=0.413 n=5+4) AddVW/3-8 2.08GB/s ± 0% 2.08GB/s ± 0% ~ (p=0.206 n=5+5) AddVW/4-8 2.46GB/s ± 0% 2.46GB/s ± 0% ~ (p=0.056 n=5+5) AddVW/5-8 2.75GB/s ± 0% 2.75GB/s ± 0% ~ (p=0.508 n=5+5) AddVW/10-8 3.63GB/s ± 0% 3.63GB/s ± 0% ~ (p=0.214 n=5+5) AddVW/100-8 4.79GB/s ± 0% 4.79GB/s ± 0% ~ (p=0.500 n=5+5) AddVW/1000-8 5.27GB/s ± 0% 5.25GB/s ± 0% -0.43% (p=0.008 n=5+5) AddVW/10000-8 5.30GB/s ± 0% 5.30GB/s ± 0% ~ (p=0.397 n=5+5) AddVW/100000-8 5.27GB/s ± 1% 5.25GB/s ± 1% ~ (p=0.690 n=5+5) AddMulVVW/1-8 1.92GB/s ± 0% 1.96GB/s ± 1% +1.95% (p=0.008 n=5+5) AddMulVVW/2-8 2.16GB/s ± 1% 2.25GB/s ± 1% +4.32% (p=0.008 n=5+5) AddMulVVW/3-8 2.39GB/s ± 1% 2.25GB/s ± 3% -5.79% (p=0.008 n=5+5) AddMulVVW/4-8 2.00GB/s ± 0% 2.31GB/s ± 1% +15.31% (p=0.008 n=5+5) AddMulVVW/5-8 2.22GB/s ± 0% 2.14GB/s ± 0% -3.86% (p=0.008 n=5+5) AddMulVVW/10-8 2.15GB/s ± 1% 2.25GB/s ± 0% +5.03% (p=0.008 n=5+5) AddMulVVW/100-8 2.09GB/s ± 0% 2.14GB/s ± 0% +2.25% (p=0.008 n=5+5) AddMulVVW/1000-8 2.04GB/s ± 0% 2.38GB/s ± 0% +16.52% (p=0.008 n=5+5) AddMulVVW/10000-8 2.03GB/s ± 0% 2.10GB/s ± 0% +3.64% (p=0.008 n=5+5) AddMulVVW/100000-8 2.02GB/s ± 0% 2.02GB/s ± 1% ~ (p=0.690 n=5+5) Change-Id: Ie482d67a7dbb5af6f5d81af2b3d9d14bd66336db Reviewed-on: https://go-review.googlesource.com/77831 Reviewed-by: Cherry Zhang <cherryyz@google.com> Run-TryBot: Cherry Zhang <cherryyz@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

Verified that BenchmarkBitLen time went down from 2.25 ns/op to 0.65 ns/op an a 2.3 GHz Intel Core i7, before removing that benchmark (now covered by math/bits benchmarks). Change-Id: I3890bb7d1889e95b9a94bd68f0bdf06f1885adeb Reviewed-on: https://go-review.googlesource.com/38464 Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

No need to test so many sizes in race mode, especially for a package which doesn't use goroutines. Reduces test time from 2.5 minutes to 25 seconds. Updates #17104 Change-Id: I7065b39273f82edece385c0d67b3f2d83d4934b8 Reviewed-on: https://go-review.googlesource.com/29163 Reviewed-by: David Crawshaw <crawshaw@golang.org>

Follow-up cleanup to https://golang.org/cl/23424/ . Change-Id: Ifb05c1ff5327df6bc5f4cbc554e18363293f7960 Reviewed-on: https://go-review.googlesource.com/23446 Reviewed-by: Marcel van Lohuizen <mpvl@golang.org>

shortens code and gives an example of the use of Run. Change-Id: I75ffaf762218a589274b4b62e19022e31e805d1b Reviewed-on: https://go-review.googlesource.com/23424 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Marcel van Lohuizen <mpvl@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

This deletes unused code and helpers from tests. Change-Id: Ie31d46115f558ceb8da6efbf90c3c204e03b0d7e Reviewed-on: https://go-review.googlesource.com/20927 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

The tree's pretty inconsistent about single space vs double space after a period in documentation. Make it consistently a single space, per earlier decisions. This means contributors won't be confused by misleading precedence. This CL doesn't use go/doc to parse. It only addresses // comments. It was generated with: $ perl -i -npe 's,^(\s*// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.) +([A-Z])') $ go test go/doc -update Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7 Reviewed-on: https://go-review.googlesource.com/20022 Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Dave Day <djd@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>

Fixes #10525. Change-Id: I92dc87f5d6db396d8dde2220fc37b7093b772d81 Reviewed-on: https://go-review.googlesource.com/9210 Reviewed-by: Robert Griesemer <gri@golang.org>

(platforms w/o corresponding assembly kernels) For short vector adds there's some erradic slow-down, but overall these routines have become significantly faster. This only matters for platforms w/o native (assembly) versions of these kernels, so we are not concerned about the minor slow-down for short vectors. This code was already reviewed under Mercurial (golang.org/cl/172810043) but wasn't submitted before the switch to git. Benchmarks run on 2.3GHz Intel Core i7, running OS X 10.9.5, with the respective AddVV and AddVW assembly routines disabled. benchmark old ns/op new ns/op delta BenchmarkAddVV_1 6.59 7.09 +7.59% BenchmarkAddVV_2 10.3 10.1 -1.94% BenchmarkAddVV_3 10.9 12.6 +15.60% BenchmarkAddVV_4 13.9 15.6 +12.23% BenchmarkAddVV_5 16.8 17.3 +2.98% BenchmarkAddVV_1e1 29.5 29.9 +1.36% BenchmarkAddVV_1e2 246 232 -5.69% BenchmarkAddVV_1e3 2374 2185 -7.96% BenchmarkAddVV_1e4 58942 22292 -62.18% BenchmarkAddVV_1e5 668622 225279 -66.31% BenchmarkAddVW_1 6.81 5.58 -18.06% BenchmarkAddVW_2 7.69 6.86 -10.79% BenchmarkAddVW_3 9.56 8.32 -12.97% BenchmarkAddVW_4 12.1 9.53 -21.24% BenchmarkAddVW_5 13.2 10.9 -17.42% BenchmarkAddVW_1e1 23.4 18.0 -23.08% BenchmarkAddVW_1e2 175 141 -19.43% BenchmarkAddVW_1e3 1568 1266 -19.26% BenchmarkAddVW_1e4 15425 12596 -18.34% BenchmarkAddVW_1e5 156737 133539 -14.80% BenchmarkFibo 381678466 132958666 -65.16% benchmark old MB/s new MB/s speedup BenchmarkAddVV_1 9715.25 9028.30 0.93x BenchmarkAddVV_2 12461.72 12622.60 1.01x BenchmarkAddVV_3 17549.64 15243.82 0.87x BenchmarkAddVV_4 18392.54 16398.29 0.89x BenchmarkAddVV_5 18995.23 18496.57 0.97x BenchmarkAddVV_1e1 21708.98 21438.28 0.99x BenchmarkAddVV_1e2 25956.53 27506.88 1.06x BenchmarkAddVV_1e3 26947.93 29286.66 1.09x BenchmarkAddVV_1e4 10857.96 28709.46 2.64x BenchmarkAddVV_1e5 9571.91 28409.21 2.97x BenchmarkAddVW_1 1175.28 1433.98 1.22x BenchmarkAddVW_2 2080.01 2332.54 1.12x BenchmarkAddVW_3 2509.28 2883.97 1.15x BenchmarkAddVW_4 2646.09 3356.83 1.27x BenchmarkAddVW_5 3020.69 3671.07 1.22x BenchmarkAddVW_1e1 3425.76 4441.40 1.30x BenchmarkAddVW_1e2 4553.17 5642.96 1.24x BenchmarkAddVW_1e3 5100.14 6318.72 1.24x BenchmarkAddVW_1e4 5186.15 6350.96 1.22x BenchmarkAddVW_1e5 5104.07 5990.74 1.17x Change-Id: I7a62023b1105248a0e85e5b9819d3fd4266123d4 Reviewed-on: https://go-review.googlesource.com/2480 Reviewed-by: Russ Cox <rsc@golang.org> Reviewed-by: Alan Donovan <adonovan@google.com>

Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg.