go - Fork of Go programming language with my patches.

Age	Commit message (Collapse)	Author
2024-01-11	runtime: ensure we free unrolled GC bitmaps	Keith Randall
	CL 555355 has a bug in it - the GC program flag was also used to decide when to free the unrolled bitmap. After that CL, we just don't free any unrolled bitmaps, leading to a memory leak. Use a separate flag to track types that need to be freed when their corresponding object is freed. Change-Id: I841b65492561f5b5e1853875fbd8e8a872205a84 Reviewed-on: https://go-review.googlesource.com/c/go/+/555416 Auto-Submit: Keith Randall <khr@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2024-01-11	runtime: don't mark unrolled bitmap type as having a gc program	Keith Randall
	It doesn't have a GC program - the whole point is that it is the unrolled version of a GC program. Fortunately, this isn't a bug as (*mspan).typePointersOfUnchecked ignores the GCProg flag and just uses GCData as a bitmap unconditionally. Change-Id: I2508af85af4a1806946e54c893120c5cc0cc3da3 Reviewed-on: https://go-review.googlesource.com/c/go/+/555355 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com>
2023-12-05	math/rand, math/rand/v2: use ChaCha8 for global rand	Russ Cox
	Move ChaCha8 code into internal/chacha8rand and use it to implement runtime.rand, which is used for the unseeded global source for both math/rand and math/rand/v2. This also affects the calculation of the start point for iteration over very very large maps (when the 32-bit fastrand is not big enough). The benefit is that misuse of the global random number generators in math/rand and math/rand/v2 in contexts where non-predictable randomness is important for security reasons is no longer a security problem, removing a common mistake among programmers who are unaware of the different kinds of randomness. The cost is an extra 304 bytes per thread stored in the m struct plus 2-3ns more per random uint64 due to the more sophisticated algorithm. Using PCG looks like it would cost about the same, although I haven't benchmarked that. Before this, the math/rand and math/rand/v2 global generator was wyrand (https://github.com/wangyi-fudan/wyhash). For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson ALFG was justifiable, since the latter was not any better. But for math/rand/v2, the global generator really should be at least as good as one of the well-studied, specific algorithms provided directly by the package, and it's not. (Wyrand is still reasonable for scheduling and cache decisions.) Good randomness does have a cost: about twice wyrand. Also rationalize the various runtime rand references. goos: linux goarch: amd64 pkg: math/rand/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ bbb48afeb7.amd64 │ 5cf807d1ea.amd64 │ │ sec/op │ sec/op vs base │ ChaCha8-32 1.862n ± 2% 1.861n ± 2% ~ (p=0.825 n=20) PCG_DXSM-32 1.471n ± 1% 1.460n ± 2% ~ (p=0.153 n=20) SourceUint64-32 1.636n ± 2% 1.582n ± 1% -3.30% (p=0.000 n=20) GlobalInt64-32 2.087n ± 1% 3.663n ± 1% +75.54% (p=0.000 n=20) GlobalInt64Parallel-32 0.1042n ± 1% 0.2026n ± 1% +94.48% (p=0.000 n=20) GlobalUint64-32 2.263n ± 2% 3.724n ± 1% +64.57% (p=0.000 n=20) GlobalUint64Parallel-32 0.1019n ± 1% 0.1973n ± 1% +93.67% (p=0.000 n=20) Int64-32 1.771n ± 1% 1.774n ± 1% ~ (p=0.449 n=20) Uint64-32 1.863n ± 2% 1.866n ± 1% ~ (p=0.364 n=20) GlobalIntN1000-32 3.134n ± 3% 4.730n ± 2% +50.95% (p=0.000 n=20) IntN1000-32 2.489n ± 1% 2.489n ± 1% ~ (p=0.683 n=20) Int64N1000-32 2.521n ± 1% 2.516n ± 1% ~ (p=0.394 n=20) Int64N1e8-32 2.479n ± 1% 2.478n ± 2% ~ (p=0.743 n=20) Int64N1e9-32 2.530n ± 2% 2.514n ± 2% ~ (p=0.193 n=20) Int64N2e9-32 2.501n ± 1% 2.494n ± 1% ~ (p=0.616 n=20) Int64N1e18-32 3.227n ± 1% 3.205n ± 1% ~ (p=0.101 n=20) Int64N2e18-32 3.647n ± 1% 3.599n ± 1% ~ (p=0.019 n=20) Int64N4e18-32 5.135n ± 1% 5.069n ± 2% ~ (p=0.034 n=20) Int32N1000-32 2.657n ± 1% 2.637n ± 1% ~ (p=0.180 n=20) Int32N1e8-32 2.636n ± 1% 2.636n ± 1% ~ (p=0.763 n=20) Int32N1e9-32 2.660n ± 2% 2.638n ± 1% ~ (p=0.358 n=20) Int32N2e9-32 2.662n ± 2% 2.618n ± 2% ~ (p=0.064 n=20) Float32-32 2.272n ± 2% 2.239n ± 2% ~ (p=0.194 n=20) Float64-32 2.272n ± 1% 2.286n ± 2% ~ (p=0.763 n=20) ExpFloat64-32 3.762n ± 1% 3.744n ± 1% ~ (p=0.171 n=20) NormFloat64-32 3.706n ± 1% 3.655n ± 2% ~ (p=0.066 n=20) Perm3-32 32.93n ± 3% 34.62n ± 1% +5.13% (p=0.000 n=20) Perm30-32 202.9n ± 1% 204.0n ± 1% ~ (p=0.482 n=20) Perm30ViaShuffle-32 115.0n ± 1% 114.9n ± 1% ~ (p=0.358 n=20) ShuffleOverhead-32 112.8n ± 1% 112.7n ± 1% ~ (p=0.692 n=20) Concurrent-32 2.107n ± 0% 3.725n ± 1% +76.75% (p=0.000 n=20) goos: darwin goarch: arm64 pkg: math/rand/v2 │ bbb48afeb7.arm64 │ 5cf807d1ea.arm64 │ │ sec/op │ sec/op vs base │ ChaCha8-8 2.480n ± 0% 2.429n ± 0% -2.04% (p=0.000 n=20) PCG_DXSM-8 2.531n ± 0% 2.530n ± 0% ~ (p=0.877 n=20) SourceUint64-8 2.534n ± 0% 2.533n ± 0% ~ (p=0.732 n=20) GlobalInt64-8 2.172n ± 1% 4.794n ± 0% +120.67% (p=0.000 n=20) GlobalInt64Parallel-8 0.4320n ± 0% 0.9605n ± 0% +122.32% (p=0.000 n=20) GlobalUint64-8 2.182n ± 0% 4.770n ± 0% +118.58% (p=0.000 n=20) GlobalUint64Parallel-8 0.4307n ± 0% 0.9583n ± 0% +122.51% (p=0.000 n=20) Int64-8 4.107n ± 0% 4.104n ± 0% ~ (p=0.416 n=20) Uint64-8 4.080n ± 0% 4.080n ± 0% ~ (p=0.052 n=20) GlobalIntN1000-8 2.814n ± 2% 5.643n ± 0% +100.50% (p=0.000 n=20) IntN1000-8 4.141n ± 0% 4.139n ± 0% ~ (p=0.140 n=20) Int64N1000-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.313 n=20) Int64N1e8-8 4.140n ± 0% 4.139n ± 0% ~ (p=0.103 n=20) Int64N1e9-8 4.139n ± 0% 4.140n ± 0% ~ (p=0.761 n=20) Int64N2e9-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.636 n=20) Int64N1e18-8 5.266n ± 0% 5.326n ± 1% +1.14% (p=0.001 n=20) Int64N2e18-8 6.052n ± 0% 6.167n ± 0% +1.90% (p=0.000 n=20) Int64N4e18-8 8.826n ± 0% 9.051n ± 0% +2.55% (p=0.000 n=20) Int32N1000-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20) Int32N1e8-8 4.126n ± 0% 4.131n ± 0% +0.12% (p=0.000 n=20) Int32N1e9-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20) Int32N2e9-8 4.132n ± 0% 4.131n ± 0% ~ (p=0.017 n=20) Float32-8 4.109n ± 0% 4.105n ± 0% ~ (p=0.379 n=20) Float64-8 4.107n ± 0% 4.106n ± 0% ~ (p=0.867 n=20) ExpFloat64-8 5.339n ± 0% 5.383n ± 0% +0.82% (p=0.000 n=20) NormFloat64-8 5.735n ± 0% 5.737n ± 1% ~ (p=0.856 n=20) Perm3-8 26.65n ± 0% 26.80n ± 1% +0.58% (p=0.000 n=20) Perm30-8 194.8n ± 1% 197.0n ± 0% +1.18% (p=0.000 n=20) Perm30ViaShuffle-8 156.6n ± 0% 157.6n ± 1% +0.61% (p=0.000 n=20) ShuffleOverhead-8 124.9n ± 0% 125.5n ± 0% +0.52% (p=0.000 n=20) Concurrent-8 2.434n ± 3% 5.066n ± 0% +108.09% (p=0.000 n=20) goos: linux goarch: 386 pkg: math/rand/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ bbb48afeb7.386 │ 5cf807d1ea.386 │ │ sec/op │ sec/op vs base │ ChaCha8-32 11.295n ± 1% 4.748n ± 2% -57.96% (p=0.000 n=20) PCG_DXSM-32 7.693n ± 1% 7.738n ± 2% ~ (p=0.542 n=20) SourceUint64-32 7.658n ± 2% 7.622n ± 2% ~ (p=0.344 n=20) GlobalInt64-32 3.473n ± 2% 7.526n ± 2% +116.73% (p=0.000 n=20) GlobalInt64Parallel-32 0.3198n ± 0% 0.5444n ± 0% +70.22% (p=0.000 n=20) GlobalUint64-32 3.612n ± 0% 7.575n ± 1% +109.69% (p=0.000 n=20) GlobalUint64Parallel-32 0.3168n ± 0% 0.5403n ± 0% +70.51% (p=0.000 n=20) Int64-32 7.673n ± 2% 7.789n ± 1% ~ (p=0.122 n=20) Uint64-32 7.773n ± 1% 7.827n ± 2% ~ (p=0.920 n=20) GlobalIntN1000-32 6.268n ± 1% 9.581n ± 1% +52.87% (p=0.000 n=20) IntN1000-32 10.33n ± 2% 10.45n ± 1% ~ (p=0.233 n=20) Int64N1000-32 10.98n ± 2% 11.01n ± 1% ~ (p=0.401 n=20) Int64N1e8-32 11.19n ± 2% 10.97n ± 1% ~ (p=0.033 n=20) Int64N1e9-32 11.06n ± 1% 11.08n ± 1% ~ (p=0.498 n=20) Int64N2e9-32 11.10n ± 1% 11.01n ± 2% ~ (p=0.995 n=20) Int64N1e18-32 15.23n ± 2% 15.04n ± 1% ~ (p=0.973 n=20) Int64N2e18-32 15.89n ± 1% 15.85n ± 1% ~ (p=0.409 n=20) Int64N4e18-32 18.96n ± 2% 19.34n ± 2% ~ (p=0.048 n=20) Int32N1000-32 10.46n ± 2% 10.44n ± 2% ~ (p=0.480 n=20) Int32N1e8-32 10.46n ± 2% 10.49n ± 2% ~ (p=0.951 n=20) Int32N1e9-32 10.28n ± 2% 10.26n ± 1% ~ (p=0.431 n=20) Int32N2e9-32 10.50n ± 2% 10.44n ± 2% ~ (p=0.249 n=20) Float32-32 13.80n ± 2% 13.80n ± 2% ~ (p=0.751 n=20) Float64-32 23.55n ± 2% 23.87n ± 0% ~ (p=0.408 n=20) ExpFloat64-32 15.36n ± 1% 15.29n ± 2% ~ (p=0.316 n=20) NormFloat64-32 13.57n ± 1% 13.79n ± 1% +1.66% (p=0.005 n=20) Perm3-32 45.70n ± 2% 46.99n ± 2% +2.81% (p=0.001 n=20) Perm30-32 399.0n ± 1% 403.8n ± 1% +1.19% (p=0.006 n=20) Perm30ViaShuffle-32 349.0n ± 1% 350.4n ± 1% ~ (p=0.909 n=20) ShuffleOverhead-32 322.3n ± 1% 323.8n ± 1% ~ (p=0.410 n=20) Concurrent-32 3.331n ± 1% 7.312n ± 1% +119.50% (p=0.000 n=20) For #61716. Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2 Reviewed-on: https://go-review.googlesource.com/c/go/+/516860 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Filippo Valsorda <filippo@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-17	runtime: put allocation headers back at the start the object	Michael Anthony Knyszek
	A persistent performance regression was discovered on perf.golang.org/dashboard and this was narrowed down to the switch to footers. Using allocation headers instead resolves the issue. The benchmark results for allocation footers weren't realistic, because they were performed on a machine with enough L3 cache that it completely hid the additional cache miss introduced by allocation footers. This means that in some corner cases the Go runtime may no longer allocate 16-byte aligned memory. Note however that this property was mostly incidental and never guaranteed in any documentation. Allocation headers were tested widely within Google and no issues were found, so we're fairly confident that this will not affect very many users. Nonetheless, by Hyrum's Law some code might depend on it. A follow-up change will add a GODEBUG flag that ensures 16 byte alignment at the potential cost of some additional memory use. Users experiencing both a performance regression and an alignment issue can also disable the GOEXPERIMENT at build time. This reverts commit 1e250a219900651dad27f29eab0877eee4afd5b9. Change-Id: Ia7d62a9c60d1773c8b6d33322ee33a80ef814943 Reviewed-on: https://go-review.googlesource.com/c/go/+/543255 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-11-16	runtime: optimize bulkBarrierPreWrite with allocheaders	Michael Anthony Knyszek
	Currently bulkBarrierPreWrite follows a fairly slow path wherein it calls typePointersOf, which ends up calling into fastForward. This does some fairly heavy computation to move the iterator forward without any assumptions about where it lands at all. It needs to be completely general to support splitting at arbitrary boundaries, for example for scanning oblets. This means that copying objects during the GC mark phase is fairly expensive, and is a regression from before allocheaders. However, in almost all cases bulkBarrierPreWrite and bulkBarrierPreWriteSrcOnly have perfect type information. We can do a lot better in these cases because we're starting on a type-size boundary, which is exactly what the iterator is built around. This change adds the typePointersOfType method which produces a typePointers iterator from a pointer and a type. This change significantly improves the performance of these bulk write barriers, eliminating some performance regressions that were noticed on the perf dashboard. There are still just a couple cases where we have to use the more general typePointersOf calls, but they're fairly rare; most bulk barriers have perfect type information. This change is tested by the GCInfo tests in the runtime and the GCBits tests in the reflect package via an additional check in getgcmask. Results for tile38 before and after allocheaders. There was previous a regression in the p90, now it's gone. Also, the overall win has been boosted slightly. tile38 $ benchstat noallocheaders.results allocheaders.results name old time/op new time/op delta Tile38QueryLoad 481µs ± 1% 468µs ± 1% -2.71% (p=0.000 n=10+10) name old average-RSS-bytes new average-RSS-bytes delta Tile38QueryLoad 6.32GB ± 1% 6.23GB ± 0% -1.38% (p=0.000 n=9+8) name old peak-RSS-bytes new peak-RSS-bytes delta Tile38QueryLoad 6.49GB ± 1% 6.40GB ± 1% -1.38% (p=0.002 n=10+10) name old peak-VM-bytes new peak-VM-bytes delta Tile38QueryLoad 7.72GB ± 1% 7.64GB ± 1% -1.07% (p=0.007 n=10+10) name old p50-latency-ns new p50-latency-ns delta Tile38QueryLoad 212k ± 1% 205k ± 0% -3.02% (p=0.000 n=10+9) name old p90-latency-ns new p90-latency-ns delta Tile38QueryLoad 622k ± 1% 616k ± 1% -1.03% (p=0.005 n=10+10) name old p99-latency-ns new p99-latency-ns delta Tile38QueryLoad 4.55M ± 2% 4.39M ± 2% -3.51% (p=0.000 n=10+10) name old ops/s new ops/s delta Tile38QueryLoad 12.5k ± 1% 12.8k ± 1% +2.78% (p=0.000 n=10+10) Change-Id: I0a48f848eae8777d0fd6769c3a1fe449f8d9d0a6 Reviewed-on: https://go-review.googlesource.com/c/go/+/542219 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-16	runtime: fix liveness issue in test-only getgcmask	Michael Anthony Knyszek
	getgcmask stops referencing the object passed to it sometime between when the object is looked up and when the function returns. Notably, this can happen while the GC mask is actively being produced, and thus the GC might free the object. This is easily reproducible by adding a runtime.GC call at just the right place. Adding a KeepAlive on the heap-object path fixes it. Fixes #64188. Change-Id: I5ed4cae862fc780338b60d969fd7fbe896352ce4 Reviewed-on: https://go-review.googlesource.com/c/go/+/542716 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-11-10	runtime: fix user arena heap bits writing on big endian platforms	Michael Anthony Knyszek
	Currently the user arena code writes heap bits to the (mspan).heapBits space with the platform-specific byte ordering (the heap bits are written and managed as uintptrs). However, the compiler always emits GC metadata for types in little endian. Because the scanning part of the code that loads through the type pointer in the allocation header expects little endian ordering, we end up with the wrong byte ordering in GC when trying to scan arena memory. Fix this by writing out the user arena heap bits in little endian on big endian platforms. This means that the space returned by (mspan).heapBits has a different meaning for user arenas and small object spans, which is a little odd, so I documented it. To reduce the chance of misuse of the writeHeapBits API, which now writes out heap bits in a different ordering than writeSmallHeapBits on big endian platforms, this change also renames writeHeapBits to writeUserArenaHeapBits. Much of this can be avoided in the future if the compiler were to write out the pointer/scalar bits as an array of uintptr values instead of plain bytes. That's too big of a change for right now though. This change is a no-op on little endian platforms. I confirmed it by checking for any assembly code differences in the runtime test binary. There were none. With this change, the arena tests pass on ppc64. Fixes #64048. Change-Id: If077d003872fcccf5a154ff5d8441a58582061bb Reviewed-on: https://go-review.googlesource.com/c/go/+/541315 Run-TryBot: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
2023-11-09	runtime: make alloc headers footers instead	Michael Anthony Knyszek
	The previous CL in this series (CL 437955) adds the allocation headers experiment. However, this experiment puts the headers at the beginning of each allocation, which decreases the default allocator alignment that users can rely upon. Historically, Go's memory allocator has implicitly provided 16-byte alignment (except for sizes where it doesn't make sense, like 8 or 24 bytes), so it's not unlikely that users are depending on it. It also complicates other changes that want higher alignment. For example, the sync/atomic.Uint64Pair proposal would (hypothetically; it's not yet accepted) introduce a type with 16-byte alignment. The malloc fast path will require extra code to consider alignment and will waste memory for any value containing such a type. This change moves the allocation header to the end of the span's allocation slot instead of the beginning. This means worse locality for the GC when scanning, but it's still an overall win. It also means that objects will still have the 16-byte alignment we've provided thus far. This is broken out in a separate change just becauase it ended up that way during development. But I've chosen to leave it this way in case we want to try and move allocation headers to the front of objects again. Below are the benchmark results of this CL series, comparing the performance of this CL with GOEXPERIMENT=allocheaders vs. without this CL series. name old time/op new time/op delta BiogoIgor 12.5s ± 0% 12.4s ± 2% ~ (p=0.079 n=9+10) BiogoKrishna 12.8s ±10% 12.4s ±10% ~ (p=0.182 n=9+10) BleveIndexBatch100 4.54s ± 3% 4.60s ± 3% ~ (p=0.050 n=9+9) EtcdPut 21.1ms ± 2% 21.3ms ± 4% ~ (p=0.669 n=7+10) EtcdSTM 107ms ± 3% 108ms ± 2% ~ (p=0.497 n=9+10) GoBuildKubelet 34.1s ± 3% 33.1s ± 2% -3.08% (p=0.000 n=10+10) GoBuildKubeletLink 7.94s ± 2% 7.95s ± 2% ~ (p=0.631 n=10+10) GoBuildIstioctl 33.2s ± 1% 31.7s ± 0% -4.37% (p=0.000 n=9+9) GoBuildIstioctlLink 8.07s ± 1% 8.05s ± 1% ~ (p=0.356 n=9+10) GoBuildFrontend 12.1s ± 0% 11.5s ± 1% -4.43% (p=0.000 n=8+10) GoBuildFrontendLink 1.20s ± 2% 1.20s ± 2% ~ (p=0.905 n=9+10) GopherLuaKNucleotide 19.9s ± 0% 19.5s ± 1% -1.95% (p=0.000 n=9+10) MarkdownRenderXHTML 194ms ± 5% 194ms ± 2% ~ (p=0.931 n=9+9) Tile38QueryLoad 518µs ± 1% 508µs ± 1% -1.93% (p=0.000 n=9+8) name old average-RSS-bytes new average-RSS-bytes delta BiogoIgor 66.2MB ± 3% 65.6MB ± 1% ~ (p=0.156 n=10+9) BiogoKrishna 4.34GB ± 2% 4.34GB ± 1% ~ (p=0.315 n=10+9) BleveIndexBatch100 189MB ± 3% 186MB ± 3% ~ (p=0.052 n=10+10) EtcdPut 105MB ± 5% 107MB ± 6% ~ (p=0.579 n=10+10) EtcdSTM 92.1MB ± 5% 93.2MB ± 4% ~ (p=0.353 n=10+10) GoBuildKubelet 2.07GB ± 1% 2.07GB ± 1% ~ (p=0.436 n=10+10) GoBuildIstioctl 1.44GB ± 1% 1.46GB ± 1% +0.96% (p=0.001 n=10+10) GoBuildFrontend 522MB ± 1% 512MB ± 2% -1.98% (p=0.000 n=10+10) GopherLuaKNucleotide 37.4MB ± 5% 36.4MB ± 4% -2.53% (p=0.035 n=10+10) MarkdownRenderXHTML 21.2MB ± 1% 20.9MB ± 3% -1.53% (p=0.003 n=8+10) Tile38QueryLoad 6.39GB ± 2% 6.24GB ± 2% -2.40% (p=0.000 n=10+10) name old peak-RSS-bytes new peak-RSS-bytes delta BiogoIgor 88.5MB ± 4% 88.4MB ± 3% ~ (p=0.971 n=10+10) BiogoKrishna 4.48GB ± 0% 4.42GB ± 0% -1.49% (p=0.000 n=10+10) BleveIndexBatch100 268MB ± 3% 265MB ± 4% ~ (p=0.315 n=9+10) EtcdPut 147MB ± 9% 146MB ± 5% ~ (p=0.853 n=10+10) EtcdSTM 119MB ± 6% 120MB ± 5% ~ (p=0.796 n=10+10) GopherLuaKNucleotide 43.1MB ±17% 40.7MB ±12% ~ (p=0.075 n=10+10) MarkdownRenderXHTML 21.2MB ± 1% 21.1MB ± 3% ~ (p=0.511 n=9+10) Tile38QueryLoad 6.65GB ± 4% 6.52GB ± 2% -1.93% (p=0.009 n=10+10) name old peak-VM-bytes new peak-VM-bytes delta BiogoIgor 1.33GB ± 0% 1.33GB ± 0% -0.16% (p=0.000 n=10+10) BiogoKrishna 5.77GB ± 0% 5.69GB ± 0% -1.23% (p=0.000 n=10+10) BleveIndexBatch100 2.62GB ± 0% 2.61GB ± 0% -0.13% (p=0.000 n=7+10) EtcdPut 12.1GB ± 0% 12.1GB ± 0% ~ (p=0.160 n=8+10) EtcdSTM 12.1GB ± 0% 12.1GB ± 0% -0.02% (p=0.000 n=10+10) GopherLuaKNucleotide 1.26GB ± 0% 1.26GB ± 0% -0.09% (p=0.000 n=10+10) MarkdownRenderXHTML 1.26GB ± 0% 1.26GB ± 0% -0.08% (p=0.000 n=10+10) Tile38QueryLoad 7.89GB ± 4% 7.76GB ± 1% -1.70% (p=0.008 n=10+8) name old p50-latency-ns new p50-latency-ns delta EtcdPut 20.1M ± 5% 20.2M ± 4% ~ (p=0.529 n=10+10) EtcdSTM 79.8M ± 4% 79.9M ± 4% ~ (p=0.971 n=10+10) Tile38QueryLoad 215k ± 1% 210k ± 3% -2.04% (p=0.021 n=8+10) name old p90-latency-ns new p90-latency-ns delta EtcdPut 31.9M ± 6% 32.0M ± 7% ~ (p=0.780 n=9+10) EtcdSTM 220M ± 6% 220M ± 2% ~ (p=1.000 n=10+10) Tile38QueryLoad 622k ± 2% 646k ± 2% +3.83% (p=0.000 n=10+10) name old p99-latency-ns new p99-latency-ns delta EtcdPut 47.6M ±32% 51.4M ±28% ~ (p=0.529 n=10+10) EtcdSTM 452M ± 2% 457M ± 2% ~ (p=0.182 n=9+10) Tile38QueryLoad 5.04M ± 2% 4.91M ± 3% -2.56% (p=0.001 n=9+9) name old ops/s new ops/s delta EtcdPut 46.1k ± 2% 45.7k ± 4% ~ (p=0.475 n=7+10) EtcdSTM 9.18k ± 5% 9.20k ± 3% ~ (p=0.971 n=10+10) Tile38QueryLoad 17.4k ± 1% 17.7k ± 1% +1.97% (p=0.000 n=9+8) Change-Id: I637f48fb9e8c181912db785ae9186d7f16769870 Reviewed-on: https://go-review.googlesource.com/c/go/+/537886 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>
2023-11-09	runtime: implement experiment to replace heap bitmap with alloc headers	Michael Anthony Knyszek
	This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
2023-11-09	runtime: add the allocation headers GOEXPERIMENT and fork files	Michael Anthony Knyszek
	This change adds the allocation headers GOEXPERIMENT which is a no-op. It forks two runtime files temporarily to make the GOEXPERIMENT easier to maintain. The forked files are mbitmap.go and msize.go. Change-Id: I60202c00e614e4517de7dd000029cf80dd0121ef Reviewed-on: https://go-review.googlesource.com/c/go/+/537980 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>

CL 555355 has a bug in it - the GC program flag was also used to decide when to free the unrolled bitmap. After that CL, we just don't free any unrolled bitmaps, leading to a memory leak. Use a separate flag to track types that need to be freed when their corresponding object is freed. Change-Id: I841b65492561f5b5e1853875fbd8e8a872205a84 Reviewed-on: https://go-review.googlesource.com/c/go/+/555416 Auto-Submit: Keith Randall <khr@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

It doesn't have a GC program - the whole point is that it is the unrolled version of a GC program. Fortunately, this isn't a bug as (*mspan).typePointersOfUnchecked ignores the GCProg flag and just uses GCData as a bitmap unconditionally. Change-Id: I2508af85af4a1806946e54c893120c5cc0cc3da3 Reviewed-on: https://go-review.googlesource.com/c/go/+/555355 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Keith Randall <khr@google.com>

Move ChaCha8 code into internal/chacha8rand and use it to implement runtime.rand, which is used for the unseeded global source for both math/rand and math/rand/v2. This also affects the calculation of the start point for iteration over very very large maps (when the 32-bit fastrand is not big enough). The benefit is that misuse of the global random number generators in math/rand and math/rand/v2 in contexts where non-predictable randomness is important for security reasons is no longer a security problem, removing a common mistake among programmers who are unaware of the different kinds of randomness. The cost is an extra 304 bytes per thread stored in the m struct plus 2-3ns more per random uint64 due to the more sophisticated algorithm. Using PCG looks like it would cost about the same, although I haven't benchmarked that. Before this, the math/rand and math/rand/v2 global generator was wyrand (https://github.com/wangyi-fudan/wyhash). For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson ALFG was justifiable, since the latter was not any better. But for math/rand/v2, the global generator really should be at least as good as one of the well-studied, specific algorithms provided directly by the package, and it's not. (Wyrand is still reasonable for scheduling and cache decisions.) Good randomness does have a cost: about twice wyrand. Also rationalize the various runtime rand references. goos: linux goarch: amd64 pkg: math/rand/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ bbb48afeb7.amd64 │ 5cf807d1ea.amd64 │ │ sec/op │ sec/op vs base │ ChaCha8-32 1.862n ± 2% 1.861n ± 2% ~ (p=0.825 n=20) PCG_DXSM-32 1.471n ± 1% 1.460n ± 2% ~ (p=0.153 n=20) SourceUint64-32 1.636n ± 2% 1.582n ± 1% -3.30% (p=0.000 n=20) GlobalInt64-32 2.087n ± 1% 3.663n ± 1% +75.54% (p=0.000 n=20) GlobalInt64Parallel-32 0.1042n ± 1% 0.2026n ± 1% +94.48% (p=0.000 n=20) GlobalUint64-32 2.263n ± 2% 3.724n ± 1% +64.57% (p=0.000 n=20) GlobalUint64Parallel-32 0.1019n ± 1% 0.1973n ± 1% +93.67% (p=0.000 n=20) Int64-32 1.771n ± 1% 1.774n ± 1% ~ (p=0.449 n=20) Uint64-32 1.863n ± 2% 1.866n ± 1% ~ (p=0.364 n=20) GlobalIntN1000-32 3.134n ± 3% 4.730n ± 2% +50.95% (p=0.000 n=20) IntN1000-32 2.489n ± 1% 2.489n ± 1% ~ (p=0.683 n=20) Int64N1000-32 2.521n ± 1% 2.516n ± 1% ~ (p=0.394 n=20) Int64N1e8-32 2.479n ± 1% 2.478n ± 2% ~ (p=0.743 n=20) Int64N1e9-32 2.530n ± 2% 2.514n ± 2% ~ (p=0.193 n=20) Int64N2e9-32 2.501n ± 1% 2.494n ± 1% ~ (p=0.616 n=20) Int64N1e18-32 3.227n ± 1% 3.205n ± 1% ~ (p=0.101 n=20) Int64N2e18-32 3.647n ± 1% 3.599n ± 1% ~ (p=0.019 n=20) Int64N4e18-32 5.135n ± 1% 5.069n ± 2% ~ (p=0.034 n=20) Int32N1000-32 2.657n ± 1% 2.637n ± 1% ~ (p=0.180 n=20) Int32N1e8-32 2.636n ± 1% 2.636n ± 1% ~ (p=0.763 n=20) Int32N1e9-32 2.660n ± 2% 2.638n ± 1% ~ (p=0.358 n=20) Int32N2e9-32 2.662n ± 2% 2.618n ± 2% ~ (p=0.064 n=20) Float32-32 2.272n ± 2% 2.239n ± 2% ~ (p=0.194 n=20) Float64-32 2.272n ± 1% 2.286n ± 2% ~ (p=0.763 n=20) ExpFloat64-32 3.762n ± 1% 3.744n ± 1% ~ (p=0.171 n=20) NormFloat64-32 3.706n ± 1% 3.655n ± 2% ~ (p=0.066 n=20) Perm3-32 32.93n ± 3% 34.62n ± 1% +5.13% (p=0.000 n=20) Perm30-32 202.9n ± 1% 204.0n ± 1% ~ (p=0.482 n=20) Perm30ViaShuffle-32 115.0n ± 1% 114.9n ± 1% ~ (p=0.358 n=20) ShuffleOverhead-32 112.8n ± 1% 112.7n ± 1% ~ (p=0.692 n=20) Concurrent-32 2.107n ± 0% 3.725n ± 1% +76.75% (p=0.000 n=20) goos: darwin goarch: arm64 pkg: math/rand/v2 │ bbb48afeb7.arm64 │ 5cf807d1ea.arm64 │ │ sec/op │ sec/op vs base │ ChaCha8-8 2.480n ± 0% 2.429n ± 0% -2.04% (p=0.000 n=20) PCG_DXSM-8 2.531n ± 0% 2.530n ± 0% ~ (p=0.877 n=20) SourceUint64-8 2.534n ± 0% 2.533n ± 0% ~ (p=0.732 n=20) GlobalInt64-8 2.172n ± 1% 4.794n ± 0% +120.67% (p=0.000 n=20) GlobalInt64Parallel-8 0.4320n ± 0% 0.9605n ± 0% +122.32% (p=0.000 n=20) GlobalUint64-8 2.182n ± 0% 4.770n ± 0% +118.58% (p=0.000 n=20) GlobalUint64Parallel-8 0.4307n ± 0% 0.9583n ± 0% +122.51% (p=0.000 n=20) Int64-8 4.107n ± 0% 4.104n ± 0% ~ (p=0.416 n=20) Uint64-8 4.080n ± 0% 4.080n ± 0% ~ (p=0.052 n=20) GlobalIntN1000-8 2.814n ± 2% 5.643n ± 0% +100.50% (p=0.000 n=20) IntN1000-8 4.141n ± 0% 4.139n ± 0% ~ (p=0.140 n=20) Int64N1000-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.313 n=20) Int64N1e8-8 4.140n ± 0% 4.139n ± 0% ~ (p=0.103 n=20) Int64N1e9-8 4.139n ± 0% 4.140n ± 0% ~ (p=0.761 n=20) Int64N2e9-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.636 n=20) Int64N1e18-8 5.266n ± 0% 5.326n ± 1% +1.14% (p=0.001 n=20) Int64N2e18-8 6.052n ± 0% 6.167n ± 0% +1.90% (p=0.000 n=20) Int64N4e18-8 8.826n ± 0% 9.051n ± 0% +2.55% (p=0.000 n=20) Int32N1000-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20) Int32N1e8-8 4.126n ± 0% 4.131n ± 0% +0.12% (p=0.000 n=20) Int32N1e9-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20) Int32N2e9-8 4.132n ± 0% 4.131n ± 0% ~ (p=0.017 n=20) Float32-8 4.109n ± 0% 4.105n ± 0% ~ (p=0.379 n=20) Float64-8 4.107n ± 0% 4.106n ± 0% ~ (p=0.867 n=20) ExpFloat64-8 5.339n ± 0% 5.383n ± 0% +0.82% (p=0.000 n=20) NormFloat64-8 5.735n ± 0% 5.737n ± 1% ~ (p=0.856 n=20) Perm3-8 26.65n ± 0% 26.80n ± 1% +0.58% (p=0.000 n=20) Perm30-8 194.8n ± 1% 197.0n ± 0% +1.18% (p=0.000 n=20) Perm30ViaShuffle-8 156.6n ± 0% 157.6n ± 1% +0.61% (p=0.000 n=20) ShuffleOverhead-8 124.9n ± 0% 125.5n ± 0% +0.52% (p=0.000 n=20) Concurrent-8 2.434n ± 3% 5.066n ± 0% +108.09% (p=0.000 n=20) goos: linux goarch: 386 pkg: math/rand/v2 cpu: AMD Ryzen 9 7950X 16-Core Processor │ bbb48afeb7.386 │ 5cf807d1ea.386 │ │ sec/op │ sec/op vs base │ ChaCha8-32 11.295n ± 1% 4.748n ± 2% -57.96% (p=0.000 n=20) PCG_DXSM-32 7.693n ± 1% 7.738n ± 2% ~ (p=0.542 n=20) SourceUint64-32 7.658n ± 2% 7.622n ± 2% ~ (p=0.344 n=20) GlobalInt64-32 3.473n ± 2% 7.526n ± 2% +116.73% (p=0.000 n=20) GlobalInt64Parallel-32 0.3198n ± 0% 0.5444n ± 0% +70.22% (p=0.000 n=20) GlobalUint64-32 3.612n ± 0% 7.575n ± 1% +109.69% (p=0.000 n=20) GlobalUint64Parallel-32 0.3168n ± 0% 0.5403n ± 0% +70.51% (p=0.000 n=20) Int64-32 7.673n ± 2% 7.789n ± 1% ~ (p=0.122 n=20) Uint64-32 7.773n ± 1% 7.827n ± 2% ~ (p=0.920 n=20) GlobalIntN1000-32 6.268n ± 1% 9.581n ± 1% +52.87% (p=0.000 n=20) IntN1000-32 10.33n ± 2% 10.45n ± 1% ~ (p=0.233 n=20) Int64N1000-32 10.98n ± 2% 11.01n ± 1% ~ (p=0.401 n=20) Int64N1e8-32 11.19n ± 2% 10.97n ± 1% ~ (p=0.033 n=20) Int64N1e9-32 11.06n ± 1% 11.08n ± 1% ~ (p=0.498 n=20) Int64N2e9-32 11.10n ± 1% 11.01n ± 2% ~ (p=0.995 n=20) Int64N1e18-32 15.23n ± 2% 15.04n ± 1% ~ (p=0.973 n=20) Int64N2e18-32 15.89n ± 1% 15.85n ± 1% ~ (p=0.409 n=20) Int64N4e18-32 18.96n ± 2% 19.34n ± 2% ~ (p=0.048 n=20) Int32N1000-32 10.46n ± 2% 10.44n ± 2% ~ (p=0.480 n=20) Int32N1e8-32 10.46n ± 2% 10.49n ± 2% ~ (p=0.951 n=20) Int32N1e9-32 10.28n ± 2% 10.26n ± 1% ~ (p=0.431 n=20) Int32N2e9-32 10.50n ± 2% 10.44n ± 2% ~ (p=0.249 n=20) Float32-32 13.80n ± 2% 13.80n ± 2% ~ (p=0.751 n=20) Float64-32 23.55n ± 2% 23.87n ± 0% ~ (p=0.408 n=20) ExpFloat64-32 15.36n ± 1% 15.29n ± 2% ~ (p=0.316 n=20) NormFloat64-32 13.57n ± 1% 13.79n ± 1% +1.66% (p=0.005 n=20) Perm3-32 45.70n ± 2% 46.99n ± 2% +2.81% (p=0.001 n=20) Perm30-32 399.0n ± 1% 403.8n ± 1% +1.19% (p=0.006 n=20) Perm30ViaShuffle-32 349.0n ± 1% 350.4n ± 1% ~ (p=0.909 n=20) ShuffleOverhead-32 322.3n ± 1% 323.8n ± 1% ~ (p=0.410 n=20) Concurrent-32 3.331n ± 1% 7.312n ± 1% +119.50% (p=0.000 n=20) For #61716. Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2 Reviewed-on: https://go-review.googlesource.com/c/go/+/516860 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Filippo Valsorda <filippo@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

A persistent performance regression was discovered on perf.golang.org/dashboard and this was narrowed down to the switch to footers. Using allocation headers instead resolves the issue. The benchmark results for allocation footers weren't realistic, because they were performed on a machine with enough L3 cache that it completely hid the additional cache miss introduced by allocation footers. This means that in some corner cases the Go runtime may no longer allocate 16-byte aligned memory. Note however that this property was *mostly* incidental and never guaranteed in any documentation. Allocation headers were tested widely within Google and no issues were found, so we're fairly confident that this will not affect very many users. Nonetheless, by Hyrum's Law some code might depend on it. A follow-up change will add a GODEBUG flag that ensures 16 byte alignment at the potential cost of some additional memory use. Users experiencing both a performance regression and an alignment issue can also disable the GOEXPERIMENT at build time. This reverts commit 1e250a219900651dad27f29eab0877eee4afd5b9. Change-Id: Ia7d62a9c60d1773c8b6d33322ee33a80ef814943 Reviewed-on: https://go-review.googlesource.com/c/go/+/543255 Auto-Submit: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

Currently bulkBarrierPreWrite follows a fairly slow path wherein it calls typePointersOf, which ends up calling into fastForward. This does some fairly heavy computation to move the iterator forward without any assumptions about where it lands at all. It needs to be completely general to support splitting at arbitrary boundaries, for example for scanning oblets. This means that copying objects during the GC mark phase is fairly expensive, and is a regression from before allocheaders. However, in almost all cases bulkBarrierPreWrite and bulkBarrierPreWriteSrcOnly have perfect type information. We can do a lot better in these cases because we're starting on a type-size boundary, which is exactly what the iterator is built around. This change adds the typePointersOfType method which produces a typePointers iterator from a pointer and a type. This change significantly improves the performance of these bulk write barriers, eliminating some performance regressions that were noticed on the perf dashboard. There are still just a couple cases where we have to use the more general typePointersOf calls, but they're fairly rare; most bulk barriers have perfect type information. This change is tested by the GCInfo tests in the runtime and the GCBits tests in the reflect package via an additional check in getgcmask. Results for tile38 before and after allocheaders. There was previous a regression in the p90, now it's gone. Also, the overall win has been boosted slightly. tile38 $ benchstat noallocheaders.results allocheaders.results name old time/op new time/op delta Tile38QueryLoad 481µs ± 1% 468µs ± 1% -2.71% (p=0.000 n=10+10) name old average-RSS-bytes new average-RSS-bytes delta Tile38QueryLoad 6.32GB ± 1% 6.23GB ± 0% -1.38% (p=0.000 n=9+8) name old peak-RSS-bytes new peak-RSS-bytes delta Tile38QueryLoad 6.49GB ± 1% 6.40GB ± 1% -1.38% (p=0.002 n=10+10) name old peak-VM-bytes new peak-VM-bytes delta Tile38QueryLoad 7.72GB ± 1% 7.64GB ± 1% -1.07% (p=0.007 n=10+10) name old p50-latency-ns new p50-latency-ns delta Tile38QueryLoad 212k ± 1% 205k ± 0% -3.02% (p=0.000 n=10+9) name old p90-latency-ns new p90-latency-ns delta Tile38QueryLoad 622k ± 1% 616k ± 1% -1.03% (p=0.005 n=10+10) name old p99-latency-ns new p99-latency-ns delta Tile38QueryLoad 4.55M ± 2% 4.39M ± 2% -3.51% (p=0.000 n=10+10) name old ops/s new ops/s delta Tile38QueryLoad 12.5k ± 1% 12.8k ± 1% +2.78% (p=0.000 n=10+10) Change-Id: I0a48f848eae8777d0fd6769c3a1fe449f8d9d0a6 Reviewed-on: https://go-review.googlesource.com/c/go/+/542219 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

getgcmask stops referencing the object passed to it sometime between when the object is looked up and when the function returns. Notably, this can happen while the GC mask is actively being produced, and thus the GC might free the object. This is easily reproducible by adding a runtime.GC call at just the right place. Adding a KeepAlive on the heap-object path fixes it. Fixes #64188. Change-Id: I5ed4cae862fc780338b60d969fd7fbe896352ce4 Reviewed-on: https://go-review.googlesource.com/c/go/+/542716 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com>

Currently the user arena code writes heap bits to the (*mspan).heapBits space with the platform-specific byte ordering (the heap bits are written and managed as uintptrs). However, the compiler always emits GC metadata for types in little endian. Because the scanning part of the code that loads through the type pointer in the allocation header expects little endian ordering, we end up with the wrong byte ordering in GC when trying to scan arena memory. Fix this by writing out the user arena heap bits in little endian on big endian platforms. This means that the space returned by (*mspan).heapBits has a different meaning for user arenas and small object spans, which is a little odd, so I documented it. To reduce the chance of misuse of the writeHeapBits API, which now writes out heap bits in a different ordering than writeSmallHeapBits on big endian platforms, this change also renames writeHeapBits to writeUserArenaHeapBits. Much of this can be avoided in the future if the compiler were to write out the pointer/scalar bits as an array of uintptr values instead of plain bytes. That's too big of a change for right now though. This change is a no-op on little endian platforms. I confirmed it by checking for any assembly code differences in the runtime test binary. There were none. With this change, the arena tests pass on ppc64. Fixes #64048. Change-Id: If077d003872fcccf5a154ff5d8441a58582061bb Reviewed-on: https://go-review.googlesource.com/c/go/+/541315 Run-TryBot: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>

The previous CL in this series (CL 437955) adds the allocation headers experiment. However, this experiment puts the headers at the beginning of each allocation, which decreases the default allocator alignment that users can rely upon. Historically, Go's memory allocator has implicitly provided 16-byte alignment (except for sizes where it doesn't make sense, like 8 or 24 bytes), so it's not unlikely that users are depending on it. It also complicates other changes that want higher alignment. For example, the sync/atomic.Uint64Pair proposal would (hypothetically; it's not yet accepted) introduce a type with 16-byte alignment. The malloc fast path will require extra code to consider alignment and will waste memory for any value containing such a type. This change moves the allocation header to the end of the span's allocation slot instead of the beginning. This means worse locality for the GC when scanning, but it's still an overall win. It also means that objects will still have the 16-byte alignment we've provided thus far. This is broken out in a separate change just becauase it ended up that way during development. But I've chosen to leave it this way in case we want to try and move allocation headers to the front of objects again. Below are the benchmark results of this CL series, comparing the performance of this CL with GOEXPERIMENT=allocheaders vs. without this CL series. name old time/op new time/op delta BiogoIgor 12.5s ± 0% 12.4s ± 2% ~ (p=0.079 n=9+10) BiogoKrishna 12.8s ±10% 12.4s ±10% ~ (p=0.182 n=9+10) BleveIndexBatch100 4.54s ± 3% 4.60s ± 3% ~ (p=0.050 n=9+9) EtcdPut 21.1ms ± 2% 21.3ms ± 4% ~ (p=0.669 n=7+10) EtcdSTM 107ms ± 3% 108ms ± 2% ~ (p=0.497 n=9+10) GoBuildKubelet 34.1s ± 3% 33.1s ± 2% -3.08% (p=0.000 n=10+10) GoBuildKubeletLink 7.94s ± 2% 7.95s ± 2% ~ (p=0.631 n=10+10) GoBuildIstioctl 33.2s ± 1% 31.7s ± 0% -4.37% (p=0.000 n=9+9) GoBuildIstioctlLink 8.07s ± 1% 8.05s ± 1% ~ (p=0.356 n=9+10) GoBuildFrontend 12.1s ± 0% 11.5s ± 1% -4.43% (p=0.000 n=8+10) GoBuildFrontendLink 1.20s ± 2% 1.20s ± 2% ~ (p=0.905 n=9+10) GopherLuaKNucleotide 19.9s ± 0% 19.5s ± 1% -1.95% (p=0.000 n=9+10) MarkdownRenderXHTML 194ms ± 5% 194ms ± 2% ~ (p=0.931 n=9+9) Tile38QueryLoad 518µs ± 1% 508µs ± 1% -1.93% (p=0.000 n=9+8) name old average-RSS-bytes new average-RSS-bytes delta BiogoIgor 66.2MB ± 3% 65.6MB ± 1% ~ (p=0.156 n=10+9) BiogoKrishna 4.34GB ± 2% 4.34GB ± 1% ~ (p=0.315 n=10+9) BleveIndexBatch100 189MB ± 3% 186MB ± 3% ~ (p=0.052 n=10+10) EtcdPut 105MB ± 5% 107MB ± 6% ~ (p=0.579 n=10+10) EtcdSTM 92.1MB ± 5% 93.2MB ± 4% ~ (p=0.353 n=10+10) GoBuildKubelet 2.07GB ± 1% 2.07GB ± 1% ~ (p=0.436 n=10+10) GoBuildIstioctl 1.44GB ± 1% 1.46GB ± 1% +0.96% (p=0.001 n=10+10) GoBuildFrontend 522MB ± 1% 512MB ± 2% -1.98% (p=0.000 n=10+10) GopherLuaKNucleotide 37.4MB ± 5% 36.4MB ± 4% -2.53% (p=0.035 n=10+10) MarkdownRenderXHTML 21.2MB ± 1% 20.9MB ± 3% -1.53% (p=0.003 n=8+10) Tile38QueryLoad 6.39GB ± 2% 6.24GB ± 2% -2.40% (p=0.000 n=10+10) name old peak-RSS-bytes new peak-RSS-bytes delta BiogoIgor 88.5MB ± 4% 88.4MB ± 3% ~ (p=0.971 n=10+10) BiogoKrishna 4.48GB ± 0% 4.42GB ± 0% -1.49% (p=0.000 n=10+10) BleveIndexBatch100 268MB ± 3% 265MB ± 4% ~ (p=0.315 n=9+10) EtcdPut 147MB ± 9% 146MB ± 5% ~ (p=0.853 n=10+10) EtcdSTM 119MB ± 6% 120MB ± 5% ~ (p=0.796 n=10+10) GopherLuaKNucleotide 43.1MB ±17% 40.7MB ±12% ~ (p=0.075 n=10+10) MarkdownRenderXHTML 21.2MB ± 1% 21.1MB ± 3% ~ (p=0.511 n=9+10) Tile38QueryLoad 6.65GB ± 4% 6.52GB ± 2% -1.93% (p=0.009 n=10+10) name old peak-VM-bytes new peak-VM-bytes delta BiogoIgor 1.33GB ± 0% 1.33GB ± 0% -0.16% (p=0.000 n=10+10) BiogoKrishna 5.77GB ± 0% 5.69GB ± 0% -1.23% (p=0.000 n=10+10) BleveIndexBatch100 2.62GB ± 0% 2.61GB ± 0% -0.13% (p=0.000 n=7+10) EtcdPut 12.1GB ± 0% 12.1GB ± 0% ~ (p=0.160 n=8+10) EtcdSTM 12.1GB ± 0% 12.1GB ± 0% -0.02% (p=0.000 n=10+10) GopherLuaKNucleotide 1.26GB ± 0% 1.26GB ± 0% -0.09% (p=0.000 n=10+10) MarkdownRenderXHTML 1.26GB ± 0% 1.26GB ± 0% -0.08% (p=0.000 n=10+10) Tile38QueryLoad 7.89GB ± 4% 7.76GB ± 1% -1.70% (p=0.008 n=10+8) name old p50-latency-ns new p50-latency-ns delta EtcdPut 20.1M ± 5% 20.2M ± 4% ~ (p=0.529 n=10+10) EtcdSTM 79.8M ± 4% 79.9M ± 4% ~ (p=0.971 n=10+10) Tile38QueryLoad 215k ± 1% 210k ± 3% -2.04% (p=0.021 n=8+10) name old p90-latency-ns new p90-latency-ns delta EtcdPut 31.9M ± 6% 32.0M ± 7% ~ (p=0.780 n=9+10) EtcdSTM 220M ± 6% 220M ± 2% ~ (p=1.000 n=10+10) Tile38QueryLoad 622k ± 2% 646k ± 2% +3.83% (p=0.000 n=10+10) name old p99-latency-ns new p99-latency-ns delta EtcdPut 47.6M ±32% 51.4M ±28% ~ (p=0.529 n=10+10) EtcdSTM 452M ± 2% 457M ± 2% ~ (p=0.182 n=9+10) Tile38QueryLoad 5.04M ± 2% 4.91M ± 3% -2.56% (p=0.001 n=9+9) name old ops/s new ops/s delta EtcdPut 46.1k ± 2% 45.7k ± 4% ~ (p=0.475 n=7+10) EtcdSTM 9.18k ± 5% 9.20k ± 3% ~ (p=0.971 n=10+10) Tile38QueryLoad 17.4k ± 1% 17.7k ± 1% +1.97% (p=0.000 n=9+8) Change-Id: I637f48fb9e8c181912db785ae9186d7f16769870 Reviewed-on: https://go-review.googlesource.com/c/go/+/537886 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Cherry Mui <cherryyz@google.com>

This change replaces the 1-bit-per-word heap bitmap for most size classes with allocation headers for objects that contain pointers. The header consists of a single pointer to a type. All allocations with headers are treated as implicitly containing one or more instances of the type in the header. As the name implies, headers are usually stored as the first word of an object. There are two additional exceptions to where headers are stored and how they're used. Objects smaller than 512 bytes do not have headers. Instead, a heap bitmap is reserved at the end of spans for objects of this size. A full word of overhead is too much for these small objects. The bitmap is of the same format of the old bitmap, minus the noMorePtrs bits which are unnecessary. All the objects <512 bytes have a bitmap less than a pointer-word in size, and that was the granularity at which noMorePtrs could stop scanning early anyway. Objects that are larger than 32 KiB (which have their own span) have their headers stored directly in the span, to allow power-of-two-sized allocations to not spill over into an extra page. The full implementation is behind GOEXPERIMENT=allocheaders. The purpose of this change is performance. First and foremost, with headers we no longer have to unroll pointer/scalar data at allocation time for most size classes. Small size classes still need some unrolling, but their bitmaps are small so we can optimize that case fairly well. Larger objects effectively have their pointer/scalar data unrolled on-demand from type data, which is much more compactly represented and results in less TLB pressure. Furthermore, since the headers are usually right next to the object and where we're about to start scanning, we get an additional temporal locality benefit in the data cache when looking up type metadata. The pointer/scalar data is now effectively unrolled on-demand, but it's also simpler to unroll than before; that unrolled data is never written anywhere, and for arrays we get the benefit of retreading the same data per element, as opposed to looking it up from scratch for each pointer-word of bitmap. Lastly, because we no longer have a heap bitmap that spans the entire heap, there's a flat 1.5% memory use reduction. This is balanced slightly by some objects possibly being bumped up a size class, but most objects are not tightly optimized to size class sizes so there's some memory to spare, making the header basically free in those cases. See the follow-up CL which turns on this experiment by default for benchmark results. (CL 538217.) Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5 Reviewed-on: https://go-review.googlesource.com/c/go/+/437955 Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Keith Randall <khr@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>

This change adds the allocation headers GOEXPERIMENT which is a no-op. It forks two runtime files temporarily to make the GOEXPERIMENT easier to maintain. The forked files are mbitmap.go and msize.go. Change-Id: I60202c00e614e4517de7dd000029cf80dd0121ef Reviewed-on: https://go-review.googlesource.com/c/go/+/537980 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Keith Randall <khr@golang.org>