| Age | Commit message (Collapse) | Author |
|
This change removes the allocheaders, deleting all the old code and
merging mbitmap_allocheaders.go back into mbitmap.go.
This change also deletes the SetType benchmarks which were already
broken in the new GOEXPERIMENT (it's harder to set up than before). We
weren't really watching these benchmarks at all, and they don't provide
additional test coverage.
Change-Id: I135497201c3259087c5cd3722ed3fbe24791d25d
Reviewed-on: https://go-review.googlesource.com/c/go/+/567200
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
For #59670
Change-Id: Id66e102f13e529dd041b68ce869026a56f0a1b9b
GitHub-Last-Rev: 43aa9376f72bc02a9d86518cdc99494a6b2f8573
GitHub-Pull-Request: golang/go#65564
Reviewed-on: https://go-review.googlesource.com/c/go/+/562298
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Austin Clements <austin@google.com>
|
|
Change-Id: I1df0685c75fc1044ba46003a69ecc7dfc53bbc2b
Reviewed-on: https://go-review.googlesource.com/c/go/+/574675
Reviewed-by: Ian Lance Taylor <iant@google.com>
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
|
|
CL 555355 has a bug in it - the GC program flag was also used to decide
when to free the unrolled bitmap. After that CL, we just don't free any
unrolled bitmaps, leading to a memory leak.
Use a separate flag to track types that need to be freed when their
corresponding object is freed.
Change-Id: I841b65492561f5b5e1853875fbd8e8a872205a84
Reviewed-on: https://go-review.googlesource.com/c/go/+/555416
Auto-Submit: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
It doesn't have a GC program - the whole point is that it is
the unrolled version of a GC program.
Fortunately, this isn't a bug as (*mspan).typePointersOfUnchecked
ignores the GCProg flag and just uses GCData as a bitmap unconditionally.
Change-Id: I2508af85af4a1806946e54c893120c5cc0cc3da3
Reviewed-on: https://go-review.googlesource.com/c/go/+/555355
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Keith Randall <khr@google.com>
|
|
Move ChaCha8 code into internal/chacha8rand and use it to implement
runtime.rand, which is used for the unseeded global source for
both math/rand and math/rand/v2. This also affects the calculation of
the start point for iteration over very very large maps (when the
32-bit fastrand is not big enough).
The benefit is that misuse of the global random number generators
in math/rand and math/rand/v2 in contexts where non-predictable
randomness is important for security reasons is no longer a
security problem, removing a common mistake among programmers
who are unaware of the different kinds of randomness.
The cost is an extra 304 bytes per thread stored in the m struct
plus 2-3ns more per random uint64 due to the more sophisticated
algorithm. Using PCG looks like it would cost about the same,
although I haven't benchmarked that.
Before this, the math/rand and math/rand/v2 global generator
was wyrand (https://github.com/wangyi-fudan/wyhash).
For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson
ALFG was justifiable, since the latter was not any better.
But for math/rand/v2, the global generator really should be
at least as good as one of the well-studied, specific algorithms
provided directly by the package, and it's not.
(Wyrand is still reasonable for scheduling and cache decisions.)
Good randomness does have a cost: about twice wyrand.
Also rationalize the various runtime rand references.
goos: linux
goarch: amd64
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.amd64 │ 5cf807d1ea.amd64 │
│ sec/op │ sec/op vs base │
ChaCha8-32 1.862n ± 2% 1.861n ± 2% ~ (p=0.825 n=20)
PCG_DXSM-32 1.471n ± 1% 1.460n ± 2% ~ (p=0.153 n=20)
SourceUint64-32 1.636n ± 2% 1.582n ± 1% -3.30% (p=0.000 n=20)
GlobalInt64-32 2.087n ± 1% 3.663n ± 1% +75.54% (p=0.000 n=20)
GlobalInt64Parallel-32 0.1042n ± 1% 0.2026n ± 1% +94.48% (p=0.000 n=20)
GlobalUint64-32 2.263n ± 2% 3.724n ± 1% +64.57% (p=0.000 n=20)
GlobalUint64Parallel-32 0.1019n ± 1% 0.1973n ± 1% +93.67% (p=0.000 n=20)
Int64-32 1.771n ± 1% 1.774n ± 1% ~ (p=0.449 n=20)
Uint64-32 1.863n ± 2% 1.866n ± 1% ~ (p=0.364 n=20)
GlobalIntN1000-32 3.134n ± 3% 4.730n ± 2% +50.95% (p=0.000 n=20)
IntN1000-32 2.489n ± 1% 2.489n ± 1% ~ (p=0.683 n=20)
Int64N1000-32 2.521n ± 1% 2.516n ± 1% ~ (p=0.394 n=20)
Int64N1e8-32 2.479n ± 1% 2.478n ± 2% ~ (p=0.743 n=20)
Int64N1e9-32 2.530n ± 2% 2.514n ± 2% ~ (p=0.193 n=20)
Int64N2e9-32 2.501n ± 1% 2.494n ± 1% ~ (p=0.616 n=20)
Int64N1e18-32 3.227n ± 1% 3.205n ± 1% ~ (p=0.101 n=20)
Int64N2e18-32 3.647n ± 1% 3.599n ± 1% ~ (p=0.019 n=20)
Int64N4e18-32 5.135n ± 1% 5.069n ± 2% ~ (p=0.034 n=20)
Int32N1000-32 2.657n ± 1% 2.637n ± 1% ~ (p=0.180 n=20)
Int32N1e8-32 2.636n ± 1% 2.636n ± 1% ~ (p=0.763 n=20)
Int32N1e9-32 2.660n ± 2% 2.638n ± 1% ~ (p=0.358 n=20)
Int32N2e9-32 2.662n ± 2% 2.618n ± 2% ~ (p=0.064 n=20)
Float32-32 2.272n ± 2% 2.239n ± 2% ~ (p=0.194 n=20)
Float64-32 2.272n ± 1% 2.286n ± 2% ~ (p=0.763 n=20)
ExpFloat64-32 3.762n ± 1% 3.744n ± 1% ~ (p=0.171 n=20)
NormFloat64-32 3.706n ± 1% 3.655n ± 2% ~ (p=0.066 n=20)
Perm3-32 32.93n ± 3% 34.62n ± 1% +5.13% (p=0.000 n=20)
Perm30-32 202.9n ± 1% 204.0n ± 1% ~ (p=0.482 n=20)
Perm30ViaShuffle-32 115.0n ± 1% 114.9n ± 1% ~ (p=0.358 n=20)
ShuffleOverhead-32 112.8n ± 1% 112.7n ± 1% ~ (p=0.692 n=20)
Concurrent-32 2.107n ± 0% 3.725n ± 1% +76.75% (p=0.000 n=20)
goos: darwin
goarch: arm64
pkg: math/rand/v2
│ bbb48afeb7.arm64 │ 5cf807d1ea.arm64 │
│ sec/op │ sec/op vs base │
ChaCha8-8 2.480n ± 0% 2.429n ± 0% -2.04% (p=0.000 n=20)
PCG_DXSM-8 2.531n ± 0% 2.530n ± 0% ~ (p=0.877 n=20)
SourceUint64-8 2.534n ± 0% 2.533n ± 0% ~ (p=0.732 n=20)
GlobalInt64-8 2.172n ± 1% 4.794n ± 0% +120.67% (p=0.000 n=20)
GlobalInt64Parallel-8 0.4320n ± 0% 0.9605n ± 0% +122.32% (p=0.000 n=20)
GlobalUint64-8 2.182n ± 0% 4.770n ± 0% +118.58% (p=0.000 n=20)
GlobalUint64Parallel-8 0.4307n ± 0% 0.9583n ± 0% +122.51% (p=0.000 n=20)
Int64-8 4.107n ± 0% 4.104n ± 0% ~ (p=0.416 n=20)
Uint64-8 4.080n ± 0% 4.080n ± 0% ~ (p=0.052 n=20)
GlobalIntN1000-8 2.814n ± 2% 5.643n ± 0% +100.50% (p=0.000 n=20)
IntN1000-8 4.141n ± 0% 4.139n ± 0% ~ (p=0.140 n=20)
Int64N1000-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.313 n=20)
Int64N1e8-8 4.140n ± 0% 4.139n ± 0% ~ (p=0.103 n=20)
Int64N1e9-8 4.139n ± 0% 4.140n ± 0% ~ (p=0.761 n=20)
Int64N2e9-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.636 n=20)
Int64N1e18-8 5.266n ± 0% 5.326n ± 1% +1.14% (p=0.001 n=20)
Int64N2e18-8 6.052n ± 0% 6.167n ± 0% +1.90% (p=0.000 n=20)
Int64N4e18-8 8.826n ± 0% 9.051n ± 0% +2.55% (p=0.000 n=20)
Int32N1000-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N1e8-8 4.126n ± 0% 4.131n ± 0% +0.12% (p=0.000 n=20)
Int32N1e9-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N2e9-8 4.132n ± 0% 4.131n ± 0% ~ (p=0.017 n=20)
Float32-8 4.109n ± 0% 4.105n ± 0% ~ (p=0.379 n=20)
Float64-8 4.107n ± 0% 4.106n ± 0% ~ (p=0.867 n=20)
ExpFloat64-8 5.339n ± 0% 5.383n ± 0% +0.82% (p=0.000 n=20)
NormFloat64-8 5.735n ± 0% 5.737n ± 1% ~ (p=0.856 n=20)
Perm3-8 26.65n ± 0% 26.80n ± 1% +0.58% (p=0.000 n=20)
Perm30-8 194.8n ± 1% 197.0n ± 0% +1.18% (p=0.000 n=20)
Perm30ViaShuffle-8 156.6n ± 0% 157.6n ± 1% +0.61% (p=0.000 n=20)
ShuffleOverhead-8 124.9n ± 0% 125.5n ± 0% +0.52% (p=0.000 n=20)
Concurrent-8 2.434n ± 3% 5.066n ± 0% +108.09% (p=0.000 n=20)
goos: linux
goarch: 386
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.386 │ 5cf807d1ea.386 │
│ sec/op │ sec/op vs base │
ChaCha8-32 11.295n ± 1% 4.748n ± 2% -57.96% (p=0.000 n=20)
PCG_DXSM-32 7.693n ± 1% 7.738n ± 2% ~ (p=0.542 n=20)
SourceUint64-32 7.658n ± 2% 7.622n ± 2% ~ (p=0.344 n=20)
GlobalInt64-32 3.473n ± 2% 7.526n ± 2% +116.73% (p=0.000 n=20)
GlobalInt64Parallel-32 0.3198n ± 0% 0.5444n ± 0% +70.22% (p=0.000 n=20)
GlobalUint64-32 3.612n ± 0% 7.575n ± 1% +109.69% (p=0.000 n=20)
GlobalUint64Parallel-32 0.3168n ± 0% 0.5403n ± 0% +70.51% (p=0.000 n=20)
Int64-32 7.673n ± 2% 7.789n ± 1% ~ (p=0.122 n=20)
Uint64-32 7.773n ± 1% 7.827n ± 2% ~ (p=0.920 n=20)
GlobalIntN1000-32 6.268n ± 1% 9.581n ± 1% +52.87% (p=0.000 n=20)
IntN1000-32 10.33n ± 2% 10.45n ± 1% ~ (p=0.233 n=20)
Int64N1000-32 10.98n ± 2% 11.01n ± 1% ~ (p=0.401 n=20)
Int64N1e8-32 11.19n ± 2% 10.97n ± 1% ~ (p=0.033 n=20)
Int64N1e9-32 11.06n ± 1% 11.08n ± 1% ~ (p=0.498 n=20)
Int64N2e9-32 11.10n ± 1% 11.01n ± 2% ~ (p=0.995 n=20)
Int64N1e18-32 15.23n ± 2% 15.04n ± 1% ~ (p=0.973 n=20)
Int64N2e18-32 15.89n ± 1% 15.85n ± 1% ~ (p=0.409 n=20)
Int64N4e18-32 18.96n ± 2% 19.34n ± 2% ~ (p=0.048 n=20)
Int32N1000-32 10.46n ± 2% 10.44n ± 2% ~ (p=0.480 n=20)
Int32N1e8-32 10.46n ± 2% 10.49n ± 2% ~ (p=0.951 n=20)
Int32N1e9-32 10.28n ± 2% 10.26n ± 1% ~ (p=0.431 n=20)
Int32N2e9-32 10.50n ± 2% 10.44n ± 2% ~ (p=0.249 n=20)
Float32-32 13.80n ± 2% 13.80n ± 2% ~ (p=0.751 n=20)
Float64-32 23.55n ± 2% 23.87n ± 0% ~ (p=0.408 n=20)
ExpFloat64-32 15.36n ± 1% 15.29n ± 2% ~ (p=0.316 n=20)
NormFloat64-32 13.57n ± 1% 13.79n ± 1% +1.66% (p=0.005 n=20)
Perm3-32 45.70n ± 2% 46.99n ± 2% +2.81% (p=0.001 n=20)
Perm30-32 399.0n ± 1% 403.8n ± 1% +1.19% (p=0.006 n=20)
Perm30ViaShuffle-32 349.0n ± 1% 350.4n ± 1% ~ (p=0.909 n=20)
ShuffleOverhead-32 322.3n ± 1% 323.8n ± 1% ~ (p=0.410 n=20)
Concurrent-32 3.331n ± 1% 7.312n ± 1% +119.50% (p=0.000 n=20)
For #61716.
Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2
Reviewed-on: https://go-review.googlesource.com/c/go/+/516860
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
A persistent performance regression was discovered on
perf.golang.org/dashboard and this was narrowed down to the switch to
footers. Using allocation headers instead resolves the issue.
The benchmark results for allocation footers weren't realistic, because
they were performed on a machine with enough L3 cache that it completely
hid the additional cache miss introduced by allocation footers.
This means that in some corner cases the Go runtime may no longer
allocate 16-byte aligned memory. Note however that this property was
*mostly* incidental and never guaranteed in any documentation.
Allocation headers were tested widely within Google and no issues were
found, so we're fairly confident that this will not affect very many
users.
Nonetheless, by Hyrum's Law some code might depend on it. A follow-up
change will add a GODEBUG flag that ensures 16 byte alignment at the
potential cost of some additional memory use. Users experiencing both a
performance regression and an alignment issue can also disable the
GOEXPERIMENT at build time.
This reverts commit 1e250a219900651dad27f29eab0877eee4afd5b9.
Change-Id: Ia7d62a9c60d1773c8b6d33322ee33a80ef814943
Reviewed-on: https://go-review.googlesource.com/c/go/+/543255
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Currently bulkBarrierPreWrite follows a fairly slow path wherein it
calls typePointersOf, which ends up calling into fastForward. This does
some fairly heavy computation to move the iterator forward without any
assumptions about where it lands at all. It needs to be completely
general to support splitting at arbitrary boundaries, for example for
scanning oblets.
This means that copying objects during the GC mark phase is fairly
expensive, and is a regression from before allocheaders.
However, in almost all cases bulkBarrierPreWrite and
bulkBarrierPreWriteSrcOnly have perfect type information. We can do a
lot better in these cases because we're starting on a type-size
boundary, which is exactly what the iterator is built around.
This change adds the typePointersOfType method which produces a
typePointers iterator from a pointer and a type. This change
significantly improves the performance of these bulk write barriers,
eliminating some performance regressions that were noticed on the perf
dashboard.
There are still just a couple cases where we have to use the more
general typePointersOf calls, but they're fairly rare; most bulk
barriers have perfect type information.
This change is tested by the GCInfo tests in the runtime and the GCBits
tests in the reflect package via an additional check in getgcmask.
Results for tile38 before and after allocheaders. There was previous a
regression in the p90, now it's gone. Also, the overall win has been
boosted slightly.
tile38 $ benchstat noallocheaders.results allocheaders.results
name old time/op new time/op delta
Tile38QueryLoad 481µs ± 1% 468µs ± 1% -2.71% (p=0.000 n=10+10)
name old average-RSS-bytes new average-RSS-bytes delta
Tile38QueryLoad 6.32GB ± 1% 6.23GB ± 0% -1.38% (p=0.000 n=9+8)
name old peak-RSS-bytes new peak-RSS-bytes delta
Tile38QueryLoad 6.49GB ± 1% 6.40GB ± 1% -1.38% (p=0.002 n=10+10)
name old peak-VM-bytes new peak-VM-bytes delta
Tile38QueryLoad 7.72GB ± 1% 7.64GB ± 1% -1.07% (p=0.007 n=10+10)
name old p50-latency-ns new p50-latency-ns delta
Tile38QueryLoad 212k ± 1% 205k ± 0% -3.02% (p=0.000 n=10+9)
name old p90-latency-ns new p90-latency-ns delta
Tile38QueryLoad 622k ± 1% 616k ± 1% -1.03% (p=0.005 n=10+10)
name old p99-latency-ns new p99-latency-ns delta
Tile38QueryLoad 4.55M ± 2% 4.39M ± 2% -3.51% (p=0.000 n=10+10)
name old ops/s new ops/s delta
Tile38QueryLoad 12.5k ± 1% 12.8k ± 1% +2.78% (p=0.000 n=10+10)
Change-Id: I0a48f848eae8777d0fd6769c3a1fe449f8d9d0a6
Reviewed-on: https://go-review.googlesource.com/c/go/+/542219
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
getgcmask stops referencing the object passed to it sometime between
when the object is looked up and when the function returns. Notably,
this can happen while the GC mask is actively being produced, and thus
the GC might free the object.
This is easily reproducible by adding a runtime.GC call at just the
right place. Adding a KeepAlive on the heap-object path fixes it.
Fixes #64188.
Change-Id: I5ed4cae862fc780338b60d969fd7fbe896352ce4
Reviewed-on: https://go-review.googlesource.com/c/go/+/542716
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Currently the user arena code writes heap bits to the (*mspan).heapBits
space with the platform-specific byte ordering (the heap bits are
written and managed as uintptrs). However, the compiler always emits GC
metadata for types in little endian.
Because the scanning part of the code that loads through the type
pointer in the allocation header expects little endian ordering, we end
up with the wrong byte ordering in GC when trying to scan arena memory.
Fix this by writing out the user arena heap bits in little endian on big
endian platforms.
This means that the space returned by (*mspan).heapBits has a different
meaning for user arenas and small object spans, which is a little odd,
so I documented it. To reduce the chance of misuse of the writeHeapBits
API, which now writes out heap bits in a different ordering than
writeSmallHeapBits on big endian platforms, this change also renames
writeHeapBits to writeUserArenaHeapBits.
Much of this can be avoided in the future if the compiler were to write
out the pointer/scalar bits as an array of uintptr values instead of
plain bytes. That's too big of a change for right now though.
This change is a no-op on little endian platforms. I confirmed it by
checking for any assembly code differences in the runtime test binary.
There were none. With this change, the arena tests pass on ppc64.
Fixes #64048.
Change-Id: If077d003872fcccf5a154ff5d8441a58582061bb
Reviewed-on: https://go-review.googlesource.com/c/go/+/541315
Run-TryBot: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
The previous CL in this series (CL 437955) adds the allocation headers
experiment. However, this experiment puts the headers at the beginning
of each allocation, which decreases the default allocator alignment that
users can rely upon. Historically, Go's memory allocator has implicitly
provided 16-byte alignment (except for sizes where it doesn't make
sense, like 8 or 24 bytes), so it's not unlikely that users are
depending on it. It also complicates other changes that want higher
alignment. For example, the sync/atomic.Uint64Pair proposal would
(hypothetically; it's not yet accepted) introduce a type with 16-byte
alignment. The malloc fast path will require extra code to consider
alignment and will waste memory for any value containing such a type.
This change moves the allocation header to the end of the span's
allocation slot instead of the beginning. This means worse locality for
the GC when scanning, but it's still an overall win. It also means that
objects will still have the 16-byte alignment we've provided thus far.
This is broken out in a separate change just becauase it ended up that
way during development. But I've chosen to leave it this way in case we
want to try and move allocation headers to the front of objects again.
Below are the benchmark results of this CL series, comparing the
performance of this CL with GOEXPERIMENT=allocheaders vs. without this
CL series.
name old time/op new time/op delta
BiogoIgor 12.5s ± 0% 12.4s ± 2% ~ (p=0.079 n=9+10)
BiogoKrishna 12.8s ±10% 12.4s ±10% ~ (p=0.182 n=9+10)
BleveIndexBatch100 4.54s ± 3% 4.60s ± 3% ~ (p=0.050 n=9+9)
EtcdPut 21.1ms ± 2% 21.3ms ± 4% ~ (p=0.669 n=7+10)
EtcdSTM 107ms ± 3% 108ms ± 2% ~ (p=0.497 n=9+10)
GoBuildKubelet 34.1s ± 3% 33.1s ± 2% -3.08% (p=0.000 n=10+10)
GoBuildKubeletLink 7.94s ± 2% 7.95s ± 2% ~ (p=0.631 n=10+10)
GoBuildIstioctl 33.2s ± 1% 31.7s ± 0% -4.37% (p=0.000 n=9+9)
GoBuildIstioctlLink 8.07s ± 1% 8.05s ± 1% ~ (p=0.356 n=9+10)
GoBuildFrontend 12.1s ± 0% 11.5s ± 1% -4.43% (p=0.000 n=8+10)
GoBuildFrontendLink 1.20s ± 2% 1.20s ± 2% ~ (p=0.905 n=9+10)
GopherLuaKNucleotide 19.9s ± 0% 19.5s ± 1% -1.95% (p=0.000 n=9+10)
MarkdownRenderXHTML 194ms ± 5% 194ms ± 2% ~ (p=0.931 n=9+9)
Tile38QueryLoad 518µs ± 1% 508µs ± 1% -1.93% (p=0.000 n=9+8)
name old average-RSS-bytes new average-RSS-bytes delta
BiogoIgor 66.2MB ± 3% 65.6MB ± 1% ~ (p=0.156 n=10+9)
BiogoKrishna 4.34GB ± 2% 4.34GB ± 1% ~ (p=0.315 n=10+9)
BleveIndexBatch100 189MB ± 3% 186MB ± 3% ~ (p=0.052 n=10+10)
EtcdPut 105MB ± 5% 107MB ± 6% ~ (p=0.579 n=10+10)
EtcdSTM 92.1MB ± 5% 93.2MB ± 4% ~ (p=0.353 n=10+10)
GoBuildKubelet 2.07GB ± 1% 2.07GB ± 1% ~ (p=0.436 n=10+10)
GoBuildIstioctl 1.44GB ± 1% 1.46GB ± 1% +0.96% (p=0.001 n=10+10)
GoBuildFrontend 522MB ± 1% 512MB ± 2% -1.98% (p=0.000 n=10+10)
GopherLuaKNucleotide 37.4MB ± 5% 36.4MB ± 4% -2.53% (p=0.035 n=10+10)
MarkdownRenderXHTML 21.2MB ± 1% 20.9MB ± 3% -1.53% (p=0.003 n=8+10)
Tile38QueryLoad 6.39GB ± 2% 6.24GB ± 2% -2.40% (p=0.000 n=10+10)
name old peak-RSS-bytes new peak-RSS-bytes delta
BiogoIgor 88.5MB ± 4% 88.4MB ± 3% ~ (p=0.971 n=10+10)
BiogoKrishna 4.48GB ± 0% 4.42GB ± 0% -1.49% (p=0.000 n=10+10)
BleveIndexBatch100 268MB ± 3% 265MB ± 4% ~ (p=0.315 n=9+10)
EtcdPut 147MB ± 9% 146MB ± 5% ~ (p=0.853 n=10+10)
EtcdSTM 119MB ± 6% 120MB ± 5% ~ (p=0.796 n=10+10)
GopherLuaKNucleotide 43.1MB ±17% 40.7MB ±12% ~ (p=0.075 n=10+10)
MarkdownRenderXHTML 21.2MB ± 1% 21.1MB ± 3% ~ (p=0.511 n=9+10)
Tile38QueryLoad 6.65GB ± 4% 6.52GB ± 2% -1.93% (p=0.009 n=10+10)
name old peak-VM-bytes new peak-VM-bytes delta
BiogoIgor 1.33GB ± 0% 1.33GB ± 0% -0.16% (p=0.000 n=10+10)
BiogoKrishna 5.77GB ± 0% 5.69GB ± 0% -1.23% (p=0.000 n=10+10)
BleveIndexBatch100 2.62GB ± 0% 2.61GB ± 0% -0.13% (p=0.000 n=7+10)
EtcdPut 12.1GB ± 0% 12.1GB ± 0% ~ (p=0.160 n=8+10)
EtcdSTM 12.1GB ± 0% 12.1GB ± 0% -0.02% (p=0.000 n=10+10)
GopherLuaKNucleotide 1.26GB ± 0% 1.26GB ± 0% -0.09% (p=0.000 n=10+10)
MarkdownRenderXHTML 1.26GB ± 0% 1.26GB ± 0% -0.08% (p=0.000 n=10+10)
Tile38QueryLoad 7.89GB ± 4% 7.76GB ± 1% -1.70% (p=0.008 n=10+8)
name old p50-latency-ns new p50-latency-ns delta
EtcdPut 20.1M ± 5% 20.2M ± 4% ~ (p=0.529 n=10+10)
EtcdSTM 79.8M ± 4% 79.9M ± 4% ~ (p=0.971 n=10+10)
Tile38QueryLoad 215k ± 1% 210k ± 3% -2.04% (p=0.021 n=8+10)
name old p90-latency-ns new p90-latency-ns delta
EtcdPut 31.9M ± 6% 32.0M ± 7% ~ (p=0.780 n=9+10)
EtcdSTM 220M ± 6% 220M ± 2% ~ (p=1.000 n=10+10)
Tile38QueryLoad 622k ± 2% 646k ± 2% +3.83% (p=0.000 n=10+10)
name old p99-latency-ns new p99-latency-ns delta
EtcdPut 47.6M ±32% 51.4M ±28% ~ (p=0.529 n=10+10)
EtcdSTM 452M ± 2% 457M ± 2% ~ (p=0.182 n=9+10)
Tile38QueryLoad 5.04M ± 2% 4.91M ± 3% -2.56% (p=0.001 n=9+9)
name old ops/s new ops/s delta
EtcdPut 46.1k ± 2% 45.7k ± 4% ~ (p=0.475 n=7+10)
EtcdSTM 9.18k ± 5% 9.20k ± 3% ~ (p=0.971 n=10+10)
Tile38QueryLoad 17.4k ± 1% 17.7k ± 1% +1.97% (p=0.000 n=9+8)
Change-Id: I637f48fb9e8c181912db785ae9186d7f16769870
Reviewed-on: https://go-review.googlesource.com/c/go/+/537886
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
This change replaces the 1-bit-per-word heap bitmap for most size
classes with allocation headers for objects that contain pointers. The
header consists of a single pointer to a type. All allocations with
headers are treated as implicitly containing one or more instances of
the type in the header.
As the name implies, headers are usually stored as the first word of an
object. There are two additional exceptions to where headers are stored
and how they're used.
Objects smaller than 512 bytes do not have headers. Instead, a heap
bitmap is reserved at the end of spans for objects of this size. A full
word of overhead is too much for these small objects. The bitmap is of
the same format of the old bitmap, minus the noMorePtrs bits which are
unnecessary. All the objects <512 bytes have a bitmap less than a
pointer-word in size, and that was the granularity at which noMorePtrs
could stop scanning early anyway.
Objects that are larger than 32 KiB (which have their own span) have
their headers stored directly in the span, to allow power-of-two-sized
allocations to not spill over into an extra page.
The full implementation is behind GOEXPERIMENT=allocheaders.
The purpose of this change is performance. First and foremost, with
headers we no longer have to unroll pointer/scalar data at allocation
time for most size classes. Small size classes still need some
unrolling, but their bitmaps are small so we can optimize that case
fairly well. Larger objects effectively have their pointer/scalar data
unrolled on-demand from type data, which is much more compactly
represented and results in less TLB pressure. Furthermore, since the
headers are usually right next to the object and where we're about to
start scanning, we get an additional temporal locality benefit in the
data cache when looking up type metadata. The pointer/scalar data is
now effectively unrolled on-demand, but it's also simpler to unroll than
before; that unrolled data is never written anywhere, and for arrays we
get the benefit of retreading the same data per element, as opposed to
looking it up from scratch for each pointer-word of bitmap. Lastly,
because we no longer have a heap bitmap that spans the entire heap,
there's a flat 1.5% memory use reduction. This is balanced slightly by
some objects possibly being bumped up a size class, but most objects are
not tightly optimized to size class sizes so there's some memory to
spare, making the header basically free in those cases.
See the follow-up CL which turns on this experiment by default for
benchmark results. (CL 538217.)
Change-Id: I4c9034ee200650d06d8bdecd579d5f7c1bbf1fc5
Reviewed-on: https://go-review.googlesource.com/c/go/+/437955
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This change adds the allocation headers GOEXPERIMENT which is a no-op.
It forks two runtime files temporarily to make the GOEXPERIMENT easier
to maintain. The forked files are mbitmap.go and msize.go.
Change-Id: I60202c00e614e4517de7dd000029cf80dd0121ef
Reviewed-on: https://go-review.googlesource.com/c/go/+/537980
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@golang.org>
|