| Age | Commit message (Collapse) | Author |
|
Change-Id: I8fe66326994894b17ce0eda991bba942844d26b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/541475
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
CL 549536 intended to decouple the internal implementation of rwmutex
from the semantic meaning of an rwmutex read/write lock in the static
lock ranking.
Unfortunately, it was not thought through well enough. The internals
were represented with the rwmutexR and rwmutexW lock ranks. The idea was
that the internal lock ranks need not model the higher-level ordering,
since those have separate rankings. That is incorrect; rwmutexW is held
for the duration of a write lock, so it must be ranked before any lock
taken while any write lock is held, which is precisely what we were
trying to avoid.
This is visible in violations like:
0 : execW 11 0x0
1 : rwmutexW 51 0x111d9c8
2 : fin 30 0x111d3a0
fatal error: lock ordering problem
execW < fin is modeled, but rwmutexW < fin is missing.
Fix this by eliminating the rwmutexR/W lock ranks shared across
different types of rwmutex. Instead require users to define an
additional "internal" lock rank to represent the implementation details
of rwmutex.rLock. We can avoid an additional "internal" lock rank for
rwmutex.wLock because the existing writeRank has the same semantics for
semantic and internal locking. i.e., writeRank is held for the duration
of a write lock, which is exactly how rwmutex.wLock is used, so we can
use writeRank directly on wLock.
For #64722.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-staticlockranking
Change-Id: Ia572de188a46ba8fe054ae28537648beaa16b12c
Reviewed-on: https://go-review.googlesource.com/c/go/+/555055
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
Currently, lock ranking doesn't really try to model rwmutex. It records
the internal locks rLock and wLock, but in a subpar fashion:
1. wLock is held from lock to unlock, so it works OK, but it conflates
write locks of all rwmutexes as rwmutexW, rather than allowing
different rwmutexes to have different rankings.
2. rLock is an internal implementation detail that is only taken when
there is contention in rlock. As as result, the reader lock path is
almost never checked.
Add proper modeling. rwmutexR and rwmutexW remain as the ranks of the
internal locks, which have their own ordering. The new init method is
passed the ranks of the higher level lock that this represents, just
like lockInit for mutex.
execW ordered before MALLOC captures the case from #64722. i.e., there
can be allocation between BeforeFork and AfterFork.
For #64722.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-staticlockranking
Change-Id: I23335b28faa42fb04f1bc9da02fdf54d1616cd28
Reviewed-on: https://go-review.googlesource.com/c/go/+/549536
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
The exported API is only available with GOEXPERIMENT=rangefunc.
This will let Go 1.22 users who want to experiment with rangefuncs
access an efficient implementation of iter.Pull and iter.Pull2.
For #61897.
Change-Id: I6ef5fa8f117567efe4029b7b8b0f4d9b85697fb7
Reviewed-on: https://go-review.googlesource.com/c/go/+/543319
Reviewed-by: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
profileruntimelocks is new in CL 544195, but the name is deceptive. Even
with profileruntimelocks=0, runtime-internal locks are still profiled.
The actual difference is that call stacks are not collected. Instead all
contention is reported at runtime._LostContendedLock.
Rename this setting to runtimecontentionstacks to make its name more
aligned with its behavior.
In addition, for this release the default is profileruntimelocks=0,
meaning that users are fairly likely to encounter
runtime._LostContendedLock. Rename it to
runtime._LostContendedRuntimeLock in an attempt to make it more
intuitive that these are runtime locks, not locks in application code.
For #57071.
Change-Id: I38aac28b2c0852db643d53b1eab3f3bc42a43393
Reviewed-on: https://go-review.googlesource.com/c/go/+/547055
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Rhys Hiltner <rhys@justin.tv>
|
|
Move ChaCha8 code into internal/chacha8rand and use it to implement
runtime.rand, which is used for the unseeded global source for
both math/rand and math/rand/v2. This also affects the calculation of
the start point for iteration over very very large maps (when the
32-bit fastrand is not big enough).
The benefit is that misuse of the global random number generators
in math/rand and math/rand/v2 in contexts where non-predictable
randomness is important for security reasons is no longer a
security problem, removing a common mistake among programmers
who are unaware of the different kinds of randomness.
The cost is an extra 304 bytes per thread stored in the m struct
plus 2-3ns more per random uint64 due to the more sophisticated
algorithm. Using PCG looks like it would cost about the same,
although I haven't benchmarked that.
Before this, the math/rand and math/rand/v2 global generator
was wyrand (https://github.com/wangyi-fudan/wyhash).
For math/rand, using wyrand instead of the Mitchell/Reeds/Thompson
ALFG was justifiable, since the latter was not any better.
But for math/rand/v2, the global generator really should be
at least as good as one of the well-studied, specific algorithms
provided directly by the package, and it's not.
(Wyrand is still reasonable for scheduling and cache decisions.)
Good randomness does have a cost: about twice wyrand.
Also rationalize the various runtime rand references.
goos: linux
goarch: amd64
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.amd64 │ 5cf807d1ea.amd64 │
│ sec/op │ sec/op vs base │
ChaCha8-32 1.862n ± 2% 1.861n ± 2% ~ (p=0.825 n=20)
PCG_DXSM-32 1.471n ± 1% 1.460n ± 2% ~ (p=0.153 n=20)
SourceUint64-32 1.636n ± 2% 1.582n ± 1% -3.30% (p=0.000 n=20)
GlobalInt64-32 2.087n ± 1% 3.663n ± 1% +75.54% (p=0.000 n=20)
GlobalInt64Parallel-32 0.1042n ± 1% 0.2026n ± 1% +94.48% (p=0.000 n=20)
GlobalUint64-32 2.263n ± 2% 3.724n ± 1% +64.57% (p=0.000 n=20)
GlobalUint64Parallel-32 0.1019n ± 1% 0.1973n ± 1% +93.67% (p=0.000 n=20)
Int64-32 1.771n ± 1% 1.774n ± 1% ~ (p=0.449 n=20)
Uint64-32 1.863n ± 2% 1.866n ± 1% ~ (p=0.364 n=20)
GlobalIntN1000-32 3.134n ± 3% 4.730n ± 2% +50.95% (p=0.000 n=20)
IntN1000-32 2.489n ± 1% 2.489n ± 1% ~ (p=0.683 n=20)
Int64N1000-32 2.521n ± 1% 2.516n ± 1% ~ (p=0.394 n=20)
Int64N1e8-32 2.479n ± 1% 2.478n ± 2% ~ (p=0.743 n=20)
Int64N1e9-32 2.530n ± 2% 2.514n ± 2% ~ (p=0.193 n=20)
Int64N2e9-32 2.501n ± 1% 2.494n ± 1% ~ (p=0.616 n=20)
Int64N1e18-32 3.227n ± 1% 3.205n ± 1% ~ (p=0.101 n=20)
Int64N2e18-32 3.647n ± 1% 3.599n ± 1% ~ (p=0.019 n=20)
Int64N4e18-32 5.135n ± 1% 5.069n ± 2% ~ (p=0.034 n=20)
Int32N1000-32 2.657n ± 1% 2.637n ± 1% ~ (p=0.180 n=20)
Int32N1e8-32 2.636n ± 1% 2.636n ± 1% ~ (p=0.763 n=20)
Int32N1e9-32 2.660n ± 2% 2.638n ± 1% ~ (p=0.358 n=20)
Int32N2e9-32 2.662n ± 2% 2.618n ± 2% ~ (p=0.064 n=20)
Float32-32 2.272n ± 2% 2.239n ± 2% ~ (p=0.194 n=20)
Float64-32 2.272n ± 1% 2.286n ± 2% ~ (p=0.763 n=20)
ExpFloat64-32 3.762n ± 1% 3.744n ± 1% ~ (p=0.171 n=20)
NormFloat64-32 3.706n ± 1% 3.655n ± 2% ~ (p=0.066 n=20)
Perm3-32 32.93n ± 3% 34.62n ± 1% +5.13% (p=0.000 n=20)
Perm30-32 202.9n ± 1% 204.0n ± 1% ~ (p=0.482 n=20)
Perm30ViaShuffle-32 115.0n ± 1% 114.9n ± 1% ~ (p=0.358 n=20)
ShuffleOverhead-32 112.8n ± 1% 112.7n ± 1% ~ (p=0.692 n=20)
Concurrent-32 2.107n ± 0% 3.725n ± 1% +76.75% (p=0.000 n=20)
goos: darwin
goarch: arm64
pkg: math/rand/v2
│ bbb48afeb7.arm64 │ 5cf807d1ea.arm64 │
│ sec/op │ sec/op vs base │
ChaCha8-8 2.480n ± 0% 2.429n ± 0% -2.04% (p=0.000 n=20)
PCG_DXSM-8 2.531n ± 0% 2.530n ± 0% ~ (p=0.877 n=20)
SourceUint64-8 2.534n ± 0% 2.533n ± 0% ~ (p=0.732 n=20)
GlobalInt64-8 2.172n ± 1% 4.794n ± 0% +120.67% (p=0.000 n=20)
GlobalInt64Parallel-8 0.4320n ± 0% 0.9605n ± 0% +122.32% (p=0.000 n=20)
GlobalUint64-8 2.182n ± 0% 4.770n ± 0% +118.58% (p=0.000 n=20)
GlobalUint64Parallel-8 0.4307n ± 0% 0.9583n ± 0% +122.51% (p=0.000 n=20)
Int64-8 4.107n ± 0% 4.104n ± 0% ~ (p=0.416 n=20)
Uint64-8 4.080n ± 0% 4.080n ± 0% ~ (p=0.052 n=20)
GlobalIntN1000-8 2.814n ± 2% 5.643n ± 0% +100.50% (p=0.000 n=20)
IntN1000-8 4.141n ± 0% 4.139n ± 0% ~ (p=0.140 n=20)
Int64N1000-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.313 n=20)
Int64N1e8-8 4.140n ± 0% 4.139n ± 0% ~ (p=0.103 n=20)
Int64N1e9-8 4.139n ± 0% 4.140n ± 0% ~ (p=0.761 n=20)
Int64N2e9-8 4.140n ± 0% 4.140n ± 0% ~ (p=0.636 n=20)
Int64N1e18-8 5.266n ± 0% 5.326n ± 1% +1.14% (p=0.001 n=20)
Int64N2e18-8 6.052n ± 0% 6.167n ± 0% +1.90% (p=0.000 n=20)
Int64N4e18-8 8.826n ± 0% 9.051n ± 0% +2.55% (p=0.000 n=20)
Int32N1000-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N1e8-8 4.126n ± 0% 4.131n ± 0% +0.12% (p=0.000 n=20)
Int32N1e9-8 4.127n ± 0% 4.132n ± 0% +0.12% (p=0.000 n=20)
Int32N2e9-8 4.132n ± 0% 4.131n ± 0% ~ (p=0.017 n=20)
Float32-8 4.109n ± 0% 4.105n ± 0% ~ (p=0.379 n=20)
Float64-8 4.107n ± 0% 4.106n ± 0% ~ (p=0.867 n=20)
ExpFloat64-8 5.339n ± 0% 5.383n ± 0% +0.82% (p=0.000 n=20)
NormFloat64-8 5.735n ± 0% 5.737n ± 1% ~ (p=0.856 n=20)
Perm3-8 26.65n ± 0% 26.80n ± 1% +0.58% (p=0.000 n=20)
Perm30-8 194.8n ± 1% 197.0n ± 0% +1.18% (p=0.000 n=20)
Perm30ViaShuffle-8 156.6n ± 0% 157.6n ± 1% +0.61% (p=0.000 n=20)
ShuffleOverhead-8 124.9n ± 0% 125.5n ± 0% +0.52% (p=0.000 n=20)
Concurrent-8 2.434n ± 3% 5.066n ± 0% +108.09% (p=0.000 n=20)
goos: linux
goarch: 386
pkg: math/rand/v2
cpu: AMD Ryzen 9 7950X 16-Core Processor
│ bbb48afeb7.386 │ 5cf807d1ea.386 │
│ sec/op │ sec/op vs base │
ChaCha8-32 11.295n ± 1% 4.748n ± 2% -57.96% (p=0.000 n=20)
PCG_DXSM-32 7.693n ± 1% 7.738n ± 2% ~ (p=0.542 n=20)
SourceUint64-32 7.658n ± 2% 7.622n ± 2% ~ (p=0.344 n=20)
GlobalInt64-32 3.473n ± 2% 7.526n ± 2% +116.73% (p=0.000 n=20)
GlobalInt64Parallel-32 0.3198n ± 0% 0.5444n ± 0% +70.22% (p=0.000 n=20)
GlobalUint64-32 3.612n ± 0% 7.575n ± 1% +109.69% (p=0.000 n=20)
GlobalUint64Parallel-32 0.3168n ± 0% 0.5403n ± 0% +70.51% (p=0.000 n=20)
Int64-32 7.673n ± 2% 7.789n ± 1% ~ (p=0.122 n=20)
Uint64-32 7.773n ± 1% 7.827n ± 2% ~ (p=0.920 n=20)
GlobalIntN1000-32 6.268n ± 1% 9.581n ± 1% +52.87% (p=0.000 n=20)
IntN1000-32 10.33n ± 2% 10.45n ± 1% ~ (p=0.233 n=20)
Int64N1000-32 10.98n ± 2% 11.01n ± 1% ~ (p=0.401 n=20)
Int64N1e8-32 11.19n ± 2% 10.97n ± 1% ~ (p=0.033 n=20)
Int64N1e9-32 11.06n ± 1% 11.08n ± 1% ~ (p=0.498 n=20)
Int64N2e9-32 11.10n ± 1% 11.01n ± 2% ~ (p=0.995 n=20)
Int64N1e18-32 15.23n ± 2% 15.04n ± 1% ~ (p=0.973 n=20)
Int64N2e18-32 15.89n ± 1% 15.85n ± 1% ~ (p=0.409 n=20)
Int64N4e18-32 18.96n ± 2% 19.34n ± 2% ~ (p=0.048 n=20)
Int32N1000-32 10.46n ± 2% 10.44n ± 2% ~ (p=0.480 n=20)
Int32N1e8-32 10.46n ± 2% 10.49n ± 2% ~ (p=0.951 n=20)
Int32N1e9-32 10.28n ± 2% 10.26n ± 1% ~ (p=0.431 n=20)
Int32N2e9-32 10.50n ± 2% 10.44n ± 2% ~ (p=0.249 n=20)
Float32-32 13.80n ± 2% 13.80n ± 2% ~ (p=0.751 n=20)
Float64-32 23.55n ± 2% 23.87n ± 0% ~ (p=0.408 n=20)
ExpFloat64-32 15.36n ± 1% 15.29n ± 2% ~ (p=0.316 n=20)
NormFloat64-32 13.57n ± 1% 13.79n ± 1% +1.66% (p=0.005 n=20)
Perm3-32 45.70n ± 2% 46.99n ± 2% +2.81% (p=0.001 n=20)
Perm30-32 399.0n ± 1% 403.8n ± 1% +1.19% (p=0.006 n=20)
Perm30ViaShuffle-32 349.0n ± 1% 350.4n ± 1% ~ (p=0.909 n=20)
ShuffleOverhead-32 322.3n ± 1% 323.8n ± 1% ~ (p=0.410 n=20)
Concurrent-32 3.331n ± 1% 7.312n ± 1% +119.50% (p=0.000 n=20)
For #61716.
Change-Id: Ibdddeed85c34d9ae397289dc899e04d4845f9ed2
Reviewed-on: https://go-review.googlesource.com/c/go/+/516860
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Filippo Valsorda <filippo@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
The exitsyscall path, since the introduction of the new execution
tracer, stores a just little bit more data in the exitsyscall stack
frame, causing a build failure from exceeding the nosplit limit with
'-N -l' set on all packages (like Delve does).
One of the paths through which this fails is "throw" from wirep, called
by a callee of exitsyscall. By switching to the systemstack on this
path, we can avoid hitting the nosplit limit, fixing the build. It's
also not totally unreasonable to switch to the systemstack for the
throws in this function, since the function has to be nosplit anyway. It
gives the throw path a bit more wiggle room to dump information than it
otherwise would have.
Fixes #64113.
Change-Id: I56e94e40614a202b8ac2fdc8b8b731493b74e5d0
Reviewed-on: https://go-review.googlesource.com/c/go/+/544535
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
This is required by traceThreadDestroy, though it's not strictly
necessary in this case. The requirement to hold sched.lock comes from
the assumption that traceThreadDestroy is getting called when the thread
leaves the tracer's view, but in this case the extra m that dropm is
dropping never leaves the allm list. Nevertheless, traceThreadDestroy
requires it just as a safety measure, and that's reasonable. dropm is
generally rare on pthread platforms, so the extra lock acquire over this
short critical section (and only when tracing is enabled) is fine.
Change-Id: Ib631820963c74f2f087d14a0067d0441d75d6785
Reviewed-on: https://go-review.googlesource.com/c/go/+/544396
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Currently entersyscall_gcwait always emits a ProcStop event. Most of the
time, this is correct, since the thread that just put the P into
_Psyscall is the same one that is putting it into _Pgcstop. However it's
possible for another thread to steal the P, start running a goroutine,
and then enter another syscall, putting the P back into _Psyscall. In
this case ProcStop is incorrect; the P is getting stolen. This leads to
broken traces.
Fix this by always emitting a ProcSteal event from entersyscall_gcwait.
This means that most of the time a thread will be 'stealing' the proc
from itself when it enters this function, but that's theoretically fine.
A ProcSteal is really just a fancy ProcStop.
Well, it would be if the parser correctly handled a self-steal. This is
a minor bug that just never came up before, but it's an update order
error (the mState is looked up and modified, but then it's modified
again at the end of the function to match newCtx). There's really no
reason a self-steal shouldn't be allowed, so fix that up and add a test.
Change-Id: Iec3d7639d331e3f2d127f92ce50c2c4a7818fcd3
Reviewed-on: https://go-review.googlesource.com/c/go/+/544215
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Add runtime-internal locks to the mutex contention profile.
Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.
Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.
Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.
Fixes #57071
This is a roll-forward of https://go.dev/cl/528657, which was reverted
in https://go.dev/cl/543660
Reason for revert: de-flakes tests (reduces dependence on fine-grained
timers, correctly identifies contention on big-endian futex locks,
attempts to measure contention in the semaphore implementation but only
uses that secondary measurement to finish the test early, skips tests on
single-processor systems)
Change-Id: I31389f24283d85e46ad9ba8d4f514cb9add8dfb0
Reviewed-on: https://go-review.googlesource.com/c/go/+/544195
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Run-TryBot: Rhys Hiltner <rhys@justin.tv>
|
|
Apparently, on Windows, throwing an exception on a non-system-
allocated crash stack causes EXCEPTION_STACK_OVERFLOW and hangs
the process (see issue #63938). Disable crash stack for now, which
gets us back the the behavior of Go 1.21.
Fixes #63938.
Change-Id: I4c090315b93b484e756b242f0de7a9e02f199261
Reviewed-on: https://go-review.googlesource.com/c/go/+/543996
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Quim Muntal <quimmuntal@gmail.com>
|
|
Currently if morestack on g0 happens the wasm runtime prints
"RuntimeError: memory access out of bounds", which is quite misleading.
By switching to a crash stack we can get better stacktraces
for the error.
There is no way to automate tests for this feature on wasm, since
TestG0StackOverflow relies on spawning a subprocess which is not
supported by the wasm port.
The way I got this tested manually is to comment everything in
TestG0StackOverflow, leaving just runtime.G0StackOverflow().
Then it is a matter of invoking the test:
GOOS=js GOARCH=wasm go test runtime -v -run=TestG0StackOverflow
Change-Id: If450f3ee5209bb32efc1abd0a34b1cc4a29d0c46
GitHub-Last-Rev: 0d7c396e4cfeadc1188cae2b55661af10c8189e7
GitHub-Pull-Request: golang/go#63956
Reviewed-on: https://go-review.googlesource.com/c/go/+/539995
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
This reverts commit go.dev/cl/528657.
Reason for revert: broke a lot of builders.
Change-Id: I70c33062020e997c4df67b3eaa2e886cf0da961e
Reviewed-on: https://go-review.googlesource.com/c/go/+/543660
Reviewed-by: Than McIntosh <thanm@google.com>
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Change-Id: I8d1011509c4f0f529e97055280606603747a2e1a
Reviewed-on: https://go-review.googlesource.com/c/go/+/541775
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
|
|
Add runtime-internal locks to the mutex contention profile.
Store up to one call stack responsible for lock contention on the M,
until it's safe to contribute its value to the mprof table. Try to use
that limited local storage space for a relatively large source of
contention, and attribute any contention in stacks we're not able to
store to a sentinel _LostContendedLock function.
Avoid ballooning lock contention while manipulating the mprof table by
attributing to that sentinel function any lock contention experienced
while reporting lock contention.
Guard collecting real call stacks with GODEBUG=profileruntimelocks=1,
since the available data has mixed semantics; we can easily capture an
M's own wait time, but we'd prefer for the profile entry of each
critical section to describe how long it made the other Ms wait. It's
too late in the Go 1.22 cycle to make the required changes to
futex-based locks. When not enabled, attribute the time to the sentinel
function instead.
Fixes #57071
Change-Id: I3eee0ccbfc20f333b56f20d8725dfd7f3a526b41
Reviewed-on: https://go-review.googlesource.com/c/go/+/528657
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Auto-Submit: Rhys Hiltner <rhys@justin.tv>
Reviewed-by: Than McIntosh <thanm@google.com>
|
|
This CL adds four new time histogram metrics:
/sched/pauses/stopping/gc:seconds
/sched/pauses/stopping/other:seconds
/sched/pauses/total/gc:seconds
/sched/pauses/total/other:seconds
The "stopping" metrics measure the time taken to start a stop-the-world
pause. i.e., how long it takes stopTheWorldWithSema to stop all Ps.
This can be used to detect STW struggling to preempt Ps.
The "total" metrics measure the total duration of a stop-the-world
pause, from starting to stop-the-world until the world is started again.
This includes the time spent in the "start" phase.
The "gc" metrics are used for GC-related STW pauses. The "other" metrics
are used for all other STW pauses.
All of these metrics start timing in stopTheWorldWithSema only after
successfully acquiring sched.lock, thus excluding lock contention on
sched.lock. The reasoning behind this is that while waiting on
sched.lock the world is not stopped at all (all other Ps can run), so
the impact of this contention is primarily limited to the goroutine
attempting to stop-the-world. Additionally, we already have some
visibility into sched.lock contention via contention profiles (#57071).
/sched/pauses/total/gc:seconds is conceptually equivalent to
/gc/pauses:seconds, so the latter is marked as deprecated and returns
the same histogram as the former.
In the implementation, there are a few minor differences:
* For both mark and sweep termination stops, /gc/pauses:seconds started
timing prior to calling startTheWorldWithSema, thus including lock
contention.
These details are minor enough, that I do not believe the slight change
in reporting will matter. For mark termination stops, moving timing stop
into startTheWorldWithSema does have the side effect of requiring moving
other GC metric calculations outside of the STW, as they depend on the
same end time.
Fixes #63340
Change-Id: Iacd0bab11bedab85d3dcfb982361413a7d9c0d05
Reviewed-on: https://go-review.googlesource.com/c/go/+/534161
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
This change mostly implements the design described in #60773 and
includes a new scalable parser for the new trace format, available in
internal/trace/v2. I'll leave this commit message short because this is
clearly an enormous CL with a lot of detail.
This change does not hook up the new tracer into cmd/trace yet. A
follow-up CL will handle that.
For #60773.
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-linux-amd64-longtest-race
Change-Id: I5d2aca2cc07580ed3c76a9813ac48ec96b157de0
Reviewed-on: https://go-review.googlesource.com/c/go/+/494187
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
Currently tickspersecond forces a 100 millisecond sleep the first time
it's called. This isn't great for profiling short-lived programs, since
both CPU profiling and block profiling might call into it.
100 milliseconds is a long time, but it's chosen to try and capture a
decent estimate of the conversion on platform with course-granularity
clocks. If the granularity is 15 ms, it'll only be 15% off at worst.
Let's try a different strategy. First, let's require 5 milliseconds of
time to have elapsed at a minimum. This should be plenty on platforms
with nanosecond time granularity from the system clock, provided the
caller of tickspersecond intends to use it for calculating durations,
not timestamps. Next, grab a timestamp as close to process start as
possible, so that we can cover some of that 5 millisecond just during
runtime start.
Finally, this function is only ever called from normal goroutine
contexts. Let's do a regular goroutine sleep instead of a thread-level
sleep under a runtime lock, which has all sorts of nasty effects on
preemption.
While we're here, let's also rename tickspersecond to ticksPerSecond.
Also, let's write down some explicit rules of thumb on when to use this
function. Clocks are hard, and using this for timestamp conversion is
likely to make lining up those timestamps with other clocks on the
system difficult if not impossible.
Note that while this improves ticksPerSecond on platforms with good
clocks, we still end up with a pretty coarse sleep on platforms with
coarse clocks, and a pretty coarse result. On these platforms, keep the
minimum required elapsed time at 100 ms. There's not much we can do
about these platforms except spin and try to catch the clock boundary,
but at 10+ ms of granularity, that might be a lot of spinning.
Fixes #63103.
Fixes #63078.
Change-Id: Ic32a4ba70a03bdf5c13cb80c2669c4064aa4cca2
Reviewed-on: https://go-review.googlesource.com/c/go/+/538898
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Currently any thread that tries to get the attention of all Ps (e.g.
stopTheWorldWithSema and forEachP) ends up in a non-preemptible state
waiting to preempt another thread. Thing is, that other thread might
also be in a non-preemptible state, trying to preempt the first thread,
resulting in a deadlock.
This is a general problem, but in practice it only boils down to one
specific scenario: a thread in GC is blocked trying to preempt a
goroutine to scan its stack while that goroutine is blocked in a
non-preemptible state to get the attention of all Ps.
There's currently a hack in a few places in the runtime to move the
calling goroutine into _Gwaiting before it goes into a non-preemptible
state to preempt other threads. This lets the GC scan its stack because
the goroutine is trivially preemptible. The only restriction is that
forEachP and stopTheWorldWithSema absolutely cannot reference the
calling goroutine's stack. This is generally not necessary, so things
are good.
Anyway, to avoid exposing the details of this hack, this change creates
a safer wrapper around forEachP (and then renames it to forEachP and the
existing one to forEachPInternal) that performs the goroutine status
change, just like stopTheWorld does. We're going to need to use this
hack with forEachP in the new tracer, so this avoids propagating the
hack further and leaves it as an implementation detail.
Change-Id: I51f02e8d8e0a3172334d23787e31abefb8a129ab
Reviewed-on: https://go-review.googlesource.com/c/go/+/533455
Auto-Submit: Michael Knyszek <mknyszek@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Currently the execution tracer synchronizes with itself using very
heavyweight operations. As a result, it's totally fine for most of the
tracer code to look like:
if traceEnabled() {
traceXXX(...)
}
However, if we want to make that synchronization more lightweight (as
issue #60773 proposes), then this is insufficient. In particular, we
need to make sure the tracer can't observe an inconsistency between g
atomicstatus and the event that would be emitted for a particular
g transition. This means making the g status change appear to happen
atomically with the corresponding trace event being written out from the
perspective of the tracer.
This requires a change in API to something more like a lock. While we're
here, we might as well make sure that trace events can *only* be emitted
while this lock is held. This change introduces such an API:
traceAcquire, which returns a value that can emit events, and
traceRelease, which requires the value that was returned by
traceAcquire. In practice, this won't be a real lock, it'll be more like
a seqlock.
For the current tracer, this API is completely overkill and the value
returned by traceAcquire basically just checks trace.enabled. But it's
necessary for the tracer described in #60773 and we can implement that
more cleanly if we do this refactoring now instead of later.
For #60773.
Change-Id: Ibb9ff5958376339fafc2b5180aef65cf2ba18646
Reviewed-on: https://go-review.googlesource.com/c/go/+/515635
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
Change-Id: I240ea7dd6430f4c89cfdadbfa790e4a70a4fd79d
GitHub-Last-Rev: 585742b5eebad7c03ec313b590e2171491b78b37
GitHub-Pull-Request: golang/go#63905
Reviewed-on: https://go-review.googlesource.com/c/go/+/539295
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
Run-TryBot: Mauri de Souza Meneguzzo <mauri870@gmail.com>
|
|
Change-Id: Ib89a71e20f9c6b86c97814c75cb427e9bd7075e5
Reviewed-on: https://go-review.googlesource.com/c/go/+/538735
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
|
|
On some platforms (notably OpenBSD), stacks must be specifically allocated
and marked as being stack memory. Allocate the crash stack using stackalloc,
which ensures these requirements are met, rather than using a global Go
variable.
Fixes #63794
Change-Id: I6513575797dd69ff0a36f3bfd4e5fc3bd95cbf50
Reviewed-on: https://go-review.googlesource.com/c/go/+/538457
Run-TryBot: Joel Sing <joel@sing.id.au>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Error like "morestack on g0" is one of the errors that is very
hard to debug, because often it doesn't print a useful stack trace.
The runtime doesn't directly print a stack trace because it is
a bad stack state to call print. Sometimes the SIGABRT may trigger
a traceback, but sometimes not especially in a cgo binary. Even if
it triggers a traceback it often does not include the stack trace
of the bad stack.
This CL makes it explicitly print a stack trace and throw. The
idea is to have some space as an "emergency" crash stack. When the
stack is in a really bad state, we switch to the crash stack and
do a traceback.
Currently only implemented on AMD64 and ARM64.
TODO: also handle errors like "morestack on gsignal" and bad
systemstack. Also handle other architectures.
Change-Id: Ibfc397202f2bb0737c5cbe99f2763de83301c1c1
Reviewed-on: https://go-review.googlesource.com/c/go/+/419435
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
After CL 527715, needm uses callbackUpdateSystemStack to set the stack
bounds for g0 on an M from the extra M list. Since
callbackUpdateSystemStack is also used for recursive cgocallback, it
does nothing if the stack is already in bounds.
Currently, the stack bounds in an extra M may contain stale bounds from
a previous thread that used this M and then returned it to the extra
list in dropm.
Typically a new thread will not have an overlapping stack with an old
thread, but because the old thread has exited there is a small chance
that the C memory allocator will allocate the new thread's stack
partially or fully overlapping with the old thread's stack.
If this occurs, then callbackUpdateSystemStack will not update the stack
bounds. If in addition, the overlap is partial such that SP on
cgocallback is close to the recorded stack lower bound, then Go may
quickly "overflow" the stack and crash with "morestack on g0".
Fix this by clearing the stack bounds in dropm, which ensures that
callbackUpdateSystemStack will unconditionally update the bounds in
needm.
For #62440.
Change-Id: Ic9e2052c2090dd679ed716d1a23a86d66cbcada7
Reviewed-on: https://go-review.googlesource.com/c/go/+/537695
Reviewed-by: Cherry Mui <cherryyz@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Bypass: Michael Pratt <mpratt@google.com>
|
|
Currently, for the shared build mode, we don't generate the module
inittasks. Instead, we rely on the main executable to do the
initialization, for both the executable and the shared library.
But, with the model as of CL 478916, the main executable only
has relocations to packages that are directly imported. It won't
see the dependency edges between packages within a shared library.
Therefore indirect dependencies are not included, and thus not
initialized. E.g. main imports a, which imports b, but main
doesn't directly import b. a and b are in a shared object. When
linking main, it sees main depends on a, so it generates main's
inittasks to run a's init before main's, but it doesn't know b,
so b's init doesn't run.
This CL makes it initialize all packages in a shared library when
the library is loaded, as any of them could potentially be
imported, directly or indirectly.
Also, in the runtime, when running the init functions, make sure
to go through the DSOs in dependency order. Otherwise packages
can be initialized in the wrong order.
Fixes #61973.
Change-Id: I2a090336fe9fa0d6c7e43912f3ab233c9c47e247
Reviewed-on: https://go-review.googlesource.com/c/go/+/520375
Reviewed-by: Than McIntosh <thanm@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
[This is a redo of CL 525455 with the test fixed on darwin by defining
_XOPEN_SOURCE, and disabled with android, musl, and openbsd, which do
not provide getcontext.]
Since CL 495855, Ms are cached for C threads calling into Go, including
the stack bounds of the system stack.
Some C libraries (e.g., coroutine libraries) do manual stack management
and may change stacks between calls to Go on the same thread.
Changing the stack if there is more Go up the stack would be
problematic. But if the calls are completely independent there is no
particular reason for Go to care about the changing stack boundary.
Thus, this CL allows the stack bounds to change in such cases. The
primary downside here (besides additional complexity) is that normal
systems that do not manipulate the stack may not notice unintentional
stack corruption as quickly as before.
Note that callbackUpdateSystemStack is written to be usable for the
initial setup in needm as well as updating the stack in cgocallbackg.
Fixes #62440.
For #62130.
Change-Id: I0fe0134f865932bbaff1fc0da377c35c013bd768
Reviewed-on: https://go-review.googlesource.com/c/go/+/527715
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
This reverts CL 525455. The test fails to build on darwin, alpine, and
android.
For #62440.
Change-Id: I39c6b1e16499bd61e0f166de6c6efe7a07961e62
Reviewed-on: https://go-review.googlesource.com/c/go/+/527317
Auto-Submit: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
|
|
Since CL 495855, Ms are cached for C threads calling into Go, including
the stack bounds of the system stack.
Some C libraries (e.g., coroutine libraries) do manual stack management
and may change stacks between calls to Go on the same thread.
Changing the stack if there is more Go up the stack would be
problematic. But if the calls are completely independent there is no
particular reason for Go to care about the changing stack boundary.
Thus, this CL allows the stack bounds to change in such cases. The
primary downside here (besides additional complexity) is that normal
systems that do not manipulate the stack may not notice unintentional
stack corruption as quickly as before.
Note that callbackUpdateSystemStack is written to be usable for the
initial setup in needm as well as updating the stack in cgocallbackg.
Fixes #62440.
For #62130.
Change-Id: I7841b056acea1111bdae3b718345a3bd3961b4a8
Reviewed-on: https://go-review.googlesource.com/c/go/+/525455
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Currently, for non-cgo programs, the g0 stack size is 8 KiB on
most platforms. With PGO which could cause aggressive inlining in
the runtime, the runtime stack frames are larger and could
overflow the 8 KiB g0 stack. Increase it to 16 KiB. This is only
one per OS thread, so it shouldn't increase memory use much.
Fixes #62120.
Fixes #62489.
Change-Id: I565b154517021f1fd849424dafc3f0f26a755cac
Reviewed-on: https://go-review.googlesource.com/c/go/+/526995
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
|
|
When a goroutine stack is reused on arm64, the spot on the stack where
the "caller's" frame pointer goes for the topmost frame should be
explicitly zeroed. Otherwise, the frame pointer check in adjustframe
with debugCheckBP enabled will fail on the topmost frame of a call stack
the first time a reused stack is grown.
Updates #39524, #58432
Change-Id: Ic1210dc005e3ecdbf9cd5d7b98846566e56df8f5
Reviewed-on: https://go-review.googlesource.com/c/go/+/481636
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Dmitri Shuralyov <dmitshur@golang.org>
|
|
Rather than testing for architectures that use libc-based system calls,
test that it is not the single architecture that Go is still using direct
system calls. This reduces the number of changes needed for new openbsd
ports.
Updates #36435
Updates #61546
Change-Id: I79c4597c629b8b372e9efcda79e8f6ff778b9e8e
Reviewed-on: https://go-review.googlesource.com/c/go/+/516016
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Ian Lance Taylor <iant@google.com>
Run-TryBot: Ian Lance Taylor <iant@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Joel Sing <joel@sing.id.au>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
On Unix-like platforms, enforce that the standard file descriptions (0,
1, 2) are always open during initialization. If any of the FDs are
closed, we open them pointing at /dev/null, or fail.
Fixes #60641
Change-Id: Iaab6b3f3e5ca44006ae3ba3544d47da9a613f58f
Reviewed-on: https://go-review.googlesource.com/c/go/+/509020
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Roland Shoemaker <roland@golang.org>
Auto-Submit: Roland Shoemaker <roland@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
goschedImpl transitions the current goroutine from _Grunning to
_Grunnable and places it on the global run queue before calling into
schedule.
It does _not_ call wakep after adding the global run queue. I believe
the intuition behind skipping wakep is that since we are immediately
calling the scheduler so we don't need to wake anything to run this
work. Unfortunately, this intuition is not correct, as it breaks
coordination with spinning Ms [1].
Consider this example scenario:
Initial conditions:
M0: Running P0, G0
M1: Spinning, holding P1 and looking for work
Timeline:
M1: Fails to find work; drops P
M0: newproc adds G1 to P0 runq
M0: does not wakep because there is a spinning M
M1: clear mp.spinning, decrement sched.nmspinning (now in "delicate dance")
M1: check sched.runqsize -> no global runq work
M0: gosched preempts G0; adds G0 to global runq
M0: does not wakep because gosched doesn't wakep
M0: schedules G1 from P0 runq
M1: check P0 runq -> no work
M1: no work -> park
G0 is stranded on the global runq with no M/P looking to run it. This is
a loss of work conservation.
As a result, G0 will have unbounded* scheduling delay, only getting
scheduled when G1 yields. Even once G1 yields, we still won't start
another P, so both G0 and G1 will switch back and forth sharing one P
when they should start another.
*The caveat to this is that today sysmon will preempt G1 after 10ms,
effectively capping the scheduling delay to 10ms, but not solving the P
underutilization problem. Sysmon's behavior here is theoretically
unnecessary, as our work conservation guarantee should allow sysmon to
avoid preemption if there are any idle Ps. Issue #60693 tracks changing
this behavior and the challenges involved.
[1] It would be OK if we unconditionally entered the scheduler as a
spinning M ourselves, as that would require schedule to call wakep when
it finds work in case there is more work.
Fixes #55160.
Change-Id: I2f44001239564b56ea30212553ab557051d22588
Reviewed-on: https://go-review.googlesource.com/c/go/+/501976
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Pratt <mpratt@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
|
|
When a thread transitions to spinning to non-spinning it must recheck
all sources of work because other threads may submit new work but skip
wakep because they see a spinning thread.
However, since the beginning of time (CL 7314062) we do not check the
global run queue, only the local per-P run queues.
The global run queue is checked just above the spinning checks while
dropping the P. I am unsure what the purpose of this check is. It
appears to simply be opportunistic since sched.lock is already held
there in order to drop the P. It is not sufficient to synchronize with
threads adding work because it occurs before decrementing
sched.nmspinning, which is what threads us to decide to wake a thread.
Resolve this by adding an explicit global run queue check alongside the
local per-P run queue checks.
Almost nothing happens between dropped sched.lock after dropping the P
and relocking sched.lock: just clearing mp.spinning and decrementing
sched.nmspinning. Thus it may be better to just hold sched.lock for this
entire period, but this is a larger change that I would prefer to avoid
in the freeze and backports.
For #55160.
Change-Id: Ifd88b5a4c561c063cedcfcfe1dd8ae04202d9666
Reviewed-on: https://go-review.googlesource.com/c/go/+/501975
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
The runtime was adjusting netpollWaiters before the waiting
goroutines were marked as ready. This could cause the scheduler
to report a deadlock because there were no goroutines ready to run.
Keeping netpollWaiters non-zero ensures that at least one goroutine
will call netpoll(-1) from findRunnable.
This does mean that if a program has network activity for a while
and then never has it again, and also has no timers, then we can leave
an M stranded in a call to netpoll from which it will never return.
At least this won't be a common case. And it's not new; this has been
a potential problem for some time.
Fixes #61454
Change-Id: I17c7f891c2bb1262fda12c6929664e64686463c8
Reviewed-on: https://go-review.googlesource.com/c/go/+/511455
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
|
|
When reusing a g struct the runtime did not reset
g.raceignore. Initialize raceignore to zero when initially
setting racectx.
A goroutine can end with a non-zero raceignore if it exits
after calling runtime.RaceDisable without a matching
runtime.RaceEnable. If that goroutine's g is later reused
the race detector is in a weird state: the underlying
g.racectx is active, yet g.raceignore is non-zero, and
raceacquire/racerelease which check g.raceignore become
no-ops. This causes the race detector to report races when
there are none.
Fixes #60934
Change-Id: Ib8e412f11badbaf69a480f03740da70891f4093f
Reviewed-on: https://go-review.googlesource.com/c/go/+/505055
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
|
|
GODEBUG=dontfreezetheworld=1 allows goroutines to continue execution
during fatal panic. This increases the chance that tracebackothers will
encounter running goroutines that it must skip, which is expected and
fine. However, it also introduces the risk that a goroutine transitions
from stopped to running in the middle of traceback, which is unsafe and
may cause traceback crashes.
Mitigate this by halting M execution if it naturally enters the
scheduler. This ensures that goroutines cannot transition from stopped
to running after freezetheworld. We simply deadlock rather than using
gcstopm to continue keeping disturbance to scheduler state to a minimum.
Change-Id: I9aa8d84abf038ae17142f34f4384e920b1490e81
Reviewed-on: https://go-review.googlesource.com/c/go/+/501255
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
|
|
On Unix platforms, the runtime previously did nothing special when a
program was run with either the SUID or SGID bits set. This can be
dangerous in certain cases, such as when dumping memory state, or
assuming the status of standard i/o file descriptors.
Taking cues from glibc, this change implements a set of protections when
a binary is run with SUID or SGID bits set (or is SUID/SGID-like). On
Linux, whether to enable these protections is determined by whether the
AT_SECURE flag is passed in the auxiliary vector. On platforms which
have the issetugid syscall (the BSDs, darwin, and Solaris/Illumos), that
is used. On the remaining platforms (currently only AIX) we check
!(getuid() == geteuid() && getgid == getegid()).
Currently when we determine a binary is "tainted" (using the glibc
terminology), we implement two specific protections:
1. we check if the file descriptors 0, 1, and 2 are open, and if they
are not, we open them, pointing at /dev/null (or fail).
2. we force GOTRACKBACK=none, and generally prevent dumping of
trackbacks and registers when a program panics/aborts.
In the future we may add additional protections.
This change requires implementing issetugid on the platforms which
support it, and implementing getuid, geteuid, getgid, and getegid on
AIX.
Thanks to Vincent Dehors from Synacktiv for reporting this issue.
Fixes #60272
Fixes CVE-2023-29403
Change-Id: I73fc93f2b7a8933c192ce3eabbf1db359db7d5fa
Reviewed-on: https://team-review.git.corp.google.com/c/golang/go-private/+/1878434
Reviewed-by: Damien Neil <dneil@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Roland Shoemaker <bracewell@google.com>
Reviewed-by: Russ Cox <rsc@google.com>
Reviewed-on: https://go-review.googlesource.com/c/go/+/501223
Run-TryBot: David Chase <drchase@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Currently lockextra always increments extraMInUse, even if the M won't
be used (or doesn't even exist), such as in addExtraM. addExtraM fails
to decrement extraMInUse, so it stays elevated forever.
Fix this bug and simplify the model by moving extraMInUse out of
lockextra to getExtraM, where we know the M will actually be used.
While we're here, remove the nilokay argument from getExtraM, which is
always false.
Fixes #60540.
Change-Id: I7a5d97456b3bc6ea1baeb06b5b2975e3b8dd96a0
Reviewed-on: https://go-review.googlesource.com/c/go/+/499677
Reviewed-by: Cherry Mui <cherryyz@google.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
This change caches the *pinner on the P to pool it and reduce the chance
that a new allocation is made. It also makes the *pinner no longer drop
its refs array on unpin, also to avoid reallocating.
The Pinner benchmark results before and after this CL are attached at
the bottom of the commit message.
Note that these results are biased toward the current change because of
the last two benchmark changes. Reusing the pinner in the benchmark
itself achieves similar performance before this change. The benchmark
results thus basically just confirm that this change does cache the
inner pinner in a useful way. Using the previous benchmarks there's
actually a slight regression from the extra check in the cache, however
the long pole is still setPinned itself.
name old time/op new time/op delta
PinnerPinUnpinBatch-8 42.2µs ± 2% 41.5µs ± 1% ~ (p=0.056 n=5+5)
PinnerPinUnpinBatchDouble-8 367µs ± 1% 350µs ± 1% -4.67% (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8 108µs ± 0% 102µs ± 1% -6.22% (p=0.008 n=5+5)
PinnerPinUnpin-8 592ns ± 8% 40ns ± 1% -93.29% (p=0.008 n=5+5)
PinnerPinUnpinTiny-8 693ns ± 9% 39ns ± 1% -94.31% (p=0.008 n=5+5)
PinnerPinUnpinDouble-8 843ns ± 5% 124ns ± 3% -85.24% (p=0.008 n=5+5)
PinnerPinUnpinParallel-8 1.11µs ± 5% 0.00µs ± 0% -99.55% (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8 1.12µs ± 8% 0.00µs ± 1% -99.55% (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8 1.79µs ± 4% 0.58µs ± 6% -67.36% (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8 5.78ns ± 0% 5.80ns ± 1% ~ (p=0.548 n=5+5)
PinnerIsPinnedOnUnpinned-8 4.99ns ± 1% 4.98ns ± 0% ~ (p=0.841 n=5+5)
PinnerIsPinnedOnPinnedParallel-8 0.71ns ± 0% 0.71ns ± 0% ~ (p=0.175 n=5+5)
PinnerIsPinnedOnUnpinnedParallel-8 0.67ns ± 1% 0.66ns ± 0% ~ (p=0.167 n=5+5)
name old alloc/op new alloc/op delta
PinnerPinUnpinBatch-8 20.1kB ± 0% 20.0kB ± 0% -0.32% (p=0.008 n=5+5)
PinnerPinUnpinBatchDouble-8 52.7kB ± 0% 52.7kB ± 0% -0.12% (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8 20.1kB ± 0% 20.0kB ± 0% -0.32% (p=0.008 n=5+5)
PinnerPinUnpin-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerPinUnpinTiny-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerPinUnpinDouble-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallel-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8 64.0B ± 0% 0.0B -100.00% (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8 0.00B 0.00B ~ (all equal)
PinnerIsPinnedOnUnpinned-8 0.00B 0.00B ~ (all equal)
PinnerIsPinnedOnPinnedParallel-8 0.00B 0.00B ~ (all equal)
PinnerIsPinnedOnUnpinnedParallel-8 0.00B 0.00B ~ (all equal)
name old allocs/op new allocs/op delta
PinnerPinUnpinBatch-8 9.00 ± 0% 8.00 ± 0% -11.11% (p=0.008 n=5+5)
PinnerPinUnpinBatchDouble-8 11.0 ± 0% 10.0 ± 0% -9.09% (p=0.008 n=5+5)
PinnerPinUnpinBatchTiny-8 9.00 ± 0% 8.00 ± 0% -11.11% (p=0.008 n=5+5)
PinnerPinUnpin-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerPinUnpinTiny-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerPinUnpinDouble-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallel-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallelTiny-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerPinUnpinParallelDouble-8 1.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5)
PinnerIsPinnedOnPinned-8 0.00 0.00 ~ (all equal)
PinnerIsPinnedOnUnpinned-8 0.00 0.00 ~ (all equal)
PinnerIsPinnedOnPinnedParallel-8 0.00 0.00 ~ (all equal)
PinnerIsPinnedOnUnpinnedParallel-8 0.00 0.00 ~ (all equal)
For #46787.
Change-Id: I0cdfad77b189c425868944a4faeff3d5b97417b9
Reviewed-on: https://go-review.googlesource.com/c/go/+/497615
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ansiwen <ansiwen@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
Currently pidleget gets passed "now" from before the M goes into
netpoll, resulting in incorrect accounting of idle CPU time.
lastpoll is also stored with a stale "now": the mistake was added in the
same CL it was added for pidleget.
Recompute "now" after returning from netpoll.
Also, start tracking idle time on js/wasm at all.
Credit to Rhys Hiltner for the test case.
Fixes #60276.
Change-Id: I5dd677471f74c915dfcf3d01621430876c3ff307
Reviewed-on: https://go-review.googlesource.com/c/go/+/496183
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
|
|
This change adds traceBlockReason which leaks fewer implementation
details of the tracer to the runtime. Currently, gopark is called with
an explicit trace event, but this leaks details about trace internals
throughout the runtime.
This change will make it easier to change out the trace implementation.
Change-Id: Id633e1704d2c8838c6abd1214d9695537c4ac7db
Reviewed-on: https://go-review.googlesource.com/c/go/+/494185
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
|
|
Currently the trace clock is cputicks() with comments sprinkled in
different places as to which clock to use. Since the execution tracer
redesign will use a different clock, it seems like a good time to clean
that up.
Also, rename the start/end timestamps to be more readable (i.e.
startTime vs. timeStart).
Change-Id: If43533eddd0e5f68885bb75cdbadb38da42e7584
Reviewed-on: https://go-review.googlesource.com/c/go/+/494775
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|
|
Currently STW events are only emitted for GC STWs. There's little reason
why the trace can't contain events for every STW: they're rare so don't
take up much space in the trace, yet being able to see when the world
was stopped is often critical to debugging certain latency issues,
especially when they stem from user-level APIs.
This change adds new "kinds" to the EvGCSTWStart event, renames the
GCSTW events to just "STW," and lets the parser deal with unknown STW
kinds for future backwards compatibility.
But, this change must break trace compatibility, so it bumps the trace
version to Go 1.21.
This change also includes a small cleanup in the trace command, which
previously checked for STW events when deciding whether user tasks
overlapped with a GC. Looking at the source, I don't see a way for STW
events to ever enter the stream that that code looks at, so that
condition has been deleted.
Change-Id: I9a5dc144092c53e92eb6950e9a5504a790ac00cf
Reviewed-on: https://go-review.googlesource.com/c/go/+/494495
Reviewed-by: Michael Pratt <mpratt@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
|
|
This reapplies CL 485500, with a fix drafted in CL 492987 incorporated.
CL 485500 is reverted due to #60004 and #60007. #60004 is fixed in
CL 492743. #60007 is fixed in CL 492987 (incorporated in this CL).
[Original CL 485500 description]
This reapplies CL 481061, with the followup fixes in CL 482975, CL 485315, and
CL 485316 incorporated.
CL 481061, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 482975 is a followup fix to a C declaration in testprogcgo.
CL 485315 is a followup fix for x_cgo_getstackbound on Illumos.
CL 485316 is a followup cleanup for ppc64 assembly.
CL 479915 passed the G to _cgo_getstackbound for direct updates to
gp.stack.lo. A G can be reused on a new thread after the previous thread
exited. This could trigger the C TSAN race detector because it couldn't
see the synchronization in Go (lockextra) preventing the same G from
being used on multiple threads at the same time.
We work around this by passing the address of a stack variable to
_cgo_getstackbound rather than the G. The stack is generally unique per
thread, so TSAN won't see the same address from multiple threads. Even
if stacks are reused across threads by pthread, C TSAN should see the
synchonization in the stack allocator.
A regression test is added to misc/cgo/testsanitizer.
[Original CL 481061 description]
This reapplies CL 392854, with the followup fixes in CL 479255,
CL 479915, and CL 481057 incorporated.
CL 392854, by doujiang24 <doujiang24@gmail.com>, speed up C to Go
calls by binding the M to the C thread. See below for its
description.
CL 479255 is a followup fix for a small bug in ARM assembly code.
CL 479915 is another followup fix to address C to Go calls after
the C code uses some stack, but that CL is also buggy.
CL 481057, by Michael Knyszek, is a followup fix for a memory leak
bug of CL 479915.
[Original CL 392854 description]
In a C thread, it's necessary to acquire an extra M by using needm while invoking a Go function from C. But, needm and dropm are heavy costs due to the signal-related syscalls.
So, we change to not dropm while returning back to C, which means binding the extra M to the C thread until it exits, to avoid needm and dropm on each C to Go call.
Instead, we only dropm while the C thread exits, so the extra M won't leak.
When invoking a Go function from C:
Allocate a pthread variable using pthread_key_create, only once per shared object, and register a thread-exit-time destructor.
And store the g0 of the current m into the thread-specified value of the pthread key, only once per C thread, so that the destructor will put the extra M back onto the extra M list while the C thread exits.
When returning back to C:
Skip dropm in cgocallback, when the pthread variable has been created, so that the extra M will be reused the next time invoke a Go function from C.
This is purely a performance optimization. The old version, in which needm & dropm happen on each cgo call, is still correct too, and we have to keep the old version on systems with cgo but without pthreads, like Windows.
This optimization is significant, and the specific value depends on the OS system and CPU, but in general, it can be considered as 10x faster, for a simple Go function call from a C thread.
For the newly added BenchmarkCGoInCThread, some benchmark results:
1. it's 28x faster, from 3395 ns/op to 121 ns/op, in darwin OS & Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
2. it's 6.5x faster, from 1495 ns/op to 230 ns/op, in Linux OS & Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
[CL 479915 description]
Currently, when C calls into Go the first time, we grab an M
using needm, which sets m.g0's stack bounds using the SP. We don't
know how big the stack is, so we simply assume 32K. Previously,
when the Go function returns to C, we drop the M, and the next
time C calls into Go, we put a new stack bound on the g0 based on
the current SP. After CL 392854, we don't drop the M, and the next
time C calls into Go, we reuse the same g0, without recomputing
the stack bounds. If the C code uses quite a bit of stack space
before calling into Go, the SP may be well below the 32K stack
bound we assumed, so the runtime thinks the g0 stack overflows.
This CL makes needm get a more accurate stack bound from
pthread. (In some platforms this may still be a guess as we don't
know exactly where we are in the C stack), but it is probably
better than simply assuming 32K.
[CL 492987 description]
On the first call into Go from a C thread, currently we set the g0
stack's high bound imprecisely based on the SP. With CL 485500, we
keep the M and don't recompute the stack bounds when it calls into
Go again. If the first call is made when the C thread uses some
deep stack, but a subsequent call is made with a shallower stack,
the SP may be above g0.stack.hi.
This is usually okay as we don't check usually stack.hi. One place
where we do check for stack.hi is in the signal handler, in
adjustSignalStack. In particular, C TSAN delivers signals on the
g0 stack (instead of the usual signal stack). If the SP is above
g0.stack.hi, we don't see it is on the g0 stack, and throws.
This CL makes it get an accurate stack upper bound with the
pthread API (on the platforms where it is available).
Also add some debug print for the "handler not on signal stack"
throw.
Fixes #51676.
Fixes #59294.
Fixes #59678.
Fixes #60007.
Change-Id: Ie51c8e81ade34ec81d69fd7bce1fe0039a470776
Reviewed-on: https://go-review.googlesource.com/c/go/+/495855
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
sysBlockTraced is a subtle and confusing flag.
Currently, it's only used in one place: a condition around whether to
traceGoSysExit when a goroutine is about to start running. That condition
looks like "gp.syscallsp != 0 && gp.trace.sysBlockTraced".
In every case but one, "gp.syscallsp != 0" is equivalent to
"gp.trace.sysBlockTraced."
That one case is where a goroutine is running without a P and racing
with trace start (world is stopped). It switches itself back to
_Grunnable from _Gsyscall before the trace start goroutine notices, such
that the trace start goroutine fails to emit a EvGoInSyscall event for
it (EvGoInSyscall or EvGoSysBlock must precede any EvGoSysExit event).
sysBlockTraced is set unconditionally on every syscall entry and the
trace start goroutine clears it if there was no EvGoInSyscall event
emitted (i.e. did not observe _Gsyscall on the goroutine). That way when
the goroutine-without-a-P wakes up and gets scheduled, it only emits
EvGoSysExit if the flag is set, i.e. trace start didn't _clear_ the
flag.
What makes this confusing is the fact that the flag is set
unconditionally and the code relies on it being *cleared*. Really, all
it's trying to communicate is whether the tracer is aware of a
goroutine's syscall at the point where a goroutine that lost its P
during a syscall is trying to run again.
Therefore, we can replace this flag with a less subtle one:
tracedSyscallEnter. It is set when GoSysCall is traced, indicating on
the goroutine that the tracer is aware of the syscall. Later, if
traceGoSysExit is called, the tracer knows its safe to emit an event
because the tracer is aware of the syscall.
This flag is then also set at trace start, when it emits EvGoInSyscall,
which again, lets the goroutine know the tracer is aware of its syscall.
The flag is cleared by GoSysExit to indicate that the tracer is no
longer aware of any syscalls on the goroutine. It's also cleared by
trace start. This is necessary because a syscall may have been started
while a trace was stopping. If the GoSysExit isn't emitted (because it
races with the trace end STW) then the flag will be left set at the
start of the next trace period, which will result in an erroneous
GoSysExit. Instead, the flag is cleared in the same way sysBlockTraced
is today: if the tracer doesn't notice the goroutine is in a syscall, it
makes that explicit to the goroutine.
A more direct flag to use here would be one that explicitly indicates
whether EvGoInSyscall or EvGoSysBlock specifically were already emitted
for a goroutine. The reason why we don't just do this is because setting
the flag when EvGoSysBlock is emitted would be racy: EvGoSysBlock is
emitted by whatever thread is stealing the P out from under the
syscalling goroutine, so it would need to synchronize with the goroutine
its emitting the event for.
The end result of all this is that the new flag can be managed entirely
within trace.go, hiding another implementation detail about the tracer.
Tested with `stress ./trace.test -test.run="TestTraceStressStartStop"`
which was occasionally failing before the CL in which sysBlockTraced was
added (CL 9132). I also confirmed also that this test is still sensitive
to `EvGoSysExit` by removing the one use of sysBlockTraced. The result
is about a 5% error rate. If there is something very subtly wrong about
how this CL emits `EvGoSysExit`, I would expect to see it as a test
failure. Instead:
53m55s: 200434 runs so far, 0 failures
Change-Id: If1d24ee6b6926eec7e90cdb66039a5abac819d9b
Reviewed-on: https://go-review.googlesource.com/c/go/+/494715
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
|
|
Just another step to hiding implementation details.
Change-Id: I71b7cc522d18c23f03a9bf32e428279e62b39a89
Reviewed-on: https://go-review.googlesource.com/c/go/+/494192
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
|
|
More tightening up of the tracer's interface.
This increases the size of each G very slightly, which isn't great, but
we stay within the same size class, so actually memory use will be
unchanged.
Change-Id: I7d1f5798edcf437c212beb1e1a2619eab833aafb
Reviewed-on: https://go-review.googlesource.com/c/go/+/494188
TryBot-Result: Gopher Robot <gobot@golang.org>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
|
|
In the interest of further cleaning up the trace.go API, move the trace
logic in oneNewExtraM into its own function.
Change-Id: I5cf478cb8cd0d301ee3b068347ed48ce768b8882
Reviewed-on: https://go-review.googlesource.com/c/go/+/494186
Reviewed-by: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
|