From 2e782e0005d36fe20e9f2fb16fc264cc08c97819 Mon Sep 17 00:00:00 2001 From: Michael Anthony Knyszek Date: Thu, 4 Aug 2022 20:08:53 +0000 Subject: design: update the soft memory limit design document This change updates the design document with some additional prior art and corrections that reflect the final implementation. For #48409. Change-Id: I8276a30b546c698ab1c9f946ccc0ae2a327399d0 Reviewed-on: https://go-review.googlesource.com/c/proposal/+/421418 Reviewed-by: Michael Knyszek --- design/48409-soft-memory-limit.md | 152 ++++++++++++---------------------- design/48409/README.md | 5 +- design/48409/eqn1.png | Bin 1012 -> 922 bytes design/48409/eqn2.png | Bin 1656 -> 1656 bytes design/48409/inl1.png | Bin 399 -> 399 bytes design/48409/inl10.png | Bin 455 -> 449 bytes design/48409/inl11.png | Bin 368 -> 455 bytes design/48409/inl12.png | Bin 755 -> 368 bytes design/48409/inl13.png | Bin 0 -> 755 bytes design/48409/inl2.png | Bin 370 -> 370 bytes design/48409/inl3.png | Bin 486 -> 389 bytes design/48409/inl4.png | Bin 434 -> 400 bytes design/48409/inl5.png | Bin 400 -> 397 bytes design/48409/inl6.png | Bin 374 -> 695 bytes design/48409/inl7.png | Bin 462 -> 374 bytes design/48409/inl8.png | Bin 482 -> 462 bytes design/48409/inl9.png | Bin 449 -> 482 bytes design/48409/soft-memory-limit.src.md | 139 ++++++++++--------------------- 18 files changed, 98 insertions(+), 198 deletions(-) create mode 100644 design/48409/inl13.png diff --git a/design/48409-soft-memory-limit.md b/design/48409-soft-memory-limit.md index 59c0ab1..3edc9ad 100644 --- a/design/48409-soft-memory-limit.md +++ b/design/48409-soft-memory-limit.md @@ -208,70 +208,51 @@ limit ![`L`](48409/inl2.png), I propose the following calculation: ![Equation 1](48409/eqn1.png) -Where ![`O_M`](48409/inl3.png) is (per `runtime/metrics` memory names) - -``` -/memory/classes/metadata/mcache/free:bytes + -/memory/classes/metadata/mcache/inuse:bytes + -/memory/classes/metadata/mspan/free:bytes + -/memory/classes/metadata/mspan/inuse:bytes + -/memory/classes/metadata/other:bytes + -/memory/classes/os-stacks:bytes + -/memory/classes/other:bytes + -/memory/classes/profiling/buckets:bytes -``` - -and ![`O_I`](48409/inl4.png) is the maximum of -`/memory/classes/heap/unused:bytes + /memory/classes/heap/free:bytes` over the -last GC cycle. - -These terms (called ![`O`](48409/inl5.png), for "overheads") account for all -memory that is not accounted for by the GC pacer (from the [new pacer -proposal](https://github.com/golang/proposal/blob/329650d4723a558c2b76b81b4995fc5c267e6bc1/design/44167-gc-pacer-redesign.md#heap-goal)). +![`T`](48409/inl3.png) is the total amount of memory mapped by the Go runtime. +![`F`](48409/inl4.png) is the amount of free and unscavenged memory the Go +runtime is holding. +![`A`](48409/inl5.png) is the number of bytes in allocated heap objects at the +time ![`\hat{L}`](48409/inl1.png) is computed. + +The second term, ![`(T - F - A)`](48409/inl6.png), represents the sum of +non-heap overheads. +Free and unscavenged memory is specifically excluded because this is memory that +the runtime might use in the near future, and the scavenger is specifically +instructed to leave the memory up to the heap goal unscavenged. +Failing to exclude free and unscavenged memory could lead to a very poor +accounting of non-heap overheads. With ![`\hat{L}`](48409/inl1.png) fully defined, our heap goal for cycle -![`n`](48409/inl6.png) (![`N_n`](48409/inl7.png)) is a straightforward extension +![`n`](48409/inl7.png) (![`N_n`](48409/inl8.png)) is a straightforward extension of the existing one. Where -* ![`M_n`](48409/inl8.png) is equal to bytes marked at the end of GC n's mark +* ![`M_n`](48409/inl9.png) is equal to bytes marked at the end of GC n's mark phase -* ![`S_n`](48409/inl9.png) is equal to stack bytes at the beginning of GC n's +* ![`S_n`](48409/inl10.png) is equal to stack bytes at the beginning of GC n's mark phase -* ![`G_n`](48409/inl10.png) is equal to bytes of globals at the beginning of GC +* ![`G_n`](48409/inl11.png) is equal to bytes of globals at the beginning of GC n's mark phase -* ![`\gamma`](48409/inl11.png) is equal to - ![`1+\frac{GOGC}{100}`](48409/inl12.png) +* ![`\gamma`](48409/inl12.png) is equal to + ![`1+\frac{GOGC}{100}`](48409/inl13.png) then ![Equation 2](48409/eqn2.png) -Over the course of a GC cycle ![`O_M`](48409/inl3.png) remains stable because it -increases monotonically. -There's only one situation where ![`O_M`](48409/inl3.png) can grow tremendously -(relative to active heap objects) in a short period of time (< 1 GC cycle), and -that's when `GOMAXPROCS` increases. -So, I also propose recomputing this value at that time. - -Meanwhile ![`O_I`](48409/inl4.png) stays relatively stable (and doesn't have a -sawtooth pattern, as one might expect from a sum of idle heap memory) because -object sweeping occurs incrementally, specifically proportionally to how fast -the application is allocating. -Furthermore, this value is guaranteed to stay relatively stable across a single -GC cycle, because the total size of the heap for one GC cycle is bounded by the -heap goal. -Taking the highwater mark of this value places a conservative upper bound on the -total impact of this memory, so the heap goal stays safe from major changes. - -One concern with the above definition of ![`\hat{L}`](48409/inl1.png) is that it -is fragile to changes to the Go GC. -In the past, seemingly unrelated changes to the Go runtime have impacted the -GC's pacer, usually due to an unforeseen influence on the accounting that the -pacer relies on. -To minimize the impact of these accidents on the conversion function, I propose -centralizing and categorizing all the variables used in accounting, and writing -tests to ensure that expected properties of the account remain in-tact. +Over the course of a GC cycle, non-heap overheads remain stable because the +mostly increase monotonically. +However, the GC needs to be responsive to any change in non-heap overheads. +Therefore, I propose a more heavy-weight recomputation of the heap goal every +time its needed, as opposed to computing it only once per cycle. +This also means the GC trigger point needs to be dynamically recomputable. +This check will create additional overheads, but they're likely to be low, as +the GC's internal statistics are updated only on slow paths. + +The nice thing about this definition of ![`\hat{L}`](48409/inl1.png) is that +it's fairly robust to changes to the Go GC, since total mapped memory, free and +unscavenged memory, and bytes allocated in objects, are fairly fundamental +properties (especially to any tracing GC design). #### Death spirals @@ -322,7 +303,7 @@ large enough to accommodate worst-case pause times but not too large such that a more than about a second. 1 CPU-second per `GOMAXPROCS` seems like a reasonable place to start. -Unfortunately, 50% is not a reasonable choice for small values of `GOGC`. +Unfortunately, 50% is not always a reasonable choice for small values of `GOGC`. Consider an application running with `GOGC=10`: an overall 50% GC CPU utilization limit for `GOGC=10` is likely going to be always active, leading to significant overshoot. @@ -359,22 +340,13 @@ use approaches the limit. I propose it does so using a proportional-integral controller whose input is the difference between the memory limit and the memory used by Go, and whose output is the CPU utilization target of the background scavenger. -The output will be clamped at a minimum of 1% and a maximum of 10% overall CPU -utilization. -Note that the 10% is chosen arbitrarily; in general, returning memory to the -platform is nowhere near as costly as the GC, but the number must be chosen such -that the mutator still has plenty of room to make progress (thus, I assert that -40% of CPU time is enough). -In order to make the scavenger scale to overall CPU utilization effectively, it -requires some improvements to avoid the aforementioned locking issues it deals -with today. - -Any CPU time spent in the scavenger should also be accounted for in the leaky -bucket algorithm described in the [Death spirals](#death-spirals) section as GC -time, however I don't think it should be throttled in the same way. -The intuition behind that is that returning memory to the platform is generally -going to be more immediately fruitful than spending more time in garbage -collection. +This will make the background scavenger more reliable. + +However, the background scavenger likely won't return memory to the OS promptly +enough for the memory limit, so in addition, I propose having span allocations +eagerly return memory to the OS to stay under the limit. +The time a goroutine spends in this will also count toward the 50% GC CPU limit +described in the [Death spirals](#death-spirals) section. #### Alternative approaches considered @@ -418,38 +390,13 @@ go beyond the spans already in-use. ##### Returning memory to the platform -A potential issue with the proposed design is that because the scavenger is -running in the background, it may not react readily to spikes in memory use that -exceed the limit. - -In contrast, [TCMalloc](#tcmalloc) searches for memory to return eagerly, if an -allocation were to exceed the limit. -In the Go 1.13 cycle, I attempted a similar policy when first implementing the -scavenger, and found that it could cause unacceptable tail latency increases in -some applications. -While that policy certainly tried to return memory back to the platform -significantly more often than it would be in this case, it still has a couple of -downsides: -1. It introduces latency. - The background scavenger can be more efficiently time-sliced in between other - work, so it generally should only impact throughput. -1. It's much more complicated to bound the total amount of time spent searching - for and returning memory to the platform during an allocation. - -The key insight as to why this policy works just fine for TCMalloc and won't -work for Go comes from a fundamental difference in design. -Manual memory allocators are typically designed to have a LIFO-style memory -reuse pattern. -Once an allocation is freed, it is immediately available for reallocation. -In contrast, most efficient tracing garbage collection algorithms require a -FIFO-style memory reuse pattern, since allocations are freed in bulk. -The result is that the page allocator in a garbage-collected memory allocator is -accessed far more frequently than in manual memory allocator, so this path will -be hit a lot harder. - -For the purposes of this design, I don't believe the benefits of eager return -outweigh the costs, and I do believe that the proposed design is good enough for -most cases. +If returning memory to the OS eagerly becomes a significant performance issue, a +reasonable alternative could be to crank up the background scavenger's CPU usage +in response to growing memory pressure. +This needs more thought, but given that it would now be controlled by a +controller, its CPU usage will be more reliable, and this is an option we can +keep in mind. +One benefit of this option is that it may impact latency less prominently. ### Documentation @@ -513,6 +460,11 @@ decides to shrink the heap space used; more recent implementations (e.g. G1) do so more rarely, except when [the application is idle](https://openjdk.java.net/jeps/346). +Some JVMs are "container aware" and read the memory limits of their containers +to stay under the limit. +This behavior is closer to what is proposed in this document, but I do not +believe the memory limit is directly configurable, like the one proposed here. + ### SetMaxHeap For nearly 4 years, the Go project has been trialing an experimental API in the diff --git a/design/48409/README.md b/design/48409/README.md index 8994615..375a0f2 100644 --- a/design/48409/README.md +++ b/design/48409/README.md @@ -1,6 +1,6 @@ # Design document -The GC pacer design document is generated from the `.src.md` file in this +The memory limit design document is generated from the `.src.md` file in this directory. It contains LaTeX formulas which Markdown cannot render, so they're rendered by an external open-source tool, @@ -10,9 +10,6 @@ Then, because Gitiles' markdown viewer can't render SVGs, run ``` ./svg2png.bash -cd pacer-plots -./svg2png.bash -cd .. ``` And go back and replace all instances of SVG with PNG in the final document. diff --git a/design/48409/eqn1.png b/design/48409/eqn1.png index f638a9a..96ce7b9 100644 Binary files a/design/48409/eqn1.png and b/design/48409/eqn1.png differ diff --git a/design/48409/eqn2.png b/design/48409/eqn2.png index b25a7dd..8174e00 100644 Binary files a/design/48409/eqn2.png and b/design/48409/eqn2.png differ diff --git a/design/48409/inl1.png b/design/48409/inl1.png index d5799ad..8636d42 100644 Binary files a/design/48409/inl1.png and b/design/48409/inl1.png differ diff --git a/design/48409/inl10.png b/design/48409/inl10.png index 0cc2d21..2d1b53e 100644 Binary files a/design/48409/inl10.png and b/design/48409/inl10.png differ diff --git a/design/48409/inl11.png b/design/48409/inl11.png index 038e874..b1966ed 100644 Binary files a/design/48409/inl11.png and b/design/48409/inl11.png differ diff --git a/design/48409/inl12.png b/design/48409/inl12.png index 48cd28f..1158f2d 100644 Binary files a/design/48409/inl12.png and b/design/48409/inl12.png differ diff --git a/design/48409/inl13.png b/design/48409/inl13.png new file mode 100644 index 0000000..1beadc3 Binary files /dev/null and b/design/48409/inl13.png differ diff --git a/design/48409/inl2.png b/design/48409/inl2.png index bb27598..7fc5478 100644 Binary files a/design/48409/inl2.png and b/design/48409/inl2.png differ diff --git a/design/48409/inl3.png b/design/48409/inl3.png index c152bf2..4afda7a 100644 Binary files a/design/48409/inl3.png and b/design/48409/inl3.png differ diff --git a/design/48409/inl4.png b/design/48409/inl4.png index cf4f20b..23b71e5 100644 Binary files a/design/48409/inl4.png and b/design/48409/inl4.png differ diff --git a/design/48409/inl5.png b/design/48409/inl5.png index 6386671..9fc70f6 100644 Binary files a/design/48409/inl5.png and b/design/48409/inl5.png differ diff --git a/design/48409/inl6.png b/design/48409/inl6.png index de20b8e..d72def7 100644 Binary files a/design/48409/inl6.png and b/design/48409/inl6.png differ diff --git a/design/48409/inl7.png b/design/48409/inl7.png index 4b7a746..4149a13 100644 Binary files a/design/48409/inl7.png and b/design/48409/inl7.png differ diff --git a/design/48409/inl8.png b/design/48409/inl8.png index e3dbc80..9e0c57c 100644 Binary files a/design/48409/inl8.png and b/design/48409/inl8.png differ diff --git a/design/48409/inl9.png b/design/48409/inl9.png index 01dff67..820bb71 100644 Binary files a/design/48409/inl9.png and b/design/48409/inl9.png differ diff --git a/design/48409/soft-memory-limit.src.md b/design/48409/soft-memory-limit.src.md index c8efb3b..a5d6e81 100644 --- a/design/48409/soft-memory-limit.src.md +++ b/design/48409/soft-memory-limit.src.md @@ -107,7 +107,7 @@ It's outside of the scope of the Go runtime to solve this problem for everyone, but I believe the Go runtime has an important role to play in supporting these worthwhile efforts. -2. Eliminate out-of-memory errors in 100% of cases. +1. Eliminate out-of-memory errors in 100% of cases. Whatever policy this API adheres to is going to fail for some use-case, and that's OK. @@ -207,28 +207,20 @@ To compute the heap limit `$\hat{L}$` from the soft memory limit `$L$`, I propose the following calculation: ```render-latex -\hat{L} = L - (O_M + O_I) +\hat{L} = L - (T - F - A) ``` -Where `$O_M$` is (per `runtime/metrics` memory names) +`$T$` is the total amount of memory mapped by the Go runtime. +`$F$` is the amount of free and unscavenged memory the Go runtime is holding. +`$A$` is the number of bytes in allocated heap objects at the time `$\hat{L}$` +is computed. -``` -/memory/classes/metadata/mcache/free:bytes + -/memory/classes/metadata/mcache/inuse:bytes + -/memory/classes/metadata/mspan/free:bytes + -/memory/classes/metadata/mspan/inuse:bytes + -/memory/classes/metadata/other:bytes + -/memory/classes/os-stacks:bytes + -/memory/classes/other:bytes + -/memory/classes/profiling/buckets:bytes -``` - -and `$O_I$` is the maximum of `/memory/classes/heap/unused:bytes + -/memory/classes/heap/free:bytes` over the last GC cycle. - -These terms (called `$O$`, for "overheads") account for all memory that is not -accounted for by the GC pacer (from the [new pacer -proposal](https://github.com/golang/proposal/blob/329650d4723a558c2b76b81b4995fc5c267e6bc1/design/44167-gc-pacer-redesign.md#heap-goal)). +The second term, `$(T - F - A)$`, represents the sum of non-heap overheads. +Free and unscavenged memory is specifically excluded because this is memory that +the runtime might use in the near future, and the scavenger is specifically +instructed to leave the memory up to the heap goal unscavenged. +Failing to exclude free and unscavenged memory could lead to a very poor +accounting of non-heap overheads. With `$\hat{L}$` fully defined, our heap goal for cycle `$n$` (`$N_n$`) is a straightforward extension of the existing one. @@ -245,31 +237,19 @@ then N_n = min(\hat{L}, \gamma(M_{n-1})+S_n+G_n) ``` -Over the course of a GC cycle `$O_M$` remains stable because it increases -monotonically. -There's only one situation where `$O_M$` can grow tremendously (relative to -active heap objects) in a short period of time (< 1 GC cycle), and that's when -`GOMAXPROCS` increases. -So, I also propose recomputing this value at that time. - -Meanwhile `$O_I$` stays relatively stable (and doesn't have a sawtooth pattern, -as one might expect from a sum of idle heap memory) because object sweeping -occurs incrementally, specifically proportionally to how fast the application is -allocating. -Furthermore, this value is guaranteed to stay relatively stable across a single -GC cycle, because the total size of the heap for one GC cycle is bounded by the -heap goal. -Taking the highwater mark of this value places a conservative upper bound on the -total impact of this memory, so the heap goal stays safe from major changes. - -One concern with the above definition of `$\hat{L}$` is that it is fragile to -changes to the Go GC. -In the past, seemingly unrelated changes to the Go runtime have impacted the -GC's pacer, usually due to an unforeseen influence on the accounting that the -pacer relies on. -To minimize the impact of these accidents on the conversion function, I propose -centralizing and categorizing all the variables used in accounting, and writing -tests to ensure that expected properties of the account remain in-tact. +Over the course of a GC cycle, non-heap overheads remain stable because the +mostly increase monotonically. +However, the GC needs to be responsive to any change in non-heap overheads. +Therefore, I propose a more heavy-weight recomputation of the heap goal every +time its needed, as opposed to computing it only once per cycle. +This also means the GC trigger point needs to be dynamically recomputable. +This check will create additional overheads, but they're likely to be low, as +the GC's internal statistics are updated only on slow paths. + +The nice thing about this definition of `$\hat{L}$` is that it's fairly robust +to changes to the Go GC, since total mapped memory, free and unscavenged memory, +and bytes allocated in objects, are fairly fundamental properties (especially to +any tracing GC design). #### Death spirals @@ -320,7 +300,7 @@ large enough to accommodate worst-case pause times but not too large such that a more than about a second. 1 CPU-second per `GOMAXPROCS` seems like a reasonable place to start. -Unfortunately, 50% is not a reasonable choice for small values of `GOGC`. +Unfortunately, 50% is not always a reasonable choice for small values of `GOGC`. Consider an application running with `GOGC=10`: an overall 50% GC CPU utilization limit for `GOGC=10` is likely going to be always active, leading to significant overshoot. @@ -357,22 +337,13 @@ use approaches the limit. I propose it does so using a proportional-integral controller whose input is the difference between the memory limit and the memory used by Go, and whose output is the CPU utilization target of the background scavenger. -The output will be clamped at a minimum of 1% and a maximum of 10% overall CPU -utilization. -Note that the 10% is chosen arbitrarily; in general, returning memory to the -platform is nowhere near as costly as the GC, but the number must be chosen such -that the mutator still has plenty of room to make progress (thus, I assert that -40% of CPU time is enough). -In order to make the scavenger scale to overall CPU utilization effectively, it -requires some improvements to avoid the aforementioned locking issues it deals -with today. - -Any CPU time spent in the scavenger should also be accounted for in the leaky -bucket algorithm described in the [Death spirals](#death-spirals) section as GC -time, however I don't think it should be throttled in the same way. -The intuition behind that is that returning memory to the platform is generally -going to be more immediately fruitful than spending more time in garbage -collection. +This will make the background scavenger more reliable. + +However, the background scavenger likely won't return memory to the OS promptly +enough for the memory limit, so in addition, I propose having span allocations +eagerly return memory to the OS to stay under the limit. +The time a goroutine spends in this will also count toward the 50% GC CPU limit +described in the [Death spirals](#death-spirals) section. #### Alternative approaches considered @@ -416,38 +387,13 @@ go beyond the spans already in-use. ##### Returning memory to the platform -A potential issue with the proposed design is that because the scavenger is -running in the background, it may not react readily to spikes in memory use that -exceed the limit. - -In contrast, [TCMalloc](#tcmalloc) searches for memory to return eagerly, if an -allocation were to exceed the limit. -In the Go 1.13 cycle, I attempted a similar policy when first implementing the -scavenger, and found that it could cause unacceptable tail latency increases in -some applications. -While that policy certainly tried to return memory back to the platform -significantly more often than it would be in this case, it still has a couple of -downsides: -1. It introduces latency. - The background scavenger can be more efficiently time-sliced in between other - work, so it generally should only impact throughput. -1. It's much more complicated to bound the total amount of time spent searching - for and returning memory to the platform during an allocation. - -The key insight as to why this policy works just fine for TCMalloc and won't -work for Go comes from a fundamental difference in design. -Manual memory allocators are typically designed to have a LIFO-style memory -reuse pattern. -Once an allocation is freed, it is immediately available for reallocation. -In contrast, most efficient tracing garbage collection algorithms require a -FIFO-style memory reuse pattern, since allocations are freed in bulk. -The result is that the page allocator in a garbage-collected memory allocator is -accessed far more frequently than in manual memory allocator, so this path will -be hit a lot harder. - -For the purposes of this design, I don't believe the benefits of eager return -outweigh the costs, and I do believe that the proposed design is good enough for -most cases. +If returning memory to the OS eagerly becomes a significant performance issue, a +reasonable alternative could be to crank up the background scavenger's CPU usage +in response to growing memory pressure. +This needs more thought, but given that it would now be controlled by a +controller, its CPU usage will be more reliable, and this is an option we can +keep in mind. +One benefit of this option is that it may impact latency less prominently. ### Documentation @@ -511,6 +457,11 @@ decides to shrink the heap space used; more recent implementations (e.g. G1) do so more rarely, except when [the application is idle](https://openjdk.java.net/jeps/346). +Some JVMs are "container aware" and read the memory limits of their containers +to stay under the limit. +This behavior is closer to what is proposed in this document, but I do not +believe the memory limit is directly configurable, like the one proposed here. + ### SetMaxHeap For nearly 4 years, the Go project has been trialing an experimental API in the -- cgit v1.3