diff options
| -rw-r--r-- | design/55022-pgo-implementation.md | 316 | ||||
| -rw-r--r-- | design/55022/image1.png | bin | 0 -> 68078 bytes | |||
| -rw-r--r-- | design/55022/image2.png | bin | 0 -> 147822 bytes |
3 files changed, 316 insertions, 0 deletions
diff --git a/design/55022-pgo-implementation.md b/design/55022-pgo-implementation.md new file mode 100644 index 0000000..f788a10 --- /dev/null +++ b/design/55022-pgo-implementation.md @@ -0,0 +1,316 @@ +# Proposal: Design and Implementation of Profile-Guided Optimization (PGO) for Go + +Author(s): Raj Barik, Jin Lin + +Last updated: 2022-09-12 + +Discussion at https://golang.org/issue/55025. \ +Linked high-level issue for PGO at https://golang.org/issue/55022. + +## Abstract + +Inefficiencies in Go programs can be isolated via profiling tools such as [pprof](https://pkg.go.dev/runtime/pprof) and linux profiler [perf](https://www.brendangregg.com/perf.html). +Such tools can pinpoint source code regions where most of the execution time is spent. +Unlike other optimizing compilers such as [LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), the Go compiler does not yet perform Profile-Guided Optimization(PGO). +PGO uses information about the code’s runtime behavior to guide compiler optimizations such as inlining, code layout etc. +PGO can improve application performance in the range 15-30% [[LLVM](https://llvm.org/docs/HowToBuildWithPGO.html), [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)]. +In this proposal, we extend the Go compiler with PGO. +Specifically, we incorporate the profiles into the frontend of the compiler to build a call graph with node & edge weights (called _WeightedCallGraph_). +The Inliner subsequently uses the WeightedCallGraph to perform _profile-guided inlining_ which aggressively inlines hot functions. +We introduce a _profile-guided code specialization_ pass that is tightly integrated with the Inliner and eliminates indirect method call overheads in hot code paths. +Furthermore, we annotate IR instructions with their associated profile weights and propagate these to the SSA-level in order to facilitate _profile-guided basic-block layout_ optimization to benefit from better instruction-cache and TLB performance. +Finally, we extend Go's linker to also consume the profiles directly and perform _function reordering_ optimization across package boundaries -- which also helps instruction-cache and TLB performance. +The format of the profile file consumed by our PGO is identical to the protobuf format produced by the [pprof](https://pkg.go.dev/runtime/pprof) tool. +This format is rich enough to carry additional hardware performance counter information such as cache misses, LBR, etc. +Existing [perf\_data\_converter](https://github.com/google/perf_data_converter) tool from Google can convert a _perf.data_ file produced by the Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format. +The first version of the code that performs _profile-guided inlining_ is available [here](http://github.com/rajbarik/go). + +## Background + +Many compiler optimizations, such as inlining, register allocation, & instruction scheduling often use hard-coded information related to caller-callee frequencies, basic-block frequencies, and branch probabilities to guide optimization. +The static estimation of these metrics often leads to poor quality code generated by the compiler. +These optimizations can easily benefit from dynamic information collected via profiling an application. + +Traditionally, a PGO-based compilation begins with an instrumentation phase to generate an instrumented version of the application. +Next, the instrumented program runs with training data to collect the profile (e.g., [edge-profiles](https://dl.acm.org/doi/10.1145/183432.183527)). +These profiles are subsequently fed to the compiler and the application is recompiled to produce an optimized binary. +During this process, the compiler updates and propagates profile information including feeding them to optimization passes to optimize hot/code paths. +Modern compilers such as LLVM have embedded PGO in them and have reported [speed ups](https://llvm.org/docs/HowToBuildWithPGO.html) of ~20\%. +Since the instrumentation build of an application incurs significant overhead, [recent work](https://dl.acm.org/doi/10.1145/1772954.1772963) has demonstrated little or no performance loss by collecting execution profiles via sampling (e.g., [hardware performance counter](https://dl.acm.org/doi/10.1145/1772954.1772963), [pprof](https://pkg.go.dev/runtime/pprof)). + +Go binaries are often large as they are statically linked and include all dependent packages and runtimes. +For such large binaries, misses in instruction cache and TLB can cause stalled cycles in the front-end leading to performance degradation. +Profile-guided _code layout_ optimization is known to alleviate this problem. +Recent works including [FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/) and [Google-Propeller](https://github.com/google/llvm-propeller) have shown more than10% performance improvements by optimizing code locality in large data-center workloads. +_Code layout optimization_ improves code locality and it comprises basic-block layout, function splitting, and function reordering optimizations. +In order to extract maximum performance, these optimizations are typically performed during or post link-time using profiling information. + +### Standard Compilation flow in Go + + + +**Figure 1.** Go compiler. + +At a very high-level, Go compiler produces an object file per package. +It links all object-files together to produce an executable. +It avoids unnecessary recompilation of packages which are not modified. +For each package, the frontend of the Go compiler performs passes such as type checking and ast-lowering before generating the frontend-IR. +At this IR-level, the Go compiler performs optimizations such as inlining, devirtualization, and escape analysis before lowering to SSA-IR. +Several optimizations (dead-code, cse, dead-store, basic-block layout, etc.) are performed at the SSA-level before producing an object file for a package. + +Inlining optimization in Go compiler proceeds as a bottom-up traversal of the call graph. +For each function, it first invokes _CanInline_ to determine the eligibility for inlining, e.g., functions marked with _go:noinline_ can not be inlined. +CanInline traverses the IR (via a HairyVisitor) to calculate the _Cost_ of the function. +If this Cost is greater than _maxInlineBudget_ (80), CanInline marks this function not inlinable in its upstream callers. +Subsequently, Inliner performs inlining in the same function for call-sites that have direct callees with cost less than maxInlineBudget (80). +As an additional code size optimization, the Inliner lowers the budget from maxInlineBudget (80) to inlineBigFunctionMaxCost (20) for a big function whose number of IR instructions is more than 5K, These parameters have been tuned heavily to get a good balance between code size and performance. +On the other hand, there have been several discussions around maxInlineBudget (80) being too small which prevents frequently executed methods not being inlined, e.g., NextSet in [bitset](https://lemire.me/blog/2017/09/05/go-does-not-inline-functions-when-it-should/). +It is definitely compelling to be able to tune these parameters based on runtime behavior of an application. + +Devirtualization pass in the Go compiler replaces interface method calls with direct concrete-type method calls where possible. +Since Inlining happens before Devirtualization, there is no guarantee that the concrete method calls generated by the devirtualizer gets inlined in the compiler. + +Go compiler performs basic-block layout at the SSA-level today, but it is not profile-guided. +This optimization is more effective when runtime information is provided to the compiler. + +Go linker does not yet perform any function reordering optimization. +To the best of our knowledge, the linker today creates two sections, one for data and another for text/code. +Ideally, one would like to do basic-block layout, hot-cold splitting, & function reordering optimizations in the linker ([FB-BOLT](https://research.fb.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/))], however without the support for sections at function and basic-block level it is hard to do them in the linker. +This limitation has led us to enable the basic-block layout optimization at SSA-level and function reordering optimization in the linker. + +## Proposal + +### New compilation flow proposed in Go for PGO + + + +**Figure 2.** New PGO-enabled Go compiler. + +Our proposed Go compilation flow consists of the following high-level components. + +* __Pprof-graph__: First, we process the profile proto file that is generated either via the [pprof](https://github.com/google/pprof/tree/main/profile) tool or the perf+[perf\_data\_converter](https://github.com/google/perf_data_converter) tool to produce an inter-package _pprof-graph._ We use the existing _internal/profile_ package to get this graph straight out-of-the-box. +This graph contains most of the information needed to perform PGO, e..g, function names, flat weights, cumulative weights, call-sites with callee information for both direct and indirect callees etc. +One thing to note is that ideally we would like to generate this graph once and cache it for individual package-level compilation in order to keep compile-time under control, however since packages could be compiled in parallel in multiple processes we could not find an easy way to share this graph across packages. +We fall-back to building this graph for every package. + +* __CallGraphBuilder and WeightedCallGraph__: While compiling each package, we visit the IRs of all the functions in the package to produce an intra-package call graph with edge and node weights. +The edge and node weights are obtained from the _pprof-graph_. +We term this as _WeightedCallGraph_(shown as _CallGraphBuilder_ in Figure 2). +In other words, _WeightedCallGraph_ is a succinct version of package-level pprof-graph with additional information such as associated IR function bodies, call edges linked with IR call sites, multiple callee edges for indirect calls etc. + +* __Embed profiles in the frontend IR & SSA IR__: While building the WeightedCallgraph, we also create a map table that stores the IR and its corresponding profiled weights in order to facilitate the basic-block layout optimization at the SSA-level. +The primary reason for storing every IR instruction with weights is that there is no control-flow graph available at the IR-level, which could have been leveraged to just annotate branch instructions and branch-edges. +During the lowering from the frontend IR to SSA IR, the profiling information is annotated into the basic blocks and edges. +The compiler propagates the profiling information in the control flow graph. +Similarly, every optimization based on the SSA IR maintains the profiling information. +Annotating every single instruction with weights can increase memory pressure; in future we plan on exploring other options. + +* __Profile-guided Inlining__: During inlining, we determine hot functions and hot call-sites based on the _WeightedCallgraph_. +We increase the inlining budget from 80 to a larger value (default 160, tunable via command-line) based on the hotness of the function. +This enables a function to be inlined in its upstream callers even if its budget is more than 80. +Moreover if the function has a cost more than 80 and has more than one upstream callers, then only hot-call edges will be inlined, others will not, thereby controlling code size growth. +Similarly, hot call-sites are also inlined aggressively in "mkinlcall" function using the increased budget. +The hot-ness criteria is based on a threshold, which can be tuned via command-line (default set to 2% of the total execution time). +One subtle note is that the total number of inlined functions via PGO may either be greater or smaller than the original inliner without PGO. +For example, in a pathological case if a hot function gets bulkier (i.e., its budget is more than 160) due to inlining performed in transitive callees, then it may not get inlined in upstream callers (unlike standard inlining)] and, if there are a large number of upstream callers, then it may lead to less number of inlining outcomes. +This can be addressed by increasing the budget via command-line, but that may also lead to increased code size, something to keep an eye out for. + +* __Code Specialization__: In order to eliminate indirect function call overheads, we introduce a _code specialization_ step inside the Inliner to transform an indirect callsite into an "if-else" code block based on the hot receiver. +The WeightedCallgraph is used to determine a hot receiver. +The "if" condition introduces a check on the runtime type of the hot receiver. +The body of the "if" block converts the indirect function call into a direct function call. +This call subsequently gets inlined since Inlining follows this optimization. +The "else" block retains the original slow indirect call. +As mentioned earlier, our goal is to be able to inline hot receivers. +We purposefully avoid making multiple checks on the receiver type for multiple hot receiver cases in order to avoid unnecessary code growth and performance degradation in corner cases. +In certain cases, it is possible to hoist the "if" condition check outside of the surrounding loop nests; this is a subject for future work. +During Inlining and Specialization, we update and propagate weights of the IR instructions. + +* __Basic-block Reordering__: Since the Go compiler produces a large monolithic binary, we implement state-of-the-art profile-guided basic block layout optimization, i.e., [EXT-TSP heuristic](https://ieeexplore.ieee.org/document/9050435), at the SSA-level. +The algorithm is a greedy heuristic that works with chains (ordered lists) of basic blocks. +Initially all chains are isolated basic blocks. +On every iteration, a pair of chains are merged that yields the biggest benefit in terms of instruction-cache reuse. +The algorithm stops when one chain is left or merging does not improve the overall benefit. + If there are more than one chain left, these chains are sorted using a density function that prioritizes merging frequently executed smaller sized chains into each page. +Unlike earlier algorithms (e.g., [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550)), in ExtTSP algorithm two chains, X and Y, are first split into three hypothetical chains, X1, X2, and Y. +Then it considers all possible ways of combining these three chains (e.g., X1YX2, X1X2Y, X2X1Y, X2YX1, YX1X2, YX2X1) and chooses the one producing the largest benefit. +We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available for performance comparisons. + +* __Function Reorderings__: Finally, we introduce a "function reordering" optimization in the linker. +Linker code is extended to rebuild the cross-package "pprof-graph" via a command-line argument. +It first produces a summary table consisting of a set of (caller, callee, edge-weight) entries [similar to WeightedCallGraph without node weights]. +We then use the [C3 heuristic](https://dl.acm.org/doi/10.5555/3049832.3049858) to re-layout the functions in the text section. +The structure of the C3 algorithm is similar to the ExtTSP and we avoid describing it again. +One thing to note is that the C3 heuristic places a function as close as possible to its most common caller following a priority from the hottest to the coldest functions in the program. +One issue we ran into while implementing this algorithm is that C3 heuristic requires the relative distance of a callsite from the start of a function. +We found this tricky to implement in the linker, so our current implementation approximates this distance as half of the code size of the function instead of computing it precisely. +We are able to compute code size directly from the object file in order to compute density. +In future, we should explore options of computing this information in the compiler and store it in a fixed location in the object file. +We also have an implementation of [Pettis-Hansen](https://dl.acm.org/doi/10.1145/93548.93550) available as an alternative. +Our function reordering optimization is automatically cross-package. +One subtle issue we have run into is that often GC-related functions are intermingled with application code in profiles making function reordering optimization sub-optimal. +While GC functions emanating from the "assistGC" function called from the "malloc" routine should be included in the function reordering optimization, other GC functions probably should be sequenced as a separate chain/cluster of its own. +This makes sense given that GC gets invoked at arbitrary timelines and runs in a separate thread of its own (perhaps could run in a separate core as well). +This is a subject for future work for us. + +* __Stale Profile__: As source code evolves, it is highly likely that the profiles could become stale over time. +In order to deal with this, we propose to use _\<pkg\_name, function\_name, line\_offset\>_ instead of _\<pkg\_name, function\_name, line\_number\>_. +With line\_offset information, we can identify if a call site has moved up or down relative to the start of the function. +In this case, we do not inline this call site, however other call sites that have not moved up or down will continue to use the profile information and get optimized. +This design also allows functions that are unchanged will continue to be optimized via PGO. +This design is similar to [AutoFDO](https://dl.acm.org/doi/pdf/10.1145/2854038.2854044)]. + +## Rationale + +There are several design choices we have made in this proposal. + +__Why not inline every hot callee? Why introduce a budget for hot callee?__ + +Primary motivation is to control code growth while retaining good performance. +Auto-tuning frameworks such as [OpenTuner](https://github.com/minshuzhan/opentuner) can tune binaries using the command-line options for budget and threshold. + +__What is the benefit of WeightedCallGraph data structure?__ + +The core of WeightedCallGraph is to associate IR functions and callsites with node and edge weights from pprof-graph. +This could have been achieved via a simpler data structure perhaps and thus, WeightedCallGraph could be a bit heavy-weight data structure just for Inlining and Specialization. +However, it is a forward looking design where other (interprocedural) optimizations can leverage this graph in future. + +__Why Specialization is performed during inlining and not before or after?__ + +Ideally, we want code specialization to be performed before inlining so that the direct callees can be inlined. +On the other hand, when specialization is performed as a separate pass before inliner, one needs to update the WeightedCallGraph during specialization. +When we perform specialization along with Inliner, we avoid updating WeightedCallgraph twice. +Moreover, Specialization also requires parts of CanInline logic to perform legality checks. +Keeping them together also avoids this code duplication. + +__Why use the Protobuf format? Why not use other formats similar to LLVM?__ + +From our experience, protobuf format is rich enough to carry additional profiling information, particularly hardware performance counter information. +There exist tools such as [perf\_data\_converter](https://github.com/google/perf_data_converter) tool that can convert a _perf.data_ file produced by Linux [perf](https://www.brendangregg.com/perf.html) into a _profile.proto_ file in protobuf format. +So, we may not lose on functionality. + +__Why do we not perform basic-block layout in the linker?__ + +Ideally, we should. +Currently, the Go compiler produces one section for code/text. +If it can be extended to produce sections at the basic-block level, we can port our basic-block layout implementation to the linker. + +## Compatibility + +Currently our optimizations have been tested in the Linux OS only. +To the best of our knowledge, our implementation follows the guidelines described in the [compatibility guidelines](https://golang.org/doc/go1compat). + +## Implementation + +We have implemented all the optimizations described above in this proposal, i.e., Inlining, Code Specialization, Basic-block layout, & Function reordering. +However, pushing all these changes upstream at once can take a while. +Our strategy is to release only "CallGraphBuilder" and "Profile-guided Inlining" (see Figure 2) in the first phase with the timeline for v1.20 release. +Below we provide the implementation details for CallGraphBuilder and Profile-guided inlining. + +### CallGraphBuilder implementation + +* __BuildPProfGraph__: This function builds the pprof-graph based on the "internal/profile" package. +The pprof-graph is built using the _cpuprofile _data from_ _profile. +The protobuf profile file is fed to the compiler via a command line flag (_-profileuse_). +It preprocesses the pprof-graph to compute a few additional information such as _TotalNodeWeights_ and _TotalEdgeWeights_. +It also creates a global node-to-weight map which is cheaper to look up during IR traversal in later phases. + +* __BuildWeightedCallGraph__: This function builds the WeightedCallGraph data structure by visiting all the _ir.Func_ s in the decl-list of a package. +During the traversal of the IR, it is necessary to determine the callee function for a direct callsite. +Our logic to do this is very similar to "inlCallee" in "inline/inl.go". +Since we would create a cyclic dependency between inline and pgo packages if we reused the code, we have duplicated "inlCallee" logic in irgraph.go (with the same name). +We should refactor this code in future to be able to put it outside of these two packages. + +* __RedirectEdges__: This function updates the WeightedCallGraph nodes and edges during and after Inlining based on a list of inlined functions. + +### Profile-guided Inlining implementation + +* __Prologue__: Before performing inlining, we walk over the _WeightedCallGraph_ to identify hot functions and hot call-sites based on a command-line tunable threshold (_-inlinehotthreshold_). +This threshold is expressed as a percentage of time exclusively spent in a function or call-site over total execution time. +It is by default set to 2% and can be updated via the command-line argument. +When a function or call-site exclusively spends more time than this threshold percentage, we classify the function or call-site as a hot function or hot call-site. + +* __Extensions to CanInline__: Currently the HairyVisitor bails out as soon as its budget is more than maxInlineBudget (80). +We introduce a new budget for profile-guided inlining, _inlineHotMaxBudget_, with a default value of 160, which can also be updated via a command line flag (_-inlinehotbudget_). +We use the inlineHotMaxBudget for hot functions during the CanInline step. + +```go +if v.budget < 0 { + if pgo.WeightedCG != nil { + if n, ok := pgo.WeightedCG.IRNodes[ir.PkgFuncName(fn)]; ok { + if inlineMaxBudget-v.budget < inlineHotMaxBudget && n.HotNode == true { + return false + } + } + } +... +} +``` + +During the HairyVisitor traversal, we also track all the hot call-sites of a function. + +```go +// Determine if callee edge is a hot callee or not. +if pgo.WeightedCG != nil && ir.CurFunc != nil { + if fn := inlCallee(n.X); fn != nil && typecheck.HaveInlineBody(fn) { + lineno := fmt.Sprintf("%v", ir.Line(n)) + splits := strings.Split(lineno, ":") + l, _ := strconv.ParseInt(splits[len(splits)-2], 0, 64) + linenum := fmt.Sprintf("%v", l) + canonicalName := ir.PkgFuncName(ir.CurFunc) + "-" + linenum + "-" + ir.PkgFuncName(fn) + if _, o := candHotEdgeMap[canonicalName]; o { + listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}] = struct{}{} + } + } +} +``` + +* __Extension to mkinlcall__: When a hot callee's budget is more than maxCost, we check if this cost is less than inlineHotMaxBudget in order to permit inlining of hot callees. + +```go +if fn.Inl.Cost > maxCost { +... + + // If the callsite is hot and it is below the inlineHotCalleeMaxBudget budget, then inline it, or else bail. + if _, ok := listOfHotCallSites[pgo.CallSiteInfo{ir.Line(n), ir.CurFunc}]; ok { + if fn.Inl.Cost > inlineHotMaxBudget { + return n + } + } else { + return n + } +} +``` + +* __Epilogue__: During this phase, we update the WeightedCallGraph based on the inlining decisions made earlier. + +```go +ir.VisitFuncsBottomUp(typecheck.Target.Decls, func(list []*ir.Func, recursive bool) { + for _, f := range list { + name := ir.PkgFuncName(f) + if n, ok := pgo.WeightedCG.IRNodes[name]; ok { + pgo.RedirectEdges(n, inlinedCallSites) + } + } +}) +``` + +* __Command line flags list__: We introduce the following flag in base/flag.go. +```go +PGOProfile string "help:\"read profile from `file`\"" +``` + +We also introduced following debug flags in base/debug.go. +```go +InlineHotThreshold string "help:\"threshold percentage for determining hot methods and callsites for inlining\"" +InlineHotBudget int "help:\"inline budget for hot methods\"" +``` + +## Open issues (if applicable) + +* Build the pprof-graph once across all packages instead of building it over and over again for each package. +Currently the Go compiler builds packages in parallel using multiple processes making it hard to do this once. + +* Enable separate sections for basic-blocks and functions in the linker. +This would enable basic-block reordering optimization at the linker level. +Our implementation performs this optimization at the SSA-level. diff --git a/design/55022/image1.png b/design/55022/image1.png Binary files differnew file mode 100644 index 0000000..d42dd33 --- /dev/null +++ b/design/55022/image1.png diff --git a/design/55022/image2.png b/design/55022/image2.png Binary files differnew file mode 100644 index 0000000..34e63af --- /dev/null +++ b/design/55022/image2.png |
