diff options
| author | Austin Clements <austin@google.com> | 2020-08-12 09:57:23 -0400 |
|---|---|---|
| committer | Austin Clements <austin@google.com> | 2020-08-14 01:18:55 +0000 |
| commit | 18f7ad3be99ec697670caa0362fd58aee5d1d181 (patch) | |
| tree | f7917535d70cd599f2158bf5b94346a3f6e3cba5 | |
| parent | a2c03b640cb6c87aaaff70a178f5492937d30667 (diff) | |
| download | go-x-proposal-18f7ad3be99ec697670caa0362fd58aee5d1d181.tar.xz | |
design/40724-register-calling: add design doc
Change-Id: Ib491db5e2523acf6f21b94924339a22d236717bc
Reviewed-on: https://go-review.googlesource.com/c/proposal/+/248178
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
| -rw-r--r-- | design/40724-register-calling.md | 548 |
1 files changed, 548 insertions, 0 deletions
diff --git a/design/40724-register-calling.md b/design/40724-register-calling.md new file mode 100644 index 0000000..3e7cabf --- /dev/null +++ b/design/40724-register-calling.md @@ -0,0 +1,548 @@ +# Proposal: Register-based Go calling convention + +Author: Austin Clements, with input from Cherry Zhang, Michael +Knyszek, Martin Möhrmann, Michael Pratt, David Chase, Keith Randall, +Dan Scales, and Ian Lance Taylor. + +Last updated: 2020-08-10 + +Discussion at https://golang.org/issue/40724. + +## Abstract + +We propose switching the Go ABI from its current stack-based calling +convention to a register-based calling convention. +[Preliminary experiments +indicate](https://github.com/golang/go/issues/18597#issue-199914923) +this will achieve at least a 5–10% throughput improvement across a +range of applications. +This will remain backwards compatible with existing assembly code that +assumes Go’s current stack-based calling convention through Go’s +[multiple ABI +mechanism](https://golang.org/design/27539-internal-abi). + +## Background + +Since its initial release, Go has used a *stack-based calling +convention* based on the Plan 9 ABI, in which arguments and result +values are passed via memory on the stack. +This has significant simplicity benefits: the rules of the calling +convention are simple and build on existing struct layout rules; all +platforms can use essentially the same conventions, leading to shared, +portable compiler and runtime code; and call frames have an obvious +first-class representation, which simplifies the implementation of the +`go` and `defer` statements and reflection calls. +Furthermore, the current Go ABI has no *callee-save registers*, +meaning that no register contents live across a function call (any +live state in a function must be flushed to the stack before a call). +This simplifies stack tracing for garbage collection and stack growth +and stack unwinding during panic recovery. + +Unfortunately, Go’s stack-based calling convention leaves a lot of +performance on the table. +While modern high-performance CPUs heavily optimize stack access, +accessing arguments in registers is still roughly [40% +faster](https://gist.github.com/aclements/ded22bb8451eead8249d22d3cd873566) +than accessing arguments on the stack. +Furthermore, a stack-based calling convention, especially one with no +callee-save registers, induces additional memory traffic, which has +secondary effects on overall performance. + +Most language implementations on most platforms use a register-based +calling convention that passes function arguments and results via +registers rather than memory and designates some registers as +callee-save, allowing functions to keep state in registers across +calls. + +## Proposal + +We propose switching the Go ABI to a register-based calling +convention, starting with a minimum viable product (MVP) on amd64, and +then expanding to other architectures and improving on the MVP. + +We further propose that this calling convention should be designed +specifically for Go, rather than using platform ABIs. +There are several reasons for this. + +It’s incredibly tempting to use the platform calling convention, as it +seems that would allow for more efficient language interoperability. +Unfortunately, there are two major reasons it would do little good, +both related to the scalability of goroutines, a central feature of +the Go language. +One reason goroutines scale so well is that the Go runtime dynamically +resizes their stacks, but this imposes requirements on the ABI that +aren’t satisfied by non-Go functions, thus requiring the runtime to +transition out of the dynamic stack regime on a foreign call. +Another reason is that goroutines are scheduled by the Go runtime +rather than the OS kernel, but this means that transitions to and from +non-Go code must be communicated to the Go scheduler. +These two things mean that sharing a calling convention wouldn’t +significantly lower the cost of calling non-Go code. + +The other tempting reason to use the platform calling convention would +be tooling interoperability, particularly with debuggers and profiling +tools. +However, these almost universally support DWARF or, for profilers, +frame pointer unwinding. +Go will continue to work with DWARF-based tools and we can make the Go +ABI compatible with platform frame pointer unwinding without otherwise +taking on the platform ABI. + +Hence, there’s little upside to using the platform ABI. +And there are several reasons to favor using our own ABI: + +- Most existing ABIs were based on the C language, which differs in + important ways from Go. + For example, most ELF ABIs (at least x64-64, ARM64, and RISC-V) + would force Go slices to be passed on the stack rather than in + registers because the slice header is three words. + Similarly, because C functions rarely return more than one word, + most platform ABIs reserve at most two registers for results. + Since Go functions commonly return at least three words (a result + and a two word error interface value), the platform ABI would force + such functions to return values on the stack. + Other things that influence the platform ABI include that array + arguments in C are passed by reference rather than by value and + small integer types in C are promoted to `int` rather than retaining + their type. + Hence, platform ABIs simply aren’t a good fit for the Go language. + +- Platform ABIs typically define callee-save registers, which place + substantial additional requirements on a garbage collector. + There are alternatives to callee-save registers that share many of + their benefits, while being much better suited to Go. + +- While platform ABIs are generally similar at a high level, their + details differ in myriad ways. + By defining our own ABI, we can follow a common structure across all + platforms and maintain much of the cross-platform simplicity and + reliability of Go’s stack-based calling convention. + +The new calling convention will remain backwards-compatible with +existing assembly code that’s based on the stack-based calling +convention via Go’s [multiple ABI +mechanism](https://golang.org/design/27539-internal-abi). + +This same multiple ABI mechanism allows us to continue to evolve the +Go calling convention in future versions. +This lets us start with a simple, minimal calling convention and +continue to optimize it in the future. + +The rest of this proposal outlines the work necessary to switch Go to +a register-based calling convention. +While it lays out the requirements for the ABI, it does not describe a +specific ABI. +Defining a specific ABI will be one of the first implementation steps, +and its definition should reside in a living document rather than a +proposal. + +## Go’s current stack-based ABI + +We give an overview of Go’s current ABI to give a sense of the +requirements of any Go ABI and because the register-based calling +convention builds on the same concepts. + +In the stack-based Go ABI, when a function F calls a function or +method G, F reserves space in its own stack frame for G’s receiver (if +it’s a method), arguments, and results. +These are laid out in memory as if G’s receiver, arguments, and +results were simply fields in a struct. + +There is one exception to all call state being passed on the stack: if +G is a closure, F passes a pointer to its function object in a +*context register*, via which G can quickly access any closed-over +values. + +Other than a few fixed-function registers, all registers are +caller-save, meaning F must spill any live state in registers to its +stack frame before calling G and reload the registers after the call. + +The Go ABI also keeps a pointer to the runtime structure representing +the current goroutine (“G”) available for quick access. +On 386 and amd64, it is stored in thread-local storage; on all other +platforms, it is stored in a dedicated register.<sup>1</sup> + +Every function must ensure sufficient stack space is available before +reserving its stack frame. +The current stack bound is stored in the runtime goroutine structure, +which is why the ABI keeps this readily accessible. +The standard prologue checks the stack pointer against this bound and +calls into the runtime to grow the stack if necessary. +In assembly code, this prologue is automatically generated by the +assembler itself. +Cooperative preemption is implemented by poisoning a goroutine’s stack +bound, and thus also makes use of this standard prologue. + +Finally, both stack growth and the Go garbage collector must be able +to find all live pointers. +Logically, function entry and every call instruction has an associated +bitmap indicating which slots in the local frame and the function’s +argument frame contain live pointers. +Sometimes liveness information is path-sensitive, in which case a +function will have additional [*stack +object*](https://golang.org/cl/134155) metadata. +In all cases, all pointers are in known locations on the stack. + +<sup>1</sup> This is largely a historical accident. +The G pointer was originally stored in a register on 386/amd64. +This is ideal, since it’s accessed in nearly every function prologue. +It was moved to TLS in order to support cgo, since transitions from C +back to Go (including the runtime signal handler) needed a way to +access the current G. +However, when we added ARM support, it turned out accessing TLS in +every function prologue was far too expensive on ARM, so all later +ports used a hybrid approach where the G is stored in both a register +and TLS and transitions from C restore it from TLS. + +## ABI design recommendations + +Here we lay out various recommendations for the design of a +register-based Go ABI. +The rest of this document assumes we’ll be following these +recommendations. + +1. Common structure across platforms. + This dramatically simplifies porting work in the compiler and + runtime. + We propose that each architecture should define a sequence of + integer and floating point registers (and in the future perhaps + vector registers), plus size and alignment constraints, and that + beyond this, the calling convention should be derived using a + shared set of rules as much as possible. + +1. Efficient access to the current goroutine pointer and the context + register for closure calls. + Ideally these will be in registers; however, we may use TLS on + architectures with extremely limited registers (namely, 386). + +1. Support for many-word return values. + Go functions frequently return three or more words, so this must be + supported efficiently. + +1. Support for scanning and adjusting pointers in register arguments + on stack growth. + Since the function prologue checks the stack bound before reserving + a stack frame, the runtime must be able to spill argument registers + and identify those containing pointers. + +1. First-class generic call frame representation. + The `go` and `defer` statements as well as reflection calls need to + manipulate call frames as first-class, in-memory objects. + Reflect calls in particular are simplified by a common, generic + representation with fairly generic bridge code (the compiler could + generate bridge code for `go` and `defer`). + +1. No callee-save registers. + Callee-save registers complicate stack unwinding (and garbage + collection if pointers are allowed in callee-save registers). + Inter-function clobber sets have many of the benefits of + callee-save registers, but are much simpler to implement in a + garbage collected language and are well-suited to Go’s compilation + model. + For an MVP, we’re unlikely to implement any form of live registers + across calls, but we’ll want to revisit this later. + +1. Where possible, be compatible with platform frame-pointer unwinding + rules. + This helps Go interoperate with system-level profilers, and can + potentially be used to optimize stack unwinding in Go itself. + +There are also some notable non-requirements: + +1. No compatibility with the platform ABI (other than frame pointers). + This has more downsides and upsides, as described above. + +1. No binary compatibility between Go versions. + This is important for shared libraries in C, but Go already + requires all shared libraries in a process to use the same Go + toolchain version. + This means we can continue to evolve and improve the ABI. + +## Toolchain changes overview + +This section outlines the changes that will be necessary to the Go +build toolchain and runtime. +The "Detailed design" section will go into greater depth on some of +these. + +### Compiler + +*Abstract argument registers*: The compiler’s register allocator will +need to allocate function arguments and results to the appropriate +registers. +However, it needs to represent argument and result registers in a +platform-independent way prior to architecture lowering and register +allocation. +We propose introducing generic SSA values to represent the argument +and result registers, as done in [David Chase’s +prototype](https://golang.org/cl/28832). +These would simply represent the *i*th argument/result register and +register allocation would assign them to the appropriate architecture +registers. +Having a common ABI structure across platforms means the +architecture-independent parts of the compiler would only need to know +how many argument/result registers the target architecture has. + +*Late call lowering*: Call lowering and argument frame construction +currently happen during AST to SSA lowering, which happens well before +register allocation. +Hence, we propose moving call lowering much later in the compilation +process. +Late call lowering will have knock-on effects, as the current approach +hides a lot of the structure of calls from most optimization passes. + +*ABI bridges*: For compatibility with existing assembly code, the +compiler must generate ABI bridges when calling between Go +(ABIInternal) and assembly (ABI0) code, as described in the [internal +ABI proposal](https://golang.org/design/27539-internal-abi). +These are small functions that translate between ABIs according to a +function’s type. +While the compiler currently differentiates between the two ABIs +internally, since they’re actually identical right now, it currently +only generates *ABI aliases* and has no mechanism for generating ABI +bridges. +As a post-MVP optimization, the compiler should inline these ABI +bridges where possible. + +*Argument GC map*: The garbage collector needs to know which arguments +contain live pointers at function entry and at any calls (since these +are preemption points). +Currently this is represented as a bitmap over words in the function’s +argument frame. +With the register-based ABI, the compiler will need to emit a liveness +map for argument registers for the function entry point. +Since initially we won't have any live registers across calls, live +arguments will be spilled to the stack at a call, so the compiler does +*not* need to emit register maps at calls. +For functions that still require a stack argument frame (because their +arguments don’t all fit in registers), the compiler will also need to +emit argument frame liveness maps at the same points it does today. + +*Traceback argument maps*: Go tracebacks currently display a simple +word-based hex dump of a function’s argument frame. +This is not particularly user-friendly nor high-fidelity, but it can +be incredibly valuable for debugging. +With a register-based ABI, there’s a wide range of possible designs +for retaining this functionality. +For an MVP, we propose trying to maintain a similar level of fidelity. +In the future, we may want more detailed maps, or may want to simply +switch to using DWARF location descriptions. + +To that end, we propose that the compiler should emit two logical +maps: a *location map* from (PC, argument word index) to +register/`stack`/`dead` and a *home map* from argument word index to +stack home (if any). +Since a named variable’s stack spill home is fixed if it ever spills, +the location map can use a single distinguished value for `stack` that +tells the runtime to refer to the home map. +This approach works well for an ABI that passes argument values in +separate registers without packing small values. +The `dead` value is not necessarily the same as the garbage +collector’s notion of a dead slot: for the garbage collector, you want +slots to become dead as soon as possible, while for debug printing, +you want them to stay live as long as possible (until clobbered by +something else). + +The exact encoding of these tables is to be determined. +Most likely, we’ll want to introduce pseudo-ops for representing +changes in the location map that the `cmd/internal/obj` package can +then encode into `FUNCDATA`. +The home map could be produced directly by the compiler as `FUNCDATA`. + +*DWARF locations*: The compiler will need to generate DWARF location +lists for arguments and results. +It already has this ability for local variables, and we should reuse +that as much as possible. +We will need to ensure Delve and GDB are compatible with this. +Both already support location lists in general, so this is unlikely to +require much (if any) work in these debuggers. + +Clobber sets will require further changes, which we discuss later. +We propose not implementing clobber sets (or any form of callee-save) +for the MVP. + +### Linker + +The linker requires relatively minor changes, all related to ABI +bridges. + +*Eliminate ABI aliases*: Currently, the linker resolves ABI aliases +generated by the compiler by treating all references to a symbol +aliased under one ABI as references to the symbol another the other +ABI. +Once the compiler generates ABI bridges rather than aliases, we can +remove this mechanism, which is likely to simplify and speed up the +linker somewhat. + +*ABI name mangling*: Since Go ABIs work by having multiple symbol +definitions under the same name, the linker will also need to +implement a name mangling scheme for non-Go symbol tables. + +### Runtime + +*First-class call frame representation*: The `go` and `defer` +statements and reflection calls must manipulate call frames as +first-class objects. +While the requirements of these three cases differ, we propose having +a common first-class call frame representation that can capture a +function’s register and stack arguments and record its register and +stack results, along with a small set of generic call bridges that +invoke a call using the generic call frame. + +*Stack growth*: Almost every Go function checks for sufficient stack +space before opening its local stack frame. +If there is insufficient space, it calls into the `runtime.morestack` +function to grow the stack. +Currently, `morestack` saves only the calling PC, the stack pointer, +and the context register (if any) because these are the only registers +that can be live at function entry. +With register-based arguments, `morestack` will also have to save all +argument registers. +We propose that it simply spill all *possible* argument registers +rather than trying to be specific to the function; `morestack` is +relatively rare, so the cost is this is unlikely to be noticeable. +It’s likely possible to spill all argument registers to the stack +itself: every function that can grow the stack ensures that there’s +room not only for its local frame, but also for a reasonably large +“guard” space. +`morestack` can spill into this guard space. +The garbage collector can recognize `morestack`’s spill space and use +the argument map of its caller as the stack map of `morestack`. + +*Runtime assembly*: While Go’s multiple ABI mechanism makes it +generally possible to transparently call between Go and assembly code +even if they’re using different ABIs, there are runtime assembly +functions that have deep knowledge of the Go ABI and will have to be +modified. +This includes any function that takes a closure (`mcall`, +`systemstack`), is called in a special context (`morestack`), or is +involved in reflection-like calls (`reflectcall`, `debugCallV1`). + +*Cgo wrappers*: Generated cgo wrappers marked with +`//go:cgo_unsafe_args` currently access their argument structure by +casting a pointer to their first argument. +This violates the `unsafe.Pointer` rules and will no longer work with +this change. +We can either special case `//go:cgo_unsafe_args` functions to use +ABI0 or change the way these wrappers are generated. + +*Stack unwinding for panic recovery*: When a panic is recovered, the +Go runtime must unwind the panicking stack and resume execution after +the deferred call of the recovering function. +For the MVP, we propose not retaining any live registers across calls, +in which case stack unwinding will not have to change. +This is not the case with callee-save registers or clobber sets. + +*Traceback argument printing*: As mentioned in the compiler section, +the runtime currently prints a hex dump of function arguments in panic +tracebacks. +This will have to consume the new traceback argument metadata produced +by the compiler. + +## Detailed design + +This section dives deeper into some of the toolchain changes described +above. +We’ll expand this section over time. + +### `go`, `defer` and reflection calls + +Above we proposed using a first-class call frame representation for +`go` and `defer` statements and reflection calls with a small set of +call bridges. +These three cases have somewhat different requirements: + +- The types of `go` and `defer` calls are known statically, while + reflect calls are not. + This means the compiler could statically generate bridges to + unmarshall arguments for `go` and `defer` calls, but this isn’t an + option for reflection calls. + +- The return values of `go` and `defer` calls are always ignored, + while reflection calls must capture results. + This means a call bridge for a `go` or `defer` call can be a tail + call, while reflection calls can require marshalling return values. + +- Call frames for `go` and `defer` calls are long-lived, while + reflection call frames are transient. + This means the garbage collector must be able to scan `go` and + `defer` call frames, while we could use non-preemptible regions for + reflection calls. + +- Finally, `go` call frames are stored directly on the stack, while + `defer` and reflection call frames may be constructed in the heap. + This means the garbage collector must be able to construct the + appropriate stack map for `go` call frames, but `defer` and + reflection call frames can use the heap bitmap. + It also means `defer` and reflection calls that require stack + arguments must copy that part of the call frame from the heap to the + stack, though we don’t expect this to be the common case. + +To satisfy these requirements, we propose the following generic +call-frame representation: + +``` +struct { + pc uintptr // PC of target function + nInt, nFloat uintptr // # of int and float registers + ints [nInt]uintptr // Int registers + floats [nFloat]uint64 // Float registers + ctxt uintptr // Context register + stack [...]uintptr // Stack arguments/result space +} +``` + +`go` calls can build this structure on the new goroutine stack and the +call bridge can pop the register part of this structure from the +stack, leaving just the `stack` part on the stack, and tail-call `pc`. +The garbage collector can recognize this call bridge and construct the +stack map by inspecting the `pc` in the call frame. + +`defer` and reflection calls can build frames in the heap with the +appropriate heap bitmap. +The call bridge in these cases must open a new stack frame, copy +`stack` to the stack, load the register arguments, call `pc`, and then +copy the register results and the stack results back to the in-heap +frame (using write barriers where necessary). +It may be valuable to have optimized versions of this bridge for +tail-calls (always the case for `defer`) and register-only calls +(likely a common case). +In the register-only reflection call case, the bridge could take the +register arguments as arguments itself and return register results as +results; this would avoid any copying or write barriers. + +## Compatibility + +This proposal is Go 1-compatible. + +While Go assembly is not technically covered by Go 1 compatibility, +this will maintain compatibility with the vast majority of assembly +code using Go’s [multiple ABI +mechanism](https://golang.org/design/27539-internal-abi). +This translates between Go’s existing stack-based calling convention +used by all existing assembly code and Go’s internal calling +convention. + +There are a few known forms of unsafe code that this change will +break: + +- Assembly code that invokes Go closures. + The closure calling convention was never publicly documented, but + there may be code that does this anyway. + +- Code that performs `unsafe.Pointer` arithmetic on pointers to + arguments in order to observe the contents of the stack. + This is a violation of the [`unsafe.Pointer` + rules](https://pkg.go.dev/unsafe#Pointer) today. + +## Implementation + +We aim to implement a minimum viable register-based Go ABI for amd64 +in the 1.16 time frame. +As of this writing (nearing the opening of the 1.16 tree), Dan Scales +has made substantial progress on ABI bridges for a simple ABI change +and David Chase has made substantial progress on late call lowering. +Austin Clements will lead the work with David Chase and Than McIntosh +focusing on the compiler side, Cherry Zhang focusing on aspects that +bridge the compiler and runtime, and Michael Knyszek focusing on the +runtime. |
