diff options
| author | Jorropo <jorropo.pgm@gmail.com> | 2025-07-03 01:35:51 +0200 |
|---|---|---|
| committer | Jorropo <jorropo.pgm@gmail.com> | 2025-07-24 14:42:10 -0700 |
| commit | fcd28070fe4fe86b04c760dd7ce5fff2aa63bad5 (patch) | |
| tree | b31d02d99517c995438222a9f243fec1e5b5deb3 /src/cmd/compile/internal/ssa/_gen/generic.rules | |
| parent | f32cf8e4b025eee84aa3ec690966fa4e737a7522 (diff) | |
| download | go-fcd28070fe4fe86b04c760dd7ce5fff2aa63bad5.tar.xz | |
cmd/compile: add opt branchelim to rewrite some CondSelect into math
This allows something like:
if y { x++ }
To be compiled to:
MOVBLZX BX, CX
ADDQ CX, AX
Instead of:
LEAQ 1(AX), CX
MOVBLZX BL, DX
TESTQ DX, DX
CMOVQNE CX, AX
While ./make.bash uniqued per LOC, there is 100 additions and 75 substractions.
See benchmark here: https://go.dev/play/p/DJf5COjwhd_s
Either it's a performance no-op or it is faster:
goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
│ /tmp/old.logs │ /tmp/new.logs │
│ sec/op │ sec/op vs base │
CmovInlineConditionAddLatency-12 0.5443n ± 5% 0.5339n ± 3% -1.90% (p=0.004 n=10)
CmovInlineConditionAddThroughputBy6-12 1.492n ± 1% 1.494n ± 1% ~ (p=0.955 n=10)
CmovInlineConditionSubLatency-12 0.5419n ± 3% 0.5282n ± 3% -2.52% (p=0.019 n=10)
CmovInlineConditionSubThroughputBy6-12 1.587n ± 1% 1.584n ± 2% ~ (p=0.492 n=10)
CmovOutlineConditionAddLatency-12 0.5223n ± 1% 0.2639n ± 4% -49.47% (p=0.000 n=10)
CmovOutlineConditionAddThroughputBy6-12 1.159n ± 1% 1.097n ± 2% -5.35% (p=0.000 n=10)
CmovOutlineConditionSubLatency-12 0.5271n ± 3% 0.2654n ± 2% -49.66% (p=0.000 n=10)
CmovOutlineConditionSubThroughputBy6-12 1.053n ± 1% 1.050n ± 1% ~ (p=1.000 n=10)
geomean
There are other benefits not tested by this benchmark:
- the math form is usually a couple bytes shorter (ICACHE)
- the math form is usually 0~2 uops shorter (UCACHE)
- the math form has usually less register pressure*
- the math form can sometimes be optimized further
*regalloc rarely find how it can use less registers
As far as pass ordering goes there are many possible options,
I've decided to reorder branchelim before late opt since:
- unlike running exclusively the CondSelect rules after branchelim,
some extra optimizations might trigger on the adds or subs.
- I don't want to maintain a second generic.rules file of only the stuff,
that can trigger after branchelim.
- rerunning all of opt a third time increase compilation time for little gains.
By elimination moving branchelim seems fine.
Change-Id: I869adf57e4d109948ee157cfc47144445146bafd
Reviewed-on: https://go-review.googlesource.com/c/go/+/685676
Reviewed-by: Keith Randall <khr@golang.org>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Keith Randall <khr@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Diffstat (limited to 'src/cmd/compile/internal/ssa/_gen/generic.rules')
| -rw-r--r-- | src/cmd/compile/internal/ssa/_gen/generic.rules | 12 |
1 files changed, 12 insertions, 0 deletions
diff --git a/src/cmd/compile/internal/ssa/_gen/generic.rules b/src/cmd/compile/internal/ssa/_gen/generic.rules index b178a1add6..89657bdabb 100644 --- a/src/cmd/compile/internal/ssa/_gen/generic.rules +++ b/src/cmd/compile/internal/ssa/_gen/generic.rules @@ -295,6 +295,9 @@ (Neq16 (Const16 <t> [c]) (Add16 (Const16 <t> [d]) x)) => (Neq16 (Const16 <t> [c-d]) x) (Neq8 (Const8 <t> [c]) (Add8 (Const8 <t> [d]) x)) => (Neq8 (Const8 <t> [c-d]) x) +(CondSelect x _ (ConstBool [true ])) => x +(CondSelect _ y (ConstBool [false])) => y + // signed integer range: ( c <= x && x (<|<=) d ) -> ( unsigned(x-c) (<|<=) unsigned(d-c) ) (AndB (Leq64 (Const64 [c]) x) ((Less|Leq)64 x (Const64 [d]))) && d >= c => ((Less|Leq)64U (Sub64 <x.Type> x (Const64 <x.Type> [c])) (Const64 <x.Type> [d-c])) (AndB (Leq32 (Const32 [c]) x) ((Less|Leq)32 x (Const32 [d]))) && d >= c => ((Less|Leq)32U (Sub32 <x.Type> x (Const32 <x.Type> [c])) (Const32 <x.Type> [d-c])) @@ -2843,3 +2846,12 @@ && clobber(sbts) && clobber(key) => (StaticLECall {f} [argsize] dict_ (StringMake <typ.String> ptr len) mem) + +// Transform some CondSelect into math operations. +// if b { x++ } => x += b // but not on arm64 because it has CSINC +(CondSelect (Add8 <t> x (Const8 [1])) x bool) && config.arch != "arm64" => (Add8 x (CvtBoolToUint8 <t> bool)) +(CondSelect (Add(64|32|16) <t> x (Const(64|32|16) [1])) x bool) && config.arch != "arm64" => (Add(64|32|16) x (ZeroExt8to(64|32|16) <t> (CvtBoolToUint8 <types.Types[types.TUINT8]> bool))) + +// if b { x-- } => x -= b +(CondSelect (Add8 <t> x (Const8 [-1])) x bool) => (Sub8 x (CvtBoolToUint8 <t> bool)) +(CondSelect (Add(64|32|16) <t> x (Const(64|32|16) [-1])) x bool) => (Sub(64|32|16) x (ZeroExt8to(64|32|16) <t> (CvtBoolToUint8 <types.Types[types.TUINT8]> bool))) |
