diff options
| author | Guoqi Chen <chenguoqi@loongson.cn> | 2024-09-13 18:47:56 +0800 |
|---|---|---|
| committer | abner chenc <chenguoqi@loongson.cn> | 2024-11-07 02:19:55 +0000 |
| commit | ac345fb7e704ede49c0c506bfd9f8d0f4b61cd7c (patch) | |
| tree | 2c39d32a0f7309d4fdcc917cd6601c04f33f12ac /src/cmd/internal/obj | |
| parent | 9088883cf4a5181cf796c968bbce5a5bc3edc7ab (diff) | |
| download | go-ac345fb7e704ede49c0c506bfd9f8d0f4b61cd7c.tar.xz | |
cmd/compiler,internal/runtime/atomic: optimize Store{64,32,8} on loong64
On Loong64, AMSWAPDB{W,V} instructions are supported by default, and AMSWAPDB{B,H} [1]
is a new instruction added by LA664(Loongson 3A6000) and later microarchitectures.
Therefore, AMSWAPDB{W,V} (full barrier) is used to implement AtomicStore{32,64}, and
the traditional MOVB or the new AMSWAPDBB is used to implement AtomicStore8 according
to the CPU feature.
The StoreRelease barrier on Loong64 is "dbar 0x12", but it is still necessary to
ensure consistency in the order of Store/Load [2].
LoweredAtomicStorezero{32,64} was removed because on loong64 the constant "0" uses
the R0 register, and there is no performance difference between the implementations
of LoweredAtomicStorezero{32,64} and LoweredAtomicStore{32,64}.
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A5000-HV @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
AtomicStore64 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20)
AtomicStore64-2 19.61n ± 0% 13.61n ± 0% -30.57% (p=0.000 n=20)
AtomicStore64-4 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20)
AtomicStore 19.61n ± 0% 13.61n ± 0% -30.60% (p=0.000 n=20)
AtomicStore-2 19.62n ± 0% 13.61n ± 0% -30.63% (p=0.000 n=20)
AtomicStore-4 19.62n ± 0% 13.62n ± 0% -30.58% (p=0.000 n=20)
AtomicStore8 19.61n ± 0% 20.01n ± 0% +2.04% (p=0.000 n=20)
AtomicStore8-2 19.62n ± 0% 20.02n ± 0% +2.01% (p=0.000 n=20)
AtomicStore8-4 19.61n ± 0% 20.02n ± 0% +2.09% (p=0.000 n=20)
geomean 19.61n 15.48n -21.08%
goos: linux
goarch: loong64
pkg: internal/runtime/atomic
cpu: Loongson-3A6000 @ 2500.00MHz
| bench.old | bench.new |
| sec/op | sec/op vs base |
AtomicStore64 18.03n ± 0% 12.81n ± 0% -28.93% (p=0.000 n=20)
AtomicStore64-2 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20)
AtomicStore64-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore 18.02n ± 0% 12.81n ± 0% -28.91% (p=0.000 n=20)
AtomicStore-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8-2 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
AtomicStore8-4 18.01n ± 0% 12.81n ± 0% -28.87% (p=0.000 n=20)
geomean 18.01n 12.81n -28.89%
[1]: https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html
[2]: https://gcc.gnu.org/git/?p=gcc.git;a=blob_plain;f=gcc/config/loongarch/sync.md
Change-Id: I4ae5e8dd0e6f026129b6e503990a763ed40c6097
Reviewed-on: https://go-review.googlesource.com/c/go/+/581356
Reviewed-by: sophie zhao <zhaoxiaolin@loongson.cn>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Qiqi Huang <huangqiqi@loongson.cn>
Reviewed-by: Meidan Li <limeidan@loongson.cn>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: David Chase <drchase@google.com>
Diffstat (limited to 'src/cmd/internal/obj')
| -rw-r--r-- | src/cmd/internal/obj/loong64/doc.go | 24 |
1 files changed, 24 insertions, 0 deletions
diff --git a/src/cmd/internal/obj/loong64/doc.go b/src/cmd/internal/obj/loong64/doc.go index 6ec53e7a17..e4c33f6525 100644 --- a/src/cmd/internal/obj/loong64/doc.go +++ b/src/cmd/internal/obj/loong64/doc.go @@ -88,5 +88,29 @@ Examples: MOVB R6, (R4)(R5) <=> stx.b R6, R5, R5 MOVV R6, (R4)(R5) <=> stx.d R6, R5, R5 MOVV F6, (R4)(R5) <=> fstx.d F6, R5, R5 + +# Special instruction encoding definition and description on LoongArch + + 1. DBAR hint encoding for LA664(Loongson 3A6000) and later micro-architectures, paraphrased + from the Linux kernel implementation: https://git.kernel.org/torvalds/c/e031a5f3f1ed + + - Bit4: ordering or completion (0: completion, 1: ordering) + - Bit3: barrier for previous read (0: true, 1: false) + - Bit2: barrier for previous write (0: true, 1: false) + - Bit1: barrier for succeeding read (0: true, 1: false) + - Bit0: barrier for succeeding write (0: true, 1: false) + - Hint 0x700: barrier for "read after read" from the same address + + Traditionally, on microstructures that do not support dbar grading such as LA464 + (Loongson 3A5000, 3C5000) all variants are treated as “dbar 0” (full barrier). + +2. Notes on using atomic operation instructions + + - AM*_DB.W[U]/V[U] instructions such as AMSWAPDBW not only complete the corresponding + atomic operation sequence, but also implement the complete full data barrier function. + + - When using the AM*_.W[U]/D[U] instruction, registers rd and rj cannot be the same, + otherwise an exception is triggered, and rd and rk cannot be the same, otherwise + the execution result is uncertain. */ package loong64 |
