aboutsummaryrefslogtreecommitdiff
path: root/src/unicode/utf8
AgeCommit message (Collapse)Author
2022-09-07unicode/utf8: use strings.Buildercuiweixie
Change-Id: I88b55f61eccb5764cac2a9397fd99a62f8735a9a Reviewed-on: https://go-review.googlesource.com/c/go/+/428281 Reviewed-by: Ian Lance Taylor <iant@google.com> Auto-Submit: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Ian Lance Taylor <iant@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com>
2022-03-02unicode/utf8: optimize Valid to parity with ValidStringAlan Donovan
The benchmarks added in this change revealed that ValidString runs ~17% faster than Valid([]byte) on the ASCII prefix of the input. Inspection of the assembly revealed that the code generated for p[8:] required recomputing the slice capacity to handle the cap=0 special case, which added an ADD -8 instruction. By making len=cap, the capacity becomes a common subexpression with the length, saving the ADD instruction. (Thanks to khr for the tip.) Incidentally, I tried a number of other optimizations but was unable to make consistent gains across all benchmarks. The most promising was to retain the bitmask of non-ASCII bytes from the fast loop; the slow loop would shift it, and when it becomes zero, return to the fast loop. This made the MostlyASCII benchmark 4x faster, but made the other cases slower by up to 10%. cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz benchmark old ns/op new ns/op delta BenchmarkValidTenASCIIChars-16 4.09 4.06 -0.85% BenchmarkValid100KASCIIChars-16 9325 7747 -16.92% BenchmarkValidTenJapaneseChars-16 27.0 27.2 +0.85% BenchmarkValidLongMostlyASCII-16 57277 58361 +1.89% BenchmarkValidLongJapanese-16 94002 93131 -0.93% BenchmarkValidStringTenASCIIChars-16 4.15 4.07 -1.74% BenchmarkValidString100KASCIIChars-16 7980 8019 +0.49% BenchmarkValidStringTenJapaneseChars-16 26.0 25.9 -0.38% BenchmarkValidStringLongMostlyASCII-16 58550 58006 -0.93% BenchmarkValidStringLongJapanese-16 97964 100038 +2.12% Change-Id: Ic9d585dedd9af83c27dd791ecd805150ac949f15 Reviewed-on: https://go-review.googlesource.com/c/go/+/375594 Reviewed-by: Keith Randall <khr@golang.org> Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gopher Robot <gobot@golang.org> Trust: Alex Rakoczy <alex@golang.org>
2021-11-05unicode/utf8: add AppendRune Examplejiahua wang
Also, correct TestAppendRune error message. Change-Id: I3ca3ac7051af1ae6d449381b78efa86c2f6be8ac Reviewed-on: https://go-review.googlesource.com/c/go/+/354529 Reviewed-by: Ian Lance Taylor <iant@golang.org> Run-TryBot: Ian Lance Taylor <iant@golang.org> Trust: Robert Findley <rfindley@google.com> Trust: Cherry Mui <cherryyz@google.com> TryBot-Result: Go Bot <gobot@golang.org>
2021-08-28unicode/utf8: add AppendRuneJoe Tsai
AppendRune appends the UTF-8 encoding of a rune to a []byte. It is a generally more user friendly than EncodeRune. EncodeASCIIRune-4 2.35ns ± 2% EncodeJapaneseRune-4 4.60ns ± 2% AppendASCIIRune-4 0.30ns ± 3% AppendJapaneseRune-4 4.70ns ± 2% The ASCII case is written to be inlineable. Fixes #47609 Change-Id: If4f71eedffd2bd4ef0d7f960cb55b41c637eec54 Reviewed-on: https://go-review.googlesource.com/c/go/+/345571 Trust: Joe Tsai <joetsai@digital-static.net> Reviewed-by: Rob Pike <r@golang.org> Run-TryBot: Rob Pike <r@golang.org> TryBot-Result: Go Bot <gobot@golang.org>
2020-09-19unicode/utf8: document the handling of runes out of range in EncodeRuneAinar Garipov
Document the way EncodeRune currently handles runes which are out of range. Also add an example showing that behaviour. Change-Id: I0f8e7645ae053474ec319085a2bb6d7f73bc137c Reviewed-on: https://go-review.googlesource.com/c/go/+/255998 Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Giovanni Bajo <rasky@develer.com> Trust: Giovanni Bajo <rasky@develer.com> Run-TryBot: Giovanni Bajo <rasky@develer.com> TryBot-Result: Go Bot <gobot@golang.org>
2020-09-10unicode/utf8: refactor benchmarks for FullRune functioneric fang
BenchmarkFullASCIIRune tests the performance of function utf8.FullRune, which will be inlined in BenchmarkFullASCIIRune. Since the return value of FullRune is not referenced, it will be removed as dead code. This CL makes the FullRune functions return value referenced by a global variable to avoid this point. In addition, this CL adds one more benchmark to cover more code paths, and puts them together as sub benchmarks of BenchmarkFullRune. Change-Id: I6e79f4c087adf70e351498a4b58d7482dcd1ec4a Reviewed-on: https://go-review.googlesource.com/c/go/+/233979 Run-TryBot: eric fang <eric.fang@arm.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
2020-04-22unicode/utf8: optimize Valid and ValidString for ASCII checksMartin Möhrmann
Add a fastpath that uses 32bit loads and compares to check 8 ASCII characters per loop iteration. This avoids the overhead of comparing and branching for every byte individually. Combining two 32bit loads into an uint32 allows the same code to be used for 32bit and 64bit platforms. amd64 (Intel i7-3520M): name old time/op new time/op delta ValidTenASCIIChars 15.6ns ± 4% 8.5ns ±14% -45.27% (p=0.000 n=10+10) ValidTenJapaneseChars 50.0ns ± 2% 52.7ns ±15% ~ (p=0.469 n=10+10) ValidStringTenASCIIChars 13.5ns ± 1% 7.9ns ± 5% -41.56% (p=0.000 n=10+10) ValidStringTenJapaneseChars 46.3ns ± 2% 45.8ns ± 2% ~ (p=0.085 n=10+10) arm (Raspberry Pi 3): name old time/op new time/op delta ValidTenASCIIChars 87.5ns ± 0% 58.5ns ± 0% -33.11% (p=0.000 n=9+10) ValidTenJapaneseChars 359ns ± 0% 384ns ± 0% +6.96% (p=0.000 n=10+9) ValidStringTenASCIIChars 87.5ns ± 0% 57.5ns ± 0% -34.31% (p=0.000 n=10+10) ValidStringTenJapaneseChars 356ns ± 0% 377ns ± 0% +5.90% (p=0.000 n=10+10) Change-Id: I9da942bddb250ee1f0ef7aabb4a8cb48edd9053e Reviewed-on: https://go-review.googlesource.com/c/go/+/228823 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2020-01-06all: fix typo in RuneSelf, runeSelf commentsTim Cooper
Fixes #36396 Change-Id: I52190f450fa9ac52fbf4ecdc814e954dc29029cd Reviewed-on: https://go-review.googlesource.com/c/go/+/213377 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org>
2019-10-06unicode/utf8: add link to formal UTF-8 description.Serhat Giydiren
Fixes #31590 Change-Id: I7fd6dcc5c34496776439ff0295f18b5fb5cb538a Reviewed-on: https://go-review.googlesource.com/c/go/+/199141 Reviewed-by: Emmanuel Odeke <emm.odeke@gmail.com>
2019-04-25unicode/utf8: remove some bounds checks from DecodeRuneJosh Bleecher Snyder
The compiler couldn't quite see that reading p[2] and p[3] was safe. This change provides a few hints to help it. First, make sz an int throughout, rather than just when checking the input length. Second, use <= instead of == in later comparisons. name old time/op new time/op delta DecodeASCIIRune-8 2.62ns ± 3% 2.60ns ± 5% ~ (p=0.126 n=18+19) DecodeJapaneseRune-8 4.46ns ±10% 4.01ns ± 5% -10.00% (p=0.000 n=19+20) Change-Id: I2f78a17e38156fbf8b0f5dd6c07c20d6a47e9209 Reviewed-on: https://go-review.googlesource.com/c/go/+/173662 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-04-24unicode/utf8: use binary literalsJosh Bleecher Snyder
We were using hex literals and had the binary literal in a comment. When I was working with this code, I always referred to the comment. That's an indicator that we should just use the binary literal directly. Updates #19308 Change-Id: I2279cb8efb4ae5f2e1558c15979058ab09eb4f6f Reviewed-on: https://go-review.googlesource.com/c/go/+/173663 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2019-04-24unicode/utf8: make acceptRanges biggerJosh Bleecher Snyder
This avoids bounds checks in the calling code. The nominal increased size of the array in the binary is compensated for by the decreased size of the functions that call it. The benchmark changes are a bit scattered, but overall positive. name old time/op new time/op delta RuneCountTenASCIIChars-8 8.86ns ± 4% 7.93ns ± 5% -10.45% (p=0.000 n=45+49) RuneCountTenJapaneseChars-8 38.2ns ± 2% 37.2ns ± 1% -2.63% (p=0.000 n=44+41) RuneCountInStringTenASCIIChars-8 7.82ns ± 2% 8.70ns ± 2% +11.19% (p=0.000 n=43+43) RuneCountInStringTenJapaneseChars-8 39.3ns ± 9% 40.0ns ± 5% +1.59% (p=0.043 n=50+50) ValidTenASCIIChars-8 8.68ns ± 5% 8.74ns ± 5% ~ (p=0.070 n=50+48) ValidTenJapaneseChars-8 34.1ns ± 5% 36.8ns ± 4% +8.09% (p=0.000 n=45+50) ValidStringTenASCIIChars-8 9.76ns ± 7% 8.33ns ± 3% -14.59% (p=0.000 n=48+47) ValidStringTenJapaneseChars-8 37.7ns ± 8% 36.5ns ± 5% -3.12% (p=0.011 n=50+47) EncodeASCIIRune-8 2.60ns ± 1% 2.59ns ± 2% -0.24% (p=0.018 n=43+36) EncodeJapaneseRune-8 3.75ns ± 2% 4.56ns ± 6% +21.71% (p=0.000 n=41+50) DecodeASCIIRune-8 2.59ns ± 2% 2.59ns ± 2% ~ (p=0.350 n=44+41) DecodeJapaneseRune-8 4.29ns ± 2% 4.31ns ± 2% +0.61% (p=0.001 n=48+39) FullASCIIRune-8 0.87ns ± 6% 0.29ns ± 5% -67.31% (p=0.000 n=49+43) FullJapaneseRune-8 0.65ns ± 6% 0.65ns ± 4% ~ (p=0.375 n=50+49) [Geo mean] 7.02ns 6.51ns -7.19% Change-Id: I8d5d69c8d33ce2bff94785fba39a2203f9315cb0 Reviewed-on: https://go-review.googlesource.com/c/go/+/173537 Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2018-05-06cmd/compile: optimize len([]rune(string))Martin Möhrmann
Adds a new runtime function to count runes in a string. Modifies the compiler to detect the pattern len([]rune(string)) and replaces it with the new rune counting runtime function. RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70% (p=0.000 n=10+10) RuneCount/lenruneslice/Japanese 126ns ± 2% 60ns ± 2% -52.03% (p=0.000 n=10+10) RuneCount/lenruneslice/MixedLength 104ns ± 2% 50ns ± 1% -51.71% (p=0.000 n=10+9) Fixes #24923 Change-Id: Ie9c7e7391a4e2cca675c5cdcc1e5ce7d523948b9 Reviewed-on: https://go-review.googlesource.com/108985 Run-TryBot: Martin Möhrmann <moehrmann@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
2017-09-12unicode/utf8: make FullRune inlinableIlya Tocar
This has same readability and allows to inline FullRune for massive performance gain: FullASCIIRune-6 4.36ns ± 0% 1.25ns ± 0% -71.33% (p=0.000 n=8+10) FullJapaneseRune-6 4.70ns ± 0% 1.42ns ± 1% -69.68% (p=0.000 n=9+10) Change-Id: I95edd6292417a28aac244e40afb713596a087d93 Reviewed-on: https://go-review.googlesource.com/63332 Run-TryBot: Ilya Tocar <ilya.tocar@intel.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
2016-10-26unicode/utf8: optimize ValidRuneJoe Tsai
Re-writing the switch statement as a single boolean expression reduces the number of branches that the compiler generates. It is also arguably easier to read as a pair of numeric ranges that valid runes can exist in. No test changes since the existing test does a good job of testing all of the boundaries. This change was to gain back some performance after a correctness fix done in http://golang.org/cl/32123. The correctness fix (CL/32123) slowed down the benchmarks slightly: benchmark old ns/op new ns/op delta BenchmarkIndexRune/10-4 19.3 21.6 +11.92% BenchmarkIndexRune/32-4 33.6 35.2 +4.76% Since the fix relies on utf8.ValidRune, this CL improves benchmarks: benchmark old ns/op new ns/op delta BenchmarkIndexRune/10-4 21.6 20.0 -7.41% BenchmarkIndexRune/32-4 35.2 33.5 -4.83% Change-Id: Ib1ca10a2e29c90e879a8ef9b7221c33e85d015d8 Reviewed-on: https://go-review.googlesource.com/32122 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-10-17runtime: speed up non-ASCII rune decodingMartin Möhrmann
Copies utf8 constants and EncodeRune implementation from unicode/utf8. Adds a new decoderune implementation that is used by the compiler in code generated for ranging over strings. It does not handle ASCII runes since these are handled directly before calls to decoderune. The DecodeRuneInString implementation from unicode/utf8 is not used since it uses a lookup table that would increase the use of cpu caches. Adds more tests that check decoding of valid and invalid utf8 sequences. name old time/op new time/op delta RuneIterate/range2/ASCII-4 7.45ns ± 2% 7.45ns ± 1% ~ (p=0.634 n=16+16) RuneIterate/range2/Japanese-4 53.5ns ± 1% 49.2ns ± 2% -8.03% (p=0.000 n=20+20) RuneIterate/range2/MixedLength-4 46.3ns ± 1% 41.0ns ± 2% -11.57% (p=0.000 n=20+20) new: "".decoderune t=1 size=423 args=0x28 locals=0x0 old: "".charntorune t=1 size=666 args=0x28 locals=0x0 Change-Id: I1df1fdb385bb9ea5e5e71b8818ea2bf5ce62de52 Reviewed-on: https://go-review.googlesource.com/28490 Run-TryBot: Martin Möhrmann <martisch@uos.de> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
2016-09-03unicode/utf8: reduce bounds checks in EncodeRuneMartin Möhrmann
Provide bounds elim hints in EncodeRune. name old time/op new time/op delta EncodeASCIIRune-4 2.69ns ± 2% 2.69ns ± 2% ~ (p=0.193 n=47+46) EncodeJapaneseRune-4 5.97ns ± 2% 5.38ns ± 2% -9.93% (p=0.000 n=49+50) Change-Id: I1a6dcffff3bdd64ab93c2130021e3b00981de4c8 Reviewed-on: https://go-review.googlesource.com/28492 Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com> Reviewed-by: Joe Tsai <thebrokentoaster@gmail.com> Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
2016-03-02all: single space after period.Brad Fitzpatrick
The tree's pretty inconsistent about single space vs double space after a period in documentation. Make it consistently a single space, per earlier decisions. This means contributors won't be confused by misleading precedence. This CL doesn't use go/doc to parse. It only addresses // comments. It was generated with: $ perl -i -npe 's,^(\s*// .+[a-z]\.) +([A-Z]),$1 $2,' $(git grep -l -E '^\s*//(.+\.) +([A-Z])') $ go test go/doc -update Change-Id: Iccdb99c37c797ef1f804a94b22ba5ee4b500c4f7 Reviewed-on: https://go-review.googlesource.com/20022 Reviewed-by: Rob Pike <r@golang.org> Reviewed-by: Dave Day <djd@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
2015-12-01unicode/utf8: add test for FullRuneMarcel van Lohuizen
Check that it now properly handles \xC0 and \xC1. Fixes #11733. Change-Id: I66cfe0d43f9d123d4c4509a3fa18b9b6380dfc39 Reviewed-on: https://go-review.googlesource.com/17225 Reviewed-by: Russ Cox <rsc@golang.org>
2015-11-24unicode/utf8: don't imply that the empty string is incorrect UTF-8Aaron Jacobs
Change-Id: Idd9523949ee4f2f304b12be39f8940ba34a420be Reviewed-on: https://go-review.googlesource.com/16361 Reviewed-by: Russ Cox <rsc@golang.org>
2015-11-16unicode/utf8: table-based algorithm for decodingMarcel van Lohuizen
This simplifies covering all cases, reducing the number of branches and making unrolling for simpler functions manageable. This significantly improves performance of non-ASCII input. This change will also allow addressing Issue #11733 in an efficient manner. RuneCountTenASCIIChars-8 13.7ns ± 4% 13.5ns ± 2% ~ (p=0.116 n=7+8) RuneCountTenJapaneseChars-8 153ns ± 3% 74ns ± 2% -51.42% (p=0.000 n=8+8) RuneCountInStringTenASCIIChars-8 13.5ns ± 2% 12.5ns ± 3% -7.13% (p=0.000 n=8+7) RuneCountInStringTenJapaneseChars-8 145ns ± 2% 68ns ± 2% -53.21% (p=0.000 n=8+8) ValidTenASCIIChars-8 14.1ns ± 3% 12.5ns ± 5% -11.38% (p=0.000 n=8+8) ValidTenJapaneseChars-8 147ns ± 3% 71ns ± 4% -51.72% (p=0.000 n=8+8) ValidStringTenASCIIChars-8 12.5ns ± 3% 12.3ns ± 3% ~ (p=0.095 n=8+8) ValidStringTenJapaneseChars-8 146ns ± 4% 70ns ± 2% -51.62% (p=0.000 n=8+7) DecodeASCIIRune-8 5.91ns ± 2% 4.83ns ± 3% -18.28% (p=0.001 n=7+7) DecodeJapaneseRune-8 12.2ns ± 7% 8.5ns ± 3% -29.79% (p=0.000 n=8+7) FullASCIIRune-8 5.95ns ± 3% 4.27ns ± 1% -28.23% (p=0.000 n=8+7) FullJapaneseRune-8 12.0ns ± 6% 4.3ns ± 3% -64.39% (p=0.000 n=8+8) Change-Id: Iea1d6b0180cbbee1739659a0a38038126beecaca Reviewed-on: https://go-review.googlesource.com/16940 Reviewed-by: Russ Cox <rsc@golang.org>
2015-11-16unicode/utf8: removed uses of ranging over stringMarcel van Lohuizen
Ranging over string is much slower than using DecodeRuneInString. See golang.org/issue/13162. Replacing ranging over a string with the implementation of the Bytes counterpart results in the following performance improvements: RuneCountInStringTenASCIIChars-8 43.0ns ± 1% 16.4ns ± 2% -61.80% (p=0.000 n=7+8) RuneCountInStringTenJapaneseChars-8 161ns ± 2% 154ns ± 2% -4.58% (p=0.000 n=8+8) ValidStringTenASCIIChars-8 52.2ns ± 1% 13.2ns ± 1% -74.62% (p=0.001 n=7+7) ValidStringTenJapaneseChars-8 173ns ± 2% 153ns ± 2% -11.78% (p=0.000 n=7+8) Update golang/go#13162 Change-Id: Ifc40a6a94bb3317f1f2d929d310bd2694645e9f6 Reviewed-on: https://go-review.googlesource.com/16695 Reviewed-by: Russ Cox <rsc@golang.org>
2015-10-26unicode/utf8: added benchmarksMarcel van Lohuizen
Cover some functions that weren't benched before and add InString variants if the underlying implementation is different. Note: compare (Valid|RuneCount)InString* to their (Valid|RuneCount)* counterparts. It shows, somewhat unexpectedly, that ranging over a string is *much* slower than using calls to DecodeRune. Results: In order to avoid a discrepancy in measuring the performance of core we could leave the names of the string-based measurements unchanged and suffix the added alternatives with Bytes. Compared to old: BenchmarkRuneCountTenASCIIChars-8 44.3 12.4 -72.01% BenchmarkRuneCountTenJapaneseChars-8 167 67.1 -59.82% BenchmarkEncodeASCIIRune-8 3.37 3.44 +2.08% BenchmarkEncodeJapaneseRune-8 7.19 7.24 +0.70% BenchmarkDecodeASCIIRune-8 5.41 5.53 +2.22% BenchmarkDecodeJapaneseRune-8 8.17 8.41 +2.94% All benchmarks: BenchmarkRuneCountTenASCIIChars-8 100000000 12.4 ns/op BenchmarkRuneCountTenJapaneseChars-8 20000000 67.1 ns/op BenchmarkRuneCountInStringTenASCIIChars-8 30000000 44.5 ns/op BenchmarkRuneCountInStringTenJapaneseChars-8 10000000 165 ns/op BenchmarkValidTenASCIIChars-8 100000000 12.5 ns/op BenchmarkValidTenJapaneseChars-8 20000000 71.1 ns/op BenchmarkValidStringTenASCIIChars-8 30000000 50.0 ns/op BenchmarkValidStringTenJapaneseChars-8 10000000 161 ns/op BenchmarkEncodeASCIIRune-8 500000000 3.44 ns/op BenchmarkEncodeJapaneseRune-8 200000000 7.24 ns/op BenchmarkDecodeASCIIRune-8 300000000 5.53 ns/op BenchmarkDecodeJapaneseRune-8 200000000 8.41 ns/op BenchmarkFullASCIIRune-8 500000000 3.91 ns/op BenchmarkFullJapaneseRune-8 300000000 4.22 ns/op Change-Id: I674d2ee4917b975a37717bbfa1082cc84dcd275e Reviewed-on: https://go-review.googlesource.com/14431 Reviewed-by: Russ Cox <rsc@golang.org>
2014-10-16unicode/utf8: fix docs for DecodeRune(empty) and friends.Nigel Tao
LGTM=r R=r CC=golang-codereviews https://golang.org/cl/157080043
2014-09-08build: move package sources from src/pkg to srcRuss Cox
Preparation was in CL 134570043. This CL contains only the effect of 'hg mv src/pkg/* src'. For more about the move, see golang.org/s/go14nopkg.