git/utf8.c, branch main

utf8.c: prepare workaround for iconv under macOS 14/15

2026-01-12T20:00:04Z

MacOS14 (Sonoma) has started to ship an iconv library with bugs. The same bugs exists even in MacOS 15 (Sequoia) A bug report running the Git test suite says: three tests of t3900 fail on macOS 26.1 for me: not ok 17 - ISO-2022-JP should be shown in UTF-8 now not ok 25 - ISO-2022-JP should be shown in UTF-8 now not ok 38 - commit --fixup into ISO-2022-JP from UTF-8 Here's the verbose output of the first one: ================= expecting success of 3900.17 'ISO-2022-JP should be shown in UTF-8 now': compare_with ISO-2022-JP "$TEST_DIRECTORY"/t3900/2-UTF-8.txt --- /Users/x/src/git/t/t3900/2-UTF-8.txt 2024-10-01 19:43:24.605230684 +0000 +++ current 2025-12-08 21:52:45.786161909 +0000 @@ -1,5 +1,5 @@ はれひほふしているのが、いるので。 -濱浜ほれぷりぽれまびぐりろへ。 +濱浜ほれぷりぽれまび$0$j$m$X!# not ok 17 - ISO-2022-JP should be shown in UTF-8 now 1..17 ================= compare_with runs git show to display a commit message, which in this case here was encoded using ISO-2022-JP and is supposed to be reencoded to UTF-8, but git show only does that half-way -- the "$0$j$m$X!#" part is from the original ISO-2022-JP representation. That botched conversion is done by utf8.c::reencode_string_iconv(). It calls iconv(3) to do the actual work, initially with an output buffer of the same size as the input. If the output needs more space the function enlarges the buffer and calls iconv(3) again. iconv(3) won't tell us how much space it needs, but it will report what part it already managed to convert, so we can increase the buffer and continue from there. ISO-2022-JP has escape codes for switching between character sets, so it's a stateful encoding. I guess the iconv(3) on my machine forgets the state at the end of part one and then messes up part two. [end of citation] Working around the buggy iconv shipped with the OS can be done in two ways: a) Link Git against a different version of iconv b) Improve the handling when iconv needs a larger output buffer a) is already done by default when either Fink [1] or MacPorts [2] or Homebrew [3] is installed. b) is implemented here, in case that no fixed iconv is available: When the output buffer is too short, increase it (as before) and start from scratch (this is new). This workound needs to be enabled with '#define ICONV_RESTART_RESET' and a makefile knob will be added in the next commit Suggested-by: René Scharfe Signed-off-by: Torsten Bögershausen [1] https://www.finkproject.org/ [2] https://www.macports.org/ [3] https://brew.sh/ Signed-off-by: Torsten Bögershausen Signed-off-by: Junio C Hamano

global: mark code units that generate warnings with `-Wsign-compare`

2024-12-06T11:20:02Z

Mark code units that generate warnings with `-Wsign-compare`. This allows for a structured approach to get rid of all such warnings over time in a way that can be easily measured. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

doc: switch links to https

2023-11-26T01:07:05Z

These sites offer https versions of their content. Using the https versions provides some protection for users. Signed-off-by: Josh Soref Signed-off-by: Junio C Hamano

Sync with Git 2.31.6

2022-12-13T12:09:40Z

utf8: refactor `strbuf_utf8_replace` to not rely on preallocated buffer

2022-12-09T05:26:21Z

In `strbuf_utf8_replace`, we preallocate the destination buffer and then use `memcpy` to copy bytes into it at computed offsets. This feels rather fragile and is hard to understand at times. Refactor the code to instead use `strbuf_add` and `strbuf_addstr` so that we can be sure that there is no possibility to perform an out-of-bounds write. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

utf8: fix checking for glyph width in `strbuf_utf8_replace()`

2022-12-09T05:26:21Z

In `strbuf_utf8_replace()`, we call `utf8_width()` to compute the width of the current glyph. If the glyph is a control character though it can be that `utf8_width()` returns `-1`, but because we assign this value to a `size_t` the conversion will cause us to underflow. This bug can easily be triggered with the following command: $ git log --pretty='format:xxx%<|(1,trunc)%x10' >From all I can see though this seems to be a benign underflow that has no security-related consequences. Fix the bug by using an `int` instead. When we see a control character, we now copy it into the target buffer but don't advance the current width of the string. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

utf8: fix overflow when returning string width

2022-12-09T05:26:21Z

The return type of both `utf8_strwidth()` and `utf8_strnwidth()` is `int`, but we operate on string lengths which are typically of type `size_t`. This means that when the string is longer than `INT_MAX`, we will overflow and thus return a negative result. This can lead to an out-of-bounds write with `--pretty=format:%<1)%B` and a commit message that is 2^31+1 bytes long: ================================================================= ==26009==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000001168 at pc 0x7f95c4e5f427 bp 0x7ffd8541c900 sp 0x7ffd8541c0a8 WRITE of size 2147483649 at 0x603000001168 thread T0 #0 0x7f95c4e5f426 in __interceptor_memcpy /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 #1 0x5612bbb1068c in format_and_pad_commit pretty.c:1763 #2 0x5612bbb1087a in format_commit_item pretty.c:1801 #3 0x5612bbc33bab in strbuf_expand strbuf.c:429 #4 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #5 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #6 0x5612bba0a4d5 in show_log log-tree.c:781 #7 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #8 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #9 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #10 0x5612bb6951a2 in cmd_log builtin/log.c:883 #11 0x5612bb56c993 in run_builtin git.c:466 #12 0x5612bb56d397 in handle_builtin git.c:721 #13 0x5612bb56db07 in run_argv git.c:788 #14 0x5612bb56e8a7 in cmd_main git.c:923 #15 0x5612bb803682 in main common-main.c:57 #16 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) #17 0x7f95c4c3c349 in __libc_start_main (/usr/lib/libc.so.6+0x23349) #18 0x5612bb5680e4 in _start ../sysdeps/x86_64/start.S:115 0x603000001168 is located 0 bytes to the right of 24-byte region [0x603000001150,0x603000001168) allocated by thread T0 here: #0 0x7f95c4ebe7ea in __interceptor_realloc /usr/src/debug/gcc/libsanitizer/asan/asan_malloc_linux.cpp:85 #1 0x5612bbcdd556 in xrealloc wrapper.c:136 #2 0x5612bbc310a3 in strbuf_grow strbuf.c:99 #3 0x5612bbc32acd in strbuf_add strbuf.c:298 #4 0x5612bbc33aec in strbuf_expand strbuf.c:418 #5 0x5612bbb110e7 in repo_format_commit_message pretty.c:1869 #6 0x5612bbb12d96 in pretty_print_commit pretty.c:2161 #7 0x5612bba0a4d5 in show_log log-tree.c:781 #8 0x5612bba0d6c7 in log_tree_commit log-tree.c:1117 #9 0x5612bb691ed5 in cmd_log_walk_no_free builtin/log.c:508 #10 0x5612bb69235b in cmd_log_walk builtin/log.c:549 #11 0x5612bb6951a2 in cmd_log builtin/log.c:883 #12 0x5612bb56c993 in run_builtin git.c:466 #13 0x5612bb56d397 in handle_builtin git.c:721 #14 0x5612bb56db07 in run_argv git.c:788 #15 0x5612bb56e8a7 in cmd_main git.c:923 #16 0x5612bb803682 in main common-main.c:57 #17 0x7f95c4c3c28f (/usr/lib/libc.so.6+0x2328f) SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/src/debug/gcc/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827 in __interceptor_memcpy Shadow bytes around the buggy address: 0x0c067fff81d0: fd fd fd fa fa fa fd fd fd fa fa fa fd fd fd fa 0x0c067fff81e0: fa fa fd fd fd fd fa fa fd fd fd fd fa fa fd fd 0x0c067fff81f0: fd fa fa fa fd fd fd fa fa fa fd fd fd fa fa fa 0x0c067fff8200: fd fd fd fa fa fa fd fd fd fd fa fa 00 00 00 fa 0x0c067fff8210: fa fa fd fd fd fa fa fa fd fd fd fa fa fa fd fd =>0x0c067fff8220: fd fa fa fa fd fd fd fa fa fa 00 00 00[fa]fa fa 0x0c067fff8230: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8240: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8250: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8260: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa 0x0c067fff8270: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb ==26009==ABORTING Now the proper fix for this would be to convert both functions to return an `size_t` instead of an `int`. But given that this commit may be part of a security release, let's instead do the minimal viable fix and die in case we see an overflow. Add a test that would have previously caused us to crash. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

utf8: fix returning negative string width

2022-12-09T05:26:21Z

The `utf8_strnwidth()` function calls `utf8_width()` in a loop and adds its returned width to the end result. `utf8_width()` can return `-1` though in case it reads a control character, which means that the computed string width is going to be wrong. In the worst case where there are more control characters than non-control characters, we may even return a negative string width. Fix this bug by treating control characters as having zero width. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

utf8: fix truncated string lengths in `utf8_strnwidth()`

2022-12-09T05:26:21Z

The `utf8_strnwidth()` function accepts an optional string length as input parameter. This parameter can either be set to `-1`, in which case we call `strlen()` on the input. Or it can be set to a positive integer that indicates a precomputed length, which callers typically compute by calling `strlen()` at some point themselves. The input parameter is an `int` though, whereas `strlen()` returns a `size_t`. This can lead to implementation-defined behaviour though when the `size_t` cannot be represented by the `int`. In the general case though this leads to wrap-around and thus to negative string sizes, which is sure enough to not lead to well-defined behaviour. Fix this by accepting a `size_t` instead of an `int` as string length. While this takes away the ability of callers to simply pass in `-1` as string length, it really is trivial enough to convert them to instead pass in `strlen()` instead. Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

t0060: test ntfs/hfs-obscured dotfiles

2021-05-04T02:52:02Z

We have tests that cover various filesystem-specific spellings of ".gitmodules", because we need to reliably identify that path for some security checks. These are from dc2d9ba318 (is_{hfs,ntfs}_dotgitmodules: add tests, 2018-05-12), with the actual code coming from e7cb0b4455 (is_ntfs_dotgit: match other .git files, 2018-05-11) and 0fc333ba20 (is_hfs_dotgit: match other .git files, 2018-05-02). Those latter two commits also added similar matching functions for .gitattributes and .gitignore. These ended up not being used in the final series, and are currently dead code. But in preparation for them being used in some fsck checks, let's make sure they actually work by throwing a few basic tests at them. Likewise, let's cover .mailmap (which does need matching code added). I didn't bother with the whole battery of tests that we cover for .gitmodules. These functions are all based on the same generic matcher, so it's sufficient to test most of the corner cases just once. Note that the ntfs magic prefix names in the tests come from the algorithm described in e7cb0b4455 (and are different for each file). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano