git/tree-diff.c, branch v2.53.0

environment: move access to core.maxTreeDepth into repo settings

2026-01-10T02:36:16Z

The config setting core.maxTreeDepth is stored in a global variable and populated by the function git_default_core_config. This won't work if we need to access multiple repositories with different values of that setting in the same process. Store the setting in struct repo_settings instead and track it separately for each repository. Signed-off-by: René Scharfe Signed-off-by: Junio C Hamano

diff: teach tree-diff a max-depth parameter

2025-08-07T22:29:35Z

When you are doing a tree-diff, there are basically two options: do not recurse into subtrees at all, or recurse indefinitely. While most callers would want to always recurse and see full pathnames, some may want the efficiency of looking only at a particular level of the tree. This is currently easy to do for the top-level (just turn off recursion), but you cannot say "show me what changed in subdir/, but do not recurse". This patch adds a max-depth parameter which is measured from the closest pathspec match, so that you can do: git log --raw --max-depth=1 -- a/b/c and see the raw output for a/b/c/, but not those of a/b/c/d/ (instead of the raw output you would see for a/b/c/d). Co-authored-by: Toon Claes Signed-off-by: Toon Claes Signed-off-by: Junio C Hamano

hash: stop depending on `the_repository` in `null_oid()`

2025-03-10T20:16:20Z

The `null_oid()` function returns the object ID that only consists of zeroes. Naturally, this ID also depends on the hash algorithm used, as the number of zeroes is different between SHA1 and SHA256. Consequently, the function returns the hash-algorithm-specific null object ID. This is currently done by depending on `the_hash_algo`, which implicitly makes us depend on `the_repository`. Refactor the function to instead pass in the hash algorithm for which we want to retrieve the null object ID. Adapt callsites accordingly by passing in `the_repository`, thus bubbling up the dependency on that global variable by one layer. There are a couple of trivial exceptions for subsystems that already got rid of `the_repository`. These subsystems instead use the repository that is available via the calling context: - "builtin/grep.c" - "grep.c" - "refs/debug.c" There are also two non-trivial exceptions: - "diff-no-index.c": Here we know that we may not have a repository initialized at all, so we cannot rely on `the_repository`. Instead, we adapt `diff_no_index()` to get a `struct git_hash_algo` as parameter. The only caller is located in "builtin/diff.c", where we know to call `repo_set_hash_algo()` in case we're running outside of a Git repository. Consequently, it is fine to continue passing `the_repository->hash_algo` even in this case. - "builtin/ls-files.c": There is an in-flight patch series that drops `USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic conflict because we use `null_oid()` in `show_submodule()`. The value is passed to `repo_submodule_init()`, which may use the object ID to resolve a tree-ish in the superproject from which we want to read the submodule config. As such, the object ID should refer to an object in the superproject, and consequently we need to use its hash algorithm. This means that we could in theory just not bother about this edge case at all and just use `the_repository` in "diff-no-index.c". But doing so would feel misdesigned. Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in "hash.c". Signed-off-by: Patrick Steinhardt Signed-off-by: Junio C Hamano

tree-diff: make list tail-passing more explicit

2025-01-09T20:24:27Z

The ll_diff_tree_paths() function and its helpers all take a pointer to a list tail, possibly add to it, and then return the new tail. This works but has two downsides: - The top-level caller (diff_tree_paths() in this case) has to make a fake combine_diff_path struct to act as the list head. This is especially weird here, as it's a flexible-sized struct which will have an empty FLEX_ARRAY field. That used to be a portability problem, though these days it is legal because our FLEX_ARRAY macro over-allocates if necessary. It's still kind of ugly, though. - Besides the name "tail", it's not immediately obvious that the entry we pass around will not be examined by each function. Using a pointer-to-pointer or similar makes it more obvious we only care about the pointer itself, not its contents. We can solve both by passing around a pointer to the tail instead. That gets rid of the return value entirely, though note that because of the recursion we actually need a three-star pointer for this to work. The result is fairly readable, as we only need to dereference the tail in one spot. If we wanted to make it simpler we could wrap the tail in a struct, which we pass around. Another option is to convert combine_diff to use our generic list_head API. I tried that and found the result became much harder to read overall. It means that _all_ code that looks at combine_diff_path structs needs to be modified, since the "next" pointer is now inside a list_head which has to be dereferenced with list_entry(). And we lose some type safety, since we're just passing around a list_head struct everywhere, and everybody who looks at it has to specify the type to list_entry themselves. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: simplify emit_path() list management

2025-01-09T20:24:26Z

In emit_path() we may append a new combine_diff_path entry to our list, decide that we don't want it (because opt->pathchange() told us so) and then roll it back. Between the addition and the rollback, it doesn't matter if it's in the list or not (no functions can even tell, since it's a singly-linked list and we pass around just the tail entry). So it's much simpler to just wait until opt->pathchange() tells us whether to keep it, and either attach it (or free it) then. We do still have to allocate it up front since it's that struct itself which is passed to the pathchange callback. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: use the name "tail" to refer to list tail

2025-01-09T20:24:26Z

The ll_diff_tree_paths() function and its helpers all append to a running list by taking in a pointer to the old tail and returning the new tail. But they just call this argument "p", which is not very descriptive. It gets particularly confusing in emit_path(), where we actually add to the list, because "p" does double-duty: it is the tail of the list, but it is also the entry which we add. Except that in some cases we _don't_ add a new entry (or we might even add it and roll it back) if the path isn't interesting. At first glance, this makes it look like a bug that we pass "p" on to ll_diff_tree_paths() to recurse; sometimes it is getting the new entry we made and sometimes not! But it's not a bug, because ll_diff_tree_paths() does not care about the entry itself at all. It is only using its "next" pointer as the tail of the list. Let's swap out "p" for "tail" to make this obvious. And then in emit_path() we'll continue to use "p" for our newly allocated entry. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: drop list-tail argument to diff_tree_paths()

2025-01-09T20:24:26Z

The internals of the path diffing code, including ll_diff_tree_paths(), all take an extra combine_diff_path parameter which they use as the tail of a list of results, appending any new entries to it. The public-facing diff_tree_paths() takes the same argument, but it just makes the callers more awkward. They always start with a clean list, and have to set up a fake head struct to pass in. Let's keep the public API clean by always returning a new list. That keeps the fake struct as an implementation detail of tree-diff.c. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: inline path_appendnew()

2025-01-09T20:24:25Z

Our path_appendnew() has been simplified to the point that it is mostly just implementing combine_diff_path_new(), plus setting the "next" pointer. Since there's only one caller, let's replace it completely with a call to that helper function. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: pass whole path string to path_appendnew()

2025-01-09T20:24:25Z

When diffing trees, we'll have a strbuf "base" containing the slash-separted names of our parent trees, and a "path" string representing an entry name from the current tree. We pass these separately to path_appendnew(), which combines them to form a single path string in the combine_diff_path struct. Instead, let's append the path string to our base strbuf ourselves, pass in the result, and then roll it back with strbuf_setlen(). This lets us simplify path_appendnew() a bit, enabling further refactoring. And while it might seem like this causes extra wasted allocations, it does not in practice. We reuse the same strbuf for each tree entry, so we only have to allocate it to match the largest name. Plus, in a recursive diff we'll end up doing this same operation to extend the base for the next level of recursion. So we're really just incurring a small memcpy(). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

tree-diff: drop path_appendnew() alloc optimization

2025-01-09T20:24:25Z

When we're diffing trees, we create a list of combine_diff_path structs that represent changed paths. We allocate each struct and add it to the list with path_appendnew(), which we then feed to opt->pathchange(). That function tells us whether the path is of interest or not; if not, then we can throw away the struct we allocated. So there's an optimization to avoid extra allocations: instead of throwing away the new entry, we try to reuse it. If it was large enough to store the next path we care about, we can do so. And if not, we fall back to freeing and re-allocating a new struct. This comes from 72441af7c4 (tree-diff: rework diff_tree() to generate diffs for multiparent cases as well, 2014-04-07), where the goal was to have even the 2-parent diff code use the combine-diff infrastructure, but without taking a performance hit. The implementation causes some complexities in the interface (as we store the allocation length inside the "next" pointer), and prevents us from using the regular combine_diff_path_new() constructor. The complexity is mostly contained inside two functions, but it's worth re-evaluating how much it's helping. That commit claims it helps ~1% on generating two-parent diffs in linux.git. Here are the timings I get on the same command today ("old" is the current tip of master, and "new" has this patch applied): Benchmark 1: ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11 Time (mean ± σ): 532.9 ms ± 5.8 ms [User: 472.7 ms, System: 59.6 ms] Range (min … max): 525.9 ms … 543.3 ms 10 runs Benchmark 2: ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11 Time (mean ± σ): 538.3 ms ± 5.7 ms [User: 478.0 ms, System: 59.7 ms] Range (min … max): 528.5 ms … 545.3 ms 10 runs Summary ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11 ran 1.01 ± 0.02 times faster than ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11 So we do end up on average 1% faster, but with 2% of noise. I tried to focus more on diff performance by running the commit traversal separately, like: git rev-list v3.10..v3.11 >in and then timing just the diffs: Benchmark 1: ./git.old diff-tree --stdin -r