<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/tree-diff.c, branch v2.53.0</title>
<subtitle>Fork of git SCM with my patches.</subtitle>
<id>http://git.kilabit.info/git/atom?h=v2.53.0</id>
<link rel='self' href='http://git.kilabit.info/git/atom?h=v2.53.0'/>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/'/>
<updated>2026-01-10T02:36:16Z</updated>
<entry>
<title>environment: move access to core.maxTreeDepth into repo settings</title>
<updated>2026-01-10T02:36:16Z</updated>
<author>
<name>René Scharfe</name>
<email>l.s.r@web.de</email>
</author>
<published>2026-01-09T21:30:12Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=e691395365b871608551bfbe20982b53140a50f0'/>
<id>urn:sha1:e691395365b871608551bfbe20982b53140a50f0</id>
<content type='text'>
The config setting core.maxTreeDepth is stored in a global variable and
populated by the function git_default_core_config.  This won't work if
we need to access multiple repositories with different values of that
setting in the same process.  Store the setting in struct repo_settings
instead and track it separately for each repository.

Signed-off-by: René Scharfe &lt;l.s.r@web.de&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>diff: teach tree-diff a max-depth parameter</title>
<updated>2025-08-07T22:29:35Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-08-07T20:52:58Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a1dfa5448d583bbfd1ec45642a4495ad499970c9'/>
<id>urn:sha1:a1dfa5448d583bbfd1ec45642a4495ad499970c9</id>
<content type='text'>
When you are doing a tree-diff, there are basically two options: do not
recurse into subtrees at all, or recurse indefinitely. While most
callers would want to always recurse and see full pathnames, some may
want the efficiency of looking only at a particular level of the tree.
This is currently easy to do for the top-level (just turn off
recursion), but you cannot say "show me what changed in subdir/, but do
not recurse".

This patch adds a max-depth parameter which is measured from the closest
pathspec match, so that you can do:

  git log --raw --max-depth=1 -- a/b/c

and see the raw output for a/b/c/, but not those of a/b/c/d/
(instead of the raw output you would see for a/b/c/d).

Co-authored-by: Toon Claes &lt;toon@iotcl.com&gt;
Signed-off-by: Toon Claes &lt;toon@iotcl.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>hash: stop depending on `the_repository` in `null_oid()`</title>
<updated>2025-03-10T20:16:20Z</updated>
<author>
<name>Patrick Steinhardt</name>
<email>ps@pks.im</email>
</author>
<published>2025-03-10T07:13:31Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=7d70b29c4f0b2fd3c6698956d9fb4026632d9c6e'/>
<id>urn:sha1:7d70b29c4f0b2fd3c6698956d9fb4026632d9c6e</id>
<content type='text'>
The `null_oid()` function returns the object ID that only consists of
zeroes. Naturally, this ID also depends on the hash algorithm used, as
the number of zeroes is different between SHA1 and SHA256. Consequently,
the function returns the hash-algorithm-specific null object ID.

This is currently done by depending on `the_hash_algo`, which implicitly
makes us depend on `the_repository`. Refactor the function to instead
pass in the hash algorithm for which we want to retrieve the null object
ID. Adapt callsites accordingly by passing in `the_repository`, thus
bubbling up the dependency on that global variable by one layer.

There are a couple of trivial exceptions for subsystems that already got
rid of `the_repository`. These subsystems instead use the repository
that is available via the calling context:

  - "builtin/grep.c"
  - "grep.c"
  - "refs/debug.c"

There are also two non-trivial exceptions:

  - "diff-no-index.c": Here we know that we may not have a repository
    initialized at all, so we cannot rely on `the_repository`. Instead,
    we adapt `diff_no_index()` to get a `struct git_hash_algo` as
    parameter. The only caller is located in "builtin/diff.c", where we
    know to call `repo_set_hash_algo()` in case we're running outside of
    a Git repository. Consequently, it is fine to continue passing
    `the_repository-&gt;hash_algo` even in this case.

  - "builtin/ls-files.c": There is an in-flight patch series that drops
    `USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic
    conflict because we use `null_oid()` in `show_submodule()`. The
    value is passed to `repo_submodule_init()`, which may use the object
    ID to resolve a tree-ish in the superproject from which we want to
    read the submodule config. As such, the object ID should refer to an
    object in the superproject, and consequently we need to use its hash
    algorithm.

    This means that we could in theory just not bother about this edge
    case at all and just use `the_repository` in "diff-no-index.c". But
    doing so would feel misdesigned.

Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in
"hash.c".

Signed-off-by: Patrick Steinhardt &lt;ps@pks.im&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: make list tail-passing more explicit</title>
<updated>2025-01-09T20:24:27Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:57:00Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=6979bf6f8f5e831ef38214edb158c7fb493540f7'/>
<id>urn:sha1:6979bf6f8f5e831ef38214edb158c7fb493540f7</id>
<content type='text'>
The ll_diff_tree_paths() function and its helpers all take a pointer to
a list tail, possibly add to it, and then return the new tail. This
works but has two downsides:

  - The top-level caller (diff_tree_paths() in this case) has to make a
    fake combine_diff_path struct to act as the list head. This is
    especially weird here, as it's a flexible-sized struct which will
    have an empty FLEX_ARRAY field. That used to be a portability
    problem, though these days it is legal because our FLEX_ARRAY macro
    over-allocates if necessary. It's still kind of ugly, though.

  - Besides the name "tail", it's not immediately obvious that the entry
    we pass around will not be examined by each function. Using a
    pointer-to-pointer or similar makes it more obvious we only care
    about the pointer itself, not its contents.

We can solve both by passing around a pointer to the tail instead. That
gets rid of the return value entirely, though note that because of the
recursion we actually need a three-star pointer for this to work.

The result is fairly readable, as we only need to dereference the tail
in one spot. If we wanted to make it simpler we could wrap the tail in a
struct, which we pass around.

Another option is to convert combine_diff to use our generic list_head
API. I tried that and found the result became much harder to read
overall. It means that _all_ code that looks at combine_diff_path
structs needs to be modified, since the "next" pointer is now inside a
list_head which has to be dereferenced with list_entry(). And we lose
some type safety, since we're just passing around a list_head struct
everywhere, and everybody who looks at it has to specify the type to
list_entry themselves.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: simplify emit_path() list management</title>
<updated>2025-01-09T20:24:26Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:54:05Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=6632bcba514f4a195447eb77264f75fed3a768e6'/>
<id>urn:sha1:6632bcba514f4a195447eb77264f75fed3a768e6</id>
<content type='text'>
In emit_path() we may append a new combine_diff_path entry to our list,
decide that we don't want it (because opt-&gt;pathchange() told us so) and
then roll it back.

Between the addition and the rollback, it doesn't matter if it's in the
list or not (no functions can even tell, since it's a singly-linked list
and we pass around just the tail entry).

So it's much simpler to just wait until opt-&gt;pathchange() tells us
whether to keep it, and either attach it (or free it) then. We do still
have to allocate it up front since it's that struct itself which is
passed to the pathchange callback.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: use the name "tail" to refer to list tail</title>
<updated>2025-01-09T20:24:26Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:53:09Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=d8baf083c5a1a3e8dd27fe73e6000c5f6dddf1ca'/>
<id>urn:sha1:d8baf083c5a1a3e8dd27fe73e6000c5f6dddf1ca</id>
<content type='text'>
The ll_diff_tree_paths() function and its helpers all append to a
running list by taking in a pointer to the old tail and returning the
new tail. But they just call this argument "p", which is not very
descriptive.

It gets particularly confusing in emit_path(), where we actually add to
the list, because "p" does double-duty: it is the tail of the list, but
it is also the entry which we add. Except that in some cases we _don't_
add a new entry (or we might even add it and roll it back) if the path
isn't interesting. At first glance, this makes it look like a bug that
we pass "p" on to ll_diff_tree_paths() to recurse; sometimes it is
getting the new entry we made and sometimes not!

But it's not a bug, because ll_diff_tree_paths() does not care about the
entry itself at all. It is only using its "next" pointer as the tail of
the list.

Let's swap out "p" for "tail" to make this obvious. And then in
emit_path() we'll continue to use "p" for our newly allocated entry.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: drop list-tail argument to diff_tree_paths()</title>
<updated>2025-01-09T20:24:26Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:51:56Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a5c4e31af9b8b8fb362472ce3a1ec404df0da032'/>
<id>urn:sha1:a5c4e31af9b8b8fb362472ce3a1ec404df0da032</id>
<content type='text'>
The internals of the path diffing code, including ll_diff_tree_paths(),
all take an extra combine_diff_path parameter which they use as the tail
of a list of results, appending any new entries to it.

The public-facing diff_tree_paths() takes the same argument, but it just
makes the callers more awkward. They always start with a clean list, and
have to set up a fake head struct to pass in.

Let's keep the public API clean by always returning a new list. That
keeps the fake struct as an implementation detail of tree-diff.c.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: inline path_appendnew()</title>
<updated>2025-01-09T20:24:25Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:49:44Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=b20d7d348c4d32777cd577c221de529452baca03'/>
<id>urn:sha1:b20d7d348c4d32777cd577c221de529452baca03</id>
<content type='text'>
Our path_appendnew() has been simplified to the point that it is mostly
just implementing combine_diff_path_new(), plus setting the "next"
pointer. Since there's only one caller, let's replace it completely with
a call to that helper function.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: pass whole path string to path_appendnew()</title>
<updated>2025-01-09T20:24:25Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:49:07Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=8c53354658462aa6783022def86750ab0b79eb6f'/>
<id>urn:sha1:8c53354658462aa6783022def86750ab0b79eb6f</id>
<content type='text'>
When diffing trees, we'll have a strbuf "base" containing the
slash-separted names of our parent trees, and a "path" string
representing an entry name from the current tree. We pass these
separately to path_appendnew(), which combines them to form a single
path string in the combine_diff_path struct.

Instead, let's append the path string to our base strbuf ourselves, pass
in the result, and then roll it back with strbuf_setlen(). This lets us
simplify path_appendnew() a bit, enabling further refactoring.

And while it might seem like this causes extra wasted allocations, it
does not in practice. We reuse the same strbuf for each tree entry, so
we only have to allocate it to match the largest name. Plus, in a
recursive diff we'll end up doing this same operation to extend the base
for the next level of recursion. So we're really just incurring a small
memcpy().

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>tree-diff: drop path_appendnew() alloc optimization</title>
<updated>2025-01-09T20:24:25Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2025-01-09T08:46:49Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a8dda1af6ab400d45b7524bc46b64e04d14fc912'/>
<id>urn:sha1:a8dda1af6ab400d45b7524bc46b64e04d14fc912</id>
<content type='text'>
When we're diffing trees, we create a list of combine_diff_path structs
that represent changed paths. We allocate each struct and add it to the
list with path_appendnew(), which we then feed to opt-&gt;pathchange().
That function tells us whether the path is of interest or not; if not,
then we can throw away the struct we allocated.

So there's an optimization to avoid extra allocations: instead of
throwing away the new entry, we try to reuse it. If it was large enough
to store the next path we care about, we can do so. And if not, we fall
back to freeing and re-allocating a new struct.

This comes from 72441af7c4 (tree-diff: rework diff_tree() to generate
diffs for multiparent cases as well, 2014-04-07), where the goal was to
have even the 2-parent diff code use the combine-diff infrastructure,
but without taking a performance hit.

The implementation causes some complexities in the interface (as we
store the allocation length inside the "next" pointer), and prevents us
from using the regular combine_diff_path_new() constructor. The
complexity is mostly contained inside two functions, but it's worth
re-evaluating how much it's helping.

That commit claims it helps ~1% on generating two-parent diffs in
linux.git. Here are the timings I get on the same command today ("old"
is the current tip of master, and "new" has this patch applied):

  Benchmark 1: ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11
    Time (mean ± σ):     532.9 ms ±   5.8 ms    [User: 472.7 ms, System: 59.6 ms]
    Range (min … max):   525.9 ms … 543.3 ms    10 runs

  Benchmark 2: ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11
    Time (mean ± σ):     538.3 ms ±   5.7 ms    [User: 478.0 ms, System: 59.7 ms]
    Range (min … max):   528.5 ms … 545.3 ms    10 runs

  Summary
    ./git.old log --raw --no-abbrev --no-renames v3.10..v3.11 ran
    1.01 ± 0.02 times faster than ./git.new log --raw --no-abbrev --no-renames v3.10..v3.11

So we do end up on average 1% faster, but with 2% of noise. I tried to
focus more on diff performance by running the commit traversal
separately, like:

  git rev-list v3.10..v3.11 &gt;in

and then timing just the diffs:

  Benchmark 1: ./git.old diff-tree --stdin -r &lt;in
    Time (mean ± σ):     415.7 ms ±   5.8 ms    [User: 357.7 ms, System: 58.0 ms]
    Range (min … max):   410.9 ms … 430.3 ms    10 runs

  Benchmark 2: ./git.new diff-tree --stdin -r &lt;in
    Time (mean ± σ):     418.5 ms ±   2.1 ms    [User: 361.7 ms, System: 56.6 ms]
    Range (min … max):   414.9 ms … 421.3 ms    10 runs

  Summary
    ./git.old diff-tree --stdin -r &lt;in ran
      1.01 ± 0.02 times faster than ./git.new diff-tree --stdin -r &lt;in

That gets roughly the same result.

Adding in "-c" to do multi-parent diffs doesn't change much:

  Benchmark 1: ./git.old diff-tree --stdin -r -c &lt;in
    Time (mean ± σ):     525.3 ms ±   6.6 ms    [User: 470.0 ms, System: 55.1 ms]
    Range (min … max):   508.4 ms … 531.0 ms    10 runs

  Benchmark 2: ./git.new diff-tree --stdin -r -c &lt;in
    Time (mean ± σ):     532.3 ms ±   6.2 ms    [User: 469.0 ms, System: 63.1 ms]
    Range (min … max):   520.3 ms … 539.4 ms    10 runs

  Summary
    ./git.old diff-tree --stdin -r -c &lt;in ran
      1.01 ± 0.02 times faster than ./git.new diff-tree --stdin -r -c &lt;in

And of course if you add in a lot more work by doing actual
content-level diffs, any difference is lost entirely (here the newer
version is actually faster, but that's really just noise):

  Benchmark 1: ./git.old diff-tree --stdin -r --cc &lt;in
    Time (mean ± σ):     11.571 s ±  0.064 s    [User: 11.287 s, System: 0.283 s]
    Range (min … max):   11.497 s … 11.615 s    3 runs

  Benchmark 2: ./git.new diff-tree --stdin -r --cc &lt;in
    Time (mean ± σ):     11.466 s ±  0.109 s    [User: 11.108 s, System: 0.357 s]
    Range (min … max):   11.346 s … 11.560 s    3 runs

  Summary
    ./git.new diff-tree --stdin -r --cc &lt;in ran
      1.01 ± 0.01 times faster than ./git.old diff-tree --stdin -r --cc &lt;in

So my conclusion is that it probably does help a little, but it's mostly
lost in the noise. I could see an argument for keeping it, as the
complexity is hidden away in functions that do not often need to be
touched. But it does make them more confusing than necessary (despite
some detailed explanations from the author of that commit; it just took
me a while to wrap my head around what was going on) and prevents
further refactoring of the combine_diff_path struct. So let's drop it.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
