<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/commit-graph.c, branch v2.30.3</title>
<subtitle>Fork of git SCM with my patches.</subtitle>
<id>http://git.kilabit.info/git/atom?h=v2.30.3</id>
<link rel='self' href='http://git.kilabit.info/git/atom?h=v2.30.3'/>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/'/>
<updated>2021-01-06T21:53:32Z</updated>
<entry>
<title>commit-graph: don't peek into `struct lock_file`</title>
<updated>2021-01-06T21:53:32Z</updated>
<author>
<name>Martin Ågren</name>
<email>martin.agren@gmail.com</email>
</author>
<published>2021-01-05T19:23:47Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a52cdce936fe473429bd4354079f03c3fb0234e9'/>
<id>urn:sha1:a52cdce936fe473429bd4354079f03c3fb0234e9</id>
<content type='text'>
Similar to the previous commit, avoid peeking into the `struct
lock_file`. Use the lock file API instead.

Signed-off-by: Martin Ågren &lt;martin.agren@gmail.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use size_t for array allocation and indexing</title>
<updated>2020-12-07T20:32:04Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-12-07T19:11:08Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=3361390cbedc962adcddcfd98676695c125fa180'/>
<id>urn:sha1:3361390cbedc962adcddcfd98676695c125fa180</id>
<content type='text'>
Our packed_commit_list is an array of pointers to commit structs. We use
"int" for the allocation, which is 32-bit even on 64-bit platforms. This
isn't likely to overflow in practice (we're writing commit graphs, so
you'd need to actually have billions of unique commits in the
repository). But it's good practice to use size_t for allocations.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: replace packed_oid_list with oid_array</title>
<updated>2020-12-07T20:32:04Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-12-07T19:11:05Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a5f1c448998fba5f5a1b00e1b9a779176ec665a6'/>
<id>urn:sha1:a5f1c448998fba5f5a1b00e1b9a779176ec665a6</id>
<content type='text'>
Our custom packed_oid_list data structure is really just an oid_array in
disguise. Let's switch to using the generic structure, which shortens
and simplifies the code slightly.

There's one slightly awkward part: in the old code we copied a hash
straight from the mmap'd on-disk data into the final object_id. And now
we'll copy to a temporary oid, which we'll then pass to
oid_array_append(). But this is an operation we have to do all over the
commit-graph code already, since it mostly uses object_id structs
internally. I also measured "git commit-graph --append", which triggers
this code path, and it showed no difference.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: drop count_distinct_commits() function</title>
<updated>2020-12-07T20:32:04Z</updated>
<author>
<name>Jeff King</name>
<email>peff@peff.net</email>
</author>
<published>2020-12-07T19:11:02Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=1cbdbf3bef7e9167794cb2c12f9af1a450584ebe'/>
<id>urn:sha1:1cbdbf3bef7e9167794cb2c12f9af1a450584ebe</id>
<content type='text'>
When writing a commit graph, we collect a list of object ids in an
array, which we'll eventually copy into an array of "struct commit"
pointers. Before we do that, though, we count the number of distinct
commit entries. There's a subtle bug in this step, though.

We eliminate not only duplicate oids, but also in split mode, any oids
which are not commits or which are already in a graph file. However, the
loop starts at index 1, always counting index 0 as distinct. And indeed
it can't be a duplicate, since we check for those by comparing against
the previous entry, and there isn't one for index 0. But it could be a
commit that's already in a graph file, and we'd overcount the number of
commits by 1 in that case.

That turns out not to be a problem, though. The only things we do with
the count are:

  - check if our count will overflow our data structures. But the limit
    there is 2^31 commits, so while this is a useful check, the
    off-by-one is not likely to matter.

  - pre-allocate the array of commit pointers. But over-allocating by
    one isn't a problem; we'll just waste a few extra bytes.

The bug would be easy enough to fix, but we can observe that neither of
those steps is necessary.

After building the actual commit array, we'll likewise check its count
for overflow. So the extra check of the distinct commit count here is
redundant.

And likewise we use ALLOC_GROW() when building the commit array, so
there's no need to preallocate it (it's possible that doing so is
slightly more efficient, but if we care we can just optimistically
allocate one slot for each oid; I didn't bother here).

So count_distinct_commits() isn't doing anything useful. Let's just get
rid of that step.

Note that a side effect of the function was that we sorted the list of
oids, which we do rely on in copy_oids_to_commits(), since it must also
skip the duplicates. So we'll move the qsort there. I didn't copy the
"TODO" about adding more progress meters. It's actually quite hard to
make a repository large enough for this qsort would take an appreciable
amount of time, so this doesn't seem like a useful note.

Signed-off-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'ds/commit-graph-merging-fix'</title>
<updated>2020-11-02T21:17:39Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-11-02T21:17:39Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=307a53dd9914f65c9b6399221574e24234e4b49f'/>
<id>urn:sha1:307a53dd9914f65c9b6399221574e24234e4b49f</id>
<content type='text'>
When "git commit-graph" detects the same commit recorded more than
once while it is merging the layers, it used to die.  The code now
ignores all but one of them and continues.

* ds/commit-graph-merging-fix:
  commit-graph: don't write commit-graph when disabled
  commit-graph: ignore duplicates when merging layers
</content>
</entry>
<entry>
<title>commit-graph: don't write commit-graph when disabled</title>
<updated>2020-10-09T21:16:32Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-10-09T20:53:52Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=85102ac71b98466eaa2b9b5a568c3a1de736202d'/>
<id>urn:sha1:85102ac71b98466eaa2b9b5a568c3a1de736202d</id>
<content type='text'>
The core.commitGraph config setting can be set to 'false' to prevent
parsing commits from the commit-graph file(s). This causes an issue when
trying to write with "--split" which needs to distinguish between
commits that are in the existing commit-graph layers and commits that
are not. The existing mechanism uses parse_commit() and follows by
checking if there is a 'graph_pos' that shows the commit was parsed from
the commit-graph file.

When core.commitGraph=false, we do not parse the commits from the
commit-graph and 'graph_pos' indicates that no commits are in the
existing file. The --split logic moves forward creating a new layer on
top that holds all reachable commits, then possibly merges down into
those layers, resulting in duplicate commits. The previous change makes
that merging process more robust to such a situation in case it happens
in the written commit-graph data.

The easy answer here is to avoid writing a commit-graph if reading the
commit-graph is disabled. Since the resulting commit-graph will would not
be read by subsequent Git processes. This is more natural than forcing
core.commitGraph to be true for the 'write' process.

Reported-by: Thomas Braun &lt;thomas.braun@virtuell-zuhause.de&gt;
Helped-by: Jeff King &lt;peff@peff.net&gt;
Helped-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: ignore duplicates when merging layers</title>
<updated>2020-10-09T21:16:23Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-10-09T20:53:51Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=150f11574bcb53a6880a03593ff46d6b25fa03a1'/>
<id>urn:sha1:150f11574bcb53a6880a03593ff46d6b25fa03a1</id>
<content type='text'>
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun &lt;thomas.braun@virtuell-zuhause.de&gt;
Helped-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Co-authored-by: Jeff King &lt;peff@peff.net&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'tb/bloom-improvements'</title>
<updated>2020-09-29T21:01:20Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-29T21:01:20Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=288ed98bf768f4df9b569d51a52c233a1402c0f5'/>
<id>urn:sha1:288ed98bf768f4df9b569d51a52c233a1402c0f5</id>
<content type='text'>
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.

* tb/bloom-improvements:
  commit-graph: introduce 'commitGraph.maxNewFilters'
  builtin/commit-graph.c: introduce '--max-new-filters=&lt;n&gt;'
  commit-graph: rename 'split_commit_graph_opts'
  bloom: encode out-of-bounds filters as non-empty
  bloom/diff: properly short-circuit on max_changes
  bloom: use provided 'struct bloom_filter_settings'
  bloom: split 'get_bloom_filter()' in two
  commit-graph.c: store maximum changed paths
  commit-graph: respect 'commitGraph.readChangedPaths'
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&amp;&amp;'-chain
  commit-graph: introduce 'get_bloom_filter_settings()'
</content>
</entry>
<entry>
<title>Merge branch 'ds/maintenance-part-1'</title>
<updated>2020-09-25T22:25:38Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-25T22:25:38Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=48794acc50f14394ca6c4f5092a4a498f409f350'/>
<id>urn:sha1:48794acc50f14394ca6c4f5092a4a498f409f350</id>
<content type='text'>
A "git gc"'s big brother has been introduced to take care of more
repository maintenance tasks, not limited to the object database
cleaning.

* ds/maintenance-part-1:
  maintenance: add trace2 regions for task execution
  maintenance: add auto condition for commit-graph task
  maintenance: use pointers to check --auto
  maintenance: create maintenance.&lt;task&gt;.enabled config
  maintenance: take a lock on the objects directory
  maintenance: add --task option
  maintenance: add commit-graph task
  maintenance: initialize task array
  maintenance: replace run_auto_gc()
  maintenance: add --quiet option
  maintenance: create basic maintenance runner
</content>
</entry>
<entry>
<title>builtin/commit-graph.c: introduce '--max-new-filters=&lt;n&gt;'</title>
<updated>2020-09-18T17:35:39Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-09-18T13:27:27Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=809e0327f579267ea78a1b2f727d3b63c1f5d044'/>
<id>urn:sha1:809e0327f579267ea78a1b2f727d3b63c1f5d044</id>
<content type='text'>
Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=&lt;n&gt;' to allow computing at most '&lt;n&gt;' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano &lt;gitster@pobox.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
