git/commit-graph.c, branch v2.30.3

commit-graph: don't peek into `struct lock_file`

2021-01-06T21:53:32Z

Similar to the previous commit, avoid peeking into the `struct lock_file`. Use the lock file API instead. Signed-off-by: Martin Ågren Reviewed-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph: use size_t for array allocation and indexing

2020-12-07T20:32:04Z

Our packed_commit_list is an array of pointers to commit structs. We use "int" for the allocation, which is 32-bit even on 64-bit platforms. This isn't likely to overflow in practice (we're writing commit graphs, so you'd need to actually have billions of unique commits in the repository). But it's good practice to use size_t for allocations. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: replace packed_oid_list with oid_array

2020-12-07T20:32:04Z

Our custom packed_oid_list data structure is really just an oid_array in disguise. Let's switch to using the generic structure, which shortens and simplifies the code slightly. There's one slightly awkward part: in the old code we copied a hash straight from the mmap'd on-disk data into the final object_id. And now we'll copy to a temporary oid, which we'll then pass to oid_array_append(). But this is an operation we have to do all over the commit-graph code already, since it mostly uses object_id structs internally. I also measured "git commit-graph --append", which triggers this code path, and it showed no difference. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

commit-graph: drop count_distinct_commits() function

2020-12-07T20:32:04Z

When writing a commit graph, we collect a list of object ids in an array, which we'll eventually copy into an array of "struct commit" pointers. Before we do that, though, we count the number of distinct commit entries. There's a subtle bug in this step, though. We eliminate not only duplicate oids, but also in split mode, any oids which are not commits or which are already in a graph file. However, the loop starts at index 1, always counting index 0 as distinct. And indeed it can't be a duplicate, since we check for those by comparing against the previous entry, and there isn't one for index 0. But it could be a commit that's already in a graph file, and we'd overcount the number of commits by 1 in that case. That turns out not to be a problem, though. The only things we do with the count are: - check if our count will overflow our data structures. But the limit there is 2^31 commits, so while this is a useful check, the off-by-one is not likely to matter. - pre-allocate the array of commit pointers. But over-allocating by one isn't a problem; we'll just waste a few extra bytes. The bug would be easy enough to fix, but we can observe that neither of those steps is necessary. After building the actual commit array, we'll likewise check its count for overflow. So the extra check of the distinct commit count here is redundant. And likewise we use ALLOC_GROW() when building the commit array, so there's no need to preallocate it (it's possible that doing so is slightly more efficient, but if we care we can just optimistically allocate one slot for each oid; I didn't bother here). So count_distinct_commits() isn't doing anything useful. Let's just get rid of that step. Note that a side effect of the function was that we sorted the list of oids, which we do rely on in copy_oids_to_commits(), since it must also skip the duplicates. So we'll move the qsort there. I didn't copy the "TODO" about adding more progress meters. It's actually quite hard to make a repository large enough for this qsort would take an appreciable amount of time, so this doesn't seem like a useful note. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano

Merge branch 'ds/commit-graph-merging-fix'

2020-11-02T21:17:39Z

When "git commit-graph" detects the same commit recorded more than once while it is merging the layers, it used to die. The code now ignores all but one of them and continues. * ds/commit-graph-merging-fix: commit-graph: don't write commit-graph when disabled commit-graph: ignore duplicates when merging layers

commit-graph: don't write commit-graph when disabled

2020-10-09T21:16:32Z

The core.commitGraph config setting can be set to 'false' to prevent parsing commits from the commit-graph file(s). This causes an issue when trying to write with "--split" which needs to distinguish between commits that are in the existing commit-graph layers and commits that are not. The existing mechanism uses parse_commit() and follows by checking if there is a 'graph_pos' that shows the commit was parsed from the commit-graph file. When core.commitGraph=false, we do not parse the commits from the commit-graph and 'graph_pos' indicates that no commits are in the existing file. The --split logic moves forward creating a new layer on top that holds all reachable commits, then possibly merges down into those layers, resulting in duplicate commits. The previous change makes that merging process more robust to such a situation in case it happens in the written commit-graph data. The easy answer here is to avoid writing a commit-graph if reading the commit-graph is disabled. Since the resulting commit-graph will would not be read by subsequent Git processes. This is more natural than forcing core.commitGraph to be true for the 'write' process. Reported-by: Thomas Braun Helped-by: Jeff King Helped-by: Taylor Blau Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

commit-graph: ignore duplicates when merging layers

2020-10-09T21:16:23Z

Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun Helped-by: Taylor Blau Co-authored-by: Jeff King Signed-off-by: Derrick Stolee Signed-off-by: Junio C Hamano

Merge branch 'tb/bloom-improvements'

2020-09-29T21:01:20Z

"git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'

Merge branch 'ds/maintenance-part-1'

2020-09-25T22:25:38Z

A "git gc"'s big brother has been introduced to take care of more repository maintenance tasks, not limited to the object database cleaning. * ds/maintenance-part-1: maintenance: add trace2 regions for task execution maintenance: add auto condition for commit-graph task maintenance: use pointers to check --auto maintenance: create maintenance..enabled config maintenance: take a lock on the objects directory maintenance: add --task option maintenance: add commit-graph task maintenance: initialize task array maintenance: replace run_auto_gc() maintenance: add --quiet option maintenance: create basic maintenance runner

builtin/commit-graph.c: introduce '--max-new-filters='

2020-09-18T17:35:39Z

Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write' is willing to compute from scratch. Prior to this patch, a commit-graph write with '--changed-paths' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split' such that a roll-up or replacement is performed). This behavior can cause prohibitively-long commit-graph writes for a variety of reasons: * There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc). * Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large. This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths' to be rather unpredictable. To make this command behave more predictably, introduce '--max-new-filters=' to allow computing at most '' Bloom filters from scratch. This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do. Helped-by: Junio C Hamano Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano