aboutsummaryrefslogtreecommitdiff
path: root/builtin/backfill.c
AgeCommit message (Collapse)Author
8 daysMerge branch 'th/backfill-auto-detect-sparseness-fix'Junio C Hamano
"git backfill" is capable of auto-detecting a sparsely checked out working tree, which was broken. * th/backfill-auto-detect-sparseness-fix: backfill: auto-detect sparse-checkout from config
9 daysbackfill: auto-detect sparse-checkout from configTrieu Huynh
Commit 85127bcdea ("backfill: assume --sparse when sparse-checkout is enabled") intended for 'git backfill' to consult the repository configuration when the user does not pass '--sparse' or '--no-sparse' on the command line. It added the sentinel check: if (ctx->sparse < 0) ctx->sparse = cfg->apply_sparse_checkout; However, the ctx->sparse field is initialized to 0 instead of -1, so this guard never triggers. Consequently, the repository config (core.sparseCheckout) is never checked, and the command always performs a full backfill even when sparse-checkout is enabled. Fix this by initializing ctx->sparse to -1, ensuring the existing fallback logic correctly reads the repository configuration when no explicit flags are provided. Add a test to verify that 'git backfill' automatically respects sparse-checkout settings when no flags are passed. Signed-off-by: Trieu Huynh <vikingtc4@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-26t5620: test backfill's unknown argument handlingDerrick Stolee
Before the recent changes to parse rev-list arguments inside of 'git backfill', the builtin would take arbitrary arguments without complaint (and ignore them). This was noticed and a patch was sent [1] which motivates this change. [1] https://lore.kernel.org/git/20260321031643.5185-1-r.siddharth.shrimali@gmail.com/ Note that the revision machinery can output an "ambiguous argument" warning if a value not starting with '--' is found and doesn't make sense as a reference or a pathspec. For unrecognized arguments starting with '--' we need to add logic into builtin/backfill.c to catch leftover arguments. Reported-by: Siddharth Shrimali <r.siddharth.shrimali@gmail.com> Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-26backfill: accept revision argumentsDerrick Stolee
The existing implementation of 'git backfill' only includes downloading missing blobs reachable from HEAD. Advanced uses may desire more general commit limiting options, such as '--all' for all references, specifying a commit range via negative references, or specifying a recency of use such as with '--since=<date>'. All of these options are available if we use setup_revisions() to parse the unknown arguments with the revision machinery. This opens up a large number of possibilities, only a small set of which are tested here. For documentation, we avoid duplicating the option documentation and instead link to the documentation of 'git rev-list'. Note that these arguments currently allow specifying a pathspec, which modifies the commit history checks but does not limit the paths used in the backfill logic. This will be updated in a future change. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-05Merge branch 'ob/core-attributesfile-in-repository'Junio C Hamano
The core.attributesfile is intended to be set per repository, but were kept track of by a single global variable in-core, which has been corrected by moving it to per-repository data structure. * ob/core-attributesfile-in-repository: environment: move "branch.autoSetupMerge" into `struct repo_config_values` environment: stop using core.sparseCheckout globally environment: stop storing `core.attributesFile` globally
2026-02-26environment: stop using core.sparseCheckout globallyOlamide Caleb Bello
The config value `core.sparseCheckout` is parsed in `git_default_core_config()` and stored globally in `core_apply_sparse_checkout`. This could cause it to be overwritten by another repository when different Git repositories run in the same process. Move the parsed value into `struct repo_config_values` in the_repository to retain current behaviours and move towards libifying Git. Suggested-by: Phillip Wood <phillip.wood123@gmail.com> Mentored-by: Christian Couder <christian.couder@gmail.com> Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com> Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-02-12builtin/backfill: fix flags passed to `odb_has_object()`Patrick Steinhardt
The function `fill_missing_blobs()` receives an array of object IDs and verifies for each of them whether the corresponding object exists. If it doesn't exist, we add it to a set of objects and then batch-fetch all of the objects at once. The check for whether or not we already have the object is broken though: we pass `OBJECT_INFO_FOR_PREFETCH`, but `odb_has_object()` expects us to pass `HAS_OBJECT_*` flags. The flag expands to: - `OBJECT_INFO_QUICK`, which asks the object database to not reprepare in case the object wasn't found. This makes sense, as we'd otherwise reprepare the object database as many times as we have missing objects. - `OBJECT_INFO_SKIP_FETCH_OBJECT`, which asks the object database to not fetch the object in case it's missing. Again, this makes sense, as we want to batch-fetch the objects. This shows that we indeed want the equivalent of this flag, but of course represented as `HAS_OBJECT_*` flags. Luckily, the code is already working correctly. The `OBJECT_INFO` flag expands to `(1 << 3) | (1 << 4)`, none of which are valid `HAS_OBJECT` flags. And if no flags are passed, `odb_has_object()` ends up calling `odb_read_object_info_extended()` with exactly the above two flags that we wanted to set in the first place. Of course, this is pure luck, and this can break any moment. So let's fix this and correct the code to not pass any flags at all. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-24packfile: split up responsibilities of `reprepare_packed_git()`Patrick Steinhardt
In `reprepare_packed_git()` we perform a couple of operations: - We reload alternate object directories. - We clear the loose object cache. - We reprepare packfiles. While the logic is hosted in "packfile.c", it clearly reaches into other subsystems that aren't related to packfiles. Split up the responsibility and introduce `odb_reprepare()` which now becomes responsible for repreparing the whole object database. The existing `reprepare_packed_git()` function is refactored accordingly and only cares about reloading the packfile store now. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01odb: rename `has_object()`Patrick Steinhardt
Rename `has_object()` to `odb_has_object()` to match other functions related to the object database and our modern coding guidelines. Introduce a compatibility wrapper so that any in-flight topics will continue to compile. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01object-store: rename files to "odb.{c,h}"Patrick Steinhardt
In the preceding commits we have renamed the structures contained in "object-store.h" to `struct object_database` and `struct odb_backend`. As such, the code files "object-store.{c,h}" are confusingly named now. Rename them to "odb.{c,h}" accordingly. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-24Merge branch 'ps/parse-options-integers'Junio C Hamano
Update parse-options API to catch mistakes to pass address of an integral variable of a wrong type/size. * ps/parse-options-integers: parse-options: detect mismatches in integer signedness parse-options: introduce precision handling for `OPTION_UNSIGNED` parse-options: introduce precision handling for `OPTION_INTEGER` parse-options: rename `OPT_MAGNITUDE()` to `OPT_UNSIGNED()` parse-options: support unit factors in `OPT_INTEGER()` global: use designated initializers for options parse: fix off-by-one for minimum signed values
2025-04-17parse-options: detect mismatches in integer signednessPatrick Steinhardt
It was reported that "t5620-backfill.sh" fails on s390x and sparc64 in a test that exercises the "--min-batch-size" command line option. The symptom was that the option didn't seem to have an effect: we didn't fetch objects with a batch size of 20, but instead fetched all objects at once. As it turns out, the root cause is that `--min-batch-size` uses `OPT_INTEGER()` to parse the command line option. While this macro expects the caller to pass a pointer to an integer, we instead pass a pointer to a `size_t`. This coincidentally works on most platforms, but it breaks apart on the mentioned platforms because they are big endian. This issue isn't specific to git-backfill(1): there are a couple of other places where we have the same type confusion going on. This indicates that the issue really is the interface that the parse-options subsystem provides -- it is simply too easy to get this wrong as there isn't any kind of compiler warning, and things just work on the most common systems. Address the systemic issue by introducing two new build asserts `BARF_UNLESS_SIGNED()` and `BARF_UNLESS_UNSIGNED()`. As the names already hint at, those macros will cause a compiler error when passed a value that is not signed or unsigned, respectively. Adapt `OPT_INTEGER()`, `OPT_UNSIGNED()` as well as `OPT_MAGNITUDE()` to use those asserts. This uncovers a small set of sites where we indeed have the same bug as in git-backfill(1). Adapt all of them to use the correct option. Reported-by: Todd Zullinger <tmz@pobox.com> Reported-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Helped-by: SZEDER Gábor <szeder.dev@gmail.com> Helped-by: Jeff King <peff@peff.net> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15object-store: merge "object-store-ll.h" and "object-store.h"Patrick Steinhardt
The "object-store-ll.h" header has been introduced to keep transitive header dependendcies and compile times at bay. Now that we have created a new "object-store.c" file though we can easily move the last remaining additional bit of "object-store.h", the `odb_path_map`, out of the header. Do so. As the "object-store.h" header is now equivalent to its low-level alternative we drop the latter and inline it into the former. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-18Merge branch 'ds/backfill'Junio C Hamano
Lazy-loading missing files in a blobless clone on demand is costly as it tends to be one-blob-at-a-time. "git backfill" is introduced to help bulk-download necessary files beforehand. * ds/backfill: backfill: assume --sparse when sparse-checkout is enabled backfill: add --sparse option backfill: add --min-batch-size=<n> option backfill: basic functionality and tests backfill: add builtin boilerplate
2025-02-03backfill: assume --sparse when sparse-checkout is enabledDerrick Stolee
The previous change introduced the '--[no-]sparse' option for the 'git backfill' command, but did not assume it as enabled by default. However, this is likely the behavior that users will most often want to happen. Without this default, users with a small sparse-checkout may be confused when 'git backfill' downloads every version of every object in the full history. However, this is left as a separate change so this decision can be reviewed independently of the value of the '--[no-]sparse' option. Add a test of adding the '--sparse' option to a repo without sparse-checkout to make it clear that supplying it without a sparse-checkout is an error. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03backfill: add --sparse optionDerrick Stolee
One way to significantly reduce the cost of a Git clone and later fetches is to use a blobless partial clone and combine that with a sparse-checkout that reduces the paths that need to be populated in the working directory. Not only does this reduce the cost of clones and fetches, the sparse-checkout reduces the number of objects needed to download from a promisor remote. However, history investigations can be expensive as computing blob diffs will trigger promisor remote requests for one object at a time. This can be avoided by downloading the blobs needed for the given sparse-checkout using 'git backfill' and its new '--sparse' mode, at a time that the user is willing to pay that extra cost. Note that this is distinctly different from the '--filter=sparse:<oid>' option, as this assumes that the partial clone has all reachable trees and we are using client-side logic to avoid downloading blobs outside of the sparse-checkout cone. This avoids the server-side cost of walking trees while also achieving a similar goal. It also downloads in batches based on similar path names, presenting a resumable download if things are interrupted. This augments the path-walk API to have a possibly-NULL 'pl' member that may point to a 'struct pattern_list'. This could be more general than the sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently the only consumer. Be sure to test this in both cone mode and not cone mode. Cone mode has the benefit that the path-walk can skip certain paths once they would expand beyond the sparse-checkout. Non-cone mode can describe the included files using both positive and negative patterns, which changes the possible return values of path_matches_pattern_list(). Test both kinds of matches for increased coverage. To test this, we can create a blobless sparse clone, expand the sparse-checkout slightly, and then run 'git backfill --sparse' to see how much data is downloaded. The general steps are 1. git clone --filter=blob:none --sparse <url> 2. git sparse-checkout set <dir1> ... <dirN> 3. git backfill --sparse For the Git repository with the 'builtin' directory in the sparse-checkout, we get these results for various batch sizes: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 3 | 110 MB | | | 10K | 12 | 192 MB | 17.2s | | 15K | 9 | 192 MB | 15.5s | | 20K | 8 | 192 MB | 15.5s | | 25K | 7 | 192 MB | 14.7s | This case matters less because a full clone of the Git repository from GitHub is currently at 277 MB. Using a copy of the Linux repository with the 'kernel/' directory in the sparse-checkout, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|------| | (Initial clone) | 2 | 1,876 MB | | | 10K | 11 | 2,187 MB | 46s | | 25K | 7 | 2,188 MB | 43s | | 50K | 5 | 2,194 MB | 44s | | 100K | 4 | 2,194 MB | 48s | This case is more meaningful because a full clone of the Linux repository is currently over 6 GB, so this is a valuable way to download a fraction of the repository and no longer need network access for all reachable objects within the sparse-checkout. Choosing a batch size will depend on a lot of factors, including the user's network speed or reliability, the repository's file structure, and how many versions there are of the file within the sparse-checkout scope. There will not be a one-size-fits-all solution. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03backfill: add --min-batch-size=<n> optionDerrick Stolee
Users may want to specify a minimum batch size for their needs. This is only a minimum: the path-walk API provides a list of OIDs that correspond to the same path, and thus it is optimal to allow delta compression across those objects in a single server request. We could consider limiting the request to have a maximum batch size in the future. For now, we let the path-walk API batches determine the boundaries. To get a feeling for the value of specifying the --min-batch-size parameter, I tested a number of open source repositories available on GitHub. The procedure was generally: 1. git clone --filter=blob:none <url> 2. git backfill Checking the number of packfiles and the size of the .git/objects/pack directory helps to identify the effects of different batch sizes. For the Git repository, we get these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|-------| | (Initial clone) | 2 | 119 MB | | | 25K | 8 | 290 MB | 24s | | 50K | 5 | 290 MB | 24s | | 100K | 4 | 290 MB | 29s | Other than the packfile counts decreasing as we need fewer batches, the size and time required is not changing much for this small example. For the nodejs/node repository, we see these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|--------| | (Initial clone) | 2 | 330 MB | | | 25K | 19 | 1,222 MB | 1m 22s | | 50K | 11 | 1,221 MB | 1m 24s | | 100K | 7 | 1,223 MB | 1m 40s | | 250K | 4 | 1,224 MB | 2m 23s | | 500K | 3 | 1,216 MB | 4m 38s | Here, we don't have much difference in the size of the repo, though the 500K batch size results in a few MB gained. That comes at a cost of a much longer time. This extra time is due to server-side delta compression happening as the on-disk deltas don't appear to be reusable all the time. But for smaller batch sizes, the server is able to find reasonable deltas partly because we are asking for objects that appear in the same region of the directory tree and include all versions of a file at a specific path. To contrast this example, I tested the microsoft/fluentui repo, which has been known to have inefficient packing due to name hash collisions. These results are found before GitHub had the opportunity to repack the server with more advanced name hash versions: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|--------| | (Initial clone) | 2 | 105 MB | | | 5K | 53 | 348 MB | 2m 26s | | 10K | 28 | 365 MB | 2m 22s | | 15K | 19 | 407 MB | 2m 21s | | 20K | 15 | 393 MB | 2m 28s | | 25K | 13 | 417 MB | 2m 06s | | 50K | 8 | 509 MB | 1m 34s | | 100K | 5 | 535 MB | 1m 56s | | 250K | 4 | 698 MB | 1m 33s | | 500K | 3 | 696 MB | 1m 42s | Here, a larger variety of batch sizes were chosen because of the great variation in results. By asking the server to download small batches corresponding to fewer paths at a time, the server is able to provide better compression for these batches than it would for a regular clone. A typical full clone for this repository would require 738 MB. This example justifies the choice to batch requests by path name, leading to improved communication with a server that is not optimally packed. Finally, the same experiment for the Linux repository had these results: | Batch Size | Pack Count | Pack Size | Time | |-----------------|------------|-----------|---------| | (Initial clone) | 2 | 2,153 MB | | | 25K | 63 | 6,380 MB | 14m 08s | | 50K | 58 | 6,126 MB | 15m 11s | | 100K | 30 | 6,135 MB | 18m 11s | | 250K | 14 | 6,146 MB | 18m 22s | | 500K | 8 | 6,143 MB | 33m 29s | Even in this example, where the default name hash algorithm leads to decent compression of the Linux kernel repository, there is value for selecting a smaller batch size, to a limit. The 25K batch size has the fastest time, but uses 250 MB more than the 50K batch size. The 500K batch size took much more time due to server compression time and thus we should avoid large batch sizes like this. Based on these experiments, a batch size of 50,000 was chosen as the default value. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03backfill: basic functionality and testsDerrick Stolee
The default behavior of 'git backfill' is to fetch all missing blobs that are reachable from HEAD. Document and test this behavior. The implementation is a very simple use of the path-walk API, initializing the revision walk at HEAD to start the path-walk from all commits reachable from HEAD. Ignore the object arrays that correspond to tree entries, assuming that they are all present already. The path-walk API provides lists of objects in batches according to a common path, but that list could be very small. We want to balance the number of requests to the server with the ability to have the process interrupted with minimal repeated work to catch up in the next run. Based on some experiments (detailed in the next change) a minimum batch size of 50,000 is selected for the default. This batch size is a _minimum_. As the path-walk API emits lists of blob IDs, they are collected into a list of objects for a request to the server. When that list is at least the minimum batch size, then the request is sent to the server for the new objects. However, the list of blob IDs from the path-walk API could be much longer than the batch size. At this moment, it is unclear if there is a benefit to split the list when there are too many objects at the same path. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-03backfill: add builtin boilerplateDerrick Stolee
In anticipation of implementing 'git backfill', populate the necessary files with the boilerplate of a new builtin. Mark the builtin as experimental at this time, allowing breaking changes in the near future, if necessary. Signed-off-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>