aboutsummaryrefslogtreecommitdiff
path: root/midx.c
AgeCommit message (Collapse)Author
2024-08-06midx: teach `fill_midx_entry()` about incremental MIDXsTaylor Blau
In a similar fashion as previous commits, teach the `fill_midx_entry()` function to work in a incremental MIDX-aware fashion. This function, unlike others which accept an index into either the lexical order of objects or packs, takes in an object_id, and attempts to fill a caller-provided 'struct pack_entry' with the remaining pieces of information about that object from the MIDX. The function uses `bsearch_midx()` which fills out the frame-local 'pos' variable, recording the given object_id's lexical position within the MIDX chain, if found (if no matching object ID was found, we'll return immediately without filling out the `pack_entry` structure). Once given that position, we jump back through the `->base_midx` pointer to ensure that our `m` points at the MIDX layer which contains the given object_id (and not an ancestor or descendant of it in the chain). Note that we can drop the bounds check "if (pos >= m->num_objects)" because `midx_for_object()` performs this check for us. After that point, we only need to make two special considerations within this function: - First, the pack_int_id returned to us by `nth_midxed_pack_int_id()` is a position in the concatenated lexical order of packs, so we must ensure that we subtract `m->num_packs_in_base` before accessing the MIDX-local `packs` array. - Second, we must avoid translating the `pos` back to a MIDX-local index, since we use it as an argument to `nth_midxed_offset()` which expects a position relative to the concatenated lexical order of objects. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_offset()` about incremental MIDXsTaylor Blau
In a similar fashion as in previous commits, teach the function `nth_midxed_offset()` about incremental MIDXs. The given object `pos` is used to find the containing MIDX, and translated back into a MIDX-local position by assigning the return value of `midx_for_object()` to it. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `bsearch_midx()` about incremental MIDXsTaylor Blau
Now that the special cases callers of `bsearch_midx()` have been dealt with, teach `bsearch_midx()` to handle incremental MIDX chains. The incremental MIDX-aware version of `bsearch_midx()` works by repeatedly searching for a given OID in each layer along the `->base_midx` pointer, stopping either when an exact match is found, or the end of the chain is reached. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: introduce `bsearch_one_midx()`Taylor Blau
The `bsearch_midx()` function will be extended in a following commit to search for the location of a given object ID across all MIDXs in a chain (or the single non-chain MIDX if no chain is available). While most callers will naturally want to use the updated `bsearch_midx()` function, there are a handful of special cases that will want finer control and will only want to search through a single MIDX. For instance, the object abbreviation code, which cares about object IDs near to where we'd expect to find a match in a MIDX. In that case, we want to look at the nearby matches in each layer of the MIDX chain, not just a single one). Split the more fine-grained control out into a separate function called `bsearch_one_midx()` which searches only a single MIDX. At present both `bsearch_midx()` and `bsearch_one_midx()` have identical behavior, but the following commit will rewrite the former to be aware of incremental MIDXs for the remaining non-special case callers. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_bitmapped_pack()` about incremental MIDXsTaylor Blau
In a similar fashion as in previous commits, teach the function `nth_bitmapped_pack()` about incremental MIDXs by translating the given `pack_int_id` from the concatenated lexical order to a MIDX-local lexical position. When accessing the containing MIDX's array of packs, use the local pack ID. Likewise, when reading the 'BTMP' chunk, use the MIDX-local offset when accessing the data within that chunk. (Note that the both the call to prepare_midx_pack() and the assignment of bp->pack_int_id both care about the global pack_int_id, so avoid shadowing the given 'pack_int_id' parameter). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_object_oid()` about incremental MIDXsTaylor Blau
The function `nth_midxed_object_oid()` returns the object ID for a given object position in the MIDX lexicographic order. Teach this function to instead operate over the concatenated lexicographic order defined in an earlier step so that it is able to be used with incremental MIDXs. To do this, we need to both (a) adjust the bounds check for the given 'n', as well as record the MIDX-local position after chasing the `->base_midx` pointer to find the MIDX which contains that object. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `prepare_midx_pack()` about incremental MIDXsTaylor Blau
The function `prepare_midx_pack()` is part of the midx.h API and loads the pack identified by the MIDX-local 'pack_int_id'. This patch prepares that function to be aware of an incremental MIDX world. To do this, introduce the second of the two general purpose helpers mentioned in the previous commit. This commit introduces `midx_for_pack()`, which is the pack-specific analog of `midx_for_object()`, and works in the same fashion. Like `midx_for_object()`, this function chases down the '->base_midx' field until it finds the MIDX layer within the chain that contains the given pack. Use this function within `prepare_midx_pack()` so that the `pack_int_id` it expects is now relative to the entire MIDX chain, and that it prepares the given pack in the appropriate MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-08-06midx: teach `nth_midxed_pack_int_id()` about incremental MIDXsTaylor Blau
The function `nth_midxed_pack_int_id()` takes in a object position in MIDX lexicographic order and returns an identifier of the pack from which that object was selected in the MIDX. Currently, the given object position is an index into the lexicographic order of objects in a single MIDX. Change this position to instead refer into the concatenated lexicographic order of all MIDXs in a MIDX chain. This has two visible effects within the implementation of `prepare_midx_pack()`: - First, the given position is now an index into the concatenated lexicographic order of all MIDXs in the order in which they appear in the MIDX chain. - Second the pack ID returned from this function is now also in the concatenated order of packs among all layers of the MIDX chain in the same order that they appear in the MIDX chain. To do this, introduce the first of two general purpose helpers, this one being `midx_for_object()`. `midx_for_object()` takes a double pointer to a `struct multi_pack_index` as well as an object `pos` in terms of the entire MIDX chain[^1]. The function chases down the '->base_midx' field until it finds the MIDX layer within the chain that contains the given object. It then: - modifies the double pointer to point to the containing MIDX, instead of the tip of the chain, and - returns the MIDX-local position[^2] at which the given object can be found. Use this function within `nth_midxed_pack_int_id()` so that the `pos` it expects is now relative to the entire MIDX chain, and that it returns the appropriate pack position for that object. [^1]: As a reminder, this means that the object is identified among the objects contained in all layers of the incremental MIDX chain, not any particular layer. For example, consider MIDX chain with two individual MIDXs, one with 4 objects and another with 3 objects. If the MIDX with 4 objects appears earlier in the chain, then asking for object 6 would return the second object in the MIDX with 3 objects. [^2]: Building on the previous example, asking for object 6 in a MIDX chain with (4, 3) objects, respectively, this would set the double pointer to point at the MIDX containing three objects, and would return an index to the second object within that MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14global: introduce `USE_THE_REPOSITORY_VARIABLE` macroPatrick Steinhardt
Use of the `the_repository` variable is deprecated nowadays, and we slowly but steadily convert the codebase to not use it anymore. Instead, callers should be passing down the repository to work on via parameters. It is hard though to prove that a given code unit does not use this variable anymore. The most trivial case, merely demonstrating that there is no direct use of `the_repository`, is already a bit of a pain during code reviews as the reviewer needs to manually verify claims made by the patch author. The bigger problem though is that we have many interfaces that implicitly rely on `the_repository`. Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code units to opt into usage of `the_repository`. The intent of this macro is to demonstrate that a certain code unit does not use this variable anymore, and to keep it from new dependencies on it in future changes, be it explicit or implicit For now, the macro only guards `the_repository` itself as well as `the_hash_algo`. There are many more known interfaces where we have an implicit dependency on `the_repository`, but those are not guarded at the current point in time. Over time though, we should start to add guards as required (or even better, just remove them). Define the macro as required in our code units. As expected, most of our code still relies on the global variable. Nearly all of our builtins rely on the variable as there is no way yet to pass `the_repository` to their entry point. For now, declare the macro in "biultin.h" to keep the required changes at least a little bit more contained. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-06-14hash: require hash algorithm in `oidread()` and `oidclr()`Patrick Steinhardt
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash function that shall be used. Require callers to pass in the hash algorithm to get rid of this implicit dependency. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-05-30midx: replace `get_midx_rev_filename()` with a generic helperTaylor Blau
Commit f894081deae (pack-revindex: read multi-pack reverse indexes, 2021-03-30) introduced the `get_midx_rev_filename()` helper (later modified by commit 60980aed786 (midx.c: write MIDX filenames to strbuf, 2021-10-26)). This function returns the location of the classic ".rev" files we used to write for MIDXs (prior to 95e8383bac1 (midx.c: make changing the preferred pack safe, 2022-01-25)), which is always of the form: $GIT_DIR/objects/pack/multi-pack-index-$HASH.rev Replace this function with a generic helper that populates a strbuf with the above form, replacing the ".rev" extension with a caller-provided argument. This will allow us to remove a similarly-defined function in the pack-bitmap code (used to determine the location of a MIDX .bitmap file) by reimplementing it in terms of `get_midx_filename_ext()`. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-23Merge branch 'ps/missing-btmp-fix'Junio C Hamano
GIt 2.44 introduced a regression that makes the updated code to barf in repositories with multi-pack index written by older versions of Git, which has been corrected. * ps/missing-btmp-fix: pack-bitmap: gracefully handle missing BTMP chunks
2024-04-15pack-bitmap: gracefully handle missing BTMP chunksPatrick Steinhardt
In 0fea6b73f1 (Merge branch 'tb/multi-pack-verbatim-reuse', 2024-01-12) we have introduced multi-pack verbatim reuse of objects. This series has introduced a new BTMP chunk, which encodes information about bitmapped objects in the multi-pack index. Starting with dab60934e3 (pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions, 2023-12-14) we use this information to figure out objects which we can reuse from each of the packfiles. One thing that we glossed over though is backwards compatibility with repositories that do not yet have BTMP chunks in their multi-pack index. In that case, `nth_bitmapped_pack()` would return an error, which causes us to emit a warning followed by another error message. These warnings are visible to users that fetch from a repository: ``` $ git fetch ... remote: error: MIDX does not contain the BTMP chunk remote: warning: unable to load pack: 'pack-f6bb7bd71d345ea9fe604b60cab9ba9ece54ffbe.idx', disabling pack-reuse remote: Enumerating objects: 40, done. remote: Counting objects: 100% (40/40), done. remote: Compressing objects: 100% (39/39), done. remote: Total 40 (delta 5), reused 0 (delta 0), pack-reused 0 (from 0) ... ``` While the fetch succeeds the user is left wondering what they did wrong. Furthermore, as visible both from the warning and from the reuse stats, pack-reuse is completely disabled in such repositories. What is quite interesting is that this issue can even be triggered in case `pack.allowPackReuse=single` is set, which is the default value. One could have expected that in this case we fall back to the old logic, which is to use the preferred packfile without consulting BTMP chunks at all. But either we fail with the above error in case they are missing, or we use the first pack in the multi-pack-index. The former case disables pack-reuse altogether, whereas the latter case may result in reusing objects from a suboptimal packfile. Fix this issue by partially reverting the logic back to what we had before this patch series landed. Namely, in the case where we have no BTMP chunks or when `pack.allowPackReuse=single` are set, we use the preferred pack instead of consulting the BTMP chunks. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-04-01midx-write: move writing-related functions from midx.cTaylor Blau
Introduce a new midx-write.c source file, which holds all of the functionality from the MIDX sub-system related to writing new MIDX files. Similar to the relationship between "pack-bitmap.c" and "pack-bitmap-write.c", this source file will hold code that is specific to writing MIDX files as opposed to reading them (the latter will remain in midx.c). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-03-25midx: use strvec_pushf() for pack-objects base nameRené Scharfe
Build the pack base name argument directly using strvec_pushf() instead of with an intermediate strbuf. This is shorter, simpler and avoids the need for explicit cleanup. Signed-off-by: René Scharfe <l.s.r@web.de> Reviewed-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `midx_preferred_pack()`Taylor Blau
When performing a binary search over the objects in a MIDX's bitmap (i.e. in pseudo-pack order), the reader reconstructs the pseudo-pack ordering using a combination of (a) the preferred pack, (b) the pack's lexical position in the MIDX based on pack names, and (c) the object offset within the pack. In order to perform this binary search, the reader must know the identity of the preferred pack. This could be stored in the MIDX, but isn't for historical reasons, mostly because it can easily be inferred at read-time by looking at the object in the first bit position and finding out which pack it was selected from in the MIDX, like so: nth_midxed_pack_int_id(m, pack_pos_to_midx(m, 0)); In midx_to_pack_pos() which performs this binary search, we look up the identity of the preferred pack before each search. This is relatively quick, since it involves two table-driven lookups (one in the MIDX's revindex for `pack_pos_to_midx()`, and another in the MIDX's object table for `nth_midxed_pack_int_id()`). But since the preferred pack does not change after the MIDX is written, it is safe to cache this value on the MIDX itself. Write a helper to do just that, and rewrite all of the existing call-sites that care about the identity of the preferred pack in terms of this new helper. This will prepare us for a subsequent patch where we will need to binary search through the MIDX's pseudo-pack order multiple times. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `midx_locate_pack()`Taylor Blau
The multi-pack index API exposes a `midx_contains_pack()` function that takes in a string ending in either ".idx" or ".pack" and returns whether or not the MIDX contains a given pack corresponding to that string. There is no corresponding function to locate the position of a pack within the MIDX's pack order (sorted lexically by pack filename). We could add an optional out parameter to `midx_contains_pack()` that is filled out with the pack's position when the parameter is non-NULL. To minimize the amount of fallout from this change, instead introduce a new function by renaming `midx_contains_pack()` to `midx_locate_pack()`, adding that output parameter, and then reimplementing `midx_contains_pack()` in terms of it. Future patches will make use of this new function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: implement `BTMP` chunkTaylor Blau
When a multi-pack bitmap is used to implement verbatim pack reuse (that is, when verbatim chunks from an on-disk packfile are copied directly[^1]), it does so by using its "preferred pack" as the source for pack-reuse. This allows repositories to pack the majority of their objects into a single (often large) pack, and then use it as the single source for verbatim pack reuse. This increases the amount of objects that are reused verbatim (and consequently, decrease the amount of time it takes to generate many packs). But this performance comes at a cost, which is that the preferred packfile must pace its growth with that of the entire repository in order to maintain the utility of verbatim pack reuse. As repositories grow beyond what we can reasonably store in a single packfile, the utility of verbatim pack reuse diminishes. Or, at the very least, it becomes increasingly more expensive to maintain as the pack grows larger and larger. It would be beneficial to be able to perform this same optimization over multiple packs, provided some modest constraints (most importantly, that the set of packs eligible for verbatim reuse are disjoint with respect to the subset of their objects being sent). If we assume that the packs which we treat as candidates for verbatim reuse are disjoint with respect to any of their objects we may output, we need to make only modest modifications to the verbatim pack-reuse code itself. Most notably, we need to remove the assumption that the bits in the reachability bitmap corresponding to objects from the single reuse pack begin at the first bit position. Future patches will unwind these assumptions and reimplement their existing functionality as special cases of the more general assumptions (e.g. that reuse bits can start anywhere within the bitset, but happen to start at 0 for all existing cases). This patch does not yet relax any of those assumptions. Instead, it implements a foundational data-structure, the "Bitampped Packs" (`BTMP`) chunk of the multi-pack index. The `BTMP` chunk's contents are described in detail here. Importantly, the `BTMP` chunk contains information to map regions of a multi-pack index's reachability bitmap to the packs whose objects they represent. For now, this chunk is only written, not read (outside of the test-tool used in this patch to test the new chunk's behavior). Future patches will begin to make use of this new chunk. [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the pack that wasn't used verbatim. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14midx: factor out `fill_pack_info()`Taylor Blau
When selecting which packfiles will be written while generating a MIDX, the MIDX internals fill out a 'struct pack_info' with various pieces of book-keeping. Instead of filling out each field of the `pack_info` structure individually in each of the two spots that modify the array of such structures (`ctx->info`), extract a common routine that does this for us. This reduces the code duplication by a modest amount. But more importantly, it zero-initializes the structure before assigning values into it. This hardens us for a future change which will add additional fields to this structure which (until this patch) was not zero-initialized. As a result, any new fields added to the `pack_info` structure need only be updated in a single location, instead of at each spot within midx.c. There are no functional changes in this patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14pack-objects: free packing_data in more placesTaylor Blau
The pack-objects internals use a packing_data struct to track what objects are part of the pack(s) being formed. Since these structures contain allocated fields, failing to appropriately free() them results in a leak. Plug that leak by introducing a clear_packing_data() function, and call it in the appropriate spots. This is a fairly straightforward leak to plug, since none of the callers expect to read any values or have any references to parts of the address space being freed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-09midx: check consistency of fanout tableJeff King
The commit-graph, midx, and pack idx on-disk formats all have oid fanout tables which are fed to bsearch_hash(). If these tables do not increase monotonically, then the binary search may not only produce bogus values, it may cause out of bounds reads. We fixed this for commit graphs in 4169d89645 (commit-graph: check consistency of fanout table, 2023-10-09). That commit argued that we did not need to do the same for midx and pack idx files, because they already did this check. However, that is wrong. We _do_ check the fanout table for pack idx files when we load them, but we only do so for midx files when running "git multi-pack-index verify". So it is possible to get an out-of-bounds read by running a normal command with a specially crafted midx file. Let's fix this using the same solution (and roughly the same test) we did for the commit-graph in 4169d89645. This replaces the same check from "multi-pack-index verify", because verify uses the same read routines, we'd bail on reading the midx much sooner now. So let's make sure to copy its verbose error message. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of revindex chunkJeff King
When we load a revindex from disk, we check the size of the file compared to the number of objects we expect it to have. But when we use a RIDX chunk stored directly in the midx, we just access the memory directly. This can lead to out-of-bounds memory access for a corrupted or malicious multi-pack-index file. We can catch this by recording the RIDX chunk size, and then checking it against the expected size when we "load" the revindex. Note that this check is much simpler than the one that load_revindex_from_disk() does, because we just have the data array with no header (so we do not need to account for the header size, and nor do we need to bother validating the header values). The test confirms both that we catch this case, and that we continue the process (the revindex is required to use the midx bitmaps, but we fallback to a non-bitmap traversal). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: bounds-check large offset chunkJeff King
When we see a large offset bit in the regular midx offset table, we use the entry as an index into a separate large offset table (just like a pack idx does). But we don't bounds-check the access to that large offset table (nor even record its size when we parse the chunk!). The equivalent code for a regular pack idx is in check_pack_index_ptr(). But things are a bit simpler here because of the chunked format: we can just check our array index directly. As a bonus, we can get rid of the st_mult() here. If our array bounds-check is successful, then we know that the result will fit in a size_t (and the bounds check uses a division to avoid overflow entirely). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of object offset chunkJeff King
The object offset chunk has one fixed-size entry for each object in the midx. But since we don't check its size, we may access out-of-bounds memory if we see a corrupt or malicious midx file. Sine the entries are fixed-size, the total length can be known up-front, and we can just check it while parsing the chunk (this is similar to what we do when opening pack idx files, which contain a similar offset table). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: enforce chunk alignment on readingJeff King
The midx reader assumes chunks are aligned to a 4-byte boundary: we treat the fanout chunk as an array of uint32_t, indexing it to feed the results to ntohl(). Without aligning the chunks, we may violate the CPU's alignment constraints. Though many platforms allow this, some do not. And certanily UBSan will complain, since it is undefined behavior. Even though most chunks are naturally 4-byte-aligned (because they are storing uint32_t or larger types), PNAM is not. It stores NUL-terminated pack names, so you can have a valid chunk with any length. The writing side handles this by 4-byte-aligning the chunk, introducing a few extra NULs as necessary. But since we don't check this on the reading side, we may end up with a misaligned fanout and trigger the undefined behavior. We have two options here: 1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The latter handles alignment itself. It's possible that it's slightly slower (though in practice I'm not sure how true that is, especially for these code paths which then go on to do a binary search). 2. Enforce the alignment when reading the chunks. This is easy to do, since the table-of-contents reader can check it in one spot. I went with the second option here, just because it places less burden on maintenance going forward (it is OK to continue using ntohl), and we know it can't have any performance impact on the actual reads. The commit-graph code uses the same chunk API. It's usually also 4-byte aligned, but some chunks are not (like Bloom filter BDAT chunks). So we'll pass "1" here to allow any alignment. It doesn't suffer from the same problem as midx with its fanout because the fanout chunk is always the first (and the rest of the format dictates that the first chunk will start aligned). The new test shows the effect on a midx with a misaligned PNAM chunk. Note that the midx-reading code treats chunk-toc errors as soft, falling back to the non-midx path rather than calling die(), as we do for other parsing errors. Arguably we should make all of these behave the same, but that's out of scope for this patch. For now the test just expects the fallback behavior. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of pack names chunkJeff King
We parse the pack-name chunk as a series of NUL-terminated strings. But since we don't look at the chunk size, there's nothing to guarantee that we don't parse off the end of the chunk (or even off the end of the mapped file). We can record the length, and then as we parse make sure that we never walk past it. The new test exercises the case, though note that it does not actually segfault before this patch. It hits a NUL byte somewhere in one of the other chunks, and comes up with a garbage pack name. You could construct one that reads out-of-bounds (e.g., a PNAM chunk at the end of file), but this case is simple and sufficient to check that we detect the problem. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: check size of oid lookup chunkJeff King
When reading an on-disk multi-pack-index, we take the number of objects in the midx from the final value of the fanout table. But we just blindly assume that the chunk containing the actual oid entries is the correct size. This can lead to us reading out-of-bounds memory if the lookup chunk is too small (or if the fanout is corrupted; when they don't agree we cannot tell which one is wrong). Note that we bump the assignment of m->num_objects into the fanout parser callback, so that it's set when we parse the lookup table (otherwise we'd have to manually record the lookup table size and check it later). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09midx: stop ignoring malformed oid fanout chunkJeff King
When we load the oid-fanout chunk, our callback checks that its size is reasonable and returns an error if not. However, the caller only checks our return value against CHUNK_NOT_FOUND, so we end up ignoring the error completely! Using a too-small fanout table means we end up accessing random memory for the fanout and segfault. We can fix this by checking for any non-zero return value, rather than just CHUNK_NOT_FOUND, and adjusting our error message to cover both cases. We could handle each error code individually, but there's not much point for such a rare case. The extra message produced in the callback makes it clear what is going on. The same pattern is used in the adjacent code. Those cases are actually OK for now because they do not use a custom callback, so the only error they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an accident waiting to happen (especially as we convert some of them away from pair_chunk). The error messages are more verbose, but it should be rare for a user to see these anyway. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09chunk-format: note that pair_chunk() is unsafeJeff King
The pair_chunk() function is provided as an easy helper for parsing chunks that just want a pointer to a set of bytes. But every caller has a hidden bug: because we return only the pointer without the matching chunk size, the callers have no clue how many bytes they are allowed to look at. And as a result, they may read off the end of the mmap'd data when the on-disk file does not match their expectations. Since chunk files are typically used for local-repository data like commit-graph files and midx's, the security implications here are pretty mild. The worst that can happen is that you hand somebody a corrupted repository tarball, and running Git on it does an out-of-bounds read and crashes. So it's worth being more defensive, but we don't need to drop everything and fix every caller immediately. I noticed the problem because the pair_chunk_fn() callback does not look at its chunk_size argument, and wanted to annotate it to silence -Wunused-parameter. We could do that now, but we'd lose the hint that this code should be audited and fixed. So instead, let's set ourselves up for going down that path: 1. Provide a pair_chunk() function that does return the size, which prepares us for fixing these cases. 2. Rename the existing function to pair_chunk_unsafe(). That gives us an easy way to grep for cases which still need to be fixed, and the name should cause anybody adding new calls to think twice before using it. There are no callers of the "safe" version yet, but we'll add some in subsequent patches. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-25Merge branch 'tb/object-access-overflow-protection'Junio C Hamano
Various offset computation in the code that accesses the packfiles and other data in the object layer has been hardened against arithmetic overflow, especially on 32-bit systems. * tb/object-access-overflow-protection: commit-graph.c: prevent overflow in `verify_commit_graph()` commit-graph.c: prevent overflow in `write_commit_graph()` commit-graph.c: prevent overflow in `merge_commit_graph()` commit-graph.c: prevent overflow in `split_graph_merge_strategy()` commit-graph.c: prevent overflow in `load_tree_for_commit()` commit-graph.c: prevent overflow in `fill_commit_in_graph()` commit-graph.c: prevent overflow in `fill_commit_graph_info()` commit-graph.c: prevent overflow in `load_oid_from_graph()` commit-graph.c: prevent overflow in add_graph_to_chain() commit-graph.c: prevent overflow in `write_commit_graph_file()` pack-bitmap.c: ensure that eindex lookups don't overflow midx.c: prevent overflow in `fill_included_packs_batch()` midx.c: prevent overflow in `write_midx_internal()` midx.c: store `nr`, `alloc` variables as `size_t`'s midx.c: prevent overflow in `nth_midxed_offset()` midx.c: prevent overflow in `nth_midxed_object_oid()` midx.c: use `size_t`'s for fanout nr and alloc packfile.c: use checked arithmetic in `nth_packed_object_offset()` packfile.c: prevent overflow in `load_idx()` packfile.c: prevent overflow in `nth_packed_object_id()`
2023-07-14midx.c: prevent overflow in `fill_included_packs_batch()`Taylor Blau
In a similar spirit as in previous commits, avoid an integer overflow when computing the expected size of a MIDX. (Note that this is also OK as-is, since `p->pack_size` is an `off_t`, so this computation should already be done as 64-bit integers. But again, let's use `st_mult()` to make this fact clear). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `write_midx_internal()`Taylor Blau
When writing a MIDX, we use the chunk-format API to write out each individual chunk of the MIDX. Each chunk of the MIDX is tracked via a call to `add_chunk()`, along with the expected size of that chunk. Guard against overflow when dealing with a MIDX with a large number of entries (and consequently, large chunks within the MIDX file itself) to avoid corrupting the contents of the MIDX itself. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: store `nr`, `alloc` variables as `size_t`'sTaylor Blau
In the `write_midx_context` structure, we use two `uint32_t`'s to track the length and allocated size of the packs, and one `uint32_t` to track the number of objects in the MIDX. In practice, having these be 32-bit unsigned values shouldn't cause any problems since we are unlikely to have that many objects or packs in any real-world repository. But these values should be `size_t`'s, so change their type to reflect that. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `nth_midxed_offset()`Taylor Blau
In a similar spirit as previous patches, avoid an overflow when looking up object offsets in the MIDX's large offset table by guarding the computation via `st_mult()`. This instance is also OK as-is, since the left operand is the result of `sizeof(...)`, which is already a `size_t`. But use `st_mult()` instead here to make it explicit that this computation is to be performed using 64-bit unsigned integers. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: prevent overflow in `nth_midxed_object_oid()`Taylor Blau
In a similar spirit as previous commits, avoid overflow when looking up an object's OID in a MIDX when its position is greater than `2^32-1/m->hash_len`. As usual, it is perfectly OK for a MIDX to have as many as 2^32-1 objects (since we use 32-bit fields to count the number of objects at each fanout layer). But if we have more than `2^32-1/m->hash_len` number of objects, we will incorrectly perform the computation using 32-bit integers, overflowing the result. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-14midx.c: use `size_t`'s for fanout nr and allocTaylor Blau
The `midx_fanout` struct is used to keep track of a set of OIDs corresponding to each layer of the MIDX's fanout table. It stores an array of entries, along with the number of entries in the table, and the allocated size of the array. Both `nr` and `alloc` are stored as 32-bit unsigned integers. In practice, this should never cause any problems, since most packs have far fewer than 2^32-1 objects. But storing these as `size_t`'s is more appropriate, and prevents us from accidentally overflowing some result when multiplying or adding to either of these values. Update these struct members to be `size_t`'s as appropriate. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-05git-compat-util: move alloc macros to git-compat-util.hCalvin Wan
alloc_nr, ALLOC_GROW, and ALLOC_GROW_BY are commonly used macros for dynamic array allocation. Moving these macros to git-compat-util.h with the other alloc macros focuses alloc.[ch] to allocation for Git objects and additionally allows us to remove inclusions to alloc.h from files that solely used the above macros. Signed-off-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-21object-store-ll.h: split this header out of object-store.hElijah Newren
The vast majority of files including object-store.h did not need dir.h nor khash.h. Split the header into two files, and let most just depend upon object-store-ll.h, while letting the two callers that need it depend on the full object-store.h. After this patch: $ git grep -h include..object-store | sort | uniq -c 2 #include "object-store.h" 129 #include "object-store-ll.h" Diff best viewed with `--color-moved`. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-05-09Merge branch 'en/header-split-cache-h-part-2'Junio C Hamano
More header clean-up. * en/header-split-cache-h-part-2: (22 commits) reftable: ensure git-compat-util.h is the first (indirect) include diff.h: reduce unnecessary includes object-store.h: reduce unnecessary includes commit.h: reduce unnecessary includes fsmonitor: reduce includes of cache.h cache.h: remove unnecessary headers treewide: remove cache.h inclusion due to previous changes cache,tree: move basic name compare functions from read-cache to tree cache,tree: move cmp_cache_name_compare from tree.[ch] to read-cache.c hash-ll.h: split out of hash.h to remove dependency on repository.h tree-diff.c: move S_DIFFTREE_IFXMIN_NEQ define from cache.h dir.h: move DTYPE defines from cache.h versioncmp.h: move declarations for versioncmp.c functions from cache.h ws.h: move declarations for ws.c functions from cache.h match-trees.h: move declarations for match-trees.c functions from cache.h pkt-line.h: move declarations for pkt-line.c functions from cache.h base85.h: move declarations for base85.c functions from cache.h copy.h: move declarations for copy.c functions from cache.h server-info.h: move declarations for server-info.c functions from cache.h packfile.h: move pack_window and pack_entry from cache.h ...
2023-04-25Merge branch 'ps/fix-geom-repack-with-alternates'Junio C Hamano
Geometric repacking ("git repack --geometric=<n>") in a repository that borrows from an alternate object database had various corner case bugs, which have been corrected. * ps/fix-geom-repack-with-alternates: repack: disable writing bitmaps when doing a local repack repack: honor `-l` when calculating pack geometry t/helper: allow chmtime to print verbosely without modifying mtime pack-objects: extend test coverage of `--stdin-packs` with alternates pack-objects: fix error when same packfile is included and excluded pack-objects: fix error when packing same pack twice pack-objects: split out `--stdin-packs` tests into separate file repack: fix generating multi-pack-index with only non-local packs repack: fix trying to use preferred pack in alternates midx: fix segfault with no packs and invalid preferred pack
2023-04-24treewide: remove cache.h inclusion due to previous changesElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-14midx: fix segfault with no packs and invalid preferred packPatrick Steinhardt
When asked to write a multi-pack-index the user can specify a preferred pack that is used as a tie breaker when multiple packs contain the same objects. When this packfile cannot be found, we just pick the first pack that is getting tracked by the newly written multi-pack-index as a fallback. Picking the fallback can fail in the case where we're asked to write a multi-pack-index with no packfiles at all: picking the fallback value will cause a segfault as we blindly index into the array of packfiles, which would be empty. Fix this bug by resetting the preferred packfile index to `-1` before searching for the preferred pack. This fixes the segfault as we already check for whether the index is `> - 1`. If it is not, we simply don't pick a preferred packfile at all. Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-04-11object-file.h: move declarations for object-file.c functions from cache.hElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21csum-file.h: remove unnecessary inclusion of cache.hElijah Newren
With the change in the last commit to move several functions to write-or-die.h, csum-file.h no longer needs to include cache.h. However, removing that include forces several other C files, which directly or indirectly dependend upon csum-file.h's inclusion of cache.h, to now be more explicit about their dependencies. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21abspath.h: move absolute path functions from cache.hElijah Newren
This is another step towards letting us remove the include of cache.h in strbuf.c. It does mean that we also need to add includes of abspath.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-21treewide: be explicit about dependence on gettext.hElijah Newren
Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-03-17Merge branch 'jk/unused-post-2.39-part2'Junio C Hamano
More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...
2023-02-24for_each_object: mark unused callback parametersJeff King
The for_each_{loose,packed}_object interface uses callback functions, but not every callback needs all of the parameters. Mark the unused ones to satisfy -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23cache.h: remove dependence on hex.h; make other files include it explicitlyElijah Newren
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-02-23alloc.h: move ALLOC_GROW() functions from cache.hElijah Newren
This allows us to replace includes of cache.h with includes of the much smaller alloc.h in many places. It does mean that we also need to add includes of alloc.h in a number of C files. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>