From 63aca3f7f14e01cabccf2165b941205a8a9a01dc Mon Sep 17 00:00:00 2001 From: Jeff King Date: Fri, 25 Oct 2024 02:58:06 -0400 Subject: dumb-http: store downloaded pack idx as tempfile This patch fixes a regression in b1b8dfde69 (finalize_object_file(): implement collision check, 2024-09-26) where fetching a v1 pack idx file over the dumb-http protocol would cause the fetch to fail. The core of the issue is that dumb-http stores the idx we fetch from the remote at the same path that will eventually hold the idx we generate from "index-pack --stdin". The sequence is something like this: 0. We realize we need some object X, which we don't have locally, and nor does the other side have it as a loose object. 1. We download the list of remote packs from objects/info/packs. 2. For each entry in that file, we download each pack index and store it locally in .git/objects/pack/pack-$hash.idx (the $hash is not something we can verify yet and is given to us by the remote). 3. We check each pack index we got to see if it has object X. When we find a match, we download the matching .pack file from the remote to a tempfile. We feed that to "index-pack --stdin", which reindexes the pack, rather than trusting that it has what the other side claims it does. In most cases, this will end up generating the exact same (byte-for-byte) pack index which we'll store at the same pack-$hash.idx path, because the index generation and $hash id are computed based on what's in the packfile. But: a. The other side might have used other options to generate the index. For instance we use index v2 by default, but long ago it was v1 (and you can still ask for v1 explicitly). b. The other side might even use a different mechanism to determine $hash. E.g., long ago it was based on the sorted list of objects in the packfile, but we switched to using the pack checksum in 1190a1acf8 (pack-objects: name pack files after trailer hash, 2013-12-05). The regression we saw in the real world was (3a). A recent client fetching from a server with a v1 index downloaded that index, then complained about trying to overwrite it with its own v2 index. This collision is otherwise harmless; we know we want to replace the remote version with our local one, but the collision check doesn't realize that. There are a few options to fix it: - we could teach index-pack a command-line option to ignore only pack idx collisions, and use it when the dumb-http code invokes index-pack. This would be an awkward thing to expose users to and would involve a lot of boilerplate to get the option down to the collision code. - we could delete the remote .idx file right before running index-pack. It should be redundant at that point (since we've just downloaded the matching pack). But it feels risky to delete something from our own .git/objects based on what the other side has said. I'm not entirely positive that a malicious server couldn't lie about which pack-$hash.idx it has and get us to delete something precious. - we can stop co-mingling the downloaded idx files in our local objects directory. This is a slightly bigger change but I think fixes the root of the problem more directly. This patch implements the third option. The big design questions are: where do we store the downloaded files, and how do we manage their lifetimes? There are some additional quirks to the dumb-http system we should consider. Remember that in step 2 we downloaded every pack index, but in step 3 we may only download some of the matching packs. What happens to those other idx files now? They sit in the .git/objects/pack directory, possibly waiting to be used at a later date. That may save bandwidth for a subsequent fetch, but it also creates a lot of weird corner cases: - our local object directory now has semi-untrusted .idx files sitting around, without their matching .pack - in case 3b, we noted that we might not generate the same hash as the other side. In that case even if we download the matching pack, our index-pack invocation will store it in a different pack-$hash.idx file. And the unmatched .idx will sit there forever. - if the server repacks, it may delete the old packs. Now we have these orphaned .idx files sitting around locally that will never be used (nor deleted). - if we repack locally we may delete our local version of the server's pack index and not realize we have it. So we'll download it again, even though we have all of the objects it mentions. I think the right solution here is probably some more complex cache management system: download the remote .idx files to their own storage directory, mark them as "seen" when we get their matching pack (to avoid re-downloading even if we repack), and then delete them when the server's objects/info/refs no longer mentions them. But since the dumb http protocol is so ancient and so inferior to the smart http protocol, I don't think it's worth spending a lot of time creating such a system. For this patch I'm just downloading the idx files to .git/objects/tmp_pack_*, and marking them as tempfiles to be deleted when we exit (and due to the name, any we miss due to a crash, etc, should eventually be removed by "git gc" runs based on timestamps). That is slightly worse for one case: if we download an idx but not the matching pack, we won't retain that idx for subsequent runs. But the flip side is that we're making other cases better (we never hold on to useless idx files forever). I suspect that worse case does not even come up often, since it implies that the packs are generated to match distinct parts of history (i.e., in practice even in a repo with many packs you're going to end up grabbing all of those packs to do a clone). If somebody really cares about that, I think the right path forward is a managed cache directory as above, and this patch is providing the first step in that direction anyway (by moving things out of the objects/pack/ directory). There are two test changes. One demonstrates the broken v1 index case (it double-checks the resulting clone with fsck to be careful, but prior to this patch it actually fails at the clone step). The other tweaks the expectation for a test that covers the "slightly worse" case to accommodate the extra index download. The code changes are fairly simple. We stop using finalize_object_file() to copy the remote's index file into place, and leave it as a tempfile. We give the tempfile a real ".idx" name, since the packfile code expects that, and thus we make sure it is out of the usual packs/ directory (so we'd never mistake it for a real local .idx). We also have to change parse_pack_index(), which creates a temporary packed_git to access our index (we need this because all of the pack idx code assumes we have that struct). It reads the index data from the tempfile, but prior to this patch would speculatively write the finalized name into the packed_git struct using the pack-$hash we expect to use. I was mildly surprised that this worked at all, since we call verify_pack_index() on the packed_git which mentions the final name before moving the file into place! But it works because parse_pack_index() leaves the mmap-ed data in the struct, so the lazy-open in verify_pack_index() never triggers, and we read from the tempfile, ignoring the filename in the struct completely. Hacky, but it works. After this patch, parse_pack_index() now uses the index filename we pass in to derive a matching .pack name. This is OK to change because there are only two callers, both in the dumb http code (and the other passes in an existing pack-$hash.idx name, so the derived name is going to be pack-$hash.pack, which is what we were using anyway). I'll follow up with some more cleanups in that area, but this patch is sufficient to fix the regression. Reported-by: fox Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- http.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-) (limited to 'http.c') diff --git a/http.c b/http.c index d59e59f66b..9642cad2e3 100644 --- a/http.c +++ b/http.c @@ -19,6 +19,7 @@ #include "string-list.h" #include "object-file.h" #include "object-store-ll.h" +#include "tempfile.h" static struct trace_key trace_curl = TRACE_KEY_INIT(CURL); static int trace_curl_data = 1; @@ -2388,8 +2389,24 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url) strbuf_addf(&buf, "objects/pack/pack-%s.idx", hash_to_hex(hash)); url = strbuf_detach(&buf, NULL); - strbuf_addf(&buf, "%s.temp", sha1_pack_index_name(hash)); - tmp = strbuf_detach(&buf, NULL); + /* + * Don't put this into packs/, since it's just temporary and we don't + * want to confuse it with our local .idx files. We'll generate our + * own index if we choose to download the matching packfile. + * + * It's tempting to use xmks_tempfile() here, but it's important that + * the file not exist, otherwise http_get_file() complains. So we + * create a filename that should be unique, and then just register it + * as a tempfile so that it will get cleaned up on exit. + * + * In theory we could hold on to the tempfile and delete these as soon + * as we download the matching pack, but it would take a bit of + * refactoring. Leaving them until the process ends is probably OK. + */ + tmp = xstrfmt("%s/tmp_pack_%s.idx", + repo_get_object_directory(the_repository), + hash_to_hex(hash)); + register_tempfile(tmp); if (http_get_file(url, tmp, NULL) != HTTP_OK) { error("Unable to get pack index %s", url); @@ -2427,10 +2444,8 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head, } ret = verify_pack_index(new_pack); - if (!ret) { + if (!ret) close_pack_index(new_pack); - ret = finalize_object_file(tmp_idx, sha1_pack_index_name(sha1)); - } free(tmp_idx); if (ret) return -1; -- cgit v1.3-5-g9baa From 03b8eed7f50a56cd7581e18775c6f3a7caf639b4 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Fri, 25 Oct 2024 03:00:09 -0400 Subject: packfile: drop has_pack_index() The has_pack_index() function has several oddities that may make it surprising if you are trying to find out if we have a pack with some $hash: - it is not looking for a valid pack that we found while searching object directories. It just looks for any pack-$hash.idx file in the pack directory. - it only looks in the local directory, not any alternates - it takes a bare "unsigned char" hash, which we try to avoid these days The only caller it has is in the dumb http code; it wants to know if we already have the pack idx in question. This can happen if we downloaded the pack (and generated its index) during a previous fetch. Before the previous patch ("dumb-http: store downloaded pack idx as tempfile"), it could also happen if we downloaded the .idx from the remote but didn't get the matching .pack. But since that patch, we don't hold on to those .idx files. So there's no need to look for the .idx file in the filesystem; we can just scan through the packed_git list to see if we have it. That lets us simplify the dumb http code a bit, as we know that if we have the .idx we have the matching .pack already. And it lets us get rid of this odd function that is unlikely to be needed again. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- http.c | 15 ++++++++------- packfile.c | 8 -------- packfile.h | 2 -- 3 files changed, 8 insertions(+), 17 deletions(-) (limited to 'http.c') diff --git a/http.c b/http.c index 9642cad2e3..03802b9049 100644 --- a/http.c +++ b/http.c @@ -2420,15 +2420,17 @@ static char *fetch_pack_index(unsigned char *hash, const char *base_url) static int fetch_and_setup_pack_index(struct packed_git **packs_head, unsigned char *sha1, const char *base_url) { - struct packed_git *new_pack; + struct packed_git *new_pack, *p; char *tmp_idx = NULL; int ret; - if (has_pack_index(sha1)) { - new_pack = parse_pack_index(sha1, sha1_pack_index_name(sha1)); - if (!new_pack) - return -1; /* parse_pack_index() already issued error message */ - goto add_pack; + /* + * If we already have the pack locally, no need to fetch its index or + * even add it to list; we already have all of its objects. + */ + for (p = get_all_packs(the_repository); p; p = p->next) { + if (hasheq(p->hash, sha1, the_repository->hash_algo)) + return 0; } tmp_idx = fetch_pack_index(sha1, base_url); @@ -2450,7 +2452,6 @@ static int fetch_and_setup_pack_index(struct packed_git **packs_head, if (ret) return -1; -add_pack: new_pack->next = *packs_head; *packs_head = new_pack; return 0; diff --git a/packfile.c b/packfile.c index 16d3bcf7f7..0ead2290d4 100644 --- a/packfile.c +++ b/packfile.c @@ -2160,14 +2160,6 @@ int has_object_kept_pack(const struct object_id *oid, unsigned flags) return find_kept_pack_entry(the_repository, oid, flags, &e); } -int has_pack_index(const unsigned char *sha1) -{ - struct stat st; - if (stat(sha1_pack_index_name(sha1), &st)) - return 0; - return 1; -} - int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data, enum for_each_object_flags flags) diff --git a/packfile.h b/packfile.h index 0f78658229..b4df3546a3 100644 --- a/packfile.h +++ b/packfile.h @@ -193,8 +193,6 @@ int find_kept_pack_entry(struct repository *r, const struct object_id *oid, unsi int has_object_pack(const struct object_id *oid); int has_object_kept_pack(const struct object_id *oid, unsigned flags); -int has_pack_index(const unsigned char *sha1); - /* * Return 1 if an object in a promisor packfile is or refers to the given * object, 0 otherwise. -- cgit v1.3-5-g9baa From c2dc4c9fbb1f119be6ab55ff8676bf18b4b9446a Mon Sep 17 00:00:00 2001 From: Jeff King Date: Fri, 25 Oct 2024 03:00:55 -0400 Subject: packfile: drop sha1_pack_name() The sha1_pack_name() function has a few ugly bits: - it writes into a static strbuf (and not even a ring buffer of them), which can lead to subtle invalidation problems - it uses the term "sha1", but it's really using the_hash_algo, which could be sha256 There's only one caller of it left. And in fact that caller is better off using the underlying odb_pack_name() function itself, since it's just copying the result into its own strbuf anyway. Converting that caller lets us get rid of this now-obselete function. Signed-off-by: Jeff King Signed-off-by: Taylor Blau --- http.c | 3 ++- packfile.c | 6 ------ packfile.h | 7 ------- 3 files changed, 2 insertions(+), 14 deletions(-) (limited to 'http.c') diff --git a/http.c b/http.c index 03802b9049..510332ab04 100644 --- a/http.c +++ b/http.c @@ -2579,7 +2579,8 @@ struct http_pack_request *new_direct_http_pack_request( preq->url = url; - strbuf_addf(&preq->tmpfile, "%s.temp", sha1_pack_name(packed_git_hash)); + odb_pack_name(&preq->tmpfile, packed_git_hash, "pack"); + strbuf_addstr(&preq->tmpfile, ".temp"); preq->packfile = fopen(preq->tmpfile.buf, "a"); if (!preq->packfile) { error("Unable to open local file %s for pack", diff --git a/packfile.c b/packfile.c index 0ead2290d4..48d650161f 100644 --- a/packfile.c +++ b/packfile.c @@ -35,12 +35,6 @@ char *odb_pack_name(struct strbuf *buf, return buf->buf; } -char *sha1_pack_name(const unsigned char *sha1) -{ - static struct strbuf buf = STRBUF_INIT; - return odb_pack_name(&buf, sha1, "pack"); -} - char *sha1_pack_index_name(const unsigned char *sha1) { static struct strbuf buf = STRBUF_INIT; diff --git a/packfile.h b/packfile.h index b4df3546a3..2bbcc58571 100644 --- a/packfile.h +++ b/packfile.h @@ -31,13 +31,6 @@ struct pack_entry { */ char *odb_pack_name(struct strbuf *buf, const unsigned char *sha1, const char *ext); -/* - * Return the name of the (local) packfile with the specified sha1 in - * its name. The return value is a pointer to memory that is - * overwritten each time this function is called. - */ -char *sha1_pack_name(const unsigned char *sha1); - /* * Return the name of the (local) pack index file with the specified * sha1 in its name. The return value is a pointer to memory that is -- cgit v1.3-5-g9baa