<feed xmlns='http://www.w3.org/2005/Atom'>
<title>git/Documentation/technical/commit-graph-format.txt, branch gitk-resize-error</title>
<subtitle>Fork of git SCM with my patches.</subtitle>
<id>http://git.kilabit.info/git/atom?h=gitk-resize-error</id>
<link rel='self' href='http://git.kilabit.info/git/atom?h=gitk-resize-error'/>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/'/>
<updated>2021-02-18T21:38:16Z</updated>
<entry>
<title>chunk-format: add technical docs</title>
<updated>2021-02-18T21:38:16Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2021-02-18T14:07:39Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=a43a2e6c2af63e5b93444c8ed8ac252927c2f0f3'/>
<id>urn:sha1:a43a2e6c2af63e5b93444c8ed8ac252927c2f0f3</id>
<content type='text'>
The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>doc: add corrected commit date info</title>
<updated>2021-01-19T00:21:18Z</updated>
<author>
<name>Abhishek Kumar</name>
<email>abhishekkumar8222@gmail.com</email>
</author>
<published>2021-01-16T18:11:18Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=5a3b130cad0d5c770f766e3af6d32b41766374c0'/>
<id>urn:sha1:5a3b130cad0d5c770f766e3af6d32b41766374c0</id>
<content type='text'>
With generation data chunk and corrected commit dates implemented, let's
update the technical documentation for commit-graph.

Signed-off-by: Abhishek Kumar &lt;abhishekkumar8222@gmail.com&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Reviewed-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Merge branch 'tb/bloom-improvements'</title>
<updated>2020-09-29T21:01:20Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-29T21:01:20Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=288ed98bf768f4df9b569d51a52c233a1402c0f5'/>
<id>urn:sha1:288ed98bf768f4df9b569d51a52c233a1402c0f5</id>
<content type='text'>
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.

* tb/bloom-improvements:
  commit-graph: introduce 'commitGraph.maxNewFilters'
  builtin/commit-graph.c: introduce '--max-new-filters=&lt;n&gt;'
  commit-graph: rename 'split_commit_graph_opts'
  bloom: encode out-of-bounds filters as non-empty
  bloom/diff: properly short-circuit on max_changes
  bloom: use provided 'struct bloom_filter_settings'
  bloom: split 'get_bloom_filter()' in two
  commit-graph.c: store maximum changed paths
  commit-graph: respect 'commitGraph.readChangedPaths'
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&amp;&amp;'-chain
  commit-graph: introduce 'get_bloom_filter_settings()'
</content>
</entry>
<entry>
<title>Merge branch 'cd/commit-graph-doc'</title>
<updated>2020-09-22T19:36:30Z</updated>
<author>
<name>Junio C Hamano</name>
<email>gitster@pobox.com</email>
</author>
<published>2020-09-22T19:36:30Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=4d515253afcef985e94400adbfed7044959f9121'/>
<id>urn:sha1:4d515253afcef985e94400adbfed7044959f9121</id>
<content type='text'>
Doc update.

* cd/commit-graph-doc:
  commit-graph-format.txt: fix no-parent value
</content>
</entry>
<entry>
<title>bloom: encode out-of-bounds filters as non-empty</title>
<updated>2020-09-18T04:55:50Z</updated>
<author>
<name>Taylor Blau</name>
<email>me@ttaylorr.com</email>
</author>
<published>2020-09-18T02:59:44Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=59f0d5073fd9f0497b9ff9a48fb3a7a8d82d1f9d'/>
<id>urn:sha1:59f0d5073fd9f0497b9ff9a48fb3a7a8d82d1f9d</id>
<content type='text'>
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):

                  |     Percentage of     |    commit-graph   |           |
                  |   commits modifying   |     file size     |           |
                  ├────────┬──────────────┼───────────────────┤    pct.   |
                  | 0 path | &gt;= 512 paths | before  |  after  |   change  |
 ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
 | android-base   | 13.20% |        0.13% | 37.468M | 37.534M | +0.1741 % |
 | cmssw          |  0.15% |        0.23% | 17.118M | 17.119M | +0.0091 % |
 | cpython        |  3.07% |        0.01% |  7.967M |  7.971M | +0.0423 % |
 | elasticsearch  |  0.70% |        1.00% |  8.833M |  8.835M | +0.0128 % |
 | gcc            |  0.00% |        0.08% | 16.073M | 16.074M | +0.0030 % |
 | gecko-dev      |  0.14% |        0.64% | 59.868M | 59.874M | +0.0105 % |
 | git            |  0.11% |        0.02% |  3.895M |  3.895M | +0.0020 % |
 | glibc          |  0.02% |        0.10% |  3.555M |  3.555M | +0.0021 % |
 | go             |  0.00% |        0.07% |  3.186M |  3.186M | +0.0018 % |
 | homebrew-cask  |  0.40% |        0.02% |  7.035M |  7.035M | +0.0065 % |
 | homebrew-core  |  0.01% |        0.01% | 11.611M | 11.611M | +0.0002 % |
 | jdk            |  0.26% |        5.64% |  5.537M |  5.540M | +0.0590 % |
 | linux          |  0.01% |        0.51% | 63.735M | 63.740M | +0.0073 % |
 | llvm-project   |  0.12% |        0.03% | 25.515M | 25.516M | +0.0050 % |
 | rails          |  0.10% |        0.10% |  6.252M |  6.252M | +0.0027 % |
 | rust           |  0.07% |        0.17% |  9.364M |  9.364M | +0.0033 % |
 | tensorflow     |  0.09% |        1.02% |  7.009M |  7.010M | +0.0158 % |
 | webkit         |  0.05% |        0.31% | 17.405M | 17.406M | +0.0047 % |

(where the above increase is determined by computing a non-split
commit-graph before and after this patch).

Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.

Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Suggested-by: Jakub Narębski &lt;jnareb@gmail.com&gt;
Helped-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Helped-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Helped-by: Junio C Hamano &lt;gitster@pobox.com&gt;
Signed-off-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph-format.txt: fix no-parent value</title>
<updated>2020-09-15T21:34:34Z</updated>
<author>
<name>Conor Davis</name>
<email>git@conor.fastmail.fm</email>
</author>
<published>2020-09-15T04:03:53Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=e40e9365510fc3d2decc91bd0939c2f520d2a7d6'/>
<id>urn:sha1:e40e9365510fc3d2decc91bd0939c2f520d2a7d6</id>
<content type='text'>
The correct value from commit-graph.c:

    #define GRAPH_PARENT_NONE 0x70000000

Signed-off-by: Conor Davis &lt;git@conor.fastmail.fm&gt;
Reviewed-by: Taylor Blau &lt;me@ttaylorr.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: use the "hash version" byte</title>
<updated>2020-08-17T23:45:14Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-08-17T14:04:47Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=665d70ad033b115d8c14cdc2bcecf6c8b1947260'/>
<id>urn:sha1:665d70ad033b115d8c14cdc2bcecf6c8b1947260</id>
<content type='text'>
The commit-graph format reserved a byte among the header of the file to
store a "hash version". During the SHA-256 work, this was not modified
because file formats are not necessarily intended to work across hash
versions. If a repository has SHA-256 as its hash algorithm, it
automatically up-shifts the lengths of object names in all necessary
formats.

However, since we have this byte available for adjusting the version, we
can make the file formats more obviously incompatible instead of relying
on other context from the repository.

Update the oid_version() method in commit-graph.c to add a new value, 2,
for sha-256. This automatically writes the new value in a SHA-256
repository _and_ verifies the value is correct. This is a breaking
change relative to the current 'master' branch since 092b677 (Merge
branch 'bc/sha-256-cvs-svn-updates', 2020-08-13) but it is not breaking
relative to any released version of Git.

The test impact is relatively minor: the output of 'test-tool
read-graph' lists the header information, so those instances of '1' need
to be replaced with a variable determined by GIT_TEST_DEFAULT_HASH. A
more careful test is added that specifically creates a repository of
each type then swaps the commit-graph files. The important value here is
that the "git log" command succeeds while writing a message to stderr.

Helped-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Reviewed-by: brian m. carlson &lt;sandals@crustytoothpaste.net&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph-format.txt: all multi-byte numbers are in network byte order</title>
<updated>2020-06-08T19:28:49Z</updated>
<author>
<name>SZEDER Gábor</name>
<email>szeder.dev@gmail.com</email>
</author>
<published>2020-06-05T13:00:25Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=6141cdfdcbe9e3edf25467582c5f32658ae79f40'/>
<id>urn:sha1:6141cdfdcbe9e3edf25467582c5f32658ae79f40</id>
<content type='text'>
The commit-graph format specifies that "All 4-byte numbers are in
network order", but the commit-graph contains 8-byte integers as well
(file offsets in the Chunk Lookup table), and their byte order is
unspecified.

Clarify that all multi-byte integers are in network byte order.

Signed-off-by: SZEDER Gábor &lt;szeder.dev@gmail.com&gt;
Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>Documentation: changed-path Bloom filters use byte words</title>
<updated>2020-05-11T16:33:56Z</updated>
<author>
<name>Derrick Stolee</name>
<email>dstolee@microsoft.com</email>
</author>
<published>2020-05-11T11:56:11Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=88093289cdcfe99c5a3d7ef6f36ee45aa3018824'/>
<id>urn:sha1:88093289cdcfe99c5a3d7ef6f36ee45aa3018824</id>
<content type='text'>
In Documentation/technical/commit-graph-format.txt, the definition
of the BIDX chunk specifies the length is a number of 8-byte words.
During development we discovered that using 8-byte words in the
Murmur3 hash algorithm causes issues with big-endian versus little-
endian machines. Thus, the hash algorithm was adapted to work on a
byte-by-byte basis. However, this caused a change in the definition
of a "word" in bloom.h. Now, a "word" is a single byte, which allows
filters to be as small as two bytes. These length-two filters are
demonstrated in t0095-bloom.sh, and a larger filter of length 25 is
demonstrated as well.

The original point of using 8-byte words was for alignment reasons.
It also presented opportunities for extremely sparse Bloom filters
when there were a small number of changes at a commit, creating a
very low false-positive rate. However, modifying the format at this
point is unlikely to be a valuable exercise. Also, this use of
single-byte granularity does present opportunities to save space.
It is unclear if 8-byte alignment of the filters would present any
meaningful performance benefits.

Modify the format document to reflect reality.

Signed-off-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
<entry>
<title>commit-graph: write Bloom filters to commit graph file</title>
<updated>2020-04-06T18:08:37Z</updated>
<author>
<name>Garima Singh</name>
<email>garima.singh@microsoft.com</email>
</author>
<published>2020-04-06T16:59:49Z</published>
<link rel='alternate' type='text/html' href='http://git.kilabit.info/git/commit/?id=76ffbca71a9c89d1e530f734e16a70b3924f4bea'/>
<id>urn:sha1:76ffbca71a9c89d1e530f734e16a70b3924f4bea</id>
<content type='text'>
Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.

Helped-by: Derrick Stolee &lt;dstolee@microsoft.com&gt;
Signed-off-by: Garima Singh &lt;garima.singh@microsoft.com&gt;
Signed-off-by: Junio C Hamano &lt;gitster@pobox.com&gt;
</content>
</entry>
</feed>
