GSoC - Weeks 1, 2

Jun 14, 2020 by Abhishek Kumar

Over the first two weeks of GSoC, I have been working on moving struct members generation and graph_pos out of the struct commit. You can find the patch and subsequent discussion versions 1 and 2, versions 3 and 4.

Contents

Overview

Let’s take a moment to talk about commit-slab. A commit slab dynamically associates commits with some kind of information. For example, indegree commit slab stores the indegree of a commit when running the topological sort, generation commit slab stores the generation number of a commit, and so on.

Commit slabs are neat because they reduce the memory footprint. The generation commit slab allocates memory only when generation numbers are assigned. For the large repositories, such savings can be pretty nifty.

But this, of course, comes at a cost. A struct member access is much cheaper than accessing the same member through commit-slab.

As discovered by SZEDER 1, specific commands like “git merge-base –is-ancestor HEAD~50000 HEAD” were slower by nearly 40%!

Dr. Stolee believes the slowdown has to do with the underlying algorithm, rather than the slower commit slab access 2.

Performance Statistics

Test Time taken by Master Time taken by Version 4 Change in Time (%)
Test Suite 23m39s 23m14s -1.76%
git merge-base –is-ancestor HEAD~50000 HEAD 0.787s 0.927s +17.79%
git gc 3m55s 3m38s -7.23%

Memory Statistics

Test Max RSS, Master Max RSS, Version 4 Change in Max RSS (%)
Test Suite 34940kb 34848kb -0.26%
git merge-base –is-ancestor HEAD~50000 HEAD 177694kb 177707kb +0.01%
git gc 4889868kb 4911960kb +0.45%
Test Suite on Master
Test Suite on Master
Test Suite on v4
Test Suite on v4

Talking more about my experience, over the weeks, I have grown more comfortable with the build process and hacking around things. While the implementation is straightforward, I had to spend a lot of time debugging tests.

For example, consider the idea “commits within the graph have definite generation numbers”. So, if the commit is not from the graph, it must have an infinite generation number.

This fails when we think about writing commit-graph. Since commits have not been written (yet), they have a generation number but are also not from the graph.

Likewise, consider the following idea: Maintaining a count of “parsed commits so far” for a repository should provide a unique value, which can then be used as an index for commit-slab seems reasonable. But it’s true only for single repository setups that have no submodules at all! The build failed tests for diff-submodules.

The commit-slab has been used for the following:

contains_cache, commit_seen, indegree_slab, author_date_slab, commit_base, commit_pos, bloom_filter_slab, buffer_slab, commit_rev_name, commit_names, commit_name_slab, saved_parents, blame_suspects, commit_todo_item

None of which had failed yet.

Next Steps

As v3 was queued into next, I will polish the current series a bit more. Solving the global counter issue with alloc_commit_index() more cleanly by dropping commit_count from parsed_object_pool and fixing the segfault when allocating new slabs 6.

Once that’s done, I will focus entirely on “handling commit-graph format change”.

Thanks to Dr. Stolee, Dr. Jakub, Junio, and Szeder Gábor for their interest in patch series.