GSoC - Week 8

Jul 26, 2020 by Abhishek Kumar

Over the week, I wrapped up the failing tests for generation number v2 pull request and published it to the mailing list. You can find the pull request and the discussion on the mailing list.

git rev-list –topo-order –first-parent

As I discussed in the previous post, one of the failing tests was test ‘rev-list: first-parent range topo-order’ for the half commit-graph. Contrary to expectations, commit-3-1 and commit-2-1 were also shown as reachable from commit-6-6 but not from commit-3-8.

[For the graph created in t6600-test-reachable, commit-x-y can reachable all other commits commit-i-j as long as 1 <= i <= x and 1 <= j <= y]

While I understood traversal as an algorithm, I knew little about the implementation details. After Tuesday, I decided making random changes, and hacking around the debugger was not going to fix this.

I had three clues to start investigating with:

  1. The tests pass for --topo-order, --first-parent options individually for half commit-graph using corrected commit date.
  2. The test passes --topo-order --first-parent for full commit-graph using corrected commit date.
  3. The test passes --topo-order --first-parent for half commit-graph using old Git (that is, new Git calculated generation numbers correctly).

To start, I decided to understand how does Git compute the set difference of reachable commits. I was getting the correct reachable sets from commit-3-8 and commit-6-6 but incorrect set difference.

Git doesn’t calculate the set difference, but rather avoids expanding branches that it knows are uninteresting. Ah! Both commit-3-1 and commit-2-1 should have been marked as uninteresting, but were not. We start with an uninteresting commit (typically passed as an argument with ^rev-name) and recursively mark all of its parent uninteresting.

The chain of marking parents uninteresting stops at commit-3-5 explains why commit-3-1 (and its parent commit-2-1) were not uninteresting.

Git stops at commit-3-5 as it belongs to the commit-graph (and has a finite corrected commit date) compared with commit-3-8 and commit-6-6 which do not, and have GENERATION_NUMBER_V2_INFINITY as the cutoff.

At the end of initializing topological walk, we have two commits commit-5-1 and commit-3-5 in the indegree queue. The topological levels for commits were 5 and 8 respectively, but because commits are created row-wise, commit-3-5 has a lower commit date (and corrected commit date) than commit-5-1.

Ah! So, the order of commits in indegree queue is reversed between topological levels and corrected commit date.

Does that mean Git should have been ordering commits by uninteresting, then generation? That certainly would fix the discrepancy of indegree walk orders.

However, this would also mean that all uninteresting commits would be expanded before any interesting commit. Such expansion is certainly not ideal and defeats the purpose of generations as a cutoff. Ordering commits by uninteresting could not be the answer.

I began to wonder why does the order of commits indegree queue even matter? It shouldn’t, as long as generations are above the cutoff for compute_indegrees_to_depth(). Both commit-3-5 and commit-5-1 were ancestors of commit-4-1 and thus had higher generations.

I added some debug statements to track the topo changes, indegree and explore queues, and saw something like this:

| get commit-4-1 into topo queue
| | ....
| | get commit-4-1 from indegree queue
| | | insert commit-3-1 into explore queue
| | ....
| | get commit-3-1 from indegree queue
| | ....
| insert commit-3-1 into topo queue

While expanding commit-4-1 in the topo walk, we were also expanding it in the indegree walk. I wondered, can the same commit be in both queues at the same time?

Clearly, no. While in expand_topo_walk(), if Git reaches a commit with a smaller generation, it computes the indegrees using the lower generation as the cutoff. Therefore, it should have exhausted commit-4-1 from the indegree queue before inserting it into the topo queue.

I needed to understand the order of commits in the indegree queue. Adding more debugging statements:

| | insert commit-4-1 into indegree queue: 0

commit-4-1 had corrected commit date of zero before inserting it into indegree queue. What! It should have a non-zero corrected commit date. Turns out, Git doesn’t always parse the commit before inserting them into indegree queue. It parses parents before inserting into explore queue (process_parents()) and before the topo queue but not indegree queue. Sometimes the parents would be parsed (and have a valid generation number), sometimes not.

Fixing this was easy - all we had to parse parent before inserting into indegree queue. I spent all week, finally, to figure out a two-liner fix. Such can be life at times.

Future Plans

With the first version of the patch series sent to the mailing list, I plan to implement the remaining features and respond to reviews.

In my spare time, I am going to look into the possibility of eliminating graph position. Storing graph positions helps Git avoid the work of searching for the commit oid in commit-graph files for subsequent lookups.

But how often are the subsequent look ups? I can think of loading bloom filters as the only example. As such, removing graph positions might not affect performance (provided we load bloom filters along with the rest of CDAT, decided to whether to load or not with a flag) and reduce our memory footprint further.