GSoC - Week 6

Jul 12, 2020 by Abhishek Kumar

Over the last week, I worked on implementing corrected commit date with generation data chunk and boy, was it painful!

While working on the prototype, I used the space allocated for graph position to store topological level before writing it into CDAT chunk. However, while computing bloom filters, we call get_bloom_filter(). This function checks if the commit belongs to the commit-graph. If it does, load the bloom filters (avoiding a lot of computation). Otherwise, compute the filters and return it. Since I am storing topological levels in graph position, Git mistakenly thinks that we are reading a graph and tries to load bloom filter, segfaulting right away.

Faced with the problem, I could think of three ideas:

  1. Extend generation to be 64-bits (which we were planning on anyway) and store topological levels within higher 32 bits and corrected date offsets within lower 32 bits. Code for calculating corrected date offsets, topological levels would be littered with bit shifts and ANDs, but code for parsing and using generation numbers would be clean.
  2. Use two 32-bits variables, level and odate, and use the contiguous 64 bits to store generation (while reading commit-graph): Code for writing commit-graph is clean, but reading generation is much trickier as we try to coerce value stored in (graph_data + 4) into a timestamp.

  3. Split get_bloom_filters() into two functions - First checks if the commit is from the graph and tries to load whereas the other computes bloom filter. Then we could directly call the second function when writing a commit-graph. I was not sure if this would have worked, but I wanted not to change things unless required.

I felt the first two approaches were too unreadable.

In the end, I compromised by using a 64-bit generation and a 32-bit level in the initial commit and will restrict the ugly conversion to reuse 64-bits in a focused patch later in the series.

Regression when computing bloom filters

While wrestling with the problem above, I noticed that commit_gen_cmp() lies in “commit-graph write” path but uses the helpers. But the helper would always return GENERATION_NUMBER_MAX as we found while moving generation number into a slab.

Digging through git logs, I found the patch, which introduced commit_gen_cmp(). Sorting commits make the computing bloom filters much, much faster. The helper returns the same value every time, which makes sorting pointless and nullifies the performance boost. Fixing this was easy, bypass the helper and access generation number.

Future Plans

I will be working on squashing down the bugs and making sure all tests pass and hope to mail in the first version by Thursday. You can find the working draft here.