I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we added more functionality to the BTree: node splitting & support for big values. We also started recording page tracking information into the transaction records, in preparation for reclaiming old space.
Reclaiming old space is important, because we use a copy-on-write approach rather than changing pages in-place. This is relatively unusual for a database, but allows us to support readers without using locks. Reclaiming the old versions (that are no longer referenced by active transactions) is required to stop the database growing without bounds, so most of today was spent on this.
Getting the basic design of the garbage collector right was tricky. We maintain a free-space map in-memory, so that we can allocate new space quickly (without going to disk every allocation). With each transaction record, we include the pages that we freed and the pages that we allocated. Thus, a complete list of transactions can be used to rebuild the free-space map. This is elegant, because if a transaction commits, that is exactly when we want to persist the page allocations. Conversely, if we rollback or otherwise fail a transaction, then we want to roll back the page allocations. So it is nice that they are actually part of the same structure - we don’t have to worry about keeping them synchronized because they are one and the same.
Now, once again, we have a log-structured system for tracking free-space, with an in-memory representation. This is exactly the same approach we’re using for all of our projects. When we startup, we replay the allocation records to rebuild the state. We have the same problem: we need to periodically take snapshots, so that we can keep the amount of time taken to replay the logs under control. We can take a snapshot, which is just another special page type, and we can reference this from the transaction. One trick is that we don’t have to write the snapshot on every transaction; instead we can write it periodicially (when it’s “worth it”). We do have to be careful not to delete the older transaction records until they are no longer needed for rebuilding the current free space map.
Finally, we can actually reclaim the pages. We keep track of every active read transaction; we can clean up after a write transaction only when no read transaction references that transaction (or an earlier version). This is the same trick that is used by LMDB and ZFS (for snapshots).
So now we free up old page versions and can reuse the space. One issue is the strategy for how we allocate memory from the free space. This is the classic memory allocation problem; there is no “right answer”. The best allocators, like tcmalloc, typically use buckets to ensure speed while avoiding fragmentation. Instead, I went with a simpler allocator inspired by ZFS: first fit. In “first fit”, we simply find the first bit of free space that can “fit” an allocation request. This is well-known to cause fragmentation, but I wanted to start simple! ZFS uses a neat trick, which actually rotates through available space (changing the zero-point to redefine “first”), which means that writes rotate sequentially around the disk. This has the advantage that older versions survive for longer (one whole ‘trip’ around the disk), which is great for recovery of corrupted data. However, we expect our database to be largely in-memory, and marching through the whole disk allocation is likely to hurt the cacheability. So I went with a simple ‘first fit’, not ZFS’s clever spin on it. Fragmentation is likely to be a problem; we’ll have to see how it behaves!
We now have a working (basic) BTree! There are still lots of issues, but we can hopefully fix those ‘as we go’ and start building more real features.
There wasn’t much time left in the day, so I proposed a patch to Barge, that lets us return a value from the StateMachine. This means that we can do write-operations that return a value. I implemented the increment operation, which just adds one to a counter value - this is the simplest useful operation I could think of! I also tweaked the code a little more so that we have a doAction method which operates on a log operation even at the top level; this should make it much easier to add lots more operations. Hopefully we’ll add a couple more operations tomorrow. Redis made the news today for having a broken-by-design cluster implementation, so it might be fun to work towards something that looks a little like Redis!