I’m building a set of open-source cloud data-stores in December, blogging all the way.
Today, I attacked one of the things that make BTrees: splitting nodes. In a traditional BTree, a page is limited to a fixed size, typically 4KB or 8KB. I’m creating what I call a ‘relaxed’ BTree, where we don’t have to be so strict about page sizes. However, we still want to limit the page size. Our leaf page format currently uses 16 bit offsets, so is limited to 64KB pages. More importantly, if we never split pages, we just end up with a single page, and we never get a tree structure. So I implemented node splitting, splitting whenever the page is bigger than 32KB. This means we start creating branch pages, and this (of course) uncovered a few bugs.
Next, I wanted to work around that 64KB limit on the leaf page size, because we might very well want bigger values. So I introduced a special format for leaf nodes with only one entry, that allows 4GB values. Because of the way we’re splitting our leaves, big values will always end up on a leaf node with just one value. I’m not sure whether this is a good idea. The “one value” thing is a bit magical. It’s also not ideal to have two formats, although they are very similar to each other. Also, storing 4GB values at all is definitely not a good idea. I’m not sure where the cut-off point comes (1GB? 1MB? 1KB?), so we’ll probably revisit this. Postgres stores big values in page-sized chunks, which has the advantage that it’s quick to seek to random parts of BLOBs. Our design would make it fairly easy to do something similar, or just to store big values in a separate page, or even to store them federated onto object storage. Whatever we eventually decide here will probably feed back into the page-splitting code as well, but for now we can store huge values.
We don’t support huge keys - it would be straightforward to do so, but it’s probably not a good idea. Keys are copied into the branch pages, so it is less straightforward to implement than big values (which occur only in leaves), and there’s a bigger overhead to having big values. For now, we won’t support keys bigger than 32KB, and we’ll probably artificially limit them to a much smaller size to encourage good usage (1KB?)
Right now, we don’t yet reclaim any pages, so our database will grow indefinitely. This is far from ideal! The first step was to extend the transaction to record free pages. The plan is to go through and reclaim transactions once they’re no longer needed (once there are no read transactions that are referencing them), and add the free pages to a free list. We can then allocate future pages from that free list, and so our database will no longer grow indefinitely. The challenge is to do this in a consistent and persistent way; more on that next time!