Day 2: A BTree for a Key-value Store - Justin Santa Barbara’s blog

I’m building a set of open-source cloud data-stores in December, blogging all the way.

In our last episode, we started off with the basic design: a Raft distributed-log of changes using the Barge implementation, persisting to a data structure we implement to store the state. Last time we built a simple append-log as-a-service, like Amazon Kinesis.

Today, I started working towards a simple key-value store, like Redis or memcache. Key-value stores let you associate values with keys, and let you retrieve the value given the key, and that’s about it. Unlike richer datastores, they don’t typically allow operations across multiple keys.

The reason those limitations were chosen is that by not supporting multiple key operations, sharding for scale-out is easy. It also allows them to use a hashtable for easy and fast lookups. The disadvantage is that the model is quite limited, so the datastores typically end up supporting a long list of complex operations on entries, or users must design complex multi-key datastructures (it pushes the complexity onto the caller). It was these observations that actually spawned the growth of the relational model back in the 70s when these datastores were commonly used. But they still have their place. Then it was because they were fast and simple to implement, now it is because they’re fast and allow easy scale-out. Fast is always nice, but the “easy to implement” bit makes them a good next-step for our little project!

We will again rely on the Raft consensus log for our key-value operations, but we’ll need to store the state in a different data structure: one that supports assigning a value to a key and retrieving the value for a given key.

Redis and memcache both choose the hashtable for this data structure. It’s a good choice because it is fast; with a good hash function, both read and write operations take place in O(1) time, although growth of the hashtable is a little tricky.

LevelDB uses the log-structured merge-tree to store its data. This allows LevelDB to support another operation: in-order key iteration, which is often enough to avoid having to support those complex operations or requiring complex datastructures. The LevelDB implementation offers very good performance for writes, and good performance for reads. The implementation suffers from occasional slowdowns for writes (during compactions), and higher memory & CPU overhead for reads. There is work going on to address these issues with LevelDB (Basho’s LevelDB and Facebook’s RocksDB).

There are many data-store implementations that use the BTree: LMDB, BerkeleyDB and every relational database. The BTree is like a binary tree, but puts more values into each node to amortize the data-structure overhead. It maps very well to page-based systems like modern CPUs and block-devices. Like the log-structured merge-tree, BTrees support in-order key iteration.

LMDB is particularly interesting because it uses a clever design which is almost lock-free on reads, but only supports a single-writer. Because all our writes are serialized through a log anyway, we’ll only have a single-writer, so this feels like a great fit.

Although a hashtable implementation might be faster for a strict key-value store, the hope is that the BTree will in practice be similar in performance (with the LMDB design), and will support additional operations (like key-iteration) that will prove useful for the future plans. For example, we can use it when we want to implement secondary indexes.

From the start, we’ll allow duplicate keys in our BTree, because secondary indexes require them and they’re painful to add afterwards. This also dictates the interface that our BTree will expose; because keys can be repeated we just allow a caller to walk through all the entries, starting from a specified key. It’s trivial to add uniqueness constraints, or to support a simple get request on top of this.

I did debate just reusing the (excellent) LMDB implementation via JNI (or writing everything in C++), but I’ve decided to roll-my-own in Java. Hopefully this will pay off: there will be opportunities to make different decisions for our particular use cases.

To produce a fast BTree implementation, we’ll continue to use ByteBuffers. Object allocation on the JVM is fast, but garbage collection can be painful, so we want to try to keep object creation under control unless the objects are very short-lived (i.e. they would be on the stack in C). For a page we’re reading, the idea is to keep the data in the ByteBuffer and extract entries directly. This is pretty much C-style pointer code and is just as tedious to get right as it would be in C, but then works well.

Implementing writes C-style (straight into the ByteBuffer) is trickier, particularly if we have a fixed buffer size. Instead, we’ll ‘extract’ the page before we write to it, converting it into a more natural Java data structure (e.g. a List of objects); applying changes is then simple. Then we’ll serialize the final version at the end back into a ByteBuffer. It makes our code much simpler; it may actually be faster than always working against the ByteBuffer when we do multiple operations; and it allows for “future optimizations” (“tune in next time…”) The big downside is that much of the code is duplicated because we now have two memory representations: one for a clean (read-only) page and one for a dirty (read-write) page.

A traditional BTree sets a maximum size for a page (e.g. 4KB or 8KB); we have to figure out what we’ll do when we exceed the limit. The traditional approach is to rebalance the BTree, moving entries around and splitting pages so that everything fits. Instead, we’ll implement what I call a ‘relaxed’ BTree, where we allow pages to be arbitrarily sized. This does mean that our ‘page id’ will be an offset into the file. With a 4 byte page id, we’d be limited to 4GB of data. We’ll force pages to align to 16 byte multiples, so we actually get 64 GB out of a 32 bit page id; it costs us a bit of padding space, but buys us a more sensible database size and better aligned data may be faster (although this may be a myth on modern CPUs).

For each transaction, we gather all the dirty pages, and then at commit we write them each to new positions in the file, which gives them a new page id; we update the parent pages, and write those, recursing all the way up to the root page. We then write a new ‘master’ page which points to the root.

We avoid overwriting data so that readers can proceed lock-free in parallel (on the previous version), and so that we don’t risk corrupting our data. This approach was first used by the ZFS filesystem to avoid corruption, and LMDB uses it to allow concurrent reads.

For now, the implementation doesn’t do any rebalancing, so we just end up with one big leaf page. It’ll fail pretty soon, because the current leaf format limits data to 16 bit offsets.

Again though, we add a simple RESTful interface, add some tests, and we have a working (though very limited) key-value store.