I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we built a cloud-backed Git server. We were able to combine cloud object storage with the key-value data-store we’ve been building to store Git repositories reliably on the cloud, taking advantage of the code Google released to do this in JGit.
Today, I started thinking more deeply about how to enhance the key-value data-store, to get to a data-store that exposes multiple key operations, particularly for reads. We need this for a document store (like MongoDB/DynamoDB), and it’s also the foundation of a relational store.
Multi-key reads are much more useful than multi-key writes, I think the most important difference between a document store and a relational store is that document stores don’t implement multi-key writes. This lets document-stores scale-out easily through sharding (just like key-value stores), while also giving them some of the power that relational stores enjoy. Now, this isn’t the difference most-often cited: that document stores allow flexible schemas, wheras relational stores have a fixed schema. This is, in my mind, a false distinction. Relational stores have chosen to enforce schemas, but don’t really have to. Google’s F1 and Postgres’ JSON type are hints of this.
I also believe that it’s possible to implement a relational datastore that can scale-out (so we don’t need to accept the compromises that document stores make), but that’s another story!
So I started off today with simple support for the most basic query imaginable: a select-all-rows scan. From there, it’s possible to add filtering, range queries etc; the hard bit is of course to run those queries without having to examine all rows! But select-all is a good start, so I implemented a very simple binary format that allowed entries to be streamed over HTTP, added the call to the client I’m building that the tests use, and we can select all entries in the store.
At this point we could implement basic filtering, but first I wanted to think about how this should really work. We’re going to want to put JSON objects into the store, and be able to query them. It’s simple to do that via a full table scan, but if we want performance to be acceptable we have to implement secondary indexes. I’m going to try storing the secondary indexes in the same BTree structure (the normal approach has a BTree for each index, but I’d like to avoid changing the BTree logic if I can). So we need a way to keep the ‘data keys’ separate from the ‘index keys’. I had the idea of “keyspaces”, assigning a numeric value to each set of keys, and prefixing every key with the keyspace id. We can use the variable-length encoding from Protocol Buffers to save some bytes: it should only need 1 byte for the first 128 keyspaces. This also means that entries will be grouped by keyspace, which is why I think one BTree will probably be fine.
Redis has functionality for multiple databases with numeric IDs. Even though I don’t really have a use for this right now, this maps pretty naturally to keyspaces, and we can then test the keyspace functionality. I implemented a test for the keyspace functionality and then the keyspace functionality itself - TDD in action (although I strictly shouldn’t have pushed the commit with the test, as that breaks the CI build)!
It’s probably premature optimization, but I next implemented alternative value formats, as I was thinking about yesterday. I’m not really using it yet, but the idea is that we’ll probably want to be able to tell the difference between a JSON object and a string that happens to look like a JSON object. I also thought we might want to store JSON values in a more efficient representation. MongoDB has BSON, which seems like a reasonable idea although has some pretty glaring flaws in the implementation. In particular, BSON misses the most important optimization of all, which is to avoid storing the keys repeatedly; this pushes clients to worry about the length of their keys. The lesson to learn is to avoid MongoDB’s mistake of exposing the binary format as the wire protocol, at least until it’s proven. We could try using Google’s Protocol Buffers as the data representation instead; Protocol Buffers is definitely proven, and much more efficient than anything I want to build. Keys are not stored at all, because they’re in the schema instead. The downside (for some) is that Protocol Buffers uses a static schema, so we can’t simply map arbitrary JSON to it. My vague plan here is to treat everything as JSON, although not necessarily store it as a simple JSON string. If we have a schema we can store it as Protocol Buffers; we can use something like BSON or maybe just compression for arbitrary JSON; but we’ll probably always expose it to the client as JSON.
Multiple value formats (for now) are simply a byte prefixed to the value, indicating how the remaining data should be interpreted. Right now, the only two formats I implemented are a raw byte format (the ‘obvious’ format for the key-value store) and a simple format for storing integers. They’re what we’re using at the moment, so that’s what we can test!
I first coded this in a “minimally invasive” way - I didn’t want to change too much - but that meant the code was fragile. After I got it working, I then went back and refactored it to be object-orientated. Rather than return a ByteBuffer for the value and expect the caller to know whether that ByteBuffer has already been decoded or not, instead I changed the Values helper class (which was comprised simply of static methods) into a Value class we instantiate for every decoded value. This has some overhead, but we expect these Value objects to be short-lived (so the Java Garbage Collector should make swift work of them.) It makes the code much cleaner and easier to read, stops me making stupid programming errors, and also lets us use polymorphism instead of tedious switch statements (we subclass Value for each format). A nice win, as long as the performance doesn’t hurt. There are some tricks (like Mutable objects) which we can introduce if it ever comes up in a performance profile, but we’ll wait till we see the problem before trying to fix it!