I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we laid the foundations for querying by implementing a table-scan, and added support for keyspace partitions and for values of different types (byte strings and integers for now).
Today, I had one of those days with lots of thought, but not a lot of code to show for it. Apart from a quick refactor, all I coded was support for JSON values. More important is the approach I think I’ve settled on for how to actually deal with JSON.
We would ideally like to be able to access one data store from different endpoints: a key-value endpoint (e.g. Redis), a document store endpoint (e.g. MongoDB) and a relational endpoint (e.g. Postgres).
I described yesterday how I think that document stores and relational databases actually can support the same data model, even if relational data stores usually choose not to support nested document structures. (Google’s F1 and Postgres’ JSON type are the two ‘exceptions that prove the rule’.) So we can map between document and relational models “easily”. We can think of our data in terms of JSON, even if we choose to store it in a different representation (this is the same as the distinction between the logical data model and the physical data model).
What is trickier is mapping between key-value and document-store/relational models. It’s fairly easy to map data from the richer model to the simpler key-value store, by exposing it as JSON strings. However, it’s not clear how we should map back. We could wrap the values in JSON i.e. “hello” <-> { “value”: “hello” }. But if a key-value entry is inserted with a value that is valid JSON, should we treat that as a string or as an object? Presumably if the caller is setting JSON, it is because they want it treated as such, but we really don’t want to choose a different path based on whether the input value is parseable as JSON, as otherwise we end up with problems like e.g. HTML injection attacks. We would want to rely instead on metadata, but the existing key-value protocols don’t support this because it doesn’t mean anything in the key-value model.
I think that, while it’s nice theoretically, it’s not necessarily that useful to e.g. set your values using a key-value protocol but retrieve them using SQL. There’s also the question of what the point is of the key-value endpoint: once we know that we can simply view them as different views onto the same data-store, why wouldn’t clients use the more powerful API, even if they choose not to use the full functionality?
Instead, it would be easy to use the existing code to build a key-value proxy that can provide whatever mapping rules the user wants to use (providing the metadata at setup time, rather than per-call). So what I’ll likely do is to split up the code tomorrow into two separate services: a key-value service and a JSON-store service. That will also make me happier about the code duplication I introduced when adding JSON support!
Continuing with more thinking than typing, I then looked into binary representations for JSON. BSON (as used by MongoDB) is basically a dead-end: it isn’t efficient in terms of CPU or storage space, and it stores repeated keys repeatedly (which is problematic because it encourages callers to use short, obscure keys rather than self-describing ones). I looked at some of the alternatives out there: Smile is fairly nice; it’s compact and reasonably efficient to parse, and it uses a dictionary approach that reminds me of a simple LZW that avoid double-storing repeated keys. UBJson is fast to parse and reasonably space efficient, though it doesn’t avoid double-storing repeated keys.
However, we have an additional consideration: we probably want to keep track of metadata if we want to use SQL as our query language. Some SQL query parsers benefit from having the “column names” available. So does a lot of SQL tooling (e.g. ActiveRecord.) Obviously recording every key will be problematic if users start putting in data with dynamic column names, but we’ll cross that bridge when we come to it.
So we can imagine storing every JSON key in a lookup table in the database, and then we could use that to replace strings with reference to that shared dictionary. This is similar to interned strings, and - as with interned strings in memory - would allow us to optimize string equality comparison because it suffices to compare the values of the “pointers”. So this may be prove a pretty big win.
We don’t necessarily have to implement this all immediately, but we now have a long-term plan for JSON. When talking to the client, we can use normal (text) JSON or a binary format like Smile, or in theory any JSON representation that becomes popular (even - yuk - BSON!) However, we can store the data internally in a different format. We’ll likely use a binary format similar to Smile, but with a database-wide key-dictionary, making use of our key-metadata collection to get efficient string comparison. We’ll store the key-metadata in a separate keyspace; we originally built keyspaces for indexes, but we can use it for this as well.
As a short-term plan, it should suffice simply to start collecting the key-metadata, and verifying that this meets our requirements for SQL. We can continue to store our JSON as text (unoptimized) for now, until we’re sure it’s useful.
So tomorrow: splitting out the two services, collecting key metadata, and hopefully adding a SQL parser.