I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we figured out how to approach JSON and realized that having a key-value interface to structured storage is probably not generally useful.
The plan for today was to split out the code into a key-value store and a structured store, start collecting JSON key metadata, and then work on a SQL parser. We
got most of the way there, though the SQL support will have to wait for tomorrow!
The refactor into separate projects went well; I also created a new shared project for shared server-side code (e.g. the BTree implementation). The code feels much better
organized now, especially after I moved some of the Btree tests to test the Btree directly, rather than going through an API. (Now that the BTree is shared, it is a ‘service’, so I think I’m justified
in still considering this an integration test, and continuing my prejudice against unit tests). So the tests run faster now; this is good. There’s a nasty bug at the moment
with shutting down services correctly, which means that the tests fail if you run more than one suite in the same process; this is bad. I tried to attack this today, but it
looks like the problem is in one of the libraries I’m using.
Next up, we wanted to build an index of keys: this should allow efficient JSON string encoding, and it may be useful to have the list of keys available to
callers or for SQL parsing. There were several steps involved: most importantly this was our first compound operation (a single insert of a JSON object will now
perform multiple actions against the data-store). Once we have secondary indexes, most inserts will be compound operations, so this is important functionality. We now have two
interfaces for two types of operation: RowOperation and ComplexOperation. Both are executed as a single transaction, but a RowOperation operates on a single row, whereas the ComplexOperation can run multiple RowOperations,
one for each row it wants to change. It’s a fairly simple change that is actually very powerful; we did all the hard work when we implemented transactions, we just haven’t
been really using them yet!
So now, whenever we insert a JSON object, we do that through a new StructuredSetOperation (which is a ComplexOperation because it inserts multiple rows). It inserts the value, but then loops through each
of the keys, checks if they are already in the system index, and if not inserts two records within the system namespace (one mapping from id to name, and one back the other way).
This obviously has problems, for example when people start dynamically generating keys we’ll end up with a huge dictionary. Before we worry about that though, I thought it was more important to figure out
whether we do want this key-index at all. It might be that we don’t need it for SQL parsing, and other encodings may be better or good enough!
SQL querying against non-relational datastores is “hot” at the moment, mostly focused on querying of Hadoop-stored big-data. The Hive project
has been around for a long time, offering something that was similar to SQL but not-really-SQL. More recently Cloudera has released Impala and Facebook has just released PrestoDB. Both of
these new projects are aiming for full SQL compatibility, and both aim to be much faster than Hive by bypassing Map/Reduce and running directly against the underlying data. Map/Reduce is not exactly slow,
but it is focused on long-running jobs and on scale-out, rather than on running a short-lived job quickly. There are also efforts to make Map/Reduce itself run short jobs faster, but - as is always the case with open
source - a lot of people are trying a lot of different things that are overlapping: may the best solution win!
Both Impala and PrestoDB are faster than Hive, but neither of them are really optimized for ultra-fast queries (e.g. a single primary key lookup). Nonetheless, if we could use one of these two to execute our
SQL commands, then we would have a system that will work for big data also, and we’d avoid having to write our own SQL engine. SQL is quite tricky to parse (it’s grown surprisingly big over
the years), and is very tricky to run efficiently (because it is declarative, not imperative). I think this complexity is what has caused the NoSQL data-stores not to implement SQL, even when
they’re starting to implement their own query languages. Nonetheless, the wealth of SQL-compatible tooling, and the productivity of a good SQL execution engine means that if we could have SQL,
I think we should.
Another interesting gotcha with Impala and PrestoDB is that they’re designed for partitioned data, and don’t really support secondary indexes. You might split up your weblogs into one file per day,
and then the optimization would be simply to ignore files that are excluded by any date criteria you specify in your query. But if you specify a different filter, the system would have to look
at every record. With a secondary index, the data-store would normally first find the matching rows in the secondary index, and then retrieve only those rows. (It’s actually not quite that simple,
because if we’re going to be retrieving more than a fraction of the records, it is often faster just to stream through all the records. I think that’s why none of Map/Reduce, Hive, Impala and PrestoDB really
support secondary indexes).
Secondary indexes are also what makes query planning hard: to translate declarative SQL into an imperative execution plan, the system has to choose which indexes to use. When you only have one data-access method, the choice is easy; but secondary
indexes produces a large (exponential) number of choices for the system to choose from. The query planner is the piece that does that, and a good query planner is the hardest thing about
writing a good SQL database. Impala and PrestoDB don’t (currently) have full query planners.
The H2 database is a nice open-source traditional SQL database (again in Java). It does have a good query planner, and support for secondary indexes, so it’s tempting to start with H2. However, if we did that we wouldn’t have all the nice
big-data stuff that PrestoDB gives us, like parallel query execution on multiple nodes. H2 is also much more rigid in its schema than is PrestoDB, and we would ideally want schema flexibility to cope
with JSON’s lack of fixed schemas.
I spent the later part of the day experimenting and trying to decide between these options.
I discounted Impala early, not because it’s a bad system, but because it uses C++ for speed,
which makes it harder to hack apart for our own purposes. I think it’s also questionable whether multi-tenant services should use “unsafe” languages like C++, although we’re sadly probably
still years away from these attacks being the low-hanging fruit.
PrestoDB gets its speed through dynamic JVM class generation, which is also not exactly easy to work with, but the system as a whole has been designed to make it easy to plug in extra data-sources.
H2 gets us a lot, but it is really tied
in to the traditional model, and going from there to a distributed query engine like PrestoDB is probably harder than adding H2’s functionality to PrestoDB. Using both PrestoDB and H2 would seem
like a good option, but then we have to maintain two SQL engines which won’t be entirely compatible for the end-user.
So PrestoDB seems like the winner. Sadly though, PrestoDB’s query execution path is very complicated (to be fair though, it is doing something pretty complicated!) It
has a great SQL parser and logical execution planner, but then it’s really designed for distributing that query
across multiple execution nodes, and gathering the results back. That’s great functionality that we’ll likely want one day, but it has a high overhead in terms of code complexity and thus in terms of execution time. I’m
pretty sure that the overhead is too high to use it for a primary key lookup, for example, where modern databases can sustain 100k queries per second or more.
My current plan is to try to adopt a hybrid model, to use PrestoDB’s basic infrastructure (parser, basic planning) to get from SQL into a nicer structure for execution. From there, I hope to check whether
the query is simple enough for me to execute directly. If it is, I’ll execute it using specialized code, otherwise I’ll send it to PrestoDB for full processing. For now I will probably just cause complex queries to return a “not yet
implemented” error!
This isn’t ideal though, because I’d have to reimplement a lot of what PrestoDB already does. So I’m basically splunking around PrestoDB’s codebase, trying to understand how it all fits together and where I can take it apart. More thought
is needed here, but also I need to just try things out and learn by making mistakes. The plan for tomorrow is to get some form of SQL querying wired-up, even if it has to be special-cased and duplicates code!