Justin Santa Barbara’s blog

Databases, Clouds and More

Building an ACI Directly From Maven

Last time, I posted about building an ACI (Application Container Image) using OS packages, and in particular how I was using this to build a Java base image.

The reason for wanting a Java base image was - of course - to run Java applications. In particular, I wanted to be able to do this direct from Maven, which is (still) the dominant build tool for Java.

Spotify has a pretty good maven plugin for Docker, so I started with that. It was pretty simple to add support for ACI.

With my ACI plugin for maven, building an ACI image requires registering the plugin in the pom.xml, along with configuring the command line, like this. It’s still more work than I would like, but it is very copy-and-pastable.

This was pretty easy to implement. The biggest challenge was creating a library for writing ACIs from Java: appc-java. ACI are not that different from Docker images. One huge difference was that it is possible to build an ACI securely (i.e. without requiring root or running arbitrary code) because ACIs are designed to be buildable without running code. Dockerfiles make for a great demo, but they are very difficult to secure.

It’s also much faster to just write a tarfile than it is to spin up a container!

I’m not sure yet whether I should try to contribute this work back into Spotify’s plugin, or whether I should make this a permanent fork. On the one hand, contributing back is good manners, but on the other ACI feels like something that is genuinely new, with a lot of capabilities that would be much harder to maintain in a Docker plugin. I’m going to keep tinkering and see how it goes!

Building an ACI Image From Packages

Containers are the new frontier of application packaging, as witnessed by the rise of Docker. Docker popularized the idea of using containers, combining simple command line tooling with a central image repository for easy sharing of images. Now that containers are moving past MVP, some of the limitations of the Docker model are becoming more apparent, and there are a number of projects that are trying to fix some of Docker’s design mistakes.

CoreOS announced rkt (pronounced rocket) last year. Rkt is creating a workable and open specification for a container image format, called the App Container Image or ACI. rkt is an implementation of a whole family of these open specifications, but I’m starting my investigations at the beginning!

ACI is deliberately minimal: it is a tarball, with a manifest JSON file at the root (/manifest) and a filesystem tree for the container (/rootfs). There are a lot more features, but today I wanted to start by figuring out an easy way to build these images.

The “absolute right way” to build a container image is to compile exactly what you need, and copy just that into a tarball. But this process is fairly hard, and essentially amounts to maintaining your own (special-purpose) distro. (Which does suggest that it would be interesting to try this with gentoo, but I digress…)

An alternative is to leverage the work done by the distros, in my case Debian Jessie packages. packages2aci is a simple “tool” I created (just a batchfile really) which can create an ACI based on a list of packages and a manifest.

packages2aci goes through a list of packages, and individually downloads them (using apt-get download), expands them using dpkg-deb, and then calls actool to build the manifest. Incredibly simple, apart from the fact that you have to specify each package individually (it doesn’t do dependency analysis). This is both annoying (busywork) and handy (sometimes you know that you don’t actually need some library).

With packages2aci, it is very easy to build simple ACI images from your OS packages, and start running rkt images. There’s an example for Java 7, which just runs java -version, which isn’t very useful but is a good first step. Next time, I’ll post about why I want a Java base image!

On Panamax

Things I learned from the panamax.io competition

Panamax.io is a new piece of open source software, from CenturyLink. Really, it’s from Lucas Carlson and the team behind PHPFog/AppFog. Centurylink is a fairly traditional telecoms company that has been buying into cloud by snapping up some interesting players: notably Savvis, AppFog and Tier 3. Rumor has it that Rackspace might be next (1). It’s a great strategy; most of the communications incumbents have good cash flow but declining businesses as we move towards dumb pipes; it makes a lot of sense to invest that cash in businesses that have more future growth potential.

With Panamax, we’ve got our first look at what those new businesses might look like. Panamax combines CoreOS, fleet and Docker, but puts a nice graphical front-end on it. It’s described as “Docker for humans”. I’ll be honest: it’s really early and I still found a number of bugs which rather mean it’s not quite ready yet for production usage by average humans, but you can definitely see where this is heading: taking Docker and making it mass-market, instead of a geek plaything. It solves some real pain-points with Docker, and it’s definitely one to watch; they are fixing the issues and it’s getting better all the time.

One of the biggest shortcomings I encountered was that I had all these great templates that I could install in seconds, but I would have to wait a long time for it to download (mostly because of the Docker registry’s questionable design). So I created a proxy-server template; with just a few clicks you can cache all those big downloads.

I also tried creating a more ambitious ELK stack (ElasticSearch, LogStash, Kibana), but even with the faster downloading that I was able to get by using my cache, there are still a few problems (my own, not Panamax’s) that I couldn’t get ironed out in the time I had. So the ELK stack template maybe isn’t quite ready for humans either. But, like Panamax, it will improve rapidly!

(1): I have no inside knowledge on Rackspace/Centurylink, and the asking price for Rackspace would seem to make it a difficult purchase for Centurylink. Strategically it makes a lot of sense: Rackspace has a great customer base, and the people are top-notch on both the business and technical sides. Despite OpenStack’s problems, I don’t see how a non-OpenStack strategy makes sense for anyone other than AWS. And to the extent Rackspace is struggling, it is mostly because everyone in the business is struggling to compete with AWS. I have some ideas on how to compete here also, and it probably involves things that look a lot like Panamx, but I digress… Underpin Rackspace’s efforts with the solid cashflow from another business and you have a real contender. Combine it with technologies like Panamax, CoreOS & Docker and things get really interesting.

Day 16: Cache Is King

I’m building a set of open-source cloud data-stores, blogging all the way.

In our last episode, we created a block store, exposed over iSCSI. We also added a native protocol buffers client to our key-value store.

One thing I really want to add is encryption. I’ve always been wary of storing data on the cloud; it’s not really a “post-Snowden” thing, as a determined enough attacker can always get access to data. Rather it’s about limiting exposure and defense-in-depth; so that a single mistake doesn’t expose everything. Making sure that all data is encrypted at rest, and that the keys aren’t stored alongside the data (or ideally nowhere at all) is a good foundation. A backup or disk access doesn’t expose the data; you have to get inside the running process to get the keys or plaintext, and even then you can only unlock whatever data is in use at the time. It is much easier just to capture the credentials!

I’ve implemented this a few times across various projects, and I’ve basically settled on the idea of deriving a key from the user’s credentials, and then using that key to unlock the actual encryption keys (possibly repeating the key-unlocking-key chain). This means that without the password, the data is completely inaccessible, even to administrators. The big downside is that we can’t do any data processing unless an authorized user is requesting it, but this is probably a good thing.

To mmap or not to mmap

The big problem we face is that we’re currently mmap-ing the data. This means that we simply can’t have a different view than the kernel exposes. I thought about using ecryptfs, but I worried that it would be relatively complicated and would really tie us to this data format; it would also probably mean that all in-use data was trivially exposed through the filesystem. I thought about using tricks with mprotect and hooking SIGSEGV, but this seems like it would involve a lot of tricky syscalls and would still have problems e.g. with making sure we only wrote encrypted data out. I found uvmem, which is a very cool patch that QEMU uses for remote paging, and debated creating a custom kernel module, but likely this would just be re-implementing ecryptfs.

In short, using kernel encryption was going to be complicated. More importantly, no “real” databases use mmap: the database knows more than the kernel about its own memory usage patterns, and so it can manage its memory better than the kernel. In theory, at least, though I’d love to see some work on exposing these hints efficiently to the kernel. But it does seem to be true in reality as well: database that naively rely on mmap (like MongoDB and Redis) perform catastrophically badly once data exceeds memory.

Instead, all databases manage their own cache through a pool of buffers, which are populated with pages in-use and recently in-use. If we had a buffer pool, we could easily implement encryption and probably compression as well. Another big win is that we could also easily page data in from cloud object storage on-demand. So, if we had a 64GB data store that we needed to recover onto a new machine, we wouldn’t have to wait to download it all before using it. There’s nothing here that the kernel couldn’t do; but the hooks just aren’t there.

Storm in a btree-cup?

One very difficult problem to solve with mmap is that kernel paging interacts poorly with event-driven I/O or with (Go-style) userspace threads. A page-fault blocks the whole thread; in the thread-per-request model that is acceptable, but in the event-driven model that thread is likely serving a large number of requests all of which are unfairly blocked. So you typically end up using a thread pool to process requests, which looks a lot like the thread-per-request model again!

On the other hand, the paper which set the stage for VoltDB shows that buffer management is a huge component of the overhead of a traditional database. Those arguments haven’t yet been fully resolved. If I had to summarize, I would say that if the database fits in memory then eliminating buffer management is a good thing (like VoltDB, MongoDB, MemSQL and Redis are trying). If the database doesn’t fit in memory, then all databases perform badly, though databases that manage their own buffers seem to perform (much) less badly.

Tilting at windmills

So today I looked at building our own caching layer. I’m not sure we’ll end up using it, so I chose to make it pluggable. I started off by better abstracting out the PageStore (which was a big design improvement, even if we don’t end up using a new PageStore). Hopefully later we can revisit this and actually compare buffer management to no-buffer management!

It also gives me a chance to play with the new asynchronous I/O in Java 7. We saw just yesterday the importance of async I/O for hard disks; when we’re reading from the cloud it’ll likely be just as important!

It proved to be fairly complicated, but I eventually ended up with a reasonable preliminary buffer cache implementation. It relies on a caching map from Guava, and a memory pool from netty, both of which are awesome but aren’t really designed for our particular requirements, so we’ll probably have to replace them soon.

The implementation is fairly straightforward - too simple, in fact. We already have transaction objects, so the transaction objects act as a nice holder for any in-use buffers, which means we don’t have to change too much code yet. We release the buffers only at the end of the transaction, which again simplifies things. We allow the cache to grow bigger than the target size, so we don’t yet worry about forcing page evictions.

All of these simplifications will have to get fixed, but they set the stage for what we really want to work on. Tomorrow: encryption. If that’s a bust, then we’ll probably end up throwing away the manual buffer management we did today!

Day 15: Block Storage

I’m building a set of open-source cloud data-stores, blogging all the way.

In our last episode, we built a file server backed by our cloud data-stores, and exposed it over WebDAV.

That was before the Christmas break, and it’s time to get going on my little project again. Paul Graham is definitely right when he talks about the Maker’s Schedule! I did manage to get some time over the Christmas break, but not the uninterrupted hours that actually produce real progress. I knew that was going to be the case, so I decided to do something a bit different, and worked on a block-storage server (like Amazon EBS or Rackspace Block Storage) that exposes a virtual hard disk over iSCSI. iSCSI is a very complicated protocol, so it actually worked fairly well to be able to work on it in small chunks, rather than going completely crazy trying to do it in one sitting.

Block Storage Architecture

Block Storage provides a virtual hard disk; it is similar to the file server that we exposed over WebDAV, but works with disk blocks instead of files. It’s a much better match for storing changing files, like in a traditional relational database. I happen to think that this is not a great fit for the cloud, but until our structured store is as good as relational databases then there will still be a need!

The architecture I went with was very similar to the file server: we store the actual blocks of data in cloud object storage (as immutable hashed blobs), and we store the (smaller) mapping from each address of the disk to the relevant blob in our key-value store. Each of these systems is distributed & fault tolerant, so our block storage is automatically distributed & fault tolerant. We should be able to support de-duplication and compression fairly easily, though I haven’t yet implemented anything particularly advanced.

Better the iSCSI devil you know…

I chose to expose the block device over iSCSI. There are a few other alternatives: NBD is a much simpler protocol, which would have been easier to implement, but Windows support was lacking; I also worried - probably unfairly - that it was so simple that it wouldn’t allow for some interesting optimizations. Another alternative was AoE, but this operates at Layer 2 so would have been painful to implement in Java, and doesn’t have as good support as iSCSI. The Layer 2 thing seems like a poor match for the cloud also, where we invariably have complicated networking and likely want to be in multiple datacenters.

iSCSI - though it is incredibly complicated - is very well supported, including directly by QEMU/KVM and probably other hypervisors also. I figured that if I could get it to work, it would be worth the pain.

iSCSI resources

There is a Java iSCSI server library, called jSCSI, which was very helpful in understanding everything, but I wanted to build it from scratch to see what performance optimizations I could find. The best resources on iSCSI I found were the iSCSI RFC; iSCSI basically takes SCSI and adds an Internet transport to it, so it was necessary also to refer to Seagate’s SCSI Reference to get details on the SCSI commands themselves.


Implementation was fairly tedious, but I eventually got iSCSI working sufficiently to support QEMU running directly from the iSCSI volume. Other clients will undoubtedly need additional support, but the core functionality is present and working. (One of the oddities of SCSI is that every command comes in multiple versions, with different sizes for block addresses - I guess that’s what happens when hard disks grow by a factor of almost 1 million!)

The big optimization that I’ve implemented so far is that SCSI allows asynchronous operation, and in particular allows for writes to be buffered and then flushed. This allows us to combine and defer write operations; which is important because when we do flush we have to replicate the data to multiple servers. Of course this was all implemented in SCSI because hard disks (even SSDs) are big sources of latency.

The implementation makes very heavy use of ListenableFuture, a key contribution from the Guava project. ListenableFuture are a much more useful implementation of Promises than Java’s built-in Future, and they make asynchronous multithreaded programming (almost) easy.

The other (related) optimization was to divide the disk into segments (currently 1MB). Instead of mapping each disk block to a key-value entry, we map a segment to a key-value entry. Blocks are small (512 bytes or 4KB), so we want to amortize some of the overhead here. If we’re reading or writing contiguous blocks, we can combine this into a single key-value operation (which is nice, because hard disks also favor sequential operations, so most software tries to avoid seeking around all over the disk). Also, we can create blobs that are bigger than a single disk block (as long as it’s within the same segment), and so we can normally combine several blocks into a single read or write from object storage. In theory, we can heuristically clean up complicated segments by re-writing blocks, but that isn’t implemented at the moment!

Improving the key-value store

One of the goals of building these other systems using the key-value store is to figure out where the key-value store needs to be improved. I realized we needed the idea of multi-tenancy in the key-value server: it is easy to ask for another isolated key-value store, and then within each key-value store it is also possible to have a modest number of keyspaces (keyspaces are currently all mapped into the same BTree). The idea is that you’ll use a separate store for each unrelated unit, and keyspaces to keep data organized. For example: in the file server each volume can now get its own key-value store, and we use keyspaces for the different types of data (inodes vs direntries); a keyspace is a bit like a table in a relational database. By making it easy to allocate key-value stores on demand, we can later store the key-value stores on different machines in the cluster, to scale-out. (We’ll still have a bottleneck if one store is very heavily used, but this simple sharding will push the problem a long way down the road).

We were still using the key-value store through the redis interface; it was still possible to map this enhanced functionality to Redis, but it was getting a little messy. So I implemented a native interface using Protocol buffers. Using protocol buffers is much more efficient than Redis’s protocol. As well as being able to multiplex requests and easily support asynchronous operations, we can also use the Protocol Buffer objects as our internal request representation. This means we don’t need to marshal and demarshal the requests between a wire-format and our internal format, and generally means a lot less code. It does mean we have to be very careful to treat data as untrusted though!

Taking this to the natural conclusion, we now use the wire-format Protocol Buffers for our Raft log entries. Sticking to one format throughout the whole system means even less code and (in theory) less overhead. It also demonstrates that our architecture really is just establishing a reliable consistent sequence of commands, like the old MySQL statement based replication. On that note, Jay Kreps (a technical lead in LinkedIn’s awesome SNA group) published an interesting piece on using logs to build distributed systems, which is well worth reading.

Cloudata: Christmas Break

We interrupt this scheduled broadcast….

So that I can give this little project the attention it deserves, I’m going to put it on hold and resume it in the New Year. There’s too much I want to do, to try to squeeze it in between the holiday commitments!

Happy Holidays to everyone!

Day 14: More WebDAV

I’m building a set of open-source cloud data-stores in December, blogging all the way.

In our last episode, we built a simple file server and exposed it over WebDAV. As we did with the git server, we store the actual data in cloud object storage, and the consistent metadata in our key-value store.

Today I really just extended the WebDAV support, adding support for more WebDAV/HTTP verbs: MOVE (rename), DELETE, LOCK/UNLOCK and MKCOL (mkdir). All fairly straightforward stuff; I’m trying only to talk about the interesting details, so just a few observations…

Web(DAV) for the win

I think WebDAV was the right choice of access protocols. We talked before about how SPDY & HTTP 2.0 should mean that HTTP will be “good enough” to replace most custom binary protocols, and so we can then take advantage of of HTTP’s benefits. A filesystem is a use case that will definitely push the boundaries: there’s a lot of traffic and it’s performance sensitive.

WebDAV today already gives us many of HTTP’s benefits (WAN access, encryption support, etc). File transfer and metadata querying with WebDAV over SPDY will be comparably efficient to a custom filesytem protocol, as this is HTTP’s bread and butter. Missing is change notification (inotify) and efficient filesystem writes, which again makes sense because of HTTP’s read-mostly philosophy.

We can imagine adding change notification using WebSockets. The inefficiency of writes comes because WebDAV normally requires a LOCK / UNLOCK pair around each write; we could imagine combining operations to make this more efficient (e.g. PUT-and-UNLOCK). Because of HTTP’s extensibility, this could be as simple as recognizing an X-Unlock header on the PUT, and responding with X-Unlocked if we released the lock. To be fair, this may actually be more efficient that many filesystems (where lock and unlock maps to file open and close). I think this shows that “just use HTTP” may be the right advice: we’re talking about features, not about overcoming the limitations of HTTP.

No Lock-ing

Lock and unlock support is only stub-implemented for now with an in-memory implementation. This is not “cloud-first” - it is a single point of failure. It would easy to store the locks into the key-value store. I will probably play around tomorrow to implement this in the key-value store, and then directly at the Raft level to see if something more efficient is possible.


Delete support is also interesting, because we don’t actually delete anything just yet. Because inodes can be shared, to delete data in an inode based filesystem, we need to keep a reference count on each inode. Because we’re also sharing data blobs, we have to implement either reference counting or garbage collection for blobs. Whatever approach we take for the data, I think I should probably take the same approach for inodes. I’m leaning towards garbage collection, not least because I think it will be more flexible - I’m pondering features like snapshots.

So, I’ve implemented delete just by removing the inode from the name to inode mapping; we don’t actually delete the inode. To make garbage collection or undelete a bit easier, I actually store a record of the removed data in a different area of the key-value store. We’ll see if this is useful, but it does mean that delete is incredibly fast (because we defer all the real work for later garbage collection).


Finally, I think one of the reasons this went so smoothly is that Netty imposed a nice architecture on us. There’s definitely a learning curve, but it is a great library. For performance, Netty reuses memory buffers, which requires manual reference counting instead of just relying normal Java garbage collection; this can be tricky - I had to fix one buffer management bug in my code, and I’m sure others are still lurking.


I think the holiday season is likely to interrupt my daily routine, but I’ll do my best. Hopefully tomorrow I’ll find time to look at locking and even a little bit more!

Day 13: A Cloud File-Server

I’m building a set of open-source cloud data-stores in December, blogging all the way.

The plan

In our last episode, we added very basic SQL querying to our structured data store.

Although our SQL querying is little more than a proof of concept at this stage, today I decided to do something different - trying to use our servers to build another server again, to figure out what’s missing. The goal of this project is not to build one data store, but instead to build a whole suite of data stores: key-value, document store, append-log, git-store, file server etc. Today it was the file server’s turn.

We want to support “real” filesystem semantics, not just be an object server. The idea is that you should be able to run a normal, unmodified program with it. Cloud object storage took a different approach: they don’t offer the full guarantees that a traditional filesystem offers, so they deliberately don’t expose themselves in the normal way. That’s a good “fail-safe” design principle.

However, as anyone that has used an object store will attest, what it offers isn’t that different to what a traditional filesystem offers. The main things that are different are strong consistency (vs. eventual consistency) and locking support. They also have different file permissions and metadata, but that’s really a design choice, not a true limitation.

Just as we did with Git, we can take our consistent key-value store, and use it to add the missing functionality to a cloud object store. We’ll store the actual file data in object storage, but all the filesystem metadata will go to our key-value store. We could put it into our structured store, but - for now at least - we don’t need it. Providing rich filesystem metadata indexing - trying to unify the filesystem with structured data storage - has been a dream for a long time, but there are too many failed projects along the way for us to try it: WinFS, the Be File System. If you’ve been following along, you’ll see where this idea comes from: we have a key-value store; we’re going to put metadata into it; we know key-value stores aren’t that different from strutured stores; if we used our structured store instead we could support metadata indexing. It does sound simple, but let’s get a basic filesystem running first!

I know Inodes

UNIX filesystems stores files in a slightly unobvious way. Every file has a data structure which contains its metadata (permissions, owner, size, pointers to the actual data etc). But rather than store the file’s name, instead we refer to this by a number. Each directory stores the information needed to map from file names to inode numbers. Each directory has an inode, but its data is actually a list of its children: names mapping to their inode numbers. To get from a filesystem path to a file, we step through the filesystem name-by-name, reading each directory to find the child inode, and then reading that child (which may in fact be a directory).

This may be an unobvious way to do things, but is actually a great design. Because we reference files by inode number, not name, it means we can do things like rename, delete or move a file while it is open. We can have hard-links, where multiple names refer to the same file. Every UNIX filesystem (I think) is built around this design; Windows has its roots in the FAT filesystem, which didn’t do this, and so hard-links and in-use files are to this day much weaker on Windows.

The big downside is that listing all the files in a directory can be fairly slow, because we must fetch the inodes for every file in the directory if we want the metadata. This is why the default function for listing the files in a directory (readdir) doesn’t return the data in the inode.

If we’re going to build a filesystem, it might be possible to build something to a different model, but it will be tricky to expose it well to a modern operating system because you’ll have to translate between the two metaphors. In short…

You're going to have a bad time

Mapping to the cloud

I stored the inodes in the obvious way: each key maps from the inode to a value containing the metadata. I actually used Protocol Buffers for the value store, as it’s easy, extensible and has reasonably high performance. We will never get the raw performance of a fixed C data structure using it, but we’re not going to win any benchmarks in Java anyway. (Actually, that might not be true: the way to win benchmarks in a higher-level language is by making use of better algorithms or approaches. But not today!)

I stored the directory data by mapping each directory entry to a key-value entry. The value contains the inode of the child. We want the key to support two operations: list all children in a directory, and find a particular child of a directory by name. For the former, we require the directory inode to be a prefix of the key (so our query becomes a prefix/range query). For the latter, we want to include the name in the key. Therefore, our key structure is the directory inode number followed by the name of the file. Simple, and works!

For storing data, we do the same thing we did when we implemented the git server. We store the file content itself on cloud object storage - it is, after all, designed for storing lots of large objects. Small files may be more problematic, because of the overhead: this problem occurs in “real” filesystems as well; they normally end up storing small files in the file inode itself. We could store the file content using the inode identifier; instead we hash the file content and store it using its (SHA-256) hash for the name. Again, this is just like Git. It has the advantage that we get de-dup for free; it has the disadvantage that cleaning up files on delete is harder, because file content may be shared. Git gets round this by never deleting content in normal operation (which makes sense for version control); for now, we’ll also ignore the “garbage collection” problem.

A downside is that the files in object storage aren’t named meaningfully. It would be great if the file called “directory1/file1” was stored under that name in object storage. That just isn’t possible in our design. This may actually be a good thing, in that we really don’t want to encourage people to “go behind our back” and work through the object storage interface.

The other big downside is that we don’t have good support for partial file writes (yet). You want to use this as a simple filesystem, not to store your database files.

WebDAV support

The hardest thing was actually figuring out how to expose this to the operating system as a real filesystem. FUSE is excellent, but would mean everyone would need to install a ‘driver’. The Windows shared filesystem protocol (CIFS) has universal support, but has a reputation as being slow and complicated. I thought about NFS, but I thought it would be tricky to get permissions and WAN support right. WebDAV seems to be a winner: it can be mounted natively on every major OS (it was the basis for iDisk on the Mac, and for ‘Web folders’ on Windows). Because it’s based on HTTP it can also easily be used at the application level, as well as by mounting it in the kernel. Best, it works on a “whole file” model, and doesn’t work well with partial writes, so it maps well to our capabilities. Annoyingly, every OS seems to have weird edge cases/bugs, but it seems like a great place to start. We might add NFS later!

I looked at the libraries that are out there, in particular Milton seems great. It seemed a bit orientated towards exposing your own data as a filesystem, rather than a raw WebDAV implementation. So, based on the Milton code, I coded my own. You can see the (mega) commit where we now support a filesystem using WebDAV. It only supports the basics of the WebDAV protocol (e.g. you can’t delete files), but it does enough that we can mount it in CyberDuck and in the MacOS Finder. That’s a great start… tomorrow I’ll work on filling out the support a little bit to make it useful.

So, we have a cloud filesystem - what does that mean? This is not a DropBox replacement: this only works online. It does provide a directory that is both shared and reliable. So for Wordpress, you might point your image upload folder here. You could store your Jenkins build artifacts here. You could use it for a simple filesystem based work-queue, although we can probably build a better solution here also!

Day 12: We Have SQL

I’m building a set of open-source cloud data-stores in December, blogging all the way.

In our last episode, we added multi-row transactions to our data-store, and examined the options for SQL parsing. PrestoDB seemed like the best for our requirements, though it didn’t offer fast enough execution of simple queries.

Today I got a basic SQL query to run. Just a single-table scan, and there are still some bits hard-coded, but it works. We can parse a query using the PrestoDB parser, check if it is sufficient for us to execute it directly, and (if so) run it. This is a great step forward!

PrestoDB is a fairly complicated system, but the SQL parser is actually fairly well self-contained. We have to set up a fair bit of infrastructure (like table metadata), but within a few hours I was able to get a unit test using the PrestoDB SQL parser to parse a simple SQL statement. PrestoDB also generates what it calls a “logical plan”, which is an execution plan before considering that some tables may be distributed. That’s exactly what we want, as for the moment our queries won’t be distributed.

So now the tricky bit: executing the SQL statement, using the PrestoDB plan. The idea here is that PrestoDB’s overhead won’t matter on a query that takes a second or more anyway, so we only need to concern ourselves with queries that should be fast. In particular, we really care about queries that only read from a single table and ideally use an index to narrow down the rows.

Even that is incredibly complicated; not least because PrestoDB doesn’t really support indexes, so it doesn’t give us a plan with index support (we’ll have to do that ourselves!) So, I worked towards executing a single-table query-scan (i.e. no indexes). I also put in the infrastructure for more advanced queries, even if we don’t use it yet.

Sure enough, around 5PM I finally got the first SQL execution. I take the logical plan (which is structured as a tree), and build another tree for the execution plan, using the Visitor design pattern. PrestoDB (like most SQL query engines and also code compilers) is essentially structured around a series of transformations, going from a tree that is a direct mapping of the SQL query, through lots of intermediate stages, to a final tree representing the execution flow. The final execution tree is then executed directly, often using a “pull model” where the next row is retrieved whenever the caller requests it; each node in the execution tree typically retrieves the next node by pulling rows from its children, etc. PrestoDB instead uses a pull-model, I think mostly for performance reasons: it allocates (fairly large) buffers into which is writes the rows and the passes those buffers to the dependent nodes. This results in fewer method calls and also has better memory performance. I suspect it also works much better with their dynamic code compilation approach.

I implemented a mixture of these approaches, where the rows are pushed to the reader, but function evaluation is done using the pull-model. I also implemented an experimental ValueHolder, which is a mutable buffer for a value; it’s supposed to reduce GC pressure a little (by allowing the value objects to be re-used, instead of building a new object every row). This is probably all over-engineered and I’ll rework it over time as I see what’s good and what’s not, but it seemed more important just to get something working.

The rows are then streamed into the HTTP output stream, and we can parse them in the client, just as we’ve done for key/value scans in the past.

Seeing the first SQL query succeed was a huge thing. This means we don’t have to invent our own query language, which would be painful to code, and even more painful for the callers who would have to learn another querying approach. There are always people that prefer the new thing because it’s new, but SQL seems like a much better option for everyone else!

The biggest remaining issue with SQL is that the performance isn’t great. I did some basic performance tuning, reducing logging output and re-using the SQL parser, which made things go twice as fast (in a quick micro-benchmark). Some simple profiling using VisualVM showed that PrestoDB’s SQL parsing and analysis is now the slowest phase. I could work on tuning it, but the bigger win would come from caching the parses and re-using them across identical queries. I may shelve that idea for a later time; we’re getting 300 queries per second (in a single sequential client thread) which is actually probably good enough for now. It’s also good to see that my “quick” query execution is actually quick enough that the SQL parsing is the bottleneck - it suggests that the approach works!

Day 11: I Dream of SQL

I’m building a set of open-source cloud data-stores in December, blogging all the way.

In our last episode, we figured out how to approach JSON and realized that having a key-value interface to structured storage is probably not generally useful.

The plan for today was to split out the code into a key-value store and a structured store, start collecting JSON key metadata, and then work on a SQL parser. We got most of the way there, though the SQL support will have to wait for tomorrow!

The refactor into separate projects went well; I also created a new shared project for shared server-side code (e.g. the BTree implementation). The code feels much better organized now, especially after I moved some of the Btree tests to test the Btree directly, rather than going through an API. (Now that the BTree is shared, it is a ‘service’, so I think I’m justified in still considering this an integration test, and continuing my prejudice against unit tests). So the tests run faster now; this is good. There’s a nasty bug at the moment with shutting down services correctly, which means that the tests fail if you run more than one suite in the same process; this is bad. I tried to attack this today, but it looks like the problem is in one of the libraries I’m using.

Next up, we wanted to build an index of keys: this should allow efficient JSON string encoding, and it may be useful to have the list of keys available to callers or for SQL parsing. There were several steps involved: most importantly this was our first compound operation (a single insert of a JSON object will now perform multiple actions against the data-store). Once we have secondary indexes, most inserts will be compound operations, so this is important functionality. We now have two interfaces for two types of operation: RowOperation and ComplexOperation. Both are executed as a single transaction, but a RowOperation operates on a single row, whereas the ComplexOperation can run multiple RowOperations, one for each row it wants to change. It’s a fairly simple change that is actually very powerful; we did all the hard work when we implemented transactions, we just haven’t been really using them yet!

So now, whenever we insert a JSON object, we do that through a new StructuredSetOperation (which is a ComplexOperation because it inserts multiple rows). It inserts the value, but then loops through each of the keys, checks if they are already in the system index, and if not inserts two records within the system namespace (one mapping from id to name, and one back the other way).

This obviously has problems, for example when people start dynamically generating keys we’ll end up with a huge dictionary. Before we worry about that though, I thought it was more important to figure out whether we do want this key-index at all. It might be that we don’t need it for SQL parsing, and other encodings may be better or good enough!

SQL querying against non-relational datastores is “hot” at the moment, mostly focused on querying of Hadoop-stored big-data. The Hive project has been around for a long time, offering something that was similar to SQL but not-really-SQL. More recently Cloudera has released Impala and Facebook has just released PrestoDB. Both of these new projects are aiming for full SQL compatibility, and both aim to be much faster than Hive by bypassing Map/Reduce and running directly against the underlying data. Map/Reduce is not exactly slow, but it is focused on long-running jobs and on scale-out, rather than on running a short-lived job quickly. There are also efforts to make Map/Reduce itself run short jobs faster, but - as is always the case with open source - a lot of people are trying a lot of different things that are overlapping: may the best solution win!

Both Impala and PrestoDB are faster than Hive, but neither of them are really optimized for ultra-fast queries (e.g. a single primary key lookup). Nonetheless, if we could use one of these two to execute our SQL commands, then we would have a system that will work for big data also, and we’d avoid having to write our own SQL engine. SQL is quite tricky to parse (it’s grown surprisingly big over the years), and is very tricky to run efficiently (because it is declarative, not imperative). I think this complexity is what has caused the NoSQL data-stores not to implement SQL, even when they’re starting to implement their own query languages. Nonetheless, the wealth of SQL-compatible tooling, and the productivity of a good SQL execution engine means that if we could have SQL, I think we should.

Another interesting gotcha with Impala and PrestoDB is that they’re designed for partitioned data, and don’t really support secondary indexes. You might split up your weblogs into one file per day, and then the optimization would be simply to ignore files that are excluded by any date criteria you specify in your query. But if you specify a different filter, the system would have to look at every record. With a secondary index, the data-store would normally first find the matching rows in the secondary index, and then retrieve only those rows. (It’s actually not quite that simple, because if we’re going to be retrieving more than a fraction of the records, it is often faster just to stream through all the records. I think that’s why none of Map/Reduce, Hive, Impala and PrestoDB really support secondary indexes).

Secondary indexes are also what makes query planning hard: to translate declarative SQL into an imperative execution plan, the system has to choose which indexes to use. When you only have one data-access method, the choice is easy; but secondary indexes produces a large (exponential) number of choices for the system to choose from. The query planner is the piece that does that, and a good query planner is the hardest thing about writing a good SQL database. Impala and PrestoDB don’t (currently) have full query planners.

The H2 database is a nice open-source traditional SQL database (again in Java). It does have a good query planner, and support for secondary indexes, so it’s tempting to start with H2. However, if we did that we wouldn’t have all the nice big-data stuff that PrestoDB gives us, like parallel query execution on multiple nodes. H2 is also much more rigid in its schema than is PrestoDB, and we would ideally want schema flexibility to cope with JSON’s lack of fixed schemas.

I spent the later part of the day experimenting and trying to decide between these options.

I discounted Impala early, not because it’s a bad system, but because it uses C++ for speed, which makes it harder to hack apart for our own purposes. I think it’s also questionable whether multi-tenant services should use “unsafe” languages like C++, although we’re sadly probably still years away from these attacks being the low-hanging fruit.

PrestoDB gets its speed through dynamic JVM class generation, which is also not exactly easy to work with, but the system as a whole has been designed to make it easy to plug in extra data-sources.

H2 gets us a lot, but it is really tied in to the traditional model, and going from there to a distributed query engine like PrestoDB is probably harder than adding H2’s functionality to PrestoDB. Using both PrestoDB and H2 would seem like a good option, but then we have to maintain two SQL engines which won’t be entirely compatible for the end-user.

So PrestoDB seems like the winner. Sadly though, PrestoDB’s query execution path is very complicated (to be fair though, it is doing something pretty complicated!) It has a great SQL parser and logical execution planner, but then it’s really designed for distributing that query across multiple execution nodes, and gathering the results back. That’s great functionality that we’ll likely want one day, but it has a high overhead in terms of code complexity and thus in terms of execution time. I’m pretty sure that the overhead is too high to use it for a primary key lookup, for example, where modern databases can sustain 100k queries per second or more.

My current plan is to try to adopt a hybrid model, to use PrestoDB’s basic infrastructure (parser, basic planning) to get from SQL into a nicer structure for execution. From there, I hope to check whether the query is simple enough for me to execute directly. If it is, I’ll execute it using specialized code, otherwise I’ll send it to PrestoDB for full processing. For now I will probably just cause complex queries to return a “not yet implemented” error!

This isn’t ideal though, because I’d have to reimplement a lot of what PrestoDB already does. So I’m basically splunking around PrestoDB’s codebase, trying to understand how it all fits together and where I can take it apart. More thought is needed here, but also I need to just try things out and learn by making mistakes. The plan for tomorrow is to get some form of SQL querying wired-up, even if it has to be special-cased and duplicates code!