I’m building a set of open-source cloud data-stores, blogging all the way.
In our last episode, we created a block store, exposed over iSCSI. We also added a native protocol buffers client to our key-value store.
One thing I really want to add is encryption. I’ve always been wary of storing data on the cloud; it’s not really a “post-Snowden” thing, as a determined enough attacker can always get access to data. Rather it’s about limiting exposure and defense-in-depth; so that a single mistake doesn’t expose everything. Making sure that all data is encrypted at rest, and that the keys aren’t stored alongside the data (or ideally nowhere at all) is a good foundation. A backup or disk access doesn’t expose the data; you have to get inside the running process to get the keys or plaintext, and even then you can only unlock whatever data is in use at the time. It is much easier just to capture the credentials!
I’ve implemented this a few times across various projects, and I’ve basically settled on the idea of deriving a key from the user’s credentials, and then using that key to unlock the actual encryption keys (possibly repeating the key-unlocking-key chain). This means that without the password, the data is completely inaccessible, even to administrators. The big downside is that we can’t do any data processing unless an authorized user is requesting it, but this is probably a good thing.
To mmap or not to mmap
The big problem we face is that we’re currently mmap-ing the data. This means that we simply can’t have a different view than the kernel exposes. I thought about using ecryptfs, but I worried that it would be relatively complicated and would really tie us to this data format; it would also probably mean that all in-use data was trivially exposed through the filesystem. I thought about using tricks with mprotect and hooking SIGSEGV, but this seems like it would involve a lot of tricky syscalls and would still have problems e.g. with making sure we only wrote encrypted data out. I found uvmem, which is a very cool patch that QEMU uses for remote paging, and debated creating a custom kernel module, but likely this would just be re-implementing ecryptfs.
In short, using kernel encryption was going to be complicated. More importantly, no “real” databases use mmap: the database knows more than the kernel about its own memory usage patterns, and so it can manage its memory better than the kernel. In theory, at least, though I’d love to see some work on exposing these hints efficiently to the kernel. But it does seem to be true in reality as well: database that naively rely on mmap (like MongoDB and Redis) perform catastrophically badly once data exceeds memory.
Instead, all databases manage their own cache through a pool of buffers, which are populated with pages in-use and recently in-use. If we had a buffer pool, we could easily implement encryption and probably compression as well. Another big win is that we could also easily page data in from cloud object storage on-demand. So, if we had a 64GB data store that we needed to recover onto a new machine, we wouldn’t have to wait to download it all before using it. There’s nothing here that the kernel couldn’t do; but the hooks just aren’t there.
Storm in a btree-cup?
One very difficult problem to solve with mmap is that kernel paging interacts poorly with event-driven I/O or with (Go-style) userspace threads. A page-fault blocks the whole thread; in the thread-per-request model that is acceptable, but in the event-driven model that thread is likely serving a large number of requests all of which are unfairly blocked. So you typically end up using a thread pool to process requests, which looks a lot like the thread-per-request model again!
On the other hand, the paper which set the stage for VoltDB shows that buffer management is a huge component of the overhead of a traditional database. Those arguments haven’t yet been fully resolved. If I had to summarize, I would say that if the database fits in memory then eliminating buffer management is a good thing (like VoltDB, MongoDB, MemSQL and Redis are trying). If the database doesn’t fit in memory, then all databases perform badly, though databases that manage their own buffers seem to perform (much) less badly.
Tilting at windmills
So today I looked at building our own caching layer. I’m not sure we’ll end up using it, so I chose to make it pluggable. I started off by better abstracting out the PageStore (which was a big design improvement, even if we don’t end up using a new PageStore). Hopefully later we can revisit this and actually compare buffer management to no-buffer management!
It also gives me a chance to play with the new asynchronous I/O in Java 7. We saw just yesterday the importance of async I/O for hard disks; when we’re reading from the cloud it’ll likely be just as important!
It proved to be fairly complicated, but I eventually ended up with a reasonable preliminary buffer cache implementation. It relies on a caching map from Guava, and a memory pool from netty, both of which are awesome but aren’t really designed for our particular requirements, so we’ll probably have to replace them soon.
The implementation is fairly straightforward - too simple, in fact. We already have transaction objects, so the transaction objects act as a nice holder for any in-use buffers, which means we don’t have to change too much code yet. We release the buffers only at the end of the transaction, which again simplifies things. We allow the cache to grow bigger than the target size, so we don’t yet worry about forcing page evictions.
All of these simplifications will have to get fixed, but they set the stage for what we really want to work on. Tomorrow: encryption. If that’s a bust, then we’ll probably end up throwing away the manual buffer management we did today!