Day 15: Block Storage - Justin Santa Barbara’s blog

I’m building a set of open-source cloud data-stores, blogging all the way.

In our last episode, we built a file server backed by our cloud data-stores, and exposed it over WebDAV.

That was before the Christmas break, and it’s time to get going on my little project again. Paul Graham is definitely right when he talks about the Maker’s Schedule! I did manage to get some time over the Christmas break, but not the uninterrupted hours that actually produce real progress. I knew that was going to be the case, so I decided to do something a bit different, and worked on a block-storage server (like Amazon EBS or Rackspace Block Storage) that exposes a virtual hard disk over iSCSI. iSCSI is a very complicated protocol, so it actually worked fairly well to be able to work on it in small chunks, rather than going completely crazy trying to do it in one sitting.

Block Storage Architecture

Block Storage provides a virtual hard disk; it is similar to the file server that we exposed over WebDAV, but works with disk blocks instead of files. It’s a much better match for storing changing files, like in a traditional relational database. I happen to think that this is not a great fit for the cloud, but until our structured store is as good as relational databases then there will still be a need!

The architecture I went with was very similar to the file server: we store the actual blocks of data in cloud object storage (as immutable hashed blobs), and we store the (smaller) mapping from each address of the disk to the relevant blob in our key-value store. Each of these systems is distributed & fault tolerant, so our block storage is automatically distributed & fault tolerant. We should be able to support de-duplication and compression fairly easily, though I haven’t yet implemented anything particularly advanced.

Better the iSCSI devil you know…

I chose to expose the block device over iSCSI. There are a few other alternatives: NBD is a much simpler protocol, which would have been easier to implement, but Windows support was lacking; I also worried - probably unfairly - that it was so simple that it wouldn’t allow for some interesting optimizations. Another alternative was AoE, but this operates at Layer 2 so would have been painful to implement in Java, and doesn’t have as good support as iSCSI. The Layer 2 thing seems like a poor match for the cloud also, where we invariably have complicated networking and likely want to be in multiple datacenters.

iSCSI - though it is incredibly complicated - is very well supported, including directly by QEMU/KVM and probably other hypervisors also. I figured that if I could get it to work, it would be worth the pain.

iSCSI resources

There is a Java iSCSI server library, called jSCSI, which was very helpful in understanding everything, but I wanted to build it from scratch to see what performance optimizations I could find. The best resources on iSCSI I found were the iSCSI RFC; iSCSI basically takes SCSI and adds an Internet transport to it, so it was necessary also to refer to Seagate’s SCSI Reference to get details on the SCSI commands themselves.

Implementation

Implementation was fairly tedious, but I eventually got iSCSI working sufficiently to support QEMU running directly from the iSCSI volume. Other clients will undoubtedly need additional support, but the core functionality is present and working. (One of the oddities of SCSI is that every command comes in multiple versions, with different sizes for block addresses - I guess that’s what happens when hard disks grow by a factor of almost 1 million!)

The big optimization that I’ve implemented so far is that SCSI allows asynchronous operation, and in particular allows for writes to be buffered and then flushed. This allows us to combine and defer write operations; which is important because when we do flush we have to replicate the data to multiple servers. Of course this was all implemented in SCSI because hard disks (even SSDs) are big sources of latency.

The implementation makes very heavy use of ListenableFuture, a key contribution from the Guava project. ListenableFuture are a much more useful implementation of Promises than Java’s built-in Future, and they make asynchronous multithreaded programming (almost) easy.

The other (related) optimization was to divide the disk into segments (currently 1MB). Instead of mapping each disk block to a key-value entry, we map a segment to a key-value entry. Blocks are small (512 bytes or 4KB), so we want to amortize some of the overhead here. If we’re reading or writing contiguous blocks, we can combine this into a single key-value operation (which is nice, because hard disks also favor sequential operations, so most software tries to avoid seeking around all over the disk). Also, we can create blobs that are bigger than a single disk block (as long as it’s within the same segment), and so we can normally combine several blocks into a single read or write from object storage. In theory, we can heuristically clean up complicated segments by re-writing blocks, but that isn’t implemented at the moment!

Improving the key-value store

One of the goals of building these other systems using the key-value store is to figure out where the key-value store needs to be improved. I realized we needed the idea of multi-tenancy in the key-value server: it is easy to ask for another isolated key-value store, and then within each key-value store it is also possible to have a modest number of keyspaces (keyspaces are currently all mapped into the same BTree). The idea is that you’ll use a separate store for each unrelated unit, and keyspaces to keep data organized. For example: in the file server each volume can now get its own key-value store, and we use keyspaces for the different types of data (inodes vs direntries); a keyspace is a bit like a table in a relational database. By making it easy to allocate key-value stores on demand, we can later store the key-value stores on different machines in the cluster, to scale-out. (We’ll still have a bottleneck if one store is very heavily used, but this simple sharding will push the problem a long way down the road).

We were still using the key-value store through the redis interface; it was still possible to map this enhanced functionality to Redis, but it was getting a little messy. So I implemented a native interface using Protocol buffers. Using protocol buffers is much more efficient than Redis’s protocol. As well as being able to multiplex requests and easily support asynchronous operations, we can also use the Protocol Buffer objects as our internal request representation. This means we don’t need to marshal and demarshal the requests between a wire-format and our internal format, and generally means a lot less code. It does mean we have to be very careful to treat data as untrusted though!

Taking this to the natural conclusion, we now use the wire-format Protocol Buffers for our Raft log entries. Sticking to one format throughout the whole system means even less code and (in theory) less overhead. It also demonstrates that our architecture really is just establishing a reliable consistent sequence of commands, like the old MySQL statement based replication. On that note, Jay Kreps (a technical lead in LinkedIn’s awesome SNA group) published an interesting piece on using logs to build distributed systems, which is well worth reading.