The reason for wanting a Java base image was - of course - to run Java applications. In particular, I wanted to be able to do this direct from Maven, which is (still) the dominant build tool for Java.
Spotify has a pretty good maven plugin for Docker, so I started with that. It was pretty simple to add support for ACI.
With my ACI plugin for maven, building an ACI image requires registering the plugin in the pom.xml, along with configuring the command line, like this. It’s still more work than I would like, but it is very copy-and-pastable.
This was pretty easy to implement. The biggest challenge was creating a library for writing ACIs from Java: appc-java. ACI are not that different from Docker images. One huge difference was that it is possible to build an ACI securely (i.e. without requiring root or running arbitrary code) because ACIs are designed to be buildable without running code. Dockerfiles make for a great demo, but they are very difficult to secure.
It’s also much faster to just write a tarfile than it is to spin up a container!
I’m not sure yet whether I should try to contribute this work back into Spotify’s plugin, or whether I should make this a permanent fork. On the one hand, contributing back is good manners, but on the other ACI feels like something that is genuinely new, with a lot of capabilities that would be much harder to maintain in a Docker plugin. I’m going to keep tinkering and see how it goes!
]]>CoreOS announced rkt (pronounced rocket) last year. Rkt is creating a workable and open specification for a container image format, called the App Container Image or ACI. rkt is an implementation of a whole family of these open specifications, but I’m starting my investigations at the beginning!
ACI is deliberately minimal: it is a tarball, with a manifest JSON file at the root (/manifest
) and a filesystem
tree for the container (/rootfs
). There are a lot more features, but today I wanted to start by figuring out an easy
way to build these images.
The “absolute right way” to build a container image is to compile exactly what you need, and copy just that into a tarball. But this process is fairly hard, and essentially amounts to maintaining your own (special-purpose) distro. (Which does suggest that it would be interesting to try this with gentoo, but I digress…)
An alternative is to leverage the work done by the distros, in my case Debian Jessie packages. packages2aci is a simple “tool” I created (just a batchfile really) which can create an ACI based on a list of packages and a manifest.
packages2aci goes through a list of packages, and individually downloads them (using apt-get download
), expands them using dpkg-deb
,
and then calls actool
to build the manifest. Incredibly simple, apart from the fact that you have to specify each package
individually (it doesn’t do dependency analysis). This is both annoying (busywork) and handy (sometimes you know that you don’t
actually need some library).
With packages2aci, it is very easy to build simple ACI images from your OS packages, and start running rkt images. There’s an
example for Java 7, which just runs java -version
, which isn’t very useful but is a good first step. Next time, I’ll post about why I want a Java base image!
Panamax.io is a new piece of open source software, from CenturyLink. Really, it’s from Lucas Carlson and the team behind PHPFog/AppFog. Centurylink is a fairly traditional telecoms company that has been buying into cloud by snapping up some interesting players: notably Savvis, AppFog and Tier 3. Rumor has it that Rackspace might be next (1). It’s a great strategy; most of the communications incumbents have good cash flow but declining businesses as we move towards dumb pipes; it makes a lot of sense to invest that cash in businesses that have more future growth potential.
With Panamax, we’ve got our first look at what those new businesses might look like. Panamax combines CoreOS, fleet and Docker, but puts a nice graphical front-end on it. It’s described as “Docker for humans”. I’ll be honest: it’s really early and I still found a number of bugs which rather mean it’s not quite ready yet for production usage by average humans, but you can definitely see where this is heading: taking Docker and making it mass-market, instead of a geek plaything. It solves some real pain-points with Docker, and it’s definitely one to watch; they are fixing the issues and it’s getting better all the time.
One of the biggest shortcomings I encountered was that I had all these great templates that I could install in seconds, but I would have to wait a long time for it to download (mostly because of the Docker registry’s questionable design). So I created a proxy-server template; with just a few clicks you can cache all those big downloads.
I also tried creating a more ambitious ELK stack (ElasticSearch, LogStash, Kibana), but even with the faster downloading that I was able to get by using my cache, there are still a few problems (my own, not Panamax’s) that I couldn’t get ironed out in the time I had. So the ELK stack template maybe isn’t quite ready for humans either. But, like Panamax, it will improve rapidly!
(1): I have no inside knowledge on Rackspace/Centurylink, and the asking price for Rackspace would seem to make it a difficult purchase for Centurylink. Strategically it makes a lot of sense: Rackspace has a great customer base, and the people are top-notch on both the business and technical sides. Despite OpenStack’s problems, I don’t see how a non-OpenStack strategy makes sense for anyone other than AWS. And to the extent Rackspace is struggling, it is mostly because everyone in the business is struggling to compete with AWS. I have some ideas on how to compete here also, and it probably involves things that look a lot like Panamx, but I digress… Underpin Rackspace’s efforts with the solid cashflow from another business and you have a real contender. Combine it with technologies like Panamax, CoreOS & Docker and things get really interesting.
]]>In our last episode, we created a block store, exposed over iSCSI. We also added a native protocol buffers client to our key-value store.
One thing I really want to add is encryption. I’ve always been wary of storing data on the cloud; it’s not really a “post-Snowden” thing, as a determined enough attacker can always get access to data. Rather it’s about limiting exposure and defense-in-depth; so that a single mistake doesn’t expose everything. Making sure that all data is encrypted at rest, and that the keys aren’t stored alongside the data (or ideally nowhere at all) is a good foundation. A backup or disk access doesn’t expose the data; you have to get inside the running process to get the keys or plaintext, and even then you can only unlock whatever data is in use at the time. It is much easier just to capture the credentials!
I’ve implemented this a few times across various projects, and I’ve basically settled on the idea of deriving a key from the user’s credentials, and then using that key to unlock the actual encryption keys (possibly repeating the key-unlocking-key chain). This means that without the password, the data is completely inaccessible, even to administrators. The big downside is that we can’t do any data processing unless an authorized user is requesting it, but this is probably a good thing.
The big problem we face is that we’re currently mmap-ing the data. This means that we simply can’t have a different view than the kernel exposes. I thought about using ecryptfs, but I worried that it would be relatively complicated and would really tie us to this data format; it would also probably mean that all in-use data was trivially exposed through the filesystem. I thought about using tricks with mprotect and hooking SIGSEGV, but this seems like it would involve a lot of tricky syscalls and would still have problems e.g. with making sure we only wrote encrypted data out. I found uvmem, which is a very cool patch that QEMU uses for remote paging, and debated creating a custom kernel module, but likely this would just be re-implementing ecryptfs.
In short, using kernel encryption was going to be complicated. More importantly, no “real” databases use mmap: the database knows more than the kernel about its own memory usage patterns, and so it can manage its memory better than the kernel. In theory, at least, though I’d love to see some work on exposing these hints efficiently to the kernel. But it does seem to be true in reality as well: database that naively rely on mmap (like MongoDB and Redis) perform catastrophically badly once data exceeds memory.
Instead, all databases manage their own cache through a pool of buffers, which are populated with pages in-use and recently in-use. If we had a buffer pool, we could easily implement encryption and probably compression as well. Another big win is that we could also easily page data in from cloud object storage on-demand. So, if we had a 64GB data store that we needed to recover onto a new machine, we wouldn’t have to wait to download it all before using it. There’s nothing here that the kernel couldn’t do; but the hooks just aren’t there.
One very difficult problem to solve with mmap is that kernel paging interacts poorly with event-driven I/O or with (Go-style) userspace threads. A page-fault blocks the whole thread; in the thread-per-request model that is acceptable, but in the event-driven model that thread is likely serving a large number of requests all of which are unfairly blocked. So you typically end up using a thread pool to process requests, which looks a lot like the thread-per-request model again!
On the other hand, the paper which set the stage for VoltDB shows that buffer management is a huge component of the overhead of a traditional database. Those arguments haven’t yet been fully resolved. If I had to summarize, I would say that if the database fits in memory then eliminating buffer management is a good thing (like VoltDB, MongoDB, MemSQL and Redis are trying). If the database doesn’t fit in memory, then all databases perform badly, though databases that manage their own buffers seem to perform (much) less badly.
So today I looked at building our own caching layer. I’m not sure we’ll end up using it, so I chose to make it pluggable. I started off by better abstracting out the PageStore (which was a big design improvement, even if we don’t end up using a new PageStore). Hopefully later we can revisit this and actually compare buffer management to no-buffer management!
It also gives me a chance to play with the new asynchronous I/O in Java 7. We saw just yesterday the importance of async I/O for hard disks; when we’re reading from the cloud it’ll likely be just as important!
It proved to be fairly complicated, but I eventually ended up with a reasonable preliminary buffer cache implementation. It relies on a caching map from Guava, and a memory pool from netty, both of which are awesome but aren’t really designed for our particular requirements, so we’ll probably have to replace them soon.
The implementation is fairly straightforward - too simple, in fact. We already have transaction objects, so the transaction objects act as a nice holder for any in-use buffers, which means we don’t have to change too much code yet. We release the buffers only at the end of the transaction, which again simplifies things. We allow the cache to grow bigger than the target size, so we don’t yet worry about forcing page evictions.
All of these simplifications will have to get fixed, but they set the stage for what we really want to work on. Tomorrow: encryption. If that’s a bust, then we’ll probably end up throwing away the manual buffer management we did today!
]]>In our last episode, we built a file server backed by our cloud data-stores, and exposed it over WebDAV.
That was before the Christmas break, and it’s time to get going on my little project again. Paul Graham is definitely right when he talks about the Maker’s Schedule! I did manage to get some time over the Christmas break, but not the uninterrupted hours that actually produce real progress. I knew that was going to be the case, so I decided to do something a bit different, and worked on a block-storage server (like Amazon EBS or Rackspace Block Storage) that exposes a virtual hard disk over iSCSI. iSCSI is a very complicated protocol, so it actually worked fairly well to be able to work on it in small chunks, rather than going completely crazy trying to do it in one sitting.
Block Storage provides a virtual hard disk; it is similar to the file server that we exposed over WebDAV, but works with disk blocks instead of files. It’s a much better match for storing changing files, like in a traditional relational database. I happen to think that this is not a great fit for the cloud, but until our structured store is as good as relational databases then there will still be a need!
The architecture I went with was very similar to the file server: we store the actual blocks of data in cloud object storage (as immutable hashed blobs), and we store the (smaller) mapping from each address of the disk to the relevant blob in our key-value store. Each of these systems is distributed & fault tolerant, so our block storage is automatically distributed & fault tolerant. We should be able to support de-duplication and compression fairly easily, though I haven’t yet implemented anything particularly advanced.
I chose to expose the block device over iSCSI. There are a few other alternatives: NBD is a much simpler protocol, which would have been easier to implement, but Windows support was lacking; I also worried - probably unfairly - that it was so simple that it wouldn’t allow for some interesting optimizations. Another alternative was AoE, but this operates at Layer 2 so would have been painful to implement in Java, and doesn’t have as good support as iSCSI. The Layer 2 thing seems like a poor match for the cloud also, where we invariably have complicated networking and likely want to be in multiple datacenters.
iSCSI - though it is incredibly complicated - is very well supported, including directly by QEMU/KVM and probably other hypervisors also. I figured that if I could get it to work, it would be worth the pain.
There is a Java iSCSI server library, called jSCSI, which was very helpful in understanding everything, but I wanted to build it from scratch to see what performance optimizations I could find. The best resources on iSCSI I found were the iSCSI RFC; iSCSI basically takes SCSI and adds an Internet transport to it, so it was necessary also to refer to Seagate’s SCSI Reference to get details on the SCSI commands themselves.
Implementation was fairly tedious, but I eventually got iSCSI working sufficiently to support QEMU running directly from the iSCSI volume. Other clients will undoubtedly need additional support, but the core functionality is present and working. (One of the oddities of SCSI is that every command comes in multiple versions, with different sizes for block addresses - I guess that’s what happens when hard disks grow by a factor of almost 1 million!)
The big optimization that I’ve implemented so far is that SCSI allows asynchronous operation, and in particular allows for writes to be buffered and then flushed. This allows us to combine and defer write operations; which is important because when we do flush we have to replicate the data to multiple servers. Of course this was all implemented in SCSI because hard disks (even SSDs) are big sources of latency.
The implementation makes very heavy use of ListenableFuture, a key contribution from the Guava project. ListenableFuture are a much more useful implementation of Promises than Java’s built-in Future, and they make asynchronous multithreaded programming (almost) easy.
The other (related) optimization was to divide the disk into segments (currently 1MB). Instead of mapping each disk block to a key-value entry, we map a segment to a key-value entry. Blocks are small (512 bytes or 4KB), so we want to amortize some of the overhead here. If we’re reading or writing contiguous blocks, we can combine this into a single key-value operation (which is nice, because hard disks also favor sequential operations, so most software tries to avoid seeking around all over the disk). Also, we can create blobs that are bigger than a single disk block (as long as it’s within the same segment), and so we can normally combine several blocks into a single read or write from object storage. In theory, we can heuristically clean up complicated segments by re-writing blocks, but that isn’t implemented at the moment!
One of the goals of building these other systems using the key-value store is to figure out where the key-value store needs to be improved. I realized we needed the idea of multi-tenancy in the key-value server: it is easy to ask for another isolated key-value store, and then within each key-value store it is also possible to have a modest number of keyspaces (keyspaces are currently all mapped into the same BTree). The idea is that you’ll use a separate store for each unrelated unit, and keyspaces to keep data organized. For example: in the file server each volume can now get its own key-value store, and we use keyspaces for the different types of data (inodes vs direntries); a keyspace is a bit like a table in a relational database. By making it easy to allocate key-value stores on demand, we can later store the key-value stores on different machines in the cluster, to scale-out. (We’ll still have a bottleneck if one store is very heavily used, but this simple sharding will push the problem a long way down the road).
We were still using the key-value store through the redis interface; it was still possible to map this enhanced functionality to Redis, but it was getting a little messy. So I implemented a native interface using Protocol buffers. Using protocol buffers is much more efficient than Redis’s protocol. As well as being able to multiplex requests and easily support asynchronous operations, we can also use the Protocol Buffer objects as our internal request representation. This means we don’t need to marshal and demarshal the requests between a wire-format and our internal format, and generally means a lot less code. It does mean we have to be very careful to treat data as untrusted though!
Taking this to the natural conclusion, we now use the wire-format Protocol Buffers for our Raft log entries. Sticking to one format throughout the whole system means even less code and (in theory) less overhead. It also demonstrates that our architecture really is just establishing a reliable consistent sequence of commands, like the old MySQL statement based replication. On that note, Jay Kreps (a technical lead in LinkedIn’s awesome SNA group) published an interesting piece on using logs to build distributed systems, which is well worth reading.
]]>So that I can give this little project the attention it deserves, I’m going to put it on hold and resume it in the New Year. There’s too much I want to do, to try to squeeze it in between the holiday commitments!
Happy Holidays to everyone!
]]>In our last episode, we built a simple file server and exposed it over WebDAV. As we did with the git server, we store the actual data in cloud object storage, and the consistent metadata in our key-value store.
Today I really just extended the WebDAV support, adding support for more WebDAV/HTTP verbs: MOVE (rename), DELETE, LOCK/UNLOCK and MKCOL (mkdir). All fairly straightforward stuff; I’m trying only to talk about the interesting details, so just a few observations…
I think WebDAV was the right choice of access protocols. We talked before about how SPDY & HTTP 2.0 should mean that HTTP will be “good enough” to replace most custom binary protocols, and so we can then take advantage of of HTTP’s benefits. A filesystem is a use case that will definitely push the boundaries: there’s a lot of traffic and it’s performance sensitive.
WebDAV today already gives us many of HTTP’s benefits (WAN access, encryption support, etc). File transfer and metadata querying with WebDAV over SPDY will be comparably efficient to a custom filesytem protocol, as this is HTTP’s bread and butter. Missing is change notification (inotify) and efficient filesystem writes, which again makes sense because of HTTP’s read-mostly philosophy.
We can imagine adding change notification using WebSockets. The inefficiency of writes comes because WebDAV normally requires a LOCK / UNLOCK pair around each write; we could imagine combining operations to make this more efficient (e.g. PUT-and-UNLOCK). Because of HTTP’s extensibility, this could be as simple as recognizing an X-Unlock header on the PUT, and responding with X-Unlocked if we released the lock. To be fair, this may actually be more efficient that many filesystems (where lock and unlock maps to file open and close). I think this shows that “just use HTTP” may be the right advice: we’re talking about features, not about overcoming the limitations of HTTP.
Lock and unlock support is only stub-implemented for now with an in-memory implementation. This is not “cloud-first” - it is a single point of failure. It would easy to store the locks into the key-value store. I will probably play around tomorrow to implement this in the key-value store, and then directly at the Raft level to see if something more efficient is possible.
Delete support is also interesting, because we don’t actually delete anything just yet. Because inodes can be shared, to delete data in an inode based filesystem, we need to keep a reference count on each inode. Because we’re also sharing data blobs, we have to implement either reference counting or garbage collection for blobs. Whatever approach we take for the data, I think I should probably take the same approach for inodes. I’m leaning towards garbage collection, not least because I think it will be more flexible - I’m pondering features like snapshots.
So, I’ve implemented delete just by removing the inode from the name to inode mapping; we don’t actually delete the inode. To make garbage collection or undelete a bit easier, I actually store a record of the removed data in a different area of the key-value store. We’ll see if this is useful, but it does mean that delete is incredibly fast (because we defer all the real work for later garbage collection).
Finally, I think one of the reasons this went so smoothly is that Netty imposed a nice architecture on us. There’s definitely a learning curve, but it is a great library. For performance, Netty reuses memory buffers, which requires manual reference counting instead of just relying normal Java garbage collection; this can be tricky - I had to fix one buffer management bug in my code, and I’m sure others are still lurking.
I think the holiday season is likely to interrupt my daily routine, but I’ll do my best. Hopefully tomorrow I’ll find time to look at locking and even a little bit more!
]]>In our last episode, we added very basic SQL querying to our structured data store.
Although our SQL querying is little more than a proof of concept at this stage, today I decided to do something different - trying to use our servers to build another server again, to figure out what’s missing. The goal of this project is not to build one data store, but instead to build a whole suite of data stores: key-value, document store, append-log, git-store, file server etc. Today it was the file server’s turn.
We want to support “real” filesystem semantics, not just be an object server. The idea is that you should be able to run a normal, unmodified program with it. Cloud object storage took a different approach: they don’t offer the full guarantees that a traditional filesystem offers, so they deliberately don’t expose themselves in the normal way. That’s a good “fail-safe” design principle.
However, as anyone that has used an object store will attest, what it offers isn’t that different to what a traditional filesystem offers. The main things that are different are strong consistency (vs. eventual consistency) and locking support. They also have different file permissions and metadata, but that’s really a design choice, not a true limitation.
Just as we did with Git, we can take our consistent key-value store, and use it to add the missing functionality to a cloud object store. We’ll store the actual file data in object storage, but all the filesystem metadata will go to our key-value store. We could put it into our structured store, but - for now at least - we don’t need it. Providing rich filesystem metadata indexing - trying to unify the filesystem with structured data storage - has been a dream for a long time, but there are too many failed projects along the way for us to try it: WinFS, the Be File System. If you’ve been following along, you’ll see where this idea comes from: we have a key-value store; we’re going to put metadata into it; we know key-value stores aren’t that different from strutured stores; if we used our structured store instead we could support metadata indexing. It does sound simple, but let’s get a basic filesystem running first!
UNIX filesystems stores files in a slightly unobvious way. Every file has a data structure which contains its metadata (permissions, owner, size, pointers to the actual data etc). But rather than store the file’s name, instead we refer to this by a number. Each directory stores the information needed to map from file names to inode numbers. Each directory has an inode, but its data is actually a list of its children: names mapping to their inode numbers. To get from a filesystem path to a file, we step through the filesystem name-by-name, reading each directory to find the child inode, and then reading that child (which may in fact be a directory).
This may be an unobvious way to do things, but is actually a great design. Because we reference files by inode number, not name, it means we can do things like rename, delete or move a file while it is open. We can have hard-links, where multiple names refer to the same file. Every UNIX filesystem (I think) is built around this design; Windows has its roots in the FAT filesystem, which didn’t do this, and so hard-links and in-use files are to this day much weaker on Windows.
The big downside is that listing all the files in a directory can be fairly slow, because we must fetch the inodes for every file in the directory if we want the metadata. This is why the default function for listing the files in a directory (readdir) doesn’t return the data in the inode.
If we’re going to build a filesystem, it might be possible to build something to a different model, but it will be tricky to expose it well to a modern operating system because you’ll have to translate between the two metaphors. In short…
I stored the inodes in the obvious way: each key maps from the inode to a value containing the metadata. I actually used Protocol Buffers for the value store, as it’s easy, extensible and has reasonably high performance. We will never get the raw performance of a fixed C data structure using it, but we’re not going to win any benchmarks in Java anyway. (Actually, that might not be true: the way to win benchmarks in a higher-level language is by making use of better algorithms or approaches. But not today!)
I stored the directory data by mapping each directory entry to a key-value entry. The value contains the inode of the child. We want the key to support two operations: list all children in a directory, and find a particular child of a directory by name. For the former, we require the directory inode to be a prefix of the key (so our query becomes a prefix/range query). For the latter, we want to include the name in the key. Therefore, our key structure is the directory inode number followed by the name of the file. Simple, and works!
For storing data, we do the same thing we did when we implemented the git server. We store the file content itself on cloud object storage - it is, after all, designed for storing lots of large objects. Small files may be more problematic, because of the overhead: this problem occurs in “real” filesystems as well; they normally end up storing small files in the file inode itself. We could store the file content using the inode identifier; instead we hash the file content and store it using its (SHA-256) hash for the name. Again, this is just like Git. It has the advantage that we get de-dup for free; it has the disadvantage that cleaning up files on delete is harder, because file content may be shared. Git gets round this by never deleting content in normal operation (which makes sense for version control); for now, we’ll also ignore the “garbage collection” problem.
A downside is that the files in object storage aren’t named meaningfully. It would be great if the file called “directory1/file1” was stored under that name in object storage. That just isn’t possible in our design. This may actually be a good thing, in that we really don’t want to encourage people to “go behind our back” and work through the object storage interface.
The other big downside is that we don’t have good support for partial file writes (yet). You want to use this as a simple filesystem, not to store your database files.
The hardest thing was actually figuring out how to expose this to the operating system as a real filesystem. FUSE is excellent, but would mean everyone would need to install a ‘driver’. The Windows shared filesystem protocol (CIFS) has universal support, but has a reputation as being slow and complicated. I thought about NFS, but I thought it would be tricky to get permissions and WAN support right. WebDAV seems to be a winner: it can be mounted natively on every major OS (it was the basis for iDisk on the Mac, and for ‘Web folders’ on Windows). Because it’s based on HTTP it can also easily be used at the application level, as well as by mounting it in the kernel. Best, it works on a “whole file” model, and doesn’t work well with partial writes, so it maps well to our capabilities. Annoyingly, every OS seems to have weird edge cases/bugs, but it seems like a great place to start. We might add NFS later!
I looked at the libraries that are out there, in particular Milton seems great. It seemed a bit orientated towards exposing your own data as a filesystem, rather than a raw WebDAV implementation. So, based on the Milton code, I coded my own. You can see the (mega) commit where we now support a filesystem using WebDAV. It only supports the basics of the WebDAV protocol (e.g. you can’t delete files), but it does enough that we can mount it in CyberDuck and in the MacOS Finder. That’s a great start… tomorrow I’ll work on filling out the support a little bit to make it useful.
So, we have a cloud filesystem - what does that mean? This is not a DropBox replacement: this only works online. It does provide a directory that is both shared and reliable. So for Wordpress, you might point your image upload folder here. You could store your Jenkins build artifacts here. You could use it for a simple filesystem based work-queue, although we can probably build a better solution here also!
]]>In our last episode, we added multi-row transactions to our data-store, and examined the options for SQL parsing. PrestoDB seemed like the best for our requirements, though it didn’t offer fast enough execution of simple queries.
Today I got a basic SQL query to run. Just a single-table scan, and there are still some bits hard-coded, but it works. We can parse a query using the PrestoDB parser, check if it is sufficient for us to execute it directly, and (if so) run it. This is a great step forward!
PrestoDB is a fairly complicated system, but the SQL parser is actually fairly well self-contained. We have to set up a fair bit of infrastructure (like table metadata), but within a few hours I was able to get a unit test using the PrestoDB SQL parser to parse a simple SQL statement. PrestoDB also generates what it calls a “logical plan”, which is an execution plan before considering that some tables may be distributed. That’s exactly what we want, as for the moment our queries won’t be distributed.
So now the tricky bit: executing the SQL statement, using the PrestoDB plan. The idea here is that PrestoDB’s overhead won’t matter on a query that takes a second or more anyway, so we only need to concern ourselves with queries that should be fast. In particular, we really care about queries that only read from a single table and ideally use an index to narrow down the rows.
Even that is incredibly complicated; not least because PrestoDB doesn’t really support indexes, so it doesn’t give us a plan with index support (we’ll have to do that ourselves!) So, I worked towards executing a single-table query-scan (i.e. no indexes). I also put in the infrastructure for more advanced queries, even if we don’t use it yet.
Sure enough, around 5PM I finally got the first SQL execution. I take the logical plan (which is structured as a tree), and build another tree for the execution plan, using the Visitor design pattern. PrestoDB (like most SQL query engines and also code compilers) is essentially structured around a series of transformations, going from a tree that is a direct mapping of the SQL query, through lots of intermediate stages, to a final tree representing the execution flow. The final execution tree is then executed directly, often using a “pull model” where the next row is retrieved whenever the caller requests it; each node in the execution tree typically retrieves the next node by pulling rows from its children, etc. PrestoDB instead uses a pull-model, I think mostly for performance reasons: it allocates (fairly large) buffers into which is writes the rows and the passes those buffers to the dependent nodes. This results in fewer method calls and also has better memory performance. I suspect it also works much better with their dynamic code compilation approach.
I implemented a mixture of these approaches, where the rows are pushed to the reader, but function evaluation is done using the pull-model. I also implemented an experimental ValueHolder, which is a mutable buffer for a value; it’s supposed to reduce GC pressure a little (by allowing the value objects to be re-used, instead of building a new object every row). This is probably all over-engineered and I’ll rework it over time as I see what’s good and what’s not, but it seemed more important just to get something working.
The rows are then streamed into the HTTP output stream, and we can parse them in the client, just as we’ve done for key/value scans in the past.
Seeing the first SQL query succeed was a huge thing. This means we don’t have to invent our own query language, which would be painful to code, and even more painful for the callers who would have to learn another querying approach. There are always people that prefer the new thing because it’s new, but SQL seems like a much better option for everyone else!
The biggest remaining issue with SQL is that the performance isn’t great. I did some basic performance tuning, reducing logging output and re-using the SQL parser, which made things go twice as fast (in a quick micro-benchmark). Some simple profiling using VisualVM showed that PrestoDB’s SQL parsing and analysis is now the slowest phase. I could work on tuning it, but the bigger win would come from caching the parses and re-using them across identical queries. I may shelve that idea for a later time; we’re getting 300 queries per second (in a single sequential client thread) which is actually probably good enough for now. It’s also good to see that my “quick” query execution is actually quick enough that the SQL parsing is the bottleneck - it suggests that the approach works!
]]>In our last episode, we figured out how to approach JSON and realized that having a key-value interface to structured storage is probably not generally useful.
The plan for today was to split out the code into a key-value store and a structured store, start collecting JSON key metadata, and then work on a SQL parser. We got most of the way there, though the SQL support will have to wait for tomorrow!
The refactor into separate projects went well; I also created a new shared project for shared server-side code (e.g. the BTree implementation). The code feels much better organized now, especially after I moved some of the Btree tests to test the Btree directly, rather than going through an API. (Now that the BTree is shared, it is a ‘service’, so I think I’m justified in still considering this an integration test, and continuing my prejudice against unit tests). So the tests run faster now; this is good. There’s a nasty bug at the moment with shutting down services correctly, which means that the tests fail if you run more than one suite in the same process; this is bad. I tried to attack this today, but it looks like the problem is in one of the libraries I’m using.
Next up, we wanted to build an index of keys: this should allow efficient JSON string encoding, and it may be useful to have the list of keys available to callers or for SQL parsing. There were several steps involved: most importantly this was our first compound operation (a single insert of a JSON object will now perform multiple actions against the data-store). Once we have secondary indexes, most inserts will be compound operations, so this is important functionality. We now have two interfaces for two types of operation: RowOperation and ComplexOperation. Both are executed as a single transaction, but a RowOperation operates on a single row, whereas the ComplexOperation can run multiple RowOperations, one for each row it wants to change. It’s a fairly simple change that is actually very powerful; we did all the hard work when we implemented transactions, we just haven’t been really using them yet!
So now, whenever we insert a JSON object, we do that through a new StructuredSetOperation (which is a ComplexOperation because it inserts multiple rows). It inserts the value, but then loops through each of the keys, checks if they are already in the system index, and if not inserts two records within the system namespace (one mapping from id to name, and one back the other way).
This obviously has problems, for example when people start dynamically generating keys we’ll end up with a huge dictionary. Before we worry about that though, I thought it was more important to figure out whether we do want this key-index at all. It might be that we don’t need it for SQL parsing, and other encodings may be better or good enough!
SQL querying against non-relational datastores is “hot” at the moment, mostly focused on querying of Hadoop-stored big-data. The Hive project has been around for a long time, offering something that was similar to SQL but not-really-SQL. More recently Cloudera has released Impala and Facebook has just released PrestoDB. Both of these new projects are aiming for full SQL compatibility, and both aim to be much faster than Hive by bypassing Map/Reduce and running directly against the underlying data. Map/Reduce is not exactly slow, but it is focused on long-running jobs and on scale-out, rather than on running a short-lived job quickly. There are also efforts to make Map/Reduce itself run short jobs faster, but - as is always the case with open source - a lot of people are trying a lot of different things that are overlapping: may the best solution win!
Both Impala and PrestoDB are faster than Hive, but neither of them are really optimized for ultra-fast queries (e.g. a single primary key lookup). Nonetheless, if we could use one of these two to execute our SQL commands, then we would have a system that will work for big data also, and we’d avoid having to write our own SQL engine. SQL is quite tricky to parse (it’s grown surprisingly big over the years), and is very tricky to run efficiently (because it is declarative, not imperative). I think this complexity is what has caused the NoSQL data-stores not to implement SQL, even when they’re starting to implement their own query languages. Nonetheless, the wealth of SQL-compatible tooling, and the productivity of a good SQL execution engine means that if we could have SQL, I think we should.
Another interesting gotcha with Impala and PrestoDB is that they’re designed for partitioned data, and don’t really support secondary indexes. You might split up your weblogs into one file per day, and then the optimization would be simply to ignore files that are excluded by any date criteria you specify in your query. But if you specify a different filter, the system would have to look at every record. With a secondary index, the data-store would normally first find the matching rows in the secondary index, and then retrieve only those rows. (It’s actually not quite that simple, because if we’re going to be retrieving more than a fraction of the records, it is often faster just to stream through all the records. I think that’s why none of Map/Reduce, Hive, Impala and PrestoDB really support secondary indexes).
Secondary indexes are also what makes query planning hard: to translate declarative SQL into an imperative execution plan, the system has to choose which indexes to use. When you only have one data-access method, the choice is easy; but secondary indexes produces a large (exponential) number of choices for the system to choose from. The query planner is the piece that does that, and a good query planner is the hardest thing about writing a good SQL database. Impala and PrestoDB don’t (currently) have full query planners.
The H2 database is a nice open-source traditional SQL database (again in Java). It does have a good query planner, and support for secondary indexes, so it’s tempting to start with H2. However, if we did that we wouldn’t have all the nice big-data stuff that PrestoDB gives us, like parallel query execution on multiple nodes. H2 is also much more rigid in its schema than is PrestoDB, and we would ideally want schema flexibility to cope with JSON’s lack of fixed schemas.
I spent the later part of the day experimenting and trying to decide between these options.
I discounted Impala early, not because it’s a bad system, but because it uses C++ for speed, which makes it harder to hack apart for our own purposes. I think it’s also questionable whether multi-tenant services should use “unsafe” languages like C++, although we’re sadly probably still years away from these attacks being the low-hanging fruit.
PrestoDB gets its speed through dynamic JVM class generation, which is also not exactly easy to work with, but the system as a whole has been designed to make it easy to plug in extra data-sources.
H2 gets us a lot, but it is really tied in to the traditional model, and going from there to a distributed query engine like PrestoDB is probably harder than adding H2’s functionality to PrestoDB. Using both PrestoDB and H2 would seem like a good option, but then we have to maintain two SQL engines which won’t be entirely compatible for the end-user.
So PrestoDB seems like the winner. Sadly though, PrestoDB’s query execution path is very complicated (to be fair though, it is doing something pretty complicated!) It has a great SQL parser and logical execution planner, but then it’s really designed for distributing that query across multiple execution nodes, and gathering the results back. That’s great functionality that we’ll likely want one day, but it has a high overhead in terms of code complexity and thus in terms of execution time. I’m pretty sure that the overhead is too high to use it for a primary key lookup, for example, where modern databases can sustain 100k queries per second or more.
My current plan is to try to adopt a hybrid model, to use PrestoDB’s basic infrastructure (parser, basic planning) to get from SQL into a nicer structure for execution. From there, I hope to check whether the query is simple enough for me to execute directly. If it is, I’ll execute it using specialized code, otherwise I’ll send it to PrestoDB for full processing. For now I will probably just cause complex queries to return a “not yet implemented” error!
This isn’t ideal though, because I’d have to reimplement a lot of what PrestoDB already does. So I’m basically splunking around PrestoDB’s codebase, trying to understand how it all fits together and where I can take it apart. More thought is needed here, but also I need to just try things out and learn by making mistakes. The plan for tomorrow is to get some form of SQL querying wired-up, even if it has to be special-cased and duplicates code!
]]>In our last episode, we laid the foundations for querying by implementing a table-scan, and added support for keyspace partitions and for values of different types (byte strings and integers for now).
Today, I had one of those days with lots of thought, but not a lot of code to show for it. Apart from a quick refactor, all I coded was support for JSON values. More important is the approach I think I’ve settled on for how to actually deal with JSON.
We would ideally like to be able to access one data store from different endpoints: a key-value endpoint (e.g. Redis), a document store endpoint (e.g. MongoDB) and a relational endpoint (e.g. Postgres).
I described yesterday how I think that document stores and relational databases actually can support the same data model, even if relational data stores usually choose not to support nested document structures. (Google’s F1 and Postgres’ JSON type are the two ‘exceptions that prove the rule’.) So we can map between document and relational models “easily”. We can think of our data in terms of JSON, even if we choose to store it in a different representation (this is the same as the distinction between the logical data model and the physical data model).
What is trickier is mapping between key-value and document-store/relational models. It’s fairly easy to map data from the richer model to the simpler key-value store, by exposing it as JSON strings. However, it’s not clear how we should map back. We could wrap the values in JSON i.e. “hello” <-> { “value”: “hello” }. But if a key-value entry is inserted with a value that is valid JSON, should we treat that as a string or as an object? Presumably if the caller is setting JSON, it is because they want it treated as such, but we really don’t want to choose a different path based on whether the input value is parseable as JSON, as otherwise we end up with problems like e.g. HTML injection attacks. We would want to rely instead on metadata, but the existing key-value protocols don’t support this because it doesn’t mean anything in the key-value model.
I think that, while it’s nice theoretically, it’s not necessarily that useful to e.g. set your values using a key-value protocol but retrieve them using SQL. There’s also the question of what the point is of the key-value endpoint: once we know that we can simply view them as different views onto the same data-store, why wouldn’t clients use the more powerful API, even if they choose not to use the full functionality?
Instead, it would be easy to use the existing code to build a key-value proxy that can provide whatever mapping rules the user wants to use (providing the metadata at setup time, rather than per-call). So what I’ll likely do is to split up the code tomorrow into two separate services: a key-value service and a JSON-store service. That will also make me happier about the code duplication I introduced when adding JSON support!
Continuing with more thinking than typing, I then looked into binary representations for JSON. BSON (as used by MongoDB) is basically a dead-end: it isn’t efficient in terms of CPU or storage space, and it stores repeated keys repeatedly (which is problematic because it encourages callers to use short, obscure keys rather than self-describing ones). I looked at some of the alternatives out there: Smile is fairly nice; it’s compact and reasonably efficient to parse, and it uses a dictionary approach that reminds me of a simple LZW that avoid double-storing repeated keys. UBJson is fast to parse and reasonably space efficient, though it doesn’t avoid double-storing repeated keys.
However, we have an additional consideration: we probably want to keep track of metadata if we want to use SQL as our query language. Some SQL query parsers benefit from having the “column names” available. So does a lot of SQL tooling (e.g. ActiveRecord.) Obviously recording every key will be problematic if users start putting in data with dynamic column names, but we’ll cross that bridge when we come to it.
So we can imagine storing every JSON key in a lookup table in the database, and then we could use that to replace strings with reference to that shared dictionary. This is similar to interned strings, and - as with interned strings in memory - would allow us to optimize string equality comparison because it suffices to compare the values of the “pointers”. So this may be prove a pretty big win.
We don’t necessarily have to implement this all immediately, but we now have a long-term plan for JSON. When talking to the client, we can use normal (text) JSON or a binary format like Smile, or in theory any JSON representation that becomes popular (even - yuk - BSON!) However, we can store the data internally in a different format. We’ll likely use a binary format similar to Smile, but with a database-wide key-dictionary, making use of our key-metadata collection to get efficient string comparison. We’ll store the key-metadata in a separate keyspace; we originally built keyspaces for indexes, but we can use it for this as well.
As a short-term plan, it should suffice simply to start collecting the key-metadata, and verifying that this meets our requirements for SQL. We can continue to store our JSON as text (unoptimized) for now, until we’re sure it’s useful.
So tomorrow: splitting out the two services, collecting key metadata, and hopefully adding a SQL parser.
]]>In our last episode, we built a cloud-backed Git server. We were able to combine cloud object storage with the key-value data-store we’ve been building to store Git repositories reliably on the cloud, taking advantage of the code Google released to do this in JGit.
Today, I started thinking more deeply about how to enhance the key-value data-store, to get to a data-store that exposes multiple key operations, particularly for reads. We need this for a document store (like MongoDB/DynamoDB), and it’s also the foundation of a relational store.
Multi-key reads are much more useful than multi-key writes, I think the most important difference between a document store and a relational store is that document stores don’t implement multi-key writes. This lets document-stores scale-out easily through sharding (just like key-value stores), while also giving them some of the power that relational stores enjoy. Now, this isn’t the difference most-often cited: that document stores allow flexible schemas, wheras relational stores have a fixed schema. This is, in my mind, a false distinction. Relational stores have chosen to enforce schemas, but don’t really have to. Google’s F1 and Postgres’ JSON type are hints of this.
I also believe that it’s possible to implement a relational datastore that can scale-out (so we don’t need to accept the compromises that document stores make), but that’s another story!
So I started off today with simple support for the most basic query imaginable: a select-all-rows scan. From there, it’s possible to add filtering, range queries etc; the hard bit is of course to run those queries without having to examine all rows! But select-all is a good start, so I implemented a very simple binary format that allowed entries to be streamed over HTTP, added the call to the client I’m building that the tests use, and we can select all entries in the store.
At this point we could implement basic filtering, but first I wanted to think about how this should really work. We’re going to want to put JSON objects into the store, and be able to query them. It’s simple to do that via a full table scan, but if we want performance to be acceptable we have to implement secondary indexes. I’m going to try storing the secondary indexes in the same BTree structure (the normal approach has a BTree for each index, but I’d like to avoid changing the BTree logic if I can). So we need a way to keep the ‘data keys’ separate from the ‘index keys’. I had the idea of “keyspaces”, assigning a numeric value to each set of keys, and prefixing every key with the keyspace id. We can use the variable-length encoding from Protocol Buffers to save some bytes: it should only need 1 byte for the first 128 keyspaces. This also means that entries will be grouped by keyspace, which is why I think one BTree will probably be fine.
Redis has functionality for multiple databases with numeric IDs. Even though I don’t really have a use for this right now, this maps pretty naturally to keyspaces, and we can then test the keyspace functionality. I implemented a test for the keyspace functionality and then the keyspace functionality itself - TDD in action (although I strictly shouldn’t have pushed the commit with the test, as that breaks the CI build)!
It’s probably premature optimization, but I next implemented alternative value formats, as I was thinking about yesterday. I’m not really using it yet, but the idea is that we’ll probably want to be able to tell the difference between a JSON object and a string that happens to look like a JSON object. I also thought we might want to store JSON values in a more efficient representation. MongoDB has BSON, which seems like a reasonable idea although has some pretty glaring flaws in the implementation. In particular, BSON misses the most important optimization of all, which is to avoid storing the keys repeatedly; this pushes clients to worry about the length of their keys. The lesson to learn is to avoid MongoDB’s mistake of exposing the binary format as the wire protocol, at least until it’s proven. We could try using Google’s Protocol Buffers as the data representation instead; Protocol Buffers is definitely proven, and much more efficient than anything I want to build. Keys are not stored at all, because they’re in the schema instead. The downside (for some) is that Protocol Buffers uses a static schema, so we can’t simply map arbitrary JSON to it. My vague plan here is to treat everything as JSON, although not necessarily store it as a simple JSON string. If we have a schema we can store it as Protocol Buffers; we can use something like BSON or maybe just compression for arbitrary JSON; but we’ll probably always expose it to the client as JSON.
Multiple value formats (for now) are simply a byte prefixed to the value, indicating how the remaining data should be interpreted. Right now, the only two formats I implemented are a raw byte format (the ‘obvious’ format for the key-value store) and a simple format for storing integers. They’re what we’re using at the moment, so that’s what we can test!
I first coded this in a “minimally invasive” way - I didn’t want to change too much - but that meant the code was fragile. After I got it working, I then went back and refactored it to be object-orientated. Rather than return a ByteBuffer for the value and expect the caller to know whether that ByteBuffer has already been decoded or not, instead I changed the Values helper class (which was comprised simply of static methods) into a Value class we instantiate for every decoded value. This has some overhead, but we expect these Value objects to be short-lived (so the Java Garbage Collector should make swift work of them.) It makes the code much cleaner and easier to read, stops me making stupid programming errors, and also lets us use polymorphism instead of tedious switch statements (we subclass Value for each format). A nice win, as long as the performance doesn’t hurt. There are some tricks (like Mutable objects) which we can introduce if it ever comes up in a performance profile, but we’ll wait till we see the problem before trying to fix it!
]]>In our last episode, we added lots of Redis commands to our key-value store, cleaned up the architecture a little bit (introducing the command design pattern to cope with the ever growing number of commands), and struggled with Redis’ lack of compare-and-swap.
Well, I found it: Redis does support compare-and-swap, although it’s a little bizarre and seemingly deprecated in favor of the even-more-bizarre Lua scripting. Redis implements something more akin to load-link/store-conditional than the simpler compare-and-swap. It requires issuing 5 commands (WATCH, GET, MULTI, SET, EXEC) and needs an extra network round-trip, but it’s there. So, we can use that to build cool things that need compare-and-swap, which is what I’m actually trying to do!
So we have a sub-optimal solution, but we don’t really want to fix it right now. What I try to do in this situation is to hide the bad solution behind an interface, so it’s easy to replace it in future. So we have RedisKeyValueStore implementing KeyValueStore. We can use the Redis implementation for now, and replace it with something better designed later. We can make progress, and it may turn out that it’s never worth replacing the “terrible” approach (YAGNI). Because of that I haven’t yet implemented the extra Redis functionality (WATCH / MULTI / EXEC), and I’m just running against traditional non-cloud Redis. Let’s actually get something done!
So today, I built Git storage on the cloud, backed by OpenStack cloud storage. Cloud storage is typically eventually consistent, which is a much weaker guarantee than a traditional filesystem gives you. However, it turns out that git is actually very lenient in what it requires of its storage (the git design is excellent; it is essentially a well-implemented Merkle tree; the genius was realizing that this was sufficient). Git stores blob data (containing the actual file data), and it stores references (which are just hashes of the latest trees). The blob data is as big as the code commit (KBs or MBs), a reference is less than a hundred bytes (a name and a hash). Further, the blob data is immutable, and thus needs almost no consistency guarantees from its storage; so wecan easily store it on cloud storage. The reference data, though, must be stored and updated consistently, so it can’t easily be put onto object storage. It’s thus been non-trivial to host Git on the cloud. But we’ve built a consistent key-value store, which is perfectly suited to solve exactly this problem. Best of all, Google have implemented their Git storage exactly the same way, and open-sourced their code as part of JGit and Gerrit, so I didn’t even have to implement all the details of git.
Just as I did with Redis, JGit has interfaces that mask the ‘terrible’ blob-store and reference-store implementations that use the filesystem. I believe Google maps both of these to BigTable. But we can map the blob-store to OpenStack Storage and the reference-store to the Redis protocol, now that we know we can implement the Redis protocol in a cloud-suitable way. JGit does some great caching, so this works wonderfully, even when I ran against Rackspace’s Cloud Files product (which runs Keystone and Swift), storing data half-way across the US.
This is truly cloud Git: all the data is now stored redundantly on multiple machines / locations, and it uses cloud services via APIs. Swift is obviously great for cloud operations; our key-value store isn’t quite so far along but it can get there architecturally. I think this also demonstrates what I mean by a cloud-first data-store: we’re using Keystone for authentication, we’re using Swift for data-storage. Our key-value store (or something like it) will be a cloud service as well. It doesn’t have any authentication yet, but we’ll do the same thing as Swift does and integrate with Keystone, instead of building a second store of users.
Compare this to how Github has done this: they use a traditional filesystem to store their git data, so to ensure that is available they have to use a complicated DRBD architecture. Although I like DRBD, it is a little bit fragile and things go wrong. I think that the block-storage metaphor is not the right approach for the cloud: it fundamentally imposes a single-server mindset, and it’s difficult to get both high-performance and high-availability. (Amazon’s Elastic Block Storage product is probably the most problematic piece of AWS, I think mostly because they favored high performance.)
The real issue is that Github have ended up with a complex and not-very cloudy architecture; for example presumably they shard their repositories across DRBD volumes, and they presumably had to figure out how to live-migrate ‘hot’ repos, as well as implementing all the disaster recovery themselves. Github is solving a lot of ‘infrastructure’ problems. I think those problems should be solved by the cloud, so that GitHub can have a very much simpler, almost stateless, architecture of web-servers consuming well-tested cloud services. Github are running on the cloud, but they’re not really using cloud architectures. (That’s not really their fault though - I don’t think this approach for storing Git data is very well known!)
The other big piece is making sure this is all open-source so companies like GitHub can use it confidently. We already have great open-source object storage, and hopefully by the end of the month we’ll be well on the way to great open-source structured data storage :-)
]]>In our last episode, we added a Redis front-end to the datastore, supporting get and set. The vision is that we can use different protocols to talk to our key-value store.
Redis supports a lot more commands than just get & set! I spent much of today implementing other commands: append, delete, exists, increment and increment-by, decrement and decrement-by.
To be able to do that and have some confidence in it, I created some unit tests for Redis, using the jedis driver. I found and fixed some more bugs, as well! And a big refactor is that I’m using the command design pattern for performing each mutating operation: given that Redis requires so many operations, it doesn’t scale to put them all the logic into one big switch statement any more.
I had hoped to implement distributed locks with compare-and-swap via the Redis protocol, but it turns out that Redis doesn’t implement compare-and-swap. The mailing list thread is just confused / wrong, so I’m hoping that someone will revisit this at some stage and it can go into the official protocol. If I want compare-and-swap, I’ll have to use another protocol, I guess. Memcache is an option, although the way it implements compare-and-swap is a little unusual as well (it only supports swapping based on a version id, not based on the value itself). Maybe our own RESTful protocol is the way to go!
I haven’t yet implemented any of the features that make Redis unique: in particular Redis supports values that are themselves data-structures (lists, sets and sorted-sets). I’m thinking through how best to support this in a generic way. One option would be to extend the key; for a list for example we could store key=(a,b,c) as three entries in our BTree: key.1=a key.2=b and key.3=c (metaphorically speaking). Another option would be to encode the list into the value, so that any value could be “typed”; we’d probably end up with something like the COM Variant type. We could also store data structures in a separate page in our system, in a data-structure specifically designed for lists/sets/sorted sets (i.e. not a BTree).
I need to think this one over. I find the best approach with these difficult ‘philosophical’ problems is (1) to work on something completely different, like implementing a distributed lock system, and (2) to work on the related problems, like a MongoDB / DynamoDB inspired store. Sounds like a plan for the weekend!
]]>In our last episode, we added garbage collection to the BTree, so it’s now at the point where we can start using it to implement useful functionality.
Today, I added a basic Redis front-end to the RESTful API we already had. I had a huge head-start because I was able to work from the redis-protocol project, which actually implements a Redis server in Java. That project (like Redis itself) is designed around a single-server, in-memory implementation. The redis-protocol project is well designed, and it would have been fairly easy to plug-in our distributed backend implementation. Nonetheless, I rewrote the code to make sure I had a good understanding, but I followed almost exactly the same design. So implementing basic support for the redis protocol was fairly quick, but I owe all that to the redis-protocol project.
Like redis-protocol, I used the excellent Netty library, which makes dealing with asynchronous I/O in Java easy. Unfortunately, Netty recently went through a major version change (from 3 to 4), and the documentation and informal documentation (blog posts, sample code, StackOverflow questions) hasn’t quite caught up. It is noticeably better than it was 6 months ago and continuously improving, so it’s definitely worth going with the latest version (for new projects at least.)
The redis protocol itself is a text-encoded protocol pretending to be a binary protocol; e.g. each variable-length component is preceded by the length encoded as text. There’s a lot more work than a true binary protocol would be to parse (both in terms of code and in terms of CPU overhead). Seems like an odd design decision…
On that note, there was a great talk on HTTP 2.0 given at Heavybit by Ilya Grigorik (video coming soon, I hope!) In it, he pointed out that HTTP has great advantages: great client support, firewall and proxy friendly, great developer tools, and in general just having an incredibly complete infrastructure around it. Where it falls down is that performance can be inferior to a raw TCP connection (particularly if you want concurrent requests), so high-speed protocols typically end up rolling their own binary-over-TCP protocol. These don’t get any of the benefits that you get with HTTP, but are fast. However, with HTTP 2.0 (the protocol formerly known as SPDY), HTTP is now a multiplexed binary protocol, and should be comparable in performance to a hand-rolled binary protocol. In other words, there should hopefully be no new binary protocols. Even today, before HTTP 2, you should probably based your protocol on HTTP if you can get away with it, and know that a big performance boost is coming with HTTP 2. You can still beat HTTP 2 with a custom binary protocol in theory, but HTTP 2 will beat the protocol you build in practice.
The next step for our Redis experiment was to implement the basic redis commands, notably get and set. With that, it’s now possible to access our key-value store using the redis protocols, so all the redis libraries and tooling should work (for the small subset I’ve implemented).
It is interesting to compare this to official Redis. The big advantage is that we are distributed; Redis is persistent, which is great, but then the big question is “what happens when a machine fails”. Memcache has a much more coherent answer here, because it is a cache: values can go away for any reason. Redis doesn’t have quite the same self-consistent answer.
As well as HA, we also have a multi-threaded implementation: we support concurrent reads, though our writes are serialized by Raft. This came about naturally; I haven’t been carefully coding all the time to support concurrent operation. We’ve inherited the concurrent reader design from LMDB, and some thought was required when it came to garbage collection, but everything else is just sensible New-Java: minimizing mutability, basic locking around shared data structures etc. Of course, I know there are plenty of threading bugs still left, but the design is naturally multi-threaded in a way that it wouldn’t be in other languages (like C).
Because we operate on a cluster of machines, writes will be much slower than official Redis. I suspect we’ll be at least as fast as reliable replicated writes will ever be in Redis. Benchmarking reads (after some performance profiling) against Redis would be very interesting: we should be slower on a per-request basis, because we’re in Java and because of our copy-on-write database, but making up for that is the fact that we can run requests concurrently, even making use of multiple cores. I suspect for mostly-read benchmarks we may be able to get much higher throughput.
But what’s more interesting to me than a pure speed contest is the idea that Redis is just a protocol on a universal key-value store, and one that we’ve built to work well on “the cloud”. We can implement the memcache protocol, the MongoDB protocol, a SQL database protocol; all backed by the same data store. Things are about to get interesting…
]]>In our last episode, we added more functionality to the BTree: node splitting & support for big values. We also started recording page tracking information into the transaction records, in preparation for reclaiming old space.
Reclaiming old space is important, because we use a copy-on-write approach rather than changing pages in-place. This is relatively unusual for a database, but allows us to support readers without using locks. Reclaiming the old versions (that are no longer referenced by active transactions) is required to stop the database growing without bounds, so most of today was spent on this.
Getting the basic design of the garbage collector right was tricky. We maintain a free-space map in-memory, so that we can allocate new space quickly (without going to disk every allocation). With each transaction record, we include the pages that we freed and the pages that we allocated. Thus, a complete list of transactions can be used to rebuild the free-space map. This is elegant, because if a transaction commits, that is exactly when we want to persist the page allocations. Conversely, if we rollback or otherwise fail a transaction, then we want to roll back the page allocations. So it is nice that they are actually part of the same structure - we don’t have to worry about keeping them synchronized because they are one and the same.
Now, once again, we have a log-structured system for tracking free-space, with an in-memory representation. This is exactly the same approach we’re using for all of our projects. When we startup, we replay the allocation records to rebuild the state. We have the same problem: we need to periodically take snapshots, so that we can keep the amount of time taken to replay the logs under control. We can take a snapshot, which is just another special page type, and we can reference this from the transaction. One trick is that we don’t have to write the snapshot on every transaction; instead we can write it periodicially (when it’s “worth it”). We do have to be careful not to delete the older transaction records until they are no longer needed for rebuilding the current free space map.
Finally, we can actually reclaim the pages. We keep track of every active read transaction; we can clean up after a write transaction only when no read transaction references that transaction (or an earlier version). This is the same trick that is used by LMDB and ZFS (for snapshots).
So now we free up old page versions and can reuse the space. One issue is the strategy for how we allocate memory from the free space. This is the classic memory allocation problem; there is no “right answer”. The best allocators, like tcmalloc, typically use buckets to ensure speed while avoiding fragmentation. Instead, I went with a simpler allocator inspired by ZFS: first fit. In “first fit”, we simply find the first bit of free space that can “fit” an allocation request. This is well-known to cause fragmentation, but I wanted to start simple! ZFS uses a neat trick, which actually rotates through available space (changing the zero-point to redefine “first”), which means that writes rotate sequentially around the disk. This has the advantage that older versions survive for longer (one whole ‘trip’ around the disk), which is great for recovery of corrupted data. However, we expect our database to be largely in-memory, and marching through the whole disk allocation is likely to hurt the cacheability. So I went with a simple ‘first fit’, not ZFS’s clever spin on it. Fragmentation is likely to be a problem; we’ll have to see how it behaves!
We now have a working (basic) BTree! There are still lots of issues, but we can hopefully fix those ‘as we go’ and start building more real features.
There wasn’t much time left in the day, so I proposed a patch to Barge, that lets us return a value from the StateMachine. This means that we can do write-operations that return a value. I implemented the increment operation, which just adds one to a counter value - this is the simplest useful operation I could think of! I also tweaked the code a little more so that we have a doAction method which operates on a log operation even at the top level; this should make it much easier to add lots more operations. Hopefully we’ll add a couple more operations tomorrow. Redis made the news today for having a broken-by-design cluster implementation, so it might be fun to work towards something that looks a little like Redis!
]]>In our last episode, we cleaned up the project a little, set up CI using CircleCI, and set up real transactional behaviour.
Today, I attacked one of the things that make BTrees: splitting nodes. In a traditional BTree, a page is limited to a fixed size, typically 4KB or 8KB. I’m creating what I call a ‘relaxed’ BTree, where we don’t have to be so strict about page sizes. However, we still want to limit the page size. Our leaf page format currently uses 16 bit offsets, so is limited to 64KB pages. More importantly, if we never split pages, we just end up with a single page, and we never get a tree structure. So I implemented node splitting, splitting whenever the page is bigger than 32KB. This means we start creating branch pages, and this (of course) uncovered a few bugs.
Next, I wanted to work around that 64KB limit on the leaf page size, because we might very well want bigger values. So I introduced a special format for leaf nodes with only one entry, that allows 4GB values. Because of the way we’re splitting our leaves, big values will always end up on a leaf node with just one value. I’m not sure whether this is a good idea. The “one value” thing is a bit magical. It’s also not ideal to have two formats, although they are very similar to each other. Also, storing 4GB values at all is definitely not a good idea. I’m not sure where the cut-off point comes (1GB? 1MB? 1KB?), so we’ll probably revisit this. Postgres stores big values in page-sized chunks, which has the advantage that it’s quick to seek to random parts of BLOBs. Our design would make it fairly easy to do something similar, or just to store big values in a separate page, or even to store them federated onto object storage. Whatever we eventually decide here will probably feed back into the page-splitting code as well, but for now we can store huge values.
We don’t support huge keys - it would be straightforward to do so, but it’s probably not a good idea. Keys are copied into the branch pages, so it is less straightforward to implement than big values (which occur only in leaves), and there’s a bigger overhead to having big values. For now, we won’t support keys bigger than 32KB, and we’ll probably artificially limit them to a much smaller size to encourage good usage (1KB?)
Right now, we don’t yet reclaim any pages, so our database will grow indefinitely. This is far from ideal! The first step was to extend the transaction to record free pages. The plan is to go through and reclaim transactions once they’re no longer needed (once there are no read transactions that are referencing them), and add the free pages to a free list. We can then allocate future pages from that free list, and so our database will no longer grow indefinitely. The challenge is to do this in a consistent and persistent way; more on that next time!
]]>In our last episode, we took the basic design that we’d used on day 1 to build an AppendLog, and built a basic key-value store that could store values. I had to take lots of shortcuts to get so far in the first two days, and much of today was spent catching up on the technical debt, with a few new features.
First off, I created a shared project in maven, which means we don’t have to keep repeating the version of the libraries. You can do this in a parent module or in a shared project, but it makes everything a lot more DRY. I’d also copied-and-pasted some code, so I then moved that code into the shared project. A quick Kata to start the day!
We had tests, but we weren’t running them automatically. I set up CircleCI, which is a hosted continuous integration which I love because it is so fast (slow CI means you don’t get the rapid feedback loop which makes CI so useful). In order to do that, I set up Barge as a git submodule so it effectively becomes part of the project. Git submodules are great, but suffer from a truly terribly CLI interface. Technically we could have got away without doing this right away, because Barge is awesome and is deployed into the maven repositories, but we know that we’re going to want to make changes to the Barge code to add some features, so the git submodule is the way to go. A bit of messing around with some maven details, and the CircleCI build was up and running.
The tests are more integration tests than unit tests: they launch a cluster of servers (embedded in-process) and then test using the public HTTP interface. I find that integration tests are a lot more stable, so there’s less need to constantly fix the tests; I think they’re testing the right thing - our public contract. I also find that unit tests are much less important in a strongly typed language like Java than they are in a weakly-typed language like Ruby or Python. If you find yourself needing a lot of unit tests, you may not be using the compiler to maximum advantage: consider introducing some strongly typed classes to enforce what you’re testing. In short, you may be writing Old-Java, not New-Java. There are exceptions to the rule of course; unit-testing implementations of complicated algorithms is generally a good idea, for example!
Next up was enforcing uniqueness of keys, because although our generic BTree can support duplicate keys, we don’t really want duplicate keys in a key-value store. So now we can replace values, and we added a test to verify that.
Then, support for deletion by key. Previously the only change supported was insertion, so it was important to figure out a good approach here. Rather than have a set of methods, one for each action, instead we have a doAction method, which is parameterized with an Action enum. We have to do this for the Raft log anyway: every action must be serialized as a message, the idea is simply to use that message, rather than fight it and marshall/demarshall back and forth and switch before dispatching to a set of similar methods.
Finally, I cleaned up the transaction handling, following the basic design of LMDB. For each transaction commit, we write the new root page id into a section of the file header, rotating through a fixed-array of slots. When we start up, we scan the section, looking at these slots to find the newest root. This is how read transactions can run without locking (at the expense of write transactions needing to do copy-on-write). I extended the LMDB approach a little bit, by writing a special “transaction record” page for each write transaction, which includes the root page id, a transaction sequence number, and a pointer to the previous transaction. The slot in the header includes a pointer to that transaction page, as well as the root page id (which is redundant, but avoids having to fetch the transaction page to find the root page id). I’m thinking this will help when it comes to implementing page-reclamation (garbage collection), and that it may be more useful generally: we’ll see!
]]>In our last episode, we started off with the basic design: a Raft distributed-log of changes using the Barge implementation, persisting to a data structure we implement to store the state. Last time we built a simple append-log as-a-service, like Amazon Kinesis.
Today, I started working towards a simple key-value store, like Redis or memcache. Key-value stores let you associate values with keys, and let you retrieve the value given the key, and that’s about it. Unlike richer datastores, they don’t typically allow operations across multiple keys.
The reason those limitations were chosen is that by not supporting multiple key operations, sharding for scale-out is easy. It also allows them to use a hashtable for easy and fast lookups. The disadvantage is that the model is quite limited, so the datastores typically end up supporting a long list of complex operations on entries, or users must design complex multi-key datastructures (it pushes the complexity onto the caller). It was these observations that actually spawned the growth of the relational model back in the 70s when these datastores were commonly used. But they still have their place. Then it was because they were fast and simple to implement, now it is because they’re fast and allow easy scale-out. Fast is always nice, but the “easy to implement” bit makes them a good next-step for our little project!
We will again rely on the Raft consensus log for our key-value operations, but we’ll need to store the state in a different data structure: one that supports assigning a value to a key and retrieving the value for a given key.
Redis and memcache both choose the hashtable for this data structure. It’s a good choice because it is fast; with a good hash function, both read and write operations take place in O(1) time, although growth of the hashtable is a little tricky.
LevelDB uses the log-structured merge-tree to store its data. This allows LevelDB to support another operation: in-order key iteration, which is often enough to avoid having to support those complex operations or requiring complex datastructures. The LevelDB implementation offers very good performance for writes, and good performance for reads. The implementation suffers from occasional slowdowns for writes (during compactions), and higher memory & CPU overhead for reads. There is work going on to address these issues with LevelDB (Basho’s LevelDB and Facebook’s RocksDB).
There are many data-store implementations that use the BTree: LMDB, BerkeleyDB and every relational database. The BTree is like a binary tree, but puts more values into each node to amortize the data-structure overhead. It maps very well to page-based systems like modern CPUs and block-devices. Like the log-structured merge-tree, BTrees support in-order key iteration.
LMDB is particularly interesting because it uses a clever design which is almost lock-free on reads, but only supports a single-writer. Because all our writes are serialized through a log anyway, we’ll only have a single-writer, so this feels like a great fit.
Although a hashtable implementation might be faster for a strict key-value store, the hope is that the BTree will in practice be similar in performance (with the LMDB design), and will support additional operations (like key-iteration) that will prove useful for the future plans. For example, we can use it when we want to implement secondary indexes.
From the start, we’ll allow duplicate keys in our BTree, because secondary indexes require them and they’re painful to add afterwards. This also dictates the interface that our BTree will expose; because keys can be repeated we just allow a caller to walk through all the entries, starting from a specified key. It’s trivial to add uniqueness constraints, or to support a simple get request on top of this.
I did debate just reusing the (excellent) LMDB implementation via JNI (or writing everything in C++), but I’ve decided to roll-my-own in Java. Hopefully this will pay off: there will be opportunities to make different decisions for our particular use cases.
To produce a fast BTree implementation, we’ll continue to use ByteBuffers. Object allocation on the JVM is fast, but garbage collection can be painful, so we want to try to keep object creation under control unless the objects are very short-lived (i.e. they would be on the stack in C). For a page we’re reading, the idea is to keep the data in the ByteBuffer and extract entries directly. This is pretty much C-style pointer code and is just as tedious to get right as it would be in C, but then works well.
Implementing writes C-style (straight into the ByteBuffer) is trickier, particularly if we have a fixed buffer size. Instead, we’ll ‘extract’ the page before we write to it, converting it into a more natural Java data structure (e.g. a List of objects); applying changes is then simple. Then we’ll serialize the final version at the end back into a ByteBuffer. It makes our code much simpler; it may actually be faster than always working against the ByteBuffer when we do multiple operations; and it allows for “future optimizations” (“tune in next time…”) The big downside is that much of the code is duplicated because we now have two memory representations: one for a clean (read-only) page and one for a dirty (read-write) page.
A traditional BTree sets a maximum size for a page (e.g. 4KB or 8KB); we have to figure out what we’ll do when we exceed the limit. The traditional approach is to rebalance the BTree, moving entries around and splitting pages so that everything fits. Instead, we’ll implement what I call a ‘relaxed’ BTree, where we allow pages to be arbitrarily sized. This does mean that our ‘page id’ will be an offset into the file. With a 4 byte page id, we’d be limited to 4GB of data. We’ll force pages to align to 16 byte multiples, so we actually get 64 GB out of a 32 bit page id; it costs us a bit of padding space, but buys us a more sensible database size and better aligned data may be faster (although this may be a myth on modern CPUs).
For each transaction, we gather all the dirty pages, and then at commit we write them each to new positions in the file, which gives them a new page id; we update the parent pages, and write those, recursing all the way up to the root page. We then write a new ‘master’ page which points to the root.
We avoid overwriting data so that readers can proceed lock-free in parallel (on the previous version), and so that we don’t risk corrupting our data. This approach was first used by the ZFS filesystem to avoid corruption, and LMDB uses it to allow concurrent reads.
For now, the implementation doesn’t do any rebalancing, so we just end up with one big leaf page. It’ll fail pretty soon, because the current leaf format limits data to 16 bit offsets.
Again though, we add a simple RESTful interface, add some tests, and we have a working (though very limited) key-value store.
]]>These have all been built before, what is going to be really new here is that all the services are going to be built with the same design approach, and they’re all going to be “cloud-native”. If that sounds like marketing-speak, what I mean is that it’ll be designed for “the cloud”: within the limitations (unreliable machines, not terribly powerful individually), and making use of the capabilities (e.g. API driven infrastructure, reliable object storage like S3 or Swift). And it’ll be built this way from the start, not as a bolted-on after-thought.
The general architecture I’m going to explore is to use an append-only log to coordinate and store changes to the state. Every operation that changes the state must be recorded in order into the log. To determine (or recover) the state, you just replay every operation from the log. So that recovery doesn’t take unbounded time, we’ll periodically snapshot the state and store it (likely in object-storage); then we need only replay the subsequent portion of the log from the last snapshot.
This basic idea, of logging transactions and applying them in batches, is common to most datastores. It is similar to the Aries transaction log model that relational databases use. It’s also similar to the model that LevelDB and many other NoSQL databases use (the Log-Structured Merge-Tree).
To be “cloud-native”, we want the log to be distributed & fault-tolerant. For that, I’m planning on running it on a group of machines using the Raft consensus protocol. Raft is easy to understand and implement (at least compared to Paxos). Raft guarantees that as long as a quorum of machines is available, the log will be available; data can be appended to the log, and once acknowledged data is durable (won’t disappear). There’s a great implementation in my preferred language (New-Java) available in the form of Barge.
(New-Java is what I call Java without the legacy cruft that gave Old-Java a bad name. It uses annotations instead of XML; it relies on dependency injection; apps are self-contained rather than relying on some monstrous application container; the coding style works with Java’s limitations and uses them to its advantage - if you’re going to pay the price, you may as well reap the rewards. Typical indications that you’re using New-Java are lots of annotations and reliance on Google Guice and Guava, directly using Jetty or Tomcat, your code follows the ideas of the Effective Java book. If you see lots of XML and Spring, you’re probably using Old-Java.)
I wanted to start with a simple service, so I’ve started with an Append-Log as-a-Service; similar to Apache Kafka, or Amazon Kinesis. The append-log service allows clients to append chunks of data to a virtual file/stream; it will then allow those chunks to be retrieved in order. Periodically, old data is deleted and can no longer be retrieved. It’s admittedly a bit of a confusing place to start, given we’re internally using a Raft append-log with similar functionality, but it is the simplest service I can imagine.
In my first attempt, I tried to use the Raft log as the append log itself. But, based on some excellent feedback from the main Barge developer, I moved to a model where we copy the data from the Raft log into a second set of files (the append files).
The obvious disadvantage here is that everything gets written to disk twice. (This problem, called write amplification, actually crops up in most log-based datastores). But:
The code (which was originally heavily based on the example code for Barge), is available on Github.
Barge is doing most of the heavy lifting; we really just have to implement the StateMachine interface. Whenever an append operation comes in, we serialize it, and try to commit the data to the Raft log. Barge takes care of reaching consensus across our cluster of machines, and then calls applyOperation on every machine when the operation is successfully committed. We then have to apply the operation to our state, which here means appending the data to our append files.
We add a RESTful front-end onto it, and add support for reading from the logs. And that’s it - for now. There are a lot more features that need to be implemented, some in Barge (most notably log compaction and dynamic reconfiguration) and some in “our” code (archiving the files to object-storage, scaling & recovery from failures, etc). But we can’t do everything on day 1! The hope is that we’ll be able to build out a bunch more services, and then add missing features across all of them with shared code.
After the overall design, the most interesting detail here is the format of our append files. We use an approach similar to Kafka: we mmap a data file and rely on the OS to page it in as needed; this means we’re not limited to the amount of data we can fit in memory, but can be just as fast when the working set fits into memory (we can theoretically achieve the elusive zero-copy data-reads). It also offloads most of the hard work to the OS. There’s no such thing as a free lunch, and sometimes the OS heuristics are not a good match for the data-access patterns, but in this case I think heuristics like read-ahead and LRU will work well; we write sequentially, and I expect most reads will be sequential and mostly focused on the newest data.
Our file format is fairly simple. We have a fixed-size file header which identifies the file format version (it’s always good to have an escape-plan that allows the next version to change). We have a fixed size per-record header which includes the length and a checksum. We can walk the records sequentially, starting from any known-position, adding the length to find the next record. We can’t randomly seek to an arbitrary position and find the next record, though, so we choose to use the offset into the file as the record identifier. The checksum verifies that this is a valid record start, and can also check file corruption. (If we detect corruption, we can hopefully find a non-corrupted copy: the data is initially stored in the Raft log, and also in the append-files, both on the servers and in object-storage). Theoretically there’s a small chance that the checksum randomly matches at a non-record position, so that dictates our security model: a caller has all-or-nothing access to a log, becasue we can’t guarantee record level security if we accept client-specified record offsets.
It’s thus easy to implement read: we seek to the provided file position, verify the checksum, and return the data. We also return the next position as an HTTP header, for the next call. More efficient APIs are obviously possible (e.g. batching, streaming), but this is a good start.
The roadmap to this being a real implementation: we need snapshot support in Barge, then we can really build read-only log-segments, then we can upload them to object-storage, then we can delete. And in parallel: we need Barge to support dynamic reconfiguration, then we can implement auto-repair / auto-scaling. We can either detect when the cluster is degraded and launch a replacement server, or (probably better) we can rely on an auto-scaling pool of servers and reassign items between them to ensure that all have quorum and we spread the load around.
One last interesting detail: For our checksum, we use CRC32-C. It’s a variant of the “normal” CRC algorithm, which has hardware acceleration on the latest Intel chips with SSE 4.2. If you’re picking a (non-cryptographic) checksum and don’t have to be compatible with legacy software, it’s the natural choice. (And CRCs often pop up in performance profiles, so it definitely can be worth optimizing!)
]]>