I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we added very basic SQL querying to our structured data store.
Although our SQL querying is little more than a proof of concept at this stage, today I decided to do something different - trying to use our servers to build another server again, to figure out what’s missing. The goal of this project is not to build one data store, but instead to build a whole suite of data stores: key-value, document store, append-log, git-store, file server etc. Today it was the file server’s turn.
We want to support “real” filesystem semantics, not just be an object server. The idea is that you should be able to run a normal, unmodified program with it. Cloud object storage took a different approach: they don’t offer the full guarantees that a traditional filesystem offers, so they deliberately don’t expose themselves in the normal way. That’s a good “fail-safe” design principle.
However, as anyone that has used an object store will attest, what it offers isn’t that different to what a traditional filesystem offers. The main things that are different are strong consistency (vs. eventual consistency) and locking support. They also have different file permissions and metadata, but that’s really a design choice, not a true limitation.
Just as we did with Git, we can take our consistent key-value store, and use it to add the missing functionality to a cloud object store. We’ll store the actual file data in object storage, but all the filesystem metadata will go to our key-value store. We could put it into our structured store, but - for now at least - we don’t need it. Providing rich filesystem metadata indexing - trying to unify the filesystem with structured data storage - has been a dream for a long time, but there are too many failed projects along the way for us to try it: WinFS, the Be File System. If you’ve been following along, you’ll see where this idea comes from: we have a key-value store; we’re going to put metadata into it; we know key-value stores aren’t that different from strutured stores; if we used our structured store instead we could support metadata indexing. It does sound simple, but let’s get a basic filesystem running first!
I know Inodes
UNIX filesystems stores files in a slightly unobvious way. Every file has a data structure which contains its metadata (permissions, owner, size, pointers to the actual data etc). But rather than store the file’s name, instead we refer to this by a number. Each directory stores the information needed to map from file names to inode numbers. Each directory has an inode, but its data is actually a list of its children: names mapping to their inode numbers. To get from a filesystem path to a file, we step through the filesystem name-by-name, reading each directory to find the child inode, and then reading that child (which may in fact be a directory).
This may be an unobvious way to do things, but is actually a great design. Because we reference files by inode number, not name, it means we can do things like rename, delete or move a file while it is open. We can have hard-links, where multiple names refer to the same file. Every UNIX filesystem (I think) is built around this design; Windows has its roots in the FAT filesystem, which didn’t do this, and so hard-links and in-use files are to this day much weaker on Windows.
The big downside is that listing all the files in a directory can be fairly slow, because we must fetch the inodes for every file in the directory if we want the metadata. This is why the default function for listing the files in a directory (readdir) doesn’t return the data in the inode.
If we’re going to build a filesystem, it might be possible to build something to a different model, but it will be tricky to expose it well to a modern operating system because you’ll have to translate between the two metaphors. In short…
Mapping to the cloud
I stored the inodes in the obvious way: each key maps from the inode to a value containing the metadata. I actually used Protocol Buffers for the value store, as it’s easy, extensible and has reasonably high performance. We will never get the raw performance of a fixed C data structure using it, but we’re not going to win any benchmarks in Java anyway. (Actually, that might not be true: the way to win benchmarks in a higher-level language is by making use of better algorithms or approaches. But not today!)
I stored the directory data by mapping each directory entry to a key-value entry. The value contains the inode of the child. We want the key to support two operations: list all children in a directory, and find a particular child of a directory by name. For the former, we require the directory inode to be a prefix of the key (so our query becomes a prefix/range query). For the latter, we want to include the name in the key. Therefore, our key structure is the directory inode number followed by the name of the file. Simple, and works!
For storing data, we do the same thing we did when we implemented the git server. We store the file content itself on cloud object storage - it is, after all, designed for storing lots of large objects. Small files may be more problematic, because of the overhead: this problem occurs in “real” filesystems as well; they normally end up storing small files in the file inode itself. We could store the file content using the inode identifier; instead we hash the file content and store it using its (SHA-256) hash for the name. Again, this is just like Git. It has the advantage that we get de-dup for free; it has the disadvantage that cleaning up files on delete is harder, because file content may be shared. Git gets round this by never deleting content in normal operation (which makes sense for version control); for now, we’ll also ignore the “garbage collection” problem.
A downside is that the files in object storage aren’t named meaningfully. It would be great if the file called “directory1/file1” was stored under that name in object storage. That just isn’t possible in our design. This may actually be a good thing, in that we really don’t want to encourage people to “go behind our back” and work through the object storage interface.
The other big downside is that we don’t have good support for partial file writes (yet). You want to use this as a simple filesystem, not to store your database files.
The hardest thing was actually figuring out how to expose this to the operating system as a real filesystem. FUSE is excellent, but would mean everyone would need to install a ‘driver’. The Windows shared filesystem protocol (CIFS) has universal support, but has a reputation as being slow and complicated. I thought about NFS, but I thought it would be tricky to get permissions and WAN support right. WebDAV seems to be a winner: it can be mounted natively on every major OS (it was the basis for iDisk on the Mac, and for ‘Web folders’ on Windows). Because it’s based on HTTP it can also easily be used at the application level, as well as by mounting it in the kernel. Best, it works on a “whole file” model, and doesn’t work well with partial writes, so it maps well to our capabilities. Annoyingly, every OS seems to have weird edge cases/bugs, but it seems like a great place to start. We might add NFS later!
I looked at the libraries that are out there, in particular Milton seems great. It seemed a bit orientated towards exposing your own data as a filesystem, rather than a raw WebDAV implementation. So, based on the Milton code, I coded my own. You can see the (mega) commit where we now support a filesystem using WebDAV. It only supports the basics of the WebDAV protocol (e.g. you can’t delete files), but it does enough that we can mount it in CyberDuck and in the MacOS Finder. That’s a great start… tomorrow I’ll work on filling out the support a little bit to make it useful.
So, we have a cloud filesystem - what does that mean? This is not a DropBox replacement: this only works online. It does provide a directory that is both shared and reliable. So for Wordpress, you might point your image upload folder here. You could store your Jenkins build artifacts here. You could use it for a simple filesystem based work-queue, although we can probably build a better solution here also!