I’m building a set of open-source cloud data-stores in December, blogging all the way.
In our last episode, we built a simple file server and exposed it over WebDAV. As we did with the git server, we store the actual data in cloud object storage, and the consistent metadata in our key-value store.
Today I really just extended the WebDAV support, adding support for more WebDAV/HTTP verbs: MOVE (rename), DELETE, LOCK/UNLOCK and MKCOL (mkdir). All fairly straightforward stuff; I’m trying only to talk about the interesting details, so just a few observations…
Web(DAV) for the win
I think WebDAV was the right choice of access protocols. We talked before about how SPDY & HTTP 2.0 should mean that HTTP will be “good enough” to replace most custom binary protocols, and so we can then take advantage of of HTTP’s benefits. A filesystem is a use case that will definitely push the boundaries: there’s a lot of traffic and it’s performance sensitive.
WebDAV today already gives us many of HTTP’s benefits (WAN access, encryption support, etc). File transfer and metadata querying with WebDAV over SPDY will be comparably efficient to a custom filesytem protocol, as this is HTTP’s bread and butter. Missing is change notification (inotify) and efficient filesystem writes, which again makes sense because of HTTP’s read-mostly philosophy.
We can imagine adding change notification using WebSockets. The inefficiency of writes comes because WebDAV normally requires a LOCK / UNLOCK pair around each write; we could imagine combining operations to make this more efficient (e.g. PUT-and-UNLOCK). Because of HTTP’s extensibility, this could be as simple as recognizing an X-Unlock header on the PUT, and responding with X-Unlocked if we released the lock. To be fair, this may actually be more efficient that many filesystems (where lock and unlock maps to file open and close). I think this shows that “just use HTTP” may be the right advice: we’re talking about features, not about overcoming the limitations of HTTP.
Lock and unlock support is only stub-implemented for now with an in-memory implementation. This is not “cloud-first” - it is a single point of failure. It would easy to store the locks into the key-value store. I will probably play around tomorrow to implement this in the key-value store, and then directly at the Raft level to see if something more efficient is possible.
Delete support is also interesting, because we don’t actually delete anything just yet. Because inodes can be shared, to delete data in an inode based filesystem, we need to keep a reference count on each inode. Because we’re also sharing data blobs, we have to implement either reference counting or garbage collection for blobs. Whatever approach we take for the data, I think I should probably take the same approach for inodes. I’m leaning towards garbage collection, not least because I think it will be more flexible - I’m pondering features like snapshots.
So, I’ve implemented delete just by removing the inode from the name to inode mapping; we don’t actually delete the inode. To make garbage collection or undelete a bit easier, I actually store a record of the removed data in a different area of the key-value store. We’ll see if this is useful, but it does mean that delete is incredibly fast (because we defer all the real work for later garbage collection).
Finally, I think one of the reasons this went so smoothly is that Netty imposed a nice architecture on us. There’s definitely a learning curve, but it is a great library. For performance, Netty reuses memory buffers, which requires manual reference counting instead of just relying normal Java garbage collection; this can be tricky - I had to fix one buffer management bug in my code, and I’m sure others are still lurking.
I think the holiday season is likely to interrupt my daily routine, but I’ll do my best. Hopefully tomorrow I’ll find time to look at locking and even a little bit more!