Justin Santa Barbara’s blog

Databases, Clouds and More

Open Competition

I recently entered Netflix’s latest competitive / crowdsourcing competition, the Netflix OSS prize. Similar to their million dollar prize for a 10% improvement in their recommendation algorithm, this one offered (smaller) prizes for enhancements to their open-source software. The first competition started a whole industry of competitions for crowdsourcing answers to tough problems: Kaggle, 99 designs, even GE got in on the fun. I think this competition will likely encourage a similar movement: we’ll see lots of competitions that aim to give a boost to open source projects.

First, a bit of background. I’m a big believer in open source software and clouds. I submitted an entry based around making a more Machiavellian Chaos Monkey (I added 14 new ways instances could fail). I’m honoured to have been selected as one of the finalists. But don’t worry, this isn’t a request for votes (unless you star my repo, I don’t think there’s any way to show your support, and even then it’s a tough field!)

I think the competition is most interesting because of the way it was conducted. To enter, you had to fork the repo on github, adding in your submission and links to your code, also on github. That meant that every entry showed up in the network graph, even during the competition, and you could easily click around to see what everyone else was working on. It’s still up there today, and every entry should remain public forever. I’m certain none of this was accidental. It’s a new way to run a competition, and I would love to hear the story of what the lawyers said when this was proposed, but kudos to Netflix for getting it done!

There were a few very minor things that I think should be polished up should this be repeated. There was obviously an incentive to enter late in the competition. Open source is normally better when everyone is releasing early and often; this encourages greater collaboration. I think the majority of the entries were finalized in the last week or two. Some of them obviously had a lot more than a week’s work go into them - IBM ported the entire Netflix OSS suite to their own preferred stack. It would be great to encourage earlier submissions. It could be as easy as saying that e.g. 20% of judging would be based on when the entry was recieved. The original Netflix prize did an excellent job of encouraging early submissions (though not with collaboration in mind): there was a hidden dataset which forced people to submit their work so they could get feedback; the first to hit 10% would win it all; and they awarded interim prizes along the way. So I think there are solutions here!

More problematically, collaboration is much trickier when there’s a “winner”. Along with $10k in cash, there’s the chance to attend the AWS conference in November, and there’s only one ticket available per category. Teams could enter, but they are responsible for sharing the prize. This is a much harder problem to solve.

More important than these minor issues were the things that Netflix/Github/AWS got absolutely right. Primarily that was being 100% open. It was great that everything had to be open source (Apache licensed) and completely open for everyone to see (on Github). It removes any concerns about IP for everyone - if you’re not comfortable making everything public, then don’t enter. (It’s my variant of the New York Times rule: if you wouldn’t want to see it on github, you shouldn’t enter it into a competition.) This extreme transparency gives everyone a sense of fairness and - more importantly - fun.

In terms of what Netflix got out of it, I’m sure they would have liked to have more submissions (I’d guess they had about 40-50 entries). But I think that’s missing half the point: it’s like a slogan competition (“I want to go to Hawaii because…”). The goal is not to get a slogan for the next commercial “on the cheap”, but to get you to think about all the great things about Hawaii.

Netflix OSS gained a whole lot of awareness, probably 100 developers that are now familiar with the codebase to the point of having made code changes, and probably 1000 that have read (bits of) it. Given the T&Cs said that entering was not an offer of employment (a rather odd inclusion, at face value), they probably are hoping to hire a few people as well, and now have some great leads. And running a competition like this certainly gives their “developer cred” a nice little boost. For $100k (I’m guessing AWS are covering the AWS related prizes), I think Netflix did very well.

I think this could be the first of many similar “open competitions”. It may even spawn a few companies (like Kaggle) to help companies run these prizes. Time will tell!

Java, Interrupted

It can’t have escaped anyone’s notice that I have a contrarian opinion of Java: it might not be the coolest language, but it lets me get things done. But I know it’s not perfect; in particular InterruptedException is a serious pain-point.

tldr: I’m experimenting with wrapping InterruptedException in an unchecked exception: my code is cleaner and it hasn’t yet died horribly. I think I like it.

Any time you call a long-running method, like waiting for a lock or sleeping, then that method will likely throw InterruptedException.

InterruptedException is a checked exception, which means you either need to throw it from your method or catch it. If you catch it, you should call Thread.currentThread().interrupt(), so that the thread-interrupted state is not lost. (The next call to a long-running method will then throw InterruptedException again; the exception works its way up the stack and cleanly out of the thread). Any time you catch a generic Exception, you need to handle the InterruptedException specifically (you shouldn’t really be catching Exception, but more on that in a minute).

It’s painful to do this catch/interrupt/rethrow dance, so it’s often better just to let your method throw InterruptedException. That still makes for ugly code, because every method ends up with an InterruptedException added to the list of things it throws.

Even if you’re well disciplined and avoid catching Exception, the compounding problem is that some common interfaces are declared to throw no checked exceptions, so you do have to catch your checked exceptions when implementing them. (That in itself seems to be an anti-pattern; the always-excellent Guava seems to prefer to have functional interface methods throws Exception). But a particularly nasty example is Runnable, where you always want a safety-catch, not least because some of the thread pools really don’t cope well if you throw from a Runnable.

So you have this:

1
2
3
4
5
6
7
public void run()  {
  try {
    ...
  } catch (Exception e) {
        log.error("Serious problem", e);
  }
}

But… you shouldn’t do that, because then you lose the Thread.interrupted status.

So you do this:

1
2
3
4
5
6
7
8
public void run()  {
  try {
    ...
  } catch (Exception e) {
    if (e instanceof InterruptedException) Thread.currentThread().interrupt();
    log.error("Serious problem", e);
  }
}

(You could also check for InterruptedException separately, but then you need to repeat your logging / error handling).

And now we have some seriously smelly code. It’s not a factory-factory-factory yet, but we’re starting down industrialization road.

How did this happen? Why is InterruptedException checked in the first place?

Well, before InterruptedException we had ThreadDeath. When you interrupted a thread using Thread.stop, that would “immediately” throw ThreadDeath into whatever code that thread was running at the time. The problem is that it becomes almost impossible to reason about your code if anything can throw an exception. For example:

1
2
3
4
5
6
7
int x = 0;
int y = 0;

synchronized void process() {
  x++;
  y--;
}

Now, in theory, x + y will always be 0. But, imagine that we throw ThreadDeath after incrementing x but before decrementing y: the x + y == 0 invariant is broken. This matters a great deal if it is shared state which will live on past the current thread’s death. It’s bad that this is difficult to spot as problematic code, but the killer blow is that it is basically impossible to fix.

So, Thread.stop and ThreadDeath are deprecated, and really there should be no Java code left that calls Thread.stop. But Thread.interrupt and InterruptedException are not deprecated (though not exactly encouraged either). But, we’re stuck with dealing with InterruptedException on every call to Thread.sleep or Object.wait, because it’s a checked exception.

I think Java’s language designers swung too far here though after having been burned by ThreadDeath; exception handling in Java is inelegant enough; with InterruptedException it is just painful. I see the argument that throws InterruptedException is a marker that tells you to watch out for your invariants, and that the call may be slow, but I think you should assume any method call at all might throw or be slow.

ThreadDeath was evil, it was unchecked and could happen anywhere. I know we have to restrict where exceptions can be raised, but I don’t want to deal with it being a checked exception.

Here’s one trick I’ve been trying out: wrapping every method that throws InterruptedException with a wrapper that rethrows it as a checked exception:

1
2
3
4
5
6
7
8
static void saneSleep(long t) {
  try {
    Thread.sleep(t);
  } catch (InterruptedException e) {
    Thread.currentThread.interrupt();
    throw new InterruptedError(e);
  }
}

InterruptedError is a simple exception deriving from Error, which is unchecked and does not need to be explicitly thrown (or caught). We rely on finally blocks for cleanup, like all good code should. What’s more, we can catch Exception if we have to, and we don’t lose the thread interrupted state. We can even safely throw Exception, and not rely on our caller to remember to check whether it’s actually an InterruptedException and handle it correctly.

Now, why is it an Error and not an RuntimeException? Well, otherwise every catch Exception would probably still have to handle it, and we’d be back almost where we started. Code that actually does something other than rethrow InterruptedException is so rare that I think it should be the exceptional case. Error is supposed to be reserved for serious problems, but I think InterruptedException falls into that camp.

How do you deal with InterruptedException? Discussion at Hacker News

Also, if anyone has code that actually does any real processing of an InterruptedException, I’d love to hear about it.

Jetcd: Java Binding for Etcd

etcd is a great distributed state store that is part of CoreOS (definitely a project-to-watch). It offers similar functionality to Zookeeper, but is written in Go instead of Java and uses a different non-Paxos coordination algorithm, called Raft.

I mention Go not because it is this year’s cool new language (sorry, Ruby Erlang Node!), but because it means that etcd has a smaller memory overhead than Zookeeper. So, if you need to store a small number of items in a distributed state store, say for leader election or cluster discovery, then it’s much lighter-weight than Zookeeper. It’s light-weight enough that you could think about running etcd on every instance in a cluster, where it’s hard to justify doing that with Zookeeper unless you’re storing more data.

But, despite the fact that Go is great, some people - myself included - will still want to use Java for more complicated tasks. So, I’m open-sourcing my Java binding to etcd under the Apache license: jetcd. It’s still very-early code, so test thoroughly before rolling out into production!

The trickiest thing is that etcd uses HTTP long-polling for change notification; that is then exposed as a Java Future for use in async code (or just call get() on it if you want to be synchronous). It’s actually exposed as a ListenableFuture from Google’s Guava, because that’s usually much more useful.

We don’t really want to tie up a Java thread for each watch, so that means using one of the async HTTP libraries. I chose the Apache HttpAsyncClient, even though it’s still in beta. If you hate that and love library X, then please submit a pull-request. In general, pull requests are very welcome!

Check out jetcd on Github

Creating an OpenStack Image

Here’s how I create an OpenStack disk image. Everyone has their own tweaks here; the most unusual thing I do is that I don’t use cloud-init, because I don’t want to rely on DHCP; I use OpenStack’s ‘configuration drive’ instead. I think that cloud-init has support for config-drive in its very latest versions, so I may revisit this in future.

A few comments:

  • I think that Grub is crazily complicated, requiring too many UUIDs to match up with each other; I use extlinux instread.
  • Some techniques inspired by vmdebootstrap
  • vmbuilder can only create ubuntu images, but we really want Debian server images to be an option (and RedHat etc)
  • Installing a (local) apt-cacher makes our life much more pleasant
  • I’ll probably bundle this into a script sometime soon!

The Host Machine

This documentation is for a Debian machine. Either use a real Debian machine, or use a Debian guest image, or install Debian in a guest!

In particular, if you’re using Ubuntu Oneirc, it won’t work until the fix is in for this extlinux bug.

Just use Debian for this!

Pre-reqs

1
2
3
4
5
6
7
8
mkdir -p images
cd images

sudo apt-get update
sudo apt-get install --yes debootstrap curl qemu-kvm mbr extlinux parted kpartx
# An apt cache makes subsequent image-building much faster
sudo apt-get install --yes apt-cacher-ng
sudo apt-get upgrade

Image Creation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Sadly, most of this needs to be done as root..
sudo bash

# Choose your image OS
OS=squeeze
KERNEL=linux-image-2.6-amd64


# Create a (sparse) 8 gig image
dd if=/dev/null bs=1M seek=8192 of=disk.raw

#Create partitions
parted -s disk.raw mklabel msdos
parted -s disk.raw mkpart primary 0% 100%
parted -s disk.raw set 1 boot on

# Install Master Boot Record
install-mbr disk.raw

# Mount the partitions
modprobe dm_mod
kpartx -av disk.raw
# Hopefully it’s loop0p1..
LOOPBACK_PARTITION=/dev/mapper/loop0p1

# Format filesystem
yes | mkfs.ext4 ${LOOPBACK_PARTITION}

# Don’t force a check based on dates
tune2fs -i 0 /dev/mapper/loop0p1

# Mount the disk image
mkdir mnt
mount ${LOOPBACK_PARTITION} mnt/

# Create root partition using debootstrap
http_proxy="http://127.0.0.1:3142" debootstrap \
      --include=openssh-server,${KERNEL} ${OS} mnt/

# Prepare for chroot
mount -t proc proc mnt/proc/

# Use chroot to fix up a few things (locales, mostly)
chroot mnt/ apt-get update
chroot mnt/ apt-get install locales
chroot mnt/ locale-gen en_US.utf8

chroot mnt/ /bin/bash -c "DEBIAN_FRONTEND=noninteractive dpkg-reconfigure locales"

# Finishing the image
chroot mnt/ apt-get upgrade
chroot mnt/ apt-get clean

# Remove persistent device names so that eth0 comes up as eth0
rm mnt/etc/udev/rules.d/70-persistent-net.rules

Set up /etc/fstab

1
2
3
4
5
6
7
8
9
10
PARTITION_UUID=`blkid -o value -s UUID ${LOOPBACK_PARTITION}`

echo -e "# /etc/fstab: static file system information." > mnt/etc/fstab
echo -e "proc\\t/proc\\tproc\\tnodev,noexec,nosuid\\t0\\t0" >> mnt/etc/fstab
echo -e "/dev/sda1\\t/\\text4\\terrors=remount-ro\\t0\\t1" >> mnt/etc/fstab
echo -e "" >> mnt/etc/fstab
# TODO: Swap

#Check it looks good
more mnt/etc/fstab

Use config drive

I highly recommend using config drive. There’s a little init script I contributed which can apply the network configuration from the config drive, located in contrib/openstack-config in the source tree of nova.

1
2
3
4
5
# Install the init script for config drive
curl https://raw.github.com/openstack/nova/master/contrib/openstack-config > mnt/etc/init.d/openstack-config
chown root:root mnt/etc/init.d/openstack-config
chmod 755 mnt/etc/init.d/openstack-config
chroot mnt/ /usr/sbin/update-rc.d openstack-config defaults

Set up extlinux

1
2
3
4
5
6
VMLINUZ=`chroot mnt find boot/ -name "vmlinuz-*"`
INITRD=`chroot mnt find boot/ -name "initrd*"`
echo -e "default linux\ntimeout 1\n\nlabel linux\nkernel ${VMLINUZ}\nappend initrd=${INITRD} root=UUID=${PARTITION_UUID} ro quiet" > mnt/extlinux.conf
cat mnt/extlinux.conf

extlinux --install mnt/

Unmount & convert to a qcow image

1
2
3
4
5
6
7
8
9
10
11
# Sync and unmount (with paranoid levels of sync-ing)
sync; sleep 2; sync
umount mnt/proc
umount mnt/
sync
kpartx -d disk.raw
sync

# Convert to qcow2 (which means that the raw image size isn’t too important)
rm disk.qcow2
qemu-img convert -f raw -O qcow2 disk.raw disk.qcow2

Upload the image into glance

1
2
3
4
5
6
7
8
9
10
11
12
13
# We don't need to be root any more...
exit

RAW_SIZE=`cat disk.qcow2 | wc -c`
echo "RAW_SIZE=${RAW_SIZE}"

# Change these parameters to the correct values
export OS_AUTH_USER=`whoami`
export OS_AUTH_KEY="secret"
export OS_TENANT_NAME=privatecloud
export OS_AUTH_URL=http://192.168.191.1:5000/v2.0/

glance add name=DebianSqueeze is_public=True disk_format=qcow2 container_format=bare image_size="${RAW_SIZE}" < disk.qcow2

Search-as-a-Service With Solr and PlatformLayer

AWS announced search-as-a-service on Thursday; so I thought it would be interesting to get Solr (a great open-source search server) running as a service with PlatformLayer.

This is what Everything-as-a-Service means! You shouldn’t spend months building each service one at a time; you just re-use common code and define what’s unique.

I hit a few stumbling blocks (which is why it took two evenings); some of them were Solr installation issues, and some of them were problems in PlatformLayer itself. The great thing is that by fixing those problems, every service gets the benefit. For example, there’s now much saner retry logic when a task goes wrong.

I’m putting instructions on the wiki now instead of this blog; so here’s how to try it out: SolrAAS

How does this compare to AWS CloudSearch?

Obviously, this isn’t the same as AWS CloudSearch:

  1. The AWS product is much more polished
  2. I think CloudSearch offers auto-scaling, but that isn’t yet implemented in PlatformLayer
  3. In general, the AWS product is a much more battle-tested product. I’m sure they do something smart if a machine fails, PlatformLayer doesn’t (yet)

But, here’s why I think PlatformLayer wins:

  1. You’re not locked in to Amazon - run it on any cloud, on your own hardware, whatever
  2. If you don’t like the way PlatformLayer does something, it’s open source, so you can change it / get it changed
  3. As PlatformLayer improves, every service gets better
  4. It’s the Solr API, so all the language & library support already exists and is mature
  5. Solr is a well-known Apache project, so its strengths and weaknesses are well known
  6. Because you’re running your own search instance, you can tweak the indexing however you want (custom scorers, stemmers etc)
  7. It takes a few hours to create a complex service (including learning about Solr!), rather than 6-24 months

7 bullets vs 3 bullets. PlatformLayer is more than twice as good!

So what service is next?

Memcached With PlatformLayer

Let’s use PlatformLayer to run memcached as a service! I’m starting with memcache because it’s the simplest service I can imagine (no persistence, no inter-machine coordination, etc)

So I added memcached support to PlatformLayer the other day. The patch is pretty simple; I think it took about 20 minutes. Writing PlatformLayer’s first integration test took a lot longer!

I’ve moved the instructions that were here to MemcacheAAS on the wiki

Afterthoughts

That’s it: memcache as a service. Obviously there’s a lot more to a full memcache-as-a-service system, but here’s the magic thing: it took 20 minutes to code, and as we extend PlatformLayer to include the missing functionality the hope is that you will get that missing functionality without any memcached-specific code.

What service should we do next?

PlatformLayer: First Steps

Now that HP has opened up their OpenStack beta, it’s easy to try out PlatformLayer.

Suggestion: Before you start setting up PlatformLayer, sign up first. I don’t know how long the approval delay can be.

I’ll walk through downloading & getting it running; in the next post I’ll describe using PlatformLayer to run memcached.

I’ve moved the installation instructions to installation on the wiki.

Next time: running memcache using PlatformLayer

That Crazy Instagram Valuation

Everyone was surprised to see Instagram sell to Facebook for $1 billion. It was a crazy valuation: it should have been $3 billion.

Instagram has ~30 million users, making a per-user valuation of ~ $30.

Facebook has about ~1 billion users, and wants to IPO in a few months.

So Facebook just set their valuation at $30 billion.

Ooops: Facebook wants to IPO closer to $100 billion.

GroupOn was the master of this strategy, single-handedly creating the daily deal space by buying up companies at an ever increasing price-per-user, thereby driving up the value of their own users, thereby being able raising more money, to buy more companies, to drive the value of their users higher. As much as I hate that game, you have to admire the execution. And it’s a good thing that Facebook isn’t playing that game!

There are lots of ways for Facebook to justify buying Instagram, not least that if Google had bought it, that would have been an instant boost for Google Plus. Think it was a coincidence that Instagram launched on Google’s Android just before their acquisition?

But we should be asking how Facebook thinks that a Facebook user is worth $100, when an Instagram user is worth $30. I think Facebook have a pretty good answer, but if we’re asking why the valuation was so high, we’re asking the wrong question.

Kudos to Facebook for picking up a $3 billion company for $1 billion.

Introducing PlatformLayer

PlatformLayer: Everything-as-a-service

PlatformLayer is a project we’re open-sourcing from FathomDB. FathomDB is now working on a next-generation database; the original FathomDB was all about automating management of a MySQL database. FathomDB was the first company to run a stock piece of software (MySQL) in the service model on the cloud. Much of that code was not specific to MySQL, and I think this is still a missing piece of the infrastructure puzzle. So we’ve take the code and know-how from doing that, and we’re open-sourcing it as PlatformLayer.

PlatformLayer let you run MySQL - or anything - as a service, whether you want to offer that service publicly, or just consume it yourself privately.

OK, enough marketing speak! Use it to:

  • Run anything as a service (MySQL-as-a-service, Nginx-aaS, Jenkins-aaS, YourWebApp-aaS…) by automatically configuring virtual machines
  • Provide an API and CLI tools, so anyone can start/shutdown/reconfigure services
  • Run the machines on a public or private cloud
  • Use PlatformLayer to start your own XaaS company

It’s similar in some ways to Chef or Puppet or Juju. You describe how the software gets installed. PlatformLayer launches the virtual machines you need, and installs the software.

But it goes much further than Chef or Puppet or Juju. It’s designed from the start to be multi-tenant, and to give you a sensible REST interface to your service, which then gives you a command-line interface automatically as well. It’s architected so that you can offer that service publicly if you want to (and you can even compete with the original FathomDB!) It’s designed for ongoing management, and not just initial configuration.

You can now build your:

  • Memcached-as-a-service (like AWS ElastiCache)
  • MySQL-as-a-service (like AWS RDS)
  • Nginx-as-a-service (like AWS Elastic Load Balancing)
  • DNS-as-a-service (like AWS Route 53)
  • etc

If the service you want isn’t implemented yet, you can probably code up a basic version in about 15 minutes. This is the motivation behind this project: in a real platform-as-a-service, everyone will want their own combination of services, so it has to be easy to create a new service, and they should all have a consistent interface.

It’s also important that all those services work reliably, so there’s common functionality to implement the tricky things. Today, this is fairly primitive, but examples of these common services are:

  • Configuration repair (mostly implemented)
  • Backup & restore (partially implemented)
  • Automatic failover and scaling (not implemented)

OK, so there’s a long way to go there but we’re drawing from the FathomDB operations system (copying the things that worked well, reworking the ones that didn’t). I’m pretty confident in the design though, having learned a lot of lessons from doing this on the first X-as-a-service company.

There are definitely some limitations at the moment:

  • It requires an OpenStack cloud, and there aren’t that many publicly available yet (though you can, of course, install your own)
  • Documentation on how to get started is lacking (I’m trying to release early, release often)
  • There’s a bunch of different services, many of which are little more than experiments
  • There’s a lot of extra functionality in a production X-as-a-service system (but the plan is to bring that code over from FathomDB)

Please try it out, or just drop me a note if you’re interested!

Coming soon: setting up and running your first service…

The Apache Elephant Graveyard

Citrix donated CloudStack to the Apache Foundation, and the journalists that were pre-briefed dutifully reported the end of OpenStack and victory for AWS, seemingly because CloudStack has a proxy that can make it look like EC2.

The usual AWS allies chimed in to support this view, including @adrianco, no doubt trying to make up for accidentally spilling the beans on Netflix’s special AWS discount (the carefully worded non-denial denial wasn’t fooling anyone.)

CloudStack is joining a long list of game-changing commercial products donated to the Apache foundation:

The Apache Foundation does great work when they start a project or get it very early (like Tomcat). When a company throws a bunch of source code at them, often as much for tax and accounting reasons as for anything else, the outcome is less impressive.

Saying that this is #gameover for anything not using the EC2 API is a little suspect therefore; it’s a strange definition of #winning.

The EC2 API is pretty terrible - it’s much worse than the Win32 API - and we shouldn’t be wasting our energy trying to support it. The Wine project spent a lot of time & effort trying to reimplement Win32, but it was a moving target, and in the meantime the world moved on - including Microsoft themselves. Let’s not repeat that mistake.

The OpenStack API isn’t perfect, but where I find a big enough fault, it’s very easy for me - or anyone else - to fix it. And if you really want better EC2 compatibility, you can improve it (or fund someone to improve it) and it’ll probably get incorporated into the next release: grab a mop.

Now, if AWS donated all their APIs to the Apache Foundation, that would be interesting…