In the last few months I had the chance to play with Hadoop for a bit. I did some prototyping and unit-testing to learn the framework and see, of it is something we can use in production in my current company. My team is re-designing an existing application which must be able to store 5 years of Poker related data. Five years does probably not sound too bad but given that our system deals about 500 poker hands at peak time, this can be challenging.
I did a presentation on Hadoop a couple of weeks ago but we decided, that we will not use it as data storage. There are a couple of reasons to it. First of all, the framework is going through a lot of changes. The API has changed a lot between version 0.18.3 and 0.20.2. Often the developer has to deal with code examples, documentation and libraries, that are not updated to the latest Hadoop API. Furthermore with Hadoop you have to tame the underlying HDFS file system and you have to tweak your Map Reduce jobs to make them run optimal. Also the Namenode is like Achilles heel in Hadoop. If the Namenode has a problem you have to have a great knowledge about how to fix and restore it, otherwise you are in trouble.
We ended up talking about Cassandra, which is also a Apache top-level project. On a very high level, it reminds me of a schema-less database, whereby Hadoop is more of a distributed file storage. I spent a day to familiarize myself with Cassandra. Here are my first thoughts.
Cassandra does not know any indexes. The data must be stored in an intelligent way, so that it can be retrieved with good performance using the primary key. What does that imply? It means you roughly need to know, how you will access the data in the future and design for it. Otherwise you might end up with uses cases that you cannot serve efficiently, ie. find all players having had Pocket Jacks pre-flop. The good thing about Cassandra is that the data is truly distributed over all nodes. You can read and write from each node at any given time. No single-point of failure like in Hadoop. On the other hand, Cassandra is not processing anything in parallel. If you want to access your data in a distributed, parallel fashion, you need to write a concurrent application talking to several nodes. Alternatively you could use the Hadoop-Cassandra integration, which was added in the latest release. It allows you to write Hadoop Map Reduce Jobs using Cassandra as their InputFormat. Fancy.
Another impression I also had, is that Cassandra hides a lot of the low level details that you come in contact with when using Hadoop. I also looked at HBase but it was rather complicated to get it running compared to Cassandra. Another feature I like is that it uses JSON a lot. It is easy to visualize your data model or even backing-up and restoring your entire database using JSON. I have not written any test code that actually uses Cassandra. We will see about that. I am looking forward to the Berlin Buzzwords conference in June, where they have a talk about Cassandra and a lot of sessions about Hadoop.
If you want to know more about Cassandra, I can recommend these great links: