Evaluating Cassandra

In the last few months I had the chance to play with Hadoop for a bit. I did some prototyping and unit-testing to learn the framework and see, of it is something we can use in production in my current company. My team is re-designing an existing application which must be able to store 5 years of Poker related data. Five years does probably not sound too bad but given that our system deals about 500 poker hands at peak time, this can be challenging.

I did a presentation on Hadoop a couple of weeks ago but we decided, that we will not use it as data storage. There are a couple of reasons to it. First of all, the framework is going through a lot of changes. The API has changed a lot between version 0.18.3 and 0.20.2. Often the developer has to deal with code examples, documentation and libraries, that are not updated to the latest Hadoop API. Furthermore with Hadoop you have to tame the underlying HDFS file system and you have to tweak your Map Reduce jobs to make them run optimal. Also the Namenode is like Achilles heel in Hadoop. If the Namenode has a problem you have to have a great knowledge about how to fix and restore it, otherwise you are in trouble.

We ended up talking about Cassandra, which is also a Apache top-level project. On a very high level, it reminds me of a schema-less database, whereby Hadoop is more of a distributed file storage. I spent a day to familiarize myself with Cassandra. Here are my first thoughts.

Cassandra does not know any indexes. The data must be stored in an intelligent way, so that it can be retrieved with good performance using the primary key. What does that imply? It means you roughly need to know, how you will access the data in the future and design for it. Otherwise you might end up with uses cases that you cannot serve efficiently, ie. find all players having had Pocket Jacks pre-flop. The good thing about Cassandra is that the data is truly distributed over all nodes. You can read and write from each node at any given time. No single-point of failure like in Hadoop. On the other hand, Cassandra is not processing anything in parallel. If you want to access your data in a distributed, parallel fashion, you need to write a concurrent application talking to several nodes. Alternatively you could use the Hadoop-Cassandra integration, which was added in the latest release. It allows you to write Hadoop Map Reduce Jobs using Cassandra as their InputFormat. Fancy.

Another impression I also had, is that Cassandra hides a lot of the low level details that you come in contact with when using Hadoop. I also looked at HBase but it was rather complicated to get it running compared to Cassandra. Another feature I like is that it uses JSON a lot. It is easy to visualize your data model or even backing-up and restoring your entire database using JSON. I have not written any test code that actually uses Cassandra. We will see about that. I am looking forward to the Berlin Buzzwords conference in June, where they have a talk about Cassandra and a lot of sessions about Hadoop.

If you want to know more about Cassandra, I can recommend these great links:

Tomcat and java.net.SocketException: Too many open files

God bless monitoring. An hour ago I received a SMS from one of my servers that one of my sites was not available anymore. It is a dedicated server that runs 4 Tomcats in parallel. To track down the problem I started looking at the Tomcat logfiles. Whoo whats that? The fourth Tomcat was spamming logfiles like crazy. For the past four days it had created me 4 logfiles having a combined size of 600 GB. I was hitting a java.net.SocketException for too many open files.

First I thought I had a leak somewhere, which prevented files and sockets from getting closed properly. Actually this was not the main problem. Since all Tomcats were running as the same user, and I had not touched the open file limit for this user, the default maximum of 1024 in Ubuntu 9.10 server was way too little. I checked how many files I had open for this user.

ps aux | grep tomcat

Then for every PID I ran

lsof -p PID | wc -l

I had cleaned the logs and rebooted already. The combined result was that I already was scratching the 1000 mark for all Tomcats after rebooting. Very thin ice. To make a long story short, here is how to change the maximum open file limit on Ubuntu 9.10 server.

First you edit /etc/security/limits.conf and add your new limit for the user running Tomcat. In my case the user was called virtual:

virtual hard nofile 5120
virtual soft nofile 4096

In addition to that, edit the file /etc/pam.d/common-session and add

session required pam_limits.so

done! Reboot the machine, then verify the changes running

su virtual
ulimit -n