Java Example with Cassandra 0.7

Last Friday release candidate 2 of the upcoming Cassandra 0.7 version saw the light of the day. It took a while to get some old Java test code working again using Thrift directly. Let's use this Maven project file:

<project xmlns="" xmlns:xsi=""


<name>Riptano Repository</name>


You want to use the Riptano repository as these guys have all the latest artifacts very fast. Here is a snippet of my new yaml based Cassandra configuration - yammi:

- name: "javasplitter"
replica_placement_strategy: org.apache.cassandra.locator.SimpleStrategy
replication_factor: 1
- name: author
compare_with: BytesType

Straightforward, no advanced use cases. Some Java code involving Thrift that works.

TTransport framedTransport = new TFramedTransport(new TSocket("localhost", 9160));
TProtocol framedProtocol = new TBinaryProtocol(framedTransport);

Cassandra.Client client = new Cassandra.Client(framedProtocol);;

String columnFamily = "author";

ColumnParent cp = new ColumnParent(columnFamily);
byte[] userIDKey = "nadine".getBytes();
new Column(

One new thing is the use of framed transports. Make sure your cassandra.yaml has this line (which it does by default):

# Frame size for thrift (maximum field length).
# 0 disables TFramedTransport in favor of TSocket. This option
# is deprecated; we strongly recommend using Framed mode.
thrift_framed_transport_size_in_mb: 15

Make really sure you are using the above code against a 0.7 Cassandra installation otherwise you will get a lot of weird errors which are not really self explaining. For instance running the Java code above against Cassandra 0.6.8 would just freeze on the line containing It is still possible to use plain TSocket transport or even to mix. Cassandra.Client has a constructor that takes two TProtocol instances (in and out).

Another thing that has changed is the use of java.nio.ByteBuffer from the Thrift based Cassandra.Client. Everything expects a ByteBuffer now (I think it took byte[] before) so the the code becomes a bit more verbose. Also make sure you are running with libthrift 0.5 (see pom.xml) otherwise you will get errors like:

java.lang.NoSuchMethodError: org.apache.thrift.protocol.TProtocol.writeBinary(Ljava/nio/ByteBuffer;)

I think 0.7 beta 2 used another version of libthrift (959516) so be careful. Anyway, you should probably use Hector from production code anyway. Should work well with 0.7 of Cassandra. I was however interested in Thrift since apparently Cassandra is moving away from Thrift to Avro. No idea yet, why Avro is better. It can't be the performance.

Having Maven create a nice zip File and separate Configuration

A co-worker and I are preparing a presentation about Amazon EC2 for other developers in my company. To show some stuff in action, we decided to write two JMS powered applications. One is sending messages, the other is receiving messages and persisting them into a database. During the presentation we will roll this out onto 3 EC2 nodes. Each application is build using Maven. To have it as convenient as possible, I changed the Maven package build phase to produce a single zip-file. The zip-file can be copied over to the EC2 node, where it is extracted. The zip-file contains one big "uber-jar" (with all the third party dependencies included) and a single properties-file to be able to set host and port for the JMS communication. We use ActiveMQ as JMS vendor in our projects.

Once the big zip-file has been extracted on the EC2 nodes and the properties have been set, you can start the producer and the consumer from the Main class. (Note the little dot infront of .:consumer... - this is needed so that the properties-file is found)

java -cp .:consumer-1.0-SNAPSHOT-final.jar package.MessageReceiver

For the packaging of the big zip-files, I use the Maven Assembly plugin during the package phase. The configuration looks like this:

<assembly xmlns=""

This explodes all third party dependency jar files and merges them into one big uber-jar file. The properties-file is excluded from the uber-jar (this name sounds hilarious if you are from Germany by the way).

<assembly xmlns=""

This creates a zip-archive containing the "uber-jar" and the properties-file. Notice that the "uber-jar" has the suffix of "-final" equal to the id-attribute in the jar.xml file.


This code snippet runs the maven-assembly-plugin during the package phase.

Everything went well when we finished the producer application last week. Today I ran into some weird errors when I worked on the consumer end. Trying to start the MessageReceiver main class gave me the following error:

Caused by: org.xml.sax.SAXParseException: cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'amq:broker'.

I was able to resolve this error with the help of the ActiveMQ XML reference page, only to stumble into the next problem:

Caused by: org.springframework.beans.factory.parsing.BeanDefinitionParsingException: Configuration problem: Unable to locate Spring NamespaceHandler for XML schema namespace

It did not really jump me at first why this was happening. There a unit tests which load the Spring Context, they run fine. Maven runs the test in the package phase, they passed fine. So this was odd. After doing some research, I found out that the maven-assembly-plugin is responsible for this. Apparently Spring needs the spring.handlers and spring.schemas files to be present in the META-INF directory of the "uber-jar". A lot of other people had already hit the same problem before me. Some of them recommend the use of the maven-shade-plugin with the following setup:


<transformer implementation="
<transformer implementation="
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<transformer implementation="

With the help of Transformers, the spring.handlers and spring.schemas files are added. If you are using Spring 3.x, there are many small spring-xyz.jar files and each of these comes with a spring.handlers and spring.schemas file. The maven-shade-plugin will append the content of each of these files using the AppendingTransformer. As you can see in the example above, I am again excluding my properties-file using the DontIncludeResourceTransformer. I decided to keep the zip-archiving part from the maven-assembly-plugin. Maybe this is something the maven-shade-plugin could do for me as well, not sure.

Interviewed at Google Pt.1

Today I want to blog about an on-site interview which I had at Google a while ago. Since I did not sign a non-disclosure agreement and only agreed on not taking photos in their office, I hope it is okay to write about this interview. Going to an on-site interview, is like reaching the next level, if Google is satisfied with your previous telephone interview(s). Due to summer-time and a lot of vacations, I had my on-site interview more than 1,5 months after doing the initial phone interview. This was very good, because it gave me extra time to brush up on some stuff. I read some books about data-structures and algorithms and practiced white-board coding a bit. I coded through a lot of sample questions from software interviews, which I was able to find on the net. However, all the questions during the interview were new to me, so I had to be spontaneous.

The job I was interviewed for, was Software Engineer in Test and the interview was held in Zürich, Usually the candidates are flying to the interview location one day in advance. That did not work for me, because I was already on another flight a day earlier. Instead Google booked two flights on the same day for me. which was OK. I arrived in time and signed in at the reception. As stated, you have to agree on not taking photos and you have to put some label identifying yourself on your shirt. Some minutes later I was picked up by my HR contact. She gave me a short office tour and set me up in one of the interview rooms. You could tell that Google is doing a lot for their employees. The Zürich office had a big fitness room with a personal coach. Free soft drinks everywhere, video gaming rooms, pool tables all that kind of stuff. My interview room was very small, maybe 3 times 3 meters and had a white-board. She handed me my interview schedule and I was surprised that it was six interviews, each 45 minutes long with a lunch break in between.

After some minutes the first interviewer came and we got into action right away. This first interview went really well. We talked a bit about continuous integration, my contribution to the Hudson project and about testing in general. Then the interviewer switched to a question about bash in Linux. We talked about how piping works under the hood. We spoke about the case where the first command is non-stopping one like "yes" or "tail -f". I learned that Linux is not executing the commands sequentially, like I thought it would. Rather it sends the output of a command to a buffer and the following command would use read and write blocks to work on the buffer. Time went and the next two interviewers came.

For some of the interviews, there are actually two people in the room. One of them being just a observer, that need to learn how to interview candidates. For the second interview, the interviewer directly put up a matrix on the white-board.

-4 -1 4 5
-3 0 6 10
1 8 11 15
17 19 22 30

I did not realize, that the matrix was set up in a way that each row and each column was ascending - from negative into positive numbers. My task then, was to come up with an algorithm that, given any number, would return true or false if the number was contained in the matrix. The first idea I had was looking at the upper left number, reading the right and lower neighbor and select the neighbor which would bring me closer to the target number. Surprisingly this worked out but the interviewer was able to find a negative example. For the next attempts I tried starting from the lower right corner, starting from the middle element looking at all 4 neighbors and even go row by row running binary search. The binary search algorithm would have worked but the interviewer indicated that this is not the best solution and that I should try a different starting location for my first approach. After some discussion we agreed that it would be possible to start lower left, so that the row would be ascending and the column would be descending. If the number, we were looking for, was bigger than the current item in the matrix (lower left at the start) I would go to the right neighbor. If it was lower I would go to the upper neighbor. This neighbor becomes the next item and the algorithm would again check right and upper neighbors until either the element was found or there was no way to go further. Finally I wrote some Java code for this and the second interview finished.

It was lunchtime and someone from Brazil picked me up for a lunch-date. We went to the Google cafeteria and I got a free lunch. We talked about different things and it was nice to relax a little bit from coding exercises.

Subversion Changes to Archive File

Just a simple one-liner that I want to share with you. Do you know this problem: you have made changes to one of your projects, some files have changed, some images were added. Now you need to move this to your live server. Most often I moved the files one by one based on their change dates. Today, just before committing the changes into Subversion, I had the idea to use the Subversion changes for creating a Tarball which I could copy over to and extract on my server.

svn st | grep -v '?' | awk '{ print $2 }' | xargs tar rvfz changes.tar

svn st - show Subversion changes (assuming that you have done all you svn add stuff before)
grep -v '?' - filter files which are not versioned, ie. IntelliJ project files
awk '{ print $2 }' - print only the file path and name
xargs tar rvfz changes.tar - add these files into changes.tar

All that is left is to transfer the file via SCP or FTP and extract it in the right location.

Google Phone Interview

A couple of months ago, I got contacted from a recruiter working at Google. It was about a Software Engineer in Test position in Stockholm. I was a bit surprised because this position has been on the net for a long time and it was more than a year ago since I sent them my CV. Even though I am not pro-actively looking for a new job, you cannot say no if Google is asking. I replied back that I was interested and she called me back two days later. Google's recruiting process is very different compared to other companies I applied at. First there will be a one or more interviews over the telephone, and if you are good at those, you will be invited to an on-site interview. The on-site interview is a whole day where you will meet 5 to 7 interviewers each for a 45 min interview. Each interviewer will then write together an evaluation of your interview. All of the evaluations are reviewed by the hiring committee and if they decide to hire you, as one of the last steps, your documents including CV is sent over to the US headquarter for the final go or no-go.

So when the recruiter called me, this was actually the first step in a long recruiting process. The only thing she was interested during the fist call, was the length of my notice period and if it was negotiable. After this she asked me two technical questions. First question for the worst case time complexity of Quicksort. Second question something about Radixsort complexity, but she could not read the question, so she asked what the difference between a HashMap and a HashSet was. The recruiter is not really a technical person. She was probably given a list of questions and answers to filter out the very bad candidates. My answers were okay and she told me that she would set up a telephone interview.

A week or two passed. Since I did not hear from her, I wrote a mail asking for the status of the telephone interview. Due to vacations it took a bit longer to set up the interview. My recruiter told me that a confirmation would be sent soon. To prepare myself, I should be looking at the Google Testing blog. Also she wrote, that the interviewer would be interested in my testing background and that he would be asking questions relevant to my coding and problem solving skills, algorithms, core computer science concepts, OOP and datastructures. It could also be, that he asked me to write a test plan or how to test an arbitrary Google product like Maps or how I would design a cache. Also we would be talking about my projects, problems I had found, tested and fixed and also what I would do with my Google 20% time. It would also be good to know some stuff about the company, like the founders, the products, the business model etc.

The interview however, went completely different. Someone else sent me a confirmation mail including a link to a Google Docs document. which I could share with the interviewer. It was also suggested, that I "warm" up a bit on in the "Software Competitions - Algorithms" arena, which I did. On the day of my phone interview, I was called exactly at 11am. The interviewer was very friendly and explained everything very carefully.

For the first task I was given two Lists of Integers, one sorted ascending the other sorted descending. I should write an algorithm (Java) in the Google docs document, which would take both Lists and return a combined List that is sorted. I remembered that in the merge phase of the Mergesort, you do a similar thing. Have a pointer on the first element of the first list and a pointer on the last element in the second list. Then copy back the smaller of the two elements and increase or decrease the appropriate pointer.

The second task was about anagrams. Given a list of words, I was asked for a datastructure in which you could store anagrams. I suggested to use a Hashtable. Each word is effectively a character array that could be sorted. So sort the characters and use this as the key for the Hash function. That way, elements having the same key are anagrams of each other.

In the last task I was introduced to a Skip List, which I never heard of before. My interviewer explained the Skip List a couple of times and I was asked to write an algorithm for finding an element in the List. I did okay I think and then he asked me if I had any questions. I asked a bit about the Stockholm office, what he was working with at Google, if they used Git for version controlling (they use Perforce) and some other questions. I asked about feedback but he said, he is never giving feedback directly. Instead the interviewer said, he would write an evaluation of the interview in the next couple of days. Also he said, if later on I was asked to do another phone interview, it did not mean I was good or bad in the first one.

One week or so passed. My recruiter called me and said that they liked my interview and want to meet me for an on-site interview. Because Stockholm was not big enough and not so many test managers worked there, they would like to have the interview in Zürich. I would be given further details and date suggestions later on. Google pays the flight upfront. The hotel must be paid by the candidate but the money can be reimbursed by sending an expense form to Google in Poland. For those candidates staying overnight, Google also pays the food (30 Euro, US-Dollar or Pound depending on where the interview was held). Rental car fees can also be reimbursed. I decided to pay for everything myself, since I was not staying over night. Sending some reimbursement form to Poland, seemed complicated. I was given 4 days to choose from for the interview. I picked one but they set up the time on another day anyways - weird. So I was flying on-site to Google in Switzerland, roughly two months after the initial contact.

Jetty7, Spring, Testing and Classloading

Currently working in a project where Jetty is configured in a Spring Context, that is started for unit tests. I was permanently hitting: "Context attribute is not of type WebApplicationContext". However, in the debugger I could see it was given a XmlWebApplicationContext - WTF? I figured it must be some classloading issue. Maybe something that was different when comparing Jetty 6 and 7? In previous Spring projects I have always used Jetty 6. This was the first time I tried Jetty 7 - which is still not final I think.

Anyway, what you want to do is this in your Spring applicationContext.xml file:

Previously I never had to use the parentLoaderPriority property but this made my tests working.

One (unrelated) question mark remains though. This is a Maven based project and the deploy artifact is a WAR file. There was one unit test that manually started a Jetty server in-process to do some testing. My initial idea was to use the maven-jetty-plugin to start Jetty before the test suite runs using the Maven command-line. However in that case it would not have been possible to run this test isolated in the IDE. I decided to make the test a Spring powered unit test and have the Spring Context start up Jetty. Unfortunately this required an existing WAR file, correctly set in the war property of the WebAppContext (see above). Maven builds the WAR file after running the test. To work around this, I told Maven to construct the WAR file before the tests are run:




Anyone else has solved this scenario differently?

Spring Insight and Google Speed Tracer

I am sitting in a session here at Disruptive Code called "From Zero to Cloud" which is presented by Adam Skogman from SpringSource. The session was actually split in the middle, with lunch in between. While the talk itself did not give me much of a big WOW effect, simply because I have used the Amazon cloud services before, Adam mentioned a very cool tool called Spring Insight. He spoke about Spring Insight just under a minute. It sounded like an extension to the Spring tc server, which makes it possible to do performance analysis for web applications. It can measure and display information about the execution time of a HTTP request, a query execution in the database or even a single Spring bean invocation. Spring Insight can furthermore be integrated with Google Speed Tracer, which is apparently a Chrome plugin that I did not know of just a couple of minutes ago.

Spring Insight is very interesting project for me. A while ago a former colleague and I had a similar idea for an open-source project. We were planning to implement a language independent framework to measure execution time of web applications. In Java, the idea was to implement this using Annotations and Servlet Filters. In PHP, we were planning to have the same outcome using explicit method calls, which the PHP developer would have to add to the code manually. While the application was executed, it would then write a report file which users could then upload online to visualize the collected data. It is this part, that Google Speed Racer is doing now in the setup together with Spring Insight. Unfortunately we never had the time to work on our project, so it ended up just being an idea.

Anyway, I googled around a bit. The biggest point of criticism for Spring Insight is apparently that it is tightly coupled to SpringSource's tc Server. Even though there a millions of Spring Framework powered applications, only a fraction of them is running in Spring tc Server. Doing a quick research, it seems like the main reason to use Spring tc Server for Spring Insight, is the usage of custom Container deployers to enable the functionality. The good news is that someone else has already started to work on a third party library, to enable Spring Insight functionality without actually forcing the user to go with Spring tc Server. The library is called spring4speedtracer. Even though it is not as powerful as the original Spring Insight - for instance JDBC query execution times cannot be measured - it is a promising project. When I get home, I will have a closer look at this library, maybe I can contribute to the project (and put my own project idea to rest for good).

Auto Secondary Indexes in Cassandra

Twenty minutes ago, Eric Evans talk about Cassandra ended at Disruptive Code. In the first 25 minutes or so, I was quite disappointed because it seemed to be exactly the same presentation, which I saw in June at the Berlin Buzzwords conference. Even the funny Bigtable-Dynamo lovechild slide was still there, though I believe the laughter was greater in Berlin than it was in Stockholm. Well I guess it's not so easy to get a Swede laughing.

Anyway, what I realised during Eric's presentation, was that he already added some stuff from the next Cassandra release 0.7. First of all, every time he was showing configuration, he had an excerpt from the a cassandra.yaml file. For instance this snippet from his timeseries example:

-name: Sites
-name Stats
compare_with: LongType
new yaml configuration in Cassandra

Apparently as of version 0.7, the cassandra.yaml file is replacing the cassandra.xml file. I have not come in contact with yaml really, I believe it is common in the Ruby world. Another very cool feature is the addition of secondary indexes to Cassandra. In previous versions, Cassandra did not have indexes out of the box. To mimic the behavior of a secondary index, what you could have done is to create another Column Family (I believe it was called). This new Column Family would then be sorted differently and contain a key to the "original" entry. As a example, imagine having a Column Family to store addresses. To be able to search by the city, you could create another Column Family called "byCity" with two properties, "city" and "address key". Every time you insert or update an address, your code has to alter the byCity Column Family.

It looks like Cassandra will do this for you from version 0.7 on. There two new per-column settings called index_name and index_type. If I understood Eric correctly, adding this to your configuration will create you an inverted index, which can be used as a secondary access path. I think this is a very nice, yet very undocumented, feature. No clue when version 0.7 is going to be released but I hope it will be very soon, because we are only weeks away from starting a very big Cassandra project in my company.

Disruptive Code Party

Seriously, who needs Java One when you can go partying with Disruptive Code people at Gröna Lund? Okay, Weather might be a bit nicer in California :)

Day 1 was great I think. Great sessions, especially the ones about HTML5. Two things I did not like: Adam Skogmans "Designing For NoSQL" talk started earlier than it was printed on the badges or written on = missed it :( Also WIFI quality in the Big Hall was really bad. Nevertheless great stuff! Looking forward to day two.

Choosing wrong track

A problem that keeps following me on conferences, is picking the wrong talks. Often a session sounds nicer on the agenda, than it is in reality. For the first track I selected PayPal over the HTML5 session. I was hoping to get some insights into the PayPal API. How to use it? How to integrate it with some practical examples? The session however turned out to be not that detailed. It felt more like a sales talk. By the way, PayPal is one of the main sponsors of Disruptive Code. The fact, that the organisers put up all conference tweets during the talk, and it seemed to be really exciting over at the HTML5 track, made things worse.

The only good thing I could take away from this talk, are some ideas to possibly integrate payment into my projects. The first of the two PayPal speakers, gave some interesting project examples. ie. a person to persons send money application, a game where you can buy ammo or crowd-sourcing (paying users for uploading pictures, entering recipes etc.). There is also a portal called PayPal X, where developers can develop applications and tools on top of the PayPal API. Similar to the iPhone, these applications have to be approved by PayPal before they are available to everyone.

So it is likely I will embed PayPal in my applications in the future. This session just did not show me how to really. The only slide having code on it, wasn't helping there much either and it certainly was not PHP code like the speakers said.

HTML5 Web Workers and Geolocation

After having missed the first session about upcoming HTML5 features, I decided to go to Peter Lubbers talk "HTML5 Web Sockets, Web Workers and Geolocation Unleashed". Peter is the author of the recently published Apress book "Pro Html 5 Programming". He works for a company called Kaazing which I think is based in the Netherlands. Some time ago, my boss forwarded me a mail which was from Kaazing. It was about a Web Sockets presentation which they wanted to held at our office, since game clients are one of the primary use cases where Web Sockets can come in handy. I have to admit that back then I did not want to meet them. Primarily because I do not work with client products. In addition to that, I thought it was immature, but it was rather that I did not know so much about it back then and did not care.

Anyway, Peters talk covered three of the most interesting API's around HTML5 - Web Workers, Geolocation and Web Sockets. These API's have initially been part of the HTML5 spec but have now been removed and put into their own specification. The idea behind all three is to make life simpler for the developers. With the current generation of browsers and HTML4, developers have to come up with complicated hacks or they have to use plugins in order to mimic bi-directional communication. What HTML5 is aiming for, is to support this natively powered through the browser instead of building something similar based on a bad foundation.

Web Workers are a new feature which brings back UI responsiveness while long running, heavy Javascript is executed. It enables background processing of such script while the user can still use the browser. Peter had an excellent example of this and hopefully I can put this up here on my blog later on. In his example he had a webform with two buttons. Each of the buttons would fire a busy-keeping Javascript for 10 seconds. One button would do this without, the other one with using a Web Worker. Clicking the first button, it was impossible to use the dropdown element or even open a new tab in Firefox. All this worked with Web Workers. Currently this feature is available in Firefox, Chrome, Safari and Opera but not in Internet Explorer. Actually it would be fun to write a web application using Web Workers, where you print a message to the IE users like "if you cannot navigate right now, consider switching to another browser". Web Workers are an incentive that comes for free if you use anything else than Internet Explorer.

The second API Peter talked about was Geolocation. Again, Geolocation is only supported in some browsers right now. You want to check out to see if your browser supports it. Also is a great site to check HTML5 compatibility. Geolocation is a name that stands for native support of user location inside a browser. There are really only two methods that the API supports, getCurrentPosition() and watchPosition(). The first one doing a one time call, the latter constantly receiving the location of the user. This of course makes much more sense on mobile devices than from withing a fixed network. What the two calls return is simply longitude, latitude and accuracy. It is up to the developer how he uses this information within the web application, ie. by displaying it using the Google Maps API. Along with the API calls, you can also request additional metadata but if your browser can not give it to you (like altitude, heading, speed) you might end up getting NULL instead. Looking under the hood, Geolocation is implemented by the browser vendors by using an external location service. The browser asks this service for the location and returns this to the user. I was thinking, the watchPosition() method from the API is probably a candidate where you want to use a Web Worker, unless it is already implemented on top of a Web Worker. Have to find this out.

Would like to blog more about Web Sockets but the next session about CSS3 has already started...

Conference kickoff

For me, it's rather hard to get annoyed on public transport in Stockholm. However this morning was different. It looked like all the Kindergartens and school classes in Sweden were out for a field trip. Pendeltåg was packed, so were the buses. I took the Bus 69 from the central station to the technical museum, where conference is taking place. I have never been in this area of Stockholm before, even though we live here for more than 2 years now. Wondering why they claim the Tekniska Museet is on Djurgården? This is totally the wrong island. Anyway, the museum is quite an amazing place. It's like Tom Tits Experiment with the limitation of you are not allowed to touch and try the stuff.

A bunch of people exited the bus with me and headed for the conference. Someone handed me my badge at registration without checking my id. Last name misspelled, did I screw this up in the blogger pass registration form? Actually it was good that my last name was spelled wrong. My company had booked a Disruptive Code ticket for me before I got elected as a Blogger. I promised the ticket to someone else in the office since I didn't know that the badges came with a name. To make a long story short, you will meet two Reik Schatz on the conference now, one with the last name spelled right.

Conference is about to start now, let's see how it goes. Could already grab Eric Evans over a coffee.

How to select a data-store

I have probably mentioned a couple of times. My company is currently in the process of selecting a new data-store which is supposed to replace MySQL for some applications that produce a lot of data. Cassandra is right now the hottest candidate and favored by my development team and also by the system administrators. Since replacing the database is obviously something bigger, we had to present this to some of the managers. They seemed to be interested as well and decided to make this a venture. This means we are getting hardware, people, resources etc. but it also means we have to follow a certain venture process that starts with a pre-study. To make it even more complicated (or bureaucratic) there is even a pre-study for the pre-study. Actually it is not as bad as it sounds. One of the architects called for a meeting today. The purpose was to find all the open questions that we have to answer in the pre-study document. It will then be this document which will be presented to the people who decided about the venture.

The meeting was actually very productive. Here is some stuff that is worth thinking about if you are in a similar situation.

Alternatives: this is almost a 100% question you will get. What are the alternatives? Why have you selected product X? I was thinking that maybe we could come up with some sort of comparison matrix from which it will be obvious why we favor Cassandra (otherwise we have to ask us why our selfs probably).

Product: The product you are evaluating. What tools does it have to offer? Which changes are required for the development and the test environments? For instance many of our testers look up and compare results directly in the database using MySQL query browser. What can we give them instead? Also an interesting aspect is competence. How do you build it up? Are there any trainings you can participate or is there a company from which you can buy professional support? In the case of open-source software, how active is the community and how certain is it that the product is not gonna die within two years?

Operations: What hardware setup do we need? How do we backup and restore? How do you monitor in the system and what do you monitor? What are the requirements on high-availability and scalability? Talking about scalability: can we add and removes nodes on the fly and how long does it take to replicate / re-balance our data onto these nodes? Do we have a disaster recovery strategy?

Impact: The most interesting part for me. How do we have to adapt existing applications in order to integrate with Cassandra? What client library is to prefer? How flexible is the new data-store when it comes to changes? For instance we always have big-time trouble if we need to alter our MySQL schema's without downtime - a process for which Facebook even has a special name: Online Schema Change. Another interesting question, how can you effectively unit test an application using Cassandra. I guess running an in-process Cassandra, that starts up inside a unit test on a single node, is not catching all errors. Is it realistic to believe, that we can come up with a comprehensive list of use-cases to describe how the data is used in our system. Such a list would greatly influence or data-model. Since there really is only one physical way to store data, how do we handle alternative access. Do we retain a bunch of MySQL indexes to support Cassandra or do we also store these secondary access paths in the main data-store?

I do not remember all the open questions we came up with during the meeting. This is a good start at least. I hope we can create a great pre-study document to get everyone in the company on board.

Blogging at Disruptive Code

Good news everyone - I have been selected as a blogger for the Disruptive Code conference in Stockholm in September. Two weeks ago, a co-worker of mine sent me a link to the dcode website. I have not heard of this conference before but the topics as well as the location seemed very cool. My company is currently evaluating long-time storage prototype based on Cassandra as a replacement for our not-so-long-time MySQL solution. My main interest is therefore the NoSQL sessions and talks about Apache Cassandra. Nice to see, that Eric Evans from Rackspace decided to come visit Scandinavia and talk about Cassandra insights. I saw him earlier this year in Berlin at the Berlin Buzzwords conference. Looking forward to two exciting days on the Djurgården island.

Big Buzz in Berlin

Flying home from Berlin to Stockholm, I decided to blog about my visit at the BerlinBuzzwords conference. Buzzwords is a conference dedicated to data storage at a high-scale. It was the first time the conference was held and I will definitely come back next year! BerlinBuzzwords was organized as a 2-days event (Monday and Tuesday) plus a Barcamp on Sunday night. The first day centered around Apache Lucene and Solr as well as NoSQL databases using a different data model than relational databases one like MongoDB or CouchDB. The second day was almost entirely dedicated to Hadoop as well as the large-scale NoSQL databases Cassandra and Hypertable.

The conference started off a bit unorganized, handing out the badges seemed to be problematic and I ended up waiting in a queue for some minutes. Apparently the badges were not alphabetically sorted which made it difficult to find the right badge and impossible to have multiple lines. Single threaded and non-indexed, pretty bad start for a data storage but this was not really a big deal. BerlinBuzzwords was organized in a way that you also could choose from 2 concurrent tracks. Lunch was served, two coffee-breaks were included in the agenda. Drinks between the sessions not marked as coffee-break or lunch had to be paid extra, otherwise they were included. Surprisingly the organizers managed to come up with a really good speaker list. Eric Evans, the author of Domain Driven Design, introduced Apache Cassandra. Michael Busch from Twitter talked about upcoming improvements in Lucene regarding real time search.

Surprisingly my favorite talk was the one about Hive. It was the last talk of the conference given by Sarah Sproehnle from Cloudera, who sat next to me during the initial keynote without me knowing who she was. Hive is a client side extension to Hadoop which makes it possible to run SQL-like queries against a HDFS storage. Hive comes with a parser or compiler that will take your SQL and turn it into Map/Reduce jobs. To get started with Hive, install Hadoop as usual. In the next step you need to create the schema for your data. Hive stores it's schema definition like table and column names and data types in a relation database aside from Hadoop. Out of the box this is Derby but you can use a different vendor instead. Tables can be created from the Hive command line utility using the mentioned SQL-like syntax. Once the schema is ready, you load data into Hive. This is done using the LOAD DATA command in the shell, specifying data files that can either exists in- or outside of HDFS. Contrary to normal databases, Hive will not validate the data against the schema before inserting. The validation is done when reading the data which can give you some big errors.

A nice shortcut for schema creation and data loading is the Cloudera tool Sqoop, which comes with the Cloudera distribution of Hadoop. BerlinBuzzwords also had a session about Sqoop. The tool is run from the command-line. Using Sqoop, you can dump tables from a relational database directly into HDFS. With the --hive-import option, the data is imported in a way usable for Hive. Sqoop can not only dump entire tables but also only an X amount of rows.

Once your data is in Hive/HDFS you can query it with HiveQL. The syntax is very similar to SQL with a few Hive specific extensions. There is direct support for partitioning and bucketing. User defined functions can be invoked from a HiveQL query. As Hive will take a query and turn it into a Map/Reduce job, you have to remember that there will always be a overhead due to a startup phase. It will take at least 30 seconds before the jobs are executed. Surprisingly, the first example Sarah had in the presentation returned the result immediately. I believe Hive is able to detect that it can go into HDFS directly for very simple queries, without running the whole Map/Reduce thinggie.

One problem is, now that Hive brought the schema back to a schema-less NoSQL system, of course schema evolution. What happens if you add or drop columns? Hive supports the ALTER TABLE statement but it will only change the information in Metastore if I understood Sarah right. In the easiest case, adding a column at the end of a Table, this might not be a bis issue as Hive will return NULL values for the new column. However, if you add or remove a column in the middle (not sure Hive can do that today), your data will be corrupted and you have to fix it, ie. by writing a Map/Reduce job. Not sure how ALTER TABLE is implemented in Hive but hypothetically, the framework should be able to create this "clean-up" job for you.

Anyway, very interesting stuff. Here is a great blog post if you want to read more about BerlinBuzzwords 2010.

Evaluating Cassandra

In the last few months I had the chance to play with Hadoop for a bit. I did some prototyping and unit-testing to learn the framework and see, of it is something we can use in production in my current company. My team is re-designing an existing application which must be able to store 5 years of Poker related data. Five years does probably not sound too bad but given that our system deals about 500 poker hands at peak time, this can be challenging.

I did a presentation on Hadoop a couple of weeks ago but we decided, that we will not use it as data storage. There are a couple of reasons to it. First of all, the framework is going through a lot of changes. The API has changed a lot between version 0.18.3 and 0.20.2. Often the developer has to deal with code examples, documentation and libraries, that are not updated to the latest Hadoop API. Furthermore with Hadoop you have to tame the underlying HDFS file system and you have to tweak your Map Reduce jobs to make them run optimal. Also the Namenode is like Achilles heel in Hadoop. If the Namenode has a problem you have to have a great knowledge about how to fix and restore it, otherwise you are in trouble.

We ended up talking about Cassandra, which is also a Apache top-level project. On a very high level, it reminds me of a schema-less database, whereby Hadoop is more of a distributed file storage. I spent a day to familiarize myself with Cassandra. Here are my first thoughts.

Cassandra does not know any indexes. The data must be stored in an intelligent way, so that it can be retrieved with good performance using the primary key. What does that imply? It means you roughly need to know, how you will access the data in the future and design for it. Otherwise you might end up with uses cases that you cannot serve efficiently, ie. find all players having had Pocket Jacks pre-flop. The good thing about Cassandra is that the data is truly distributed over all nodes. You can read and write from each node at any given time. No single-point of failure like in Hadoop. On the other hand, Cassandra is not processing anything in parallel. If you want to access your data in a distributed, parallel fashion, you need to write a concurrent application talking to several nodes. Alternatively you could use the Hadoop-Cassandra integration, which was added in the latest release. It allows you to write Hadoop Map Reduce Jobs using Cassandra as their InputFormat. Fancy.

Another impression I also had, is that Cassandra hides a lot of the low level details that you come in contact with when using Hadoop. I also looked at HBase but it was rather complicated to get it running compared to Cassandra. Another feature I like is that it uses JSON a lot. It is easy to visualize your data model or even backing-up and restoring your entire database using JSON. I have not written any test code that actually uses Cassandra. We will see about that. I am looking forward to the Berlin Buzzwords conference in June, where they have a talk about Cassandra and a lot of sessions about Hadoop.

If you want to know more about Cassandra, I can recommend these great links:

Tomcat and Too many open files

God bless monitoring. An hour ago I received a SMS from one of my servers that one of my sites was not available anymore. It is a dedicated server that runs 4 Tomcats in parallel. To track down the problem I started looking at the Tomcat logfiles. Whoo whats that? The fourth Tomcat was spamming logfiles like crazy. For the past four days it had created me 4 logfiles having a combined size of 600 GB. I was hitting a for too many open files.

First I thought I had a leak somewhere, which prevented files and sockets from getting closed properly. Actually this was not the main problem. Since all Tomcats were running as the same user, and I had not touched the open file limit for this user, the default maximum of 1024 in Ubuntu 9.10 server was way too little. I checked how many files I had open for this user.

ps aux | grep tomcat

Then for every PID I ran

lsof -p PID | wc -l

I had cleaned the logs and rebooted already. The combined result was that I already was scratching the 1000 mark for all Tomcats after rebooting. Very thin ice. To make a long story short, here is how to change the maximum open file limit on Ubuntu 9.10 server.

First you edit /etc/security/limits.conf and add your new limit for the user running Tomcat. In my case the user was called virtual:

virtual hard nofile 5120
virtual soft nofile 4096

In addition to that, edit the file /etc/pam.d/common-session and add

session required

done! Reboot the machine, then verify the changes running

su virtual
ulimit -n

glftpd trouble after upgrading to Ubuntu 10.04

Today I upgraded three computers running Ubuntu 8.04 LTS to the latest version 10.04. On one of the computers I had glftpd running and it did not work anymore after the upgrade. When I checked for the open ports

netstat -anp --tcp --udp | grep LISTEN

the port glftpd was previously running on, was not in the list anymore and also ps aux | grep glftpd did not show the process. The error "service/protocol combination not in /etc/services: glftpd/tcp" could be seen when restarting xinetd

sudo /etc/init.d/xinetd restart

while tailing syslog

tail -100f /var/log/syslog

All that was missing was a entry in /etc/services at the end.

# Local services
glftpd 10087/tcp # glftpd

First I thought the 10.04 upgrade process was flawed but then I remembered that Ubuntu actually asked me to keep or overwrite some files which I had modified in 8.04. I almost always decided to overwrite the files, except for some local Apache modifications. Apparently /etc/services was one of the files that had changed.

Dependencies clashing with Maven Overlays

Last year I wrote about a Maven feature which I had discovered back then, that made it possible to "merge" two web-applications using overlays. Unfortunately I discovered a very annoying problem with overlays today. In my current project I am using the decode method in the Base64 class in commons-codec. The method was added in commons-codec 1.4. One of the overlay WAR files comes with an older version of commons-codec. What Maven does is that it just throws the dependencies from the overlay project together with your referring project. Make sure you have a look in the lib folder after running mvn package. When I looked into my lib directory in my WEB-INF folder, I realized that I had commons-codec-1.4.jar as well as commons-codec-1.3.jar (the one coming with the overlay) in there. When the web-application loads, version 1.3 is picked up causing a Runtime Exception because the decode method was missing.

This is pretty serious I think. First of all, I did not realize that the 1.3 dependency slipped in with the overlays. I ran mvn dependency:tree -Dverbose -Dincludes=commons-codec to see what happened but it did not show me any library using commons-codec 1.3. It took a while until I realized what the problem was and that the Maven dependency plugin would not help me here. I went into my local repository and ran a find . -name '*.pom' -exec grep -nH 'commons-codec' {} \; to be able to spot projects using commons-codec. I would like to see a Maven plugin where it can find you libraries using a certain version of a dependency. Anyway, I was lucky. The overlay projects could be upgraded to use codec 1.4 instead and the dependency clash went away.

Unfortunately one of the WAR overlays uses dom4j-1.6.1.jar which has a dependency to xml-apis-1.0.b2.jar. In the project where I refer to the overlay it depends on xms-apis-1.3.04.jar - so here I am having two jar's of the same type in the classpath and no easy way to fix this.

First Walk in the Clouds

During the week I tried the Hadoop framework for the first time. I wrote a proof of concept prototype for an application that we are likely going to develop. I managed to test my code using unit test (mrunit), local integration test starting embedded Hadoop and running it pseudo-distributed on my local Hadoop cluster. The final step was to test it in a real cluster in Amazon EC2.

I had never started any AMI in EC2 before, so everything was brand new for me. Access to your AWS account as well as you instances is well protected. Setting up proper access to my EC2 instances was very bumpy, especially since I made a mistake with one of the private key files. Unfortunately the error message I got, was not very helpful and I spent quite some time finding the problem.

If you want to use EC2 you need the following security credentials. Sign-in Credentials: this is basically a email address and a password protecting your AWS account. You need to keep this really safe.

Access Credentials: they consist of three different sub-groups. First there are the Access Keys. Each EC2 user can have up to two Access Keys. Each Access Key has a Access Key Id and a Secret Access Key. In your system environment variables system environment variables, you add them as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. The next subgroup are the X.509 Certificates. Again, you can generate two X.509 Certificates at a time for each EC2 account. Create a new certificate in the AWS management console and download the public and private key. The public key will be in a file that starts with cert-xxxx, the private key will be in a file that starts with pk-xxxx. Copy these two files somewhere and add the full location to your system environment variables as EC2_PRIVATE_KEY and EC2_CERT. The last subgroup are the Key Pairs. A key pair is used when you install a AMI executing the ec2-run-instances command. This is a additional protection to restrict access to you running instance. A Key Pair also has a private and a public part. Amazon will keep the public key of the key pair and store it with your instance. To connect to the instance you need the private key part of the Key Pair.

Run the command: "ec2-add-keypair foo" to create a Key Pair named foo. This will return you the private key part which you will have to copy into a file. This is where I made a stupid mistake. I copied only the parts between BEGIN and END into the file but the file needs to contain the whole output. So it is much much better to run this instead: "ec2-add-keypair foo > ~/.ssh/foo.keypair.ssh". This will automatically sent the output to a new file in the .ssh directory. Finally give your new file the right permissions using "chmod 0700 ~/.ssh/foo.keypair.ssh". For further reading I recommend this page which I helped me fixing my problem. So if you try to ssh -i into your instance and it asks you for a passphrase, something is not correct with the private key part of your Key Pair. Another manifestation of the same problem if you do not use ssh to connect to your instance but the Cloudera scripts instead. If you are following the Cloudera guide for running Cloudera Distribution AMI for Hadoop, and you are on chapter 2.3 Running Jobs and execute: "hadoop fs -ls /" to get this:

WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively
10/03/12 15:09:31 INFO ipc.Client: Retrying connect to server: Already tried 0 time(s).

It was the same problem for me. Having the correct private key part of the Key pair fixed this for me.

The last bit of EC2 protection are the Account Identifiers. I think they are only relevant if you plan to share AWS resources with different accounts - not sure.