Java Code Splitter

Flying home from Berlin to Stockholm, I decided to blog about my visit at the BerlinBuzzwords conference. Buzzwords is a conference dedicated to data storage at a high-scale. It was the first time the conference was held and I will definitely come back next year! BerlinBuzzwords was organized as a 2-days event (Monday and Tuesday) plus a Barcamp on Sunday night. The first day centered around Apache Lucene and Solr as well as NoSQL databases using a different data model than relational databases one like MongoDB or CouchDB. The second day was almost entirely dedicated to Hadoop as well as the large-scale NoSQL databases Cassandra and Hypertable.

The conference started off a bit unorganized, handing out the badges seemed to be problematic and I ended up waiting in a queue for some minutes. Apparently the badges were not alphabetically sorted which made it difficult to find the right badge and impossible to have multiple lines. Single threaded and non-indexed, pretty bad start for a data storage but this was not really a big deal. BerlinBuzzwords was organized in a way that you also could choose from 2 concurrent tracks. Lunch was served, two coffee-breaks were included in the agenda. Drinks between the sessions not marked as coffee-break or lunch had to be paid extra, otherwise they were included. Surprisingly the organizers managed to come up with a really good speaker list. Eric Evans, the author of Domain Driven Design, introduced Apache Cassandra. Michael Busch from Twitter talked about upcoming improvements in Lucene regarding real time search.

Surprisingly my favorite talk was the one about Hive. It was the last talk of the conference given by Sarah Sproehnle from Cloudera, who sat next to me during the initial keynote without me knowing who she was. Hive is a client side extension to Hadoop which makes it possible to run SQL-like queries against a HDFS storage. Hive comes with a parser or compiler that will take your SQL and turn it into Map/Reduce jobs. To get started with Hive, install Hadoop as usual. In the next step you need to create the schema for your data. Hive stores it's schema definition like table and column names and data types in a relation database aside from Hadoop. Out of the box this is Derby but you can use a different vendor instead. Tables can be created from the Hive command line utility using the mentioned SQL-like syntax. Once the schema is ready, you load data into Hive. This is done using the LOAD DATA command in the shell, specifying data files that can either exists in- or outside of HDFS. Contrary to normal databases, Hive will not validate the data against the schema before inserting. The validation is done when reading the data which can give you some big errors.

A nice shortcut for schema creation and data loading is the Cloudera tool Sqoop, which comes with the Cloudera distribution of Hadoop. BerlinBuzzwords also had a session about Sqoop. The tool is run from the command-line. Using Sqoop, you can dump tables from a relational database directly into HDFS. With the --hive-import option, the data is imported in a way usable for Hive. Sqoop can not only dump entire tables but also only an X amount of rows.

Once your data is in Hive/HDFS you can query it with HiveQL. The syntax is very similar to SQL with a few Hive specific extensions. There is direct support for partitioning and bucketing. User defined functions can be invoked from a HiveQL query. As Hive will take a query and turn it into a Map/Reduce job, you have to remember that there will always be a overhead due to a startup phase. It will take at least 30 seconds before the jobs are executed. Surprisingly, the first example Sarah had in the presentation returned the result immediately. I believe Hive is able to detect that it can go into HDFS directly for very simple queries, without running the whole Map/Reduce thinggie.

One problem is, now that Hive brought the schema back to a schema-less NoSQL system, of course schema evolution. What happens if you add or drop columns? Hive supports the ALTER TABLE statement but it will only change the information in Metastore if I understood Sarah right. In the easiest case, adding a column at the end of a Table, this might not be a bis issue as Hive will return NULL values for the new column. However, if you add or remove a column in the middle (not sure Hive can do that today), your data will be corrupted and you have to fix it, ie. by writing a Map/Reduce job. Not sure how ALTER TABLE is implemented in Hive but hypothetically, the framework should be able to create this "clean-up" job for you.

Anyway, very interesting stuff. Here is a great blog post if you want to read more about BerlinBuzzwords 2010.

Big Buzz in Berlin