Auto Secondary Indexes in Cassandra

Twenty minutes ago, Eric Evans talk about Cassandra ended at Disruptive Code. In the first 25 minutes or so, I was quite disappointed because it seemed to be exactly the same presentation, which I saw in June at the Berlin Buzzwords conference. Even the funny Bigtable-Dynamo lovechild slide was still there, though I believe the laughter was greater in Berlin than it was in Stockholm. Well I guess it's not so easy to get a Swede laughing.

Anyway, what I realised during Eric's presentation, was that he already added some stuff from the next Cassandra release 0.7. First of all, every time he was showing configuration, he had an excerpt from the a cassandra.yaml file. For instance this snippet from his timeseries example:


#conf/cassandra.yaml
keyspaces:
-name: Sites
column_families:
-name Stats
compare_with: LongType
new yaml configuration in Cassandra

Apparently as of version 0.7, the cassandra.yaml file is replacing the cassandra.xml file. I have not come in contact with yaml really, I believe it is common in the Ruby world. Another very cool feature is the addition of secondary indexes to Cassandra. In previous versions, Cassandra did not have indexes out of the box. To mimic the behavior of a secondary index, what you could have done is to create another Column Family (I believe it was called). This new Column Family would then be sorted differently and contain a key to the "original" entry. As a example, imagine having a Column Family to store addresses. To be able to search by the city, you could create another Column Family called "byCity" with two properties, "city" and "address key". Every time you insert or update an address, your code has to alter the byCity Column Family.

It looks like Cassandra will do this for you from version 0.7 on. There two new per-column settings called index_name and index_type. If I understood Eric correctly, adding this to your configuration will create you an inverted index, which can be used as a secondary access path. I think this is a very nice, yet very undocumented, feature. No clue when version 0.7 is going to be released but I hope it will be very soon, because we are only weeks away from starting a very big Cassandra project in my company.