Investigating Cassandra Heap

We are working on a new application which will use Apache Cassandra. Yesterday a co-worker sent me the following warning, which we kept seeing in the logs every now and then on several nodes. I was asked if this was something to worry about.

WARN [ScheduledTasks:1] 2013-01-07 12:14:10,865 (line 145) Heap is 0.8336618755935529 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically.

The warning is a bit misleading as you will see in a bit - but hey, using 83% of your JVM heap memory should always ring at least some alarm bells. Since I haven’t used Cassandra that much, I needed to investigate how it uses its heap memory. We are using Datastax Community Edition 1.1.x, so the first place to look for more information was Opscenter. Bu it didn’t give me much information about the heap. Next I went into one cluster node via SSH to see if I could get out some stats via JMX, as I was suspecting a big cache to be the problem. For the first time I used jmxterm instead of commandline-jmxclient. So to get some numbers for Cassandras key and row cache via JMX you can do this:

Obviously we were running defaults for the 2 caches. The key cache was very small and the row cache not even enabled. By default Cassandra 1.1 assigns 5% of the JVM heap memory to the key cache, though never more than 100 MB. As next step I wanted to find out how the heap memory was actually used. So I ran jmap -heap `pgrep java` as explained here. Make sure you have only 1 java process running otherwise feed in the pid manually to jmap. Note: doing a heap dump to file wasn't such a great idea. It stopped after about 20 minutes. At this point the dump file was 2.7 GB big and the node had already left the cluster.

Apparently 2.8 GB of our 4 GB heap were used in the old generation (also called concurrent mark and sweep generation if a CMS GC algorithm is in use). The old generation contains objects that have survived a couple of collections in the various stages of the young generation. After reading this blog post about Cassandra GC tuning and this description from Oracle, I was thinking that the old generation might be filled because the JVM never did a major collection. Apparently if –XX: CMSInitiatingOccupancyFraction is not changed via the JAVA_OPTS, a major collection would only be issued at approximately 92% of usage. So if Cassandra was flushing the largest memtable every time at 0.75 percent (default value for flush_largest_memtables_at in cassandra.yaml) it would free heap memory therefore preventing a concurrent major collection.

Then however I realized that we were still running with the default value for memtable_total_space_in_mb, which is the only setting for memtables since Cassandra 1.0. The default is to use a maximum of 1/3 of the JVM heap. So something else was eating up the heap memory, not memtables. So Cassandra dropping the largest memtable at 75% seems kind of desperate in our scenario. So with caching and memtables not being the culprits, what else was left? It turned out the bloom filter for the amount of data and the number of nodes we have, was getting very big. Our test cluster has 6 nodes and the total data size is around 400 GB. Cassandra uses a bloom filter in front of its SSTables to check if a row exists before it does disk IO. This is an extra layer that, if tuned properly, can make Cassandra access to column families more efficient because disk IO is slow. A bloom filter is a probabilistic data structure. It can give you false positives, meaning it will tell you a record exists in an SSTable but it does not. It will however never tell you a record does not exist while it exists in reality (false negative).

The false positives ratio can be tuned using the bloom_filter_fp_chance parameter in cassandra.yaml. We were running default of 0.1 for this parameter, which I think accounts for a 10% chance of a false positive. The value can be anything between 0 and 1. Well nothing is for free and having a better bloom filter increases the size of the data structure.

The bloom filter is defined per column family. So one way to bring down the size of a bloom filter in Cassandra, is to evaluate your column families. Column families which are not getting a lot of read requests should be fine without a effective bloom filter. Another possibility is to add more nodes to the cluster, so that each node maintains less data therefore bringing also down the size of the bloom filter. Finally here is some good news for Cassandra 1.2 (still waiting for the Datastax release for 1.2). The bloom filter can run off-heap since Cassandra 1.2. For this to work you need to enable Java Native Access (JNA), which isn’t done by default when installing Cassandra (even when installing from the Debian packages from what I heard). Running the bloom filter off-heap will solve your immediate heap problems. As far as I know it is not recommended to run Cassandra with more than 8GB of heap memory. However you still need to tune your bloom filter in regards to data size, number of nodes and false positives ratio. Otherwise you might run out of system memory. Finally also tuning the CMS garbage collection is useful. I think we will set it up to be incremental.