Java Code Splitter

Last week I had to work on an interesting problem. My team was working on some sort of reporting application which creates csv like reports based on JDBC ResultSet's returned from querying a database. Earlier this year, I refactored some code sections which created the report files, to make them testable by unit tests. Since the application was already using Spring 2.5, I decided to refactor the plain old JDBC code and use the Spring JdbcTemplate instead.

My unit tests passed and I was happy. The code never went live though, as other stuff got a higher priority. After a while the application was put on release schedule but (of course while I was away from work) system verification found a big problem. When creating the report the application crashed with a OutOfMemoryError. Another developer started looking at my refactored code. First of all, I was not aware that the reports could contain million of rows. The one report that caused the OutOfMemoryError had 16 million rows. It was pretty naive that I used the query method in the SimpleJdbcTemplate passing a ParameterizedRowMapper as argument. Obviously the returned list would contain 16 millions entries and never fit into memory.

Since I was not in the office, the developer who looked at my code wrote me a mail. I don't remember the exact words but he asked me, if I had a particular reason to use SimpleJdbcTemplate instead of the old code. I felt challenged. Of course it would be madness, not to use JdbcTemplate or SimpleJdbcTemplate in a Spring powered application. However he had one good argument - the old code worked! I started investigating how to archive the same performance using only Spring classes. I suggested to use the query method of the JdbcTemplate instead. When using this method, you have the opportunity to supply a RowCallbackHandler as argument. The processRow method of the RowCallbackHandler is then invoked for every row in the ResultSet and we could directly write a line in our report.

We changed the code once more. The unit test still ran. However, we quite soon discovered that we did not really fix the main issue. Even though it could handle more records now, it would still fail with an OutOfMemoryError. Instead of building up a huge List as before, it created a huge ResultSet in memory. Another big problem became apparent. Processing the rows was now very slow. Compared to before, a report with 16 Million rows which took 7 minutes to create before would now be created in 90 minutes. Now I felt really challenged! I did not want to go back to the old code and use good 'ol plain JDBC again.

I downloaded the Spring source code and compared our previous implementation with the way we run now. Soon I found out about the problem. The old code created something which I call a streaming ResultSet. This was done by specifying flags java.sql.ResultSet.TYPE_FORWARD_ONLY and java.sql.ResultSet.CONCUR_READ_ONLY in the createStatement method of the Connection and also specifying a fetch size of Integer.MIN_VALUE. I compared this with what was JdbcTemplate was doing. Spring also used the createStatement method of the Connection class but without specifying extra flags. This was fine, since TYPE_FORWARD_ONLY and CONCUR_READ_ONLY are used by default. The JdbcTemplate also had a setFetchSize method, cool. However, by looking at the source, I saw that it would completely ignore negative fetch sizes. This is pretty bad. I think it would be much nicer to throw an exception here, since the client calling setFetchSize with a negative value will now know that his fetch size is ignored. On the other hand it was easy enough to create a new subclass which allowed negative fetch sizes. I called it StreamingResultSetEnabledJdbcTemplate.


/**
 * A {@link JdbcTemplate} which will make it possible to mimic streaming Resultset's by allowing negative fetch sizes
 * to be set on the {@link Statement}.
 *
 * @author reik.schatz
 */
public class StreamingResultSetEnabledJdbcTemplate extends JdbcTemplate
{
    public StreamingResultSetEnabledJdbcTemplate(final DataSource dataSource)
    {
        super(dataSource);
    }

    public StreamingResultSetEnabledJdbcTemplate(final DataSource dataSource, final boolean lazyInit)
    {
        super(dataSource, lazyInit);
    }

    /**
  * Prepare the given JDBC Statement (or PreparedStatement or CallableStatement),
  * applying statement settings such as fetch size, max rows, and query timeout.
     * Unlike in {@link JdbcTemplate} you can also specify a negative fetch size.
     *
  * @param stmt the JDBC Statement to prepare
  * @throws java.sql.SQLException if thrown by JDBC API
  * @see #setFetchSize
  * @see #setMaxRows
  * @see #setQueryTimeout
  * @see org.springframework.jdbc.datasource.DataSourceUtils#applyTransactionTimeout
  */
    @Override
    protected void applyStatementSettings(final Statement stmt) throws SQLException
    {
  int fetchSize = getFetchSize();
        stmt.setFetchSize(fetchSize);

  int maxRows = getMaxRows();
  if (maxRows > 0) {
   stmt.setMaxRows(maxRows);
  }
  DataSourceUtils.applyTimeout(stmt, getDataSource(), getQueryTimeout());
    }
}

Using my new class killed all the issues we had. Memory was not a problem anymore and the speed was back. One drawback, if you use my solution, is that some methods like isLast or isFirst are not supported on the ResultSet anymore. If your code invokes them and the ResultSet was created using the described approach, an Exception is thrown.

To come up with some numbers, I created a simple test project using Maven. Feel free to download and test for yourself. You need a local MySQL database, a schema and a database user who can write to this schema. Download the zip file and extract to any directory. Go in src/main/resources and apply the correct database settings in applicationContext.xml. After that open a command prompt, go into the directory where you extracted the zip file to and run mvn test. Maven 2 must be installed of course.

This will run two TestNG unit tests. The first test is called JdbcTemplateTest. The test creates 1,5 million rows in the MySQL database and executes the same retrieval code first using a StreamingResultSetEnabledJdbcTemplate then using a JdbcTemplate. I could not write the unit test with more records as you will hit a OutOfMemoryError for JdbcTemplate otherwise. Here is the JdbcTemplateTest.


/**
 * Tests and measures the {@link JdbcTemplate} and {@link StreamingResultSetEnabledJdbcTemplate}.
 *
 * @author reik.schatz
 */
public class JdbcTemplateTest extends AbstractJdbcTemplateTest
{
    @Test(groups = "unit")
    public void testRun()
    {
        runTestUsingTemplate(getStreamingResultSetEnabledJdbcTemplate());
        runTestUsingTemplate(getJdbcTemplate());
    }

    private void runTestUsingTemplate(final JdbcTemplate jdbcTemplate)
    {
        final String selectStatement = getQuery();

        final AtomicLong count = new AtomicLong();

        final Date before = new Date();

        final String className = jdbcTemplate.getClass().getSimpleName();
        System.out.println("Testing " + className);

        jdbcTemplate.query(selectStatement, new RowCallbackHandler()
        {
            public void processRow(ResultSet resultSet) throws SQLException
            {
                final long i = count.incrementAndGet();
                if (i % 500000 == 0) System.out.println("Iterated " + i + " rows");
            }
        });

        final Date after = new Date();
        final long duration = after.getTime() - before.getTime();

        System.out.println(className + ".query method took " + duration + " ms.");

        assertEquals(count.get(), getNumberOfRecords());

        renderSeperator();
    }

    protected JdbcTemplate getJdbcTemplate()
    {
        final JdbcTemplate jdbcTemplate = new JdbcTemplate(getDataSource());
        jdbcTemplate.setFetchSize(Integer.MIN_VALUE);
        return jdbcTemplate;
    }

    protected JdbcTemplate getStreamingResultSetEnabledJdbcTemplate()
    {
        final JdbcTemplate jdbcTemplate = new StreamingResultSetEnabledJdbcTemplate(getDataSource());
        jdbcTemplate.setFetchSize(Integer.MIN_VALUE);
        return jdbcTemplate;
    }
}

The test data is created in the abstract base class AbstractJdbcTemplateTest.


/**
 * Inserts the test data.
 *
 * @author reik.schatz
 */
@ContextConfiguration(locations = "/applicationContext.xml")
public abstract class AbstractJdbcTemplateTest extends AbstractTestNGSpringContextTests
{
    @Autowired
    private DataSource m_dataSource;

    protected DataSource getDataSource()
    {
        return m_dataSource;
    }

    protected String getQuery()
    {
        return "SELECT * FROM rounds";
    }

    @BeforeClass
    protected void setUp()
    {
        System.out.println("\n\n " + getClass().getSimpleName() + ": \n");

        final JdbcTemplate jdbcTemplate = new JdbcTemplate(m_dataSource);
        
        renderSeperator();
        System.out.println("Dropping table");
        jdbcTemplate.update("DROP TABLE IF EXISTS rounds;");

        System.out.println("Creating table");
        jdbcTemplate.update("CREATE TABLE rounds (round_id INT, player_id INT DEFAULT 0, gaming_center INT DEFAULT 1, last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP);");
        jdbcTemplate.update("ALTER TABLE rounds DISABLE KEYS;");
        jdbcTemplate.update("LOCK TABLES rounds WRITE;");

        final Date now = new Date();

        final StringBuilder sb = new StringBuilder();
        final long records = getNumberOfRecords();

        for (int i = 0; i < records; i++)
        {
            if (i % 100000 == 0)
            {
                sb.append("INSERT INTO rounds(round_id) VALUES(" + i + ")");
            }
            else
            {
                 sb.append(",(" + i + ")");
            }


            if (i % 100000 == 99999 || i == (records - 1))
            {
                jdbcTemplate.update(sb.toString());
                sb.setLength(0);

                System.out.println("Inserted " + i + " rows");
            }

        }

        jdbcTemplate.update("UNLOCK TABLES;");
        jdbcTemplate.update("ALTER TABLE rounds ENABLE KEYS;");

        System.out.println("Insertion took " + (new Date().getTime() - now.getTime()) + " ms");

        renderSeperator();
    }

    protected long getNumberOfRecords()
    {
        return 1500000L;
    }

    protected void renderSeperator()
    {
        System.out.println("============================================================");
    }
}

Even though it is only iterating 1.500.000 million records, you already see a difference between JdbcTemplate and StreamingResultSetEnabledJdbcTemplate. Using the StreamingResultSetEnabledJdbcTemplate the iteration runs for 1526 ms. Using the JdbcTemplate the iteration runs for 5192 ms. Don't forget, you cannot even use the JdbcTemplate if you have 2, 3 or 4 million records in the ResultSet.


JdbcTemplateTest: 

Testing StreamingResultSetEnabledJdbcTemplate
Iterated 500000 rows
Iterated 1000000 rows
Iterated 1500000 rows
StreamingResultSetEnabledJdbcTemplate.query method took 1526 ms.

Testing JdbcTemplate
Iterated 500000 rows
Iterated 1000000 rows
Iterated 1500000 rows
JdbcTemplate.query method took 5192 ms.

Finally I wrote another unit test. The FastInsertionTest iterates 16 million rows using StreamingResultSetEnabledJdbcTemplate. The iteration runs for only 9507 ms. Not bad.

Summary: I did a very very small change with a gigantic effect. I recommend two things to the Spring development team. First, in the setFetchSize method an Exception should be thrown when someone sents in a negative fetch size. Second, in future Spring versions the JdbcTemplate should enable the use of negative fetch sizes, so that the StreamingResultSetEnabledJdbcTemplate becomes obsolete. Maybe something is coming with Spring 3.0.

On a side note, each of the two unit tests creates a lot of test data. In the first version, I wrote a for-loop that fired an Insert statement for every iteration. This was incredibly slow, about 20 seconds for just 100.000 Inserts. I checked a few good resources on the web, like the MySQL documentation or this blog post, and refactored my code.

The JdbcTemplateTest inserts the test data now using a MySQL feature called multiple value Insert. Instead of firing an Insert statement every iteration in the for-loop, I add a new value to multiple value Insert statement. Then every 100.000 iterations I fire the statement. So for 1.5 million rows, I fire only 15 Insert statements.


final StringBuilder sb = new StringBuilder();
final long records = getNumberOfRecords();

for (int i = 0; i < records; i++)
{
    if (i % 100000 == 0)
    {
        sb.append("INSERT INTO rounds(round_id) VALUES(" + i + ")");
    }
    else
    {
         sb.append(",(" + i + ")");
    }


    if (i % 100000 == 99999 || i == (records - 1))
    {
        jdbcTemplate.update(sb.toString());
        sb.setLength(0);

        System.out.println("Inserted " + i + " rows");
    }

}

This runs very fast as you can see in the test output. The 1.500.000 records are inserted in only 2601 ms.


Dropping table
Creating table
Inserted 99999 rows
Inserted 199999 rows
Inserted 299999 rows
Inserted 399999 rows
Inserted 499999 rows
Inserted 599999 rows
Inserted 699999 rows
Inserted 799999 rows
Inserted 899999 rows
Inserted 999999 rows
Inserted 1099999 rows
Inserted 1199999 rows
Inserted 1299999 rows
Inserted 1399999 rows
Inserted 1499999 rows
Insertion took 2601 ms

The FastInsertionTest uses another feature of MySQL called INFILE insertion. This time, the for-loop in my code builds up a gigantic text file which I then import into my MySQL database using the LOAD DATA INFLIE syntax. One drawback of using of this approach is that the number of columns in the file must match the columns in the table you are trying to insert to. In other words, you cannot use the DEFAULT feature of a column. In my table rounds, I have 4 columns but column 2, 3 and 4 have a DEFAULT value. It would be nice if my file would only contain the first column values, as the file would be much smaller in this case. This however is not possible. I have to add column values for column 2, 3 and 4 as well.


final File javaIoTmpDir = SystemUtils.getJavaIoTmpDir();
assertNotNull(javaIoTmpDir);
assertTrue(javaIoTmpDir.exists());

final File dumpFile = new File(javaIoTmpDir, "dump.txt");
if (dumpFile.exists())
{
    assertTrue(dumpFile.delete());
}

Writer out = null;
try
{
    out = new BufferedWriter(new FileWriter(dumpFile));
}
catch (IOException e)
{
    fail();
}
assertNotNull(out);

final long records = getNumberOfRecords();
try
{
    for (int i = 0; i < records; i++)
    {
        out.write("1");
        out.write('\t');
        out.write("1");
        out.write('\t');
        out.write("1");
        out.write('\t');
        out.write("0000-00-00 00:00:00");
        out.write('\n');
    }
}
catch (IOException e)
{
    fail();
}
finally
{
    out.close();
}

jdbcTemplate.update("LOAD DATA INFILE '" + dumpFile.getPath() + "' INTO TABLE rounds");

As you can see in the test output, the 16.000.000 rows are inserted into the MySQL database in only 24852 ms. Awesome.


FastInsertionTest: 
============================================================
Dropping table
Creating table
Inserting 16000000 rows took 24852 ms
============================================================

Pimp ma JDBC ResultSet