Product Import with Spring Batch - Part1

I have two websites, which on a regular basis import products belonging to affiliate programs. The websites were developed in 2005 and 2006. In the process of moving the applications from a Windows to a Linux server, I decided to rewrite and modularize a lot of code that was duplicated or poorly performing. For the product import, I selected Spring Batch as new framework. Spring Batch forces you to write little code chunks instead huge jobs, which will then make up your whole batch. Also, there is a ready-to-use tool for monitoring and I am familiar with the standard Spring Framework.

Since the full batch might be complex later, I decided to publish my experiences here while I implement. In the first part, I will use Spring Batch to download a product data file (csv) that belongs to an affiliate program. To build the application, I use Maven 2. In the version for this part, only 5 dependencies are needed. Spring Batch of course, commons-io and commons-lang to help me with some utility stuff, junit and log4j.


xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

4.0.0
com.kanzelbahn.utils
product-import
jar
1.0-SNAPSHOT
Product Import Library


2.0.3.RELEASE
2.5.6
false






commons-io
commons-io
1.4


commons-lang
commons-lang
2.4




org.springframework.batch
spring-batch-core
${spring.batch.version}


org.springframework.batch
spring-batch-test
${spring.batch.version}




junit
junit
4.4
test




log4j
log4j
1.2.9







org.apache.maven.plugins
maven-surefire-plugin

${skipJunitTests}
-Xms128m -Xmx256m -XX:PermSize=128m -XX:MaxPermSize=256m
false




org.apache.maven.plugins
maven-compiler-plugin

1.5
1.5
false
false






pom.xml

One big disadvantage of Spring Batch 2.0, is that you cannot unit test it together with TestNG. The problem is rather in the Spring Framework than Spring Batch. When you write a Spring powered TestNG unit test, you need to extend AbstractTestNGSpringContextTests. There is no Runner to use in the @RunWith annotation, like SpringJunit4ClassRunner. This alone is not a problem, but since your Spring Batch test also need to extend from AbstractJobTests, and you cannot inherit twice, TestNG is out of the loop. This will be fixed in Spring Batch 2.1 because then you will not have to extend from AbstractJobTests anymore. I did not use version 2.1 because it was not release at the point of writing this post.

Let's have a look at the job configuration.


xmlns:beans="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-2.0.xsd
http://www.springframework.org/schema/batch
http://www.springframework.org/schema/batch/spring-batch-2.0.xsd">




















jobs.xinc

As you can see, it has only three (well three and a half) steps. Step 1 is implemented in the InitializingTasklet and all it really does is logging the start time. The next step is called csv_exists and is a decision step. If the csv-File for the current day exists, I move on to Step 3, otherwise Step 2 is invoked. Step 2 is implemented in the DownloaderTasklet. This Tasklet will download the csv-File of the current day. Step 3 is implemented in the FinishingTasklet and also performs basic logging. Let's look at the three different Tasklet's and the JobExecutionDecider.

The InitializingTasklet for Step 1 is pretty much self explaining. On a side note, see how I use FastDateFormat instead of SimpleDateFormatter because it is not thread-safe.


/**
* Performs initialization tasks.
*
* @author reik.schatz Dec 11, 2009
*/
public class InitializingTasklet implements Tasklet {
private static final Logger LOGGER = Logger.getLogger(InitializingTasklet.class);

private static final FastDateFormat DATE_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss");

public RepeatStatus execute(final StepContribution stepContribution, final ChunkContext chunkContext) throws Exception {
LOGGER.debug("Initializing at " + DATE_FORMAT.format(new Date()) + ".");
return RepeatStatus.FINISHED;
}
}
InitializingTasklet.java

The JobExecutionDecider for the csv_exists decision is implemented in a class called DoesCsvExistDecision. When constructing the bean, you need to specify a CsvFileFacade to handle the access to the csv-File. CsvFileFacade is an Interface and the my only implementation is the class CsvFileFacadeImpl. The implementation expects a ImportSettings instance upon creation. Everything is wired together by Spring. Using the ImportSettings instance, the CsvFileFacade knows about the root directory, to which the csv-Files shall be downloaded to, the affiliate program id and the location of the csv-File on the Internet. I used the German affiliate program Bakker as a sample implementation.


/**
* A {@link CsvFileFacade} wraps the handling of csv files.
*
* @author reik.schatz Dec 11, 2009
*/
public interface CsvFileFacade {

File getCsvFile();

URL getDataFileURL();
}
CsvFileFacade.java


/**
* A {@link CsvFileFacade} which retrieves informations about the csv file
* location from a {@link ImportSettings} instance.
*
* @author reik.schatz Dec 11, 2009
*/
public class CsvFileFacadeImpl implements CsvFileFacade {
private static final Logger LOGGER = Logger.getLogger(CsvFileFacadeImpl.class);

private final ImportSettings _settings;

public CsvFileFacadeImpl(final ImportSettings settings) {
_settings = settings;
}

public File getCsvFile() {
final String fileName = _settings.getImportable().getProgramId() + ".csv";
final File dailyDirectory = _settings.getDirectory();
return new File(dailyDirectory, fileName);
}

public URL getDataFileURL() {
return _settings.getImportable().getDataFile();
}
}
CsvFileFacadeImpl.java


/**
* Wraps all settings for the current import run.
*
* @author reik.schatz Dec 13, 2009
*/
public interface ImportSettings {

/**
* Get's the directory to which the datafile shall be imported to.
*
* @return File
*/
File getDirectory();

/**
* Returns the {@link Importable} which shall be used.
*
* @return Importable
*/
Importable getImportable();
}
ImportSettings.java


/**
* Encapsulates settings of for a single import run.
*
* @author reik.schatz Dec 13, 2009
*/
public class StandardSettings implements ImportSettings {

private final File _rootDirectory;
private final Importable _importable;

public StandardSettings(final File importDirectory, final Importable importable) {
_importable = importable;

if (importDirectory == null || !importDirectory.exists()) {
final String path = importDirectory == null ? "" : importDirectory.getPath();
throw new IllegalArgumentException("Given importDirectory (" + path + ") does not exist.");
}

_rootDirectory = importDirectory;
}

/** @inheritDoc **/
public File getDirectory() {
final Date now = new Date();
final FastDateFormat df = FastDateFormat.getInstance("yyyy-MM-dd");
final String day = df.format(now);

final File importableDataFileDirectory = new File(_rootDirectory, day);
if (!importableDataFileDirectory.exists()) {
try {
FileUtils.forceMkdir(importableDataFileDirectory);
} catch (IOException e) {
throw new IllegalStateException("Unable to create daily import directory (" + day + ")", e);
}
}
return importableDataFileDirectory;
}

/** @inheritDoc **/
public Importable getImportable() {
return _importable;
}
}
StandardSettings.java


/**
* Represents a importable program.
*
* @author reik.schatz Dec 13, 2009
*/
public interface Importable {

public int getProgramId();

public URL getDataFile();
}
Importable.java


/**
* A {@link Importable} which wraps all parameters for the german affiliate program Bakker.
*
* @author reik.schatz Dec 13, 2009
*/
public class Bakker implements Importable, Serializable {

private static final long serialVersionUID = 7526472295622776147L;

private final int _programId;
private final URL _dataFile;

public Bakker(final int programId, final String dataFileLocation) {
_programId = programId;
try {
_dataFile = new URL(dataFileLocation);
} catch (MalformedURLException e) {
throw new IllegalArgumentException("Given dataFileLocation " + dataFileLocation + " is not a valid URL");
}
}

public int getProgramId() {
return _programId;
}

public URL getDataFile() {
return _dataFile;
}

@Override
public boolean equals(final Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;

final Bakker bakker = (Bakker) o;

if (_programId != bakker._programId) return false;

return true;
}

@Override
public int hashCode() {
return _programId;
}

@Override
public String toString() {
return "Bakker{" +
", _programId=" + _programId +
", _dataFile=" + _dataFile +
'}';
}
}
Bakker.java

Every JobExecutionDecider must implement the decide method. I get the csv-File from the CsvFileFacade, which will be a different file depending on the day you run the job and the affiliate program. If the csv-File exists, I return FlowExecutionStatus.COMPLETED which will invoke Step 3. Otherwise I return FlowExecutionStatus.FAILED which will invoke Step 2 – to download the file.


/**
* Tests for the existence of the csv file in the specified
* {@link CsvFileFacade}.
*
* @author reik.schatz Dec 11, 2009
*/
public class DoesCsvExistDecision implements JobExecutionDecider {

private final CsvFileFacade _csvFileFacade;

public DoesCsvExistDecision(final CsvFileFacade csvFileFacade) {
_csvFileFacade = csvFileFacade;
}

public FlowExecutionStatus decide(final JobExecution jobExecution, final StepExecution stepExecution) {
final File csvFile = _csvFileFacade.getCsvFile();
if (csvFile.isFile()) {
return FlowExecutionStatus.COMPLETED;
} else {
return FlowExecutionStatus.FAILED;
}
}
}
Bakker.java

The file download is wrapped in the DownloaderTasklet. The Tasklet again is injected with a reference to the CsvFileFacade. Using the Facade and FileUtils from commons-io, I download the csv-File and store it physical on disc.


/**
* Downloads the csv file.
*
* @author reik.schatz Dec 11, 2009
*/
public class DownloaderTasklet implements Tasklet {

private final CsvFileFacade _csvFileFacade;

public DownloaderTasklet(final CsvFileFacade csvFileFacade) {
_csvFileFacade = csvFileFacade;
}

public RepeatStatus execute(final StepContribution stepContribution, final ChunkContext chunkContext) throws Exception {
final File csvFile = _csvFileFacade.getCsvFile();
final URL location = _csvFileFacade.getDataFileURL();
try {
FileUtils.copyURLToFile(location, csvFile);
} catch (IOException e) {
throw new IllegalStateException("Unable to download csv file.", e);
}

return RepeatStatus.FINISHED;
}
}
DownloaderTasklet.java

The job ends for now in Step 3, which simply invokes log4j one more time.


/**
* Contains actions to be done when a Job is finishing.
*
* @author reik.schatz Dec 11, 2009
*/
public class FinishingTasklet implements Tasklet {
private static final Logger LOGGER = Logger.getLogger(InitializingTasklet.class);

private static final FastDateFormat DATE_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss");

public RepeatStatus execute(final StepContribution stepContribution, final ChunkContext chunkContext) throws Exception {
LOGGER.debug("Finished at " + DATE_FORMAT.format(new Date()) + ".");

return RepeatStatus.FINISHED;
}
}
FinishingTasklet.java

Unit testing could not be easier.


/**
* Tests the import job.
*
* @author reik.schatz Dec 11, 2009
*/
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = { "/applicationContext.xml" })
public class ImportJobTest extends AbstractJobTests {

@Autowired
private CsvFileFacade _csvFileFacade;

@Transactional
@Test
public void testChain() throws Exception {
final JobExecution jobExecution = this.launchJob();
assertEquals(jobExecution.getExitStatus(), ExitStatus.COMPLETED);
assertTrue(_csvFileFacade.getCsvFile().exists());
}
}
ImportJobTest.java

You can download the full source code from Google Groups. The zip-archive will also contain the remaining parts of the Spring configuration, which is needed to wire all beans together.

Whats new in Maven 3

Yesterday evening I went to a Java event in Stockholm called Java Forum. This was actually the first time I was there, even though the Java Forum meeting is held every month and at different locations in Stockholm. Yesterday's meeting was very interesting for me because it was all about Maven.

The first presentation was held by Jason van Zyl and was all about the new features in Maven 3. Jason is sort of the brain behind Maven. He is CTO at Sonatype, the Maven company.

The second speech was given by Dennis Lundberg, a guy who helps developing a lot of Maven plug-ins. His talk was about the Maven Site plug-in. Dennis spoke in Swedish, which was fine by me. However, since Jason was present, I think it would have been more polite, if he had talked English too - especially since he had an (unanswered) Maven 3 question which Jason maybe could have answered. Finally Jason took over again and talked about his idea of a next generation infrastructure build up from Maven, M2Eclipse, Nexus and Hudson. Unfortunately there was enough time left to talk about Hudson, so he could only cover M2Eclipse and Nexus.

M2Eclipse looked promising, as it will be the first working Maven integration plug-in for the Eclipse IDE. It wont be out until January 2010 however. Not that I care much, since I am using IntelliJ, but they had some nice features in M2Eclipse. One of them being some extra XML meta-data in the Maven POM, which is only picked up by M2Eclipse and ignored by command-line Maven. This meta-data will speed up the builds in Eclipse when using M2Eclipse. A build which took minutes before can be run in seconds that way. M2Eclipse will download all Sources automatically, this isn't something new, but it also with a single click create you a new project for any of your dependencies. This can be helpful if you project has a dependency and you need to patch something in that dependency very quick.

Anyway, for me the most interesting presentation was the one about Maven 3 - which will not be released until next year. So what are the big improvements?

Probably the biggest visible change is the polyglot POM support. You can write your POM files now in different languages. Jason has examples for Groovy, Yaml and Raven. I found some sneak previews here and here. It was a little bit funny when Jason had presented the polyglot feature of Maven 3 and then asked in the audience if anyone thought the original XML format was annoying. No one of the 80 people thought it was annoying. So maybe this feature will not be used very often.

Those of you who have worked in multi-module or multi-pom projects in Maven2 might have asked themselves, why do I have to specify the parent version in every sub module. Maven 3 will remove this redundancy and add version-less parent elements.

Another big problem in Maven2 is to find out for an effective POM, which dependency or POM supplied which artifact to the final outcome. Maven3 will address this and it will be easier to see who contributed which artifact. In connection with M2Eclipse it will then for the developer be possible to deselect a certain contribution and select another one instead. All this is only possible because Maven3 decouples execution plan and execution. You POM defines an execution plan which is then brought to execution. Users can make changes to the execution plan before execution. In general, will Maven 3 come with a lot of extensions points, which can be used to statically and dynamically alter the POM. This can for example be leveraged by companies, who have their dependencies and versions in different formats, that want to use parts of Maven 3.

Extension points seems to be the next big thing in Maven 3. This is actually something where Jason confirmed they stole from Eclipse. Instead of sub-classing a plug-in, like you would do in Maven 2, developers can hook up to different extension points to alter the plug-in behavior. For instance, you might have an extension point to alter the way the web.xml is processed by the WAR plug-in. You don't have to inherit anymore to get customized behavior.

Error messages will not be as cryptic anymore. Most of the 40+ something error messages will come with a link to the Maven 3 wiki where the error is explained in detail. A state of the art Maven 3 client is currently developed by the Jetty people. They have created their own asynchronous HTTP client library which will be used in the M3 client. Internally, the new Maven uses a micro OSGI container but only for classloading and bundle management. The Maven 3 source code uses Google Guice for dependency injection and a library called Peaberry which extends Guice with OSGI capabilities.

Finally, the whole dependency resolution is re-factored by Sonatype into a standalone product. The software will be called Mercury and Maven 3 will be a client who uses Mercury. Software companies and developers might as well use Mercury to integrate dependency resolution into their own solutions.

I got some impressions about Maven 3. It was not so much different on the surface, except for the polyglot stuff and the extension points. A lot of stuff is happening under the hood. Maven 3 is fully backwards compatible to Maven 2. It runs much faster though and the code base is 1/3 smaller.

I liked Jason's two presentations in Stockholm. Too bad that time was running out. I would have liked to see more Maven 3 in action from the command line or some extension point code examples in Eclipse.

Pimp ma JDBC ResultSet

Last week I had to work on an interesting problem. My team was working on some sort of reporting application which creates csv like reports based on JDBC ResultSet's returned from querying a database. Earlier this year, I refactored some code sections which created the report files, to make them testable by unit tests. Since the application was already using Spring 2.5, I decided to refactor the plain old JDBC code and use the Spring JdbcTemplate instead.

My unit tests passed and I was happy. The code never went live though, as other stuff got a higher priority. After a while the application was put on release schedule but (of course while I was away from work) system verification found a big problem. When creating the report the application crashed with a OutOfMemoryError. Another developer started looking at my refactored code. First of all, I was not aware that the reports could contain million of rows. The one report that caused the OutOfMemoryError had 16 million rows. It was pretty naive that I used the query method in the SimpleJdbcTemplate passing a ParameterizedRowMapper as argument. Obviously the returned list would contain 16 millions entries and never fit into memory.

Since I was not in the office, the developer who looked at my code wrote me a mail. I don't remember the exact words but he asked me, if I had a particular reason to use SimpleJdbcTemplate instead of the old code. I felt challenged. Of course it would be madness, not to use JdbcTemplate or SimpleJdbcTemplate in a Spring powered application. However he had one good argument - the old code worked! I started investigating how to archive the same performance using only Spring classes. I suggested to use the query method of the JdbcTemplate instead. When using this method, you have the opportunity to supply a RowCallbackHandler as argument. The processRow method of the RowCallbackHandler is then invoked for every row in the ResultSet and we could directly write a line in our report.

We changed the code once more. The unit test still ran. However, we quite soon discovered that we did not really fix the main issue. Even though it could handle more records now, it would still fail with an OutOfMemoryError. Instead of building up a huge List as before, it created a huge ResultSet in memory. Another big problem became apparent. Processing the rows was now very slow. Compared to before, a report with 16 Million rows which took 7 minutes to create before would now be created in 90 minutes. Now I felt really challenged! I did not want to go back to the old code and use good 'ol plain JDBC again.

I downloaded the Spring source code and compared our previous implementation with the way we run now. Soon I found out about the problem. The old code created something which I call a streaming ResultSet. This was done by specifying flags java.sql.ResultSet.TYPE_FORWARD_ONLY and java.sql.ResultSet.CONCUR_READ_ONLY in the createStatement method of the Connection and also specifying a fetch size of Integer.MIN_VALUE. I compared this with what was JdbcTemplate was doing. Spring also used the createStatement method of the Connection class but without specifying extra flags. This was fine, since TYPE_FORWARD_ONLY and CONCUR_READ_ONLY are used by default. The JdbcTemplate also had a setFetchSize method, cool. However, by looking at the source, I saw that it would completely ignore negative fetch sizes. This is pretty bad. I think it would be much nicer to throw an exception here, since the client calling setFetchSize with a negative value will now know that his fetch size is ignored. On the other hand it was easy enough to create a new subclass which allowed negative fetch sizes. I called it StreamingResultSetEnabledJdbcTemplate.


/**
* A {@link JdbcTemplate} which will make it possible to mimic streaming Resultset's by allowing negative fetch sizes
* to be set on the {@link Statement}.
*
* @author reik.schatz
*/
public class StreamingResultSetEnabledJdbcTemplate extends JdbcTemplate
{
public StreamingResultSetEnabledJdbcTemplate(final DataSource dataSource)
{
super(dataSource);
}

public StreamingResultSetEnabledJdbcTemplate(final DataSource dataSource, final boolean lazyInit)
{
super(dataSource, lazyInit);
}

/**
* Prepare the given JDBC Statement (or PreparedStatement or CallableStatement),
* applying statement settings such as fetch size, max rows, and query timeout.
* Unlike in {@link JdbcTemplate} you can also specify a negative fetch size.
*
* @param stmt the JDBC Statement to prepare
* @throws java.sql.SQLException if thrown by JDBC API
* @see #setFetchSize
* @see #setMaxRows
* @see #setQueryTimeout
* @see org.springframework.jdbc.datasource.DataSourceUtils#applyTransactionTimeout
*/
@Override
protected void applyStatementSettings(final Statement stmt) throws SQLException
{
int fetchSize = getFetchSize();
stmt.setFetchSize(fetchSize);

int maxRows = getMaxRows();
if (maxRows > 0) {
stmt.setMaxRows(maxRows);
}
DataSourceUtils.applyTimeout(stmt, getDataSource(), getQueryTimeout());
}
}


Using my new class killed all the issues we had. Memory was not a problem anymore and the speed was back. One drawback, if you use my solution, is that some methods like isLast or isFirst are not supported on the ResultSet anymore. If your code invokes them and the ResultSet was created using the described approach, an Exception is thrown.

To come up with some numbers, I created a simple test project using Maven. Feel free to download and test for yourself. You need a local MySQL database, a schema and a database user who can write to this schema. Download the zip file and extract to any directory. Go in src/main/resources and apply the correct database settings in applicationContext.xml. After that open a command prompt, go into the directory where you extracted the zip file to and run mvn test. Maven 2 must be installed of course.

This will run two TestNG unit tests. The first test is called JdbcTemplateTest. The test creates 1,5 million rows in the MySQL database and executes the same retrieval code first using a StreamingResultSetEnabledJdbcTemplate then using a JdbcTemplate. I could not write the unit test with more records as you will hit a OutOfMemoryError for JdbcTemplate otherwise. Here is the JdbcTemplateTest.


/**
* Tests and measures the {@link JdbcTemplate} and {@link StreamingResultSetEnabledJdbcTemplate}.
*
* @author reik.schatz
*/
public class JdbcTemplateTest extends AbstractJdbcTemplateTest
{
@Test(groups = "unit")
public void testRun()
{
runTestUsingTemplate(getStreamingResultSetEnabledJdbcTemplate());
runTestUsingTemplate(getJdbcTemplate());
}

private void runTestUsingTemplate(final JdbcTemplate jdbcTemplate)
{
final String selectStatement = getQuery();

final AtomicLong count = new AtomicLong();

final Date before = new Date();

final String className = jdbcTemplate.getClass().getSimpleName();
System.out.println("Testing " + className);

jdbcTemplate.query(selectStatement, new RowCallbackHandler()
{
public void processRow(ResultSet resultSet) throws SQLException
{
final long i = count.incrementAndGet();
if (i % 500000 == 0) System.out.println("Iterated " + i + " rows");
}
});

final Date after = new Date();
final long duration = after.getTime() - before.getTime();

System.out.println(className + ".query method took " + duration + " ms.");

assertEquals(count.get(), getNumberOfRecords());

renderSeperator();
}

protected JdbcTemplate getJdbcTemplate()
{
final JdbcTemplate jdbcTemplate = new JdbcTemplate(getDataSource());
jdbcTemplate.setFetchSize(Integer.MIN_VALUE);
return jdbcTemplate;
}

protected JdbcTemplate getStreamingResultSetEnabledJdbcTemplate()
{
final JdbcTemplate jdbcTemplate = new StreamingResultSetEnabledJdbcTemplate(getDataSource());
jdbcTemplate.setFetchSize(Integer.MIN_VALUE);
return jdbcTemplate;
}
}


The test data is created in the abstract base class AbstractJdbcTemplateTest.


/**
* Inserts the test data.
*
* @author reik.schatz
*/
@ContextConfiguration(locations = "/applicationContext.xml")
public abstract class AbstractJdbcTemplateTest extends AbstractTestNGSpringContextTests
{
@Autowired
private DataSource m_dataSource;

protected DataSource getDataSource()
{
return m_dataSource;
}

protected String getQuery()
{
return "SELECT * FROM rounds";
}

@BeforeClass
protected void setUp()
{
System.out.println("\n\n " + getClass().getSimpleName() + ": \n");

final JdbcTemplate jdbcTemplate = new JdbcTemplate(m_dataSource);

renderSeperator();
System.out.println("Dropping table");
jdbcTemplate.update("DROP TABLE IF EXISTS rounds;");

System.out.println("Creating table");
jdbcTemplate.update("CREATE TABLE rounds (round_id INT, player_id INT DEFAULT 0, gaming_center INT DEFAULT 1, last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP);");
jdbcTemplate.update("ALTER TABLE rounds DISABLE KEYS;");
jdbcTemplate.update("LOCK TABLES rounds WRITE;");

final Date now = new Date();

final StringBuilder sb = new StringBuilder();
final long records = getNumberOfRecords();

for (int i = 0; i < records; i++)
{
if (i % 100000 == 0)
{
sb.append("INSERT INTO rounds(round_id) VALUES(" + i + ")");
}
else
{
sb.append(",(" + i + ")");
}


if (i % 100000 == 99999 || i == (records - 1))
{
jdbcTemplate.update(sb.toString());
sb.setLength(0);

System.out.println("Inserted " + i + " rows");
}

}

jdbcTemplate.update("UNLOCK TABLES;");
jdbcTemplate.update("ALTER TABLE rounds ENABLE KEYS;");

System.out.println("Insertion took " + (new Date().getTime() - now.getTime()) + " ms");

renderSeperator();
}

protected long getNumberOfRecords()
{
return 1500000L;
}

protected void renderSeperator()
{
System.out.println("============================================================");
}
}


Even though it is only iterating 1.500.000 million records, you already see a difference between JdbcTemplate and StreamingResultSetEnabledJdbcTemplate. Using the StreamingResultSetEnabledJdbcTemplate the iteration runs for 1526 ms. Using the JdbcTemplate the iteration runs for 5192 ms. Don't forget, you cannot even use the JdbcTemplate if you have 2, 3 or 4 million records in the ResultSet.


JdbcTemplateTest:

Testing StreamingResultSetEnabledJdbcTemplate
Iterated 500000 rows
Iterated 1000000 rows
Iterated 1500000 rows
StreamingResultSetEnabledJdbcTemplate.query method took 1526 ms.

Testing JdbcTemplate
Iterated 500000 rows
Iterated 1000000 rows
Iterated 1500000 rows
JdbcTemplate.query method took 5192 ms.


Finally I wrote another unit test. The FastInsertionTest iterates 16 million rows using StreamingResultSetEnabledJdbcTemplate. The iteration runs for only 9507 ms. Not bad.

Summary: I did a very very small change with a gigantic effect. I recommend two things to the Spring development team. First, in the setFetchSize method an Exception should be thrown when someone sents in a negative fetch size. Second, in future Spring versions the JdbcTemplate should enable the use of negative fetch sizes, so that the StreamingResultSetEnabledJdbcTemplate becomes obsolete. Maybe something is coming with Spring 3.0.

On a side note, each of the two unit tests creates a lot of test data. In the first version, I wrote a for-loop that fired an Insert statement for every iteration. This was incredibly slow, about 20 seconds for just 100.000 Inserts. I checked a few good resources on the web, like the MySQL documentation or this blog post, and refactored my code.

The JdbcTemplateTest inserts the test data now using a MySQL feature called multiple value Insert. Instead of firing an Insert statement every iteration in the for-loop, I add a new value to multiple value Insert statement. Then every 100.000 iterations I fire the statement. So for 1.5 million rows, I fire only 15 Insert statements.


final StringBuilder sb = new StringBuilder();
final long records = getNumberOfRecords();

for (int i = 0; i < records; i++)
{
if (i % 100000 == 0)
{
sb.append("INSERT INTO rounds(round_id) VALUES(" + i + ")");
}
else
{
sb.append(",(" + i + ")");
}


if (i % 100000 == 99999 || i == (records - 1))
{
jdbcTemplate.update(sb.toString());
sb.setLength(0);

System.out.println("Inserted " + i + " rows");
}

}


This runs very fast as you can see in the test output. The 1.500.000 records are inserted in only 2601 ms.


Dropping table
Creating table
Inserted 99999 rows
Inserted 199999 rows
Inserted 299999 rows
Inserted 399999 rows
Inserted 499999 rows
Inserted 599999 rows
Inserted 699999 rows
Inserted 799999 rows
Inserted 899999 rows
Inserted 999999 rows
Inserted 1099999 rows
Inserted 1199999 rows
Inserted 1299999 rows
Inserted 1399999 rows
Inserted 1499999 rows
Insertion took 2601 ms


The FastInsertionTest uses another feature of MySQL called INFILE insertion. This time, the for-loop in my code builds up a gigantic text file which I then import into my MySQL database using the LOAD DATA INFLIE syntax. One drawback of using of this approach is that the number of columns in the file must match the columns in the table you are trying to insert to. In other words, you cannot use the DEFAULT feature of a column. In my table rounds, I have 4 columns but column 2, 3 and 4 have a DEFAULT value. It would be nice if my file would only contain the first column values, as the file would be much smaller in this case. This however is not possible. I have to add column values for column 2, 3 and 4 as well.


final File javaIoTmpDir = SystemUtils.getJavaIoTmpDir();
assertNotNull(javaIoTmpDir);
assertTrue(javaIoTmpDir.exists());

final File dumpFile = new File(javaIoTmpDir, "dump.txt");
if (dumpFile.exists())
{
assertTrue(dumpFile.delete());
}

Writer out = null;
try
{
out = new BufferedWriter(new FileWriter(dumpFile));
}
catch (IOException e)
{
fail();
}
assertNotNull(out);

final long records = getNumberOfRecords();
try
{
for (int i = 0; i < records; i++)
{
out.write("1");
out.write('\t');
out.write("1");
out.write('\t');
out.write("1");
out.write('\t');
out.write("0000-00-00 00:00:00");
out.write('\n');
}
}
catch (IOException e)
{
fail();
}
finally
{
out.close();
}

jdbcTemplate.update("LOAD DATA INFILE '" + dumpFile.getPath() + "' INTO TABLE rounds");


As you can see in the test output, the 16.000.000 rows are inserted into the MySQL database in only 24852 ms. Awesome.


FastInsertionTest:
============================================================
Dropping table
Creating table
Inserting 16000000 rows took 24852 ms
============================================================

Parsing Tomcat Access Log for 404 Errors

Yesterday I set up a new dedicated server for a couple of domains. I have Apache with mod_proxy running in front of a Tomcat. It was pretty easy to set up. Since these were quite old domains, I have not really worked with them in a while. I was interested, if I get a lot of 404 errors for these websites. I came up with a nice looking Linux command. Something I remembered from my current job.

Given that you have logging enabled in your Tomcat server.xml configuration, probably like this:



<Valve className="org.apache.catalina.valves.FastCommonAccessLogValve" directory="/etc/tomcats/logs" prefix="access." suffix=".log" pattern="common" resolveHosts="false" />



Prefix and suffix could be different of course but this does not matter. This will create you a daily log file like access.2009-09-24.log. Now to get a nice overview and detect 404 errors fast, run this command:



cat access.2009-09-24.log | cut -f 7,8,9 -d \ | sort | uniq -c | sort -gr



Here are the details. First you display the file contents using cat. This is piped through the cut command. -f 7,8,9 -d \ specifies that you are interested in the fields 7,8 and 9 and the delimiter shall be a whitespace. The syntax for the whitespace delimiter only works that way because another pipe follows. The sort applies some alphabetical sorting. Next pipe is uniq -c which will eliminated duplicates but also adds a count for each unique row. Finally sort -gr will apply numerical sorting based on the result of uniq -c and in reverse order, having the highest number first. Here is some sample output:



6 /includes/css/schufafreie.css HTTP/1.1" 200
6 /images/spacer.gif HTTP/1.1" 200
6 /images/linksline.gif HTTP/1.1" 200
6 /images/banner_oben.jpg HTTP/1.1" 200
6 /favicon.ico HTTP/1.1" 404
5 /includes/js/schufafreie.js HTTP/1.1" 200
5 /images/pfeil_r_grau.gif HTTP/1.1" 200
5 / HTTP/1.1" 200

Loving Maven Webapp Overlays and Jetty Plugin

Alright maybe this is so basic for most of you that I should not blog about this, but yesterday I found out about a Maven feature which I really like. I guess most of you are familiar with the so called "multi-pom" or "multi-module" projects in Maven. This is a Maven project structure, where you have one root pom.xml file defined as parent and then a couple of sub-projects, each with it's own pom.xml. Each of the sub projects, can then be used to create their own build artifacts. This is a nice way to separate the logic of an application into small and reusable deployment artifacts.

Let's say you are writing a standard web-application. To maintain the website content, you have also added some jsp files and classes which function as a "semi-CMS". Additionally you have written a nice database access layer and some useful helper classes for Spring. The straightforward approach would be to create one single Maven project of archetype maven-archetype-webapp. This will produce a WAR file as deployment artifact and you are fine. However, it would be much nicer to have separate deployment artifacts for better maintainability and reuseability. This could be: one JAR file (A) containing all the Spring helper and database classes, one WAR file (B) containing the CMS part and one WAR file (C) containing the real web-application but also referencing the other two deploy artifacts. This setup would have been very easy and common, if A and B produced two JAR deploy artifacts. Fortunately it is also possible to reference one Maven project that produces a WAR file from another Maven project which also produces a WAR file. This is then called WAR overlay. For those interested in the source code, here is a very basic prototype.

Given the above scenario, all you have to do really is to add a dependency in your project C to project B like this:



<dependency>
<groupId>javasplitter</groupId>
<artifactId>webappB</artifactId>
<version>1.0-SNAPSHOT</version>
<type>war</type>
<scope>runtime</scope>
</dependency>



This will merge the folder structure of your webappB into the webappC folder structure. The final WAR file, will then be a merge of all static resources (Images, JSP's, CSS, Javascript) of the two combined WAR files from webappB and C. It will also contain all classes of the two Maven projects in the WEB-INF/classes directory of the final WAR. However, in a real life project I had the problem, that if a class within webappC also uses a class from webappB, these classes were not found anymore. I fixed this by adding an additional dependency in webappC like this:



<dependency>
<groupId>javasplitter</groupId>
<artifactId>webappB</artifactId>
<version>1.0-SNAPSHOT</version>
<type>jar</type>
<scope>provided</scope>
</dependency>



This might not be the preferred way to fix this but it worked for me. Now for the part, why I like the above setup so much and blog about it. The Maven Jetty Plugin gives you the opportunity to immediately test your web-application within a container. When you run mvn jetty:run it will start a Jetty that loads your webapplication. All the changes done to the static resources (JSP files, Images, CSS, Javascript etc.) are immediately visible when you reload the page in your browser. If your IDE project is set up to compile classes in target/classes (which will be the case if you use IntelliJ and Maven project type), the web application context reloads automatically when you recompile a single class. You can define how often you want the Maven Jetty Plugin to scan the classpath for changes before reloading. Given all this, you can do some real rapid development without long build-deploy cycles between every code change. In my previous set up I had used the Maven Cargo Plugin instead and had it deploy the WAR file into a running Tomcat somewhere else on my computer. This was a big problem, as every time I changed a single character in one of my JSP files, I had to rebuild and redeploy the WAR file. I lost a lot of time.

Here is how I have configured the Maven Jetty Plugin in webappC:



<build>
<plugins>
<plugin>
<groupId>org.mortbay.jetty</groupId>
<artifactId>maven-jetty-plugin</artifactId>
<version>6.1.12</version>
<configuration>
<scanIntervalSeconds>2</scanIntervalSeconds>
<webAppConfig>
<contextPath>/</contextPath>
<baseResource implementation="org.mortbay.resource.ResourceCollection">
<resourcesAsCSV>src/main/webappC,../cms/src/main/webappB</resourcesAsCSV>
</baseResource>
</webAppConfig>
</configuration>
</plugin>
</plugins>
</build>



Some things worth mentioning. I set the classpath scan interval manually to 2 seconds. I use the root context to reach my webapplication in the browser. You have to pass in the two webapp directories containing the static resources as a Resource in the webAppConfig element. The order in which you do this might be important. I have a Servlet in my webappC which loaded on startup and read a file path out of the ServletConfig (ServletContext). If I had webappB before webappC in the above example, it would load my Servlet using the webappB ServletConfig which was a big problem because all the file paths were wrong that way. Finally note the resourcesAsCSV element. In the documentation of the Maven Jetty Plugin you are being told to use just a resources element but this will not work properly. You will end up with an error similar to Cannot assign configuration entry 'resources' to 'class [Lorg.mortbay.resource.Resource;' - so use resourcesAsCSV instead.

I would also like to add that developing a Grail webapplication using Maven gives you an even faster rapid development experience. I used the Maven Grails Plugin for one project, which also uses a Jetty (mvn grails:run-app) to test the deployment artifact. This Jetty however, was able to detect class changes automatically and much faster. I had not to manually compile in my IntelliJ IDEA anymore, just saving the modified source file in IntelliJ would immediately update my web-application context and the changes were visible. I have not checked how this behavior was implemented but obviously some very smart Grails people came up with a great idea.

Where to host your Grails application

As I wrote recently, I have joined the Grails and Groovy crowd. I finished my first simple web application, which was actually more than just a local prototype, but I needed to host it somewhere. As with all Java based web-application, it is much harder to find a good and cheap hosting provider. It is much easier with PHP. So I started to look for a Grails hosting provider and ended up where every Grails developer will be looking sooner or later - this page.

I scrolled down the list and nothing really suited me. I have to admit that I stopped reading carefully after a couple of hosting providers. Since this was just my own web-application to play with, I did not want to spent much for hosting every month. I did another search in Google and came to this German blog entry. The guy basically recommended 3 different hosting providers for Grails applications: eatj.com, javaprovider.net and mor.ph. Out of these he liked mor.ph the most. So I decided to try them first.

Unfortunately I could not find anything related to hosting on their website. I have no idea what they are offering but web-hosting did not jump me off the screen. Next I tried javaprovider.net. This provider has a 30-days trial offer, where you can test your web-application for free basically. Exactly what I wanted. You choose between shared Tomcat or private Tomcat. In the private Tomcat you have 32MB heap memory, in the shared version the heap memory is shared with other applications. I signed up for a shared Tomcat trial account. First of all, the trial account in javaprovider.net is not entirely free anymore. I had to pay $0.50 when signing up. This is to scare idiots off. The account was created immediately. However, they set up a private Tomcat account even though I wanted a shared one. The private account plan would cost me $9.99 every month. Way to much for my little toy website. I selected some stupid sub domain, like tv3.javaprovider.net - it will probably not work anymore when you read this. I entered the MySQL details into the production environment block of my DataSource.groovy file.


dataSource {
    pooled = false                          
    driverClassName = "org.hsqldb.jdbcDriver" 
    username = "sa"
    password = ""    
}
environments {
    development {
        dataSource {
            dbCreate = "create-drop" // one of 'create', 'createeate-drop','update'
            url = "jdbc:hsqldb:mem:devDB"
        }
    }   
    test {
        dataSource {
            dbCreate = "update"
            url = "jdbc:hsqldb:mem:testDb"
        }
    }   
    production {
        dataSource {
            pooled = true                          
            driverClassName = "com.mysql.jdbc.Driver"
            dbCreate = "update"
            url = "...." <--- Enter here
            username = "...." <--- Enter here
            password = "...." <--- Enter here
        }
    }
}


Then I packaged everything again using the Maven 2 Grails plugin.


mvn package -Dgrails.env=production


For some reason the WAR file was not build when I ran mvn grails:package -Dgrails.env=production

Finally I deployed the WAR file into my javaprovider.net Tomcat. It worked out of the box! However, as I started to click around for a bit it stopped worked. I checked my Tomcat logfiles and saw a nasty OutOfMemoryError. Obviously 32MB heap is way to little, and this was even the better account plan in javaprovider.net. I opened a ticket and asked them about Grails hosting. They got back to me after a day, saying the I needed at least 256MB heap for running a Grails application and it would cost me around $30/month. This is simply not true! Read on.

While waiting for the ticket reply, I signed up with the last Grails hosting provider that was mentioned in the original blog post - eatj.com. They have a free trial account too, great for testing. The trial account has even 64MB heap. However, their cheapest commercial offer is $9.85/month, so I knew from the start that I would not use them. To make a long story short, I used their MySQL details, repackaged, deployed and it worked fine without any memory issues. Now I knew at least, that 64MB heap would be enough for my Grails app.

In search for a cheaper alternative to host a Grail web application, I saw an offer from a German hosting provider called Netcup. They are offering a vserver with 100MB of guaranteed memory. The price is very low. You pay one time 10€ which is about $14 and 1,69€ per month ($2.40). Since it is a vserver, you have to install everything yourself, like Java, Tomcat, MySQL etc. When you order the account, they set up a Debian Etch for you, including pre-installed Apache and MySQL. I used this image for a bit but their /etc/apt/sources.list file was pretty limited and I struggled to install Java. I switched to Ubuntu Hardy, which I knew. You can switch between images with a few click in a web console. Their Ubuntu image also had a weird sources.list file but I changed it they way how I had it locally and it worked. I installed Java, MySQL, Tomcat 6 and deployed my Grails webapplication. It worked like a charm.

You get a static ip-adress, which is enough for now. No domain needed for testing. You can check it out here. Log in with (reik/schatz). I am watching tons of TV series like Lost, Heroes, True Blood, 24, Prison Break, Friday Night Lights, Supernatural, Dexter etc. Every time I forgot which episode I saw last, so I wrote this little Grails application where you can save the last episode you have seen.

For me the netcup.de vserver is the best option if you want to host a Java based web application using Grails and Groovy. Here is a comparison of the annual prices including one time fees:


  • eatj.com Basic: $98.50

  • javaprovider.net Private: $119.88

  • mor.ph ?

  • netcup.de vserver Aluminium: 30.28€ ($42.71)

JSON in Grails with nested collections

I have totally fallen in love with Grails and Groovy. I am probably the last Java developer in the world to try out Groovy and Grails but it has never been my top priority. So I started to write a little web application for myself, which I will host later on Google App Engine.

One common scenario is that you have nested domain classes. In my case, I have a Show class. A Show is basically a TV series like Lost or Deperate Housewives. Each Show has a list of Season's. These seasons are strictly in order, like Lost has Season 1, 2, 3, 4 and 5 so far. Each Season has a number of Episode's, which are also stricly ordered. Then finally each Episode can be aired in different formats like HDTV, 720p or Standard. This makes up a nice class hierarchie having 1:n relations.




class Show {

String name

static hasMany = [ seasons : Season ]

static mapping = {
table 'ct_show'
}

static constraints = {
name(size:1..100,blank:false)
}
}

class Season {

Integer seasonNumber

static hasMany = [ episodes : Episode ]

static mapping = {
table 'ct_season'
}

static constraints = {
seasonNumber(size:1..100,blank:false)
}
}

class Episode {

Integer episodeNumber

static hasMany = [ formats : Format ]

static mapping = {
table 'ct_episode'
}

static constraints = {
episodeNumber(size:1..100,blank:false)
}
}

class Format {

String name

static mapping = {
table 'ct_format'
}

static constraints = {
name(size:1..100,blank:false)
}
}




Now I had a typical use case, that when you want to use Ajax within you web application, you want the Controller classes to return you domain objects as JSON. In Grails this is easy to accomplish, just read this example. However, I had the big problem that my domain class hierarchy heavily uses nested collections. This will not work when you just do this:



render Season.get(1) as JSON



In this blog post, the last example offers a solution to my problem but it is a rather static example. The idea is to use my root object (Season in my particular case) to produce a JSON string in the format I want and then return it via render as JSON. Building up this String would be quite a big chunk of code in Java probably involving StringBuilder, but Groovy has this great method .collect for Collections. Here is the Groovy code. Try this in Java.



def selectedSeason = Season.get(1)
def seasonJSON = [
id: selectedSeason.id,
seasonNumber: selectedSeason.seasonNumber,
episodes: selectedSeason.episodes.collect{
[
id: it.id,
episodeNumber: it.episodeNumber,
formats: it.formats.collect{
[
id: it.id ,
name: it.name
]
}
]
}
]

render seasonJSON as JSON

Struggling with Servlet 3.0

A couple of days ago, I wrote a blog post about a session which I had attended during JavaOne. It was a session about the upcoming Servlet 3.0 specification, also known as JSR-315. Back at work, I thought it might be fun to try some of the new features myself. We have this one application running in our production site, which stores events that occur during Poker games. The application uses a Service based on Hessian, which clients can call to store the events. The Hessian service is exposed using the HessianServlet from Caucho, so that it may be invoked from HTTP. The Servlet runs in Jetty 6 container. The application is heavily used and around 20 to 30 calls are made per second.

My idea was to rewrite some parts within the application, so that everything is based on Servlet 3.0. I have written a test case where I measure the execution time of 30 parallel Threads persisting 30 Poker events. I am hoping that using Servlet 3.0 asynchronous requests, I can see a performance improvement. I know that this is not the perfect use case where asynchronous requests can shine as there is nothing that can be optimized using parallel processing within a single Thread. Anyway, we will see how it goes. I am curious about the results anyway.

The first step towards Servlet 3.0 is to use a JSR-315 compliant Servlet container. At JavaOne everyone was talking about the upcoming Jetty 7 release and that it supports Servlet 3.0. This was my disappointment. Jetty 7 is not built on top of Servlet 3.0 but still only Servlet 2.5 compatible. The semiofficial reason is that Jetty has moved from Codehaus to Eclipse and JSR-315 is delayed anyways. My next pick would have been Tomcat 7 as I have seen a comparison matrix that indicated, that the next Tomcat would support Servlet 3.0 as well. Unfortunately I don't think development is that far. I found some source code in SVN but I don't know if it was official and it was also only Servlet 2.5 based. So my last resort was Glassfish v3 which is the reference implementation for Java EE6. The preview release of Glassfish v3 comes with a Servlet container that implements JSR-315. Perfect.

It was the first time I installed Glassfish. It was very easy. The application server ships with a web administration interface and is easy to maintain. Currently there are not so many tutorials and examples for Servlet 3.0 on the Web, so I looked forward to check some samples which ship with Glassfish. "After installation, samples are located in install-dir/glassfish/samples/javaee6" - well this directory just does not exist. Not in the standard preview nor in the web profile. Too bad, they were supposed to have a sample for asynchronous requests as well as adding Servlet's dynamically.

Anyway, I changed my application from a standalone jar distribution that starts a Jetty 6 container to a war distribution that is deployed in Glassfish v3. To deploy something in Glassfish, just copy it into the autodeploy directory of your domain in the glassfish directory. Since the old version uses the HessianServlet directly, I had to download the source from Caucho and modify it, so that it uses asynchronous requests from Servlet 3.0. Unfortunately the HessianServlet is not really built for extensibility, so I just got a copy of the whole file to play with. To use another Sevlet 3.0 feature, I decided to add the Servlet at runtime using the new ServletContext.addServlet method. I looked up a sample on how to write a ServletContextListener. Some old documentation about JSR-315 indicated that you had to annotate the Listener with the @WebServletContextListener annotation. This annotation does not exist anymore in the final draft of JSR-315. Instead you do it the oldschool way. Write a class that implements ServletContextListener and add it to the web.xml as a context-listener. Then in the contextInitialized method, I added my AsynchronousHessianServlet.

In the next posting I will write about asynchronous requests and if this really makes an existing application faster.

Update: the javaee6 samples will be downloaded and installed using the Glassfish updater tool. In my first install attempt, the updater would not work with my companies firewall, so I never got the samples folder. It works fine if your updater tools works. Would have been nice to mention on the Glassfish or Sun website.

Session of the day: Java NIO2 in JDK7

Today has been a good day at the JavaOne. I have seen quite a few great and useful talks. For the session of the day I have picked a talk by Alan Bateman and Carl Quinn from Netflix about the new IO API (JSR203) that will be available in JDK7.

In my own private projects I still use the old Java IO API and I guess that's perfectly fine if your application is not IO critical. At work however, we have multiple projects that make use of java.nio and are very much defendant on a good performance when it comes to files and directories. So what can JSR-203 do for us?

First of all there will be a class Path which is an abstraction to a physical file or directory resource. Path is basically what File used to be in plain Java IO. To create a Path instance you have a bunch of options. You can call FileSystems.getDefault().getPath("/foo/bar") or just Paths.get("/foo/bar"). One nice thing is that Path will implement java.lang.Iterable so that you can iterate over a physical path from root to current directory. If you want to know at which depth you currently are from the root just call Path.getNameCount(). Another nice thing, when you iterate over Path using the legacy Iterator idiom and you invoke iterator.remove, then the physical file gets deleted.

In the example code above you already saw something called FileSystem. This is the a of all Paths in NIO2. In JDK7 there will also be something called a Provider which you can leverage to create your own FileSystem. It will be possible to create a memory based FileSystem, a Desktop FS or a Hadoop FileSystem or anything else you can think of. You can even make your FileSystem the default FileSystem so that whenever your application calls FileSystems.getDefault() will return your custom FileSystem.

Another cool thing is the possibility to traverse a directory tree. NIO2 contains a Interface called FileVisitor. The Interface has a bunch of methods like preVisitDirectory, postVisitDirectory, visitFile, visitFileFailed etc. Each of the methods will be invoked at certain stages when traversing a file tree. For convinced JSR-203 ships with a bunch of implementations of FileVisitor like SimpleFileVisitor or InvokeFileVisitor. You can use one of these FileVisitor's and then only overwrite the methods that are interesting for you. To kick off traversal of the file tree you would call Files.walkFileTree(Path path, FileVisitor fileVisitor).

This becomes really handy when you use it in conjunction with another new class in JDK7 called PathMatcher. This is an Interface similar to FileFilter in old Java IO maybe. This is how to create a PathMatcher: FileSystems.getDefault().getPathMatcher("glob:*.log"). In this example it will select all files matching *.log. You can also use regular expressions instead of glob syntax.

If you look at the method signature of visitFile in the FileVisitor Interface you will notice that the second parameter is of type BasicFileAttributes, which are the attributes of the current file that visitFile is invoked with. So let's say you create your FileVisitor with a PathMather that selects *.log files. What you could do in the visitFile method, is to invoke PathMatcher.match and if it is a log file, check the file size attribute using the given BasicFileAttributes. If the file is bigger than a certain size, delete it. Pretty handy or? A piece of very short Java code that traverses a File tree and deleted logfiles of a certain size.




PathMatcher matcher = FileSystems.getDefault().getPathMatcher("glob:*.{java,class}");

Path filename = ...;
if (matcher.matches(filename)) {
System.out.println(filename);
}




A entirely different use case can be covered with WatchService, Watchable, WatchEvent and WatchKey. These guys make it possible to sit, listen and react to changes that occur to Path objects. First you get yourself a WatchService using the default FileSystem. WatchService watcher = FileSystems.getDefault().newWatchService(). The next step is to get the Path like before Paths.get("/foo/bar/old.log"). Then you register the Path with the WatchService to get a WatchKey: WatchKey key = path.register(watcher, ENTRY_CREATE, ENTRY_DELETE, ENTRY_MODIFY). The last parameters in the register method are varargs. In the example you will be watching create, delete and modification events on the specified Path. Finally you need to create an invinite loop that is constantly polling the events out of the WatchKey. Your code can then react to these events.




for (;;) {

//wait for key to be signaled
WatchKey key;
try {
key = watcher.take();
} catch (InterruptedException x) {
return;
}

for (WatchEvent event: key.pollEvents()) {
WatchEvent.Kind kind = event.kind();

//This key is registered only for ENTRY_CREATE events,
//but an OVERFLOW event can occur regardless if events are
//lost or discarded.
if (kind == OVERFLOW) {
continue;
}

// do your stuff
}

boolean valid = key.reset();
if (!valid) {
break;
}
}





One new feature that is particular interesting for our applications is the new DirectoryStream class. You use it to access the contents of a directory. Well you could do this before but DirectoryStream scales much better and uses less resources. This will not be an issue if you have a couple of hunded files in your directory but make a huge difference if there are hundreds of thousand files in the directory. Here is how to use DirectoryStream.




Path dir = new File("/foo/bar").toPath(); // new method on File in JDK7
DirectoryStream stream = null;
try {
stream = dir.newDirectoryStream();
for (Path file: stream) {
System.out.println(file.getName());
}
} catch (IOException x) {
....
} finally {
if (stream != null) stream.close();
}





Okay this post was maybe a bit too theoretical. You can check out what nio2 feels like with the OpenJDK or this great tutorial.

Session of the day: JVM debugging


The second day at JavaOne was surprisingly just average. Kohsuke had a great talk about distributed Hudson builds and Hudson EC2 integration but the rest of the sessions was pretty normal. Then I am going into this session “Debugging your production JVM” from Ken Sipe and the guy blasts the roof off. He is showing all these nifty, cool tools that can help to get an insight what is going on in your JVM. Not just one tool but many. I am really having a hard time to write everything down. A lot of command-line tools that already come with Java like jstat, jps or jmap. They are ready to be used, you just need to know how to use them with the right parameters etc.

Then Ken starts to talk about something really fancy - BTrace. So how can BTrace help to debug your production VM. Essentially it is a little tool that you can use at runtime to at debugging “aspects” to your running Java bytecode. I use the word Aspect here because when I first saw it, BTrace felt quite similar to AspectJ. What BTrace does, it takes a little Script that your write in Java and dynamically injects it as tracing code into the running JVM.

What I called Script here is pure Java code. What you write looks almost exactly like a plain Java Classes with a lot of Annotations. The Code you can write as BTrace Script is really limited however. Ken had a slide in the presentation about all the Java stuff that is not doable. I only remember that you could not use the new keyword to create new objects. Luckily I found the other restrictions on the BTrace website:


  • can not create new
  • can not create new arrays.
  • can not throw exceptions.
  • can not catch exceptions.
  • can not make arbitrary instance or static method calls - only the public static methods of com.sun.btrace.BTraceUtils class may be called from a BTrace program.
  • can not assign to static or instance fields of target program's classes and objects. But, BTrace class can assign to it's own static fields ("trace state" can be mutated).
  • can not have instance fields and methods. Only static public void returning methods are allowed for a BTrace class. And all fields have to be static.
  • can not have outer, inner, nested or local classes.
  • can not have synchronized blocks or synchronized methods.
  • can not have loops (for, while, do..while)
  • can not extend arbitrary class (super class has to be java.lang.Object)
  • can not implement interfaces.
  • can not contains assert statements.
  • can not use class literals.


Your hands are tied. Well almost. So let's have a look at a sample from the BTrace website.





import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;

// @BTrace annotation tells that this is a BTrace program
@BTrace
public class HelloWorld {

// @OnMethod annotation tells where to probe.
// In this example, we are interested in entry
// into the Thread.start() method.
@OnMethod(
clazz="java.lang.Thread",
method="start"
)
public static void func() {
// println is defined in BTraceUtils
// you can only call the static methods of BTraceUtils
println("about to start a thread!");
}
}




You can see, it is a standard Java class annotated with @BTrace. Then it says @OnMethod with two parameters which translates to – every time the start method is invoked in java.lang.Thread ... What it will do in that case is invoke the static function it will find within the @BTrace annotated class. It has to be a static method. I forgot if it had to follow a naming convention too. So the static method will be invoked every time a Thread is started. In the sample, it will just print out something fixed on the console. You could also count the number of Threads or other things.

Here is another example.





@BTrace public class Memory {
@OnTimer(4000)
public static void printMem() {
println("Heap:");
println(heapUsage());
println("Non-Heap:");
println(nonHeapUsage());
}
}




This will print out memory usage every 4 seconds. Awesome. Now something really huge.





@BTrace public class HistogramBean {
// @Property exposes this field as MBean attribute
@Property
private static Map histo = newHashMap();

@OnMethod(
clazz="javax.swing.JComponent",
method=""
)
public static void onnewObject(@Self Object obj) {
....
}

@OnTimer(4000)
public static void print() {
if (size(histo) != 0) {
printNumberMap("Component Histogram", histo);
}
}
}



Don't worry about the details what it does right now. The important part is that you can annotate fields with @Property and have BTrace expose these fields as Mbean. All you need to do is inject the BTrace script a little bit different from the command line.

Some words on the @OnMethod, @OnTimer stuff. These Annotations are called probe points and there are more like @OnMemoryLow, @OnExit, @OnError etc. Another example is to use BTrace to monitor entering and leaving of synchronization blocks.

Unfortunately BTrace requires Java 6+, it will not work with 5. Now you have a good reason to step up your Java version.

Session of the day: Servlet 3.0


SVG and Canvas is really cool stuff and “Cross Browser Vector Graphics with SVG and Canvas” came really close to be my session of the day here at JavaOne, but the Servlet 3.0 stuff topped it all. There are so many great features in JSR-315. The final draft is now out and the guys said it will go live with Java EE6.

web.xml = history
First of all, the biggest difference is that you do not need a web.xml anymore. Servlet 3.0 fully relies on Java Annotations. If you want to declare a Servlet use the @WebServlet Annotation on your Servlet class. The class still has to inherit from HttpServlet, this remains unchanged. That means it is not possible to use other methods instead of doPost, doGet etc. and annotate them to mark them as Request handler methods. This is something you can do in the Jersey library (JSR-311). In theory I guess it would have been possible to not inherit from HttpServlet but the create a rule that for every class annotated with @WebServlet you have to have methods annotated with @PostMethod or @GetMethod. I guess there are good reasons not to do so in Servlet 3.0.

At the very minimum, you have to annotate specifying a URL under which to invoke your Servlet. If omitted, the full class name will be used as the Servlet name. Other parameters you can specify in the top-level @WebServlet Annotation are for instance if you want the Servlet to be usable in asynchronous requests. More on asynchronous requests later.




@WebServlet(url = "/foo", asynchronous = true)
public class SomeServlet extends HttpServlet {
public void doGet(...) {
....
}
}





So the web.xml is gone. Filters are added to the ServletContext using @WebFilter Annotation, Listeners are added using @WebListener Annotation. The deployment descriptor File web.xml is still useful though. It can be used to overwrite whatever you have specified using Class Annotations. So if you create a web.xml file, whatever you have in there has the final word when the Container starts up the ServletContext.

Servlets, Listeners and Filters can now also be added programatically. The ServletContext class has new methods like addServlet, addServletMapping, addFilter or addFilterMapping. On Container start up, you can hook in and add Servlets or whatever you want at Runtime.

Web Frameworks can plug in
Something that I think is really cool, is the possibility for Web Frameworks like Apache Wicket, Tapestry or Spring MVC to plug-in into the ServletContext creation. Remember that in the past, whenever you learned about a new web framework, there was this one section in the documentation where you had to add some Servlet, some Filter or some Listener to the web.xml?





<web-app>
<display-name>Wicket Examples</display-name>
<filter>
<filter-name>HelloWorldApplication</filter-name>
<filter-class>org.apache.wicket.protocol.http.WicketFilter</filter-class>
<init-param>
<param-name>applicationClassName</param-name>
<param-value>org.apache.wicket.examples.helloworld.HelloWorldApplication</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>HelloWorldApplication</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
</web-app>




This is now history in Servlet 3.0. Web Frameworks can supply something in their JAR-file deployables that is called web-fragment.xml. It is basically a light version of the web.xml and has almost the exact XML structure. What the Container does when it loads up, it will go into all the JAR files in WEB-INF/lib and scan for a web-fragment.xml File in the META-INF directory. If it finds one, the file will be used when creating the final ServletContext. Sometimes you want more control over how fragments are being pulled into the ServletContext creation. There are ways to control the ordering in which web-fragment.xml files are being put together. The library itself can specify a relative ordering in the web-fragment.xml file. For instance it can say, I am not interested in a particular order, load me last. It is also possible to define an absolute ordering in your own web.xml file which will have again the final word. One important thing of notice. Only the WEB-INF/lib directory will be used to scan for web fragments, not the classes directory.

Third party libraries can also add resources on their own. Whatever the web framework has in META-INF/resources becomes available from the context in which the library is loaded. For instance if you have META-INF/resources/foo.jsp then the resource is available from http://localhost:8080/foo.jsp. Kind of useful too.

New security, New Error Page

Security constraints can now also be expressed using Annotations. I forgot the correct Annotation names, I think it was something like @DenyRoles or something that you could put on for instance doPost methods to secure them. I vaguely remember also, that the standard security mechanism to display a Web Form (Form Based Authentication) is removed. What you now do instead is to properly annotate a method which will authenticate the User for you. You are therefore basically free to choose however you want to authenticate your Users. Unfortunately this is a feature in the Servlet Specification that I have almost never used in the last 10 years, so I did not pay too much attention when they talked about this and the security part was also kept short in the session.

Which brings me to the next improvement, default error pages. Remember back in the old Servlet 2.3+ days that you had to add a default error page for every error code? What a copy and paste mess. JSR-315 gives you the possibility to define default error pages. You can have for instance something like “show this page for all errors except 404”. Very handy.

New methods in Response
Remember that it was a pain in the ass to work with HttpResponse sometimes? It was not apparent in which phase the response was, what the status was. Servlet 3.0 will add new methods to HttpRepsonse that will make our lives easier. You can get Header name, Headers and the Response status now using API calls.

Asynchronous Requests

Finally to the most impressive new feature - asynchronous requests. This is huge! Imagine that you have HTTP POST method in a Servlet and what it does is to call out to a WebService. While the Service is doing it's work, the HttpRequest is being held by the Servlet Container. Standard Stuff. This however becomes problematic if your thread pool limit is reached. So lets say you have defined that 50 Threads should be in the Pool. You receive 30 Requests per second. If the web service call takes 2 seconds, you have a problem because 60 Requests are coming in and only 50 Threads are in the pool.

Here is what Servlet 3.0 does. Well, it is kind of hard to explain and I did not understand it fully. But here is what I think it does. You can specify on Servlet and Filters that they may be used for asynchronous requests. This is something you have to declare as an attribute of the @WebServlet or @WebFilter Annotation. The request comes in, it will look at the Filter chain and the Servlet. It will figure out if it can run asynchronously or not. The original request calls out and the Request is suspended, therefore freeing a Thread and returning it to the pool. A callback method is given along. The callback method is invoked when the external resource becomes available and the Container will use a new Thread to generate the Response. The Request therefore resumes it's work. New methods for resuming and suspending as well as querying the current status of the request have been added. There was also something called asynchronous Handle in the presentation slides but I forgot how it was used.

Anyway the guy at JavaOne had an example ready where he had written a Servlet that queried the Ebay Rest Interface for 3 keywords. He then had written the same using asynchronous Requests from Servlet 3.0 and it was like 3 or 4 times faster. This is because not only are the Threads returned to the pool while they are doing nothing (Thread starvation) but also can multiple Requests be run in parallel. This is a major improvement I think. Our current web applications can be made faster just by using features of a new Servlet specification.

I hope this was useful. I will experiment with Servlet 3.0 before the final release comes out. I think I might be able to do this today already using some experimental version of Jetty or Glassfish.