Tuesday, August 6, 2013

Setting-up Tomcat SSL with StartSSL Certificates

Part of an effort to improve the security of CheckTheCrowd.com is to enable SSL on my web server. Enabling SSL allows it to support HTTPS connections.

The CheckTheCrowd web application is hosted on Apache Tomcat which provides a pretty good, albeit generic documentation on how to achieve this setup.

In summary, enabling SSL on Tomcat requires three things:
  1. Creating a Java keystore which contains the private key that Tomcat would use to start SSL handshakes
  2. Ensuring that you or your website owns the private key by having it signed by a trusted authority which in turn, issues a digital certificate verifying your ownership of the key
  3. Configuring a Tomcat connector to listen on HTTPS from a specified port
Creating a keystore and configuring a Tomcat connector is simple enough. However, acquiring an SSL certificate from a trusted provider can be expensive.

Thankfully,  I learned about StartSSL which provides free SSL certificates with one year validity (a new one can be generated upon expiry).

Below are the steps I took to set-up Tomcat SSL using StartSSL certificates.

Disclaimer: I learned most of these steps from this blog post.

1. Creating the Java Keystore File (.jks)

As per the Tomcat documentation, the first thing I needed to do was to generate a Java keystore which would hold my private key. This was done by using the keytool command that is part of JDK.
keytool -genkey -keysize 2048 -keyalg RSA -sigalg SHA1withRSA \
 -alias [name of server] -keystore [name of keystore].jks \
 -keypass [password] -storepass [password] -dname "CN=[domain name], \
 OU=Unknown, O=[website], L=[city], ST=[state], C=[country]"
Note that due to a Tomcat limitation, the keypass and storepass must be the same. The dname entry is optional; if not provided, these details will be asked by keytool during the process.

Example:
keytool -genkey -keysize 2048 -keyalg RSA -sigalg SHA1withRSA \
 -alias webserver -keystore checkthecrowd.jks \
 -keypass ****** -storepass ****** -dname "CN=checkthecrowd.com, \
 OU=Unknown, O=CheckTheCrowd, L=Singapore, ST=Unknown, C=SG"
At this point, my keystore already contains the private key required by Tomcat to start an SSL connection.

I can already start using this keystore to enable SSL in Tomcat, but a rogue entity can hijack the connection and pretend that his private key was issued by CheckTheCrowd. This rogue entity can then trick my users that they are securely connected to CheckTheCrowd when in fact they are connected to something else.

To solve this, I need to acquire a signed certificate to prove that my private key is associated to my domain (checkthecrowd.com).

2. Creating a Certificate Request File (.csr)

A certificate request is submitted to a certificate provider and an SSL certificate is generated based on this file.
keytool -certreq -alias [name of server] -file [name of request].csr \
 -keystore [name of keystore].jks
Note that this command would ask for the password previously set for the keystore.

Example:
keytool -certreq -alias webserver -file checkthecrowd.csr \
 -keystore checkthecrowd.jks

3. Submitting the Certificate Request to StartSSL

I needed to signup for an account in order to use StartSSL. Signing-up involves generating a signed private key which proves my identity. Here onwards, the key is used by StartSSL to authenticate my access to their website.

Note that it is important to keep a back-up copy of this private key for future use. This file needs to be imported on all computers used to access StartSSL.

Figure 1: StartSSL


Once I have an account, I can use the Control Panel to generate my certificate. The first step is to validate that I own the domain checkthecrowd.com. The aptly named Validation Wizard takes care of this.

Once my domain is validated, I used the Certificates Wizard to submit my certificate request (.csr file):
  1. Select Web Server SSL/TLS Certificate.
  2. Because I already have a private key and a certificate request, I skip the next screen
  3. I pasted the contents of my certificate request (.csr file) to the text area provided
  4. When finished, the generated certificate is displayed on another text area -- I copied this and saved to a file called ssl.crt.
4. Import the Generated Certificate and StartSSL Certificate Chains

The next step is import the generated certificate to my keystore. The StartSSL certificate chain also needed to be imported.

The StartSSL certificate chain can be downloaded from:
The free SSL certificate from StartSSL is only a Class 1 level certificate. With an upgraded package (Class 2 and higher), all applicable class certificates must be downloaded.

I again used keytool to import these certificates:
keytool -import -alias [ca alias] -file [ca file].cer \
 -keystore [keystore name].jks -trustcacerts
keytool -import -alias [class1 alias] -file [class1 file].pem \
 -keystore [keystore name].jks -trustcacerts
keytool -import -alias [name of server] -file ssl.crt \
 -keystore [keystore name].jks
The first two commands imported the certificate chain as trusted certificates, the last command imported the signed certificate.

Example:
keytool -import -alias startsslca -file ca.cer \
 -keystore checkthecrowd.jks -trustcacerts
keytool -import -alias startsslca1 -file sub.class1.server.ca.pem \
 -keystore checkthecrowd.jks -trustcacerts
keytool -import -alias webserver -file ssl.crt \
 -keystore checkthecrowd.jks
Listing the contents of my keystore verified that I have 3 certificates:
# keytool -list -keystore checkthecrowd.jks
webserver, Aug 5, 2013, PrivateKeyEntry,
Certificate fingerprint (SHA1): [...]
startsslca, Aug 5, 2013, trustedCertEntry,
Certificate fingerprint (SHA1): [...]
startsslca1, Aug 5, 2013, trustedCertEntry,
Certificate fingerprint (SHA1): [...]

5. Configure Tomcat with SSL

Enabling SSL with Tomcat involves creating a new connector which listens to HTTPS connections. This connector needs to know the location of the keystore file as well as the password to access the keystore.

For convenience, I placed my keystore under $TOMCAT_HOME.
<!-- 
Define a SSL HTTP/1.1 Connector on port 8443
This connector uses the JSSE configuration, when using APR, the
connector should be using the OpenSSL style configuration
described in the APR documentation 
-->
<Connector
    protocol="HTTP/1.1"
    port="8443" maxThreads="200"
    scheme="https" secure="true" SSLEnabled="true"
    keystoreFile="checkthecrowd.jks" keystorePass="******"
    clientAuth="false" sslProtocol="TLS"/>
Note that by default, the Tomcat HTTPS port is 8443.

That's all there is to it! After bouncing Tomcat, I am now able to access CheckTheCrowd via HTTPS from port 8443: https://checkthecrowd.com:8443/.

The next step is to configure Apache httpd to forward HTTPS requests to port 8443. I still haven't figured out how to do this yet, so if you have an idea, let me know!

Wednesday, June 19, 2013

Analyzing FizzBuzz Performance on Java and Scala

Recently, I attended a technical interview for a Java developer role. The first problem that was given to me was to solve the FizzBuzz test.
Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.
While the FizzBuzz test is apparently a popular interview problem, I've never encountered it before. And to be honest, I did not find the problem too challenging so this worried me a bit (more often than not, easy interview problems are trick questions).

My policy when writing code is that the simplest solution is usually the best solution. So, I gave them the most straight-forward solution I can think of.
for (int i=1; i <= 100; i++) {
    if (i % 3 == 0 && i % 3 == 5)
        println("FizzBuzz")
    else if (i % 3 == 0)
        println("Fizz")
    else if (i % 5 == 0)
        println("Buzz")
    else
        println(i)
}
As expected, the interviewers asked me to improve my code. It's obvious that the mod expressions are duplicated so I assigned them to variables.
for (int i=1; i <= 100; i++) {
    boolean fizz = i % 3 == 0
    boolean buzz = i % 5 == 0
    if (fizz && buzz)
        println("FizzBuzz")
    else if (fizz)
        println("Fizz")
    else if (buzz)
        println("Buzz")
    else
        println(i)
}
I was pretty happy with this solution especially because it made the intentions code easier to read. However, the comment from the interviewers was that in order to get to the most common case (divisible by neither 3 nor 5), I would have to evaluate 3 branch statements.

To solve this, I added an initial check if the number is divisible by neither 3 nor 5.
for (int i=1; i <= 100; i++) {
    boolean fizz = i % 3 == 0
    boolean buzz = i % 5 == 0
    if (!fizz && !buzz) {
        println(i)
    } else {
        if (fizz && buzz)
            println("FizzBuzz")
        else if (fizz)
            println("Fizz")
        else
            println("Buzz")
    }
}
I personally did not like this solution, I feel that the added check is redundant and made the intention of the code less obvious. In fact, I told the interviewers that I was choosing my 1st solution over this one.

I can tell that the interviewers are not completely convinced; but we already spent too much time on this problem so we moved on. Quite frankly, I didn't think I can improve my solution any further at that point.

As soon as I reached home, I tried thinking of a better solution but couldn't come up with one. So, I did the natural thing and searched the internet! I was pleased to see that most people came-up with the same solution as I did.

However, I found a blog with an interesting solution.
for (int i=1; i <= 100; i++) {
    boolean fizz = i % 3 == 0
    boolean buzz = i % 5 == 0
    if (!fizz && !buzz) {
        print(i)
    } else {
        if (fizz)
            print("Fizz")
        if (buzz)
            print("Buzz")
    }
}
Notice that the above code (a rough translation to Java) is mostly equivalent to my 2nd solution except for a major twist -- that is, it eliminated the branch for "FizzBuzz" by concatenating the results of both "Fizz" and "Buzz" branches.

In my opinion, the intention of the above code is even less obvious than my 2nd solution (at quick glance, I wouldn't think that it prints "FizzBuzz" at all); but since we are aiming for an optimized solution, this one seemed better.

I was curious how these solutions differ in terms of performance, so I decided to write an actual test.

I wrote my tests in both Java and Scala (for practice mostly). To make an "apples to apples" comparison, all solutions will be calling print rather than println (otherwise the 3rd solution would require an extra step to print a new line every iteration). Also, to eliminate the I/O overhead, I used a PrintWriter as a substitute to System.out.

Below is my Java implementation.
private static void fizzBuzz1(int n) {
    PrintWriter writer = new PrintWriter(new StringWriter());
    for (int i = 1; i <= n; i++) {
        boolean fizz = i % 3 == 0;
        boolean buzz = i % 5 == 0;
        if (fizz && buzz)
            writer.print("FizzBuzz");
        else if (fizz)
            writer.print("Fizz");
        else if (buzz)
            writer.print("Buzz");
        else
            writer.print(i);
    }
}

private static void fizzBuzz2(int n) {
    PrintWriter writer = new PrintWriter(new StringWriter());
    for (int i = 1; i <= n; i++) {
        boolean fizz = i % 3 == 0;
        boolean buzz = i % 5 == 0;
        if (!fizz && !buzz) {
            writer.print(i);
        } else {
            if (fizz && buzz)
                writer.print("FizzBuzz");
            else if (fizz)
                writer.print("Fizz");
            else
                writer.print("Buzz");
        }
    }
}

private static void fizzBuzz3(int n) {
    PrintWriter writer = new PrintWriter(new StringWriter());
    for (int i = 1; i <= n; i++) {
        boolean fizz = i % 3 == 0;
        boolean buzz = i % 5 == 0;
        if (!fizz && !buzz) {
            writer.print(i);
        } else {
            if (fizz)
                writer.print("Fizz");
            if (buzz)
                writer.print("Buzz");
        }
    }
}
Below is my Scala implementation.
  def fizzBuzz1(n: Int) = {
    val writer = new PrintWriter(new StringWriter)
    for (i <- 1 to n) {
      val fizz = i % 3 == 0
      val buzz = i % 5 == 0
      if (fizz && buzz) writer.print("FizzBuzz")
      else if (fizz) writer.print("Fizz")
      else if (buzz) writer.print("Buzz")
      else writer.print(i)
    }
  }

  def fizzBuzz2(n: Int) = {
    val writer = new PrintWriter(new StringWriter)
    for (i <- 1 to n) {
      val fizz = i % 3 == 0
      val buzz = i % 5 == 0
      if (!fizz && !buzz) {
        writer.print(i)
      } else {
        if (fizz && buzz) writer.print("FizzBuzz")
        else if (fizz) writer.print("Fizz")
        else writer.print("Buzz")
      }
    }
  }

  def fizzBuzz3(n: Int) = {
    val writer = new PrintWriter(new StringWriter)
    for (i <- 1 to n) {
      val fizz = i % 3 == 0
      val buzz = i % 5 == 0
      if (!fizz && !buzz) {
        writer.print(i)
      } else {
        if (fizz) writer.print("Fizz")
        if (buzz) writer.print("Buzz")
      }
    }
  }
My test involves executing the 3 solutions 100,000 times to ensure that the HotSpot JVM is properly "warmed-up". I measured the performance by recording the average elapsed time of each iteration. I then used these averages to calculate the percentage difference in performance.
val range = 100
val iterations = 100000

val totalFizzbuzz1 = run(fizzBuzz1, range, iterations)
val totalFizzbuzz2 = run(fizzBuzz2, range, iterations)
val totalFizzbuzz3 = run(fizzBuzz3, range, iterations)

val aveFizzbuzz1 = totalFizzbuzz1.toDouble / iterations
val aveFizzbuzz2 = totalFizzbuzz2.toDouble / iterations
val aveFizzbuzz3 = totalFizzbuzz3.toDouble / iterations

val maxAve = List(aveFizzbuzz1, aveFizzbuzz2, aveFizzbuzz3).
    reduceLeft((l, r) => if (r > l) r else l)

def run(f: (Int) => Unit, range: Int, iterations: Int) = {
  var startTime: Long = 0
  var totalTime: Long = 0
  for (i <- 1 to iterations) {
    startTime = System.nanoTime()
    f(range)
    totalTime += (System.nanoTime() - startTime);
  }

  totalTime
}
Note that the above test is repeated several times to check for consistency.

On my machine (i7-2600 @ 3.40 GHz x 8 Cores + 8 GB RAM), I got the following result:
=== Round 1 ===

Iterations: 100000
Range: 100

- Averages -
Solution 1 = 6862.5038 ns.
Solution 2 = 6071.6055 ns.
Solution 3 = 6112.7298 ns.

- Statistics -
Solution 1 = 0% faster
Solution 2 = 11.5249% faster
Solution 3 = 10.9257% faster

=== Round 2 ===

Iterations: 100000
Range: 100

- Averages -
Solution 1 = 5732.6164 ns.
Solution 2 = 5735.8125 ns.
Solution 3 = 6069.4501 ns.

- Statistics -
Solution 1 = 5.5497% faster
Solution 2 = 5.497% faster
Solution 3 = 0% faster

=== Round 3 ===

Iterations: 100000
Range: 100

- Averages -
Solution 1 = 4690.641 ns.
Solution 2 = 4126.3942 ns.
Solution 3 = 4331.1555 ns.

- Statistics -
Solution 1 = 0% faster
Solution 2 = 12.0292% faster
Solution 3 = 7.6639% faster

=== Round 4 ===

Iterations: 100000
Range: 100

- Averages -
Solution 1 = 3985.0794 ns.
Solution 2 = 4201.1156 ns.
Solution 3 = 4299.2292 ns.

- Statistics -
Solution 1 = 7.3071% faster
Solution 2 = 2.2821% faster
Solution 3 = 0% faster

=== Round 5 ===

Iterations: 100000
Range: 100

- Averages -
Solution 1 = 3994.3317 ns.
Solution 2 = 4103.4171 ns.
Solution 3 = 4293.0744 ns.

- Statistics -
Solution 1 = 6.9587% faster
Solution 2 = 4.4177% faster
Solution 3 = 0% faster
What is immediately noticeable is that my test is unable to arrive at a consistent result! More so, what's surprising is that the 3rd solution performed worst 3 out of the 5 runs (2, 4, 5). The 1st solution however, performed best 3 out of the 5 runs (2, 4, 5).

Note that the above results do not imply that the 1st solution is the fastest and that the 3rd solution is the slowest. What can be concluded from the above results however, is that the performance of these solutions do not matter -- in fact, my test written in Java yielded similar results. The performance difference from these solutions are so negligible that we are better off with the most straightforward solution (the 1st solution, in my opinion).

This is exactly what is meant by avoiding premature optimization. By optimizing code without bothering to measure the actual performance, readability suffers and more harm is done. Rather than focus on optimizing at code level, efforts are better spent designing an architecture that is built for performance.

Friday, May 24, 2013

Flowee: Sample Application


My previous post introduced Flowee as a framework for building Java services backed by one or more workflows. Through a sample application, this post will demonstrate how easy it is to build workflow-based services using Flowee.

The sample application is a service which authenticates two types of accounts: an admin and a user. The service will display a greeting then authenticate each type of account using different authentication methods.
Figure 1: Sample Workflow Service
Note that this example is by no means a realistic use-case for production, it is only used here for illustration purposes.

Implementing the Service

(1) We start by defining the request:
public class LoginRequest {
    private String username;
    private String password;
    private String type;
    // Accessors omitted
}

(2) We also create an application specific WorkflowContext (this is an optional step, the default workflow context is probably sufficient for some applications):
public class LoginContext extends WorkflowContext {
    private static final String KEY_IS_AUTHENTICATED = "isAuthenticated";

    public void setIsAuthenticated(Boolean isAuthenticated) {
        put(KEY_IS_AUTHENTICATED, isAuthenticated);
    }

    public Boolean getIsAuthenticated() {
        return (Boolean) get(KEY_IS_AUTHENTICATED);
    }
}
Here, we extend the default context with convenience methods to access the value mapped to the "isAuthenticated" key (return values are normally stored in the context).

(3) Then we define an application specific Task interface to remove generic types:
public interface LoginTask 
    extends Task<LoginRequest, LoginContext> {
}

(4) Next, we define an application specific abstract Task. We make it BeanNameAware so that the tasks assumes the bean name if declared in Spring:
public abstract class AbstractLoginTask extends
        AbstractTask<LoginRequest, LoginContext> implements LoginTask,
        BeanNameAware {    
    @Override
    public void setBeanName(String name) {
        setName(name);
    }
}

(5) We can now create an application specific Workflow:
public class LoginWorkflow extends
        AbstractWorkflow<LoginTask, LoginRequest, LoginContext> {
    public LoginWorkflow(String name) {
        super(name);
    }
}
The abstract implementation should provide all the functionality we need, this step is to simply to declare the generic parameters.

(6) Next, we define a WorkflowFactory. For this example, we will make our workflow factory configurable from properties files. To achieve this, we need to inherit from AbstractPropertiesWorkflowFactory:
public class LoginWorkflowFactory
        extends
        AbstractPropertiesWorkflowFactory<LoginWorkflow, LoginTask, LoginRequest, LoginContext> {
    @Override
    protected LoginWorkflow createWorkflow(String name) {
        return new LoginWorkflow(name);
    }
}
This requires us override createWorkflow(), an Abstract Factory method.

(7) We then define the Filter that will be used by our configurable workflow factory. Recall from my previous post that configurable factories uses filters to evaluate conditions that determine which workflows gets created.

Flowee comes with an abstract implementation that evaluates conditions as JEXL expressions. Using JEXL allows us to define JavaScript-like conditions for our workflow configuration:
public class LoginFilter 
        extends AbstractJexlFilter<LoginRequest, LoginContext> {
    @Override
    protected ReadonlyContext populateJexlContext(LoginRequest request,
            LoginContext context) {
        JexlContext jexlContext = new MapContext();
        jexlContext.set("request", request);
        return new ReadonlyContext(jexlContext);
    }
}
The method, populateJexlContext() populates a JEXL context with the LoginRequest. This allows us to access fields and methods of the request using JEXL expressions (ex: request.type == 'admin' ).

(8) We now have everything we need to define the WorkflowService:
public class LoginService
        extends
        AbstractWorkflowService<LoginWorkflow, LoginTask, LoginRequest, LoginContext> {
    @Override
    protected LoginContext createContext() {
        return new LoginContext();
    }
}
Here, we override an Abstract Factory method for creating an instance of the LoginContext.

Implementing the Tasks

Now that we have the infrastructure for our workflow service, the next stage is to define the actual tasks that comprise the workflows.

(1) We create a simple task that greets the user being authenticated:
public class GreetUserTask extends AbstractLoginTask {
    @Override
    protected TaskStatus attemptExecute(LoginRequest request,
            LoginContext context) throws WorkflowException {
        System.out
                .println(String.format("Welcome '%s'!", 
                        request.getUsername()));
        return TaskStatus.CONTINUE;
    }
}

(2) We then define the task which authenticates admin accounts
public class AuthenticateAdmin extends AbstractLoginTask {
    @Override
    protected TaskStatus attemptExecute(LoginRequest request,
            LoginContext context) throws WorkflowException {
        if ("admin".equals(request.getUsername())
                && "p@ssw0rd".equals(request.getPassword())) {
            System.out.println(String.format(
                    "User '%s' has been authenticated as Administrator",
                    request.getUsername()));
            context.setIsAuthenticated(Boolean.TRUE);
            return TaskStatus.CONTINUE;
        } else {
            System.err.println(String.format("Cannot authenticate user '%s'!",
                    request.getUsername()));
            context.setIsAuthenticated(Boolean.FALSE);
            return TaskStatus.BREAK;
        }
    }
}
Normally, this task should perform authentication against a data source. For this example, we are only trying to simulate a scenario where admin and user accounts are authenticated differently.

(3) We then define the task which authenticates user accounts
public class AuthenticateUser extends AbstractLoginTask {
    @Override
    protected TaskStatus attemptExecute(LoginRequest request,
            LoginContext context) throws WorkflowException {
        if ("user".equals(request.getUsername())
                && "p@ssw0rd".equals(request.getPassword())) {
            System.out.println(String.format(
                    "User '%s' has been authenticated", request.getUsername()));
            context.setIsAuthenticated(Boolean.TRUE);
            return TaskStatus.CONTINUE;
        } else {
            System.err.println(String.format("Cannot authenticate user '%s'!",
                    request.getUsername()));
            context.setIsAuthenticated(Boolean.FALSE);
            return TaskStatus.BREAK;
        }
    }
}

Spring Integration

We now have all the components we need to build the application. It's time to wire them all in Spring.

(1) We start with the tasks. It is a good practice to provide a separate configuration file for tasks, this makes our configuration manageable in case the number of tasks grows.
<beans>
    <bean id="greet" 
        class="com.jramoyo.flowee.sample.login.task.GreetUserTask" />
    <bean id="authenticate_user" 
        class="com.jramoyo.flowee.sample.login.task.AuthenticateUser" />
    <bean id="authenticate_admin" 
        class="com.jramoyo.flowee.sample.login.task.AuthenticateAdmin" />
</beans>
Note that we are not wiring these tasks to any of our components. Flowee Spring Module (flowee-spring) comes with an implementation of TaskRegistry that looks-up task instances from the Spring context.

(2) We then define our main Spring configuration file.
<beans>
    <import resource="classpath:spring-tasks.xml" />
    <bean id="workflowFactory" 
        class="com.jramoyo.flowee.sample.login.LoginWorkflowFactory">
        <property name="filter">
            <bean 
                class="com.jramoyo.flowee.sample.login.LoginFilter" />
        </property>
        <property name="taskRegistry">
            <bean 
                class="com.jramoyo.flowee.spring.ContextAwareTaskRegistry" />
        </property>
    </bean>
    <bean id="workflowService" 
        class="com.jramoyo.flowee.sample.login.LoginService">
        <property name="factory" ref="workflowFactory" />
    </bean>
</beans>
Here, we declare our LoginFactory as a dependency of LoginService. LoginFactory is then wired with LoginFilter and ContextAwareTaskRegistry. As mentioned in the previous step, ContextAwareTaskRegistry allows our factory to look-up task instances from the Spring context.

Workflow Configuration

The last stage is to configure the WorkflowFactory to assemble the workflows required for an account type.

Recall from the previous steps that we using an instance of AbstractPropertiesWorkflowFactory. This loads configuration from two properties files: one for the workflow conditions (workflow.properties) and another for the workflow tasks (tasks.properties).

(1) We create the workflow.properties file with the following content:
admin=request.type == 'admin'
user=request.type == 'user'
This configuration means that if the value of LoginRequest.getType() is 'admin', execute the workflow named 'admin' and if the value is 'user' execute the workflow named 'user'

(2) Then, we create the task.properties file:
admin=greet,authenticate_admin
user=greet,authenticate_user
This configures the sequence of tasks that make up the workflow.

By default, both workflow.properties and task.properties are loaded from the classpath. This can be overridden by setting setWorkflowConfigFile() and setTaskConfigFile() respectively.

That's all there is to it! From Spring, we can load LoginService and inject it anywhere within the application.

Notice that while we created several components to build our service, most of them were simple inheritance and required very little lines of code. Also, we only need to build these component once per service.

You'll find that the value of Flowee becomes more evident as the number of workflows and tasks increases.

Testing

We create a simple JUnit test case:
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration("classpath:/spring.xml")
public class LoginServiceTest {

    @Resource(name = "workflowService")
    private LoginService service;

    @Test
    public void testAdminLogin() throws WorkflowException {
        LoginRequest request = new LoginRequest();
        request.setUsername("admin");
        request.setPassword("p@ssw0rd");
        request.setType("admin");

        LoginContext context = service.process(request);
        Assert.assertTrue("Incorrect result", context.getIsAuthenticated());

        request.setPassword("wrong");
        context = service.process(request);
        Assert.assertFalse("Incorrect result", context.getIsAuthenticated());
    }

    @Test
    public void testUserLogin() throws WorkflowException {
        LoginRequest request = new LoginRequest();
        request.setUsername("user");
        request.setPassword("p@ssw0rd");
        request.setType("user");

        LoginContext context = service.process(request);
        Assert.assertTrue("Incorrect result", context.getIsAuthenticated());

        request.setPassword("wrong");
        context = service.process(request);
        Assert.assertFalse("Incorrect result", context.getIsAuthenticated());
    }
}
Notice how we injected LoginService to our unit test.

Our test yields the following output:
Welcome 'admin'!
User 'admin' has been authenticated as Administrator

Welcome 'admin'!
Cannot authenticate user 'admin'!

Welcome 'user'!
User 'user' has been authenticated

Welcome 'user'!
Cannot authenticate user 'user'!

The code used in this example is available as a Maven module from Flowee's Git repository.

So far, I've only demonstrated simple workflows with linear task sequences. My next post will introduce special tasks which allows for more complex task sequences.

Introducing Flowee, a Framework for Building Workflow-based Services in Java

Overview

My past roles required me to write three different applications having surprisingly similar requirements. These requirements were:
  1. The application must run as a service which receives some form of request
  2. Once received, several stages of processing needs to be performed on the request (i.e. validation, persistence, state management, etc)
  3. These stages of processing may change overtime as new requirements come in
The third requirement is arguably the most important. While the first two can be performed without special techniques or design, the last requirement requires a bit of planning.

In order to achieve this required flexibility, adding new processing logic or modifying existing ones should not affect the code of other tasks. For example: if I add a new logic for validation, my existing code for state management should not be affected.

I need to encapsulate each logic as individual units of "tasks". This way, changes to each task is independent from the other.

The behavior of my service is now defined by the sequence of tasks needed to be performed. This behavior can easily be changed by adding, removing, or modifying tasks in the sequence.

Essentially, I needed to build a workflow.

Design

My design had several iterations throughout the years. My initial design was pretty straightforward: I implemented a workflow as a container of one or more tasks arranged in a sequence. When executed, the workflow iterates through each task and executes it. The workflow has no knowledge of what each task would do, all it knows is that it is executing some task. This was made possible because each task share a common interface. This interface exposes an execute() method which accepts the request as a parameter.

Through dependency injection, the behavior of the workflow becomes very easy to change. I can add, remove and rearrange tasks via configuration.

While proven to be effective, my initial design was application specific -- the workflow can only accept a specific type of request. This makes it an ineffective framework because it only works for that particular application. There was also the problem of tasks not being able to share information: as a result, temporary values needed to be stored within the request itself.

I had the chance to improve on this design on a later project. By applying Generics, I was able to redesign my workflow so that it can accept any type of request.

In order for information to be shared across each tasks, a data structure serving as the workflow "context" is also passed as a parameter to each task.
Figure 1: Workflow and Tasks

More often than not, different workflows need to be executed for different types of request. For example, one workflow needs to be executed to open an account, another workflow to close an account, etc.

I came up with a "workflow factory" as a solution to this requirement. Depending on the type of request, the factory will assemble the workflows required to be executed against the request. The factory is exposed as an interface so that I can have different implementations depending on the requirement.
Figure 2: Workflow Factory

My service now becomes very simple: as soon as a request comes in, I will call the factory to assemble the required workflows then one-by-one execute my request against each of them.
Figure 3: Workflow Service

Flowee Framework

Flowee is an open source implementation of the above design. Having gone through several iterations, I feel that this design has matured enough to warrant a common framework that can be useful to others.

The project is hosted via GitHub at https://github.com/jramoyo/flowee.

Workflow and Tasks

The core framework revolves around the Workflow interface and its relationship with the Task interface.
Figure 4: Flowee Workflow

The execute() method of Workflow accepts both the request and an instance of WorkflowContext. Errors encountered while processing the request are thrown as WorkflowExceptions.

AbstractWorkflow provides a generic implementation of Workflow. It iterates through a list of associated Tasks to process a request. Application specific workflows will inherit from this class.

Most workflow Tasks will inherit from AbstractTask. It provides useful features for building application specific tasks namely:
  • Retry task execution when an exception is encountered
  • Silently skip task execution instead of throwing an exception
  • Skip task execution depending on the type of request

WorkflowFactory

The WorkflowFactory is another important piece of the framework. The way it abstracts the selection and creation of workflows simplifies the rest of the core components.
Figure 5: Flowee WorkflowFactory

AbstractConfigurableWorkflowFactory is used to build configuration  driven workflow factories. It defines an abstract fetchConfiguration() method and an abstract fetchTaskNames() method that sub-classes need to implement. These methods are used to fetch configurations from various sources: either from the file system or from a remote server.

The configuration is represented as a Map whose key is the name of a workflow and whose value is the condition which activates that workflow.

AbstractConfigurableWorkflowFactory uses a Filter instance to evaluate the conditions configured to activate the workflows.

AbstractPropertiesWorkflowFactory is a sub-class of AbstractConfigurableWorkflowFactory that fetches configuration from a properties file.

WorkflowService

WorkflowService and AbstractWorkflowService acts as a facade linking the core components together.
Figure 6: Flowee WorkflowService
With all the complexities taken care of by both the Workflow and the WorkflowFactory, our WorkflowService implementation becomes very simple.

While most applications will interact with workflows through the WorkflowService, those requiring a different behavior can interact with the underlying components directly.

Conclusion

The primary purpose of Flowee is to provide groundwork for rule-driven workflow selection and execution. Developers can focus the majority of their efforts on building the tasks which hold the actual business requirements.

Workflows built on Flowee will run without the need for "containers" or "engines". The framework is lightweight and integrates seamlessly with any application.

This post discussed the design considerations which led to the implementation of Flowee. It also described the structure of the core framework. My next post will demonstrate how easy it is to build workflow-based services with Flowee by going through a sample application.

Saturday, March 9, 2013

Static Methods and Unit Testing

Overview

We all know that Interfaces allow us to write loosely-coupled components, and that this "loose-coupledness" comes-in handy during unit testing. Because we can separate the implementation from the method signatures, Interfaces allow us to mock implementations of our dependencies.

Mocking dependencies is important during unit testing because it allows us to isolate the components we are testing from their dependencies -- this means that incorrect behavior from any dependency will not affect the results of our tests.

Consider the following class:
public class Component {
    private Dependency dependency
    
    public Component(Dependency dependency) {
        this.dependency = dependency
    }
    
    public void componentMethod() {
        int imporantValue = dependency.getImportantValue();
        // use 'imporantValue' to calculate stuff
    }
}
And the following Interface:
public interface Dependency {
    int getImportantValue();
}
Whose default implementation is defined as:
public class DefaultDependency implements Dependency {
    public int getImportantValue() {
        int value = // fetch imporant value from external service
        return value;
    }
}
In order for us to test componentMethod(), we'd have to setup our unit test environment to allow connections to the external service; and if this external service fails, our unit test would fail as well.

Mocking the dependency allows us to execute our unit test without the need for an external service:
public class MockedDependency implements Dependency {
    public int getImportantValue() {
        // mocked value
        return 1;
    }
}
Because we are providing a simple and consistent implementation, we are assured that our dependency always returns the correct value and therefore would not compromise the results of our tests.

Mockito

Mockito is a mocking framework which simplifies mocking. It allows us to mock dependencies with very little code and to customize our mocked behavior depending on our testing scenario:
// without mocking
Component component = new Component(new DefaultDependency());

// with mocking
Component component = new Component(new MockedDependency());

// with Mockito
Dependency dependency = Mockito.mock(Dependency.class);
Mockito.when(dependency.getImportantValue()).thenReturn(1);
Component component = new Component(dependency);
A great thing about Mockito is that it also supports mocking concrete classes.

Note that in the above examples, we are assuming that unit tests were also written for DefaultDependency. Without this, we cannot guarantee the overall correctness of the application.

Static Methods

The point that I'm trying to make is that abstract methods provide us the flexibility to mock our implementation. Using mocking frameworks such as Mockito, we can even extend this flexibility to concrete methods. The same cannot be said for static methods.

Because they are associated with the class rather than an instance, static methods cannot be overridden. In fact, most static utility classes in Java are marked as final. And because they cannot be overridden, it is impossible to mock static method implementations.

Suppose that in our previous example, Dependency was implemented as a static utility class:
public class Component {
    public void componentMethod() {
        int imporantValue = Dependency.getImportantValue();
        // use 'imporantValue' to calculate stuff
    }
}

public final class Dependency {
    public static int getImportantValue() {
        int value = // fetch imporant value from external service
        return value;
    }
}
Because we cannot override getImportantValue() with a mocked implementation, there is simply no way for us to test componentMethod() without requiring a connection to the external service.

Singletons

You might have come across statements saying that "Singletons are evil". That is partly true depending on how you create and use Singletons.

Suppose that in our previous example, Dependency was implemented and used as a classic Singleton:
public class Component {
    public void componentMethod() {
        // classic use of Singleton
        int imporantValue = Dependency.getInstance().getImportantValue();
        // use 'imporantValue' to calculate stuff
    }
}

public final class Dependency {
    private static final Dependency instance = new Dependency();
    
    private Dependency(){}
    
    public static Dependency getInstance() {
        return instance;
    }

    public int getImportantValue() {
        int value = // fetch imporant value from external service
        return value;
    }
}
Because getInstance() is a static method, all the evils associated with static methods apply to the Singleton as well (also applicable to Singletons implemented as enums).

Obviously not all Singletons are evil. With slight modifications, we can fix what's evil about our previous implementation:
public interface Dependency {
    int getImportantValue();
}

public final class DefaultDependency implements Dependency {
    private static final Dependency instance = new DefaultDependency();
    
    private DefaultDependency(){}
    
    public static Dependency getInstance() {
        return instance;
    }

    public int getImportantValue() {
        int value = // fetch imporant value from external service
        return value;
    }
}
By making Dependency an Interface and only applying the Singleton pattern to its default implementation, our Component class can be implemented exactly as the original version:
public class Component {
    private Dependency dependency
    
    public Component(Dependency dependency) {
        this.dependency = dependency
    }
    
    public void componentMethod() {
        int imporantValue = dependency.getImportantValue();
        // use 'imporantValue' to calculate stuff
    }
}
Once again making it unit-testable:
// without mocking, acquire Singleton instance
Component component = new Component(DefaultDependency.getInstance());

// with mocking
Component component = new Component(new MockedDependency());

// with Mockito
Dependency dependency = Mockito.mock(Dependency.class);
Mockito.when(dependency.getImportantValue()).thenReturn(1);
Component component = new Component(dependency);
The above is an example of Inversion of Control (IoC). Instead of acquiring the Dependency instance from the Component class, we let the container decide which instance of Dependency to assign to Component.

Can you think of another popular pattern with a tendency to have the same issues described above? Hint: it's called Service Locator.

Conclusion

Static methods have their use. But because of their impact to unit testing, caution must be applied before using them. Personally, I limit my use of static methods to utility classes having small and unchanging logic (example: org.apache.commons.io.IOUtils).

If required to use static factory methods, applying Inversion of Control should help enforce unit-testablity.

Sunday, February 24, 2013

RestTemplate with Google Places API

My website, CheckTheCrowd.com, was initially using the Google Maps JavaScript API (Places Library) to fetch details on the various places submitted to the website.

Place details such as names, addresses, and photos are normally displayed as content. Because these contents were dynamically loaded via JavaScript, they won't be visible to web crawlers and hence cannot be read as keywords.

In order for web crawlers to access the place details, they needed to be included as part of the HTML generated from by the Servlet. This meant that rather than fetch the place details from the browser via JavaScript, I needed to fetch them from the web server.

Place Details - Google Places API

Under the hood, the JavaScript Places Library calls a REST service to fetch the details of a particular place. I needed to call the same service from the web server in order to deliver the place details as part of the Servlet content.

The Place Details REST service is a GET call to the following resource:
https://maps.googleapis.com/maps/api/place/details/output?parameters
Where output can either be JSON (json) or XML (xml), the resource requires 3 parameters: the API key (key), the place identifier (reference), and the sensor flag (sensor).

For example:
https://maps.googleapis.com/maps/api/place/details/json?reference=12345
   &sensor=false&key=54321

RestTemplate - Spring Web

Starting with version 3.0, the Spring Web module comes with a class called RestTemplate. Similar to other Spring templates, RestTemplate reduces boiler-plate code that is normally involved with calling REST services.

RestTemplate supports common HTTP methods such as GET, POST, DELETE, PUT, etc. Objects passed to and returned from these methods are converted by HttpMessageConverters. Default converters are registered against the MIME type and custom converters are also supported.

RestTemplate and Place Details

RestTemplate exposes a method called getForObject to support GET method calls. It accepts a String representing the URL template, a Class for the return type, and a variable String array to populate the template.

I started my implementation by creating a class called GooglePlaces. I then declared the URL template as a constant and declared RestTemplate as an instance member injected by the Spring container. My Google Places API key was also declared as a member instance, this time populated by Spring from a properties file:
private static final String PLACE_DETAILS_URL = 
    "https://maps.googleapis.com/maps/api/place/details/json?reference" 
        + "={searchId}&sensor=false&key={key}";

Value("${api.key}")
private String apiKey;

@Inject
private RestTemplate restTemplate;
The above code should be enough to call the Place Details service and get the response as JSON string:
String json = restTemplate.getForObject(PLACE_DETAILS_URL, 
   String.class, "12345", apiKey);
However, the JSON response needs to be converted to a Java object to be of practical use.

By default, RestTemplate supports JSON to Java conversion via MappingJacksonHttpMessageConverter. All I need is to do is create Java objects which map to the Place Details JSON response.

Java Mapping

I referred to Place Details reference guide for a sample of the JSON response that I needed to map to Java. Because the Place Details response includes other information that I didn't need for CheckTheCrowd, I added annotations to my classes which tells the converter to ignore unmapped properties:
@JsonIgnoreProperties(ignoreUnknown = true)
public static class PlaceDetailsResponse {
    @JsonProperty("result")
    private PlaceDetails result;

    public PlaceDetails getResult() {
        return result;
    }

    public void setResult(PlaceDetails result) {
        this.result = result;
    }
}
The above class represents the top-level response object. It is simply a container for the result property.

The below class represents the result:
@JsonIgnoreProperties(ignoreUnknown = true)
public static class PlaceDetails {
    @JsonProperty("name")
    private String name;

    @JsonProperty("icon")
    private String icon;

    @JsonProperty("url")
    private String url;

    @JsonProperty("formatted_address")
    private String address;

    @JsonProperty("geometry")
    private PlaceGeometry geometry;

    @JsonProperty("photos")
    private List<PlacePhoto> photos = Collections.emptyList();

    // Getters and setters...
}
I also needed the longitude and latitude information as well as the photos. Below are the classes for the geometry and photo properties which contain these information:
@JsonIgnoreProperties(ignoreUnknown = true)
public static class PlaceGeometry {
    @JsonProperty("location")
    private PlaceLocation location;

    public PlaceLocation getLocation() {
        return location;
    }

    public void setLocation(PlaceLocation location) {
        this.location = location;
    }
}

@JsonIgnoreProperties(ignoreUnknown = true)
public static class PlaceLocation {
    @JsonProperty("lat")
    private String lat;

    @JsonProperty("lng")
    private String lng;
    
    // Getters and setters
}
@JsonIgnoreProperties(ignoreUnknown = true)
public static class PlacePhoto {
    @JsonProperty("photo_reference")
    private String reference;

    public String getReference() {
        return reference;
    }

    public void setReference(String reference) {
        this.reference = reference;
    }
}
With the above Java mappings, I can now expose a method which returns an instance of PlaceDetails given a place reference:
public PlaceDetails getPlaceDetails(String searchId) {
    PlaceDetailsResponse response 
        = restTemplate.getForObject(PLACE_DETAILS_URL, 
            PlaceDetailsResponse.class, searchId, apiKey);
    if (response.getResult() != null) {
        return response.getResult();
    } else {
        return null;
    }
}

Caching

The moment I deployed my changes to Tomcat, I noticed a significant latency between server requests. This was expected because the server now has to make several calls to the Place Details service before returning a response.

This is exactly a scenario where a good caching strategy would help. It is worth noting however that Google Maps API Terms of Service (10.1.3.b) has strict rules regarding caching. It states that caching should only be done to improve performance and that data can only be cached up to 30 calendar days.

CheckTheCrowd uses Guava which includes a pretty good API for in-memory caching. Using a CacheLoader, I can seamlessly integrate a Guava cache to my code:
private LoadingCache<String, PlaceDetails> placeDetails 
        = CacheBuilder.newBuilder().maximumSize(1000).expireAfterAccess(24, TimeUnit.HOURS)
            .build(new CacheLoader<String, PlaceDetails>() {   
                public PlaceDetails load(String searchId) throws Exception {
                    PlaceDetailsResponse response = restTemplate.getForObject(PLACE_DETAILS_URL, 
                        PlaceDetailsResponse.class, searchId, apiKey);
                    if (response.getResult() != null) {
                        return response.getResult();
                    } else {
                        throw new PlacesException("Unable to find details for reference: " + searchId);
                    }
                }
            });
I set a cache size of 1000 and an expiry of 24 hours. The call to the Place Details service was then moved to the CacheLoader's load method. After which, I updated my method to refer to the cache instead:
public PlaceDetails getPlaceDetails(String searchId) {
    try {
        return placeDetails.get(searchId);
    } catch (ExecutionException e) {
        logger.warn("An exception occurred while "
            + "fetching place details!", e.getCause());
        return null;
    }
}
Unfortunately, I wasn't able to measure the exact latency prior to applying the cache. I was however, very pleased with the noticeable improvement I got after applying the cache.

The complete source is available from Google Code under Apache License 2.0.

Monday, February 11, 2013

Generating Sitemaps Using Spring Batch and SitemapGen4j

I recently launched a website called CheckTheCrowd.com. And in order for search engines to effectively crawl my content, I needed a sitemap.

Since my content is mostly generated from the database, I needed to find a way to dynamically generate my sitemap.

Most answers I got from online forums suggested exposing a URL which when accessed, generates the sitemap. With Spring MVC, it goes something like:
@RequestMapping("/sitemap.xml")
public @ResponseBody String generateSitemap() {
    String sitemap = // generate expensive XML String
    return sitemap
}
The problem with this approach is that it doesn't scale. The more content you have, the longer it takes to generate the sitemap. And because the sitemap is generated every time the URL is accessed, precious server resources are wasted.

Another suggestion was to append an entry to the sitemap every time new content is added to the database. I did not like this approach because it would be difficult to do source control on the sitemap. Also, accidentally deleting the sitemap would mean that data is gone forever.

Batch Job Approach

Eventually, I ended-up doing something similar to the first suggestion. However, instead of generating the sitemap every time the URL is accessed, I ended-up generating the sitemap from a batch job.

With this approach, I get to schedule how often the sitemap is generated. And because generation happens outside of an HTTP request, I can afford a longer time for it to complete.

Having previous experience with the framework, Spring Batch was my obvious choice. It provides a framework for building batch jobs in Java. Spring Batch works with the idea of "chunk processing" wherein huge sets of data are divided and processed as chunks.

I then searched for a Java library for writing sitemaps and came-up with SitemapGen4j. It provides an easy to use API and is released under Apache License 2.0.

Requirements

My requirements are simple: I have a couple of static web pages which can be hard-coded to the sitemap. I also have pages for each place submitted to the web site; each place is stored as a single row in the database and is identified by a unique ID. There are also pages for each registered user; similar to the places, each user is stored as a single row and is identified by a unique ID.

A job in Spring Batch is composed of 1 or more "steps". A step encapsulates the processing needed to be executed against a set of data.

I identified 4 steps for my job:
  1. Add static pages to the sitemap
  2. Add place pages to the sitemap
  3. Add profile pages to the sitemap
  4. Write the sitemap XML to a file
Step 1

Because it does not involve processing a set of data, my first step can be implemented directly as a simple Tasklet:
public class StaticPagesInitializerTasklet implements Tasklet {
    private static final Logger logger 
            = LoggerFactory.getLogger(StaticPagesInitializerTasklet.class);

    private final String rootUrl;

    @Inject
    private WebSitemapGenerator sitemapGenerator;

    public StaticPagesInitializerTasklet(String rootUrl) {
        this.rootUrl = rootUrl;
    }

    @Override
    public RepeatStatus execute(StepContribution contribution, 
            ChunkContext chunkContext) throws Exception {
        logger.info("Adding URL for static pages...");
        sitemapGenerator.addUrl(rootUrl);
        sitemapGenerator.addUrl(rootUrl + "/terms");
        sitemapGenerator.addUrl(rootUrl + "/privacy");
        sitemapGenerator.addUrl(rootUrl + "/attribution");

        logger.info("Done.");
        return RepeatStatus.FINISHED;
    }

    public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
        this.sitemapGenerator = sitemapGenerator;
    }
}
The starting point of a Tasklet is the execute() method. Here, I add the URLs of the known static pages of CheckTheCrowd.com.

Step 2

The second step requires places data to be read from the database then subsequently written to the sitemap.

This is a common requirement and Spring Batch provides built-in Interfaces to help perform these types of processing:
  • ItemReader - Reads a chunk of data from a source; each data is considered an item. In my case, an item represents a place.
  • ItemProcessor - Transforms the data before writing. This is optional and is not used in this example.
  • ItemWriter - Writes a chunk of data to a destination. In my case, I add each place to the sitemap.
The Spring Batch API includes a class called JdbcCursorItemReader, an implementation of ItemReader which continously reads rows from a JDBC ResultSet. It requires a RowMapper which is responsible for mapping database rows to batch items.

For this step, I declare a JdbcCursorItemReader in my Spring configuration and set my implementation of RowMapper:
@Bean
public JdbcCursorItemReader<PlaceItem> placeItemReader() {
    JdbcCursorItemReader<PlaceItem> itemReader 
            = new JdbcCursorItemReader<>();
    itemReader.setSql(environment
            .getRequiredProperty(PROP_NAME_SQL_PLACES));
    itemReader.setDataSource(dataSource);
    itemReader.setRowMapper(new PlaceItemRowMapper());
    return itemReader;
}
Line 5 sets the SQL statement to query the ResultSet. In my case, the SQL statement is fetched from a properties file.

Line 7 sets the JDBC DataSource.

Line 8 sets my implementation of RowMapper.

Next, I write my implementation of ItemWriter:
public class PlaceItemWriter implements ItemWriter<PlaceItem> {
    private static final Logger logger 
        = LoggerFactory.getLogger(PlaceItemWriter.class);

    private final String rootUrl;

    @Inject
    private WebSitemapGenerator sitemapGenerator;

    public PlaceItemWriter(String rootUrl) {
        this.rootUrl = rootUrl;
    }

    @Override
    public void write(List<? extends PlaceItem> items) throws Exception {
        String url;
        for (PlaceItem place : items) {
            url = rootUrl + "/place/" + place.getApiId() + "?searchId=" + place.getSearchId();

            logger.info("Adding URL: " + url);
            sitemapGenerator.addUrl(url);
        }
    }

    public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
        this.sitemapGenerator = sitemapGenerator;
    }
}
Places in CheckTheCrowd.com are accessible from URLs having this pattern: checkthecrowd.com/place/{placeId}?searchId={searchId}.  My ItemWriter simply iterates through the chunk of PlaceItems, builds the URL, then adds the URL to the sitemap.

Step 3

The third step is exactly the same as the previous, but this time processing is done on user profiles.

Below is my ItemReader declaration:
@Bean
public JdbcCursorItemReader<PlaceItem> profileItemReader() {
    JdbcCursorItemReader<PlaceItem> itemReader 
            = new JdbcCursorItemReader<>();
    itemReader.setSql(environment
            .getRequiredProperty(PROP_NAME_SQL_PROFILES));
    itemReader.setDataSource(dataSource);
    itemReader.setRowMapper(new ProfileItemRowMapper());
    return itemReader;
}
Below is my ItemWriter implementation:
public class ProfileItemWriter implements ItemWriter<ProfileItem> {
    private static final Logger logger 
            = LoggerFactory.getLogger(ProfileItemWriter.class);
    
    private final String rootUrl;

    @Inject
    private WebSitemapGenerator sitemapGenerator;

    public ProfileItemWriter(String rootUrl) {
        this.rootUrl = rootUrl;
    }

    @Override
    public void write(List<? extends ProfileItem> items) 
            throws Exception {
        String url;
        for (ProfileItem profile : items) {
            url = rootUrl + "/profile/" + profile.getUsername();

            logger.info("Adding URL: " + url);
            sitemapGenerator.addUrl(url);
        }
    }

    public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
        this.sitemapGenerator = sitemapGenerator;
    }
}
Profiles in CheckTheCrowd.com are accessed from URLs having this pattern: checkthecrowd.com/profile/{username}.

Step 4

The last step is fairly straightforward and is also implemented as a simple Tasklet:
public class XmlWriterTasklet implements Tasklet {
    private static final Logger logger = 
            LoggerFactory.getLogger(XmlWriterTasklet.class);

    @Inject
    private WebSitemapGenerator sitemapGenerator;

    @Override
    public RepeatStatus execute(StepContribution contribution, 
            ChunkContext chunkContext) throws Exception {
        logger.info("Writing sitemap.xml...");
        sitemapGenerator.write();

        logger.info("Done.");
        return RepeatStatus.FINISHED;
    }
}
Notice that I am using the same instance of WebSitemapGenerator across all the steps. It is declared in my Spring configuration as:
@Bean
public WebSitemapGenerator sitemapGenerator() throws Exception {
    String rootUrl = environment
            .getRequiredProperty(PROP_NAME_ROOT_URL);
    String deployDirectory = environment
            .getRequiredProperty(PROP_NAME_DEPLOY_PATH);
    return WebSitemapGenerator.builder(rootUrl, 
            new File(deployDirectory)).allowMultipleSitemaps(true)
                    .maxUrls(1000).build();
}
Because they change between environments (dev vs prod), rootUrl and deployDirectory are both configured from a properties file.

Wiring them all together...
 
<beans>
    <context:component-scan 
        base-package="com.checkthecrowd.batch.sitemapgen.config" />
    
    <bean 
        class="com.checkthecrowd.batch.sitemapgen.config.SitemapGenConfig" />
    <bean 
        class="org.springframework.config.java.process.ConfigurationPostProcessor" />
    
    <batch:job id="generateSitemap" job-repository="jobRepository">
        <batch:step id="insertStaticPages" next="insertPlacePages">
            <batch:tasklet ref="staticPagesInitializerTasklet" />
        </batch:step>
        <batch:step id="insertPlacePages" parent="abstractParentStep" 
            next="insertProfilePages">
            <batch:tasklet>
                <batch:chunk reader="placeItemReader" 
                    writer="placeItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="insertProfilePages" parent="abstractParentStep" 
            next="writeXml">
            <batch:tasklet>
                <batch:chunk reader="profileItemReader" 
                    writer="profileItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="writeXml">
            <batch:tasklet ref="xmlWriterTasklet" />
        </batch:step>
    </batch:job>
    
    <batch:step id="abstractParentStep" abstract="true">
        <batch:tasklet>
            <batch:chunk commit-interval="100" />
        </batch:tasklet>
    </batch:step>
</beans>
Lines 33-37 declare an abstract step which serves as the common parent for steps 2 and 3. It sets a property called commit-interval which defines how many items comprises a chunk. In this case, a chunk is comprised of 100 items.

There is a lot more to Spring Batch, kindly refer to the official reference guide.