on software development (and stuff)

Generating Sitemaps Using Spring Batch and SitemapGen4j

7 comments
I recently launched a website called CheckTheCrowd.com. And in order for search engines to effectively crawl my content, I needed a sitemap.

Since my content is mostly generated from the database, I needed to find a way to dynamically generate my sitemap.

Most answers I got from online forums suggested exposing a URL which when accessed, generates the sitemap. With Spring MVC, it goes something like:
@RequestMapping("/sitemap.xml")
public @ResponseBody String generateSitemap() {
  String sitemap = generateExpensiveXml();
  return sitemap
}
The problem with this approach is that it doesn't scale. The more content you have, the longer it takes to generate the sitemap. And because the sitemap is generated every time the URL is accessed, precious server resources are wasted.

Another suggestion was to append an entry to the sitemap every time new content is added to the database. I did not like this approach because it would be difficult to do source control on the sitemap. Also, accidentally deleting the sitemap would mean that data is gone forever.

Batch Job Approach

Eventually, I ended-up doing something similar to the first suggestion. However, instead of generating the sitemap every time the URL is accessed, I ended-up generating the sitemap from a batch job.

With this approach, I get to schedule how often the sitemap is generated. And because generation happens outside of an HTTP request, I can afford a longer time for it to complete.

Having previous experience with the framework, Spring Batch was my obvious choice. It provides a framework for building batch jobs in Java. Spring Batch works with the idea of "chunk processing" wherein huge sets of data are divided and processed as chunks.

I then searched for a Java library for writing sitemaps and came-up with SitemapGen4j. It provides an easy to use API and is released under Apache License 2.0.

Requirements

My requirements are simple: I have a couple of static web pages which can be hard-coded to the sitemap. I also have pages for each place submitted to the web site; each place is stored as a single row in the database and is identified by a unique ID. There are also pages for each registered user; similar to the places, each user is stored as a single row and is identified by a unique ID.

A job in Spring Batch is composed of 1 or more "steps". A step encapsulates the processing needed to be executed against a set of data.

I identified 4 steps for my job:
  • Add static pages to the sitemap
  • Add place pages to the sitemap
  • Add profile pages to the sitemap
  • Write the sitemap XML to a file

Step 1

Because it does not involve processing a set of data, my first step can be implemented directly as a simple Tasklet:
public class StaticPagesInitializerTasklet implements Tasklet {
  private static final Logger logger = LoggerFactory.getLogger(StaticPagesInitializerTasklet.class);

  private final String rootUrl;

  @Inject
  private WebSitemapGenerator sitemapGenerator;

  public StaticPagesInitializerTasklet(String rootUrl) {
    this.rootUrl = rootUrl;
  }

  @Override
  public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
    logger.info("Adding URL for static pages...");
    sitemapGenerator.addUrl(rootUrl);
    sitemapGenerator.addUrl(rootUrl + "/terms");
    sitemapGenerator.addUrl(rootUrl + "/privacy");
    sitemapGenerator.addUrl(rootUrl + "/attribution");

    logger.info("Done.");
    return RepeatStatus.FINISHED;
  }

  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}
The starting point of a Tasklet is the execute() method. Here, I add the URLs of the known static pages of CheckTheCrowd.com.

Step 2

The second step requires places data to be read from the database then subsequently written to the sitemap.

This is a common requirement, and Spring Batch provides built-in Interfaces to help perform these types of processing:
  • ItemReader - Reads a chunk of data from a source; each data is considered an item. In my case, an item represents a place.
  • ItemProcessor - Transforms the data before writing. This is optional and is not used in this example.
  • ItemWriter - Writes a chunk of data to a destination. In my case, I add each place to the sitemap.
The Spring Batch API includes a class called JdbcCursorItemReader, an implementation of ItemReader which continously reads rows from a JDBC ResultSet. It requires a RowMapper which is responsible for mapping database rows to batch items.

For this step, I declare a JdbcCursorItemReader in my Spring configuration and set my implementation of RowMapper:
@Bean
public JdbcCursorItemReader<PlaceItem> placeItemReader() {
  JdbcCursorItemReader<PlaceItem> itemReader = new JdbcCursorItemReader<>();
  itemReader.setSql(environment.getRequiredProperty(PROP_NAME_SQL_PLACES));
  itemReader.setDataSource(dataSource);
  itemReader.setRowMapper(new PlaceItemRowMapper());
  return itemReader;
}
Line 4 sets the SQL statement to query the ResultSet. In my case, the SQL statement is fetched from a properties file.

Line 5 sets the JDBC DataSource.

Line 6 sets my implementation of RowMapper.

Next, I write my implementation of ItemWriter:
public class PlaceItemWriter implements ItemWriter<PlaceItem> {
  private static final Logger logger = LoggerFactory.getLogger(PlaceItemWriter.class);

  private final String rootUrl;

  @Inject
  private WebSitemapGenerator sitemapGenerator;

  public PlaceItemWriter(String rootUrl) {
    this.rootUrl = rootUrl;
  }

  @Override
  public void write(List<? extends PlaceItem> items) throws Exception {
    String url;
    for (PlaceItem place : items) {
      url = rootUrl + "/place/" + place.getApiId() + "?searchId=" + place.getSearchId();

      logger.info("Adding URL: " + url);
      sitemapGenerator.addUrl(url);
    }
  }

  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}
Places in CheckTheCrowd.com are accessible from URLs having this pattern: checkthecrowd.com/place/{placeId}?searchId={searchId}.  My ItemWriter simply iterates through the chunk of PlaceItems, builds the URL, then adds the URL to the sitemap.

Step 3

The third step is exactly the same as the previous, but this time processing is done on user profiles.

Below is my ItemReader declaration:
@Bean
public JdbcCursorItemReader<PlaceItem> profileItemReader() {
  JdbcCursorItemReader<PlaceItem> itemReader = new JdbcCursorItemReader<>();
  itemReader.setSql(environment.getRequiredProperty(PROP_NAME_SQL_PROFILES));
  itemReader.setDataSource(dataSource);
  itemReader.setRowMapper(new ProfileItemRowMapper());
  return itemReader;
}
Below is my ItemWriter implementation:
public class ProfileItemWriter implements ItemWriter<ProfileItem> {
  private static final Logger logger = LoggerFactory.getLogger(ProfileItemWriter.class);

  private final String rootUrl;

  @Inject
  private WebSitemapGenerator sitemapGenerator;

  public ProfileItemWriter(String rootUrl) {
    this.rootUrl = rootUrl;
  }

  @Override
  public void write(List<? extends ProfileItem> items) throws Exception {
    String url;
    for (ProfileItem profile : items) {
      url = rootUrl + "/profile/" + profile.getUsername();

      logger.info("Adding URL: " + url);
      sitemapGenerator.addUrl(url);
    }
  }

  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}
Profiles in CheckTheCrowd.com are accessed from URLs having this pattern: checkthecrowd.com/profile/{username}.

Step 4

The last step is fairly straightforward and is also implemented as a simple Tasklet:
public class XmlWriterTasklet implements Tasklet {
  private static final Logger logger = LoggerFactory.getLogger(XmlWriterTasklet.class);

  @Inject
  private WebSitemapGenerator sitemapGenerator;

  @Override
  public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
    logger.info("Writing sitemap.xml...");
    sitemapGenerator.write();

    logger.info("Done.");
    return RepeatStatus.FINISHED;
  }
}
Notice that I am using the same instance of WebSitemapGenerator across all the steps. It is declared in my Spring configuration as:
@Bean
public WebSitemapGenerator sitemapGenerator() throws Exception {
  String rootUrl = environment.getRequiredProperty(PROP_NAME_ROOT_URL);
  String deployDirectory = environment.getRequiredProperty(PROP_NAME_DEPLOY_PATH);
  return WebSitemapGenerator.builder(rootUrl, new File(deployDirectory))
    .allowMultipleSitemaps(true).maxUrls(1000).build();
}
Because they change between environments (dev vs prod), rootUrl and deployDirectory are both configured from a properties file.

Wiring them all together...
<beans>
    <context:component-scan base-package="com.checkthecrowd.batch.sitemapgen.config" />

    <bean class="...config.SitemapGenConfig" />
    <bean class="...config.java.process.ConfigurationPostProcessor" />

    <batch:job id="generateSitemap" job-repository="jobRepository">
        <batch:step id="insertStaticPages" next="insertPlacePages">
            <batch:tasklet ref="staticPagesInitializerTasklet" />
        </batch:step>
        <batch:step id="insertPlacePages" parent="abstractParentStep" next="insertProfilePages">
            <batch:tasklet>
                <batch:chunk reader="placeItemReader" writer="placeItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="insertProfilePages" parent="abstractParentStep" next="writeXml">
            <batch:tasklet>
                <batch:chunk reader="profileItemReader" writer="profileItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="writeXml">
            <batch:tasklet ref="xmlWriterTasklet" />
        </batch:step>
    </batch:job>

    <batch:step id="abstractParentStep" abstract="true">
        <batch:tasklet>
            <batch:chunk commit-interval="100" />
        </batch:tasklet>
    </batch:step>
</beans>
Lines 26-30 declare an abstract step which serves as the common parent for steps 2 and 3. It sets a property called commit-interval which defines how many items comprises a chunk. In this case, a chunk is comprised of 100 items.

There is a lot more to Spring Batch, kindly refer to the official reference guide.

7 comments :

  1. Thank; this is awesome stuff!

    ReplyDelete
  2. Hi there :) Great post, however, as soon as I run the job again, for the second time, I get an exception from sitemapgen4j: java.lang.RuntimeException: Sitemap already printed; you must create a new generator to make more sitemaps. This is because sitemapgen4j has a property internally called finished to track if the file has been written. How did you solve this issue?

    ReplyDelete
  3. Thanks! I am running the job via CommandLineJobRunner, hence a new instance of WebSitemapGenerator is created every time.

    How are you running your job?

    ReplyDelete
  4. Hi,

    I'm using the SimpleJobLauncher like this:

    jobLauncher.run((Job) applicationContext.getBean("sitemapJob"), new JobParametersBuilder().addLong("timestamp", System.currentTimeMillis()).toJobParameters());

    ReplyDelete
  5. Try declaring your WebSitemapGenerator with scope = "job" (only available starting Spring Batch 3.0)

    More details here:

    http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html#job-scope

    ReplyDelete