System Design
Spring BatchJavaPerformanceDistributed Processing

Spring Batch for High-Volume Data Processing

Partitioning, Performance Tuning, and Fault Tolerance at Scale

How to design and tune Spring Batch jobs that process millions of records reliably — covering partitioned steps, chunk sizing, restart semantics, and the monitoring setup that keeps batch jobs observable in production.

10 min readSeptember 22, 2024

The Case for Spring Batch

For processing millions of financial transactions on a schedule, Spring Batch provides the infrastructure you don't want to build: chunk-oriented processing, restart/retry semantics, job repository for state, and parallel processing via partitioning. At Mastercard, we processed millions of PAN registrations in nightly batch windows. At Sopra Banking, daily lending transaction processing had strict SLA windows that required careful tuning.

Chunk-Oriented Processing

Spring Batch's fundamental unit is the chunk: read N items, process N items, write N items, commit. Chunk size is the most impactful tuning knob. Too small: too many DB commits, high overhead. Too large: memory pressure, long transaction windows. We empirically found chunk sizes of 500-1000 items optimal for financial record processing on PostgreSQL, measured by throughput and GC pressure.

RegistrationBatchConfig.java
@Bean
public Step registrationStep() {
    return stepBuilderFactory.get("registrationStep")
        .<RegistrationRequest, RegistrationEvent>chunk(500) // Tuned chunk size
        .reader(pagingItemReader())
        .processor(registrationProcessor())
        .writer(kafkaItemWriter())
        .faultTolerant()
        .skipLimit(100)
        .skip(DataIntegrityViolationException.class)
        .retryLimit(3)
        .retry(TransientDataAccessException.class)
        .listener(batchMetricsListener())
        .build();
}

Partitioned Steps: True Parallelism

For large datasets, partitioned steps are the key to meeting SLA windows. A master step divides the dataset into N partitions (by ID range, hash, or date), and worker steps process them in parallel. We reduced our 4-hour batch window to under 1.5 hours by partitioning a 20M-record dataset into 20 workers on PCF. The partition strategy matters: aim for roughly equal partition sizes to avoid stragglers.

PartitionedBatchConfig.java
@Bean
public Step masterStep(Step workerStep) {
    return stepBuilderFactory.get("masterStep")
        .partitioner("workerStep", rangePartitioner())
        .step(workerStep)
        .gridSize(20)  // 20 parallel workers
        .taskExecutor(batchTaskExecutor())
        .build();
}

@Bean
public Partitioner rangePartitioner() {
    return (gridSize) -> {
        Map<String, ExecutionContext> partitions = new HashMap<>();
        long totalRecords = registrationRepo.count();
        long batchSize = totalRecords / gridSize;

        for (int i = 0; i < gridSize; i++) {
            ExecutionContext ctx = new ExecutionContext();
            ctx.putLong("minId", i * batchSize);
            ctx.putLong("maxId", (i + 1) * batchSize);
            partitions.put("partition" + i, ctx);
        }
        return partitions;
    };
}

Fault Tolerance and Restart Semantics

Production batch jobs will fail. Spring Batch's job repository persists execution state, enabling restarts from the last successful checkpoint rather than from the beginning. Configure skip policies for business exceptions (bad data you can ignore) and retry policies for transient failures (DB timeouts, network blips). We also implemented StepExecutionListeners to emit Micrometer metrics on chunk completion, giving real-time throughput visibility in Grafana.

Tags

#spring-batch#java#performance#banking