Apache KafkaEvent-Driven ArchitectureMicroservicesDistributed Systems

Event-Driven Microservices with Apache Kafka

Patterns, Pitfalls, and Production Lessons

A deep-dive into building resilient event-driven architectures with Kafka — covering topic design, consumer group strategies, exactly-once semantics, and hard-won lessons from fintech production systems.

12 min readNovember 15, 2024

Why Event-Driven?

Traditional synchronous REST-based microservices create temporal coupling — if Service B is down, Service A fails too. Event-driven architectures decouple producers and consumers in time, enabling independent scaling, fault isolation, and the ability to replay history. In fintech systems where audit trails are non-negotiable, Kafka's durable log becomes your source of truth.

Topic Design: The Foundation

The most consequential decision in a Kafka-based system is topic design. A common mistake is creating one topic per microservice. Instead, design topics around domain events: `account.registered`, `payment.processed`, `fraud.flagged`. Each topic represents a fact about your domain. Partition count determines parallelism — for payment processing, we partitioned by account prefix to ensure ordered delivery per account while achieving high throughput.

kafka-topics.yaml

# Topic design for PAN registration system
topics:
  account.registration.requests:
    partitions: 24        # Parallelism — sized to consumer count
    replication-factor: 3 # High availability
    retention.ms: 604800000  # 7-day retention for replay

  account.registration.events:
    partitions: 24
    replication-factor: 3
    cleanup.policy: compact  # Keep latest state per key

Exactly-Once Semantics: The Hard Part

Kafka guarantees at-least-once delivery by default — your consumer may process the same message twice. For financial systems, idempotency is non-negotiable. We implemented a two-layered approach: (1) Kafka transactions for atomic producer-consumer commit, and (2) a Redis-backed idempotency cache keyed by event UUID. Before processing, check the cache. On success, write to the cache with a TTL matching your maximum retry window.

IdempotentKafkaConsumer.java

@KafkaListener(topics = "account.registration.requests")
@Transactional
public void process(RegistrationEvent event) {
    String idempotencyKey = "reg:" + event.getEventId();

    // Check idempotency cache first
    if (Boolean.TRUE.equals(redis.hasKey(idempotencyKey))) {
        log.info("Duplicate event {}, skipping", event.getEventId());
        return;
    }

    // Business logic
    registrationService.process(event);

    // Mark as processed with TTL
    redis.opsForValue().set(idempotencyKey, "1",
        Duration.ofHours(24));
}

Consumer Group Strategies

Consumer groups are Kafka's unit of scalability. A critical lesson: never share a consumer group between services with different processing semantics. In our PAN registration system, we had separate consumer groups for the synchronous validation flow and the async downstream propagation flow. This allowed us to scale them independently and prevent a downstream slowdown from backing up the validation queue.

Observability: What Gets Measured Gets Fixed

In production, the most valuable Kafka metrics are consumer lag (per topic, per partition), end-to-end event latency (producer timestamp to consumer commit), and dead-letter queue depth. We surfaced these in Dynatrace dashboards and set up Splunk alerts on consumer lag thresholds. When lag spiked, we could correlate with a deployment, upstream traffic spike, or downstream service degradation within minutes.

Dead Letter Queues: Your Safety Net

Every consumer should publish unprocessable messages to a DLQ topic. But the DLQ is only useful if you have a process to replay or handle DLQ messages. We built a Spring Batch job that periodically replayed DLQ messages during low-traffic windows, with alerting on DLQ depth. This combination of resilience and visibility prevented silent data loss in production.