Event-Driven Microservices with Apache Kafka
Patterns, Pitfalls, and Production Lessons
A deep-dive into building resilient event-driven architectures with Kafka — covering topic design, consumer group strategies, exactly-once semantics, and hard-won lessons from fintech production systems.
Why Event-Driven?
Traditional synchronous REST-based microservices create temporal coupling — if Service B is down, Service A fails too. Event-driven architectures decouple producers and consumers in time, enabling independent scaling, fault isolation, and the ability to replay history. In fintech systems where audit trails are non-negotiable, Kafka's durable log becomes your source of truth.
Topic Design: The Foundation
The most consequential decision in a Kafka-based system is topic design. A common mistake is creating one topic per microservice. Instead, design topics around domain events: `account.registered`, `payment.processed`, `fraud.flagged`. Each topic represents a fact about your domain. Partition count determines parallelism — for payment processing, we partitioned by account prefix to ensure ordered delivery per account while achieving high throughput.
# Topic design for PAN registration system
topics:
account.registration.requests:
partitions: 24 # Parallelism — sized to consumer count
replication-factor: 3 # High availability
retention.ms: 604800000 # 7-day retention for replay
account.registration.events:
partitions: 24
replication-factor: 3
cleanup.policy: compact # Keep latest state per keyExactly-Once Semantics: The Hard Part
Kafka guarantees at-least-once delivery by default — your consumer may process the same message twice. For financial systems, idempotency is non-negotiable. We implemented a two-layered approach: (1) Kafka transactions for atomic producer-consumer commit, and (2) a Redis-backed idempotency cache keyed by event UUID. Before processing, check the cache. On success, write to the cache with a TTL matching your maximum retry window.
@KafkaListener(topics = "account.registration.requests")
@Transactional
public void process(RegistrationEvent event) {
String idempotencyKey = "reg:" + event.getEventId();
// Check idempotency cache first
if (Boolean.TRUE.equals(redis.hasKey(idempotencyKey))) {
log.info("Duplicate event {}, skipping", event.getEventId());
return;
}
// Business logic
registrationService.process(event);
// Mark as processed with TTL
redis.opsForValue().set(idempotencyKey, "1",
Duration.ofHours(24));
}Consumer Group Strategies
Consumer groups are Kafka's unit of scalability. A critical lesson: never share a consumer group between services with different processing semantics. In our PAN registration system, we had separate consumer groups for the synchronous validation flow and the async downstream propagation flow. This allowed us to scale them independently and prevent a downstream slowdown from backing up the validation queue.
Observability: What Gets Measured Gets Fixed
In production, the most valuable Kafka metrics are consumer lag (per topic, per partition), end-to-end event latency (producer timestamp to consumer commit), and dead-letter queue depth. We surfaced these in Dynatrace dashboards and set up Splunk alerts on consumer lag thresholds. When lag spiked, we could correlate with a deployment, upstream traffic spike, or downstream service degradation within minutes.
Dead Letter Queues: Your Safety Net
Every consumer should publish unprocessable messages to a DLQ topic. But the DLQ is only useful if you have a process to replay or handle DLQ messages. We built a Spring Batch job that periodically replayed DLQ messages during low-traffic windows, with alerting on DLQ depth. This combination of resilience and visibility prevented silent data loss in production.
Tags