Kafka Development
This skill provides best practices for Apache Kafka event streaming and distributed messaging systems. Apply these guidelines when building Kafka-based applications.
Core Principles
- Kafka is a distributed event streaming platform for high-throughput, fault-tolerant messaging
- Unlike traditional pub/sub, Kafka uses a pull model - consumers pull messages from partitions
- Design for scalability, durability, and exactly-once semantics where needed
- Leave NO todos, placeholders, or missing pieces in the implementation
Workflow: Setting Up a Kafka Producer-Consumer Pipeline
- Define the topic — choose a descriptive name, set partition count based on expected consumer parallelism, and configure retention and replication factor.
- Design the message schema — register an Avro or JSON schema in Schema Registry; ensure backward compatibility from the start.
- Implement the producer — configure
acks=all, enable idempotence, select a partition key that distributes evenly, and add error handling with retry logic.
- Implement the consumer — set
enable.auto.commit=false, pick an appropriate auto.offset.reset policy, process messages idempotently, and commit offsets only after successful processing.
- Add observability — instrument producer send-rate, consumer lag, and broker under-replicated-partitions; propagate trace context in message headers.
- Test end-to-end — use Testcontainers or an embedded Kafka broker to verify the full produce-consume-commit cycle, including failure and rebalance scenarios.
- Deploy and monitor — roll out with lag alerts, dead-letter-topic routing for persistent failures, and dashboards for key broker and client metrics.
Architecture Overview
Core Components
- Topics: Categories/feeds for organizing messages
- Partitions: Ordered, immutable sequences within topics enabling parallelism
- Producers: Clients that publish messages to topics
- Consumers: Clients that read messages from topics
- Consumer Groups: Coordinate consumption across multiple consumers
- Brokers: Kafka servers that store data and serve clients
Key Concepts
- Messages are appended to partitions in order
- Each message has an offset - a unique sequential ID within the partition
- Consumers maintain their own cursor (offset) and can read streams repeatedly
- Partitions are distributed across brokers for scalability
Topic Design
Partitioning Strategy
- Use partition keys to place related events in the same partition
- Messages with the same key always go to the same partition
- This ensures ordering for related events
- Choose keys carefully - uneven distribution causes hot partitions
Partition Count
- More partitions = more parallelism but more overhead
- Consider: expected throughput, consumer count, broker resources
- Start with number of consumers you expect to run concurrently
- Partitions can be increased but not decreased
Topic Configuration
retention.ms: How long to keep messages (default 7 days)
retention.bytes: Maximum size per partition
cleanup.policy: delete (remove old) or compact (keep latest per key)
min.insync.replicas: Minimum replicas that must acknowledge
Producer Best Practices
Reliability Settings
acks=all # Wait for all replicas to acknowledge
retries=MAX_INT # Retry on transient failures
enable.idempotence=true # Prevent duplicate messages on retry
Performance Tuning
batch.size: Accumulate messages before sending (default 16KB)
linger.ms: Wait time for batching (0 = send immediately)
buffer.memory: Total memory for buffering unsent messages
compression.type: gzip, snappy, lz4, or zstd for bandwidth savings
Error Handling
- Implement retry logic with exponential backoff
- Handle retriable vs non-retriable exceptions differently
- Log and alert on send failures
- Consider dead letter topics for messages that fail repeatedly
Example: Java Producer with Idempotence and Error Handling
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
ProducerRecord<String, String> record =
new ProducerRecord<>("orders", "order-123", "{\"item\":\"widget\",\"qty\":5}");
producer.send(record, (metadata, exception) -> {
if (exception != null) {
log.error("Send failed for key=order-123", exception);
// Route to dead-letter topic or alert
} else {
log.info("Delivered to {}-{} offset {}",
metadata.topic(), metadata.partition(), metadata.offset());
}
});
}
Partitioner
- Default: hash of key determines partition (null key = round-robin)
- Custom partitioners for specific routing needs
- Ensure even distribution to avoid hot partitions
Consumer Best Practices
Offset Management
- Consumers track which messages they've processed via offsets
auto.offset.reset: earliest (start from beginning) or latest (only new messages)
- Commit offsets after successful processing, not before
- Use
enable.auto.commit=false for exactly-once semantics
Consumer Groups
- Consumers in a group share partitions (each partition to one consumer)
- More consumers than partitions = some consumers idle
- Group rebalancing occurs when consumers join/leave
- Use
group.instance.id for static membership to reduce rebalances
Processing Patterns
- Process messages in order within a partition
- Handle out-of-order messages across partitions if needed
- Implement idempotent processing for at-least-once delivery
- Consider transactional processing for exactly-once
Example: Java Consumer with Manual Offset Commit
Properties props = new Properties();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processing-group");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
try (KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props)) {
consumer.subscribe(Collections.singletonList("orders"));
while (running) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(500));
for (ConsumerRecord<String, String> record : records) {
try {
processOrder(record.key(), record.value());
} catch (Exception e) {
log.error("Failed to process offset={} key={}", record.offset(), record.key(), e);
publishToDeadLetterTopic(record, e);
}
}
consumer.commitSync(); // Commit only after successful processing
}
}
Timeouts and Failures
- Implement processing timeout to isolate slow events
- When timeout occurs, set event aside and continue to next message
- Maintain overall system performance over processing every single event
- Use dead letter queues for messages failing all retries
Error Handling and Retry
Retry Strategy
- Allow multiple runtime retries per processing attempt
- Example: 3 runtime retries per redrive, maximum 5 redrives = 15 total retries
- Runtime retries typically cover 99% of failures
- After exhausting retries, route to dead letter queue
Dead Letter Topics
- Create dedicated DLT for messages that can't be processed
- Include original topic, partition, offset, and error details
- Monitor DLT for patterns indicating systemic issues
- Implement manual or automated retry from DLT
Schema Management
Schema Registry
- Use Confluent Schema Registry for schema management
- Producers validate data against registered schemas during serialization
- Schema mismatches throw exceptions, preventing malformed data
- Provides common reference for producers and consumers
Schema Evolution
- Design schemas for forward and backward compatibility
- Add optional fields with defaults for backward compatibility
- Avoid removing or renaming fields
- Use schema versioning and migration strategies
Kafka Streams
State Management
- Implement log compaction to maintain latest version of each key
- Periodically purge old data from state stores
- Monitor state store size and access patterns
- Use appropriate storage backends for your scale
Windowing Operations
- Handle out-of-order events and skewed timestamps
- Use appropriate time extraction and watermarking techniques
- Configure grace periods for late-arriving data
- Choose window types based on use case (tumbling, hopping, sliding, session)
Security
Authentication
- Use SASL/SSL for client authentication
- Support SASL mechanisms: PLAIN, SCRAM, OAUTHBEARER, GSSAPI
- Enable SSL for encryption in transit
- Rotate credentials regularly
Authorization
- Use Kafka ACLs for fine-grained access control
- Grant minimum necessary permissions per principal
- Separate read/write permissions by topic
- Audit access patterns regularly
Monitoring and Observability
Key Metrics
- Producer: record-send-rate, record-error-rate, batch-size-avg
- Consumer: records-consumed-rate, records-lag, commit-latency
- Broker: under-replicated-partitions, request-latency, disk-usage
Lag Monitoring
- Consumer lag = last produced offset - last committed offset
- High lag indicates consumers can't keep up
- Alert on increasing lag trends
- Scale consumers or optimize processing
Distributed Tracing
- Propagate trace context in message headers
- Use OpenTelemetry for end-to-end tracing
- Correlate producer and consumer spans
- Track message journey through the pipeline
Testing
Unit Testing
- Mock Kafka clients for isolated testing
- Test serialization/deserialization logic
- Verify partitioning logic
- Test error handling paths
Integration Testing
- Use embedded Kafka or Testcontainers
- Test full producer-consumer flows
- Verify exactly-once semantics if used
- Test rebalancing scenarios
Performance Testing
- Load test with production-like message rates
- Test consumer throughput and lag behavior
- Verify broker resource usage under load
- Test failure and recovery scenarios
Common Patterns
Event Sourcing
- Store all state changes as immutable events
- Rebuild state by replaying events
- Use log compaction for snapshots
- Enable time-travel debugging
CQRS (Command Query Responsibility Segregation)
- Separate write (command) and read (query) models
- Use Kafka as the event store
- Build read-optimized projections from events
- Handle eventual consistency appropriately
Saga Pattern
- Coordinate distributed transactions across services
- Each service publishes events for next step
- Implement compensating transactions for rollback
- Use correlation IDs to track saga instances
Change Data Capture (CDC)
- Capture database changes as Kafka events
- Use Debezium or similar CDC tools
- Enable real-time data synchronization
- Build event-driven integrations