Apache Kafka — Interview Questions

Stack context: This system uses Apache Kafka (Confluent 7.6.1, KRaft mode) as the event-streaming backbone for asynchronous inter-service communication. Topics include order.created, payment.processed, and order.status.updated. Avro + Confluent Schema Registry is used for serialization, and Spring's @RetryableTopic handles retry/DLT flows.


Q1 — What is Apache Kafka and how does it differ from a traditional message queue? junior

Answer: Apache Kafka is a distributed, append-only event log. Unlike a traditional message queue (RabbitMQ, ActiveMQ) where a message is removed once consumed, Kafka retains messages on disk for a configurable retention period. Multiple independent consumers can read the same messages at different offsets.

Key differences:

Real use case in this system: The order.created topic is consumed independently by payment-processors and notification-processors. Both groups receive every event without interfering with each other.


Q2 — What are topics, partitions, and offsets? junior

Answer:

order.created (3 partitions):
  Partition 0: [offset-0, offset-1, offset-2, ...]
  Partition 1: [offset-0, offset-1, ...]
  Partition 2: [offset-0, ...]

Real use case: In this system order.created has 3 partitions. Orders are keyed by orderId, so all events for the same order always land in the same partition, guaranteeing per-order processing order.


Q3 — What is a consumer group and how does partition assignment work? junior

Answer: A consumer group is a logical grouping of consumers that share the work of consuming a topic. Kafka assigns each partition to exactly one consumer within a group, ensuring each message is processed once per group.

Rules:

When a consumer joins or leaves, Kafka triggers a rebalance to redistribute partitions.

Real use case: payment-processors group has consumers equal to the 3 partitions of order.created. Each consumer handles one partition. If a payment-service pod crashes, Kafka rebalances and the remaining consumers absorb its partition.


Q4 — What is auto.offset.reset and when would you use earliest vs latest? junior

Answer: auto.offset.reset determines where a new consumer group (no committed offset yet) starts reading:

When to use each:


Q5 — Explain the @KafkaListener annotation and how it works in Spring. junior

Answer: @KafkaListener marks a method as a Kafka message handler. Spring's KafkaListenerContainerFactory creates a MessageListenerContainer that manages the consumer loop, offset commit, and thread management.

@KafkaListener(
    topics = "order.created",
    groupId = "payment-processors",
    containerFactory = "kafkaListenerContainerFactory"
)
public void handleOrder(OrderCreatedEvent event) { ... }

Key behaviors:


Q6 — What is Avro and why use it instead of JSON for Kafka messages? junior

Answer: Apache Avro is a binary serialization format with a schema defined in JSON. Compared to JSON:

Feature Avro JSON
Size Compact binary (~5-10× smaller) Verbose text
Schema enforcement Required at produce/consume Optional
Evolution Supports backward/forward compatibility No built-in rules
Performance Fast binary encode/decode Slower text parse

Schema Registry integration: Avro schemas are registered with Confluent Schema Registry. Only the schema ID (4 bytes) is sent in the message payload. Consumers fetch the schema by ID to deserialize.

Real use case: This system uses Avro for OrderCreatedEvent, PaymentProcessedEvent, and OrderStatusUpdatedEvent. Schema Registry prevents a producer from publishing a breaking schema change.


Q7 — What is the Confluent Schema Registry and how does it prevent breaking changes? junior

Answer: Schema Registry is a centralized service that stores and versions Avro (or Protobuf/JSON Schema) schemas. Producers register a schema before publishing; consumers retrieve schemas by ID.

Compatibility levels (configured per subject):

If a proposed schema violates the compatibility rule (e.g., removing a required field), the Registry rejects it and the producer fails to start — preventing runtime deserialization errors.


Q8 — What is @RetryableTopic and how does Spring implement retry logic with it? junior

Answer: @RetryableTopic is a Spring Kafka annotation that configures non-blocking retry using separate retry topics instead of blocking the main consumer thread.

@RetryableTopic(
    attempts = "4",
    backoff = @Backoff(delay = 1000, multiplier = 2.0),
    dltTopicSuffix = ".DLT"
)
@KafkaListener(topics = "order.created", groupId = "payment-processors")
public void processPayment(OrderCreatedEvent event) { ... }

Spring auto-creates order.created-retry-0, order.created-retry-1, order.created-retry-2, and order.created.DLT. On exception, the message is forwarded to the next retry topic with a header indicating the delay. After all attempts are exhausted, the message lands in the DLT.

Advantage over blocking retry: The main consumer continues processing other messages while retries are scheduled, preventing a single bad message from stalling the entire partition.


Q9 — What is a Dead Letter Topic (DLT) and how should it be handled in production? senior

Answer: A DLT receives messages that failed all retry attempts. It serves as a "poison pill parking lot" that preserves failed events for investigation and reprocessing.

Production handling strategy:

  1. Monitor: Alert on DLT consumer-group lag > 0 (PagerDuty/Datadog).
  2. @DltHandler: Annotate a method to process DLT messages — log structured context (orderId, error type, attempt count, headers).
  3. Manual replay: After fixing the root cause, re-publish DLT messages back to the main topic.
  4. Automatic replay (optional): A scheduled job re-publishes DLT messages after a cooldown period.
@DltHandler
public void handleDlt(OrderCreatedEvent event, @Header KafkaHeaders.RECEIVED_TOPIC String topic) {
    log.error("DLT received: orderId={}, topic={}", event.getOrderId(), topic);
    // persist to dead_letters table for ops dashboard
}

This system: order.created.DLT has 3 partitions. The @DltHandler logs the failure; ops teams can replay via the admin endpoint.


Q10 — What is the Transactional Outbox Pattern and why is it needed? senior

Answer: The Transactional Outbox solves the dual-write problem: you cannot atomically write to both a database and Kafka in a single transaction (they are different systems).

Problem: order-service creates an order in PostgreSQL and publishes to Kafka. If the app crashes after the DB commit but before the Kafka publish, the order exists but no event is sent — downstream services never know.

Solution: Write the event to an outbox table in the same DB transaction as the order. A separate OutboxPublisher polls the outbox, publishes to Kafka, then marks the row as sent.

BEGIN TRANSACTION
  INSERT INTO orders (...)
  INSERT INTO outbox (event_type, payload, published=false)
COMMIT

-- Separate thread:
SELECT * FROM outbox WHERE published = false
  → KafkaTemplate.send(...)
  → UPDATE outbox SET published = true WHERE id = ?

Guarantees: At-least-once delivery (idempotent consumers needed). The order and event are always consistent.

This system: order-service implements this with OutboxPublisher polling every 100 ms.


Q11 — Explain Kafka's delivery semantics: at-most-once, at-least-once, exactly-once. senior

Answer:

Semantic Description Config Risk
At-most-once Commit offset before processing enable.auto.commit=true, early commit Message loss on consumer crash
At-least-once Commit after processing Manual commit after handler returns Duplicate processing on crash
Exactly-once Idempotent producer + transactions enable.idempotence=true, isolation.level=read_committed Higher latency, complexity

At-least-once + idempotent consumers is the pragmatic choice for most systems. This system uses it: @RetryableTopic may redeliver the same event, so payment-service checks if an order payment already exists before processing.

Exactly-once is appropriate for financial ledgers or inventory deductions where duplicates are unacceptable and the overhead is justified.


Q12 — How does Kafka KRaft mode differ from ZooKeeper-based Kafka? senior

Answer: KRaft (Kafka Raft) replaces ZooKeeper as the metadata store in Kafka 3.x+. The cluster elects a Controller Quorum using the Raft consensus algorithm, and metadata is stored in an internal __cluster_metadata topic.

Benefits of KRaft:

This system: Uses KRaft mode (docker-compose.yml runs Kafka with KAFKA_KRAFT_MODE=true, no ZooKeeper container).


Q13 — What is consumer lag and how do you monitor and alert on it? senior

Answer: Consumer lag is the difference between the latest produced offset and the last committed consumer offset for a partition:

lag = latest_offset - committed_offset

High lag means consumers are falling behind producers — events are queuing up.

Monitoring:

Alerting thresholds (example):

Remediation:


Q14 — How does Kafka guarantee ordering and what are its limitations? senior

Answer: Kafka guarantees ordering within a partition. Messages with the same key always go to the same partition (via key hash % partition count). A single consumer per partition in a consumer group processes them in order.

Limitations:


Q15 — What are idempotent producers and when are they important? senior

Answer: An idempotent producer ensures that retrying a failed send() does not result in duplicate messages in Kafka. Each producer is assigned a Producer ID (PID), and each message gets a sequence number. Kafka brokers track PID+sequence and deduplicate retried messages.

Enable: enable.idempotence=true (default in Kafka 3+).

When important:

Transactional producers: Extend idempotency across partitions and topics — required for exactly-once semantics.


Q16 — Explain acks configuration and its impact on durability vs. throughput. senior

Answer:

acks Meaning Durability Throughput
0 Fire and forget — no ACK waited Lowest Highest
1 Leader ACK only Medium Medium
all (-1) All in-sync replicas (ISR) ACK Highest Lowest

acks=all combined with min.insync.replicas=2 ensures that at least 2 replicas have the message before the producer gets success. If the leader crashes before replication, the message isn't lost.

This system: Order events use acks=all (configured via Spring spring.kafka.producer.acks=all) because losing an order event is unacceptable. Payment notifications can tolerate acks=1.


Q17 — What is a Kafka compacted topic and when would you use one? senior

Answer: A compacted topic retains only the latest value for each key, removing older versions during log compaction. It behaves like a changelog or snapshot store rather than an event log.

Use cases:

How compaction works: A background thread (log cleaner) periodically scans log segments, keeps the latest record per key, and discards older ones. Tombstones (null value) signal deletion.

This system: Does not use compacted topics currently, but product.catalog.updated would be a candidate — consumers need the latest product state, not full history.


Q18 — How do you handle schema evolution in Avro without breaking consumers? senior

Answer: Schema Registry enforces compatibility rules. For backward compatibility (consumers can read new schema with the old one):

Evolution example:

// V1
{"name": "OrderCreatedEvent", "fields": [{"name": "orderId", "type": "string"}]}

// V2 — backward compatible (new optional field with default)
{"name": "OrderCreatedEvent", "fields": [
  {"name": "orderId", "type": "string"},
  {"name": "priority", "type": ["null", "string"], "default": null}
]}

Rolling upgrade strategy:

  1. Register V2 schema.
  2. Deploy consumers first (they handle V1 messages with default for missing field).
  3. Deploy producers (publish V2).

This system: Deploy order-service consumers before order-service producers when adding new event fields.


Q19 — What is the difference between poll() timeout and max.poll.interval.ms? senior

Answer:

Common mistake: Setting max.poll.records high and having slow message processing. The consumer processes 500 records but takes 6 minutes, exceeding max.poll.interval.ms=300000 (5 min default) → rebalance → duplicate processing.

Fix: Reduce max.poll.records, increase max.poll.interval.ms accordingly, or process records asynchronously (with care around offset commit ordering).


Q20 — How does partition rebalance work with the cooperative sticky assignor? senior

Answer: The default Eager Assignor (Range/RoundRobin) triggers a "stop-the-world" rebalance: all consumers revoke ALL their partitions, then re-assign. This causes processing gaps.

The Cooperative Sticky Assignor (available since Kafka 2.4) implements incremental rebalancing:

  1. Consumers propose their current assignments.
  2. The Group Coordinator identifies partitions that need to move.
  3. Only those partitions are revoked and reassigned; other partitions continue processing.

Benefits: No processing pause for partitions that don't move. Reduces consumer lag spike on rebalance.

Configure: partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor


Q21 — How would you implement exactly-once processing end-to-end? senior

Answer: Full exactly-once requires coordination at three layers:

  1. Idempotent producer (enable.idempotence=true): Deduplicates retried produce calls within a session.
  2. Transactional producer: Groups multiple sends into an atomic unit.
  3. Consumer isolation (isolation.level=read_committed): Consumers only see committed transactional messages, not in-flight or aborted ones.
producer.initTransactions();
producer.beginTransaction();
try {
    producer.send(record1);
    producer.send(record2);
    producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Trade-offs: ~10-20% throughput reduction. Only appropriate for financial or inventory systems where duplicates are catastrophic. Most systems accept at-least-once + idempotent consumers.


Q22 — What is the role of min.insync.replicas and how does it interact with acks=all? senior

Answer: min.insync.replicas (broker/topic config) defines the minimum number of replicas that must acknowledge a write for it to succeed when acks=all.

Formula: replication.factor >= min.insync.replicas + 1 (leave room for one broker to be offline).

Example (replication.factor=3, min.insync.replicas=2):

Meaning: With this config, you can lose 1 broker and still write. You cannot lose 2 brokers simultaneously.

This system: Using default single-node Kafka in Docker for dev/test. Production would use replication.factor=3, min.insync.replicas=2.


Q23 — How would you debug a consumer that is not receiving messages? junior

Answer: Systematic debugging steps:

  1. Check consumer group status: kafka-consumer-groups.sh --describe --group payment-processors — verify members and assigned partitions.
  2. Check lag: Is lag 0 (caught up) or growing?
  3. Check topic existence: kafka-topics.sh --list — verify topic name (typo in config?).
  4. Check auto.offset.reset: Is it latest and the consumer started before messages were produced?
  5. Check deserializer: Deserialization errors cause the consumer to log and skip (or throw, depending on error handler).
  6. Check Spring logs: Enable logging.level.org.springframework.kafka=DEBUG.
  7. Produce a test message manually: kafka-console-producer.sh → see if consumer receives it.
  8. Check ACL: Is the consumer authorized to read the topic (in secured clusters)?

Q24 — What is the difference between commitSync and commitAsync? junior

Answer:

Best practice: Use commitAsync() during normal processing for throughput, and commitSync() in the finally block on shutdown to ensure the last batch is committed.

Spring default: AckMode.BATCH — commits offsets after each poll batch returns. This is a form of async commit optimized for throughput.


Q25 — How do you test Kafka consumers and producers in unit/integration tests? junior

Answer: Unit tests (no broker needed):

Integration tests (with Testcontainers):

@Testcontainers
class OrderEventConsumerTest {
    @Container
    static KafkaContainer kafka = new KafkaContainer(DockerImageName.parse("confluentinc/cp-kafka:7.6.1"));

    @Test
    void shouldProcessOrderCreated() {
        // produce test message → await consumer invocation → assert side effects
    }
}

This system: Uses @EmbeddedKafka (Spring) for per-service tests and DockerComposeContainer for full E2E tests. KafkaTestUtils provides getSingleRecord() helper.


Q26 — What is a Kafka Streams application and how does it differ from a consumer group? senior

Answer: Kafka Streams is a library for stateful, exactly-once stream processing within a JVM application:

vs. Consumer group:

Consumer Group Kafka Streams
Processing Stateless per-message Stateful (joins, windows)
State External (DB) Local state store + changelog
Exactly-once Complex to achieve Built-in with processing.guarantee=exactly_once_v2
Complexity Low Higher

This system: Does not use Kafka Streams. Payment and notification services use simple consumer groups because they don't require stateful aggregation. Streams would be appropriate for "calculate total order value per customer per hour" analytics.


Q27 — How does Kafka handle backpressure? senior

Answer: Kafka does not have built-in backpressure signaling from consumer to producer. Instead, backpressure is managed through:

  1. Consumer lag accumulation: Messages queue up in the topic. Producers write at their rate; consumers read at their rate. The gap (lag) absorbs the burst.
  2. Producer blocking: If the producer's internal buffer (buffer.memory) fills up, send() blocks for max.block.ms before throwing an exception.
  3. Application-level throttling: Rate limit producers based on consumer lag metrics (adaptive rate limiting).
  4. Spring reactive WebFlux: In this system's API Gateway, reactive backpressure via Project Reactor Flux.limitRate() can throttle upstream requests.

This system: The order.created topic acts as a buffer. If payment-service is slow (DB contention during sales peak), lag grows in the topic. No messages are lost; they process when capacity is available.


Q28 — Explain Kafka's log segment structure and how retention works. senior

Answer: Each partition is stored as a series of log segments on disk:

order.created-0/
  00000000000000000000.log     ← message data
  00000000000000000000.index   ← offset → byte position index
  00000000000000000000.timeindex ← timestamp → offset index
  00000000000001000000.log     ← next segment (after roll)

A new segment is created when the active segment reaches log.segment.bytes (default 1 GB) or log.roll.ms (default 7 days).

Retention modes:

This system: Default 7-day retention. Sufficient for retry/DLT scenarios. A payment DLT message is never older than days before ops investigates.


Q29 — How would you increase Kafka throughput for a high-volume producer? senior

Answer: Key producer throughput tunings:

Config Default High-throughput Value
batch.size 16 KB 65536 (64 KB) or higher
linger.ms 0 10-50 ms
compression.type none lz4 or snappy
buffer.memory 32 MB 128 MB+
acks all 1 (if durability trade-off acceptable)

linger.ms > 0 causes the producer to wait briefly before sending, allowing more messages to batch together. Combined with larger batch.size, this dramatically increases throughput at a small latency cost.

Consumer throughput:


Q30 — How would you design a multi-tenant Kafka setup where tenant data must be isolated? senior

Answer: Three main strategies:

1. Topic-per-tenant:

2. Partition-per-tenant:

3. Kafka cluster-per-tenant (dedicated cluster):

Recommended (SaaS):

This system: Single-tenant. No isolation needed. Would use topic-per-tenant if extended to multi-tenant SaaS.


Q31 — What is Kafka Connect and when would you use it? junior

Answer: Kafka Connect is a framework for streaming data between Kafka and external systems (databases, object stores, search engines, etc.) using reusable connectors — without writing custom producer/consumer code.

Two connector types:

// Example: Debezium PostgreSQL source connector config
{
  "name": "orders-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.dbname": "orders_db",
    "database.server.name": "orders",
    "table.include.list": "public.orders",
    "plugin.name": "pgoutput"
  }
}

When to use: CDC pipelines, ETL into data warehouses, syncing Kafka topics to Elasticsearch for search. Avoids writing boilerplate producer/consumer code for standard integrations.

This system: Not used directly, but Debezium Connect would be the natural choice to replace the manual OutboxPublisher polling with CDC-based outbox relay.


Q32 — How do you choose the right number of partitions for a topic? senior

Answer: Partitions determine parallelism and throughput. Over-partitioning wastes resources; under-partitioning limits throughput.

Formula starting point:

partitions = max(target_throughput / producer_throughput_per_partition,
                 target_throughput / consumer_throughput_per_partition)

Practical guidelines:

Factors to consider:

Factor Impact
Target consumer parallelism 1 partition per consumer max
Message ordering requirements More partitions → harder to maintain global order
Replication overhead partitions × replication.factor log files on each broker
Retention size Each partition is stored separately

Warning: You can increase partitions later, but it changes the key-to-partition mapping. This breaks ordering guarantees for key-based producers. Plan conservatively upfront for order-sensitive topics.

This system: order.created uses 3 partitions — matching the 3 payment-service pods. If we scale to 6 pods, we'd increase to 6 partitions during a maintenance window with consumer group re-assignment.


Q33 — What is the difference between key-based partitioning and round-robin? junior

Answer:

// Key-based — orderId ensures all events for the same order go to the same partition
ProducerRecord<String, OrderCreatedEvent> record =
    new ProducerRecord<>("order.created", order.getId().toString(), event);

// Round-robin — no key
ProducerRecord<String, LogEvent> logRecord =
    new ProducerRecord<>("app.logs", null, logEvent);

When to use round-robin: Logs, metrics, or any data where ordering doesn't matter and you want maximum throughput distribution.

When to use key-based: Domain events where processing order matters (e.g., all events for orderId=123 must process in sequence: created → paid → shipped).

This system: order.created, payment.processed, and order.status.updated all use orderId as the key.


Q34 — How does Kafka handle message compression and what are the tradeoffs? junior

Answer: Kafka supports producer-side compression. The producer compresses a batch of messages; the broker stores the compressed batch; consumers decompress.

Supported codecs:

Codec Compression Ratio CPU Cost Best For
none None Low-volume or pre-compressed data
gzip 4–8× High High ratio needed, CPU available
snappy 2–4× Low Balanced (Google's default)
lz4 2–4× Very low Lowest latency
zstd 4–8× Medium Best ratio/speed tradeoff (Kafka 2.1+)
# Spring Boot producer config
spring:
  kafka:
    producer:
      compression-type: lz4
      batch-size: 65536      # 64 KB batch
      linger-ms: 10          # wait up to 10ms to fill batch

Key insight: Compression works best with larger batches (linger.ms > 0). Compressing single tiny messages has overhead without benefit.

This system: Uses lz4 for Avro-serialized order events. Avro is already compact binary, so compression gain is modest (~20%), but latency savings from smaller network payloads justify it.


Q35 — What are Kafka message headers and how are they used? junior

Answer: Kafka headers are key-value metadata attached to a message, separate from the message key and value. They don't affect partitioning and are stored as byte arrays.

Common uses:

// Producer: add correlation ID header
@Autowired KafkaTemplate<String, OrderCreatedEvent> kafkaTemplate;

public void publish(OrderCreatedEvent event, String correlationId) {
    var record = new ProducerRecord<>("order.created", event.getOrderId(), event);
    record.headers().add("X-Correlation-ID", correlationId.getBytes(StandardCharsets.UTF_8));
    kafkaTemplate.send(record);
}

// Consumer: read header
@KafkaListener(topics = "order.created", groupId = "payment-processors")
public void handle(
    OrderCreatedEvent event,
    @Header(value = "X-Correlation-ID", required = false) byte[] correlationIdBytes
) {
    String correlationId = correlationIdBytes != null
        ? new String(correlationIdBytes, StandardCharsets.UTF_8)
        : UUID.randomUUID().toString();
    MDC.put("correlationId", correlationId);
    // process...
}

This system: Propagates X-Correlation-ID from the HTTP request through Kafka events to enable distributed tracing across order → payment → notification flow.


Q36 — How do you produce messages with Spring Boot's KafkaTemplate? junior

Answer: KafkaTemplate is Spring Kafka's high-level producer abstraction that wraps the native Kafka Producer. It handles serialization, threading, and async callbacks.

@Service
public class OrderEventPublisher {

    private final KafkaTemplate<String, OrderCreatedEvent> kafkaTemplate;

    public OrderEventPublisher(KafkaTemplate<String, OrderCreatedEvent> kafkaTemplate) {
        this.kafkaTemplate = kafkaTemplate;
    }

    public void publish(OrderCreatedEvent event) {
        CompletableFuture<SendResult<String, OrderCreatedEvent>> future =
            kafkaTemplate.send("order.created", event.getOrderId(), event);

        future.whenComplete((result, ex) -> {
            if (ex != null) {
                log.error("Failed to publish order event: orderId={}", event.getOrderId(), ex);
            } else {
                RecordMetadata meta = result.getRecordMetadata();
                log.info("Published to partition={} offset={}",
                    meta.partition(), meta.offset());
            }
        });
    }
}
# application.yml
spring:
  kafka:
    producer:
      bootstrap-servers: localhost:9092
      key-serializer: org.apache.kafka.common.serialization.StringSerializer
      value-serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
      acks: all
      retries: 3
      enable-idempotence: true
    properties:
      schema.registry.url: http://localhost:8081

Key methods:


Q37 — What is the purpose of fetch.min.bytes and fetch.max.wait.ms? senior

Answer: These two configs control how the broker responds to consumer fetch requests, trading latency vs. throughput:

Consumer sends FETCH request
  → Broker checks: do I have ≥ fetch.min.bytes?
      YES → respond immediately
      NO  → wait up to fetch.max.wait.ms, then respond with what's available

Tuning for throughput (high-volume topics):

spring:
  kafka:
    consumer:
      fetch-min-size: 102400     # 100 KB — wait for 100 KB before responding
      fetch-max-wait: 500        # but no more than 500ms
      max-poll-records: 500

Tuning for low latency (real-time alerting):

spring:
  kafka:
    consumer:
      fetch-min-size: 1          # respond immediately with any data
      fetch-max-wait: 50         # max 50ms wait

This system: Notification service uses low fetch-min-size=1 for near-real-time delivery of order.status.updated events.


Q38 — What is the __consumer_offsets internal topic? senior

Answer: __consumer_offsets is a Kafka-internal compacted topic that stores committed consumer group offsets. Before Kafka 0.9, offsets were stored in ZooKeeper; since 0.9, they're stored in this topic for better scalability and availability.

Structure:

50 partitions by default (controlled by offsets.topic.num.partitions). Each consumer group is assigned to a partition of __consumer_offsets based on hash(group_id) % 50.

How to inspect (debugging):

# List all consumer groups
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

# Describe a group (shows committed offsets, lag, and assigned members)
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group payment-processors

# Output:
# GROUP               TOPIC          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# payment-processors  order.created  0          1523            1523            0
# payment-processors  order.created  1          1201            1205            4  ← lag!

This system: Used for monitoring. A lag > 0 alert on payment-processors group triggers investigation into payment-service pod health.


Q39 — How do you implement consumer pause and resume for rate limiting? senior

Answer: Kafka consumers support pause(partitions) and resume(partitions) to temporarily stop fetching from specific partitions without leaving the consumer group or triggering a rebalance.

Use cases:

@Service
public class RateLimitedPaymentConsumer {

    private final RateLimiter rateLimiter = RateLimiter.create(100); // 100 events/sec

    @KafkaListener(topics = "order.created", groupId = "payment-processors",
                   containerFactory = "kafkaListenerContainerFactory")
    public void handle(OrderCreatedEvent event,
                       Acknowledgment ack,
                       Consumer<?, ?> consumer) {
        // Pause if rate limit exceeded
        if (!rateLimiter.tryAcquire()) {
            Collection<TopicPartition> assigned = consumer.assignment();
            consumer.pause(assigned);

            // Resume after a short delay (Spring will call poll() again)
            // In practice, use a scheduled task or AOP-based circuit breaker
        }

        processPayment(event);
        ack.acknowledge();

        // Resume paused partitions after processing
        if (!consumer.paused().isEmpty()) {
            consumer.resume(consumer.paused());
        }
    }
}

Spring approach: Use KafkaListenerEndpointRegistry to pause/resume the entire listener container:

@Autowired KafkaListenerEndpointRegistry registry;

// Pause
registry.getListenerContainer("paymentListener").pause();

// Resume
registry.getListenerContainer("paymentListener").resume();

Q40 — What is static group membership (group.instance.id) and why does it matter? senior

Answer: By default, Kafka consumers have dynamic membership: each poll() session gets a new member ID, and any disconnect (pod restart, GC pause) triggers a rebalance.

Static membership (group.instance.id) assigns a stable identity to a consumer. When a static member disconnects and reconnects within session.timeout.ms, Kafka recognizes it by instance ID and skips the rebalance, reassigning the same partitions.

# application.yml — give each pod a stable instance ID
spring:
  kafka:
    consumer:
      group-instance-id: payment-processor-pod-${HOSTNAME}
      session-timeout-ms: 60000   # give pod 60s to restart before rebalance

Benefits:

Tradeoff: If a pod dies permanently (not just restarting), its partitions won't be reassigned until session.timeout.ms expires. Set it shorter than your deployment window.

This system: payment-service pods in Kubernetes use HOSTNAME-based group.instance.id. A rolling deploy of 3 pods takes ~90 seconds but causes zero rebalances.


Q41 — How do you handle large messages in Kafka? senior

Answer: Kafka's default max.message.bytes is 1 MB. For larger payloads, you have three strategies:

1. Increase message size limits (simplest, not recommended past ~10 MB):

# Broker topic config
log.message.max.bytes=10485760  # 10 MB

# Producer
spring.kafka.producer.properties.max.request.size=10485760

# Consumer
spring.kafka.consumer.properties.fetch.message.max.bytes=10485760

2. External reference pattern (recommended for large blobs): Store the large payload in object storage (S3, MinIO) and send only a reference in the Kafka message:

// Producer
String s3Key = "orders/large-payload/" + orderId + ".json";
s3Client.putObject(bucket, s3Key, largePayload);

LargeOrderEvent event = LargeOrderEvent.newBuilder()
    .setOrderId(orderId)
    .setPayloadRef(s3Key)   // reference, not data
    .build();
kafkaTemplate.send("order.large", orderId, event);

// Consumer
LargeOrderEvent event = ...;
String payload = s3Client.getObject(bucket, event.getPayloadRef());

3. Kafka chunking (complex, avoid if possible): Split the message into multiple Kafka records with sequence headers, reassemble on the consumer side.

This system: Order events are small Avro records (~200 bytes). If image uploads were included, the external reference pattern would be used with MinIO.


Q42 — What is the difference between subscribe() and assign() in a Kafka consumer? junior

Answer:

subscribe() assign()
Usage consumer.subscribe(List.of("order.created")) consumer.assign(List.of(new TopicPartition("order.created", 0)))
Partition assignment Kafka Group Coordinator manages it Manual — you specify exact partitions
Rebalance Triggered automatically on membership changes No rebalancing — you own the assignment
Consumer group Participates in a group No group coordination (no offset commit to broker group)
Offset management Group offsets via commitSync/commitAsync Manual or commitSync with assigned partitions

When to use assign():

// Direct partition assignment for reprocessing partition 0 from offset 500
TopicPartition tp = new TopicPartition("order.created", 0);
consumer.assign(List.of(tp));
consumer.seek(tp, 500L);   // start from specific offset
while (true) {
    ConsumerRecords<String, OrderCreatedEvent> records = consumer.poll(Duration.ofMillis(100));
    // process...
}

Spring context: @KafkaListener always uses subscribe() internally. For assign() semantics, use KafkaConsumer directly.


Q43 — How does Spring Kafka's ConcurrentMessageListenerContainer achieve parallelism? senior

Answer: ConcurrentMessageListenerContainer spins up N internal KafkaMessageListenerContainer instances, each running its own poll loop in a dedicated thread. Each thread is an independent consumer in the same consumer group.

@Bean
public ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>
    kafkaListenerContainerFactory(ConsumerFactory<String, OrderCreatedEvent> consumerFactory) {

    ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent> factory =
        new ConcurrentKafkaListenerContainerFactory<>();
    factory.setConsumerFactory(consumerFactory);
    factory.setConcurrency(3);  // 3 threads = 3 consumers in the group
    factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);
    return factory;
}

Or via annotation:

@KafkaListener(
    topics = "order.created",
    groupId = "payment-processors",
    concurrency = "3"   // overrides factory default for this listener
)
public void handle(OrderCreatedEvent event, Acknowledgment ack) { ... }

Thread safety: Each thread has its own KafkaConsumer instance (not thread-safe). Do NOT share a single KafkaConsumer across threads.

Partition assignment: With concurrency=3 and 3 partitions, each thread gets 1 partition. With concurrency=6 and 3 partitions, 3 threads are idle.

This system: payment-service uses concurrency=3 matching the 3 partitions of order.created. All 3 pods × 3 threads = 9 consumers, but only 3 partitions means 6 consumers are idle. Correct configuration: 1 thread per pod = 3 pods = 3 consumers = 1 per partition.


Q44 — What is Kafka MirrorMaker 2 and how does it enable cross-datacenter replication? senior

Answer: MirrorMaker 2 (MM2) is a Kafka Connect-based tool for replicating topics between Kafka clusters. It's the production replacement for the original MirrorMaker.

Architecture:

Primary DC (us-east)          Secondary DC (eu-west)
┌─────────────────┐           ┌─────────────────────────┐
│ Kafka Cluster A │──MM2────▶│ Kafka Cluster B          │
│ order.created   │           │ us-east.order.created    │
│ payment.processed│          │ us-east.payment.processed│
└─────────────────┘           └─────────────────────────┘

MM2 prefixes replicated topics with the source cluster alias (us-east.) to avoid naming conflicts. It replicates:

Use cases:

Consumer offset sync: MM2 translates offsets using a checkpointing mechanism, enabling consumer groups to resume on the secondary cluster with minimal lag after failover.

This system: Single-region Docker setup. MM2 would be added if deployed to multi-region production (primary: us-east-1, DR: us-west-2).


Q45 — How do you implement message filtering in a Kafka consumer without creating separate topics? junior

Answer: Two approaches for filtering without extra topics:

1. RecordFilterStrategy (Spring Kafka): Configures a filter at the container level. Filtered records are acknowledged but not passed to the listener method.

@Bean
public ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>
    filteredListenerContainerFactory(ConsumerFactory<String, OrderCreatedEvent> cf) {

    var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderCreatedEvent>();
    factory.setConsumerFactory(cf);
    // Filter: only process HIGH_VALUE orders (> $500)
    factory.setRecordFilterStrategy(record ->
        record.value().getTotalAmount().compareTo(new BigDecimal("500")) < 0
    );
    factory.setAckDiscarded(true); // auto-ack filtered messages
    return factory;
}

2. Header-based routing in the listener method:

@KafkaListener(topics = "order.created", groupId = "fraud-detectors")
public void handle(
    OrderCreatedEvent event,
    @Header("order-region") String region
) {
    if (!"EU".equals(region)) {
        return; // skip non-EU orders silently
    }
    runFraudCheck(event);
}

Tradeoff: Consumer still fetches all messages from Kafka — filtering is application-level. For high-volume topics where only a small fraction is relevant, consider a dedicated filtered topic (produced by a streaming processor) to reduce consumer load.


Q46 — What are the tradeoffs between Avro, Protobuf, and JSON Schema with Schema Registry? senior

Answer:

Avro Protobuf JSON Schema
Wire format Binary Binary JSON text
Schema language JSON .proto IDL JSON Schema
Size Smallest Smallest Largest (5–10×)
Speed Fast Fastest Slowest
Language support Wide Widest Universal
Schema evolution Backward/forward with defaults Field number-based, robust Limited enforcement
Tooling Strong with Confluent Strong everywhere Basic
Human readable No No Yes

When to choose:

Schema Registry supports all three with the same compatibility enforcement and schema ID embedding.

This system: Uses Avro throughout internal services. If an external partner API needed to consume events, a separate JSON Schema topic or a transformation layer would be added.


Q47 — How does sendOffsetsToTransaction enable consume-transform-produce exactly-once? senior

Answer: In a read-process-write pipeline (consume from topic A, transform, produce to topic B), exactly-once requires that the consumer offset commit and the producer send are atomic — either both commit or neither does.

sendOffsetsToTransaction() atomically includes the consumer offset commit inside the producer transaction. If the transaction aborts, the offset commit is rolled back too.

@Bean
public KafkaTransactionManager<String, TransformedEvent> txManager(
    ProducerFactory<String, TransformedEvent> pf) {
    return new KafkaTransactionManager<>(pf);
}

// Transactional consume-transform-produce
@Transactional("kafkaTransactionManager")
@KafkaListener(topics = "order.created", groupId = "transformer-group",
               containerFactory = "exactlyOnceFactory")
public void transformAndForward(
    ConsumerRecord<String, OrderCreatedEvent> record,
    @Autowired KafkaTemplate<String, TransformedEvent> producerTemplate
) {
    TransformedEvent transformed = transform(record.value());

    // Within the same transaction: send + offset commit
    producerTemplate.send("order.transformed", record.key(), transformed);

    // sendOffsetsToTransaction is called automatically by Spring
    // when using ChainedKafkaTransactionManager or TransactionalKafkaMessageListenerContainer
}

Key configs:

spring:
  kafka:
    producer:
      transaction-id-prefix: "transformer-txn-"
    consumer:
      isolation-level: read_committed  # don't read uncommitted transactional messages
    listener:
      ack-mode: MANUAL  # Spring manages offset commit inside transaction

This system: Not implemented (payment-service is not a consume-transform-produce pipeline). Would be applicable if building a stream processor that enriches order.created before forwarding to order.enriched.


Q48 — What is consumer group heartbeat and how do session.timeout.ms and heartbeat.interval.ms interact? junior

Answer: A Kafka consumer sends periodic heartbeats to the Group Coordinator broker to signal it is alive. If the coordinator doesn't receive a heartbeat within session.timeout.ms, it removes the consumer from the group and triggers a rebalance.

Consumer ──heartbeat──▶ Coordinator  (every heartbeat.interval.ms)
Coordinator tracks last-seen timestamp per consumer
If now - last_seen > session.timeout.ms → consumer presumed dead → rebalance

Heartbeats vs. max.poll.interval.ms: Heartbeats are sent by a background thread — a consumer can be heartbeating but still trigger a rebalance if it doesn't call poll() within max.poll.interval.ms. These are two independent liveness checks:

Recommended configuration:

spring:
  kafka:
    consumer:
      heartbeat-interval: 3000      # 3s heartbeat
      session-timeout-ms: 15000     # 15s session timeout (detect fast failures)
      properties:
        max.poll.interval.ms: 300000  # 5 min for slow processing

Q49 — How would you perform a zero-downtime Kafka topic partition increase? senior

Answer: Increasing partitions is a one-way operation — you cannot reduce them without recreating the topic. It also breaks key-to-partition mapping for existing keys, which can violate ordering guarantees.

Procedure for zero-downtime:

Step 1: Increase partitions during a low-traffic window:

kafka-topics.sh --bootstrap-server localhost:9092 \
  --alter --topic order.created \
  --partitions 6

Step 2: Wait for all existing messages in old partitions (0–2) to be consumed. Check lag:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group payment-processors

Step 3: Scale consumers to match new partition count (e.g., from 3 to 6 pods).

Step 4: Verify new partitions (3–5) are being assigned and consumed.

Ordering caveat: After the change, murmur2("orderId-123") % 6 maps to a different partition than murmur2("orderId-123") % 3. In-flight messages for a key may be split across the old and new partition. Strategy:

This system: order.created partition increase from 3 → 6 would require a maintenance window to drain existing messages and update the payment-service deployment count.


Q50 — How does Kafka's isolation.level protect consumers in transactional topics? senior

Answer: isolation.level controls whether consumers read messages from transactional producers before those transactions are committed.

read_uncommitted (default):

read_committed:

# Consumer config for exactly-once processing
spring:
  kafka:
    consumer:
      properties:
        isolation.level: read_committed

Last Stable Offset (LSO) vs. High Watermark (HW):

Partition log:
  offset 0: msg (non-transactional) ✓ visible to all
  offset 1: msg (txn A - committed) ✓ visible to read_committed
  offset 2: msg (txn B - in-flight) ✗ read_committed pauses here (LSO=2)
  offset 3: msg (txn A - committed) ✓ becomes visible after txn B resolves

This system: Payment and notification consumers use read_uncommitted (default) since order events are not published transactionally. If the Outbox Publisher were upgraded to use Kafka transactions, consumers would switch to read_committed to avoid processing rolled-back order records.