Microservices Architecture — Interview Questions

Stack context: This system is a six-service ecommerce platform (api-gateway, user-service, product-service, order-service, payment-service, notification-service) using event-driven architecture with Apache Kafka. Key patterns implemented: transactional outbox, saga (choreography), DLT (dead-letter topic), API gateway, session-based auth, circuit breaker, rate limiting, and distributed tracing.

Q1 — What is microservices architecture and what problems does it solve? `junior`

Answer: Microservices is an architectural style where an application is decomposed into small, independently deployable services, each focused on a specific business capability.

Problems it solves:

Scaling: Scale individual services under load (scale payment-service independently of notification-service).
Independent deployment: Deploy order-service without redeploying payment-service.
Technology flexibility: Use the right tool per service (Redis for sessions, PostgreSQL for orders).
Team autonomy: Different teams own different services with clear API contracts.
Fault isolation: A failure in notification-service doesn't crash order-service.

Trade-offs introduced:

Distributed systems complexity (network failures, partial failures).
Data consistency challenges (no distributed ACID transactions).
Operational overhead (6 services + 4 DBs + 3 infra components).
Observability complexity (distributed tracing, log aggregation).

Q2 — What is the Transactional Outbox Pattern and why is it needed? `junior`

Answer: The Transactional Outbox Pattern solves the "dual write" problem: how to atomically update a database AND publish an event to Kafka.

The problem: A network failure after DB commit but before Kafka publish leaves the system inconsistent — order saved but no event sent, payment never processed.

The solution:

Write the business entity (order) AND an outbox event record to the same database in one transaction.
A separate poller reads unprocessed outbox events and publishes them to Kafka.
On successful Kafka publish, mark the outbox event as processed.

DB Transaction:
  INSERT INTO orders (id, ...) VALUES (...)
  INSERT INTO outbox_events (id, event_type, payload, status) VALUES (...)
  
  -- Committed atomically. If Kafka is down, the outbox has the event.

OutboxPoller (scheduled):
  SELECT * FROM outbox_events WHERE status = 'PENDING'
  FOR EACH event:
    kafkaTemplate.send(event.topic, event.payload)
    UPDATE outbox_events SET status = 'PUBLISHED' WHERE id = event.id

This system: order-service implements the outbox pattern. The OrderCreatedEvent is written to the outbox table within the same transaction that creates the Order entity, ensuring at-least-once delivery to Kafka.

Q3 — What is the Saga Pattern and when is it used? `junior`

Answer: The Saga Pattern manages long-running business transactions that span multiple services without a distributed transaction (2PC). Each service executes a local transaction and publishes an event. If a step fails, compensating transactions undo previous steps.

Two implementations:

Choreography (event-driven, this system):

Each service listens for events and triggers the next step.
No central coordinator.
Simple but harder to track the saga flow.

order-service: ORDER_CREATED event →
payment-service: processes payment → PAYMENT_PROCESSED event →
order-service: updates order to CONFIRMED →
notification-service: sends confirmation email

Orchestration (centralized):

A Saga Orchestrator calls each service step by step.
Clear visibility of saga state.
Central point of failure.

Compensation: If payment fails:

payment-service: PAYMENT_FAILED event →
order-service: updates order to FAILED (compensation) →
product-service: releases reserved stock (compensation) →
notification-service: sends failure notification

Q4 — What is a Dead Letter Topic (DLT) and how is it used? `junior`

Answer: A Dead Letter Topic (DLT) is a Kafka topic where messages that cannot be processed after all retry attempts are sent. It prevents poison messages from blocking normal message processing.

Flow:

order.created → consumer processes → fails
                → retry 1 (100ms wait) → fails
                → retry 2 (200ms wait) → fails
                → retry 3 (400ms wait) → fails (4th attempt total)
                → send to order.created.DLT

Spring Kafka @RetryableTopic:

@RetryableTopic(
    attempts = "4",
    backoff = @Backoff(delay = 100, multiplier = 2.0),
    dltTopicSuffix = ".DLT"
)
@KafkaListener(topics = "order.created")
public void processOrder(OrderCreatedEvent event) {
    // On 4th failure, message goes to order.created.DLT
}

DLT monitoring: Operators monitor the DLT. A non-empty DLT triggers an alert. Messages are analyzed, root cause fixed, and messages replayed manually or via an automated dead letter processor.

This system: order.created.DLT collects failed order events for manual review and replay.

Q5 — What is the API Gateway Pattern and what responsibilities does it have? `junior`

Answer: An API Gateway is a single entry point for all client requests. It handles cross-cutting concerns that would otherwise be duplicated across all services.

Responsibilities:

Concern	Without Gateway	With Gateway
Authentication	Each service validates JWT	Gateway validates once, injects headers
Rate limiting	Each service implements it	Gateway enforces per-user limits
Routing	Client knows all service URLs	Client talks only to gateway
SSL termination	Each service needs certificate	Gateway handles SSL
CORS	Each service configures CORS	Gateway handles CORS
Request tracing	Each service adds trace ID	Gateway injects correlation ID

This system (api-gateway on port 8080):

Validates JWT session from Redis (or Caffeine fallback).
Injects X-User-Id, X-User-Email, X-User-Role headers.
Applies token-bucket rate limiting per user.
Routes to user-service (8081), product-service (8082), order-service (8083).

Q6 — What is eventual consistency and how do you handle it in microservices? `senior`

Answer: Eventual consistency means that after a series of updates, all replicas/services will eventually reach the same state — but there's no guarantee of immediate consistency.

In this system: After order placement:

order-service creates order with status PENDING.
payment-service processes payment (may take 100ms-10s).
order-service updates order to CONFIRMED.

Between steps 1 and 3, a user reading their order status sees PENDING. This is eventual consistency.

Handling strategies:

Read-your-writes: User reads from the same service instance that wrote (sticky sessions, Redis caching the latest state).
Versioning/ETags: Return order version; client can detect stale data.
Polling: Client polls until status changes.
WebSocket/SSE: Push status updates to client in real time.
Idempotency: Allow safe retries — clients can retry without side effects.

CAP theorem: In distributed systems, you can have only 2 of: Consistency, Availability, Partition Tolerance. Microservices typically choose Availability + Partition Tolerance (AP), accepting eventual consistency.

Q7 — What is idempotency and how do you implement it? `senior`

Answer: An operation is idempotent if calling it multiple times with the same input produces the same result as calling it once. Essential for safe retries in event-driven systems.

Why needed: Kafka guarantees at-least-once delivery. The same OrderCreatedEvent may be delivered multiple times. Payment should only be processed once.

Implementation strategies:

1. Idempotency key in DB (most reliable):

@Entity
public class ProcessedEvent {
    @Id
    private String eventId;     // Kafka message key or UUID in event
    private Instant processedAt;
}

@Transactional
public void processPayment(PaymentRequest request) {
    if (processedEventRepo.existsById(request.getEventId())) {
        return;  // duplicate — skip
    }
    processedEventRepo.save(new ProcessedEvent(request.getEventId()));
    // ... process payment
}

2. Unique constraint in DB:

CREATE UNIQUE INDEX idx_payment_order ON payments(order_id);
-- Second attempt to insert payment for same order_id throws exception → rollback → safe

3. Redis idempotency (shorter TTL, distributed):

Boolean firstTime = redisTemplate.opsForValue()
    .setIfAbsent("processed:" + eventId, "1", Duration.ofHours(24));
if (Boolean.FALSE.equals(firstTime)) return; // already processed

Q8 — What is the Circuit Breaker Pattern and how does it prevent cascading failures? `senior`

Answer: The Circuit Breaker prevents a system from repeatedly calling a failing service, allowing it time to recover.

States:

CLOSED → OPEN (after N failures or X% failure rate)
        → HALF_OPEN (after wait duration) → probes with limited requests
        → CLOSED (if probe succeeds) or OPEN (if probe fails)

Resilience4j in Spring Boot:

@CircuitBreaker(name = "payment-service", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
    return paymentClient.process(request);
}

public PaymentResult paymentFallback(PaymentRequest request, Exception e) {
    log.warn("Circuit open for payment-service: {}", e.getMessage());
    return PaymentResult.pending("PAYMENT_SERVICE_UNAVAILABLE");
}

resilience4j.circuitbreaker.instances.payment-service:
  slidingWindowSize: 10
  failureRateThreshold: 50        # open when 50% of 10 requests fail
  waitDurationInOpenState: 10s
  permittedNumberOfCallsInHalfOpenState: 3

Cascading failure scenario: Without circuit breaker — payment-service is slow → order-service holds threads waiting → order-service becomes slow → gateway threads exhausted → entire system down. With circuit breaker — open immediately, return fallback, system remains responsive.

Q9 — How does service-to-service authentication work in a microservices system? `senior`

Answer: When service A calls service B, service B needs to verify the call is legitimate and identify the calling service.

Strategy 1: Trust gateway-injected headers (this system):

Gateway validates user JWT and injects X-User-Id, X-User-Email, X-User-Role.
Downstream services trust these headers and do NOT re-validate JWT.
Services are not exposed publicly — only through the gateway.
Risk: If a service is accidentally exposed directly, it has no auth.

Strategy 2: Service-to-service JWT (mutual auth):

Each service has its own service account JWT (signed with internal key).
Service A includes its JWT in Authorization: Bearer <service-jwt> when calling Service B.
Service B validates the service JWT independently.

Strategy 3: mTLS (mutual TLS):

Each service has a client certificate.
Services verify each other's certificates during TLS handshake.
Used in service meshes (Istio, Linkerd).

Strategy 4: API keys:

Each service has a pre-shared API key for inter-service calls.
Simpler but less secure (static secret rotation is manual).

Best practice: Trust gateway-injected headers for user context (as this system does), plus mTLS or service JWTs for service identity verification.

Q10 — What is service discovery and how does it work? `junior`

Answer: Service discovery allows services to find each other dynamically without hardcoded IP addresses, which change frequently in containerized environments.

Two types:

Client-side discovery (Spring Cloud + Eureka):

Services register with Eureka server on startup.
Service A queries Eureka to get instances of Service B, then calls one directly.
lb://order-service in gateway config triggers client-side load balancing.

Server-side discovery (Kubernetes / NGINX / AWS ELB):

Service A calls a fixed hostname/URL.
Infrastructure (K8s DNS, load balancer) resolves to a healthy instance.
No service discovery library needed in the application.

# Spring Cloud Eureka registration
eureka:
  client:
    service-url.defaultZone: http://eureka-server:8761/eureka
  instance:
    instance-id: ${spring.application.name}:${server.port}

This system: Uses static URIs (http://order-service:8083) because Docker Compose DNS resolves service names. In Kubernetes, http://order-service.default.svc.cluster.local:8083 or just http://order-service:8083 works via K8s DNS.

Q11 — What is Bounded Context and how does it apply to microservice design? `senior`

Answer: Bounded Context (from Domain-Driven Design) defines the explicit boundary within which a specific domain model applies. Each microservice should ideally correspond to one bounded context.

Product in multiple contexts:

Service	What "Product" means	Fields
product-service	Catalog item	name, description, price, images, category
order-service	Ordered item (snapshot)	productId, name, price (at time of order), quantity
inventory-service	Stock unit	productId, warehouseId, quantity, location

Each service has its own Product model. The order-service's OrderItem snapshot the product price at order time — it does NOT depend on the live product price in product-service.

Why separate: If order-service called product-service for every order lookup, it would create tight coupling, network dependency, and cascade failures.

Anti-pattern: A single shared Product class imported by all services. This creates a monolith disguised as microservices.

Q12 — What is CQRS and when is it used? `senior`

Answer: CQRS (Command Query Responsibility Segregation) separates write (command) and read (query) operations into different models and potentially different data stores.

Structure:

Command side: Accepts writes (create order, update payment). Optimized for consistency, business rules, transactional writes. Uses the main relational DB.
Query side: Optimized for complex reads. Uses a denormalized read model (Redis, Elasticsearch, read replica, materialized view).

// Command (write)
public class CreateOrderCommand {
    UUID productId;
    int quantity;
    UUID customerId;
}

// Query model (read, denormalized)
public record OrderSummaryView(
    UUID orderId, String customerEmail, String productName,
    BigDecimal total, OrderStatus status, Instant createdAt) {}

Event Sourcing + CQRS: Commands create events stored in an event log. Read models are projections built by replaying events.

When to use CQRS: High read/write ratio disparity, complex query requirements, multiple read models needed, event sourcing. Overkill for simple CRUD services.

This system: order-service uses a simple JPA model. CQRS would be valuable if complex order dashboards or reporting are added.

Q13 — What is the Strangler Fig Pattern for migrating a monolith to microservices? `senior`

Answer: The Strangler Fig Pattern gradually extracts functionality from a monolith into microservices, without a risky big-bang rewrite.

Strategy:

Place a gateway/proxy in front of the monolith.
Extract one bounded context (e.g., product catalog) into a new microservice.
Route /products/** requests to the new microservice; all other traffic still goes to the monolith.
Migrate data from the monolith DB to the new service's DB.
Repeat for each bounded context until the monolith is replaced.

Phase 1: All traffic → Monolith
Phase 2: /products → ProductService; rest → Monolith
Phase 3: /products, /orders → Services; rest → Monolith (shrinking)
Phase N: Monolith retired

Challenges:

Data sharing: Extract data without breaking the monolith's DB queries.
Transactions: The monolith and new service share data during transition.
Rollback: Keep the old code path available as a fallback.

Q14 — How do you handle distributed tracing across microservices? `junior`

Answer: Distributed tracing tracks a single request as it flows through multiple services, allowing you to see the full execution path and latency breakdown.

Core concepts:

Trace: The complete journey of a request across services.
Span: A single operation within a service (DB query, HTTP call).
TraceId: Unique ID shared across all spans in a trace.
SpanId: Unique ID for each individual span.

This system (Micrometer Tracing + Zipkin):

management:
  tracing:
    sampling:
      probability: 1.0    # trace 100% of requests (use 0.1 in production)
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans

Spring Boot 3.x auto-configures trace context propagation via B3 headers. When gateway calls order-service, it adds X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId headers automatically.

Zipkin UI: Shows waterfall view of spans — identify which service added latency. http://localhost:9411

Q15 — What is health check and readiness/liveness probes in microservices? `junior`

Answer: Liveness probe: Is the service running? If it fails, the container is restarted. Readiness probe: Is the service ready to accept traffic? If it fails, the service is removed from the load balancer but not restarted.

# Kubernetes probes
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

Spring Boot Actuator:

management:
  health:
    livenessstate.enabled: true
    readinessstate.enabled: true
  endpoint.health.probes.enabled: true

Custom health indicator (Redis connectivity):

@Component
public class RedisHealthIndicator implements HealthIndicator {
    @Override
    public Health health() {
        try {
            redisTemplate.opsForValue().get("health-check");
            return Health.up().withDetail("redis", "connected").build();
        } catch (Exception e) {
            return Health.down().withDetail("redis", e.getMessage()).build();
        }
    }
}

Q16 — What is the Sidecar Pattern in microservices? `senior`

Answer: The Sidecar Pattern deploys a helper process alongside the main service container, sharing the same lifecycle, network, and resources. The sidecar handles cross-cutting concerns without modifying the main application.

Common sidecars:

Sidecar	Function
Envoy / Istio proxy	mTLS, load balancing, circuit breaking, retries
Log shipper (Fluentd/Filebeat)	Collect and forward logs to centralized logging
Metrics agent (Prometheus exporter)	Expose application metrics
Config reloader	Hot-reload config without restart
Secret manager	Inject secrets from Vault/AWS SM

In Kubernetes (Service Mesh):

containers:
  - name: order-service
    image: order-service:latest
  - name: envoy-proxy         # sidecar
    image: envoy:latest
    # Intercepts all network traffic, handles mTLS, retries, etc.

Benefit: Order-service code has zero security/observability code. The sidecar handles it. Upgrade security posture without redeploying services.

Q17 — What is an anti-corruption layer (ACL) and when do you need one? `senior`

Answer: An Anti-Corruption Layer (ACL) is a translation layer between two bounded contexts with different domain models. It prevents the concepts of one context from polluting another.

Scenario: order-service needs product information from product-service, but their models differ.

// product-service model
record CatalogProduct(UUID productId, String productName, Money listPrice,
                       String categoryCode, List<String> imageUrls) {}

// order-service needs
record OrderLineProduct(UUID id, String name, BigDecimal price) {}

// Anti-corruption layer: translates between contexts
@Service
public class ProductAcl {
    private final ProductServiceClient client;

    public OrderLineProduct getForOrder(UUID productId) {
        CatalogProduct catalog = client.getProduct(productId);
        return new OrderLineProduct(
            catalog.productId(),
            catalog.productName(),
            catalog.listPrice().amount()  // extract scalar from Money value object
        );
    }
}

When needed:

Integrating with legacy systems with a different model.
Calling an external API whose model differs from yours.
Preventing domain model pollution from a poorly designed upstream service.

Q18 — How do you handle database migrations in microservices? `junior`

Answer: Each microservice owns its own database. Schema changes are managed with a migration tool (Flyway or Liquibase) that runs on service startup.

Flyway (used in this system):

src/main/resources/db/migration/
  V1__create_orders_table.sql
  V2__add_order_status_index.sql
  V3__add_outbox_events_table.sql

spring:
  flyway:
    enabled: true
    locations: classpath:db/migration
    baseline-on-migrate: false

Zero-downtime migration patterns (blue-green deployments):

Expand: Add new column as nullable (old and new service version can run together).
Migrate: Backfill the column in a background job.
Contract: Remove the old column after all instances use the new version.

Never: Rename a column in one migration — it breaks existing running instances. Always expand-migrate-contract.

This system: Flyway runs on application startup. For distributed systems, use flyway.out-of-order=false and ensure only one instance runs migrations (use locking or init containers in K8s).

Q19 — What is the Bulkhead Pattern and how does it prevent resource exhaustion? `senior`

Answer: The Bulkhead Pattern isolates resources (thread pools, connection pools) per integration point, preventing one slow downstream from exhausting all resources.

Without bulkhead: payment-service is slow → order-service spawns threads waiting for payment → thread pool exhausted → order-service cannot process any requests (including unrelated ones).

With bulkhead: Separate thread pool per downstream service.

@Bulkhead(name = "payment-service", fallbackMethod = "paymentFallback")
@CircuitBreaker(name = "payment-service")
public PaymentResult processPayment(PaymentRequest req) {
    return paymentClient.process(req);
}

resilience4j.bulkhead.instances.payment-service:
  maxConcurrentCalls: 20        # max 20 concurrent payment calls
  maxWaitDuration: 500ms        # wait 500ms for a slot before rejecting

Thread pool bulkhead (Resilience4j @ThreadPoolBulkhead):

resilience4j.thread-pool-bulkhead.instances.payment-service:
  maxThreadPoolSize: 10
  coreThreadPoolSize: 5
  queueCapacity: 5

Note with virtual threads: Virtual threads eliminate the thread exhaustion problem — each blocking call gets its own virtual thread. Bulkheads still prevent overloading the downstream service.

Q20 — How do you implement distributed locking in microservices? `senior`

Answer: Distributed locks coordinate access to shared resources across multiple service instances.

Redis-based lock (Redlock algorithm):

// Acquire lock (SET NX EX)
Boolean acquired = redisTemplate.opsForValue()
    .setIfAbsent("lock:order:" + orderId, "locked", Duration.ofSeconds(10));

if (!Boolean.TRUE.equals(acquired)) {
    throw new ConcurrencyException("Order " + orderId + " is already being processed");
}

try {
    // Critical section — only one instance executes this
    processOrderPayment(orderId);
} finally {
    // Release lock (use Lua script to ensure atomicity)
    redisTemplate.delete("lock:order:" + orderId);
}

Atomic release with Lua (prevents releasing another instance's lock):

String luaScript = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """;
redisTemplate.execute(new DefaultRedisScript<>(luaScript, Long.class),
    List.of("lock:order:" + orderId), lockValue);

Database-based lock:

// PostgreSQL advisory lock
jdbcTemplate.execute("SELECT pg_advisory_lock(" + orderId.hashCode() + ")");
try { ... }
finally { jdbcTemplate.execute("SELECT pg_advisory_unlock(" + orderId.hashCode() + ")"); }

Q21 — What is consumer-driven contract testing? `senior`

Answer: Consumer-Driven Contract (CDC) testing verifies that a service's API matches what its consumers expect, without requiring both services to run simultaneously.

How it works:

Consumer writes a contract (pact file) defining what it expects from the provider API.
Provider runs the contract in its tests to verify it satisfies all consumer expectations.
Contracts are stored in a shared repository (Pact Broker or Maven repo).

Spring Cloud Contract (used in this system via common module):

// contract defined by order-service (consumer)
Contract.make {
    request {
        method 'GET'
        url '/products/123e4567-e89b-12d3-a456-426614174000'
    }
    response {
        status 200
        body([id: $(anyUuid()), name: $(anyNonBlankString()), price: $(anyPositiveInt())])
        headers { contentType applicationJson() }
    }
}

Benefits:

Tests service integration without running all services.
Contract violations are caught in CI before deployment.
Documentation as code — contracts serve as API specs.

This system: common module contains stubs generated from contracts, used in integration tests with WireMock.

Q22 — What is graceful shutdown in microservices and how is it implemented? `junior`

Answer: Graceful shutdown ensures a service completes in-flight requests before stopping, preventing data loss or client errors.

Spring Boot graceful shutdown:

server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s  # wait up to 30s for in-flight requests

How it works:

Kubernetes (or Docker) sends SIGTERM to the process.
Spring stops accepting new requests (readiness probe fails → removed from LB).
Spring waits for in-flight requests to complete (up to 30s).
Spring closes DB connections, Kafka producers (flushes in-flight messages).
Process exits with code 0.

Kafka producer flush (critical for outbox poller):

@PreDestroy
public void shutdown() {
    log.info("Shutting down outbox poller...");
    running = false;
    kafkaTemplate.flush();  // ensure all buffered messages are sent
}

This system: All services configure graceful shutdown. The outbox poller has @PreDestroy to stop polling and flush Kafka before JVM exits.

Q23 — How does the Rate Limiting pattern work in microservices? `junior`

Answer: Rate limiting controls the number of requests a client can make in a time window, protecting services from overload and abuse.

Algorithms:

Algorithm	Description	Burst handling
Token bucket	Bucket with N tokens; each request consumes one; bucket refills at rate R	Allows burst up to bucket size
Fixed window	Count requests per fixed time window (e.g., 100/min)	Burst at window boundary
Sliding window log	Track timestamps of recent requests	Smooth, but memory-heavy
Leaky bucket	Requests queue and are processed at fixed rate	No burst — smooths traffic

This system (Token Bucket):

redis-rate-limiter.replenishRate=10 → 10 tokens added per second.
redis-rate-limiter.burstCapacity=20 → burst of up to 20 simultaneous requests.
Rate limit key = X-User-Id → per-user limit.
Stored in Redis → shared across all gateway instances.

HTTP 429 Too Many Requests with Retry-After header tells clients how long to wait.

Q24 — What is the Inbox/Outbox Pattern and how does it differ from just using Kafka? `senior`

Answer: The Outbox Pattern (write side) ensures events are reliably published from the producer side. The Inbox Pattern (read side) ensures events are idempotently processed on the consumer side.

Combined (full reliability):

Producer (order-service):
  DB Transaction:
    INSERT INTO orders(...)         -- business record
    INSERT INTO outbox(event_id, topic, payload, processed=false)  -- event record
  
  Poller: publish outbox events → mark processed

Consumer (payment-service):
  Receive Kafka message
  DB Transaction:
    IF NOT EXISTS (SELECT 1 FROM inbox WHERE event_id = ?)
      INSERT INTO inbox(event_id, processed_at)  -- idempotency record
      INSERT INTO payments(...)                   -- business record

Without Outbox: Producer can fail between DB commit and Kafka publish → event lost. Without Inbox: Consumer can process the same event twice → duplicate payment. With both: Exactly-once business semantics achieved with at-least-once Kafka delivery.

This system: Outbox pattern is implemented in order-service. Inbox pattern (explicit idempotency table) would be implemented in payment-service for production robustness.

Q25 — How do you handle versioning of Kafka event schemas? `senior`

Answer: Event schema versioning is critical for evolving events without breaking consumers.

Avro + Schema Registry (this system):

Avro schemas are registered in Confluent Schema Registry.
Producers and consumers exchange schema IDs, not the full schema on every message.
Schema Registry enforces compatibility rules.

Compatibility types:

Type	Meaning	Allows
`BACKWARD`	New consumer can read old messages	Add optional fields
`FORWARD`	Old consumer can read new messages	Remove optional fields
`FULL`	Both BACKWARD and FORWARD	Only add/remove optional fields
`NONE`	No compatibility check	Any change

Schema evolution rules (BACKWARD compat):

✅ Add new field with default: {"name": "discount", "type": "double", "default": 0.0}
❌ Remove required field
❌ Rename a field
❌ Change field type (int → string)

Alternative (versioned topics): Create new topic order.created.v2 for breaking changes. Run consumers for both v1 and v2 during migration.

Q26 — What is a service mesh and how does it differ from API gateway? `senior`

Answer:

	API Gateway	Service Mesh
Location	Edge (north-south traffic)	Internal (east-west traffic)
Manages	External client → services	Service → service
Features	Routing, auth, rate limiting, SSL termination	mTLS, observability, retries, circuit breaking
Implementation	Application layer (Spring Cloud Gateway, NGINX)	Infrastructure layer (Istio, Linkerd sidecar proxies)
Requires code changes	Some	None (transparent proxy)

Service mesh sidecars intercept all network traffic between services:

order-service → [Envoy sidecar] → [network] → [Envoy sidecar] → payment-service

Envoy handles retries, circuit breakers, mTLS, and tracing — without any code changes in order-service or payment-service.

This system: No service mesh currently. All cross-cutting concerns are in application code (Spring Resilience4j, Micrometer). A service mesh (Istio) would replace much of this code in a Kubernetes deployment.

Q27 — How do you implement feature flags in microservices? `senior`

Answer: Feature flags (feature toggles) allow enabling/disabling features without deploying new code, enabling canary releases, A/B testing, and safe rollbacks.

Implementation options:

1. Configuration-based (simple):

features:
  new-payment-flow: true
  loyalty-points: false

@Value("${features.new-payment-flow}")
private boolean newPaymentFlowEnabled;

if (newPaymentFlowEnabled) {
    return newPaymentService.process(request);
}
return legacyPaymentService.process(request);

2. Redis-based (dynamic, no restart):

Boolean enabled = redisTemplate.opsForValue().get("feature:new-payment-flow");

3. External flag service (LaunchDarkly, Unleash):

Supports targeting (enable for 10% of users, specific user IDs).
Audit trail of flag changes.
Real-time updates without service restart.

Canary with feature flags: Enable new checkout flow for 1% of users. Monitor error rates. Increase to 10%, 50%, 100%. Instant rollback if issues arise.

Q28 — What is the Retry Pattern and when should you not retry? `junior`

Answer: The Retry Pattern automatically re-executes a failed operation with the assumption that transient failures (network blip, momentary service overload) will resolve.

Spring Retry:

@Retryable(
    retryFor = {HttpServerErrorException.class, ResourceAccessException.class},
    maxAttempts = 3,
    backoff = @Backoff(delay = 500, multiplier = 2, maxDelay = 5000)
)
public ProductDto getProduct(UUID id) {
    return productClient.get(id);
}

Exponential backoff with jitter (avoid retry storms):

Attempt 1: wait 500ms
Attempt 2: wait 1000ms ± random(0-200ms)  
Attempt 3: wait 2000ms ± random(0-400ms)

Do NOT retry:

Non-idempotent operations: POST /payments — retrying creates duplicate payments.
Business errors (400 Bad Request): The request is invalid; retrying will always fail.
Authentication errors (401/403): Retrying without fixing credentials is useless.
Client errors: Retry only server errors (500, 502, 503, 504) and timeouts.

This system: @RetryableTopic in Kafka consumers retries failed message processing. HTTP calls between services use circuit breaker + retry only on idempotent operations.

Q29 — How do you design for observability in microservices? `senior`

Answer: Observability = ability to understand internal state from external outputs. Three pillars:

1. Metrics (Micrometer + Prometheus + Grafana):

// Custom business metric
Counter orderCounter = Counter.builder("orders.created")
    .tag("status", "success")
    .register(meterRegistry);
orderCounter.increment();

// Histogram for latency
Timer.Sample timer = Timer.start(meterRegistry);
processOrder(request);
timer.stop(Timer.builder("order.processing.time").register(meterRegistry));

2. Logs (structured JSON, correlated by traceId):

{"timestamp":"2026-05-31T14:30:00Z","level":"INFO","service":"order-service",
 "traceId":"abc123","spanId":"def456","orderId":"uuid","message":"Order created"}

3. Traces (Zipkin/Jaeger — this system uses Zipkin):

Waterfall view of spans across services.
Identify P99 latency bottlenecks.
Trace IDs link logs and traces.

Alerting:

Error rate > 1% for 5 minutes.
P99 latency > 2 seconds.
DLT topic has messages (Kafka consumer failures).
Redis connection failures.
JVM heap > 80%.

Q30 — What are anti-patterns in microservices to avoid? `senior`

Answer: Common microservices anti-patterns and their solutions:

1. Distributed Monolith: Services are deployed separately but are tightly coupled (shared DB, synchronous chains). Every change requires coordinating multiple services. Fix: Define proper bounded contexts, use events for loose coupling.

2. Chatty services: Service A makes 10 synchronous calls to Service B per request. Fix: Batch APIs, events, or API composition at the gateway.

3. Shared database: Multiple services read/write the same DB table. Fix: Each service owns its data; others request via API or events.

4. Too many services: CRUD microservices (UserMicroservice with 3 endpoints). Overhead exceeds benefit. Fix: Keep services at bounded-context granularity, not function granularity.

5. Missing idempotency: Retries cause duplicate orders/payments. Fix: Idempotency keys in every state-changing operation.

6. Synchronous event notification: Using REST calls instead of events for async workflows. Fix: Use Kafka events for order→payment→notification flow.

7. Ignoring eventual consistency: UI shows stale data without feedback. Fix: Optimistic UI updates, polling, SSE.

8. Logging without correlation: Impossible to trace a request across services. Fix: Correlation IDs, structured logging, Zipkin.

Microservices Architecture — Interview Questions

Q1 — What is microservices architecture and what problems does it solve? junior

Q2 — What is the Transactional Outbox Pattern and why is it needed? junior

Q3 — What is the Saga Pattern and when is it used? junior

Q4 — What is a Dead Letter Topic (DLT) and how is it used? junior

Q5 — What is the API Gateway Pattern and what responsibilities does it have? junior

Q6 — What is eventual consistency and how do you handle it in microservices? senior

Q7 — What is idempotency and how do you implement it? senior

Q8 — What is the Circuit Breaker Pattern and how does it prevent cascading failures? senior

Q9 — How does service-to-service authentication work in a microservices system? senior

Q10 — What is service discovery and how does it work? junior

Q11 — What is Bounded Context and how does it apply to microservice design? senior

Q12 — What is CQRS and when is it used? senior

Q13 — What is the Strangler Fig Pattern for migrating a monolith to microservices? senior

Q14 — How do you handle distributed tracing across microservices? junior

Q15 — What is health check and readiness/liveness probes in microservices? junior

Q16 — What is the Sidecar Pattern in microservices? senior

Q17 — What is an anti-corruption layer (ACL) and when do you need one? senior

Q18 — How do you handle database migrations in microservices? junior

Q19 — What is the Bulkhead Pattern and how does it prevent resource exhaustion? senior

Q20 — How do you implement distributed locking in microservices? senior

Q21 — What is consumer-driven contract testing? senior

Q22 — What is graceful shutdown in microservices and how is it implemented? junior

Q23 — How does the Rate Limiting pattern work in microservices? junior

Q24 — What is the Inbox/Outbox Pattern and how does it differ from just using Kafka? senior

Q25 — How do you handle versioning of Kafka event schemas? senior

Q26 — What is a service mesh and how does it differ from API gateway? senior

Q27 — How do you implement feature flags in microservices? senior

Q28 — What is the Retry Pattern and when should you not retry? junior

Q29 — How do you design for observability in microservices? senior

Q30 — What are anti-patterns in microservices to avoid? senior