Microservices Architecture — Interview Questions
Stack context: This system is a six-service ecommerce platform (api-gateway, user-service, product-service, order-service, payment-service, notification-service) using event-driven architecture with Apache Kafka. Key patterns implemented: transactional outbox, saga (choreography), DLT (dead-letter topic), API gateway, session-based auth, circuit breaker, rate limiting, and distributed tracing.
Q1 — What is microservices architecture and what problems does it solve? junior
Answer: Microservices is an architectural style where an application is decomposed into small, independently deployable services, each focused on a specific business capability.
Problems it solves:
- Scaling: Scale individual services under load (scale payment-service independently of notification-service).
- Independent deployment: Deploy order-service without redeploying payment-service.
- Technology flexibility: Use the right tool per service (Redis for sessions, PostgreSQL for orders).
- Team autonomy: Different teams own different services with clear API contracts.
- Fault isolation: A failure in notification-service doesn't crash order-service.
Trade-offs introduced:
- Distributed systems complexity (network failures, partial failures).
- Data consistency challenges (no distributed ACID transactions).
- Operational overhead (6 services + 4 DBs + 3 infra components).
- Observability complexity (distributed tracing, log aggregation).
Q2 — What is the Transactional Outbox Pattern and why is it needed? junior
Answer: The Transactional Outbox Pattern solves the "dual write" problem: how to atomically update a database AND publish an event to Kafka.
The problem: A network failure after DB commit but before Kafka publish leaves the system inconsistent — order saved but no event sent, payment never processed.
The solution:
- Write the business entity (order) AND an outbox event record to the same database in one transaction.
- A separate poller reads unprocessed outbox events and publishes them to Kafka.
- On successful Kafka publish, mark the outbox event as processed.
DB Transaction:
INSERT INTO orders (id, ...) VALUES (...)
INSERT INTO outbox_events (id, event_type, payload, status) VALUES (...)
-- Committed atomically. If Kafka is down, the outbox has the event.
OutboxPoller (scheduled):
SELECT * FROM outbox_events WHERE status = 'PENDING'
FOR EACH event:
kafkaTemplate.send(event.topic, event.payload)
UPDATE outbox_events SET status = 'PUBLISHED' WHERE id = event.id
This system: order-service implements the outbox pattern. The OrderCreatedEvent is written to the outbox table within the same transaction that creates the Order entity, ensuring at-least-once delivery to Kafka.
Q3 — What is the Saga Pattern and when is it used? junior
Answer: The Saga Pattern manages long-running business transactions that span multiple services without a distributed transaction (2PC). Each service executes a local transaction and publishes an event. If a step fails, compensating transactions undo previous steps.
Two implementations:
Choreography (event-driven, this system):
- Each service listens for events and triggers the next step.
- No central coordinator.
- Simple but harder to track the saga flow.
order-service: ORDER_CREATED event →
payment-service: processes payment → PAYMENT_PROCESSED event →
order-service: updates order to CONFIRMED →
notification-service: sends confirmation email
Orchestration (centralized):
- A Saga Orchestrator calls each service step by step.
- Clear visibility of saga state.
- Central point of failure.
Compensation: If payment fails:
payment-service: PAYMENT_FAILED event →
order-service: updates order to FAILED (compensation) →
product-service: releases reserved stock (compensation) →
notification-service: sends failure notification
Q4 — What is a Dead Letter Topic (DLT) and how is it used? junior
Answer: A Dead Letter Topic (DLT) is a Kafka topic where messages that cannot be processed after all retry attempts are sent. It prevents poison messages from blocking normal message processing.
Flow:
order.created → consumer processes → fails
→ retry 1 (100ms wait) → fails
→ retry 2 (200ms wait) → fails
→ retry 3 (400ms wait) → fails (4th attempt total)
→ send to order.created.DLT
Spring Kafka @RetryableTopic:
@RetryableTopic(
attempts = "4",
backoff = @Backoff(delay = 100, multiplier = 2.0),
dltTopicSuffix = ".DLT"
)
@KafkaListener(topics = "order.created")
public void processOrder(OrderCreatedEvent event) {
// On 4th failure, message goes to order.created.DLT
}
DLT monitoring: Operators monitor the DLT. A non-empty DLT triggers an alert. Messages are analyzed, root cause fixed, and messages replayed manually or via an automated dead letter processor.
This system: order.created.DLT collects failed order events for manual review and replay.
Q5 — What is the API Gateway Pattern and what responsibilities does it have? junior
Answer: An API Gateway is a single entry point for all client requests. It handles cross-cutting concerns that would otherwise be duplicated across all services.
Responsibilities:
| Concern | Without Gateway | With Gateway |
|---|---|---|
| Authentication | Each service validates JWT | Gateway validates once, injects headers |
| Rate limiting | Each service implements it | Gateway enforces per-user limits |
| Routing | Client knows all service URLs | Client talks only to gateway |
| SSL termination | Each service needs certificate | Gateway handles SSL |
| CORS | Each service configures CORS | Gateway handles CORS |
| Request tracing | Each service adds trace ID | Gateway injects correlation ID |
This system (api-gateway on port 8080):
- Validates JWT session from Redis (or Caffeine fallback).
- Injects
X-User-Id,X-User-Email,X-User-Roleheaders. - Applies token-bucket rate limiting per user.
- Routes to user-service (8081), product-service (8082), order-service (8083).
Q6 — What is eventual consistency and how do you handle it in microservices? senior
Answer: Eventual consistency means that after a series of updates, all replicas/services will eventually reach the same state — but there's no guarantee of immediate consistency.
In this system: After order placement:
order-servicecreates order with status PENDING.payment-serviceprocesses payment (may take 100ms-10s).order-serviceupdates order to CONFIRMED.
Between steps 1 and 3, a user reading their order status sees PENDING. This is eventual consistency.
Handling strategies:
- Read-your-writes: User reads from the same service instance that wrote (sticky sessions, Redis caching the latest state).
- Versioning/ETags: Return order version; client can detect stale data.
- Polling: Client polls until status changes.
- WebSocket/SSE: Push status updates to client in real time.
- Idempotency: Allow safe retries — clients can retry without side effects.
CAP theorem: In distributed systems, you can have only 2 of: Consistency, Availability, Partition Tolerance. Microservices typically choose Availability + Partition Tolerance (AP), accepting eventual consistency.
Q7 — What is idempotency and how do you implement it? senior
Answer: An operation is idempotent if calling it multiple times with the same input produces the same result as calling it once. Essential for safe retries in event-driven systems.
Why needed: Kafka guarantees at-least-once delivery. The same OrderCreatedEvent may be delivered multiple times. Payment should only be processed once.
Implementation strategies:
1. Idempotency key in DB (most reliable):
@Entity
public class ProcessedEvent {
@Id
private String eventId; // Kafka message key or UUID in event
private Instant processedAt;
}
@Transactional
public void processPayment(PaymentRequest request) {
if (processedEventRepo.existsById(request.getEventId())) {
return; // duplicate — skip
}
processedEventRepo.save(new ProcessedEvent(request.getEventId()));
// ... process payment
}
2. Unique constraint in DB:
CREATE UNIQUE INDEX idx_payment_order ON payments(order_id);
-- Second attempt to insert payment for same order_id throws exception → rollback → safe
3. Redis idempotency (shorter TTL, distributed):
Boolean firstTime = redisTemplate.opsForValue()
.setIfAbsent("processed:" + eventId, "1", Duration.ofHours(24));
if (Boolean.FALSE.equals(firstTime)) return; // already processed
Q8 — What is the Circuit Breaker Pattern and how does it prevent cascading failures? senior
Answer: The Circuit Breaker prevents a system from repeatedly calling a failing service, allowing it time to recover.
States:
CLOSED → OPEN (after N failures or X% failure rate)
→ HALF_OPEN (after wait duration) → probes with limited requests
→ CLOSED (if probe succeeds) or OPEN (if probe fails)
Resilience4j in Spring Boot:
@CircuitBreaker(name = "payment-service", fallbackMethod = "paymentFallback")
public PaymentResult processPayment(PaymentRequest request) {
return paymentClient.process(request);
}
public PaymentResult paymentFallback(PaymentRequest request, Exception e) {
log.warn("Circuit open for payment-service: {}", e.getMessage());
return PaymentResult.pending("PAYMENT_SERVICE_UNAVAILABLE");
}
resilience4j.circuitbreaker.instances.payment-service:
slidingWindowSize: 10
failureRateThreshold: 50 # open when 50% of 10 requests fail
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 3
Cascading failure scenario: Without circuit breaker — payment-service is slow → order-service holds threads waiting → order-service becomes slow → gateway threads exhausted → entire system down. With circuit breaker — open immediately, return fallback, system remains responsive.
Q9 — How does service-to-service authentication work in a microservices system? senior
Answer: When service A calls service B, service B needs to verify the call is legitimate and identify the calling service.
Strategy 1: Trust gateway-injected headers (this system):
- Gateway validates user JWT and injects
X-User-Id,X-User-Email,X-User-Role. - Downstream services trust these headers and do NOT re-validate JWT.
- Services are not exposed publicly — only through the gateway.
- Risk: If a service is accidentally exposed directly, it has no auth.
Strategy 2: Service-to-service JWT (mutual auth):
- Each service has its own service account JWT (signed with internal key).
- Service A includes its JWT in
Authorization: Bearer <service-jwt>when calling Service B. - Service B validates the service JWT independently.
Strategy 3: mTLS (mutual TLS):
- Each service has a client certificate.
- Services verify each other's certificates during TLS handshake.
- Used in service meshes (Istio, Linkerd).
Strategy 4: API keys:
- Each service has a pre-shared API key for inter-service calls.
- Simpler but less secure (static secret rotation is manual).
Best practice: Trust gateway-injected headers for user context (as this system does), plus mTLS or service JWTs for service identity verification.
Q10 — What is service discovery and how does it work? junior
Answer: Service discovery allows services to find each other dynamically without hardcoded IP addresses, which change frequently in containerized environments.
Two types:
Client-side discovery (Spring Cloud + Eureka):
- Services register with Eureka server on startup.
- Service A queries Eureka to get instances of Service B, then calls one directly.
lb://order-servicein gateway config triggers client-side load balancing.
Server-side discovery (Kubernetes / NGINX / AWS ELB):
- Service A calls a fixed hostname/URL.
- Infrastructure (K8s DNS, load balancer) resolves to a healthy instance.
- No service discovery library needed in the application.
# Spring Cloud Eureka registration
eureka:
client:
service-url.defaultZone: http://eureka-server:8761/eureka
instance:
instance-id: ${spring.application.name}:${server.port}
This system: Uses static URIs (http://order-service:8083) because Docker Compose DNS resolves service names. In Kubernetes, http://order-service.default.svc.cluster.local:8083 or just http://order-service:8083 works via K8s DNS.
Q11 — What is Bounded Context and how does it apply to microservice design? senior
Answer: Bounded Context (from Domain-Driven Design) defines the explicit boundary within which a specific domain model applies. Each microservice should ideally correspond to one bounded context.
Product in multiple contexts:
| Service | What "Product" means | Fields |
|---|---|---|
| product-service | Catalog item | name, description, price, images, category |
| order-service | Ordered item (snapshot) | productId, name, price (at time of order), quantity |
| inventory-service | Stock unit | productId, warehouseId, quantity, location |
Each service has its own Product model. The order-service's OrderItem snapshot the product price at order time — it does NOT depend on the live product price in product-service.
Why separate: If order-service called product-service for every order lookup, it would create tight coupling, network dependency, and cascade failures.
Anti-pattern: A single shared Product class imported by all services. This creates a monolith disguised as microservices.
Q12 — What is CQRS and when is it used? senior
Answer: CQRS (Command Query Responsibility Segregation) separates write (command) and read (query) operations into different models and potentially different data stores.
Structure:
- Command side: Accepts writes (create order, update payment). Optimized for consistency, business rules, transactional writes. Uses the main relational DB.
- Query side: Optimized for complex reads. Uses a denormalized read model (Redis, Elasticsearch, read replica, materialized view).
// Command (write)
public class CreateOrderCommand {
UUID productId;
int quantity;
UUID customerId;
}
// Query model (read, denormalized)
public record OrderSummaryView(
UUID orderId, String customerEmail, String productName,
BigDecimal total, OrderStatus status, Instant createdAt) {}
Event Sourcing + CQRS: Commands create events stored in an event log. Read models are projections built by replaying events.
When to use CQRS: High read/write ratio disparity, complex query requirements, multiple read models needed, event sourcing. Overkill for simple CRUD services.
This system: order-service uses a simple JPA model. CQRS would be valuable if complex order dashboards or reporting are added.
Q13 — What is the Strangler Fig Pattern for migrating a monolith to microservices? senior
Answer: The Strangler Fig Pattern gradually extracts functionality from a monolith into microservices, without a risky big-bang rewrite.
Strategy:
- Place a gateway/proxy in front of the monolith.
- Extract one bounded context (e.g., product catalog) into a new microservice.
- Route
/products/**requests to the new microservice; all other traffic still goes to the monolith. - Migrate data from the monolith DB to the new service's DB.
- Repeat for each bounded context until the monolith is replaced.
Phase 1: All traffic → Monolith
Phase 2: /products → ProductService; rest → Monolith
Phase 3: /products, /orders → Services; rest → Monolith (shrinking)
Phase N: Monolith retired
Challenges:
- Data sharing: Extract data without breaking the monolith's DB queries.
- Transactions: The monolith and new service share data during transition.
- Rollback: Keep the old code path available as a fallback.
Q14 — How do you handle distributed tracing across microservices? junior
Answer: Distributed tracing tracks a single request as it flows through multiple services, allowing you to see the full execution path and latency breakdown.
Core concepts:
- Trace: The complete journey of a request across services.
- Span: A single operation within a service (DB query, HTTP call).
- TraceId: Unique ID shared across all spans in a trace.
- SpanId: Unique ID for each individual span.
This system (Micrometer Tracing + Zipkin):
management:
tracing:
sampling:
probability: 1.0 # trace 100% of requests (use 0.1 in production)
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
Spring Boot 3.x auto-configures trace context propagation via B3 headers. When gateway calls order-service, it adds X-B3-TraceId, X-B3-SpanId, X-B3-ParentSpanId headers automatically.
Zipkin UI: Shows waterfall view of spans — identify which service added latency. http://localhost:9411
Q15 — What is health check and readiness/liveness probes in microservices? junior
Answer: Liveness probe: Is the service running? If it fails, the container is restarted. Readiness probe: Is the service ready to accept traffic? If it fails, the service is removed from the load balancer but not restarted.
# Kubernetes probes
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
Spring Boot Actuator:
management:
health:
livenessstate.enabled: true
readinessstate.enabled: true
endpoint.health.probes.enabled: true
Custom health indicator (Redis connectivity):
@Component
public class RedisHealthIndicator implements HealthIndicator {
@Override
public Health health() {
try {
redisTemplate.opsForValue().get("health-check");
return Health.up().withDetail("redis", "connected").build();
} catch (Exception e) {
return Health.down().withDetail("redis", e.getMessage()).build();
}
}
}
Q16 — What is the Sidecar Pattern in microservices? senior
Answer: The Sidecar Pattern deploys a helper process alongside the main service container, sharing the same lifecycle, network, and resources. The sidecar handles cross-cutting concerns without modifying the main application.
Common sidecars:
| Sidecar | Function |
|---|---|
| Envoy / Istio proxy | mTLS, load balancing, circuit breaking, retries |
| Log shipper (Fluentd/Filebeat) | Collect and forward logs to centralized logging |
| Metrics agent (Prometheus exporter) | Expose application metrics |
| Config reloader | Hot-reload config without restart |
| Secret manager | Inject secrets from Vault/AWS SM |
In Kubernetes (Service Mesh):
containers:
- name: order-service
image: order-service:latest
- name: envoy-proxy # sidecar
image: envoy:latest
# Intercepts all network traffic, handles mTLS, retries, etc.
Benefit: Order-service code has zero security/observability code. The sidecar handles it. Upgrade security posture without redeploying services.
Q17 — What is an anti-corruption layer (ACL) and when do you need one? senior
Answer: An Anti-Corruption Layer (ACL) is a translation layer between two bounded contexts with different domain models. It prevents the concepts of one context from polluting another.
Scenario: order-service needs product information from product-service, but their models differ.
// product-service model
record CatalogProduct(UUID productId, String productName, Money listPrice,
String categoryCode, List<String> imageUrls) {}
// order-service needs
record OrderLineProduct(UUID id, String name, BigDecimal price) {}
// Anti-corruption layer: translates between contexts
@Service
public class ProductAcl {
private final ProductServiceClient client;
public OrderLineProduct getForOrder(UUID productId) {
CatalogProduct catalog = client.getProduct(productId);
return new OrderLineProduct(
catalog.productId(),
catalog.productName(),
catalog.listPrice().amount() // extract scalar from Money value object
);
}
}
When needed:
- Integrating with legacy systems with a different model.
- Calling an external API whose model differs from yours.
- Preventing domain model pollution from a poorly designed upstream service.
Q18 — How do you handle database migrations in microservices? junior
Answer: Each microservice owns its own database. Schema changes are managed with a migration tool (Flyway or Liquibase) that runs on service startup.
Flyway (used in this system):
src/main/resources/db/migration/
V1__create_orders_table.sql
V2__add_order_status_index.sql
V3__add_outbox_events_table.sql
spring:
flyway:
enabled: true
locations: classpath:db/migration
baseline-on-migrate: false
Zero-downtime migration patterns (blue-green deployments):
- Expand: Add new column as nullable (old and new service version can run together).
- Migrate: Backfill the column in a background job.
- Contract: Remove the old column after all instances use the new version.
Never: Rename a column in one migration — it breaks existing running instances. Always expand-migrate-contract.
This system: Flyway runs on application startup. For distributed systems, use flyway.out-of-order=false and ensure only one instance runs migrations (use locking or init containers in K8s).
Q19 — What is the Bulkhead Pattern and how does it prevent resource exhaustion? senior
Answer: The Bulkhead Pattern isolates resources (thread pools, connection pools) per integration point, preventing one slow downstream from exhausting all resources.
Without bulkhead: payment-service is slow → order-service spawns threads waiting for payment → thread pool exhausted → order-service cannot process any requests (including unrelated ones).
With bulkhead: Separate thread pool per downstream service.
@Bulkhead(name = "payment-service", fallbackMethod = "paymentFallback")
@CircuitBreaker(name = "payment-service")
public PaymentResult processPayment(PaymentRequest req) {
return paymentClient.process(req);
}
resilience4j.bulkhead.instances.payment-service:
maxConcurrentCalls: 20 # max 20 concurrent payment calls
maxWaitDuration: 500ms # wait 500ms for a slot before rejecting
Thread pool bulkhead (Resilience4j @ThreadPoolBulkhead):
resilience4j.thread-pool-bulkhead.instances.payment-service:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 5
Note with virtual threads: Virtual threads eliminate the thread exhaustion problem — each blocking call gets its own virtual thread. Bulkheads still prevent overloading the downstream service.
Q20 — How do you implement distributed locking in microservices? senior
Answer: Distributed locks coordinate access to shared resources across multiple service instances.
Redis-based lock (Redlock algorithm):
// Acquire lock (SET NX EX)
Boolean acquired = redisTemplate.opsForValue()
.setIfAbsent("lock:order:" + orderId, "locked", Duration.ofSeconds(10));
if (!Boolean.TRUE.equals(acquired)) {
throw new ConcurrencyException("Order " + orderId + " is already being processed");
}
try {
// Critical section — only one instance executes this
processOrderPayment(orderId);
} finally {
// Release lock (use Lua script to ensure atomicity)
redisTemplate.delete("lock:order:" + orderId);
}
Atomic release with Lua (prevents releasing another instance's lock):
String luaScript = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
""";
redisTemplate.execute(new DefaultRedisScript<>(luaScript, Long.class),
List.of("lock:order:" + orderId), lockValue);
Database-based lock:
// PostgreSQL advisory lock
jdbcTemplate.execute("SELECT pg_advisory_lock(" + orderId.hashCode() + ")");
try { ... }
finally { jdbcTemplate.execute("SELECT pg_advisory_unlock(" + orderId.hashCode() + ")"); }
Q21 — What is consumer-driven contract testing? senior
Answer: Consumer-Driven Contract (CDC) testing verifies that a service's API matches what its consumers expect, without requiring both services to run simultaneously.
How it works:
- Consumer writes a contract (pact file) defining what it expects from the provider API.
- Provider runs the contract in its tests to verify it satisfies all consumer expectations.
- Contracts are stored in a shared repository (Pact Broker or Maven repo).
Spring Cloud Contract (used in this system via common module):
// contract defined by order-service (consumer)
Contract.make {
request {
method 'GET'
url '/products/123e4567-e89b-12d3-a456-426614174000'
}
response {
status 200
body([id: $(anyUuid()), name: $(anyNonBlankString()), price: $(anyPositiveInt())])
headers { contentType applicationJson() }
}
}
Benefits:
- Tests service integration without running all services.
- Contract violations are caught in CI before deployment.
- Documentation as code — contracts serve as API specs.
This system: common module contains stubs generated from contracts, used in integration tests with WireMock.
Q22 — What is graceful shutdown in microservices and how is it implemented? junior
Answer: Graceful shutdown ensures a service completes in-flight requests before stopping, preventing data loss or client errors.
Spring Boot graceful shutdown:
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s # wait up to 30s for in-flight requests
How it works:
- Kubernetes (or Docker) sends
SIGTERMto the process. - Spring stops accepting new requests (readiness probe fails → removed from LB).
- Spring waits for in-flight requests to complete (up to 30s).
- Spring closes DB connections, Kafka producers (flushes in-flight messages).
- Process exits with code 0.
Kafka producer flush (critical for outbox poller):
@PreDestroy
public void shutdown() {
log.info("Shutting down outbox poller...");
running = false;
kafkaTemplate.flush(); // ensure all buffered messages are sent
}
This system: All services configure graceful shutdown. The outbox poller has @PreDestroy to stop polling and flush Kafka before JVM exits.
Q23 — How does the Rate Limiting pattern work in microservices? junior
Answer: Rate limiting controls the number of requests a client can make in a time window, protecting services from overload and abuse.
Algorithms:
| Algorithm | Description | Burst handling |
|---|---|---|
| Token bucket | Bucket with N tokens; each request consumes one; bucket refills at rate R | Allows burst up to bucket size |
| Fixed window | Count requests per fixed time window (e.g., 100/min) | Burst at window boundary |
| Sliding window log | Track timestamps of recent requests | Smooth, but memory-heavy |
| Leaky bucket | Requests queue and are processed at fixed rate | No burst — smooths traffic |
This system (Token Bucket):
redis-rate-limiter.replenishRate=10→ 10 tokens added per second.redis-rate-limiter.burstCapacity=20→ burst of up to 20 simultaneous requests.- Rate limit key =
X-User-Id→ per-user limit. - Stored in Redis → shared across all gateway instances.
HTTP 429 Too Many Requests with Retry-After header tells clients how long to wait.
Q24 — What is the Inbox/Outbox Pattern and how does it differ from just using Kafka? senior
Answer: The Outbox Pattern (write side) ensures events are reliably published from the producer side. The Inbox Pattern (read side) ensures events are idempotently processed on the consumer side.
Combined (full reliability):
Producer (order-service):
DB Transaction:
INSERT INTO orders(...) -- business record
INSERT INTO outbox(event_id, topic, payload, processed=false) -- event record
Poller: publish outbox events → mark processed
Consumer (payment-service):
Receive Kafka message
DB Transaction:
IF NOT EXISTS (SELECT 1 FROM inbox WHERE event_id = ?)
INSERT INTO inbox(event_id, processed_at) -- idempotency record
INSERT INTO payments(...) -- business record
Without Outbox: Producer can fail between DB commit and Kafka publish → event lost. Without Inbox: Consumer can process the same event twice → duplicate payment. With both: Exactly-once business semantics achieved with at-least-once Kafka delivery.
This system: Outbox pattern is implemented in order-service. Inbox pattern (explicit idempotency table) would be implemented in payment-service for production robustness.
Q25 — How do you handle versioning of Kafka event schemas? senior
Answer: Event schema versioning is critical for evolving events without breaking consumers.
Avro + Schema Registry (this system):
- Avro schemas are registered in Confluent Schema Registry.
- Producers and consumers exchange schema IDs, not the full schema on every message.
- Schema Registry enforces compatibility rules.
Compatibility types:
| Type | Meaning | Allows |
|---|---|---|
BACKWARD |
New consumer can read old messages | Add optional fields |
FORWARD |
Old consumer can read new messages | Remove optional fields |
FULL |
Both BACKWARD and FORWARD | Only add/remove optional fields |
NONE |
No compatibility check | Any change |
Schema evolution rules (BACKWARD compat):
- ✅ Add new field with default:
{"name": "discount", "type": "double", "default": 0.0} - ❌ Remove required field
- ❌ Rename a field
- ❌ Change field type (int → string)
Alternative (versioned topics): Create new topic order.created.v2 for breaking changes. Run consumers for both v1 and v2 during migration.
Q26 — What is a service mesh and how does it differ from API gateway? senior
Answer:
| API Gateway | Service Mesh | |
|---|---|---|
| Location | Edge (north-south traffic) | Internal (east-west traffic) |
| Manages | External client → services | Service → service |
| Features | Routing, auth, rate limiting, SSL termination | mTLS, observability, retries, circuit breaking |
| Implementation | Application layer (Spring Cloud Gateway, NGINX) | Infrastructure layer (Istio, Linkerd sidecar proxies) |
| Requires code changes | Some | None (transparent proxy) |
Service mesh sidecars intercept all network traffic between services:
order-service → [Envoy sidecar] → [network] → [Envoy sidecar] → payment-service
Envoy handles retries, circuit breakers, mTLS, and tracing — without any code changes in order-service or payment-service.
This system: No service mesh currently. All cross-cutting concerns are in application code (Spring Resilience4j, Micrometer). A service mesh (Istio) would replace much of this code in a Kubernetes deployment.
Q27 — How do you implement feature flags in microservices? senior
Answer: Feature flags (feature toggles) allow enabling/disabling features without deploying new code, enabling canary releases, A/B testing, and safe rollbacks.
Implementation options:
1. Configuration-based (simple):
features:
new-payment-flow: true
loyalty-points: false
@Value("${features.new-payment-flow}")
private boolean newPaymentFlowEnabled;
if (newPaymentFlowEnabled) {
return newPaymentService.process(request);
}
return legacyPaymentService.process(request);
2. Redis-based (dynamic, no restart):
Boolean enabled = redisTemplate.opsForValue().get("feature:new-payment-flow");
3. External flag service (LaunchDarkly, Unleash):
- Supports targeting (enable for 10% of users, specific user IDs).
- Audit trail of flag changes.
- Real-time updates without service restart.
Canary with feature flags: Enable new checkout flow for 1% of users. Monitor error rates. Increase to 10%, 50%, 100%. Instant rollback if issues arise.
Q28 — What is the Retry Pattern and when should you not retry? junior
Answer: The Retry Pattern automatically re-executes a failed operation with the assumption that transient failures (network blip, momentary service overload) will resolve.
Spring Retry:
@Retryable(
retryFor = {HttpServerErrorException.class, ResourceAccessException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 500, multiplier = 2, maxDelay = 5000)
)
public ProductDto getProduct(UUID id) {
return productClient.get(id);
}
Exponential backoff with jitter (avoid retry storms):
Attempt 1: wait 500ms
Attempt 2: wait 1000ms ± random(0-200ms)
Attempt 3: wait 2000ms ± random(0-400ms)
Do NOT retry:
- Non-idempotent operations: POST /payments — retrying creates duplicate payments.
- Business errors (400 Bad Request): The request is invalid; retrying will always fail.
- Authentication errors (401/403): Retrying without fixing credentials is useless.
- Client errors: Retry only server errors (500, 502, 503, 504) and timeouts.
This system: @RetryableTopic in Kafka consumers retries failed message processing. HTTP calls between services use circuit breaker + retry only on idempotent operations.
Q29 — How do you design for observability in microservices? senior
Answer: Observability = ability to understand internal state from external outputs. Three pillars:
1. Metrics (Micrometer + Prometheus + Grafana):
// Custom business metric
Counter orderCounter = Counter.builder("orders.created")
.tag("status", "success")
.register(meterRegistry);
orderCounter.increment();
// Histogram for latency
Timer.Sample timer = Timer.start(meterRegistry);
processOrder(request);
timer.stop(Timer.builder("order.processing.time").register(meterRegistry));
2. Logs (structured JSON, correlated by traceId):
{"timestamp":"2026-05-31T14:30:00Z","level":"INFO","service":"order-service",
"traceId":"abc123","spanId":"def456","orderId":"uuid","message":"Order created"}
3. Traces (Zipkin/Jaeger — this system uses Zipkin):
- Waterfall view of spans across services.
- Identify P99 latency bottlenecks.
- Trace IDs link logs and traces.
Alerting:
- Error rate > 1% for 5 minutes.
- P99 latency > 2 seconds.
- DLT topic has messages (Kafka consumer failures).
- Redis connection failures.
- JVM heap > 80%.
Q30 — What are anti-patterns in microservices to avoid? senior
Answer: Common microservices anti-patterns and their solutions:
1. Distributed Monolith: Services are deployed separately but are tightly coupled (shared DB, synchronous chains). Every change requires coordinating multiple services. Fix: Define proper bounded contexts, use events for loose coupling.
2. Chatty services: Service A makes 10 synchronous calls to Service B per request. Fix: Batch APIs, events, or API composition at the gateway.
3. Shared database: Multiple services read/write the same DB table. Fix: Each service owns its data; others request via API or events.
4. Too many services: CRUD microservices (UserMicroservice with 3 endpoints). Overhead exceeds benefit. Fix: Keep services at bounded-context granularity, not function granularity.
5. Missing idempotency: Retries cause duplicate orders/payments. Fix: Idempotency keys in every state-changing operation.
6. Synchronous event notification: Using REST calls instead of events for async workflows. Fix: Use Kafka events for order→payment→notification flow.
7. Ignoring eventual consistency: UI shows stale data without feedback. Fix: Optimistic UI updates, polling, SSE.
8. Logging without correlation: Impossible to trace a request across services. Fix: Correlation IDs, structured logging, Zipkin.