Introduction
Modern Java microservices need to be always-on, self-aware, and resilient. But what happens when a service hangs, a DB goes down, or a thread pool is exhausted?
Enter the Health Check and Self-Healing Patterns—two foundational resilience strategies that help detect failures and recover from them automatically, often without human intervention.
In this tutorial, you'll learn how to implement both patterns in Java applications using Spring Boot, Kubernetes, and custom logic.
🧠 What Are Health Check and Self-Healing Patterns?
Health Check Pattern
Provides an endpoint or mechanism to assess the status of critical components (DB, cache, memory, disk, etc.).
Self-Healing Pattern
Triggers automated actions (like restart, scale, reroute) when the system detects unhealthy behavior.
UML Diagram (Conceptual)
[Client or Load Balancer]
|
|---> /actuator/health --> [Service A] --> OK
| |
|<--- Auto-restart (K8s probe fails)
👥 Core Participants
- Health Indicator: Checks a component's health.
- Monitoring System: Polls health endpoints (e.g., Prometheus, K8s).
- Recovery Agent: Triggers actions like restart or fallback.
- Self-Healing Logic: Built-in recovery routines within services.
🌍 Real-World Use Cases
- Restarting crashed services in Kubernetes.
- Re-initializing broken Kafka consumers.
- Reconnecting to DB if a pool is exhausted.
- Scaling pods when latency spikes.
🧰 Implementation Strategies in Java
1. Spring Boot Health Checks (Actuator)
Maven Dependency
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
application.yml
management:
endpoints:
web:
exposure:
include: health, info
Custom Health Indicator
@Component
public class RedisHealthIndicator implements HealthIndicator {
@Override
public Health health() {
boolean redisUp = checkRedis();
return redisUp ? Health.up().build() : Health.down().withDetail("error", "Redis unreachable").build();
}
private boolean checkRedis() {
// ping Redis or check connection
return false; // simulate down
}
}
2. Kubernetes Liveness and Readiness Probes
K8s YAML
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
3. Self-Healing via Recovery Code
@Scheduled(fixedRate = 30000)
public void autoReconnectIfDbDown() {
if (!dataSource.isHealthy()) {
logger.warn("DB unhealthy. Re-initializing...");
dataSource.reconnect();
}
}
✅ Pros and Cons
Pros | Cons |
---|---|
Detects failure before users are impacted | May restart during transient spikes |
Enables automatic recovery and uptime | Requires careful configuration |
Improves observability and SLA enforcement | May mask underlying root causes |
❌ Anti-Patterns and Misuse
- Not customizing health checks (only default checks)
- Using the same endpoint for liveness and readiness
- Over-restarting due to aggressive probe settings
- Ignoring service logs during healing
🔁 Comparison with Related Patterns
Pattern | Purpose |
---|---|
Health Check | Detect component status |
Self-Healing | Take automated recovery actions |
Circuit Breaker | Stop calling failing services |
Retry Pattern | Retry operations on failure |
Failover | Switch to backup system |
💻 Java Code – Health Check + Recovery
@Component
public class KafkaHealthIndicator implements HealthIndicator {
@Autowired
private KafkaTemplate<String, String> kafkaTemplate;
@Override
public Health health() {
try {
kafkaTemplate.send("health-check", "ping");
return Health.up().build();
} catch (Exception e) {
return Health.down(e).build();
}
}
}
Recovery Code
@EventListener(ApplicationReadyEvent.class)
public void startHealthWatcher() {
Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(() -> {
if (!kafkaHealthy()) {
restartKafkaClient();
}
}, 0, 30, TimeUnit.SECONDS);
}
🔧 Refactoring Legacy Code
Before
- No health endpoint
- No recovery logic
After
- Add
/actuator/health
- Add liveness/readiness probe support
- Add periodic checks to restart flaky services
🌟 Best Practices
- Separate liveness and readiness checks.
- Customize health indicators for all critical components.
- Use exponential backoff in retries.
- Add alerting on repeated self-healing.
- Document what constitutes an “unhealthy” state.
🧠 Real-World Analogy
Think of your car’s dashboard lights (health checks). When something's wrong, it tells you. But modern cars also auto-heal—like switching to eco-mode or rerouting power when a tire pressure drops. That’s self-healing.
☕ Java Feature Relevance
- Spring Boot Actuator: Easily expose health metrics.
- @Scheduled: Run periodic recovery checks.
- Records/Sealed Types: Model recovery response objects.
- CompletableFuture: Retry or parallel healing logic.
🔚 Conclusion & Key Takeaways
Health checks detect problems. Self-healing fixes them.
Together, they form the backbone of microservice resilience, ensuring systems recover fast, scale smartly, and stay available even in partial failure scenarios.
✅ Summary
- Use
/actuator/health
with custom indicators. - Integrate with Kubernetes probes.
- Implement self-healing with scheduled logic.
- Monitor, test, and refine frequently.
❓ FAQ – Health Check & Self-Healing in Java
1. What’s the difference between liveness and readiness?
Liveness: Is app running?
Readiness: Is app ready to receive traffic?
2. Can I customize Spring Boot health checks?
Yes. Implement HealthIndicator
.
3. Should I restart on all failures?
No. Use smart recovery logic.
4. What’s a good retry interval?
Start with 30 seconds and monitor.
5. Can self-healing mask real issues?
Yes. Add alerting to monitor recovery attempts.
6. What tools integrate with health checks?
Kubernetes, Prometheus, Grafana, ELK.
7. Do all services need readiness probes?
Only those with long startup processes or external dependencies.
8. How can I simulate failure locally?
Kill DB connection or simulate high CPU to test liveness.
9. What about multi-region failover?
Use readiness checks to direct traffic only to healthy regions.
10. Can I use retries and self-healing together?
Yes. Retry short failures, self-heal persistent issues.