Meaningful web service /health checks
About 10 years ago, I deployed my first web service. It was a nice, silly PHP application to store game cheat sheets. Interestingly, what made me really proud about it was the fact that I was able to release new versions with a single command, with a weird mix of git hooks and rsync-powered bash scripts.
When I think about the massive transformation that our field has undergone in the last few years in terms of continuous delivery and service orchestration, I always come back to that memory, and I can’t help laughing a bit.
Cloud platforms such as AWS, Heroku, Azure or Kubernetes have enabled us to use deployment strategies, such as canary releases, staged rollouts, or blue-green deployments, regardless of whether we’re deploying a side project or a critical enterprise service.
All of these strategies have but one goal: to minimise client-facing downtime. Which brings me to an important (yet somehow easily forgotten) topic: Health checks.
A health check is a way for a service to report whether it’s running properly or not. Web services usually expose this via a /health
HTTP endpoint. Then orchestration components, such as load balancers or service discovery systems, poll that endpoint to monitor the health of a fleet of services and make some key decisions, such as:
- Is a new version of the service ready to receive requests?
- Should we roll back a deployment?
- Should we restart an instance?
Anemic health checks
In my short experience (I’ve been in the industry for about 6 years), I’ve seen a bunch of health checks for different services and I’ve realised that most of them are pretty basic. They attempt to establish a connection to their downstream dependencies (select something from the database, ping Redis…) and report that they’re okay as soon as they:
- Can process a request to the
/health
endpoint (that must mean our application is fully loaded); - Can connect to their dependencies.
After that, we’re basically good to go.
So when I needed to write my own health check for a new service, I copied that pattern.
And then we had an outage. We had deployed a new version with a bug and my engineering manager asked me a very basic question: Is our health check giving us enough information to prevent this?
That made me reflect and realise that our health checks were anemic. We speak of anemic domain models when a domain is so silly that it doesn’t contain any business logic. It’s just there, being dull and irrelevant. And so were our health checks.
10 rules for meaningful health checks
I immediately turned to Google, Stack Overflow & co. looking for best practices for health checks. I couldn’t find any! No philosophical discussions on health checks, no epic rants about how we’re doing everything wrong…
Maybe I had lost my googling mojo or maybe nobody cared about health checks, but I do! And if you do too, here’s a good list of things to keep in mind when you’re implementing a health check for your service.
Rule #1. Take your business use cases into account
Being healthy can mean different things in different contexts. For typical web services exposing an HTTP API, it might be enough to consider the ratio of internal server errors. For other services, such as tasks that need to run periodically or subscribers that need to consume events, a healthy state might mean something completely different.
When asked the question: when am I operating healthily?, think twice about the use cases your service is supposed to fulfil.
Rule #2. Check your downstream dependencies
A health check shouldn’t just rely on its downstream dependencies, but they’re definitely an important part of the equation.
Typically, you’ll want to answer these questions:
- Can I grab a connection from my connection pool?
- Can I request something simple from the database?
- Does the request finish within an acceptable time?
Rule #3. Return machine-readable data
Health checks will be mainly used by machines for a wide range of scenarios (visualisation, decision-making, load balancing, alerting…).
Your health checks should return data in a machine-readable format (there’s a nice RFC proposing a standard) that looks the same for all the services of your company.
For bonus points, try to be transparent about the checks you perform, which ones failed and why. As your deployment and routing strategies get more sophisticated, this information will become invaluable.
Rule #4. Report health as a spectrum
People’s health is not binary. They are not either completely healthy or dead. Servers aren’t either.
There are several reasons why we might want to react to an unhealthy status. We might want to roll back a release, restart the service, reduce traffic, page our on-call engineer. This will highly depend on our ability to distinguish between different shades of grey.
Can you imagine getting back a medical report with no more information than OK or KO?
Rule #5. Consider different checks (readiness, liveness…)
Orchestration platforms like Kubernetes make a distinction between a liveness check and a readiness check (although, in Kubernetes jargon, they call them probes).
A readiness check answers the question: Can I start processing work? Plus, it might check things like:
- Can I establish a connection to the database?
- Are all important caches warmed up?
A liveness check answers the question: Should I keep running? It might depend on things like:
- Is my error rate acceptable?
- Am I running out of memory? Might I have a slow memory leak?
Rule #6. Don’t confuse overall health with individual health
A health check is concerned with the health of a particular instance of your server, so we don’t want to report health based on aggregated metrics like:
- The overall error rate of your cluster;
- The number of customer sign-ups.
In fact, the whole point is that you spot the bad apples from the good ones, and replace them without affecting your end customer at all.
Rule #7. Don’t expose the health endpoint publicly
Health endpoints are supposed to contain debug-level information. They leak very important details about their internal implementation, what your service uses and what it doesn’t use.
You can be transparent with the community and keep a nice status page, but keep the endpoint itself safe behind closed doors.
Treat the privacy of your services as you would treat the privacy of your customers. They go hand in hand.
Rule #8. Delegate to smaller subcomponent checks
Some services are fairly small. Some are big monoliths. In the first case, you might not be worried if your health endpoint is tightly coupled with some other parts of your system. If you’re checking the health of a big monolith, however, your concerns might be a bit different.
If it makes sense for you, consider providing a way for other components to implement their own health checks in a way that the main health endpoint can invoke them without being coupled to its internal representation.
Be careful, however, and think about the ripple effect that you want each component to have over your monolith. Should you report an unhealthy instance just because one component is in a failing state? Should you keep your report waiting if a component takes too long to process theirs?
Rule #9. Health checks should be efficient
If you have a high-volume service, an endpoint that gets queried a dozen times per minute might look like a drop in the ocean. However, keep in mind that the orchestration services calling your health endpoint also have timeouts and they might decide you’re unhealthy if you take too long to answer. For that reason, it pays to follow some simple patterns:
- Perform every individual check in parallel and join the results.
- Use timeouts to ensure the latency is within acceptable bounds.
- Even better, perform the checks periodically in the background and keep a centralized, up-to-date status. That way, the health endpoint can return immediately and you will not be limited by its performance.
Rule #10. Monitor the history of your health checks
Health checks make for very good time-series data.
Whenever you generate a health report, send metrics to your observability system. This will enable you to answer questions such as:
- How long does it take for my instances to become ready?
- For how long has my system remained healthy?
- How often do I have partially unhealthy states? What are the causes?
- How many requests is my health check receiving?
Which in turn will allow you to prevent issues, identify the root cause of an outage, or optimise some key areas of your application.
In summary
Orchestration platforms allow us to control the availability and reliability of our services, and reduce the risk of deploying a bad version of our code.
These deployment strategies rely heavily on our ability to report the health of our instances in an accurate and timely way.
Our services have a rich business logic, and the health checks we write should reflect that wealth of use cases and follow some best practices to ensure we make the most out of them.