Systems could fail due to several unexpected failures, degraded hardware suddenly showing higher latency, partial network disruption to some specific instances, or network cable cut-off etc.,… Health checking, in principle, handles such unexpected localised failures that happen to particular containers or VMs among the entire cluster by removing them from the workload. This way total error rate in the cluster can be reduced.
In production, there are several such localised failures
The network layer of some specific server racks got degraded, increasing the TCP retransmissions rate and increasing latencies
Software faults cause a few nodes in the cluster to get stuck
Threads got stuck in a deadlock
A critical dependency is failing, for example, unable to connect to a database or unable to connect to an upstream service.
The HTTP server process crashed, but the container or VM is still running
A critical background thread crashed; for example, an async kafka producer might be pushing data to kafka in a background thread, and that thread got crashed for some reason, or a local cache invalidation background thread crashed
Working on stale data, for example, in a control plane & data plane architecture, the disruption in the communication could make the data plane work on very stale data
A critical process or a sidecar crashed
For all such failures, we could do health checking for all pods in a cluster and remove them from the workload. While such a mechanism of health checking solves some failure modes, it inevitably introduces many other failure modes as well into the system.
Cascading Outages due to Dependencies
While this solves the localised failures, some failures are not completely local. For example, suppose a critical dependency starts failing, and we start removing the pods from the cluster. In that case, the load on the rest of the pods keeps increasing, further aggravating the situation and leading to a cascading outage on downstream services as well.
Fail Open on All Pods being Unhealthy
To avoid such cascading outages, AWS ALB implements a Fail open technique where if all pods of a cluster start reporting unhealthy at the same time, then ALB would start ignoring the health checks and drive traffic to all the hosts just as in a steady state.
Load-based Cascading Outages
But upstream dependency-based health check failure is not the only reason why a health check could be failing; in languages like python, ruby and PHP, a web server is equipped with a fixed number of threads, and this number is pre-tuned according to resources(vcpu & memory) allocated to a single container. But in production, the conditions change very fast; for example, an upstream latency suddenly started increasing, or a notification campaign went live, and suddenly a lot of customers are trying to buy that fancy hat you are selling, or your competitor is facing a blackout and customers started coming on to your site at once.
For all such scenarios, all the threads in your server would be very busy, excess requests would start getting dropped, and some could even be health checks. In such a case, if ALB starts removing even some of the hosts, the load on the rest of the hosts will increase further, leading to more load than the cluster can handle. Here the outage is leading to a cascading effect due to an issue within the cluster rather than an upstream service. Since this is a load-based issue, as the load balancer removes nodes, they immediately become healthy as the load is reduced, but the loop continues, i.e. as soon they are put into the pool, they start failing health checks. So the Fail Open only when all pods are unhealthy is not a good solution for such load-based health check failures.
Healthy Panic Threshold
Envoy solves this problem by using a panic threshold, i.e. if more than 50% of the hosts are unhealthy, then simply send traffic to healthy and unhealthy containers. However, the threshold might need some tuning according to your auto-scaling threshold.
This is an excellent solution for such load-based cascading outages. Let's say the panic threshold is at
30%, then the cluster can remove unhealthy members up to
30% and in which case the load only increases by
42.8% on the rest of them
((100/70) * 30). This would solve both the localised failures scenario and would solve cascading outages.
This would require some fine-tuning and maintenance, though, i.e. choosing a threshold according to your min count set in autoscaling, ensuring to remove the unhealthy nodes by the orchestrator(or by a human) so that future health check failures can be handled gracefully and removed from the cluster.
Prioritising Health Check Requests
In case a healthy panic threshold is not implemented like in ALB, One other solution to such load-based outages is to prioritise health-checking requests above all other requests. since health-checking requests are often lightweight, it won’t be taking resources away from the actual requests that contribute to the availability of your service.
Haproxy offers this via the
set-priority-class(https://cbonte.github.io/haproxy-dconv/1.9/configuration.html#4.2-http-request set-priority-class) option for HTTP requests, using this, we can drive higher priority to health check requests and thus avoid health check failures during high load scenarios.
Although such request-based prioritisation feature is not available with other proxies like nginx & envoy(https://github.com/envoyproxy/envoy/issues/25783)
Coming back to fixed thread pool languages like PHP, python and ruby, in some cases, the thread could get stuck on a very slow upstream and a timeout is not set, or the thread could get into a deadlock. Because of the fixed size of the pool, such issues could quickly choke the entire pool and start failing actual requests and health-checking requests equally. In such cases prioritising health check requests won’t help much because all threads are stuck. For such cases, we need to implement a thread killer that kills off threads that got stuck for a long time (5-10 seconds), though it may not help in all scenarios as it depends on the inflow of such requests which go into the stuck state.
Dedicated Threadpool for Healthcheck Requests
One solution would be to separate the thread pool for health checking and main requests. Most frameworks like apache+php, nginx+php-fpm, nginx+passenger+ruby, and nginx+gunicorn+python offer a way to have dedicated thread pools and route requests based on the URL to the corresponding thread pool.
The only disadvantage is that threads are costly in some languages like python and ruby, where each thread consumes a lot of memory, and it would be expensive to have a separate thread pool just for health checks. While a smaller thread pool would be enough for health checks during load-based issues where the CPU is high, it may not be enough to ensure that health check requests would always pass.
Each solution could help avoid cascading outages due to the load, but each comes with some caveat, or the feature is unavailable for your production setup.
Cost of Point to Point Health checking
Most envoy-based service meshes implement a point-to-point health-checking system, which means for a small service, a big downstream could generate more health-checking workload than the actual workload. This, in turn, could increase the total infrastructure cost unnecessarily.
To solve this, envoy provides a way to cache these health checks within envoy, but currently, this is not supported for grpc. In any case, allocating enough vCPU to such a small service ensures that autoscaling doesn’t keep scaling up the containers because of the health-checking traffic.
Deep Health Checking
As mentioned at the top of the article, there are many failure modes for the pods and adding each of these failure modes into health checking is a very tempting solution to implement. While deep health-checking solves the localised failures, cluster-wide failures also lead to painful cascading outages. Such deep health checking must be balanced with necessary central safeguards to ensure that not all or a significant number of cluster pods are removed immediately.
ECS scheduler doesn’t have a Fail open or panic threshold logic, so if an essential container in all tasks becomes unhealthy, it will stop all the ecs tasks and, even worse, stop new tasks from becoming part of the cluster / ALB
Kubernetes removes all pods that fail readiness probes and doesn’t have a Fail open or panic threshold functionality
An ASG with a custom health check could mark all hosts as unhealthy at once and go into a cascading outage
since most orchestrators and load balancers lack features like fail open and healthy panic threshold, we should be careful around the depth of health-checking logic and refrain when the corresponding safeguard feature is unavailable.
Startup vs Runtime Health checks
Setting a deep health check, like being connected to a database, is risky because of the lack of panic threshold features in load balancing. Still, there are a decent number of failure modes that, if failed during the startup, have no chance of recovering during runtime.
For example, if a container fails to get the configuration or is not connected to a critical dependent, it most certainly would return errors for all the requests. One could wait until all dependencies are satisfied, but for how long? should there be a timeout for such waiting logic, and if there is, should the ecs task stop or start accepting requests and failing them until the dependencies recover in the background?
Such startup health checks are really useful when deploying incorrect configurations like incorrect port number for the database, incorrect host for a redis, a new MySQL slave whose security group is not whitelisted etc.
While deep runtime health checks are very risky, the risk of such deep checks is somewhat less for startup health checking but not completely non-existent. A very deep startup health check, like checking upstream microservice dependency, might cause a cascading outage where the scale-up of microservice stops due to an upstream service outage. However, a slightly deep startup health check, like being connected to a database, is still a contained dependency within that microservice while helping to avoid some issues.
Observability Costs of Point-to-Point Health checking
Because the volume of point-to-point health checking would be very high, it also costs a lot to build observability around it. And so, it is easy to remove observability for health checking, but it’s also very dangerous as it would make the health check failures much harder to solve.
For example, we won’t be able to debug health check timeouts properly if we can’t find the latency of health checks during a steady state. If health checking is removed from APM, it would be hard to see the distribution of latency among each of the components involved in health checking.
In such cases, sampling is a very good solution to recording steady-state behaviour that would be useful during health check failure scenarios.
Thanks for reading! Subscribe for free to receive new posts.