If you’ve been reading anything I’ve written lately, you’ll be aware that we at Cratejoy have recently migrated out infrastructure into a new Kubernetes deployment. Throughout this process, we hit a couple rather embarassing snafus, and this was just one of them.
Long story short, we accidently DoS’d some of our own services, and attempting to solve the problem by scaling horizontally only made the problem worse.
At time time, we were running 4 HTTP services within our cluster, with each one having 1 Application Load Balancer (ALB) and two AWS Target Groups (one for http, and one for https internally). Each of these Load Balancers has ingress in 4 Availability Zones, with healthchecks at 5 second intervals. We are using NodePorts for our services, instead of Kubernetes Ingress.
As well, our services are currently Flask applications running behind uWSGI, with a limited number of processes per pod (at the time that number was 5).
Since our traffic and workload is rather variable, we run the Kubernetes Cluster Autoscaler in our cluster, in order to dynamically allocate nodes when necessary, and and shutdown nodes when possible. The Autoscaler works by modifying the size of an underlying AWS Autoscaling Group (ASG), which we also had set to automatically add new nodes into each of the ALB Target Groups.
When we began migrating our 5th (and largest) service into Kubernetes is when these individual configuration choices began to compound and cause issues. Since this service so much larger than the others, adding it into the cluster caused the cluster size to (approximately) triple in size. This meant 4 times as many nodes, with 4 times as many healthchecks, even though the underlying service was not scaled.
At this point I started doing some basic math and saw:
4 AZ’s * 2 TGs * 30 nodes / 5 = 48 healthchecks per second, per service.
As well, note that this only includes the ALB health checks; Kubernetes internal healthchecks also increased.
Because these healthchecks were scaling in relation to the overall cluster size, and not by the service size, the smaller services would see the same absolute increase in healthcheck counts as our larger services, without anywhere near the same resources to handle them.
As well, since many of these health check requests would come in at the same time, and we only had the capability on this small service to serve 10 concurrent requests, we’d see at ~80ms delay on legitimate requests, which could be long running, occupying processes when the next round of healthchecks would come in, leading to a growing queue of requests to serve. To make it worse, increasing the capacity of any one service would increase the health check requests to all of the others.
To make matters worse, we first started seeing the effects of this on a Friday afternoon. Luckily, the scale of the issue was high enough that we were able to clearly see the increased health check volume from the LB IPs.
It was clear to see now that we would have to limit the number of nodes in our cluster that were directly on the Load Balancer. It would be fine for a subset of our nodes to proxy traffic into the cluster, and have the Kubernetes Service forward the request to a Node with the correct service.
As a stop gap, we:
- Removed all but 4 cluster nodes (one in each AZ) from the Target Groups
- Set up a Datadog monitor and alert to inform us if any of those nodes were removed from service (at which point someone would manually add another node from the removed AZ)
Now that we had subsided the problem, and our services were starting to operate normally again, we decided to split our cluster into two ASGs; one limited to exactly 4 nodes, which would automatically be added to the Target Groups, and one that would be controlled by the Cluster Autoscaler, and would not be added to the target groups. This gave us the scaling offered by the Cluster Autoscalers, as well as the resiliancy offered by the AWS Autoscaling Group, without overloading out smallers services with unnecessary healthchecks.
While there are some tradeoffs to this approach (sometimes one of the ‘ingress’ nodes is very under-utilized, however the Cluster Autoscaler cannot remove it), this setup has served our purposes reliably so far.