Yes, Kubernetes "Limits" Are Important

September 2, 2020
kubernetes devops sre

Early today an engineer at Buffer put out a post about removing resource limits from a subset of their Kubernetes deployments in order to “make their services faster”. In summary, they were experiencing a kernel bug in which cpu throttling was being applied to containers which had not yet hit their CPU limits. They determined that this bug did not have an impact if no limits were set. This bug has since been fixed, and backported to many current linux versions, however they are still using a lack of limits to provide greater performance for some of their user-facing features.

Note that the post in question has been updated, and now contains some more reasonable recommendations.

Why are Limits important?

Kubernetes resource limits specify the maximum amount of a given resource that a pod is permitted to use. When a pod hits the limit for a given resource, it is either throttled (for compressible resources like CPU), or killed (for uncompressible resources, such as memory). They allow us to set the maximum bounds that our application will operate in, which allows us to set strict expectations regarding performance. For the rest of this post, we will be discussing CPU Limits.

One of the large selling points of containerizing applications for deployment is that we can control the environment we’ve deployed to. Without setting limits, we are explicitly leaving a hole in that methodology. By “allowing” our apps to use as much CPU as the system will provide them, we are allowing our performance to vary unexpectedly based on the environment. Yes, this can have better performance characteristics in some cases, but it increases the variance we will see at the long-tail. By setting a maximum resource utilization for our services, we can explicitly know the performance we should expect under those conditions.

Should we ever hit our limits?

No. If your application is hitting it’s limits, it is something that needs to be looked at. If you have applications regularly operating above the requests, the request should be increased. You should be using a Horizonal Pod Autoscaler to increase your pod count when resources starts getting too high.

Now that the unnecessary throttling bug has been solved, removing limits is a quick fix approach to avoid solving underlying resource utilization issues. Without understanding of how your cluster and services will behaviour in the worse case, removing limits only serves to set your cluster and services up for failure in the event of an unexpected load spike and node overscheduling.

Now, if you really know what you are doing…

This might sound like I am going against the above points, but if you have done your homework, and understand the explicit bounds of your application performance (worst case behaviour, request == limit), it is typically safe to remove limits as long as:

  • requests are set for the containers
  • Typical usage is below that request
  • You are not depending on exceeding requests to meet your performance goals
  • You have aggressive monitoring for containers exceeding their requests

But remember, you are removing a safeguard which helps you understand and bound the performance characteristics of your service. Be careful, and use this power wisely.

As a final note, there are some fantastic thoughts on the topic in this Reddit comment