Debugging Mystery 502 Errors in Kubernetes

July 4, 2022
devops Kubernetes

For a while, under any real load, one of my services was returning 502 errors at an ~0.5% rate.

This is a service that is run in Kubernetes, with an Nginx Ingress Controller, and a Flask python application being run with uWSGI.

I was expecting these errors to be fairly easy to identify because I generate Request IDs in nginx (added as an X-Request-Id HTTP header to both the downstream request and upstream response), and add this ID to the request-context logger in the application. It should have been trivial to find the request ID of the error requests, and find the corresponding logs from the application.

Unfortunately when I starting digging into this more, I noticed that:

  1. These errors were generic nginx error pages, rather than a downstream error response
  2. These requests were returning significantly faster than successful requests
  3. These requests did not have my X-Request-Id header populated
  4. uWSGI was not logging any error responses

All of these issues suggested that the requests were not even making it to my nginx ingress, but from nginx request logs it was easy to confirm that nginx was at least getting the requests, as well as the fact that the error rate was represented in my nginx monitoring.

This opens the question of why the X-Request-Id header is missing, which is an nginx directive. After some research (way to much time), I learned that the nginx add_header directive does not apply on 5xx status codes unless explicitly specified with the always attribute. After adding this to my ingress configuration, the 502s had request headers.

With this new information I was able to confirm that there were no application logs for the failed requests. This was a good confirmation, as it was previously possible that the application was returning a successful response, and nginx was failing to proxy it. At this point I’ve been able to confirm that the failure is between the nginx ingress, and the application running in the pod. This leaves potential failures points of “the network”, and uWSGI.

Having X-Request-Id headers was also helpful at this point, as I could take the request ID from the client requests to find the specific request log (and surrounding context) from nginx. This gave me the following:

2022/06/25 23:21:37 [error] 8432#8432: *1519542 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <client-ip>, server: <host>, request: "POST /events HTTP/1.1", upstream: "", host: "<host>"
<host> - [<client-ip>] - - [25/Jun/2022:23:21:37 +0000] "POST /events HTTP/1.1" 502 150 "-" "python-requests/2.25.1" 533 0.003 [<svc>-api-http] 0 0.000 502 [6dee349605a2f6436cecf5737c604a5c] - https

From this it’s pretty clear that the connection that nginx was attempting to use was no longer alive. In some situations, uWSGI was closing the connection, and nginx was unaware of this until it attempted to use the connection. This could be solved in 2 ways:

  1. Retry these requests against the backend from nginx. These were POST requests, and even though this particular endpoint would be idempotent, it’s not guaranteed across the board.
  2. Configure uwsgi to not close connections. This is obviously not a 100% guaranteed solution, and there are still situations where a connection may be terminated that uwsgi cannot control. In this situations appropriate retries (or accepting failures within our SLO) would be required.

In my situation I was able to determine that I could enable http keepalive in uWSGI (with the --http-keepalive flag) and it would be satisfactory for my current purpose.

This immediately eliminated the issue.

Lessons Learned

  1. The nginx add_header directive does not apply on failed requests by default. This seems like an odd behaviour, as failed requests are the ones most likely to be investigated by a user.
  2. uWSGI (and any other application server) should typically keep connections with an upstream load balancer alive.
  3. Monitoring Nginx metrics is important, but will not catch everything. Monitoring for new lines in nginx error logs would have been very valuable. If you’re already writing nginx request logs you may have to filter only error logs into your logging system.