Recently at Cratejoy, we found ourselves in a situation where we had an “impossible” bug, and a limited time to identify and correct the issue. For a bit of background, one of our merchants was about to begin a relatively large marketing event, and we had developed a new caching solution designed to address the traffic profile we were expecting.
Unfortunately, a few hours before the event, we started seeing reports of pages returning blank responses. Certain requests, would sometimes be served a blank 200 response from the cache. As well, this problem appeared to be limited to this particular store. The problem seemed similar enough; if we were getting a blank body from the cache, then we must have been putting a blank response into the cache at some point. We were however sure that there was no way that this endpoint should be returning a blank 200.
We needed to confirm this though, so we manually traced through the request code path. After a bit, we were certain that there was no way we were returning a blank 200 from this code path.
Based on our assumptions, the behaviour we were seeing was impossible. This means that either the universe was screwing with us, or one of our assumptions was wrong.
Testing our assumptions
This is an important point to always remember:
When you end up in a situation where something is shown to be impossible, remind yourself that it is clearly happening, and thus not impossible. One of your assumptions must be wrong.
These assumptions can be with either problem (ie, is the bad thing you are seeing actually the bad thing you think it is?) or with the process (ie, is the thing you think is causing the problem actually what is causing the problem).
At this point, we realized we had to take a step back, and verify all of the assumptions that we had made. I’ve found that the best way to start is to make a list of all of your assumptions, as well as how you can verify that that assumption is correct.
We can up with the following:
Problem (the bad thing)
- We are getting a blank 200 response when we shouldn’t be: verify by hitting the URL from a fresh session/curl
Process (what discrete things should be happening)
- Request hits nginx: verify this by checking the nginx logs
- Requst hits uWSGI: verify this by checking the uWSGI logs
- uWSGI checks our memcached: verify this by checking the uWSGI logs and watching the memcached stats. We expect a miss.
- Request hits backend system: verify this by checking the Flask logs
- Flask returns a blank 200: verify this by checking the Flask logs
- uWSGI checks our cache settings, and caches the 200 response: verify this by checking the uWSGI logs
- uWSGI returns the response to nginx: verify this by checking the nginx logs
Since our problem appears to be impossible, one of the above assumptions must be wrong.
In this case, the visible problem was fairly easy to confirm. “Are we actually getting a blank 200 response?”. Just incase, to confirm that it wasn’t a browser problem, hit the url using curl. Yupp, definitely a blank 200, so we quickly moved on to attempting to verify the process that we were debugging.
We started at the top, and were able to quickly verify that nginx is it, that uWSGI is hit, and that the uWSGI logs say that we are checking memcached. We finally got a breakthrough when we saw that although uWSGI thinks it checked memcached, the memcached stats had 0
get_hits, and 0
get_misses. We were expecting a miss.
We have invalidated one of our core assumptions. A quick look at the next step (checking our backend logs) confirms that the request never reach our backend system. By validating each of our core assumptions, we have been able to identify that the problem lies in our uWSGI cache usage (or the uWSGI cache module itself).
Now this is the part that I feel bad writing about, since it’s not totally reproducible. We had an epiphany that this must be caused by an error with the memcached request. Some quick googling led us to see that a common memcached error is that the key is too long. Memcached keys are limited to 250 bytes, and the particular failing keys were ~270 bytes. Using telnet, we quickly verified that a 270 byte key will cause a memcached error.
We can use this to revise our assumptions, and include the following: if we use a key that causes a memcached error, uWSGI will serve a blank 200 response.
In order to validate this, we crafted a long URL that doesn’t exist, and should always be a 404. If we get a 404, our hypothesis is wrong. If we get a blank 200, that is a very good sign we’ve identified the problem.
We hit the url, and received a blank 200 response, confirming to us that we were able to reproduce the problem, and giving ourselves a way to verify when we have an appropriate solution. Using this as a test, we were able to bypass memcached when the key would be invalid. It wasn’t the best solution, but it was viable in the limited time we had, and we were able to readdress the core problem at a later time.
All in all, we were able to find the problem, develop and confirm a solution, and deploy it live in approximately an hour. Without taking the time to take a breathe, list our assumptions, and individually verify each one of them, I highly doubt we would have stumbled across the core problem in the time we had. Only by reminding ourselves that our “impossible” bug clearly existed, and approaching the problem in a calm and logical manner, we were able to solve the issue in the time we had.