The Magic Number That Murdered My Queue

The Magic Number That Murdered My Queue

Uriel Wasyng
Uriel Wasyng

“@Monday-Dev-On-Call” on Slack, I looked at the notification and thought, “That can’t be good.” I opened the notification and saw this: “Messages aren’t arriving at our SQS for a few days, and we didn’t get any alerts/bad logs about it!”

That has to be a mistake! I quickly opened our AWS prod env, expecting to see something obvious, but – NO. The messages are arriving at the SNS, the subscription still exists, but the SQS isn’t getting any messages. No alerts, no logs, everything looks just fine, except it’s not. How is it possible?!

If you are feeling nervous while reading these lines, this blog is for you. I’ll walk you through the story behind our bug, the tricky investigation, and what we learned from it.

So what the hell happened?

To understand what happened, let me provide some background. I’m a software developer in the monday dev product, and our product is built on top of the building blocks of the monday platform. One of our product’s features is sprint management, and to respond to events that are happening on the platform, we have this architecture:

As you can see, for example, when a user creates a new sprint, we receive a new message from the monday service to an SNS. We have a specific subscription to the SNS, and we are passing the messages to our queue. We are using message attributes (an AWS feature that allows you to provide structured metadata about the message) to filter out messages according to our filter policy and retrieve only the relevant ones for SQS. Additionally, our subscription is configured with raw message delivery enabled, which means we receive the message directly without needing to extract it from the object that holds extra metadata.

So far, so good, right? Not exactly, but we’ll get to it.

The Investigation

Let’s go back to the Slack notification. I joined my teammates who were already investigating the problem. Trust me, they checked exactly what you think of – the logs show us that the reporter sent the messages successfully to the SNS. Our subscription to the SQS was active. But as I mentioned earlier, the SQS was just empty, standing there, laughing at us.

So what can we do next?

We checked when the SQS had stopped receiving messages. The timing didn’t raise any flags about a deployment of our team’s microservices that might be related. We also decided to check the deployment timing of the largest platform’s service, as we had shared this flow (sending messages to this SNS) with other teams. 

Bingo! We had stopped receiving messages to the SQS exactly after this service deployment. We reviewed the number of pull requests deployed in this deployment – 45 PRs. just great (Did I already mention that this is our biggest service?).

We went over all of them until we found something suspicious. It was a slight change in a test, and it looked like this:

Someone changed the legitimate number of message attributes in a test from ‘10’ to ‘11’. That was weird – what’s the reason for this test? And why ‘10’ of all numbers?

We checked the AWS documentation, and there it was in front of our eyes, both in the message attributes and raw delivery docs:

“discarded as client-side errors.” No bad logs, no alerts, just like it never happened.

After we saw this, we checked the PR with the test change and found that it included another new message attribute that was supposed to be sent with the message.

Want to guess how many message attributes we had with this new one? That’s right, 11. That was the reason we didn’t get any of our messages. We found you, you tricky one.

After we understood the root cause, the solution was straightforward. We quickly adapted our microservice to work with raw delivery disabled instead of enabled, updated the subscription, and just like before, the messages arrived at the SQS again.

Lessons learned

After we covered the “action,” let me share my thoughts about this incident. I hope it can help you avoid these hard-to-detect scenarios and prevent similar issues in the future.

Avoid magic numbers

Magic number – “10”.  I believe that if this magic number were replaced with a decent constant name (for example, RAW_DELIVERY_MESSAGE_ATTRIBUTE_LIMIT), this whole incident wouldn’t have occurred. So, my take on this one is to always think of the next person who will read your lines of code a few years from now. Generally, be good to them, and specifically, use constants instead of magic numbers.

Tests for cloud feature limits

Every great cloud feature might have future limitations. In this use case, it was two cloud features (raw delivery and message attributes). Each one of them worked great alone, and even together at the start. However, the limitation became apparent only after we began working with them. I believe the right way to deal with this kind of limitation is by writing tests. We need to read the fine print, and if we find any limitations, even if they seem far removed from today (yes, even the generous 256KB message size limit of SQS, for example), we should handle them with tests. We did an almost great job here, but just almost.

Monitor for missing activity

Create Alerts for minimal volume on the queue, exactly for use cases like this bug. We could have known about the problem much earlier with this kind of monitoring. Yes, it might create noise with false positive alerts, but with the right fine-tuning, it will do the job without causing panic in the middle of the night.

Track shared ownership of resources

We must be aware of which resources belong exclusively to our team and are not accessible to others (or at least shouldn’t be, but that’s a story for another time), and which resources are shared with other teams/groups. It might sound obvious, but when you can’t understand where a bug is coming from because it wasn’t your team (you swear!), you might be right. You just need to think about who else might be involved (my hint: follow the biggest shared resources first).

Bugs like these are quiet, sneaky, and totally avoidable – in retrospect. However, if we’re learning from the retro, it also makes us better engineers. So do the smart thing here: learn from others’ mistakes and don’t forget — if something feels too invisible to fail, it probably deserves a test, an alert, and a constant name.