Detecting traffic anomalies at scale
We analyse HTTP requests and SQL queries to find and block excessive and potentially abusive traffic. We look for unreasonable usage generating so much traffic that it could potentially impact the stability of the whole multi-tenant platform. Only individual users or scripts are blocked.
The problem of noisy customers
In a multi-tenant environment, millions of users are running their daily workflows on the same, shared infrastructure. We run on multiple clusters, yet it’s possible that a single rogue user generates a significant load. This is especially visible in the case of API and automations. User-generated scripts and our low-code automated features can easily amplify to tens of thousands of API and database calls. Of course, concurrent scraping and attempted DDoS attacks are also possible. Last but not least, bugs and inefficient code (N+1 problem, infinite loops between systems, etc.) may result in an unexpected rise of traffic from a single user action.
So at this scale we occasionally observed individual users generating 20-30% of the overall system load! We needed a way to quickly (in a matter of minutes) discover such individuals and take some precautions. This phenomenon is often called noisy neighbor and it doesn’t only apply to cloud hardware being overutilized by other tenants. For the time being, we simply ban problematic user accounts for a few minutes. This obviously has a negative impact on usability, but it prevents whole system downtime and allows us to investigate with less time pressure.
Collecting fine-grained traffic telemetry
We need a way to constantly collect and process the traffic of each individual user in real-time. With millions of users and several services in a distributed environment, this proved to be quite challenging. The first step is to identify which pieces of information we need and how to obtain it. At the moment we monitor
- Incoming HTTP request (web and RESTful APIs)
- SQL queries
- Public GraphQL API calls
Apart from the obvious metadata (timestamp, user ID, endpoint/query itself) we also identified one important factor: event weight. Basically, it describes how heavy a particular operation is. After all, an empty HTTP 301 is much cheaper to our infrastructure, compared to, let’s say, 100 KiB of JSON. The same applies to SQL, where we take into account the number of affected rows and query time; and GraphQL, where we can utilize the preexisting notion of query cost.
To support comprehensive telemetry collection we built thin client libraries that intercept HTTP server calls and SQL queries. Our stack choice comprises of Node.js for HTTP layer and Sequelize for data access. Intercepting HTTP requests in Express.js is relatively simple with a set of handlers. Every HTTP request passes through a special filter that collects telemetry. In case of Sequelize we use the mechanism of hooks. They are executed for each and every database query. Of course our extra instrumentation layer must be fast and resilient to minimise impact on production traffic.
After plugging in said library, we can transparently extract all relevant information without modifying any business logic. We could technically use a service mesh for that, but some fine-grained attributes are hard to obtain on the network level only. We made sure that this agent is extremely easy to install in all of our services.
Being able to collect telemetry from various sources locally is surprisingly easy. Publishing to a single and accessible place proved to be much harder. Every piece of our agent must be resilient and have as low overhead as possible. After all, we don’t want to bring down production due to misbehaving software whose sole purpose was to avoid bringing down production.
To publish and aggregate telemetry insights we considered several options. Polling individual services or pushing through high-overhead protocols like HTTP was ruled out. Writing to a central SQL/NoSQL database also seems too brittle. In the end, we settled on publishing telemetry events through Kafka. Sending messages to topics is extremely fast, durable, and compact. Each event has a set of metadata that we can later use to gain some insights. Once all the data from all our services is centralized in one Kafka topic, we can move on to running real-time analytics on top of raw data.
Discovering anomalies with Spark streaming
We are not building a general-purpose performance monitoring tool. All we care about is finding individual user accounts that are producing a disproportionate amount of load. But with almost 10 billion messages produced daily on Kafka brokers, it’s not an easy task. We decided to run analytics in near real-time using managed Spark. A relatively simple Spark streaming job examines events through a moving window. This implementation discovers anomalies within 10-15 seconds.
At this point, we examine a set of rules that can be configured at runtime through our Developer Platform. A typical rule discovers unusual and potentially harmful spikes of traffic coming from a single user account. For example, one user scanning hundreds of millions of database records in total, over a short period. There can be many reasons for such an anomaly:
- the malicious actor performing DDoS attack, scraping or API abuse. Such actor makes hundreds of thousands of relatively simple queries
- a bug in the platform (missing database index, cache miss or lack of caching, uncontrolled loop or recursion)
- batch process running on behalf of an individual user without any throttling
Keep in mind that all of the above situations (and more) are protected with more specific mechanisms, like application firewalls and circuit breakers. Our solution is meant to be the last line of defence, if all else fails. Also, the rules for detecting abnormal usage are quite conservative to avoid false positives.
Feedback loop – blocking individual users
When the aforementioned Spark job discovers that some rules were violated, we take several immediate actions.
- First of all, we ban individual users for a configurable amount of time. Most critical of our services poll for newly created bans and are aware of them within seconds. They can either ignore it or block requests from affected users, returning HTTP 429 early. Ignoring happens when we want to dry-run a rule. We don’t want to trigger false-positives. More on that later.
- Secondly, we try to notify teams that might be affected. We don’t want to wake up on-duty engineers with elevated error rates.
- Finally, we often contact SREs or performance engineers. Even a single ban might be a symptom of a larger attack that we can proactively prevent. The sooner we block unwanted traffic, the less ripple effect it will have on core services.
The length of the ban may be propagated through Retry-After header. But if the (most likely) automated traffic doesn’t stop, despite 429, we lift the ban with an exponentially growing timeout.
Outcomes and challenges
We deployed this project in several phases. During the initial step, we barely observed the traffic, looking for anomalies more or less manually. Later we codified these anomalies in the form of rules. We are really careful to avoid false positives, blocking legitimate users and traffic. So our thresholds are quite high (even for a high-traffic website like monday.com), but more importantly, we ran our rules in dry-run mode, only notifying us, but not really blocking anyone. As an interesting side effect, the presence of a ban was often correlated with system-wide outages observed by other teams. This meant that our system was actually helping to identify many symptoms of ongoing incident.
Then we proceeded to gradually promote rules to actually ban traffic. Once again, harming legitimate traffic is our worst possible outcome, so we were really careful to raise bans conservatively. Over time we expect our ban to be a leading, rather than lagging outage indicator. In simple words, we want to prevent outages by blocking traffic quickly. Currently, due to the conservative nature of our rules, we ban users when a lot of damage has already been done. But that’s just a matter of fine-tuning our blocking rules.
The system we deployed is meant to be the last line of defence. It’s fairly broad, collecting a few terabytes of data daily and trying to find unusual HTTP/SQL traffic patterns. We reached a goal of proactively banning individual customers for a brief moment to allow hundreds of thousands of other customers to work continuously.