
From Reactive Blocks to Proactive Insights: monday.com’s Journey in Traffic Anomaly Detection
Over two years ago, we shared how monday.com protected its platform by quickly blocking anomalous traffic, stopping system-wide instability from excessive HTTP requests and SQL queries. (You can read more about that initial solution in ‘Detecting Traffic Anomalies at Scale‘).
That reactive approach was a critical first step. It helped us prevent crashes and maintain stability. We prevented crashes, yes, but we were still flying blind. Imagine the frustration: a sudden surge, and all we could do was block. Was it a coordinated attack? A runaway automation? A critical bug? Without knowing why the anomaly occurred, our teams were constantly reacting, wasting valuable time on investigation, and were unable to proactively prevent similar incidents.
This critical need for deeper understanding drove us to evolve monday.com’s traffic anomaly detection from a reactive blocking tool into a holistic analysis solution. Our new system goes beyond simply stopping problems; it provides a comprehensive, real-time picture of what’s happening across the platform.
In this post, we’ll dive into the improved architecture, the expanded data we’re now collecting, and the powerful analysis tools we’ve implemented to ensure monday.com runs smoothly for all its many users.
Expanded Coverage
Initially, we focused on HTTP requests and SQL queries because they provided excellent indicators of direct user interaction and the volume of requests our database was handling.
But monday.com is complex, operating as a multi-tenant platform with diverse user actions, automations, and integrations. This complexity means a single HTTP request can trigger a cascade of internal processes and data interactions. To truly understand the full scope of user activity and its impact, we needed to look beyond just front-end requests and database queries.
We’ve since broadened our scope, and now we also monitor SQS messages, which are crucial for our asynchronous workflows and inter-service communication, as well as Cassandra queries, as they are a vital part of our backend for storing large volumes of data. Additionally, we track Sidekiq jobs, which handle many of our background processing tasks, helping us observe the full execution lifecycle of complex workflows.
Monitoring these allows us to trace user behavior more deeply into our system, capturing anomalies that were previously invisible at the HTTP or SQL level. This expanded coverage provides a more comprehensive understanding of the full impact of user activity, enabling us to block unusual traffic at various critical points.
Data Collection
Our approach to data collection has undergone significant changes. Previously, we mainly recorded ‘blocks’—instances where we prevented a user’s action. This was helpful for immediate fixes, but it didn’t fully explain why the block occurred. For example, a “block” told us that a particular user was blocked in one service due to an abnormal amount of HTTP requests.
To truly understand the reasons behind unusual traffic, we needed more information. So, we shifted our focus from just collecting “blocks” to gathering “events”.
Now, every user action in the system, even if it doesn’t result in a block, is recorded, containing crucial context like the host, HTTP path and method, affected rows and duration of the user’s action.
This detailed event data across HTTP, SQL, SQS, Cassandra and Sidekiq gives us a comprehensive, real-time dataset. It enables us to analyze user behavior, pinpoint the causes of anomalies, and proactively fine-tune detection rules, forming the backbone of our new analysis solution.
Additional Setup & Visualization Tooling
To effectively leverage the new data, we have invested in robust tooling. We store all raw event data in AWS S3, a scalable and cost-effective storage solution. Then, AWS Athena, a serverless query engine with an SQL interface, lets us query the dataset directly from S3, providing a lot of flexibility for ad-hoc analysis.
To visualize our findings, we use Apache Superset, an open-source data visualization and data exploration platform, to build interactive dashboards. These dashboards turn complex data into clear, actionable insights.
Beyond dashboards, we’ve developed a custom Slack bot. This bot provides real-time, actionable context during investigations, such as ‘Top Block Activity (Last X Hours),’ detailing specific rules, affected accounts, and active locks. This immediate access to consolidated information drastically reduces the time spent sifting through logs, allowing developers to jump straight to diagnosis and resolution.
Collectively, these tools empower us to not only gather data but to respond effectively with a profound understanding of the situation.
Current system structure
When a user initiates a request (HTTP or database), a user action event is generated by a proxy or middleware and then published to a Kafka topic. Two parallel Spark jobs consume these event jobs: one for processing and block creation, and the other for raw event archiving.
The processing job, based on rule information, creates blocks that are stored in the database. Depending on the currently active blocks, users either gain access to the requested resource or receive a 429 “Too Many Requests” error. Raw events are stored in S3 and can be queried using AWS Athena, with visualization capabilities provided by Apache Superset.
How New Available Data Resolves Old Challenges
Our new data collection methods have entirely changed how we approach problem-solving. Previously, a “block” was just a block, offering little insight without a deep dive into countless logs.
Now, with a comprehensive flow of events, we can quickly get to the bottom of things. This allows us to pinpoint complex “lock patterns”, which help us tell the difference between actual malicious traffic and, for instance, a feature’s automation unexpectedly hitting a limit.
Imagine a user who keeps getting blocked due to an inefficient integration; our new tooling can help highlight that.
We also use it to distinguish between normal, high-volume usage from a happy customer and a sudden, unusual surge that might signal a bug or an attack. It provides our development teams with precise, in-depth traffic information, leading to faster diagnosis and more effective solutions.
This means we can move from just reacting to problems to making informed decisions.
Consider a recent scenario where an account experienced a surge in HTTP traffic, leading to an account service block.
While initial investigation showed increased activity across all users on that account, our detailed, high-cardinality event data quickly revealed the true culprit: an abnormal number of requests to a single ‘service’ endpoint from the frontend. This pointed directly to a newly deployed feature.
Our current setup enabled us to swiftly identify the responsible feature flag and resolve the issue for the affected account, thereby avoiding a broader impact and allowing for a targeted fix rather than a blanket block.
Next steps
We’re now exploring the use of large language models (LLMs) for more predictive anomaly detection, as well as refining our ability to differentiate between legitimate high-volume usage and truly anomalous behavior in our multi-tenant environment.
This continuous evolution is key to maintaining a seamless experience for our users and ensuring uninterrupted work for hundreds of thousands of monday.com customers.