Managing Trace Volume at monday.com
At monday.com, our distributed systems handle tens of millions of requests daily, generating a massive volume of diagnostic information that we need to monitor the health of our service, diagnose issues, and make informed decisions on which parts need more optimization.
Observability does not come for free. Handling this amount of data is a challenge on all sides—starting from generation (how to enforce standards?), the ingestion side (how to ensure we collect everything that is needed?), and finally, how to present the data to any engineer who needs it.
In this article, we will examine how we evolved our handling of tracing data and how our investment in open standards allowed us to build our tool and provide our developers with a better experience.
When Trace Volume Floods the System
Our team is focusing on being efficient in telemetry data collected from production – and recently, the tracing data we collect rose to excessive levels. The volume of observability data is a common topic with a typical solution: we introduced a low sampling rate for all spans generated from our production system. By adding sampling and some more simple heuristics (e.g., increased sampling of errors), this approach preserved much of the prior value at a significantly lower cost.
However, “a lot” is not “the same.” After the introduction of decreased sampling, there was a recurring theme of less occurring, harder-to-track problems that could no longer be solved using traces.
Fortunately, we have invested in the OpenTelemetry ecosystem for quite some time. Since the early days of our microservices architecture, we instrumented all Node.js-based microservices using OpenTelemetry libraries. As the infrastructure Observability team, we’re also deploying OpenTelemetry Collector as the shipper for all observability signals. These two decisions enabled us to develop our tool to help in uncommon cases.
Ocean
In cooperation with developers, we devised an idea to take back ownership of our monitoring data from the vendor, store it in our storage solution, and make it possible to replay small (time-wise) but complete information about the system’s state.
With this tool, whenever you have a rare situation that you need to understand fully, you just need to make sure that you know the exact time it happened. Then, you will get a better view of what happened in the same APM tool you use daily! Our backend will take your query, find all the traces related to it in our data layer, and send them to our APM vendor so you can get a clearer picture than before, but in the familiar tool.
We have picked Clickhouse as a database. As a column-oriented database, it offers excellent capability for querying the traces (so our filtering is fast) and built-in compression (perfect for dealing with highly repetitive data, like in spans), making it a popular choice for storing tracing data. There is also a ready-made exporter component for OpenTelemetry Collector for seamless integration.
Thanks to using an open-source data shipper, data is already under our control. We just needed to split our trace handling pipeline, but we must do it this time before it gets sampled. In more direct terms – from this snippet in the Collector configuration:
...
service:
pipelines: ...
traces:
receivers: [otlp]
processors: []
exporters: [spanmetrics/connector]
traces/sampled:
receivers: [spanmetrics/connector]
processors: [probabilisticsampler, batch]
exporters: [vendor/exporter]
...
To a setup with an additional trace handling pipeline with Clickhouse exporter:
...
service:
pipelines: ...
traces:
receivers: [otlp]
processors: []
exporters: [spanmetrics/connector]
traces/sampled:
receivers: [spanmetrics/connector]
processors: [probabilisticsampler, batch]
exporters: [vendor/exporter]
traces/unsampled:
receivers: [spanmetrics/connector]
processors: [batch/clickhouse]
exporters: [clickhouse]
...
We have implemented the UI for developers as a micro-frontend in our internal developer platform Sphera. Development of it was a breeze – we have used our Vibe frontend components library and quickly created a simple UI:
On the main screen, you can see an audit log for all requests for trace replays—who made the request, for which service, our estimated cost, and the replaying progress.
Another part of the UI is a form for requesting the traces, where you can narrow your request to a specific timeframe (from 1 minute up to 1 hour) and limit it by a few trace attributes (which user? which account? which endpoint?). During the confirmation step, the user will see the number of traces sent and the estimated cost.
Why not jump the ship?
The obvious question is: If you have the data on your side, why don’t we switch entirely to an open-source visualization layer for telemetry and query Clickhouse directly instead of paying extra to ship more data to the usual vendor? There are multiple reasons at this point.
First, we all see the value in having fewer places to look—and if it’s a tool you already know how to use, even better! During an incident call, any extra work you have to do to correlate the data can hinder the time needed to resolve the problem.
Another critical point is that we can’t make too many changes at once, and we need to have confidence that our changes will have the impact we hope to have. It’s much easier and faster to introduce a smaller tool and then build upon this foundation to fit more and more use cases. Narrowing of scope also gives us, the maintainers of this tool, an excellent time to learn about all the new tools we introduced and how to operate them – it’s a prime learning opportunity for the whole team so we can provide better tools in the future.
The last point is that in the current state, there is a limit to how much data we retain. As our initial target was to help debug rare problems, we decided to trade off the timeframe for information completeness. In the “usual” case, we send heavily sampled data to the vendor, but you can query it for up to 30 days. Our Clickhouse instance keeps 100% of spans but for a few days.
From waves to tides
At every step of our work, we need to remember to plan for increased scale in the future—we need to make sure our tool will grow with our system and still deliver the value we want it to have.
One potential pain point is the storage of traces. With the increased volume, even our deployment of Clickhouse might become too expensive or unable to handle the load. For example, using Quickwit instance and cloud storage backends like AWS S3 might be a great way to introduce more savings at the cost of more development.
Another scaling issue is that incident management becomes more complex with every new feature shipped. We can easily imagine that by adding a few simple rules around incidents inside our Ocean application, we could ensure that without any manual action, the whole, crisp image of what happened will already be replayed to our regular observability tools, saving precious seconds in high-stress situations.
Steering through Complexity
As our systems grow and evolve, so do the complexities of monitoring and understanding them. The challenges are limitless—from managing vast telemetry data to ensuring interoperability across diverse tools and platforms.
We found a real treasure in our investment in OpenTelemetry. This open standard equips us with the flexibility to adapt to whatever the daily observability challenges throw our way and prepare us for future brand-new ideas in the field.