How We Achieved Almost Limitless Observability
ClickHouse
DevOps
High Cardinality
Infrastructure
Limitless
Monitoring
Observability

How We Achieved Almost Limitless Observability

and Broke the Limits of Traditional Observability

Tom Shlomi
Tom Shlomi

With the upcoming mondayDB v3, we had to analyze performance of events in relation to accounts, boards, and items, without compromising on speed or cost. After multiple tests across different observability platforms and technologies, the solution became clear: ClickHouse.

As you may know from a previous blog post, we had a perfect candidate for a high-volume, high-cardinality data store. We aimed at providing an efficient, cost-effective, and scalable solution for handling high-cardinality events and metrics using ClickHouse. By leveraging ClickHouse’s architecture, we enabled lightning-fast queries for analytics that were previously impossible.

This shift removes constraints on the volumes and cardinality of the data and opens new opportunities for observability and performance insights. The technical side includes a massive change to how we approach using telemetry data: Instead of using a dedicated observability tool/platform, we store telemetry in a DBMS and use SQL to visualize and read the data from custom-made schemas.

Technical Overview

Reaching the stars

Row-oriented databases store consecutive table rows sequentially. This layout allows to retrieve rows quickly as the column values of each row are stored together.

Unlike transactional queries or OLTP (Online Transaction Processing) that read and write just a few rows per query and, therefore, complete in milliseconds, OLAP queries routinely process billions and trillions of rows. Online Analytical Processing, refers to SQL queries with complex calculations (e.g., aggregations, string processing, arithmetic) over massive datasets.

ClickHouse is a column-oriented SQL database management system (DBMS) for OLAP.

In column-oriented systems, tables are stored as a collection of columns, i.e., the values of each column are stored sequentially one after the other. This layout makes it harder to restore single rows (as they are now stored in different files) but column operations such as filters or aggregations become much faster than in a row-oriented database. That’s what makes it so good for observability: attribute (column)-based filtration.

ClickHouse for Observability

In a row-oriented database, the system still needs to load the data from other existing columns from disk to memory. The reason for that is that data is stored on disk in chunks called blocks (usually fixed sizes, e.g., 4 KB or 8 KB). Blocks are the smallest units of data read from disk to memory.

When an application or database requests data, the operating system’s disk I/O subsystem reads the required blocks from the disk. Even if only part of a block is needed, the entire block is read into memory.

Because the values of each column are stored sequentially one after the other on disk, no unnecessary data is loaded when a query is run. The engine reads only the referenced columns, avoiding unnecessary I/O for unused data.

This is much faster compared to row-based storage, where entire rows (including irrelevant columns) are read. Column-based storage  is good for observability as most of the time you are querying specific attributes; Each telemetry attribute can be stored as a column in ClickHouse.

Inserting & Merging

ClickHouse operates similarly to LSM trees:

Inserts (into tables from the MergeTree engine family) create sorted, immutable data parts. All data processing is offloaded to background part merges.

In the background, each INSERT is written into a folder (creating a part). ClickHouse performs best with as few parts as possible and shines on BIG batches when INSERTing. Parts are configured in the DDL, for example, daily (by Timestamp) parts – PARTITION BY toDate(Timestamp)

To control the number of parts per table, ClickHouse continuously merges (per partition) smaller parts into larger ones in the background until they reach a compressed size of approximately 150 GB by default.

Merges and parts in ClickHouse

Merges and parts in ClickHouse

MVs MVs MVs

The secret you ask? Not so much a secret…

We, as owners of an OTEL native data ingestion pipeline, wanted to build a solution that would support it fully and give us the best performance possible.

At first, we quickly realized the standard OTEL schemas weren’t enough. Next step was to start using ClickHouse as an ETL Pipeline, using a Null table that acts as the entry that extracts the data without saving it, and MATERIALIZED VIEWS that act as triggers on each INSERT to transform the data, then, ClickHouse loads the data into the final table, where ClickHouse maintains, compresses and optimizes it.

With MVs, possibilities seem to be almost endless, they, and the new JSON column type, make ClickHouse highly flexible and maintain its columnar storage with amazing compression ratios and super-fast queries. Another plus of course is that migrations can be performed with ease; Just create another MV – table pair, and effortlessly process the data into two different tables.

This makes ClickHouse good for observability; telemetry data can be transformed and optimized to any schema for specific and unique use-cases.

Materialized Views in Clickhouse

Materialized Views in ClickHouse

Sounds Like A Library

We mainly use 2 types of indexes:

minmax is a lightweight index type that requires no parameters. It stores the minimum and maximum values of the index expression for each block.

This type is ideal for columns that tend to be loosely sorted by value. This index type is usually the least expensive to apply during query processing. Observability data like metric values, trace duration and so on, can benefit from this type of index.

This type of index only works correctly with a scalar or tuple expression — the index will never be applied to expressions that return an array or map data type.

INDEX idx_user_id UserId TYPE minmax GRANULARITY 1

Bloom Filter indexes are data structures that allow space-efficient testing of set membership at the cost of a slight chance of false positives, works similarly to a hashing mechanism.

A false positive is not a significant concern in the case of indexes because the only disadvantage is reading a few unnecessary blocks. Bloom filters make sure valid data is not skipped.

Because Bloom filters can more efficiently handle testing for a large number of discrete values, they can be appropriate for conditional expressions that produce more values to test. INDEX idx_trace_id TraceId TYPE bloom_filter(0.01) GRANULARITY 1 .

JSON Magic

We have set up ClickHouse with default settings and OTEL schemas, but the results did not meet our expectations. After experimenting with various schema combinations, we finally reached the ‘golden schema‘ that allowed for both complex and generic queries, dynamic and common schemas across services, and rapid processing of vast amounts of data.

ClickHouse can take a JSON object and split it into a multi-column structure, where each key becomes a sub-column itself. ClickHouse leverages those sub-columns, by treating them as columns of their own, and so, gains significant performance (it is a column-oriented DB after all) and (almost) full column abilities on them.

JSON column definition example – `Attributes` JSON(max_dynamic_paths=5000) CODEC(ZSTD(1)) . Each telemetry attribute can be represented in a different column.

JSON Columns Split in Storage

JSON Columns Split in Storage

Examples

The following are examples of ETLs we did with ClickHouse, while this is a lot, it sums up what we talked about:

Conditional values in MV

multiIf(
    LogAttributes['severity'] != '', LogAttributes['severity'],
    LogAttributes['level'] != '', LogAttributes['level'],
    '') AS Severity

Fallback value check

if(LogAttributes['trace_id'] != '', LogAttributes['trace_id'], TraceId
   ) AS TraceId

Complex json parsing

jsonMergePatch(
    toJSONString(mapFilter((k, v) -> (k NOT IN ('account_id', 'system', 'user_id', 'message', 'trace_id', 'span_id', '', 'eks_cluster', 'env', 'container_name', 'pod', 'namespace', 'node', 'region', 'tag', 'request_id', 'version') AND k NOT LIKE '%_stringified'), LogAttributes)),
    toJSONString(mapApply((k, v) -> (extractGroups(k, '(.*?)(?:_{1,2})?stringified$')[1], JSONExtract(v, 'Map(String, String)')), mapFilter((k, v) -> ((k LIKE '%_stringified') AND (JSONType(v) = 'Object')), LogAttributes))),
    toJSONString(mapApply((k, v) -> (extractGroups(k, '(.*?)(?:_{1,2})?stringified$')[1], JSONExtract(v, 'Array(String)')), mapFilter((k, v) -> ((k LIKE '%_stringified') AND (JSONType(v) = 'Array')), LogAttributes))),
    toJSONString(mapApply((k, v) -> (extractGroups(k, '(.*?)(?:_{1,2})?stringified$')[1], JSONExtract(v, 'String')), mapFilter((k, v) -> ((k LIKE '%_stringified') AND (JSONType(v) NOT IN ('Array', 'Object'))), LogAttributes)))
) AS Attributes

(very) Extra table definitions – an example for service map implementation, simplified

otel.service_map_local
(
    parent_service LowCardinality(String),
    child_service LowCardinality(String),
    ts DateTime, -- time bucket
    endpoint LowCardinality(String),
    span_kind LowCardinality(String),
    requests UInt64,
    errors UInt64,
    total_latency Float64,
    avg_latency Float64 ALIAS total_latency / requests,
    error_rate Float64 ALIAS errors / requests,
    INDEX ...
)
ENGINE = SummingMergeTree() -- summing by the order key
ORDER BY (parent_service, child_service, endpoint, span_kind, ts) -- order (primary) key

Process Table (ETL, Trigger like functionality) (ON CLUSTER makes sure all the nodes are doing that action perform the ETL on data FROM x and output TO y

CREATE MATERIALIZED VIEW otel.service_map_mv ON CLUSTER traces
TO otel.service_map_local AS
SELECT
    parent.ServiceName AS parent_service, 
    child.ServiceName AS child_service,
    toStartOfMinute(child.Timestamp) AS ts,
    child.SpanName AS endpoint,
    child.SpanKind AS span_kind,
    count() AS requests,
    sum(if(child.StatusCode = 'ERROR', 1, 0)) AS errors,
    sum(child.Duration) AS total_latency
FROM otel.otel_traces AS child
INNER JOIN otel.otel_traces AS parent
    ON child.TraceId = parent.TraceId
    AND child.ParentSpanId = parent.SpanId
GROUP BY parent_service, child_service, endpoint, span_kind, ts;

Implementation

We implemented this observability data processing pipeline solution through a two-stage approach. First, we will manually define tables for each telemetry type: Logs, Traces and Metrics like gauge, sum, histogram, etc… using the engine = Null() configuration in ClickHouse. This means the data won’t be physically stored in these tables, serving only as entry schema definitions in OTEL semantics.

The core of our implementation will be materialized views that function as data pipelines. These views will extract the necessary data elements from the raw telemetry streams coming from our OpenTelemetry ingestion pipeline. Then the materialized views will transform, and process this data before persisting it in the appropriate storage structures. This approach gives us flexibility to adjust our data processing logic while maintaining a clean separation between raw data ingestion and processed events and metrics storage.

We deployed ClickHouse in our shared Kubernetes clusters, allowing all environments to insert data (there are a lot of them). Moreover, we deployed ClickHouse on AWS Graviton instances, which gave us a platform with very good performance and cost-effectiveness for running heavy workloads.

We utilized Karpenter, which gave us superb node-pod management, and Altinity ClickHouse operator, that enabled us to deploy ClickHouse clusters very efficiently.

All SELECT queries run through a Distributed Table Engine that gathers results from each node and merges them to the required result from all of the data. This way, we can use the K8S service’s (behind an NLB) native load-balancing to distribute the INSERTs without overloading the ClickHouse nodes (they will need to coordinate), and ensuring both the ingestion and storage will be balanced, and still be available with ease.

What We Sacrificed

No doubt that an all-in-one platform that is managing everything for you is paradise. But, it sometimes (*cough* *cough*) too expensive, especially when it comes to indexing and high cardinality observability.

So, with the in-house solution, We manage terabytes, and eventually petabytes of data, which adds significant strain on the team. The process also involves finding the best performance for cost by rightsizing nodes and node count, optimizing disk layout for best performance, managing alerts and monitoring on the observability tool itself.

We are transitioning to a migration period where data is spread across multiple observability tools, which makes certain correlations require more than one click. This will be mitigated in the future, as migrating years of work on dashboards, integration and methodology is a very complex process, especially in a large company like monday.com.

What We earned

Events on every request to whichever service we pick, which are then summarized into span-like keys (and sub-keys), which are flawlessly handled by ClickHouse. Not to mention, direct correlation to traces.

We visualize this data with millisecond SELECTs, powered by polished, flexible schemas that we fully control.

All this comes at a fraction of the cost of other SaaS platforms, allowing us to focus on the raw data instead of over-engineering and fine-tuning operational moves.

High Cardinality Events & Metrics

High Cardinality Monitoring Dashboard High Cardinality Engine Monitoring Dashboard High Cardinality Monitoring Overview Dashboard

100% – No Compromises Tracing Solution

Ingesting 100% traces from infrastructure to Datadog is very costly, Grafana + ClickHouse cluster solves the problem easily. As we became experts on working with both tools, it also works for our tracing datastore.

We are the ones who control the cost by choosing node size and count, and optimize the data using Materialized Views. In fact, we plan to create a hybrid OTEL schema for traces, to save on fields we don’t need and extract (LowCardinality) ones we do.

We can look up a specific trace in <500ms, get meaningful analytics and extract specific fields from all traces.

Traces Services Monitoring Dashboard

APM

APM Dashboard – alpha

Results & Impact

After showcasing our observability solution to another company, they acknowledged its superiority and scheduled a follow-up to understand its workings for potential implementation in their stack.

  • Developers and infrastructure teams now gain insights and analytics on all types of entities, at never-before-achievable dimensions.
  • We cut costs by shifting away from tools that charge by cardinality.
  • Metric/Event/Trace granularity – we store and analyze everything. No compromises. Lightning-fast queries.

What’s Next?

  • Fine-tuning schemas, ingestion, and indexing strategies for even greater performance.
  • Discover already existing products that provide superior UX that are both easy to implement and use, and deliver the performance and the flexibility we want to achieve.
  • Use object storage (S3) to store archival/not frequently accessed data. Also try to use local NVMe super fast filesystem cache and S3 as main storage.

More reading material

** DISCLAIMER: Not a single line was written by AI, besides typo checking.

Tom Shlomi
DevOps Engineer @ monday.com