Cassandra at scale: Partitions limitations and solutions
Cassandra is a highly available distributed and partitioned database, suitable for high write throughput and scale. As such it can be a great choice when data can be easily partitioned. With that said, Cassandra’s partitions implementation has its own built in drawbacks which force you to plan carefully: queries depend on the partition key, and querying without it at hand can be impossible. On top of it, oversized partitions can hurt performance drastically. Naturally, these limitations can become intertwined, and hurt your performance further. At monday we tackled and solved them. Lets see how.
Internally, Cassandra’s partitioning implementation uses a variation of consistent hashing to distribute data between the nodes: For every record stored, its primary key is used as a value for the hashing function, and the outcome is used to decide on which node to store it.
Cassandra’s data mapping to nodes
While the official recommendation is to have partitions up to 10MB, in practice, partitions larger than 1MB can take a while to query, and hurt the performance. On top of that, if records are deleted often, tombstone records may become a serious burden, and hurt the performance further, until they are purged during a compaction process by the database’s engine.
If you are not familiar with the concept – tombstones are records that mark deleted values in Cassandra. They are added by the database to ensure eventual consistency, and separate missing records from deleted ones. When queries are executed, Cassandra also reads the tombstones, to ensure the data is correct, before returning it to the caller. Vast amount of tombstones can take some nodes in your cluster or all of it down. The compaction process eventually deletes them physically, reducing number of read records on relevant queries.
Data lifecycle in Cassandra
There are different approaches that can be utilized to minimize partition size. In our case, since we were not able to find a trivial partition key with higher cardinality, we decided to divide our partitions further into smaller entities we called buckets. In this structure, the bucket id is part of the partition key, together with all the other fields we used for the primary key. The bucket ids values are just hashes calculated on all the other primary key fields (modulo the number of buckets), meaning for each row in the table we can always calculate its bucket id.
In case we need to traverse an entire partition, this increases the number of queries to number of buckets. To avoid having running much more queries all the time, we decided to use the bucket option based on opt in feature flag, relevant only for specific partitions, and on top of it, change the number of buckets based on the number of records in a partition.
This way, by default we assume our partition will only have one bucket, and it will always be queried. Once the partition grows beyond the defined limit, we automatically allocate 9 additional buckets, meaning the bucket id is now the hash calculated on other primary key fields modulo 10.
querying a logical partition with bucketing in place
To ensure we know how many buckets are there, we started managing the number in additional metadata table. Now each of our queries will be translated to anywhere from 2 to n database queries. This approach allows us to divide the partition to buckets, without automatically increasing our total number of queries x number of buckets. A possible issue with this approach, is that a row’s bucket id can change if more buckets are added, as the modulo calculation changes. To solve that we decided to query all relevant buckets, and return the last version of the record only.
The tombstones problem is a harder one to tackle. Since our table store data generated by users, who can also delete the data they create, tombstones can be a heavy burden, with extreme cases where specific partitions have dozens of thousands of tombstones, before they are purged. We are implementing two solutions for that.
The first one is relatively simple. We measure how much time it takes us to run a query. If it takes more than a defined limit, we execute the query again with tracing on (using Cassandra’s built in tracing mechanism). Then we query Cassandra for the received trace id, and analyze it to see how many tombstones were read during the query. In case the number is above the threshold we defined – we will block queries from that partition until it’s compacted by the database. For each query we run, we also check if its primary key is in the blocked list, and do not query it if it is.
The block time is configurable, and it corresponds to the scheduled compaction jobs we’re running on the database itself, to ensure tombstones are purged often enough, and don’t overload the database’s engine for a long time. Of course, this solution is used only for extreme cases where the number of tombstones causes the database to reach constant resources utilizations, eventually timing out on queries.
blocking partitions with too many tombstones, until compacted
The second solution, under development now, is more complicated. It introduces a rebalancing mechanism to our partitions. It uses the same table we use to count the number of buckets, extending it to store also a number of ongoing deletes per bucket. In case an X that we considered dangerous is reached, we run an async process, that moves this bucket’s alive rows to a new one. The new one will only have active rows with no tombstones. Then, the original one could be deleted. The entire process requires to lock data while its updated, to ensure consistency, making the entire process potentially more time consuming and risky.
In conclusion, these optimizations mitigate our known issues with Cassandra, while we use it to tackle our ever growing scale and customer demands.