What are Tombstones in Cassandra and Why Are There Too Many?

I have been asked about tombstones in cassandra, often from users who suddenly are experiencing a slow cluster, some nasty timeouts, a large number of tombstones, and a complex problem to sort. So I decided to put together some of my lengthy answers and write a blog about it.
Apache Cassandra is a powerful NoSQL database designed for high availability and scalability. However, its architecture and data management come with certain complexities, one of which is tombstones. In this blog post, we will delve into what tombstones are, why having too many is problematic, what creates them, how are they cleared, and the role of the gc_grace_seconds
parameter.
What are tombstones?
In Cassandra, a tombstone is a marker indicating that a piece of data (row, column, or cell) has been deleted. Rather than immediately removing data upon deletion, Cassandra writes a tombstone to mark the data as deleted. This approach supports Cassandra’s eventual consistency model and ensures that deletions are propagated across all replicas while improving performance.
Tombstones are not permanent and are eventually cleared through a process called compaction. Compaction in Cassandra has several types, but the key point is that during this process, Cassandra merges SSTables and discards data that is no longer needed, including tombstones older than a certain threshold.
Why are too many tombstones bad?
Having an excessive number of tombstones can severely impact the performance and stability of a Cassandra cluster. Here are the key reasons why:
- Read Performance Degradation: During read operations, Cassandra must check tombstones to determine if a piece of data has been deleted. If there are many tombstones, read latency increases as Cassandra scans through them.
- Compaction Overhead: Compaction is a process in Cassandra that merges SSTables and removes unnecessary tombstones. If there are too many tombstones, compaction becomes more resource-intensive and frequent, leading to higher CPU and I/O load.
- Increased Disk Usage: Tombstones occupy disk space. Although they are smaller than the actual data, a large number of tombstones can lead to significant disk usage.
- Write Amplification: The presence of tombstones means that more data needs to be managed and compacted, increasing the overall write amplification in the system.
- Increased Memory Usage: During scans, Cassandra needs to keep tombstones in memory to ensure all replicas are aware of deletions. Workloads generating a lot of tombstones can lead to high memory consumption, potentially exhausting the server heap and causing performance issues.
What Creates Tombstones?
Tombstones can be generated through various operations in Cassandra:
- Deletes: Any delete operation, whether it is a row delete, a range delete, or a column delete, will create a tombstone.


- TTL (Time-to-Live): Data with a TTL will be automatically treated like a tombstone once the TTL expires.


- Updates: Updating a column with a new value can create a tombstone for the old value, depending on the conditions of the update e.g: if an update to a column sets its value to
null
, a tombstone is created. This marks the column for deletion, similar to an explicit delete operation.


- Range Deletes: When a range of rows is deleted using
DELETE ... WHERE
with a range condition, tombstones are created for all the affected rows. - Insert: Inserting a null value can create tombstones, this is an easy mistake to make, see the example of good and bad null insert query.


- Collection column: while Cassandra collections (list, set, map) are convenient, they do indeed generate tombstones, especially during deletions and updates, because of Cassandra’s distributed, immutable data storage model.


The Role of gc_grace_seconds
The gc_grace_seconds
parameter in Cassandra plays a crucial role in managing tombstones. It specifies the time period during which tombstones are kept before they are eligible for garbage collection (i.e., permanent removal during compaction). The default value is 10 days (864,000 seconds).
Something to remember regardinggc_grace_seconds
: is that it is a balancing act, setting gc_grace_seconds
too low can lead to data resurrection (AKA zombie data) if replicas haven't fully synced. Setting it too high can result in an accumulation of tombstones.
Best Practices to Manage Tombstones
To manage tombstones effectively, there are some best practices that we can keep in mind.
- Be careful when you create your data model: as in part shown above, how you choose to create your schema, can put you on the right path or put your cluster in an unmanageable state.
- Tune
gc_grace_seconds
: Adjustgc_grace_seconds
according to your replication strategy and workload to balance consistency and performance if the default 10 days is not good for you. - Good Compaction: Ensure that compaction processes are running smoothly and frequently to effectively manage tombstones. Use strategies like SizeTieredCompactionStrategy (STCS) or LeveledCompactionStrategy (LCS) based on your data model and workload.
- Monitor Metrics: Use Cassandra’s monitoring tools like AxonOps to keep an eye on tombstone creation and compaction performance.
- Avoid Large Deletes: Large deletions can create many tombstones. Where possible, avoid bulk deletes or spread them out over time.
- Repair: Make sure your data is always repaired within the
gc_grace_seconds
.
Tombstones thresholds
As described above, a high number of tombstones can cause issues in your cluster. To mitigate this issue, Apache Cassandra has a tombstone counter: During read operations, as part of merging data from different SSTables to fulfill a query, Cassandra counts all the tombstones encountered. If the number of tombstones exceeds warning or failure thresholds, Cassandra will warn you and, for the failure limit, will also stop the query.
Here’s how each threshold operates:
tombstone_warn_threshold
Purpose: This threshold acts as a warning indicator. It defines the number of tombstones that, if encountered during a single query, will log a warning in the Cassandra logs.
Default Value: Set to 1,000 tombstones per read query.
Function: When a query hits this threshold, Cassandra logs a warning (often seen in system.log) to alert administrators that the query is encountering a high volume of tombstones. This warning is useful for identifying potential issues before they escalate into severe performance problems.
Action: This is a non-fatal threshold. Cassandra simply logs the occurrence without interrupting the query execution. Administrators are advised to review these warnings and assess the queries or data models generating excessive tombstones to prevent further issues.
Example of log:
'SELECT * FROM data WHERE dtype=7 AND dposition=15 AND dnumber=-8349
AND measuredtime>='2024-06-06T00:00:00.000Z'
AND measuredtime<='2024-06-26T23:59:00.000Z'
ORDER BY measuredtime DESC' generated server side warning(s):
Read 1703 live rows and 19085 tombstone cells for query
SELECT * FROM db.data WHERE dtype, dposition, dnumber = 7, 15, -8349
AND measuredtime <= 2024-06-26T23:59:00.000Z
AND measuredtime >= 2024-06-06T00:00:00.000Z
LIMIT 5000; token -2455708300064992048 (see tombstone_warn_threshold)
tombstone_failure_threshold
Purpose: This threshold is more critical than the warning threshold and is designed to prevent severe performance issues from excessive tombstone reads.
Default Value: Set to 100,000 tombstones per read query.
Function: When a query encounters a number of tombstones equal to or greater than this threshold, Cassandra will abort the query and return an error instead of completing it. The error message usually indicates a tombstone overflow, which highlights that the query would be too resource-intensive to complete without risking node stability.
Action: When this threshold is hit, Cassandra will throw an error to prevent overloading nodes with a potentially excessive read workload. This behavior helps maintain cluster stability, as high tombstone counts can lead to long read latencies, excessive memory usage, and even OutOfMemory (OOM) errors.
Example of log:
Scanned over 100000 tombstones; query aborted
How to Adjust Tombstone Thresholds
Both tombstone_warn_threshold
and tombstone_failure_threshold
can be configured in the configuration file cassandra.yaml
:
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
Adjusting these values may be necessary for workloads where certain read paths naturally generate many tombstones, though it is usually better to review the data model and query patterns first. Lowering these thresholds can increase sensitivity to tombstone issues (helping catch problems sooner) while raising them can accommodate specific use cases with high deletion rates. This should be done cautiously.
Conclusion
Tombstones are essential to Cassandra’s architecture, enabling its robust deletion mechanism while maintaining eventual consistency. However, managing tombstones is crucial for maintaining cluster performance and stability. By understanding what creates tombstones, how they are cleared, and how to fine-tune related settings, you can optimise your Cassandra clusters to handle tombstones effectively and maintain optimal performance.
By following these best practices, you can ensure that your Cassandra clusters remain healthy and performant, even as they scale to meet the demands of your applications.
If you are struggling with this or other issues, Contact our team for a free consultation to discuss how we can tailor our approach to your specific needs and challenges.