Recently in one of the clients I consult for, I came across a strange situation: tombstone records that “refused” to disappear.
The scenario was quite simple:
- Kafka Streams application that materializes some state (in RocksDB).
- From time to time, a punctuation kicks it, pulls all the accumulated records and sends them somewhere.
- Upon success, it deletes all the records and calls it a day.
However, when consuming the changelog topic, I notice that there were lots of tombstone records. Having some of them made sense, that is how a “delete” should be represented in a changelog topic. However, having so many that hadn’t cleared out was unexpected.
I applied a few strategies/changes until I finally made them “gone”.
Step 1 – Roll your segments more often
Compaction only happens when the file that contains your topic/partition data is rolled. Therefore, it is important to adjust when that happens if you want to influence the compaction process:
segment.ms: the segment can stay open for up to this value. By default, that is 7 days.
segment.bytes: the segment can stay open up to this number of bytes. The default here is 1 GB, which is too big for low traffic topics.
The defaults for these two settings have “big data” stamped on them. If you don’t have a “big data” topic, chances are the process won’t be responsive enough for you.
I tried setting them up to 60,000 ms (1 min) and 1,048,576 (1 MB) respectively… with no luck. Nothing changed; tombstones were still there.
Step 2 – Tolerate less dirtiness
It is also possible that, even if your segments are rolling regularly, the log compaction thread doesn’t pick up your topic/partition file because it is not dirty enough, meaning the ratio between entries that are candidates for compaction and those that aren’t is not meeting the configured threshold.
min.cleanable.dirty.ratio controls this threshold and it is 0.5 by default, meaning you need at least 50% of your topic/partition file with “dirty” entries for compaction to run. Anything below that, the thread doesn’t find it worth doing compaction on it.
My next step was to set this value to 0.01. This quite aggressive and I wouldn’t recommend it for most topics, unless you have low volume and you really, really want to keep your topic/partition spotless.
However, this didn’t do the trick either…
Step 3 – Be less nice with your replaying consumers
When a consumer is replaying a topic from the beginning, it might encounter this problem:
- Offset X contains a record with Key K and Value V.
- A few records “later” (maybe millions…), record with Key K again, bu with a Value null, AKA tombstone.
- If the consumer reads the first record, but compaction runs and gets rid of the second record (the tombstone), the consumer will never know that the record with Key K has been deleted.
To compensate for this scenario, Kafka has a config setting called
delete.retention.ms that controls how long tombstones should be kept around for the benefit of these consumers. Its default: 1 day.
This is very useful, but it will also keep tombstones around unnecessarily if you don’t expect any replaying consumer to read a given topic or, at least, to take as long as 1 day.
My next attempt was to configure this down to 60,000 ms (1 minute)… but still not working.
Step 4 – It’s not a feature… it’s a bug
I ran out of options here so I thought that maybe this is one of those rare and unfortunate occasions when I hit one of Kafka bugs. Fired up a quick search on Google and… voila!
Tombstones can survive forever: https://issues.apache.org/jira/browse/KAFKA-8522
Long story short, under certain circumstances, tombstones will get their “timeouts” renew regularly, meaning they will not honor
delete.retention.ms and stick around.
The only walkaround that seems to work is to set
delete.retention.ms to zero, forcing the tombstones to be deleted immediately, instead of sticking around for the benefit of consumers replaying the topic.
However, this solution must be used with great care. For the scenario described at the beginning, a Kafka Streams app and a changelog topic, using this option can have unexpected side effects during the Restore phase, when the app reads its changelog topics to restore its state. If, while doing so, compaction kicked in, it might miss the tombstone records for entries that it has already consumed, keeping entries in its key/value store that should have been removed.
Unfortunately, until the bug is fixed, if your app needs all these tombstones evicted from the changelog, this seems to be the only option.