Kafka defaults that you should re-consider (I)

Image taken from http://www.htbdpodcast.com

There is a vast number of configuration options for Apache Kafka, mostly because the product can be fine-tuned to perform in various scenarios (e.g., low latency, high throughput, durability). These defaults span across brokers, producers and consumers (plus other sidecar products like Connect or Streams).

The guys at Kafka do their best to provide a comprehensive set of defaults that will just work, but some of them can be relatively dangerous if used blindly, as they might have unexpected side effects, or be optimized for a use case different to yours.

In this topic, I’d like to review the most obvious ones in the brokers’ side, explain what they do, why their default can be problematic and propose an alternative value.

Change these defaults

auto.create.topics.enable

Defaults to ‘true’. You definitively want to change this one to false. Applications should be responsible for creating their topics, which the correct configuration settings for the various use cases.

If you keep it true, some other configuration values kick in to fulfill the default topic configuration:

log.retention.hours: by default, logs will be retained 7 days. Think carefully if this default is good enough. Any data older than that is not be available when replaying the topic.
min.insync.replicas: Default to 1. As the documentation mentions, a typical configuration is replication-factor minus 1, meaning with a replication factor of 3, min.insync.replicas should be 2. The problem with 1 is it puts you in a dangerous position, where the cluster accepts messages for which you only have 1 copy. On the other hand, a value equal to the replication factor means losing one node temporarily stops your cluster from accepting values until the missing partition has rebalanced to a healthy node.
default.replication.factor: Default to 1. This is a bad value since it effectively creates only one copy of an auto-created topic. If the disk that stores a partition of this topic dies, the data is lost. Even if there are backups, the consumers don’t benefit from automatic rebalancing to other brokers that have copies of the partition, resulting in consumption interruptions. I would suggest a value like 3 and then fine-tune topics that require more or less, independently.
num.partitions: Default to 1. Another bad value. If a topic only has one partition, it can be consumed by only one instance of an application at a time, hindering any parallelization that we might hope to achieve using Kadka. While partitions are not free and Kafka clusters have a limit on how many they can handle, a minimum value of 3 partitions per topic seems like a safer and more sensible default.

offsets.retention.minutes

Defaults to 1400 minutes (24 hours). This is a dangerous default. Some applications might be idle over the weekend, meaning they don’t publish to Kafka during that period.

The morning after, if they restart before they consume from Kafka, the new instances don’t find any committed offsets for their consumer group, since they have expired.

At that point, the auto.offset.reset configuration in the consumer kicks in, sending the application to the earliest message, latest, or failing. In any case, this is not desirable.

The recommendation is to increase this value to something like 7 days for extra safeties.

Keep these defaults

auto.leader.rebalance.enable

Defaults to true. Unless you know what you’re doing, you don’t want to rebalance partitions manually. Let Kafka do it for you.

delete.topic.enable

Defaults to true. If you find yourself in a highly regulated environment, you might not be allowed to delete anything, ever. Otherwise, allowing topic deletion guarantees that you can get rid of data quickly and easily.

That is especially useful in development clusters. Don’t set this to false there; you will shoot yourself on foot.

log.flush.scheduler.interval.ms

Default to ‘never’ (represented as a ridiculously long number of ms). Kafka is so performant because it enables zero-copy data transfers from producers to consumers.

While that is a fantastic mechanism for moving tons of data quickly, the durability aspect can be a concern. To account for that, Kafka proposes using replication across nodes to guarantee the information is lost, instead of explicitly flushing messages to disk as they come. The result of that is a lack of certainty about when the messages are actually written to the disk.

You could effectively force Kafka to flush to disk using this and other configuration properties. However, you would most likely kill Kafka performance in the process. Hence, the recommendation is to keep the default value.

offsets.commit.required.acks

Defaults to ‘-1’, which means messages are not acknowledged by a leader until they the min.in.sync.replicas value for the topic is honored.

That is a safe default, falling on the side of durability, versus lower latency. You should consider particular configurations at the topic level, dependent on the nature of the stored information (e.g., ‘logs’ been a lower value than ‘orders’).

offsets.topic.num.partitions

Defaults to ’50’. Kafka automatically created the topic __consumer_offsets with this number of partitions. Since this is likely to be the busiest topic in your cluster, it’s a good idea to keep the number of partitions high so that the load is spread across as many nodes as possible.

__consumer_offsets cannot be changed for the lifetime of the cluster, so even if you are not planning to have 50 brokers in your cluster, it falls on the safe side to maintain this number as it is.

offsets.topic.replication.factor

Defaults to ‘3’. Similar to the previous value, but to configure how many copies of your __consumer_offsets you want. 3 copies is a safe default and should probably only be changed to rise to a more significant number.

More copies of the topic would make your cluster more resilient in the event of broker failure since there would be more followers ready to that the role of the fallen leader.

unclean.leader.election.enable

Defaults to ‘false’. Used to be ‘true’ by default because it was optimized for availability. In the case of a leader dying without any follower been up to date, the cluster to continue operating if this value is set to ‘true’. Unfortunately, data loss would result..

However, after Aphyr roasted Kafka for this data loss scenario, Kafka introduced this configuration value and eventually changed it to ‘false’ to prevent data loss. With this default, the cluster stops operating until a follower that was up to date with the fallen leader arises (potentially, the fixed leader itself), preventing any loss.

Summary

There are many more configuration values that play essential roles in the broker side, and we haven’t even mentioned any of the values in the client side (e.g., consumers, producers). In following posts, I’ll jump into those and describe what sensible defaults are and what you should think twice before blindly embracing.

3 thoughts on “Kafka defaults that you should re-consider (I)”

Mayank Madhav says:

June 12, 2020 at 11:10 pm

Thanks for the compilation.

Stanislav Kozlovski says:

July 25, 2024 at 11:40 am

offsets.commit.required.acks

Defaults to ‘-1’, which means messages are not acknowledged by a leader until they the min.in.sync.replicas value for the topic is honored.

That is a safe default, falling on the side of durability, versus lower latency. You should consider particular configurations at the topic level, dependent on the nature of the stored information (e.g., ‘logs’ been a lower value than ‘orders’).

Few things wrong here:

1. This only applies to the OffsetCommit request made by consumers toward the __consumer_offsets topic.

2. Acknowledged by a leader until they the min.in.sync.replicas value for the topic is honored.

Many people get this wrong. The acks=all, or acks=-1 in this particular case, setting requires ALL alive replicas to acknowledge the write. min.in.sync.replicas only denotes when an error is thrown.

See https://www.linkedin.com/pulse/kafka-acks-explained-stanislav-kozlovski/

This config is a legacy config (10+ years now) and isn’t even supported properly. That’s why it’s being removed with Kafka 4.0 – https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=303794933

1. Javier Holguera says:
  
  July 25, 2024 at 4:51 pm
  I agree my entry didn’t make it clear that the setting is exclusively related to offset commits and not general “produce” ACKs.
  
  Regarding who ACKs work in coordination with `min.insync.replicas`, you are right. I have always assumed that replicas over `min.insync.replicas` would be asynchronous, i.e., the broker would not wait for them to return to the producer (in an ACKs=-1) scenario. Based on your comment (and link) you are saying that copy is synchronous as long as the extra replicas stay “in-sync”.
  
  I suppose only three things can happen here:
  1. The extra replicas (i.e., over `min.insync.replicas`) stay in-sync, in which case they shouldn’t add much more latency to the overall “produce” process as they would happen in parallel with the other replicas.
  2. The extra replicas are already out-of-sync so they get ignored and latency stays the same.
  3. The extra replicas “fall out” of in-sync during this produce request, in which case we are taking extra latency “unnecessarily” since we have configured our minimum durability with `min.insync.replicas`
  For me, it would make sense that those extra replicas are asynchronously copied to avoid the unnecessary latency for scenario 3 (and potentially ‘some’ in scenario 1) but I’ll take your word for it.
  
  Thanks for your input.

Kafka defaults that you should re-consider (I)

Change these defaults

auto.create.topics.enable

offsets.retention.minutes

Keep these defaults

auto.leader.rebalance.enable

delete.topic.enable

log.flush.scheduler.interval.ms

offsets.commit.required.acks

offsets.topic.num.partitions

offsets.topic.replication.factor

unclean.leader.election.enable

Summary

Published by Javier Holguera

3 thoughts on “Kafka defaults that you should re-consider (I)”

Leave a comment Cancel reply

Change these defaults

auto.create.topics.enable

offsets.retention.minutes

Keep these defaults

auto.leader.rebalance.enable

delete.topic.enable

log.flush.scheduler.interval.ms

offsets.commit.required.acks

offsets.topic.num.partitions

offsets.topic.replication.factor

unclean.leader.election.enable

Summary

Related

Published by Javier Holguera

3 thoughts on “Kafka defaults that you should re-consider (I)”

Leave a comment Cancel reply