Kafka 爬坑记

一些关于Kafka的理解,和踩到的坑。

Kafka简述

(待补)

遇到的问题

Leader election during rolling update.

Observation

Kafka cluster在rolling update的时候收到request,会返回一个exception:

1
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now

Kafka log:

1
2
WARN  [org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=<client-id>] Got error produce response with correlation id <id> on topic-partition <partition-id>, retrying (9 attempts left). Error: NOT_LEADER_FOR_PARTITION
WARN [org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=<client-id>] Received invalid metadata error in produce request on partition <partition-id> due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now

求:如何避免?如何解决?

Analysis

通过触发底层配置更新,Kafka cluster会进行rolling update。在update过程中,当roll到原先的leader时,集群会出现一个没有leader、需要重新选举的状态。

在这个进行选举的时间间隙里面,整个集群确实是一个无leader状态,所以会报错。

Proposal

Kafka提供了集群内部的retry机制,调用方法是配置ProducerConfig.RETRIES_CONFIGProducerConfig.RETRY_BACKOFF_MS_CONFIG。附上官方JavaDoc:

1
2
3
4
5
6
7
8
9
private static final String RETRIES_DOC = "Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error."
+ " Note that this retry is no different than if the client resent the record upon receiving the error."
+ " Allowing retries without setting <code>" + MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION + "</code> to 1 will potentially change the"
+ " ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second"
+ " succeeds, then the records in the second batch may appear first. Note additionall that produce requests will be"
+ " failed before the number of retries has been exhausted if the timeout configured by"
+ " <code>" + DELIVERY_TIMEOUT_MS_CONFIG + "</code> expires first before successful acknowledgement. Users should generally"
+ " prefer to leave this config unset and instead use <code>" + DELIVERY_TIMEOUT_MS_CONFIG + "</code> to control"
+ " retry behavior.";

除了集群内部的retry,正如上文JavaDoc所说,还可以在client端进行resend。

Other issue

其实这件事还没有解决,即便是internal retry + client resend,还是有可能会出现报错导致request写不进去的情况,目前解决办法是。。。继续增加客户端resend的次数!

It is Likely That The Consumer Was Kicked Out Of The Group

Observation

Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group

先贴几个别人的文章

KAFKA Says: It is Likely That The Consumer Was Kicked Out Of The Group | Hacker Noon

kafka 0.10.1一些使用经验 - 简书 (jianshu.com)

INVALID_FETCH_SESSION_EPOCH

Observation

一台新装好的机器,放个一两天之后就用不了了,报错如下

[org.apache.kafka.clients.FetchSessionHandler] [Consumer clientId=xxxxxxxxxxxxxxxxxx-a2c15833-af73-4c69-a515-652d42fa6da1-StreamThread-1-consumer, groupId=xxxxxxxxxxxxxxxxxxxx] Node 1 was unable to process the fetch request with (sessionId=866458856, epoch=3657): INVALID_FETCH_SESSION_EPOCH.

Analysis

网上一堆互相复制粘贴的博文说直接升版本。垃圾信息。

深得我心!博主晚餐加鸡腿!