Kafka 爬坑记

一些关于Kafka的理解，和踩到的坑。

Kafka简述

（待补）

遇到的问题

Leader election during rolling update.

Observation

Kafka cluster在rolling update的时候收到request，会返回一个exception:

1	org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now

Kafka log：

1
2

WARN  [org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=<client-id>] Got error produce response with correlation id <id> on topic-partition <partition-id>, retrying (9 attempts left). Error: NOT_LEADER_FOR_PARTITION
WARN  [org.apache.kafka.clients.producer.internals.Sender] [Producer clientId=<client-id>] Received invalid metadata error in produce request on partition <partition-id> due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now

求：如何避免？如何解决？

Analysis

通过触发底层配置更新，Kafka cluster会进行rolling update。在update过程中，当roll到原先的leader时，集群会出现一个没有leader、需要重新选举的状态。

在这个进行选举的时间间隙里面，整个集群确实是一个无leader状态，所以会报错。

Proposal

Kafka提供了集群内部的retry机制，调用方法是配置ProducerConfig.RETRIES_CONFIG和ProducerConfig.RETRY_BACKOFF_MS_CONFIG。附上官方JavaDoc：

private static final String RETRIES_DOC = "Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error."
        + " Note that this retry is no different than if the client resent the record upon receiving the error."
        + " Allowing retries without setting <code>" + MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION + "</code> to 1 will potentially change the"
        + " ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second"
        + " succeeds, then the records in the second batch may appear first. Note additionall that produce requests will be"
        + " failed before the number of retries has been exhausted if the timeout configured by"
        + " <code>" + DELIVERY_TIMEOUT_MS_CONFIG + "</code> expires first before successful acknowledgement. Users should generally"
        + " prefer to leave this config unset and instead use <code>" + DELIVERY_TIMEOUT_MS_CONFIG + "</code> to control"
        + " retry behavior.";

除了集群内部的retry，正如上文JavaDoc所说，还可以在client端进行resend。

Other issue

其实这件事还没有解决，即便是internal retry + client resend，还是有可能会出现报错导致request写不进去的情况，目前解决办法是。。。继续增加客户端resend的次数！

It is Likely That The Consumer Was Kicked Out Of The Group

Observation

Offset commit cannot be completed since the consumer is not part of an active group for auto partition assignment; it is likely that the consumer was kicked out of the group

先贴几个别人的文章

KAFKA Says: It is Likely That The Consumer Was Kicked Out Of The Group | Hacker Noon

kafka 0.10.1一些使用经验 - 简书 (jianshu.com)

INVALID_FETCH_SESSION_EPOCH

Observation

一台新装好的机器，放个一两天之后就用不了了，报错如下

[org.apache.kafka.clients.FetchSessionHandler] [Consumer clientId=xxxxxxxxxxxxxxxxxx-a2c15833-af73-4c69-a515-652d42fa6da1-StreamThread-1-consumer, groupId=xxxxxxxxxxxxxxxxxxxx] Node 1 was unable to process the fetch request with (sessionId=866458856, epoch=3657): INVALID_FETCH_SESSION_EPOCH.

Analysis

网上一堆互相复制粘贴的博文说直接升版本。垃圾信息。