Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online

Products

VMware HCX

Issue/Introduction

Unable to login to HCX UI with the error message "Invalid username or password, or too many active sessions" despite having valid credentials.
When trying to access HCX on port 9443, the page does not load or provide a login page either.
HCX Dashboard never loads via the vCenter plugin.
Remote HCX Cloud manager reports site pairing is down.
After consecutive HCX Manager reboots or shut down / power on events the app-engine service and Kafka messaging service may not come online. This issue is observed intermittently.
From an SSH session to the HCX Manager, the app-engine service will be stuck in activating state since kafka is not running:

admin@hcx-manager-hostname [ ~ ]$ systemctl status app-engine
● app-engine.service - App-Engine
     Loaded: loaded (/etc/systemd/system/app-engine.service; enabled; vendor preset: enabled)
     Active: activating (start-pre) since Thu <YYYY-MM-DD hh:mm:ss> UTC; 9min ago
Cntrl PID: 25616 (service-depende)
      Tasks: 2
     Memory: 420.0K
     CGroup: /system.slice/app-engine.service
             ├─ 9572 sleep 30
             └─25616 /bin/bash /etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka

<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.

When checking the kafka server.log found under /common/logs/kafka you see ERRORs like

[<YYYY-MM-DD hh:mm:ss.sss>] INFO Error while loading logs in /common/kafka-db/__transaction_state-8 in 3ms (105/175 completed in /common/kafka-db) (kafka.log.LogManager)

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log. (kafka.log.LogManager)

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log.

or

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-3/00000000000003559658.log.

Execute the following command at the admin prompt of the HCX manager:
$journalctl
Examine the output for similar entries as those seen in the example below.

<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: WATCHER::
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: WatchedEvent state:SyncConnected type:None path:null
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: 2<hh:mm:ss.sss> [main-SendThread(localhost:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x################, packet:: clientPath:null serverPath:null finished:false header:: 1,8  replyHeader:: 1,526433,-101  request:: '/controller,F  response:: v{}
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: Node does not exist: /controller
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: <hh:mm:ss.sss> [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 1
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6214]: Connecting to localhost:2181
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6214]: <hh:mm:ss.sss> [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=#.#.#-#######, built on <YYYY-MM-DD hh:mm> UTC

Cause

Consecutive HCX Manager reboots or shut down / power on events.

Outage in the environment where the appliance resides.

Resolution

This issue will be fixed in a future HCX software release.

The workaround involves file deletions and needs to be carried out by a Broadcom engineer. If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.

Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online

Article ID: 367123

Updated On:

Products

Issue/Introduction

Cause

Resolution

Feedback