Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online
search cancel

Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online

book

Article ID: 367123

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

  • Unable to login to HCX UI with the error message "Invalid username or password, or too many active sessions" despite having valid credentials.
  • When trying to access HCX on port 9443, the page does not load or provide a login page either.
  • HCX Dashboard never loads via the vCenter plugin.
  • Remote HCX Cloud manager reports site pairing is down.
  • After consecutive HCX Manager reboots or shut down / power on events the app-engine service and Kafka messaging service may not come online. This issue is observed intermittently.
  • From an SSH session to the HCX Manager, the app-engine service will be stuck in activating state since kafka is not running:
admin@hcx-manager-hostname [ ~ ]$ systemctl status app-engine
● app-engine.service - App-Engine
     Loaded: loaded (/etc/systemd/system/app-engine.service; enabled; vendor preset: enabled)
     Active: activating (start-pre) since Thu <YYYY-MM-DD hh:mm:ss> UTC; 9min ago
Cntrl PID: 25616 (service-depende)
      Tasks: 2
     Memory: 420.0K
     CGroup: /system.slice/app-engine.service
             ├─ 9572 sleep 30
             └─25616 /bin/bash /etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka

<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.

 

  • When checking the kafka server.log found under /common/logs/kafka you see ERRORs like
[<YYYY-MM-DD hh:mm:ss.sss>] INFO Error while loading logs in /common/kafka-db/__transaction_state-8 in 3ms (105/175 completed in /common/kafka-db) (kafka.log.LogManager)

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log. (kafka.log.LogManager)

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log.

 

or

 

[<YYYY-MM-DD hh:mm:ss.sss>] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-3/00000000000003559658.log.

 

  • Execute the following command at the admin prompt of the HCX manager:
    $journalctl
  • Examine the output for similar entries as those seen in the example below.
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: WATCHER::
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: WatchedEvent state:SyncConnected type:None path:null
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: 2<hh:mm:ss.sss> [main-SendThread(localhost:2181)] DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 0x################, packet:: clientPath:null serverPath:null finished:false header:: 1,8  replyHeader:: 1,526433,-101  request:: '/controller,F  response:: v{}
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: Node does not exist: /controller
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6192]: <hh:mm:ss.sss> [main] ERROR org.apache.zookeeper.util.ServiceUtils - Exiting JVM with code 1
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6214]: Connecting to localhost:2181
<MMM DD hh:mm:ss> <HOST NAME> pre-kafka-start[6214]: <hh:mm:ss.sss> [main] INFO org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version=#.#.#-#######, built on <YYYY-MM-DD hh:mm> UTC

Cause

Consecutive HCX Manager reboots or shut down / power on events.

Outage in the environment where the appliance resides.

Resolution

This issue will be fixed in a future HCX software release.

The workaround involves file deletions and needs to be carried out by a Broadcom engineer. If you believe you have encountered this issue, please open a support case with Broadcom Support and refer to this KB article.

For more information, see Creating and managing Broadcom support cases.