HCX - Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online
search cancel

HCX - Consecutive HCX Manager reboots may result in the app-engine and Kafka messaging services not coming online

book

Article ID: 367123

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

  • After consecutive HCX Manager reboots or shut down / power on events the app-engine service and Kafka messaging service may not come online. This issue is observed intermittently
  • From an ssh session to the HCX Manager, the app-engine service will be stuck in activating state since kafka is not running:
admin@hcx-manager-hostname [ ~ ]$ systemctl status app-engine
● app-engine.service - App-Engine
     Loaded: loaded (/etc/systemd/system/app-engine.service; enabled; vendor preset: enabled)
     Active: activating (start-pre) since Thu 2023-12-07 16:47:33 UTC; 9min ago
Cntrl PID: 25616 (service-depende)
      Tasks: 2
     Memory: 420.0K
     CGroup: /system.slice/app-engine.service
             ├─ 9572 sleep 30
             └─25616 /bin/bash /etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka

Dec 07 16:52:40 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
Dec 07 16:53:10 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
Dec 07 16:53:41 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
Dec 07 16:54:12 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
Dec 07 16:54:42 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.
Dec 07 16:55:13 hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.

 

  • When checking the kafka server.log found under /common/logs/kafka you see ERRORs like
[2023-12-07 17:01:09,991] INFO Error while loading logs in /common/kafka-db/__transaction_state-8 in 3ms (105/175 completed in /common/kafka-db) (kafka.log.LogManager)

[2023-12-07 17:01:09,991] ERROR There was an error in one of the threads during logs loading: org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log. (kafka.log.LogManager)

[2023-12-07 17:01:09,997] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-8/00000000000000000000.log.

 

or

 

[2024-07-18 12:14:24,649] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
org.apache.kafka.common.errors.CorruptRecordException: Found record size 0 smaller than minimum record overhead (14) in file /common/kafka-db/__transaction_state-3/00000000000003559658.log.

Cause

Consecutive HCX Manager reboots or shut down / power on events.

Outage in the environment where the appliance resides.

Resolution

This issue will be fixed in a future HCX software release.

The workaround involves file deletions and needs to be carried out by a Broadcom engineer. Please contact Broadcom support.