HCX: Kafka service fails to start after increasing HCX Manager disk space on version 4.10 or higher
search cancel

HCX: Kafka service fails to start after increasing HCX Manager disk space on version 4.10 or higher

book

Article ID: 430264

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

  • VMware HCX Manager services, specifically the Kafka and app-engine services, fail to initialize following a manual disk expansion of the HCX Manager appliance.
    Kafka service remains in a 'failed' or 'stopped' state.
  • From an SSH session to the HCX Manager, the app-engine service will be stuck in activating state since kafka is not running:
    admin@hcx-manager-hostname [ ~ ]$ systemctl status app-engine
     app-engine.service - App-Engine
         Loaded: loaded (/etc/systemd/system/app-engine.service; enabled; vendor preset: enabled)
         Active: activating (start-pre) since Thu <YYYY-MM-DD hh:mm:ss> UTC; 9min ago
    Cntrl PID: 25616 (service-depende)
          Tasks: 2
         Memory: 420.0K
         CGroup: /system.slice/app-engine.service
                 ├─ 9572 sleep 30
                 └─25616 /bin/bash /etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka

<MMM DD hh:mm:ss> hcx-manager-hostname service-dependency-check.sh[25616]: kafka is not running.

  • /common/logs/kafka/server.log reports the following entries:

ERROR Error while writing meta.properties to /common/kafka-db (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.nio.file.FileAlreadyExistsException: /common/kafka-db

  • /common/logs/admin/app.log reports the following entries:

[kafka-producer-network-thread | producer-TID_####-####-####-####, , , TxId: ] WARN  o.apache.kafka.clients.NetworkClient- [Producer clientId=producer-TID_####-####-####-####, transactionalId=TID_####-####-####-####]
Connection to node 0 (localhost/127.xxx.0.1:9092) could not be established. Broker may not be available.

  • Filesystem inspection reveals the /common/kafka-db path is inconsistent with the expected post-resize structure.

 

Environment

VMware HCX

 

Cause

The disk expansion workflow for HCX 4.11.3 and above requires manual redirection of database directories from the /common partition to the newly expanded /common_ext partition. The failure occurs because the kafka-db and postgres-db directories were not correctly migrated or symlinked, preventing the services from accessing their data stores.

Also there are three more services which depends on Postgres and Kafka that needs to stopped before and started after applying the configuration changes.

Resolution

Follow the steps mentioned in KB article for HCX manager in 4.11.3 and above  - Increasing HCX Manager Disk Space for HCX Software Version 4.10 or Higher 

Additional Information

Increasing HCX Manager Disk Space for HCX Software Version 4.10 or Higher