Zookeeper is a critical component in the VMware NSX Application Platform (NAPP) that coordinates distributed services such as Kafka by providing centralized configuration management, synchronization, and leader election. It ensures that dependent services like Kafka operate reliably across multiple nodes using quorum-based fault tolerance.
This Knowledge Base (KB) article explains how to identify and recover from a situation where one of the Zookeeper pods becomes stuck in a CrashLoopBackOff state due to snapshot corruption, resulting in the NSX Manager UI showing the warning:
NSX Application Platform Health - Service K8S_MESSAGING_SERVICE is degraded
Although the Zookeeper and Kafka clusters may remain operational due to quorum, this failure can still impact overall system stability, degrade dependent services, and rapidly fill logs with repeated error messages.
How to Identify a Zookeeper Crash Loop
Use the NAPP CLI as a root user to check Zookeeper pod statuses:
root@example-nsx01:~# napp-k get pods -A | grep zoo
nsxi-platform zookeeper-0 1/1 Running 1 (99d ago) 286d
nsxi-platform zookeeper-1 0/1 CrashLoopBackOff 681 (2m29s ago) 286d
nsxi-platform zookeeper-2 1/1 Running 1 (149d ago) 286d
You can confirm the degraded state by inspecting the StatefulSet:
napp-k get sts zookeeper -n nsxi-platform
Despite this failure, Kafka and the Zookeeper cluster remain functional due to quorum-based fault tolerance. However, if left unaddressed, the issue can lead to cascading failures, impact dependent services, and cause excessive log growth.
Verifying Kafka Functionality:
You can verify Kafka functionality using the cluster-api pod:
napp-k get pods -n nsxi-platform | grep cluster-api
napp-k exec -it cluster-api-xxxxx -c cluster-api -- bash
Inside the container, run the following commands:
# Validate Kafka group consumption and topics:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --command-config /root/adminclient.props --all-groups --describe
kafka-topics.sh --list --bootstrap-server kafka:9092 --command-config /root/adminclient.props
kafka-console-consumer.sh --bootstrap-server kafka:9092 --consumer.config /root/adminclient.props --topic nsx2pace-config --from-beginning
Expected output: Messages from the specified Kafka topic should be displayed successfully.
Environments with:
Persistent Volume Claims (PVCs)
Unstable underlying storage
Infrastructure-level inconsistencies
The crash loop is triggered by corruption in the Zookeeper data directory, specifically during snapshot loading. This may stem from:
Disk-level corruption
PVC/PV layer issues
Snapshot file corruption
Zookeeper replication inconsistencies
Rare Zookeeper software defects
Sample Error message:
These logs are located in the container’s standard output and can be viewed in NSX cli as a root user:
napp-k logs zookeeper-1 -n nsxi-platform
<SKIP>
Invalid snapshot snapshot.4004xxxxx. len = -1622828290, byte = 156
Reading snapshot /data/zookeeper/version-2/snapshot.4004xxxxxx
Unable to load database on disk
Exiting JVM with code 1
Note: This is different from the "Unreasonable length" exception, which is covered in this KB.
To recover the Zookeeper pod (zookeeper-1) from a crash loop due to snapshot corruption:
Pre-check: Log collection is recommended before proceeding to ensure no broader system issues are present.
Log in to the NSX manager as root:
Verify PVCs:
zookeeper-1)This forces a fresh volume to be created for zookeeper-1 when it's redeployed.
Confirm new PVC created:
And verify pod status:
Expected output:
Once the above steps are done and verify zookeeper is running, the UI warning message should be resolved.
Notes
There is no need to manually delete the PV. Kubernetes handles PVC/PV re-creation automatically if the volumeReclaimPolicy is set to Delete.
This recovery process utilizes Zookeeper's quorum and replication mechanisms to repopulate data safely.
zookeeper-0 must be healthy for this workaround to succeed, as it's required to rebuild the cluster state.