NAPP zookeeper stuck in crashloop with "NSX Application Platform Health - Service K8S_MESSAGING

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Zookeeper is a critical component in the VMware NSX Application Platform (NAPP) that coordinates distributed services such as Kafka by providing centralized configuration management, synchronization, and leader election. It ensures that dependent services like Kafka operate reliably across multiple nodes using quorum-based fault tolerance.

This Knowledge Base (KB) article explains how to identify and recover from a situation where one of the Zookeeper pods becomes stuck in a CrashLoopBackOff state due to snapshot corruption, resulting in the NSX Manager UI showing the warning:

NSX Application Platform Health - Service K8S_MESSAGING_SERVICE is degraded

Although the Zookeeper and Kafka clusters may remain operational due to quorum, this failure can still impact overall system stability, degrade dependent services, and rapidly fill logs with repeated error messages.

How to Identify a Zookeeper Crash Loop

Use the NAPP CLI as a root user to check Zookeeper pod statuses:

root@example-nsx01:~# napp-k get pods -A | grep zoo
nsxi-platform                  zookeeper-0                                                                   1/1     Running            1 (99d ago)       286d
nsxi-platform                  zookeeper-1                                                                   0/1     CrashLoopBackOff   681 (2m29s ago)   286d
nsxi-platform                  zookeeper-2                                                                   1/1     Running            1 (149d ago)      286d

You can confirm the degraded state by inspecting the StatefulSet:

napp-k get sts zookeeper -n nsxi-platform

Despite this failure, Kafka and the Zookeeper cluster remain functional due to quorum-based fault tolerance. However, if left unaddressed, the issue can lead to cascading failures, impact dependent services, and cause excessive log growth.

Verifying Kafka Functionality:

You can verify Kafka functionality using the cluster-api pod:

napp-k get pods -n nsxi-platform | grep cluster-api
napp-k exec -it cluster-api-xxxxx -c cluster-api -- bash

Inside the container, run the following commands:

# Validate Kafka group consumption and topics:
kafka-consumer-groups.sh --bootstrap-server kafka:9092 --command-config /root/adminclient.props --all-groups --describe
kafka-topics.sh --list --bootstrap-server kafka:9092 --command-config /root/adminclient.props
kafka-console-consumer.sh --bootstrap-server kafka:9092 --consumer.config /root/adminclient.props --topic nsx2pace-config --from-beginning

Expected output: Messages from the specified Kafka topic should be displayed successfully.

Environment

Applicable to NAPP versions 4.2.x
Environments with:
- Persistent Volume Claims (PVCs)
- Unstable underlying storage
- Infrastructure-level inconsistencies

Cause

The crash loop is triggered by corruption in the Zookeeper data directory, specifically during snapshot loading. This may stem from:

Disk-level corruption
PVC/PV layer issues
Snapshot file corruption
Zookeeper replication inconsistencies
Rare Zookeeper software defects

Sample Error message:

These logs are located in the container’s standard output and can be viewed in NSX cli as a root user:

napp-k logs zookeeper-1 -n nsxi-platform
<SKIP>
Invalid snapshot snapshot.4004xxxxx. len = -1622828290, byte = 156
Reading snapshot /data/zookeeper/version-2/snapshot.4004xxxxxx
Unable to load database on disk
Exiting JVM with code 1

Note: This is different from the "Unreasonable length" exception, which is covered in this KB.

Resolution

To recover the Zookeeper pod (zookeeper-1) from a crash loop due to snapshot corruption:

Pre-check: Log collection is recommended before proceeding to ensure no broader system issues are present.

Log in to the NSX manager as root:

Step 1: Scale Down to 1 Replica

Verify PVCs:

Step 2: Delete the Corrupted PVC (`zookeeper-1`)

This forces a fresh volume to be created for zookeeper-1 when it's redeployed.

Step 3: Scale Up to 3 Replicas

Confirm new PVC created:

And verify pod status:

Expected output:

Once the above steps are done and verify zookeeper is running, the UI warning message should be resolved.

Notes

There is no need to manually delete the PV. Kubernetes handles PVC/PV re-creation automatically if the volumeReclaimPolicy is set to Delete.
This recovery process utilizes Zookeeper's quorum and replication mechanisms to repopulate data safely.
zookeeper-0 must be healthy for this workaround to succeed, as it's required to rebuild the cluster state.