Pods in nsxi-platform Namespace Enter CrashLoopBackOff After SSPi 5.1 Upgrade
search cancel

Pods in nsxi-platform Namespace Enter CrashLoopBackOff After SSPi 5.1 Upgrade

book

Article ID: 418164

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms

Multiple pods in the nsxi-platform namespace entered a CrashLoopBackOff state after upgrading to SSPi 5.1.

Affected pods included:

  • kafka

  • pcap

Several dependent services (e.g., common-agent) were also impacted.

 

Kafka logs showed repeated graceful shutdown failures and TimeoutException errors :

ERROR kafka-o-metadata-loader-event-handler MetadataLoader – [MetadataLoader id=0] initializeNewPublishers: 
the loader is still catching up because we still don’t know the high water mark yet.

WARN  kafka-o-raft-io-thread KafkaRaftClient – [RaftManager id=0] Graceful shutdown of RaftClient timed out after 5000ms
ERROR kafka-o-metadata-loader-event-handler KafkaEventQueue – [ControllerRegistrationManager id=0] 
Graceful shutdown of RaftClient failed

ERROR kafka-o-metadata-loader-event-handler KafkaEventQueue – [StandardAuthorizer 0] 
Failed to complete initial ACL load process.

java.util.concurrent.TimeoutException
WARN  kafka-shutdown-hook NetworkClient – Attempting to close NetworkClient that has already been closed.
INFO  kafka-shutdown-hook NodeToControllerChannelManagerImpl – Node to controller channel manager shutdown completed.



Environment

SSP 5.1

Cause

The repeated TimeoutException messages and graceful shutdown failures indicate that Kafka could not complete metadata and ACL initialization within the expected timeframe.

The underlying reason was infrastructure slowness during the upgrade process.

As a result, dependent services (such as common-agent) could not establish a connection to Kafka:

failed to dial: failed to open connection to kafka:9092: dial tcp <ip>:9092: i/o timeout

Resolution

1. Identify the affected pods using the following commands:

k get pod -n nsxi-platform | grep kafka

k get pod -n nsxi-platform | grep pcap

 

2. Delete the affected Kafka and pcap pods identified from the above output:

k delete pod <pod name> -n nsxi-platform 

k delete pod <pod name> -n nsxi-platform

 

3. After the pods restart, verify that all components return to a healthy state using:

k get pods -n nsxi-platform

 

If the issue persists, contact Broadcom Support for further assistance.