NSX Application Platform Upgrade Fails due to nsx-config pod going into CrashLoopBackOff State
search cancel

NSX Application Platform Upgrade Fails due to nsx-config pod going into CrashLoopBackOff State

book

Article ID: 367688

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

1. During NSX Application Platform upgrade, the nsx-config pod cannot be successfully upgraded 

2. When logged into the NSX manager and running "napp-k get pods", the pod status for nsx-config shows "Init:CrashLoopBackOff"
Example
nsx-config-86cffc69c-txlw7 0/1 Init:CrashLoopBackOff 19 (2m55s ago) 166m

3. When checking the pod logs with "napp-k logs nsx-config-XXX -c wait-for-druid-supervisor-ready", the logs shows "Cannot find any supervisor with id: [pace2druid_policy_intent_config]"

Example:
root@systest-runner:~[621]# napp-k logs nsx-config-86cffc69c-txlw7 -n nsxi-platform -c wait-for-druid-supervisor-ready


INFO:root:==============Checking the pace2druid_policy_intent_config status=============
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 10.xx.xx.xx:8290
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.xx.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://10.xx.xx.xx:8290 "GET /druid/indexer/v1/supervisor/pace2druid_policy_intent_config/status HTTP/1.1" 404 None
INFO:root:Cannot find any supervisor with id: [pace2druid_policy_intent_config]

Environment

This issue impacts upgrades from NAPP 4.1.x to NAPP 4.1.2.1

Cause

During the Napp upgrade, there may be a chance that two druid-overlord pods are running at the same time, both believing themselves to be the leader. The API call to configure druid may go to the wrong leader, causing the correct leader to not getting updated. After the upgrade, the wrong leader will be terminated, leaving the correct leader with partial data.

Resolution

Log into the NSX manager as root.

(1) Method 1
Retrying upgrade has a high chance of resolving the issue. Since the issue is due to leader conflict, it only happens if configure-druid job is running at the same time when overlord and zookeeper pods are restarting.

(2) Method 2
Use "napp-k get pods | grep druid-overlord" to find the druid overlord pod
Use "napp-k delete pod <druid-overlord-name>" to restart the pod

(3) Method 3

1. napp-k get job configure-druid -o json > configure-druid.json

2. napp-k get job configure-druid -o json > configure-druid.json.bak (backup)

3. Edit the configure-druid.json file
vim configure-druid.json
Remove "matchLabels" field in ".spec.selector"
Remove "controller-uid" field in ".spec.template.metadata.labels"
Save and exit

4. napp-k delete job configure-druid


5. napp-k apply -f configure-druid.json