1. During NSX Application Platform upgrade, the nsx-config pod cannot be successfully upgraded
2. When logged into the NSX manager and running "napp-k get pods", the pod status for nsx-config shows "Init:CrashLoopBackOff"
Examplensx-config-86cffc69c-txlw7 0/1 Init:CrashLoopBackOff 19 (2m55s ago) 166m
3. When checking the pod logs with "napp-k logs nsx-config-XXX -c wait-for-druid-supervisor-ready", the logs shows "Cannot find any supervisor with id: [pace2druid_policy_intent_config]"
Example:
root@systest-runner:~[621]# napp-k logs nsx-config-86cffc69c-txlw7 -n nsxi-platform -c wait-for-druid-supervisor-ready
INFO:root:==============Checking the pace2druid_policy_intent_config status=============
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 10.xx.xx.xx:8290
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:1004: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.xx.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://10.xx.xx.xx:8290 "GET /druid/indexer/v1/supervisor/pace2druid_policy_intent_config/status HTTP/1.1" 404 None
INFO:root:Cannot find any supervisor with id: [pace2druid_policy_intent_config]
This issue impacts upgrades from NAPP 4.1.x to NAPP 4.1.2.1
During the Napp upgrade, there may be a chance that two druid-overlord pods are running at the same time, both believing themselves to be the leader. The API call to configure druid may go to the wrong leader, causing the correct leader to not getting updated. After the upgrade, the wrong leader will be terminated, leaving the correct leader with partial data.
Log into the NSX manager as root.
(1) Method 1
Retrying upgrade has a high chance of resolving the issue. Since the issue is due to leader conflict, it only happens if configure-druid job is running at the same time when overlord and zookeeper pods are restarting.
(2) Method 2
Use "napp-k get pods | grep druid-overlord" to find the druid overlord pod
Use "napp-k delete pod <druid-overlord-name>" to restart the pod
(3) Method 3
1. napp-k get job configure-druid -o json > configure-druid.json
2. napp-k get job configure-druid -o json > configure-druid.json.bak (backup)
3. Edit the configure-druid.json file
vim configure-druid.json
Remove "matchLabels" field in ".spec.selector"
Remove "controller-uid" field in ".spec.template.metadata.labels"
Save and exit
4. napp-k delete job configure-druid
5. napp-k apply -f configure-druid.json