Removing stale WCP cluster that are causing a false positive report on NCP down alarm

Products

VMware NSX Networking VMware NSX

Issue/Introduction

NCP plugin down alarm present in NSX UI and not able to resolve the alarm as the alarm re appears

One or more cluster show as unknown in the output of the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status

GET /api/v1/systemhealth/container-cluster/ncp/status

{

  "results" : [ {

    "cluster_id" : "0cba5273-4fd8-559f-8fbf-11e53114c58f",

   "cluster_name" : "domain-##:#########-####-####-####-############",

    "type" : "Kubernetes",

    "status" : "UNKNOWN", <<<<<<<<<<<<<

    "detail" : "",

    "_protection" : "NOT_PROTECTED"

  } ],

  "result_count" : 2,

You have verified that the POD health and manager connectivity is good by following the steps in Broadcom KB:345833
From the VC support bundle, wcp-db-dump.py.txt will tell us the only the healthy cluster defined. Make sure there is no entry found for the unknown cluster.
NSX support bundle reveals many resources still belonging to the unhealthy cluster "domain-##:#########-####-####-####-############", however you are not able to find this unhealthy cluster present in VCenter

Environment

VMware NSX-T Data Center
VMware NSX

Cause

NSX manager reports cluster as unhealthy due to stale inventory entries in NSX for a deleted WCP cluster and that is causing a false positive in reporting the NCP alarm down.
For some reason WCP did not perform NSX cleanup when the cluster was destroyed.

Resolution

The customer needs to cleanup resources for this WCP cluster using the below scripted approach by calling the unknown cluster name from API: GET /api/v1/systemhealth/container-cluster/ncp/status. This means removing all NSX resources that are referring to this cluster, including the inventory resources that are causing the alarm.

The script can be executed as follows.

python3 /usr/lib/vmware-wcp/nsx_policy_cleanup.py --cluster <cluster name> -u <nsx admin user> -p '<nsx mgr admin pass>' --mgr-ip=<nsx mgr ip> --no-warning --top-tier-router-id=<cluster name> --all-res -r

Example:
python3 /usr/lib/vmware-wcp/nsx_policy_cleanup.py --cluster domain-##:#########-####-####-####-############ -u <nsx admin user> -p '<nsx mgr admin pass>' --mgr-ip=<nsx mgr ip> --no-warning --top-tier-router-id=domain-##:#########-####-####-####-############ --all-res -r

IMPORTANT: The -r option perform the actual removal of NSX resources. It is advised to first run the script without this option. This will perform a dry-run, users can evaluate the output to assess which resources will be removed before doing the actual deletion.

Additional Information

If customer is running VC 8.0.3.
This will require triggering the script in an alternate way, due to the presence of an envoy proxy in VCSA.

In this case as the cleanup operation will also need to go through the envoy proxy, users should set the following options:

--envoy-endpoint: IpAddress or Hostname of VC envoy sidecar, ignored if the environment variable ENVOY_ENDPOINT is set
--envoy-port: HTTP port of VC envoy sidecar, ignored environment variable ENVOY_PORT is set. Defaults to 1080.

Alternatively, users can copy the cleanup script to any machine with connectivity with NSX manager and do the cleanup from there; the script indeed has no dependency on WCPSVC. Users should just be aware that the script requires the python requests package which must be installed on the machine (or python venv) from which the script is executed.