Persistent "NCP plugin down" Alarm in NSX UI Due to Stale TKGi/WCP Cluster Entries
search cancel

Persistent "NCP plugin down" Alarm in NSX UI Due to Stale TKGi/WCP Cluster Entries

book

Article ID: 437588

calendar_today

Updated On:

Products

VMware NSX VMware Tanzu Kubernetes Grid

Issue/Introduction

  • A persistent "NCP plugin down" alarm is actively displayed in the NSX UI.
  • Standard administrative attempts to acknowledge or resolve the alarm are unsuccessful, as the alert continuously reappears.

  • One or more cluster show as unknown in the output of the NSX API GET /api/v1/systemhealth/container-cluster/ncp/status:
{
    "results": [
        {
            "cluster_id": "########-####-####-####-############",
            "cluster_name": "pks-#######-####-####-####-############",
            "type": "Kubernetes",
            "status": "UNKNOWN",
            "detail": "",
            "_protection": "NOT_PROTECTED"
        },

Environment

VMware NSX 4.x

Tanzu Kubernetes Grid Integrated Edition (TKGi)

Cause

This issue is a false positive caused by stale inventory entries remaining in the NSX database.

This condition occurs when a TKGi or WCP cluster is deleted or removed from the environment without performing the proper corresponding cleanup within NSX. Because the cluster objects still exist in the NSX database, the NSX Manager continues to actively poll the removed cluster. When the polling inevitably fails, the NSX Manager determines the cluster is unhealthy and triggers the "NCP plugin down" alarm.

Resolution

To resolve this issue, you must clear the stale cluster objects from the NSX inventory using the policy cleanup script.

  1. SSH into the vCenter Server and log in as the root user.
  2. Navigate to directory: /usr/lib/vmware-wcp
  3. Run the nsx_policy_cleanup.py script against your vCenter.
    1. Dry Run Mode - This will list the objects about to be cleaned up:         

                  IMPORTANT: The -r option performs the actual removal of NSX resources. It is advised to first run the script without this option. This will perform a dry-run, giving users the opportunity to evaluate the output to assess which resources will be removed before doing the actual deletion. 

python3 /usr/lib/vmware-wcp/nsx_policy_cleanup.py --cluster <cluster name> -u <nsx admin user> -p '<nsx mgr admin pass>' --mgr-ip=<nsx mgr ip> --no-warning --top-tier-router-id=<cluster name> --all-res

              b. Actual Cleanup:

python3 /usr/lib/vmware-wcp/nsx_policy_cleanup.py --cluster <cluster name> -u <nsx admin user> -p '<nsx mgr admin pass>' --mgr-ip=<nsx mgr ip> --no-warning --top-tier-router-id=<cluster name> --all-res -r

   

    4. Allow the script to execute. Note: The script successfully cleans the inventory resource early in its execution process. Even if the top-tier-router-id logic fails later in the script, the necessary stale inventory deletion will have already occurred.

 

To confirm the solution was successful and the stale entries have been removed, query the NSX Manager API:

  1. Execute the following API GET request: GET /api/v1/systemhealth/container-cluster/ncp/status

  2. Review the JSON response and verify that it no longer displays the removed cluster with an "UNKNOWN" status.

Additional Information

Cleaning Up the NSX Environment

Removing stale WCP cluster that are causing a false positive report on NCP down alarm