vSphere Supervisor stuck in configuring status with errror NSX Edge Cluster <cluster name> is invalid.
search cancel

vSphere Supervisor stuck in configuring status with errror NSX Edge Cluster <cluster name> is invalid.

book

Article ID: 432398

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service VCF Operations

Issue/Introduction

  • Supervisor is stuck in "Configuration Status: Configuring" with an error indicating the configured Control Plane VMs NSX Edge Cluster is invalid. From the Supervisor management menu when checking the supervisor you would see below:

    Initialized vSphere resources
    Deployed Control Plane VMs
    Configured Control Plane VMs
     • NSX Edge Cluster <cluster name> is invalid.
    Configured Core Supervisor Services
  • From VCF OPs, if completing an upgrade of the workload domain, you would see "Error: Cluster domain-c<id>is not ready yet." in the Pre-check.

Environment

  • VCF 9.0
  • vSphere Kubernetes Service

Cause

If the NSX edge cluster is created via Policy API, the edge cluster policy ID and the NSX API ID would be different. From logs /var/log/vmware/wcp/wcpsvc.log

debug wcp [nsxt/validator.go:77] [opID=69ac9240-85726ca4-14e1-4f15-85f9-ca7a06168874] Processing policy edge cluster edge-wrk-cl-01 with nsx_id <policy id>
...
error wcp [kubelifecycle/kubenodeconfig.go:277] [opID=69ac9240-85726ca4-14e1-4f15-85f9-ca7a06168874] Edge Cluster '<edge cluster name>' does not have corresponding Policy ID. Err: Unknown edge cluster <edge cluster name>
error wcp [kubelifecycle/controller.go:2084] [opID=69ac9240-85726ca4-14e1-4f15-85f9-ca7a06168874] Error creating master node config. Err NSX Edge Cluster <edge cluster name> is invalid.
error wcp [kubelifecycle/controller.go:2381] [opID=69ac9240-85726ca4-14e1-4f15-85f9-ca7a06168874] Error configuring API server on cluster <id> NSX Edge Cluster <edge cluster name> is invalid.
 warning wcp [kubelifecycle/controller.go:1103] [opID=69ac9240-85726ca4-14e1-4f15-85f9-ca7a06168874] Unable to configure agent in cluster domain-c10. Err NSX Edge Cluster <edge cluster name> is invalid.

Checking network_provider_settings from wcp.cluster_db_configs table shows incorrect value:

psql -d VCDB -U postgres

select network_provider_settings from wcp.cluster_db_configs;
 network_provider_settings
 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
{"PodCidrs": [{"IP": "10.96.208.0", "Mask": "///+AA=="}], "RoutedMode": false, "EgressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "IngressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "Tier0Gateway
": "t0-wrk-01", "TransitCidrs": [{"IP": "###.###.##.##", "Mask": "//8AAA=="}, {"IP": "###.###.##.##", "Mask": "////8A=="}], "EdgeClusterID": "<edge cluster name>", "CrdMacAddressSpec": {"VMMacAddressMappings": {}}, "Namespace
SubnetPrefix": 28, "ClusterDistributedSwitchID": "## ## ## ## ## ## ##-## ## ## ## ##"}
(1 row)

The <egecluster name> value should be the edge cluster UUID.

Resolution

Issue resolved in VCF 9.0.2

To workaround this issue, update the table to use Management Plane ID (Realized UUID) for the edge cluster.

  1. Find the Management Plane ID for the edge cluster in the NSX UI at System > Configuration > Fabric > Nodes > Edge Clusters



  2. SSH to workload vCenter.
  3. Launch the psql utility and connect to VCDB

    psql -d VCDB -U postgres

  4. Run the following command to get the current configuration:

    select network_provider_settings from wcp.cluster_db_configs;

    Note: Output similar to the following will be returned:

     network_provider_settings
     
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------
    {"PodCidrs": [{"IP": "10.96.208.0", "Mask": "///+AA=="}], "RoutedMode": false, "EgressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "IngressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "Tier0Gateway": "t0-wrk-01", "TransitCidrs": [{"IP": "###.###.##.##", "Mask": "//8AAA=="}, {"IP": "###.###.##.##", "Mask": "////8A=="}], "EdgeClusterID": "<edge cluster name>", "CrdMacAddressSpec": {"VMMacAddressMappings": {}}, "NamespaceSubnetPrefix": 28, "ClusterDistributedSwitchID": "## ## ## ## ## ## ##-## ## ## ## ##"}
    (1 row)

  5. Run a command similar to the following to update the value of EdgeClusterID to the MP ID of the cluster instead of Policy ID". Use the output from Step 3 and replace the value for <edge cluster name> with the Edge Cluster ID value obtained in Step 1.

    UPDATE wcp.cluster_db_configs SET network_provider_settings = '{"PodCidrs": [{"IP": "10.96.208.0", "Mask": "///+AA=="}], "RoutedMode": false, "EgressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "IngressCidrs": [{"IP": "###.###.##.##", "Mask": "////AA=="}], "Tier0Gateway": "t0-wrk-01", "TransitCidrs": [{"IP": "###.###.##.##", "Mask": "//8AAA=="}, {"IP": "###.###.##.##", "Mask": "////8A=="}], "EdgeClusterID": "<edge cluster ID>", "CrdMacAddressSpec": {"VMMacAddressMappings": {}}, "NamespaceSubnetPrefix": 28, "ClusterDistributedSwitchID": "## ## ## ## ## ## ##-## ## ## ## ##"}';

    Note: This command should be one continuous line with no line breaks or carriage returns.

  6. Type exit to exit the psql utility.
  7. Restart the wcp service

    service-control --restart wcp

Note: The supervisor should go back to a running state