Cluster expansion in SDDC fails at Validate NSX-T Transport Node Cluster does not use Static IP Pools

Products

VMware SDDC Manager VMware Cloud Foundation Subscription

Issue/Introduction

Error in SDDC UI

Description	Validate NSX-T Transport Node Cluster does not use Static IP Pools
Progress Messages	Expanding L3 based Cluster is not supported since the cluster is using NSX-T overlay static IP pool.
Error

Message: Expanding L3 based Cluster is not supported since the cluster is using NSX-T overlay static IP pool.
Remediation Message:
Reference Token: #####
Cause: Host esxi01.example.com in cluster has static IP pool [6d86####-####-####-####-########b801] defined. Cannot continue with workflow.

Error in /var/log/vmware/vcf/domainmanager/domainmanager.log

ERROR [vcf_dm,##########,408e] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-18]  [#####] NSXT_VALIDATE_L3_CLUSTER_WITH_STATIC_IP_POOL_FAILED Expanding L3 based Cluster is not supported since the cluster is using NSX-T overlay static IP pool.
com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Expanding L3 based Cluster is not supported since the cluster is using NSX-T overlay static IP pool.
        at com.vmware.vcf.common.fsm.plugins.nsxt.action.ValidateNsxtOverlayIpAssignmentBaseAction.execute(ValidateNsxtOverlayIpAssignmentBaseAction.java:145)
        at com.vmware.vcf.common.fsm.plugins.nsxt.action.ValidateNsxtOverlayIpAssignmentAction.execute(ValidateNsxtOverlayIpAssignmentAction.java:29)
        at com.vmware.vcf.common.fsm.plugins.nsxt.action.ValidateNsxtOverlayIpAssignmentAction.execute(ValidateNsxtOverlayIpAssignmentAction.java:12)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:159)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:144)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.invokeMethod(ProcessingTaskSubscriber.java:400)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.processTask(ProcessingTaskSubscriber.java:520)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.accept(ProcessingTaskSubscriber.java:124)
        at jdk.internal.reflect.GeneratedMethodAccessor469.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.google.common.eventbus.Subscriber.invokeSubscriberMethod(Subscriber.java:88)
        at com.google.common.eventbus.Subscriber$1.run(Subscriber.java:73)
        at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:64)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.RuntimeException: Host esxi01.example.com in cluster has static IP pool [6d86####-####-####-####-########b801] defined. Cannot continue with workflow.

Additional logging in /var/log/vmware/vcf/domainmanager/domainmanager.log

WARN  [vcf_dm,##########,6206] [c.v.v.h.HostManagerEventHandler,dm-exec-3]  Could not collect persisted hosts in cluster.
java.lang.NullPointerException: null
        at com.vmware.vcf.hostmanager.service.model.AddHostInternalModel.getClusterId(AddHostInternalModel.java:120)

INFO  [vcf_dm,##########,265b] [c.v.e.s.c.s.a.w.o.WorkflowOptionsAdapterUtil,http-nio-127.0.0.1-7200-exec-10]  Checking if network pool is same for cluster 201c####-####-####-####-########5807 after adding hosts [a245####-####-####-####-########b99b, e754####-####-####-####-########c45f].

INFO  [vcf_dm,##########,265b] [c.v.e.s.c.s.a.w.o.WorkflowOptionsAdapterUtil,http-nio-127.0.0.1-7200-exec-10]  Is network pool match: false

INFO  [vcf_dm,##########,5433] [c.v.v.c.f.p.n.a.ValidateNsxtOverlayIpAssignmentBaseAction,dm-exec-9]  Found Transport Node

{
   .........
               "ip_assignment_spec": {
                  "fields": {
                    "ip_pool_id": {
                      "value": "6d86####-####-####-####-########b801"
                    },
                    "resource_type": {
                      "value": "StaticIpPoolSpec" <=== static ip pool
                    }
                  },
                  "name": "struct"
                },
   ..........
}

Environment

VMware Cloud Foundation

Cause

For VCF 4.5.2, all host(s) of the cluster belong to one L2 domain.
Existing hosts in the Cluster and new hosts to be added to the cluster are in different network pool in SDDC Manager.
Since the new hosts are in a different network pool, SDDC is treating them as belonging to a different L2 domain, which is leading to the failure.

Resolution

Validate that the ESXi hosts are in different network pool in SDDC Manager

Follow the below steps

SSH to SDDC Manager with vcf user and su to root

Get the host ids from SDDC platform database

psql -h localhost -U postgres -d platform -c "select id,hostname from where hostname='esxi01.example.com'" ----> for existing host in cluster

Sample output

                  id                  |       hostname
--------------------------------------+----------------------
 6d86####-####-####-####-########b801 | esxi01.example.com
(1 row)

psql -h localhost -U postgres -d platform -c "select id,hostname from where hostname='esxi05.example.com'" ----> for new host to be added in cluster

Sample output

                  id                  |       hostname
--------------------------------------+----------------------
 a245####-####-####-####-########b99b | esxi05.example.com
(1 row)

Get the associated network pool id for the hosts

psql -h localhost -U postgres -d platform -c "select * from host_and_network_pool where host_id='6d86####-####-####-####-########b801'" ---> for existing host in cluster

Sample output

id |               host_id                |           network_pool_id
----+--------------------------------------+--------------------------------------
  2 | 6d86####-####-####-####-########b801 | b146###-####-####-####-########6d4c
(1 row)

psql -h localhost -U postgres -d platform -c "select * from host_and_network_pool where host_id='a245####-####-####-####-########b99b'" ---> for new host to be added in cluster

Sample output

id |               host_id                |           network_pool_id
----+--------------------------------------+--------------------------------------
 13 | a245####-####-####-####-########b99b | 17b7###-####-####-####-########e659
(1 row)

Note: Do not modify host_and_network_pool table to match the IDs. Updating the SDDCM inventory with the n/p ID matching can potentially fail the workflow at a later stage while creating VMKs for vMotion and vSAN etc. This could lead to potential network connectivity issues.

To resolve the issue upgrade VCF to 5.1 or later

Workaround:

Decommission and Recommission the new hosts in the network pool of existing hosts in cluster.

Refer:

Decommission Hosts

Commission Hosts