Edge cluster expansion fails with error "VSPHERE_VM_ANTIAFFINITY_RULE_CREATION_FAILED Failed to create VM-VM anti-affinity rule"
search cancel

Edge cluster expansion fails with error "VSPHERE_VM_ANTIAFFINITY_RULE_CREATION_FAILED Failed to create VM-VM anti-affinity rule"

book

Article ID: 318237

calendar_today

Updated On:

Products

VMware Cloud Foundation

Issue/Introduction

Expand edge cluster from SDDC Manager UI

Symptoms:

  • Issue occurs when environment has edge cluster setup across 2 or more vSphere clusters each having exactly one edge node VM in the Edge cluster being expanded
  • On the SDDC Manager, task error would be Create Anti-Affinity Rule for NSX-T Data Center Edge nodes

Error Message: Failed to create VM-VM anti-affinity rule VCF-edge

  • domainmanager.log displays errors


2023-01-05T11:20:33.154+0000 ERROR [vcf_dm,f5e5b2630a454c84,03f6] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-11] [TPC4QK] VSPHERE_VM_ANTIAFFINITY_RULE_CREATION_FAILED Failed to create VM-VM anti-affinity rule VCF-edge_w161-t0-ec-01_antiAffinity_########-####-####-####-##########75 in vCenter domain-c
com.vmware.evo.sddc.orchestrator.exceptions.OrchTaskException: Failed to create VM-VM anti-affinity rule VCF-edge_w161-t0-ec-01_antiAffinity_########-####-####-####-##########75 in vCenter domain-c
        at com.vmware.vcf.common.fsm.plugins.action.impl.CreateAntiAffinityRuleAction.postValidate(CreateAntiAffinityRuleAction.java:349)
        at com.vmware.vcf.common.fsm.plugins.action.impl.CreateAntiAffinityRuleAction.postValidate(CreateAntiAffinityRuleAction.java:41)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.lambda$static$1(FsmActionState.java:23)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionState.invoke(FsmActionState.java:62)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:159)
        at com.vmware.evo.sddc.orchestrator.platform.action.FsmActionPlugin.invoke(FsmActionPlugin.java:144)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.invokeMethod(ProcessingTaskSubscriber.java:400)
        at com.vmware.evo.sddc.orchestrator.core.ProcessingTaskSubscriber.processTask(ProcessingTaskSubscriber.java:561)
2023-01-05.0.log:2023-01-05T10:46:16.894+0000 DEBUG [vcf_dm,0f19dbf65ee041c9,bc65] [c.v.v.c.f.p.a.i.UpdateAntiAffinityRuleAction,dm-exec-19] Anti-affinity rule ########-####-####-####-##########a5 already exists with VMs [vm-94]

2023-01-05T10:46:16.998+0000 DEBUG [vcf_dm,0f19dbf65ee041c9,bc65] [c.v.v.c.f.p.a.i.UpdateAntiAffinityRuleAction,dm-exec-19] Anti-affinity rule ########-####-####-####-##########9f already exists with VMs [vm-92]

2023-01-05.0.log:2023-01-05T11:20:33.154+0000 ERROR [vcf_dm,f5e5b2630a454c84,03f6] [c.v.e.s.o.model.error.ErrorFactory,dm-exec-11] [TPC4QK] VSPHERE_VM_ANTIAFFINITY_RULE_CREATION_FAILED Failed to create VM-VM anti-affinity rule VCF-edge_w161-t0-ec-01_antiAffinity_########-####-####-####-##########75 in vCenter domain-c

  • Confirm the payload to edge cluster expansion in domainmanager logs show below

2023-01-03T15:14:59.256+0000 DEBUG [vcf_dm,d9f11d6b6e624743,e706] [c.v.e.s.o.c.c.ContractParamBuilder,dm-exec-10] Contract task Update NSX-T Data Center Anti-Affinity Rule input: {"clusterMoIdToRemoteEndpoint":{"domain-c":
.
.
antiAffinityRuleParamList":[{"clusterMobId":"domain-c","ruleIdToVms":{"########-####-####-####-##########70":["VM-A","VM-C"]},"antiAffinity":true},{"clusterMobId":"domain-c","ruleIdToVms":{"########-####-####-####-##########9f":["VM-b","VM-D"]},"antiAffinity":true}]}

  • Check if affinity rule exist warning is reported

 Existing AA rules in cluster domain-c: [{"vm":[{"_type":"VirtualMachine","_value":"vm-94","_serverGuid":"########-####-####-####-##########a5"}],"key":1,"enabled":true,"name":"VCF-edge_w161-t0-ec-01_antiAffinity_########-####-####-####-##########75","ruleUuid":"########-####-####-####-##########70"}]

2023-01-03T15:14:59.412+0000 DEBUG [vcf_dm,d9f11d6b6e624743,e706] [c.v.v.c.f.p.a.i.UpdateAntiAffinityRuleAction,dm-exec-10] Anti-affinity rule ########-####-####-####-##########70 already exists with VMs [vm-94]

  • Check the rule ID of the Anti-affinity rule is same as the ID of the payload passed to NSX and not any stale ones. In this case for a cluster, it is ########-####-####-####-##########70



Environment

VMware Cloud Foundation 4.x

Cause

During Edge cluster expansion, when the starting state has two or more VC clusters each hosting one node in the expanding Edge cluster, input preparation erroneously maps each VC cluster's Anti-Affinity rule spec into all the other single-node clusters as well. This in turn causes the Anti-Affinity rule creation action to fail since it cannot create a rule for node VMs in a cluster that doesn't host those VMs.

Resolution

This is a known issue for versions of VCF until 4.4.1. The engineering team is working on this and plans to port the fix into VCF5.0

Workaround:
Workaround 1

Failure type may be avoided by changing the input task so that Edge cluster creation adds two Edge nodes to one VC host cluster, and then expansion added two more Edge nodes to the other VC host cluster.
Workaround 2 

  1. Check both clusters if Anti-Affinity rule is already created and houses failed edge node VMs in it. 
  2. If step 2 holds good, domain manager workflow for the failed subtask can be skipped from the database. Kindly contact VMware support for performing this action
  3. Post skipping the failed task, retry task from SDDC manager UI and edge cluster expansion task should be successful 

Workaround 3
If edge cluster is relatively new and downtime can be taken, cleanup the entire edge cluster using https://kb.vmware.com/s/article/78635 and recreate edge cluster with all edge node information to deploy at one stretch

Additional Information

Impact/Risks:

  • Edge cluster creation fails
  • Edge node status unavailable in NSX-Manager UI


Note: Ensure that the BGP connectivity status for all edge node VMs in NSX manager UI (including the new nodes deployed as part of expansion cluster workflow) is alright to perform the workaround.