vCenter HA Basic Mode fails during Passive/Witness cloning when using NFSv3 Datastore

Products

VMware vCenter Server VMware vCenter Server 8.0

Issue/Introduction

When configuring vCenter High Availability (VCHA) in Basic mode, the process may get stuck or fail at 96% or 33% if the vCenter Server appliance is stored on an NFSv3 datastore.

During repeated attempts to configure VCHA using the automatic cloning method, the process may fail at different stages with errors such as:

"An error occurred while communicating with the remote host"

"The session is not authenticated. You do not hold privileges "PropertyCollector.te#t session[52###bb4-#####-50b7-223d-f61#####1c88]52###06e-5874-dcda-#####-8b#####5"

From /var/log/vmware/vpxd/vpxd.log

YYYY-MM-DDThh:mm:ss.Z warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed
YYYY-MM-DDThh:mm:ss.Z error vpxd[7F725BE7C700] [Originator@6876 sub=VmProv opID=FlowBasedWizard-apply-638-ngc-35-f-01] Get exception while executing action vpx.vmprov.RemoveSnapshot: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)

From the ESXI /var/run/log/vmkernel.log

YYYY-MM-DDThh:mm:ss.Z Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
YYYY-MM-DDThh:mm:ss.Z cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
YYYY-MM-DDThh:mm:ss.Z cpu14:68341 opID=840210c9)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure

Error in /var/log/vmware/vpxd/vpxd.log

warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [GetChangesTime] took 1629260 ms
warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [DoHostSyncTime] took 1629260 ms

warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed

error vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Unable to do delete snapshot clone-temp-######## because the host host.example.com is disconnected
info vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Done undo action vpx.vmprov.CreateSnapshot with output:

warning vpxd[7FE71E63C700] [Originator@6876 sub=VpxProfiler opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] VpxLro::LroMain [TotalTime] took 9647077 ms
error vpxd[7FE71E63C700] [Originator@6876 sub=vpxLro opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] [VpxLRO] Unexpected Exception: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)

info vpxd[7FE71E63C700] [Originator@6876 sub=Default opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be] [VpxLRO] -- ERROR t
ask-1416563 -- vm-##### -- vim.VirtualMachine.clone: vmodl.fault.HostCommunication:
--> Result:
--> (vmodl.fault.HostCommunication) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>
-->    msg = ""
--> }
--> Args:
-->
--> Arg folder:
--> 'vim.Folder:6CA8####-###-###-####-C01D####:group-####'
--> Arg name:
--> "TestVC001"
--> Arg spec:
--> (vim.vm.CloneSpec) {
-->    location = (vim.vm.RelocateSpec) {
-->       service = (vim.ServiceLocator) null,
-->       folder = 'vim.Folder:6CA8####-###-###-####-C01D####:group-####',
-->       datastore = 'vim.Datastore:6CA8####-###-###-####-C01D####:group-####':datastore-#####',
-->       diskMoveType = <unset>,

Error in /var/run/log/vmkernel.log

cpu20:442393)VSCSI: 273: handle 46587(vscsi0:12):Input values: res=0 limit=-2 bw=-1 Shares=1000
cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
cpu20:442393)NetPort: 3208: resuming traffic on DV port 251
cpu20:442393)Team.etherswitch: TeamESPolicySet:5942: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
cpu20:442393)Team.etherswitch: TeamESPolicySet:5950: Update: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
cpu20:442393)NetPort: 1662: enabled port 0x600002c with mac ##:##:##:##:##
cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x2000007
cpu20:442393)NetPort: 1662: enabled port 0x2000007 with mac ##:##:##:##:##
cpu14:68341 opID=#####)World: 12235: VC opID HB-host-#####@96420-74cd6b42-DvsHandleHostReconnect-1fc1dfe0-3-1dd4 maps to vmkernel opID #####
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)Team.etherswitch: TeamESLACPLAGEventCB:6277: Received a LAG DESTROY event version :0, lagId :0, lagLinkStatus :NOT USED,lagName :, uplinkName :, portLinkStatus :NOT USED, portID :0x0
cpu14:68341 opID=#####)netioc: NetIOCSetRespoolVersion:245: Set netioc version for portset: DvsPortset-0 to 2,old threshold: 2
cpu14:68341 opID=#####)netioc: NetIOCPortsetNetSchedStatusSet:1207: Set sched status for portset: DvsPortset-0 to Inactive, old:Inactive

Cause

This issue occurs due to a limitation in the NFSv3 locking mechanism, which includes a 40-second lock timeout. The automatic clone operations used by VCHA configuration may exceed this timeout, resulting in the failure.

Resolution

Option 1: Use the Same ESXi Host for Clone Operations

To avoid clone failures:

Ensure that the source (Active) and target ESXi hosts for the Passive and Witness nodes are the same host during VCHA deployment.
To avoid DRS anti-affinity rule errors when deploying all VCHA nodes on the same host:
- Go to vCenter Advanced Settings
- Set the following key:
  
  config.vpxd.vcha.drsAntiAffinity = False

Note: After the configuration nodes can be moved to different ESXI hosts and the above Advance settings can be revert back to True

Option 2: Use Manual Clone Method

Instead of using the Basic (automatic) mode, configure VCHA using the Manual Clone method, which gives more control over node placement and avoids issues related to datastore locking on NFSv3.

Additional Information

Use NFS 4.1 configurations due to its improved session and file locking semantics.