vCenter HA Basic Mode fails during Passive/Witness cloning when using NFSv3 Datastore
search cancel

vCenter HA Basic Mode fails during Passive/Witness cloning when using NFSv3 Datastore

book

Article ID: 343081

calendar_today

Updated On:

Products

VMware vCenter Server VMware vCenter Server 8.0

Issue/Introduction

  • When configuring vCenter High Availability (VCHA) in Basic mode, the process may get stuck or fail at 96% or 33% if the vCenter Server appliance is stored on an NFSv3 datastore.

  • During repeated attempts to configure VCHA using the automatic cloning method, the process may fail at different stages with errors such as:
    "An error occurred while communicating with the remote host"
    
    "The session is not authenticated. You do not hold privileges "PropertyCollector.te#t session[52###bb4-#####-50b7-223d-f61#####1c88]52###06e-5874-dcda-#####-8b#####5"
  • From /var/log/vmware/vpxd/vpxd.log
    YYYY-MM-DDThh:mm:ss.Z warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed
    YYYY-MM-DDThh:mm:ss.Z error vpxd[7F725BE7C700] [Originator@6876 sub=VmProv opID=FlowBasedWizard-apply-638-ngc-35-f-01] Get exception while executing action vpx.vmprov.RemoveSnapshot: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)
  • From the ESXI /var/run/log/vmkernel.log
    YYYY-MM-DDThh:mm:ss.Z Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
    YYYY-MM-DDThh:mm:ss.Z cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
    YYYY-MM-DDThh:mm:ss.Z cpu14:68341 opID=840210c9)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure

     

  • Error in /var/log/vmware/vpxd/vpxd.log

    warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [GetChangesTime] took 1629260 ms
    warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [DoHostSyncTime] took 1629260 ms
    
    warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed
    
    error vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Unable to do delete snapshot clone-temp-######## because the host host.example.com is disconnected
    info vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Done undo action vpx.vmprov.CreateSnapshot with output:
    
    warning vpxd[7FE71E63C700] [Originator@6876 sub=VpxProfiler opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] VpxLro::LroMain [TotalTime] took 9647077 ms
    error vpxd[7FE71E63C700] [Originator@6876 sub=vpxLro opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] [VpxLRO] Unexpected Exception: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)
    
    info vpxd[7FE71E63C700] [Originator@6876 sub=Default opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be] [VpxLRO] -- ERROR t
    ask-1416563 -- vm-##### -- vim.VirtualMachine.clone: vmodl.fault.HostCommunication:
    --> Result:
    --> (vmodl.fault.HostCommunication) {
    -->    faultCause = (vmodl.MethodFault) null,
    -->    faultMessage = <unset>
    -->    msg = ""
    --> }
    --> Args:
    -->
    --> Arg folder:
    --> 'vim.Folder:6CA8####-###-###-####-C01D####:group-####'
    --> Arg name:
    --> "TestVC001"
    --> Arg spec:
    --> (vim.vm.CloneSpec) {
    -->    location = (vim.vm.RelocateSpec) {
    -->       service = (vim.ServiceLocator) null,
    -->       folder = 'vim.Folder:6CA8####-###-###-####-C01D####:group-####',
    -->       datastore = 'vim.Datastore:6CA8####-###-###-####-C01D####:group-####':datastore-#####',
    -->       diskMoveType = <unset>,

     

  • Error in /var/run/log/vmkernel.log

    cpu20:442393)VSCSI: 273: handle 46587(vscsi0:12):Input values: res=0 limit=-2 bw=-1 Shares=1000
    cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
    cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
    cpu20:442393)NetPort: 3208: resuming traffic on DV port 251
    cpu20:442393)Team.etherswitch: TeamESPolicySet:5942: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
    cpu20:442393)Team.etherswitch: TeamESPolicySet:5950: Update: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
    cpu20:442393)NetPort: 1662: enabled port 0x600002c with mac ##:##:##:##:##
    cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
    cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x2000007
    cpu20:442393)NetPort: 1662: enabled port 0x2000007 with mac ##:##:##:##:##
    cpu14:68341 opID=#####)World: 12235: VC opID HB-host-#####@96420-74cd6b42-DvsHandleHostReconnect-1fc1dfe0-3-1dd4 maps to vmkernel opID #####
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
    cpu14:68341 opID=#####)Team.etherswitch: TeamESLACPLAGEventCB:6277: Received a LAG DESTROY event version :0, lagId :0, lagLinkStatus :NOT USED,lagName :, uplinkName :, portLinkStatus :NOT USED, portID :0x0
    cpu14:68341 opID=#####)netioc: NetIOCSetRespoolVersion:245: Set netioc version for portset: DvsPortset-0 to 2,old threshold: 2
    cpu14:68341 opID=#####)netioc: NetIOCPortsetNetSchedStatusSet:1207: Set sched status for portset: DvsPortset-0 to Inactive, old:Inactive
     

     

Cause

This issue occurs due to a limitation in the NFSv3 locking mechanism, which includes a 40-second lock timeout. The automatic clone operations used by VCHA configuration may exceed this timeout, resulting in the failure.

Resolution

Option 1: Use the Same ESXi Host for Clone Operations

To avoid clone failures:

  • Ensure that the source (Active) and target ESXi hosts for the Passive and Witness nodes are the same host during VCHA deployment.

  • To avoid DRS anti-affinity rule errors when deploying all VCHA nodes on the same host:

    • Go to vCenter Advanced Settings

    • Set the following key:

      config.vpxd.vcha.drsAntiAffinity = False

Note: After the configuration nodes can be moved to different ESXI hosts and the above Advance settings can be revert back to True

Option 2: Use Manual Clone Method

  • Instead of using the Basic (automatic) mode, configure VCHA using the Manual Clone method, which gives more control over node placement and avoids issues related to datastore locking on NFSv3.

Additional Information

Use NFS 4.1 configurations due to its improved session and file locking semantics.