NFSv3 データストアを使用している場合、Passive/Witness クローン作成中に vCenter HA 基本モードが失敗する

Products

VMware vCenter Server VMware vCenter Server 8.0

Issue/Introduction

Symptoms:
免責事項：これは英文の記事「vCenter HA Basic Mode fails during Passive/Witness cloning when using NFSv3 Datastore」の日本語訳です。
記事はベストエフォートで翻訳を進めているため、ローカライズ化コンテンツは最新情報ではない可能性があります。最新情報は英語版の記事で参照してください。

vCenter High Availability (VCHA) を基本モードで構成する場合、vCenter Server Appliance がNFSv3 データストアに保存されていると、プロセスが 96% または 33% で停止したり失敗したりすることがあります。

自動クローン方式を使用して VCHA を繰り返し構成しようとすると、さまざまな段階でプロセスが失敗し、次のようなエラーが発生する場合があります。

"An error occurred while communicating with the remote host"

"The session is not authenticated. You do not hold privileges "PropertyCollector.te#t session[52###bb4-#####-50b7-223d-f61#####1c88]52###06e-5874-dcda-#####-8b#####5"

/var/log/vmware/vpxd/vpxd.logには下記のような出力があります。

YYYY-MM-DDThh:mm:ss.Z warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed
YYYY-MM-DDThh:mm:ss.Z error vpxd[7F725BE7C700] [Originator@6876 sub=VmProv opID=FlowBasedWizard-apply-638-ngc-35-f-01] Get exception while executing action vpx.vmprov.RemoveSnapshot: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)

ESXiホストの/var/run/log/vmkernal.logには下記のような出力があります。

YYYY-MM-DDThh:mm:ss.Z Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
YYYY-MM-DDThh:mm:ss.Z cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
YYYY-MM-DDThh:mm:ss.Z cpu14:68341 opID=840210c9)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure

/var/log/vmware/vpxd/vpxd.logには下記のような出力があります。

warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [GetChangesTime] took 1629260 ms
warning vpxd[7FE6210F7700] [Originator@6876 sub=VpxProfiler opID=HB-host-#####@96321-26a52267] DoHostSync:host-##### [DoHostSyncTime] took 1629260 ms

warning vpxd[7FE71E63C700] [Originator@6876 sub=pbm opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] PBMCallback: ShouldSkipPostCloneCommonCallback: post clone callback is skipped - VM clone failed

error vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Unable to do delete snapshot clone-temp-######## because the host host.example.com is disconnected
info vpxd[7FE71E63C700] [Originator@6876 sub=VmProv opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] Done undo action vpx.vmprov.CreateSnapshot with output:

warning vpxd[7FE71E63C700] [Originator@6876 sub=VpxProfiler opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] VpxLro::LroMain [TotalTime] took 9647077 ms
error vpxd[7FE71E63C700] [Originator@6876 sub=vpxLro opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be-01] [VpxLRO] Unexpected Exception: N5Vmomi5Fault17HostCommunication9ExceptionE(vmodl.fault.HostCommunication)

info vpxd[7FE71E63C700] [Originator@6876 sub=Default opID=ProvisioningWizard-addMulti-24848-ngc:70001948-be] [VpxLRO] -- ERROR t
ask-1416563 -- vm-##### -- vim.VirtualMachine.clone: vmodl.fault.HostCommunication:
--> Result:
--> (vmodl.fault.HostCommunication) {
-->    faultCause = (vmodl.MethodFault) null,
-->    faultMessage = <unset>
-->    msg = ""
--> }
--> Args:
-->
--> Arg folder:
--> 'vim.Folder:6CA8####-###-###-####-C01D####:group-####'
--> Arg name:
--> "TestVC001"
--> Arg spec:
--> (vim.vm.CloneSpec) {
-->    location = (vim.vm.RelocateSpec) {
-->       service = (vim.ServiceLocator) null,
-->       folder = 'vim.Folder:6CA8####-###-###-####-C01D####:group-####',
-->       datastore = 'vim.Datastore:6CA8####-###-###-####-C01D####:group-####':datastore-#####',
-->       diskMoveType = <unset>,

ESXiホストの/var/run/log/vmkernel.logには下記のような出力があります。

cpu20:442393)VSCSI: 273: handle 46587(vscsi0:12):Input values: res=0 limit=-2 bw=-1 Shares=1000
cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x600002c
cpu20:442393)NetPort: 3208: resuming traffic on DV port 251
cpu20:442393)Team.etherswitch: TeamESPolicySet:5942: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
cpu20:442393)Team.etherswitch: TeamESPolicySet:5950: Update: Port 0x600002c frp numUplinks 2 active 2(max 2) standby 0
cpu20:442393)NetPort: 1662: enabled port 0x600002c with mac ##:##:##:##:##
cpu20:442393)Vmxnet3: 17293: Disable Rx queuing; queue size 256 is larger than Vmxnet3RxQueueLimit limit of 64.
cpu20:442393)Vmxnet3: 17651: Using default queue delivery for vmxnet3 for port 0x2000007
cpu20:442393)NetPort: 1662: enabled port 0x2000007 with mac ##:##:##:##:##
cpu14:68341 opID=#####)World: 12235: VC opID HB-host-#####@96420-74cd6b42-DvsHandleHostReconnect-1fc1dfe0-3-1dd4 maps to vmkernel opID #####
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)VMKAPIMOD: 86: Failed to  check if port is Uplink : Failure
cpu14:68341 opID=#####)Team.etherswitch: TeamESLACPLAGEventCB:6277: Received a LAG DESTROY event version :0, lagId :0, lagLinkStatus :NOT USED,lagName :, uplinkName :, portLinkStatus :NOT USED, portID :0x0
cpu14:68341 opID=#####)netioc: NetIOCSetRespoolVersion:245: Set netioc version for portset: DvsPortset-0 to 2,old threshold: 2
cpu14:68341 opID=#####)netioc: NetIOCPortsetNetSchedStatusSet:1207: Set sched status for portset: DvsPortset-0 to Inactive, old:Inactive

Environment

vCenter Server
vCenter Server 8.0

Cause

この問題は、NFSv3 ロック機構の制限（40 秒のロックタイムアウトを含む）が原因で発生します。VCHA 構成で使用される自動クローン操作がこのタイムアウトを超えると、本事象が発生する可能性があります。

Resolution

オプション 1: クローン操作において同じ ESXi ホストを指定する

クローンの失敗を回避するには:

VCHA のデプロイ中に、パッシブノードと監視ノードのソース (アクティブ) ESXi ホストとターゲット ESXi ホストが同じホストであることを確認します。
すべての VCHA ノードを同じホストに展開するときに DRS アンチアフィニティルールエラーを回避するには、次の手順を実行します。
a. 左側インベントリ最上部のvCenter Serverを選択します。
b. [構成] > [設定] > [詳細設定]を選択します。
c. [設定の編集]を選択します。
d. 下記オプションを false に変更し、保存します。
名前： config.vpxd.vcha.drsAntiAffinity
値： false

注: 構成ノードを別の ESXi ホストに移動した後、上記の高度な設定をTrueに戻すことができます。

オプション 2: 手動クローン方法を使用する

基本 (自動) モードを使用する代わりに、手動クローン方式を使用して VCHA を構成します。これにより、ノードの配置をより細かく制御でき、NFSv3 のデータストアのロックに関連する問題を回避できます。

Additional Information

セッションとファイルロックの機構が改善された NFS 4.1 構成を使用します。