Symptoms:
#esxcli vsan cluster get
If the output of the command returns below output the node is considered to be isolated.Sub-Cluster Member Count: 1
From /var/run/log/vsansystem.log, it is observed that the node is placed into a cluster-partitioned and network-isolated state. The nodeCount value is reported as 1, indicating that connectivity with the rest of the cluster is not maintained.
YYYY-MM-DDTHH:MM:SSZ info vsansystem[2099080] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-2dd7] Complete, nodeCount: 1, runtime info: (vim.vsan.host.VsanRuntimeInfo) {--> membershipList = (vim.vsan.host.MembershipInfo) [--> (vim.vsan.host.MembershipInfo) {--> nodeUuid = "5275a207-####-####-####-###########",--> hostname = "XXXXXX"--> }--> ],--> diskIssues = <unset>,--> accessGenNo = <unset>
From /var/run/log/vmkernel.log, entries for the affected node indicate that the host is removed from the cluster due to repeated heartbeat timeouts.
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: LeaderRxFilter:7449: 5275a207-####-####-####-###########: Received LeaderHeartbeat during failoverYYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSLeaderPrintFilterOutMsg:7364: 5275a207-####-####-####-###########: Filtering out LeaderHeartbeat from node: 61c4ee00-####-####-####-######### SeqNum = 7445826YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSLeaderPrintFilterOutMsg:7367: 5275a207-####-####-####-###########: Current membershipID: 179f4168-####-####-####-########### Received membershipID: 4dfdd267-####-####-####-###########YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSStateMachineReceiveLoop:1731: 5275a207-####-####-####-###########: Error receiving from 5ac4eab4-####-####-####-###########: FailureYYYY-MM-DDTHH:MM:SSZZ cpu0:2098114)CMMDS: CMMDSStateDestroyNode:706: 5275a207-####-####-####-###########: Destroying node 5ac4eab4-####-####-####-###########: Failed to receive from nodeYYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: LeaderRemoveNodeFromMembership:7965: 5275a207-####-####-####-###########: Removing node 5ac4eab4-####-####-####-########### (vsanNodeType: data) from the cluster membershipYYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSClusterDestroyNodeImpl:264: Destroying node 5ac4ee7d-####-####-####-########### from the cluster db. Last HB received from node - 7429###########
vsish -e ls /net/pNics | while read nics; do echo -n $nics; vsish -e cat /net/pNics/${nics}stats; done | less pktsInDropped = 0 pktsOutDropped = 77584 >>>>>>>>>> High number of packets droppedVMware vSAN 8.X
Abnormal behavior in the transmit queue of the Mellanox ConnectX-3 Pro NICs is observed. Packet drops on the VMkernel interface, Transmission queue stalls and resets at the NIC level, and eventual connection shutdown are reported.
Using the VMware Interactive Shell (vsish) from root on the ESXi CLI:
Command to see pNic stats on a live host:vsish -e ls /net/pNics | while read nics; do echo -n $nics; vsish -e cat /net/pNics/${nics}stats; done | less
txCheckSumDone : 35943902198
txTsoDone : 3004727829
txQueueStopped : 3787
txQueueWaken : 3787
rxCheckSumOk : 76283507558
rxCheckSumNone : 0
Engagement with the Mellanox (NIC) vendor is recommended for appropriate corrective action