Network Isolation on ESXi Host Triggered by Mellanox Adapter Connection Reset Events
search cancel

Network Isolation on ESXi Host Triggered by Mellanox Adapter Connection Reset Events

book

Article ID: 400536

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • The ESXi node is observed to be network-isolated, execute the below command 

#esxcli vsan cluster get

If the output of the command returns below output the node is considered to be isolated.
Sub-Cluster Member Count: 1

From /var/run/log/vsansystem.log, it is observed that the node is placed into a cluster-partitioned and network-isolated state. The nodeCount value is reported as 1, indicating that connectivity with the rest of the cluster is not maintained.

YYYY-MM-DDTHH:MM:SSZ info vsansystem[2099080] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-2dd7] Complete, nodeCount: 1, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
-->    membershipList = (vim.vsan.host.MembershipInfo) [
-->       (vim.vsan.host.MembershipInfo) {
-->          nodeUuid = "5275a207-####-####-####-###########",
-->          hostname = "XXXXXX"
-->       }
-->    ],
-->    diskIssues = <unset>,
-->    accessGenNo = <unset>

From /var/run/log/vmkernel.log, entries for the affected node indicate that the host is removed from the cluster due to repeated heartbeat timeouts.

YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: LeaderRxFilter:7449: 5275a207-####-####-####-###########: Received LeaderHeartbeat during failover
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSLeaderPrintFilterOutMsg:7364: 5275a207-####-####-####-###########: Filtering out LeaderHeartbeat from node: 61c4ee00-####-####-####-######### SeqNum = 7445826
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSLeaderPrintFilterOutMsg:7367: 5275a207-####-####-####-###########: Current membershipID: 179f4168-####-####-####-########### Received membershipID: 4dfdd267-####-####-####-###########
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSStateMachineReceiveLoop:1731: 5275a207-####-####-####-###########: Error receiving from 5ac4eab4-####-####-####-###########: Failure
YYYY-MM-DDTHH:MM:SSZZ cpu0:2098114)CMMDS: CMMDSStateDestroyNode:706: 5275a207-####-####-####-###########: Destroying node 5ac4eab4-####-####-####-###########: Failed to receive from node
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: LeaderRemoveNodeFromMembership:7965: 5275a207-####-####-####-###########: Removing node 5ac4eab4-####-####-####-########### (vsanNodeType: data) from the cluster membership
YYYY-MM-DDTHH:MM:SSZ cpu0:2098114)CMMDS: CMMDSClusterDestroyNodeImpl:264: Destroying node 5ac4ee7d-####-####-####-########### from the cluster db. Last HB received from node - 7429###########

  • Significant packet drops on transmitted traffic for the vSAN vmnic (vmnic#), using the VMware Interactive Shell (vsish) from root on the ESXi CLI:
     Command to see pNic stats on a live host:

    vsish -e ls /net/pNics | while read nics; do echo -n $nics; vsish -e cat /net/pNics/${nics}stats; done | less
                       
            pktsInDropped = 0
            pktsOutDropped = 77584            >>>>>>>>>> High number of packets dropped

Environment

VMware vSAN 8.X

Cause

Abnormal behavior in the transmit queue of the Mellanox ConnectX-3 Pro NICs is observed. Packet drops on the VMkernel interface, Transmission queue stalls and resets at the NIC level, and eventual connection shutdown are reported.

Using the VMware Interactive Shell (vsish) from root on the ESXi CLI:

Command to see pNic stats on a live host:

vsish -e ls /net/pNics | while read nics; do echo -n $nics; vsish -e cat /net/pNics/${nics}stats; done | less

       
         txCheckSumDone : 35943902198
         txTsoDone : 3004727829
         txQueueStopped : 3787
         txQueueWaken : 3787
         rxCheckSumOk : 76283507558
         rxCheckSumNone : 0
       

Resolution

Engagement with the Mellanox (NIC) vendor is recommended for appropriate corrective action