Network Partition Of vSAN Node
search cancel

Network Partition Of vSAN Node

book

Article ID: 326579

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

How to isolate the issue with the NIC

Symptoms:
  • The issue has been seen in 2 Node Stretched cluster of vSAN  
  • ESXi version 6.7 u2. Cluster is partitioned with Witness able to communicate over ping to only one Data Node.
  • The 2 Data nodes are able to communicate with each other and show in "vsan cluster get " command 
  • The issue related to NIC card ( physical )


Environment

VMware vSAN 6.x

Cause

The NIC was in a " hung state "  ( With the latest driver  )

Resolution

Packets get dropped upon ping to VSAN vmkernel.

NODE2# vmkping -I vmk2 192.168.1.111 -c 1000
PING 192.168.1.111 (192.168.1.111): 56 data bytes
64 bytes from 192.168.1.111: icmp_seq=2 ttl=64 time=0.133 ms
64 bytes from 192.168.1.111: icmp_seq=3 ttl=64 time=0.111 ms
64 bytes from 192.168.1.111: icmp_seq=4 ttl=64 time=0.129 ms
64 bytes from 192.168.1.111: icmp_seq=5 ttl=64 time=0.133 ms
64 bytes from 192.168.1.111: icmp_seq=6 ttl=64 time=0.137 ms
64 bytes from 192.168.1.111: icmp_seq=7 ttl=64 time=0.140 ms
64 bytes from 192.168.1.111: icmp_seq=8 ttl=64 time=0.141 ms
64 bytes from 192.168.1.111: icmp_seq=9 ttl=64 time=0.127 ms
64 bytes from 192.168.1.111: icmp_seq=10 ttl=64 time=0.139 ms
64 bytes from 192.168.1.111: icmp_seq=11 ttl=64 time=0.087 ms
<======= Sequence missed
64 bytes from 192.168.1.111: icmp_seq=37 ttl=64 time=0.137 ms<======= Sequence missed
64 bytes from 192.168.1.111: icmp_seq=38 ttl=64 time=0.151 ms


Packet capture shows UDP traffic is working but  We have seen the "sequence 11 is followed by sequence 37" 


# pktcap-uw --uplink vmnic4 --dir 0 --stage 1 --proto 0x11 -o -| tcpdump-uw -r - -nne   >> Run this command on one of the data node where uplink 4 is used for vSAN vmkernel.

----- Output of the above command is as below  -----

The Stage is Post.
The session filter IP protocol is 0x11.
pktcap: The output file is -.
pktcap: No server port specifed, select 21248 as the port.
pktcap: Local CID 2.
pktcap: Listen on port 21248.
reading from file -, link-type EN10MB (Ethernet)
pktcap: Accept...
pktcap: Vsock connection from port 1029 cid 2.
07:39:15.068063 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 178: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 136
07:39:16.068090 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 178: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 136
07:39:17.068136 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 258: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 216
07:39:17.068162 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 186: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 144
07:39:18.068157 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 258: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 216
07:39:18.068186 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 186: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 144
07:39:19.068208 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:20.068203 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:21.068238 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:22.068288 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:23.068326 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:24.068347 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:25.068365 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:26.068417 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 242: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 200
07:39:27.068432 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 466: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 424
07:39:28.068511 00:50:56:6a:c1:90 > 00:50:56:61:cb:93, ethertype IPv4 (0x0800), length 466: 192.168.1.222.12321 > 192.168.1.111.12321: UDP, length 424


The same packet capture with ICMP filter shows more drops:

# pktcap-uw --uplink vmnic5 --dir 0 --stage 0 --proto 0x01 -o -|tcpdump-uw -r - -nne
The name of the uplink is vmnic5.
The Stage is Pre.
The session filter IP protocol is 0x01.
pktcap: The output file is -.
pktcap: No server port specifed, select 42606 as the port.
pktcap: Local CID 2.
pktcap: Listen on port 42606.
reading from file -, link-type EN10MB (Ethernet)
pktcap: Accept...
pktcap: Vsock connection from port 1026 cid 2.

7:45:06.559790 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 98, length 64
07:45:07.561992 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 99, length 64
07:45:08.562521 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 100, length 64
07:45:09.564725 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 101, length 64
07:45:10.566928 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 102, length 64
07:45:11.569107 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 103, length 64
07:45:27.598571 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 >  192.168.1.222: ICMP echo request, id 36438, seq 119, length 64 <======== show sequence missed again.
07:45:28.600526 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 120, length 64
07:45:29.602738 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 121, length 64
07:45:30.604959 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 122, length 64
07:45:31.607195 00:50:56:61:cb:93 > 00:50:56:6a:c1:90, ethertype IPv4 (0x0800), length 98: 192.168.1.111 > 192.168.1.222: ICMP echo request, id 36438, seq 123, length 64


NODE2# esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2019-09-03T07:02:40Z
   Local Node UUID: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 52694080-22bb-7ced-95fb-01b8e5a92f67
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 5cc8c87f-1ad4-5768-e1fe-20040ff07c0e
   Sub-Cluster Member HostNames: NODE2
   Sub-Cluster Membership UUID: 01106e5d-52e2-c75d-907e-20040ff07c0e
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: de0bb183-3c7e-418b-b652-258770e89e01 12 2019-08-19T09:12:12.1

NODE2# esxcli network ip interface ipv4 get
Name  IPv4 Address   IPv4 Netmask     IPv4 Broadcast  Address Type  Gateway        DHCP DNS
----  -------------  ---------------  --------------  ------------  -------------  --------
vmk0  10.12.132.247  255.255.255.224  10.12.132.255   STATIC        10.12.132.225     false
vmk2  192.168.1.222  255.255.255.0    192.168.1.255   STATIC        0.0.0.0           false
vmk3  192.168.2.20   255.255.255.0    192.168.2.255   STATIC        0.0.0.0           false



NODE1# esxcli network ip interface ipv4 get
Name  IPv4 Address   IPv4 Netmask     IPv4 Broadcast  Address Type  Gateway        DHCP DNS
----  -------------  ---------------  --------------  ------------  -------------  --------
vmk0  10.12.132.246  255.255.255.224  10.12.132.255   STATIC        10.12.132.225     false
vmk2  192.168.1.111  255.255.255.0    192.168.1.255   STATIC        0.0.0.0           false
vmk3  192.168.2.10   255.255.255.0    192.168.2.255   STATIC        0.0.0.0           false


Isolating 1 NIC shows 100 % packet loss:

NODE2# esxcli network nic list
Name    PCI Device    Driver   Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
------  ------------  -------  ------------  -----------  -----  ------  -----------------  ----  -----------------------------------------------------------------
vmnic0  0000:18:00.0  ntg3     Up            Up            1000  Full    20:04:0f:f0:7c:0c  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1  0000:18:00.1  ntg3     Up            Down             0  Half    20:04:0f:f0:7c:0d  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2  0000:19:00.0  ntg3     Up            Up            1000  Full    20:04:0f:f0:7c:0e  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic3  0000:19:00.1  ntg3     Up            Down             0  Half    20:04:0f:f0:7c:0f  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic4  0000:87:00.0  qedentv  Down          Down             0  Half    34:80:0d:0f:0a:2c  1500  QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter
vmnic5  0000:87:00.1  qedentv  Up            Up           10000  Full    34:80:0d:0f:0a:2d  1500  QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter
NODE2#vmkping -I vmk2 192.168.1.111 -c 100 -i 0.005
PING 192.168.1.111 (192.168.1.111): 56 data bytes

--- 192.168.1.111 ping statistics ---
100 packets transmitted, 0 packets received, 100% packet loss


Bringing up other NIC and making faulty down show packet is not lost by verifying it on esxtop command and selecting option "n"  to see association between NIC and vmkernel port.

NODE2# esxcli network nic up -n vmnic4

NODE2#esxcli network nic list
Name    PCI Device    Driver   Admin Status  Link Status  Speed  Duplex  MAC Address         MTU  Description
------  ------------  -------  ------------  -----------  -----  ------  -----------------  ----  -----------------------------------------------------------------
vmnic0  0000:18:00.0  ntg3     Up            Up            1000  Full    20:04:0f:f0:7c:0c  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1  0000:18:00.1  ntg3     Up            Down             0  Half    20:04:0f:f0:7c:0d  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2  0000:19:00.0  ntg3     Up            Up            1000  Full    20:04:0f:f0:7c:0e  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic3  0000:19:00.1  ntg3     Up            Down             0  Half    20:04:0f:f0:7c:0f  1500  Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic4  0000:87:00.0  qedentv  Up            Up           10000  Full    34:80:0d:0f:0a:2c  1500  QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter
vmnic5  0000:87:00.1  qedentv  Up            Up           10000  Full    34:80:0d:0f:0a:2d  1500  QLogic Corp. QLogic FastLinQ QL41xxx 1/10/25 GbE Ethernet Adapter

NODE2#esxcli network nic down -n vmnic5


NODE02# vmkping -I vmk2 192.168.1.111 -c 100 -i 0.005
PING 192.168.1.111 (192.168.1.111): 56 data bytes
64 bytes from 192.168.1.111: icmp_seq=0 ttl=64 time=0.148 ms
64 bytes from 192.168.1.111: icmp_seq=1 ttl=64 time=0.069 ms
64 bytes from 192.168.1.111: icmp_seq=2 ttl=64 time=0.066 ms
64 bytes from 192.168.1.111: icmp_seq=3 ttl=64 time=0.072 ms
64 bytes from 192.168.1.111: icmp_seq=4 ttl=64 time=0.068 ms
64 bytes from 192.168.1.111: icmp_seq=5 ttl=64 time=0.061 ms


NIC was using latest driver.


NODE2#vmkload_mod -s qedentv
vmkload_mod module information
 input file: /usr/lib/vmware/vmkmod/qedentv
 Version: 3.9.31.2-1OEM.670.0.0.8169922
 Build Type: release
 License: QLogic_Proprietary
 Required name-spaces:
  com.vmware.vmkapi#v2_5_0_0
 Parameters:


Workaround:
  • Isolated the faulty NIC in the standard switch with the working NIC. Select "load balancing" setting to "Route based on originating port ID"
  • Moving the faulty NIC to standby.