This document is an outline of general Network troubleshooting tasks, with specific focus on a vSAN environment.
This does not focus on specific network issues but can provide a starting point when troubleshooting.
If your experiencing Network related issues you may see any of the following symptoms:
Host(s) are showing as network partitioned (vSAN Skyline Health is reporting a Cluster Partition)
Skyline Health is showing alerts in the section "Network"
Host(s) will not join the vSAN cluster.
Experiencing packet drops, timeouts or other related Network errors.
ESXi Log messages (vmkernel, vobd) may point to communication issues - Examples: Timeouts, excessive messaging coming from the Network Card(s), Heartbeat Timeouts etc.
VMware vSAN (All Versions)
The following steps are not listed in any specific order but each set of options can or will help to build your data set for troubleshooting.
If you are unable to identify a specific issue, please reach out to the VMware Network team for additional assistance.
Having the data collected from many of the following commands can be very useful in decreasing the time to resolution.
To help understand the Network layout of the Cluster either take a look at the Web Client or refer to the following commands run on an ESX Host console or SSH session.
ESXCLI Command Reference - esxcli network
Cluster Membership Status:
esxcli vsan cluster get (Verify that all Hosts have joined the Cluster)
Configured VMKernel Ports on Host:
esxcli vsan network list (VMKernel Port for vSAN)
esxcfg-vmknic -l (Overview)
esxcli network ip interface list (Shows Portgroups, Switch, Port IDs etc.)
esxcli network ip interface ipv4 get (All ipv4 IPs associated incl. Gateway, DNS)
esxcli network ip interface ipv6 get (All ipv4 IPs associated incl. Gateway, DNS)
esxcli network ip route ipv4 list (List configured IPv4 routes)
esxcli network ip route ipv6 list (List configured IPv6 routes)
Network Cards (vmnics):
esxcli network nic list (Cards installed on Host)
esxcli network ip neighbor list ( ARP Table)
VMware Switches:
esxcfg-vswitch -l (All Switches configured/connected on Host)
esxcli network vswitch dvs vmware list (Lists all Distributed Switches the Host is connected to)
esxcli network vswitch standard list (Lists all Standard Switches the Host has configured)
esxcli network vswitch dvs vmware lacp config get (Shows LACP Config on the connected Distributed Switch)
Unicast:
esxcli vsan cluster unicastagent list (Should list the Agents of all other Hosts & Witness in the Cluster
----------------------------------------------------------------------------------
Network Cards (vmnics) Stats
The following commands will help in gathering stats about the Network interfaces installed on the Host.
Some of these commands show similar information but all will help in pinning down a potential network issue.
esxcli network nic stats get –n <vmnic interface>
[root@vsan01:~] esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
Packets received: 17173053
Packets sent: 0
Bytes received: 12492014211
Bytes sent: 0
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 0
Broadcast packets received: 0
Multicast packets sent: 0
Broadcast packets sent: 0
Total receive errors: 0
Receive length errors: 0
Receive over errors: 0
Receive CRC errors: 0
Receive frame errors: 0
Receive FIFO errors: 0
Receive missed errors: 0
Total transmit errors: 0
Transmit aborted errors: 0
Transmit carrier errors: 0
Transmit FIFO errors: 0
Transmit heartbeat errors: 0
Transmit window errors: 0
vsish –e get /net/pNics/vmnic#/stats
[root@vsan01:~] vsish -e get /net/pNics/vmnic0/stats
device {
-- General Statistics:
Rx Packets:17174928
Tx Packets:0
Rx Bytes:12493488653
Tx Bytes:0
Rx Errors:0
Tx Errors:0
Rx Dropped:0
Tx Dropped:0
Rx Multicast:0
Rx Broadcast:0
Tx Multicast:0
Tx Broadcast:0
Collisions:0
Rx Length Errors:0
Rx Over Errors:0
Rx CRC Errors:0
Rx Frame Errors:0
Rx Fifo Errors:0
Rx Missed Errors:0
Tx Aborted Errors:0
Tx Carrier Errors:0
Tx Fifo Errors:0
Tx Heartbeat Errors:0
Tx Window Errors:0
Module Interface Rx packets:17174928
Module Interface Tx packets:0
Module Interface Rx dropped:0
Module Interface Tx dropped:0
-- Driver Specific Statistics:
rx_packets : 17174928
tx_packets : 0
rx_bytes : 12562055220
tx_bytes : 0
rx_broadcast : 0
tx_broadcast : 0
rx_multicast : 0
tx_multicast : 0
rx_errors : 0
tx_errors : 0
tx_dropped : 0
multicast : 0
collisions : 0
rx_length_errors : 0
rx_over_errors : 0
rx_crc_errors : 0
rx_frame_errors : 0
rx_no_buffer_count : 0
rx_missed_errors : 0
tx_aborted_errors : 0
tx_carrier_errors : 0
tx_fifo_errors : 0
tx_heartbeat_errors : 0
tx_window_errors : 0
tx_abort_late_coll : 0
tx_deferred_ok : 0
tx_single_coll_ok : 0
tx_multi_coll_ok : 0
tx_timeout_count : 0
tx_restart_queue : 0
rx_long_length_errors : 0
rx_short_length_errors : 0
rx_align_errors : 0
tx_tcp_seg_good : 0
tx_tcp_seg_failed : 0
rx_flow_control_xon : 0
rx_flow_control_xoff : 0
tx_flow_control_xon : 0
tx_flow_control_xoff : 0
rx_long_byte_count : 12562055220
rx_csum_offload_good : 17129362
rx_csum_offload_errors : 0
alloc_rx_buff_failed : 0
tx_smbus : 0
rx_smbus : 0
dropped_smbus : 0
----------------------------------------------------------------------------------
ESXTOP
Once ESXTOP is running enter “n” for the Network interface
This will show live stats of all the interfaces along with which vmnic the vSAN VMKernel port currently uses.
----------------------------------------------------------------------------------
MTU Check via vmkping
vmkping is a simple tool for base testing of a connection and to verify packets of various sizes can be passed. This can help to identify if MTU / Jumbo frames are working properly in the environment.
MTU sizes can vary but the most common ones to test with are 1500, 9000.
Some paths may need some overhead in the packet size so pinging a slightly smaller packet size can help to avoid ping failures (1472 and 8972).
Testing VMkernel network connectivity with the vmkping command
vSAN Healthcheck -- vMotion: MTU check (ping with large packet size)
vmkping –I <vsan vmnic interface> <target vsan interface ip> -s <mtu size>
[root@vsan01:~] vmkping -I vmk1 192.x.x.x -s 1500
PING
192.x.x.x
(192.x.x.x
): 1500 data bytes1508 bytes from 192.x.x.x: icmp_seq=1 ttl=64 time=0.844 ms
1508 bytes from 192.x.x.x: icmp_seq=2 ttl=64 time=0.877 ms
--- 192.x.x.x ping statistics ---
3 packets transmitted, 2 packets received, 33% packet loss
round-trip min/avg/max = 0.844/0.860/0.877 ms
----------------------------------------------------------------------------------
Verifying the Drivers and Firmware for a Host is critical to making sure the configuration is supported and limiting the number of potential issues we encounter. The following links will be helpful in checking these:
Determining Network/Storage firmware and driver version
Check the Broadcom Compatibility Guide
----------------------------------------------------------------------------------
Should list the agents of all other Hosts & Witness in the Cluster.
Configuring vSAN Unicast networking from the command line[root@vsan01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid
------------------------------------ --------- ---------------- ---------- ----- ---------- ----------------------------------------------------------- --------------
630e1560-fa89-afc4-fdb2-######## 0 true 10.x.x.4 12321 35:C5:72:93:19:68:BB:5C:FF:03:CF:80:61:A7:06:EC:AE:12:4B:EF a21d567f-e835-4177-bd77-########
630e1559-b1c4-61e4-31d3-######## 0 true 10.x.x.6 12321 3B:1C:C4:47:0B:88:E4:58:B1:1A:2B:BE:85:F7:79:71:19:92:A9:15 a21d567f-e835-4177-bd77-########
630e1562-c676-99b6-891b-######## 0 true 10.x.x.5 12321 32:AB:8C:C4:0C:A8:E4:08:F9:CC:A3:60:32:16:65:9D:B8:93:D6:A0 a21d567f-e835-4177-bd77-########
With 6.5d or later, it was observed that vCenter automatically removes some of Unicast entries.
This often caused outages dure to resulting Cluster partitions. (Mostly happened during Cluster upgrades and/or vCenter Build lower than vSAN Build)
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
To verify the setting:
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Using nc command to test Network connectivity
Testing the vmkernel network performance using the nc command
tcpdump & pktcap
Tcpdump and pktcap are packet trace tools that can help in gathering more details for further analysis.
Usually this is reserved for the Network Team however it can provide useful information and help to speed up resolution
if we can gather this before engaging the network teams for additional assistance.
Using the pktcap-uw tool in ESXi 5.5 and later
Testing VSAN Network Performance with iPerf
vSAN Network Ports and Protocols
Bandwidth and Latency Requirements