This document is an outline of general Network troubleshooting tasks, with specific focus on a vSAN environment.
This does not focus on specific network issues but can provide a starting point when troubleshooting.
If your experiencing Network related issues you may see any of the following symptoms:
Host(s) are showing as network partitioned (vSAN Skyline Health is reporting a Cluster Partition)
Skyline Health is showing alerts in the section "Network"
Host(s) will not join the vSAN cluster.
Experiencing packet drops, timeouts or other related Network errors.
ESXi Log messages (vmkernel, vobd) may point to communication issues - Examples: Timeouts, excessive messaging coming from the Network Card(s), Heartbeat Timeouts etc.
VMware vSAN (All Versions)
The following steps are not listed in any specific order but each set of options can or will help to build your data set for troubleshooting.
If you are unable to identify a specific issue, please reach out to the VMware Network team for additional assistance.
Having the data collected from many of the following commands can be very useful in decreasing the time to resolution.
To help understand the Network layout of the Cluster either take a look at the Web Client or refer to the following commands run on an ESX Host console or SSH session.
ESXCLI Command Reference - esxcli network
Cluster Membership Status:
esxcli vsan cluster get (Verify that all Hosts have joined the Cluster)
Configured VMKernel Ports on Host:
esxcli vsan network list (VMKernel Port for vSAN)
esxcfg-vmknic -l (Overview)
esxcli network ip interface list (Shows Portgroups, Switch, Port IDs etc.)
esxcli network ip interface ipv4 get (All ipv4 IPs associated incl. Gateway, DNS)
esxcli network ip interface ipv6 get (All ipv4 IPs associated incl. Gateway, DNS)
esxcli network ip route ipv4 list (List configured IPv4 routes)
esxcli network ip route ipv6 list (List configured IPv6 routes)
Network Cards (vmnics):
esxcli network nic list (Cards installed on Host)
esxcli network ip neighbor list ( ARP Table)
VMware Switches:
esxcfg-vswitch -l (All Switches configured/connected on Host)
esxcli network vswitch dvs vmware list (Lists all Distributed Switches the Host is connected to)
esxcli network vswitch standard list (Lists all Standard Switches the Host has configured)
esxcli network vswitch dvs vmware lacp config get (Shows LACP Config on the connected Distributed Switch)
Unicast:
esxcli vsan cluster unicastagent list (Should list the Agents of all other Hosts & Witness in the Cluster
----------------------------------------------------------------------------------
Network Cards (vmnics) Stats
The following commands will help in gathering stats about the Network interfaces installed on the Host.
Some of these commands show similar information but all will help in pinning down a potential network issue.
esxcli network nic stats get –n <vmnic interface>
[root@vsan01:~] esxcli network nic stats get -n vmnic0NIC statistics for vmnic0Packets received: 17173053Packets sent: 0Bytes received: 12492014211Bytes sent: 0Receive packets dropped: 0Transmit packets dropped: 0Multicast packets received: 0Broadcast packets received: 0Multicast packets sent: 0Broadcast packets sent: 0Total receive errors: 0Receive length errors: 0Receive over errors: 0Receive CRC errors: 0Receive frame errors: 0Receive FIFO errors: 0Receive missed errors: 0Total transmit errors: 0Transmit aborted errors: 0Transmit carrier errors: 0Transmit FIFO errors: 0Transmit heartbeat errors: 0Transmit window errors: 0
vsish –e get /net/pNics/vmnic#/stats
[root@vsan01:~] vsish -e get /net/pNics/vmnic0/statsdevice {-- General Statistics:Rx Packets:17174928Tx Packets:0Rx Bytes:12493488653Tx Bytes:0Rx Errors:0Tx Errors:0Rx Dropped:0Tx Dropped:0Rx Multicast:0Rx Broadcast:0Tx Multicast:0Tx Broadcast:0Collisions:0Rx Length Errors:0Rx Over Errors:0Rx CRC Errors:0Rx Frame Errors:0Rx Fifo Errors:0Rx Missed Errors:0Tx Aborted Errors:0Tx Carrier Errors:0Tx Fifo Errors:0Tx Heartbeat Errors:0Tx Window Errors:0Module Interface Rx packets:17174928Module Interface Tx packets:0Module Interface Rx dropped:0Module Interface Tx dropped:0-- Driver Specific Statistics:rx_packets : 17174928tx_packets : 0rx_bytes : 12562055220tx_bytes : 0rx_broadcast : 0tx_broadcast : 0rx_multicast : 0tx_multicast : 0rx_errors : 0tx_errors : 0tx_dropped : 0multicast : 0collisions : 0rx_length_errors : 0rx_over_errors : 0rx_crc_errors : 0rx_frame_errors : 0rx_no_buffer_count : 0rx_missed_errors : 0tx_aborted_errors : 0tx_carrier_errors : 0tx_fifo_errors : 0tx_heartbeat_errors : 0tx_window_errors : 0tx_abort_late_coll : 0tx_deferred_ok : 0tx_single_coll_ok : 0tx_multi_coll_ok : 0tx_timeout_count : 0tx_restart_queue : 0rx_long_length_errors : 0rx_short_length_errors : 0rx_align_errors : 0tx_tcp_seg_good : 0tx_tcp_seg_failed : 0rx_flow_control_xon : 0rx_flow_control_xoff : 0tx_flow_control_xon : 0tx_flow_control_xoff : 0rx_long_byte_count : 12562055220rx_csum_offload_good : 17129362rx_csum_offload_errors : 0alloc_rx_buff_failed : 0tx_smbus : 0rx_smbus : 0dropped_smbus : 0
----------------------------------------------------------------------------------
ESXTOP
Once ESXTOP is running enter “n” for the Network interface
This will show live stats of all the interfaces along with which vmnic the vSAN VMKernel port currently uses.
----------------------------------------------------------------------------------
MTU Check via vmkping
vmkping is a simple tool for base testing of a connection and to verify packets of various sizes can be passed. This can help to identify if MTU / Jumbo frames are working properly in the environment.
MTU sizes can vary but the most common ones to test with are 1500, 9000.
Some paths may need some overhead in the packet size so pinging a slightly smaller packet size can help to avoid ping failures (1472 and 8972).
Testing VMkernel network connectivity with the vmkping command
vSAN Healthcheck -- vMotion: MTU check (ping with large packet size)
vmkping –I <vsan vmnic interface> <target vsan interface ip> -s <mtu size>
[root@vsan01:~] vmkping -I vmk1 192.x.x.x -s 1500PING 192.x.x.x (192.x.x.x): 1500 data bytes1508 bytes from 192.x.x.x: icmp_seq=1 ttl=64 time=0.844 ms1508 bytes from 192.x.x.x: icmp_seq=2 ttl=64 time=0.877 ms--- 192.x.x.x ping statistics ---3 packets transmitted, 2 packets received, 33% packet lossround-trip min/avg/max = 0.844/0.860/0.877 ms
----------------------------------------------------------------------------------
Verifying the Drivers and Firmware for a Host is critical to making sure the configuration is supported and limiting the number of potential issues we encounter. The following links will be helpful in checking these:
Determining Network/Storage firmware and driver version
Check the Broadcom Compatibility Guide
----------------------------------------------------------------------------------
Should list the agents of all other Hosts & Witness in the Cluster.
Configuring vSAN Unicast networking from the command line[root@vsan01:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name Cert Thumbprint SubClusterUuid------------------------------------ --------- ---------------- ---------- ----- ---------- ----------------------------------------------------------- --------------630e1560-fa89-afc4-fdb2-######## 0 true 10.x.x.4 12321 35:C5:72:93:19:68:BB:5C:FF:03:CF:80:61:A7:06:EC:AE:12:4B:EF a21d567f-e835-4177-bd77-######## 630e1559-b1c4-61e4-31d3-######## 0 true 10.x.x.6 12321 3B:1C:C4:47:0B:88:E4:58:B1:1A:2B:BE:85:F7:79:71:19:92:A9:15 a21d567f-e835-4177-bd77-######## 630e1562-c676-99b6-891b-######## 0 true 10.x.x.5 12321 32:AB:8C:C4:0C:A8:E4:08:F9:CC:A3:60:32:16:65:9D:B8:93:D6:A0 a21d567f-e835-4177-bd77-########
With 6.5d or later, it was observed that vCenter automatically removes some of Unicast entries.
This often caused outages dure to resulting Cluster partitions. (Mostly happened during Cluster upgrades and/or vCenter Build lower than vSAN Build)
esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates
To verify the setting:
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Using nc command to test Network connectivity
Testing the vmkernel network performance using the nc command
tcpdump & pktcap
Tcpdump and pktcap are packet trace tools that can help in gathering more details for further analysis.
Usually this is reserved for the Network Team however it can provide useful information and help to speed up resolution
if we can gather this before engaging the network teams for additional assistance.
Using the pktcap-uw tool in ESXi 5.5 and later
Testing VSAN Network Performance with iPerf
vSAN Network Ports and Protocols
Bandwidth and Latency Requirements