Multiple objects are in an inaccessible state due to network issues on vSAN cluster
search cancel

Multiple objects are in an inaccessible state due to network issues on vSAN cluster

book

Article ID: 390937

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms :

- The vSAN objects are intermittently getting into inaccessible state. 

- Example output shows 5 objects as inaccessible when you check from an ESXi host which is part of vSAN cluster.

[root@V2:~] esxcli vsan debug object health summary get
Health Status                                              Number Of Objects
---------------------------------------------------------  -----------------
remoteAccessible                                                           0
inaccessible                                                               5
reduced-availability-with-no-rebuild                                       0

- None of the objects are reporting as inaccessible after some time on the same host when checked next time.

[root@v2:~] esxcli vsan debug object health summary get
Health Status                                              Number Of Objects
---------------------------------------------------------  -----------------
remoteAccessible                                                           0
inaccessible                                                               0
reduced-availability-with-no-rebuild                                       0
reduced-availability-with-no-rebuild-delay-timer                         303
reducedavailabilitywithpolicypending                                       0
reducedavailabilitywithpolicypendingfailed                                 0

When reviewing the logs /var/run/log/vsansystem.log the node count fluctuates due to ongoing network issues.

2025-03-13T00:26:18.716Z info vsansystem[2099681] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-b825] Complete, nodeCount: 4, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-03-13T00:26:22.644Z info vsansystem[2099772] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-b859] Complete, nodeCount: 3, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-03-13T00:26:26.293Z info vsansystem[2099778] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-b8a2] Complete, nodeCount: 4, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-03-13T00:26:26.700Z info vsansystem[2099791] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSNodeUpdate-b8a3] Complete, nodeCount: 4, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-03-13T00:26:37.854Z info vsansystem[2099779] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bc4a] Complete, nodeCount: 5, runtime info: (vim.vsan.host.VsanRuntimeInfo) {
2025-03-13T00:26:38.102Z info vsansystem[2099781] [vSAN@6876 sub=VsanSystemProvider opId=CMMDSMembershipUpdate-bc69] Complete, nodeCount: 6, runtime info: (vim.vsan.host.VsanRuntimeInfo) {

In /var/log/clomd.log, objects are going inaccessible below as vSAN hosts lost access to multiple components from different hosts due to unstable network connectivity.

clomd[2099719]: [Originator@6876] CLOM_ProcessObject: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx is inaccessible, Skipping compliance verification @CSN 1881, SCSN 1883. ConfigState 13
clomd[2099719]: [Originator@6876] CLOM_ProcessObject: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx is inaccessible, Skipping compliance verification @CSN 1938, SCSN 1940. ConfigState 13
clomd[2099719]: [Originator@6876] CLOM_ProcessObject: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx is inaccessible, Skipping compliance verification @CSN 2674, SCSN 2792. ConfigState 13

In var/log/vmkernel.log, ESXi hosts are losing connectivity to the objects around the time of inaccessible objects being reported. 

vmkernel: cpu59:2099510)DOM: DOMOwner_SetLivenessState:11608: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx lost liveness [0x45bbdbbe9380]
vmkernel: cpu8:2099488)DOM: DOMOwner_SetLivenessState:11608: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx lost liveness [0x45bbdb882800]
vmkernel: cpu64:2099528)DOM: DOMOwner_SetLivenessState:11608: Object xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx lost liveness [0x45db7ed55e80]

Environment

VMware vSAN 9.x

VMware vSAN 8.x

Cause

The issue is seen when vmnic on an ESXi host experiences errors like, CRC, RxMissed and Receive errors.

When ping test is run between hosts, packet loss would show up.

example: pinging 192.168.x.x

1480 bytes from 192.168.x.x: icmp_seq=993 ttl=64 time=0.131 ms
1480 bytes from 192.168.x,x: icmp_seq=996 ttl=64 time=0.109 ms
1480 bytes from 192.168.x.x: icmp_seq=997 ttl=64 time=0.121 ms

--- 192.168.x.x ping statistics ---
1000 packets transmitted, 646 packets received, 35.4% packet loss
round-trip min/avg/max = 0.089/0.202/3.215 ms

Network stats :

[root@v1:/vmfs/volumes/e6e9c139-########] esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
   Packets received: 11615094673
   Packets sent: 2705982369
   Bytes received: 15157097757492
   Bytes sent: 4784107499235
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Multicast packets received: 10247164
   Broadcast packets received: 29481196
   Multicast packets sent: 105222
   Broadcast packets sent: 12874
   Total receive errors: 765
   Receive length errors: 2
   Receive over errors: 0
   Receive CRC errors: 763
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 22506

Link down events :

The link down events would show up in the /var/run/log/vmkernel.log on ESXi host.
 
2025-03-13T01:47:24.137Z: [netCorrelator] 3148104485784us: [vob.net.vmnic.linkstate.down] vmnic vmnic1 linkstate down
2025-03-13T01:47:25.002Z: [netCorrelator] 3148103027015us: [esx.problem.net.vmnic.linkstate.down] Physical NIC vmnic1 linkstate is down
2025-03-13T01:54:44.194Z: [netCorrelator] 3148544542160us: [vob.net.vmnic.linkstate.down] vmnic vmnic1 linkstate down

Resolution

If network issues are observed on a vSAN cluster, this will need to check on the physical side of your infrastructure, especially if CRC occurs as this indicates network data corruption, typically caused by a physical layer (Layer 1) issue in the network such as faulty nic, cable, port or physical switch itself. This needs to be investigated internally by the network team, as CRC isn't caused on the software layer (vSphere)