There was a vSAN network outage which might resulted in an outage on the affected vSAN Cluster ( = vSAN Cluster Partition )
You might have observed in the vSphere Client on the Summary page of the Cluster, the related vSAN Cluster partition alert
Now vSAN has recovered (vSAN Healthcheck is not showing any relevant issues ), but several Linux VMs reported I/O errors on their root disk.
Basic commands (uptime, date, df -h) fail due to root filesystem unavailability.
vmware.log: No heartbeat timeout or storage device errors. Only minor log discard:
YYYY-MM-DDTHH:MM:SS.SSSZ No (00) svga - >>> Error writing log, 110 bytes discarded. Disk full?
vmkernel.log: Events align with a vSAN network outage, including:
High latency warnings
Node removals from cluster membership
Leader election and failover
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu13:2100033) CMMDS: LeaderUpdateMeanRTLatency: 12423: Throttled: #-#-#-#-#: High RT latency. Node #-#-#-#-#, RT latency 958 (ms). Mean RT latency 122 (ms)
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu3:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865:#-#-#-#-#: Number of slow updates in last interval is 1 maxLatency 313 millisecs slowest #-#-#-#-#YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865: 521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 689 millisecs slowest UUID #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderSendHeartbeat : 2635: #-#-#-#-#: Backup unresponsiveYYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSStateDestroyNode : 708: #-#-#-#-#: Destroying node #-#-#-#-#: Backup is too far behindYYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu51:2100033) CMMDS: LeaderLostBackup: 545: #-#-#-#-#: Leader Failover: MUUID #-#-#-#-# old #-#-#-#-#YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderRemoveNodeFromMembership: 8592: #-#-#-#-#: Removing node #-#-#-#-# (vsanNodeType: data) from the cluster membershipYYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSClusterDestroyNodeImpl: 262: Destroying node #-#-#-#-# from the cluster db. Last HB received from node - #-#-#-#-#YYYY-MM-DDTHH:MM:SS.SSSZ Wa(180) vmkwarning: cpu51:2100033) WARNING: RDT: RDTEndQueuedMessages: 1410: assoc 0x4322c6ac1680 message 9901371 failure
#-#-#-#-#
VMware ESXi 8.x
VMware ESXi 7.x
VMware vSAN 8.x
VMware vSAN 7.x
During the vSAN network outage, backend storage became temporarily inaccessible. The Linux VMs attempted read/write operations to their root disk (/dev/sda), but commands did not complete within the configured SCSI timeout (1080s / 1880s).
XFS journal writes failed with log I/O error -5.
XFS forced a filesystem shutdown to protect data integrity.
Since /dev/sda contained the root filesystem, all VM operations became unresponsive.
[465620.448887] I/D error, dev sdb, sector 9473560 op 0x8: (READ) flags 0x0 phys_seg 1 prio class 0[465621.023491] sd 0:0:0:0: [sda] tag#1016 timing out command, waited 1880s[465621.023969] I/0 error, dev sda, sector 229747980 op 0x1: (WRITE) flags 0x9800 phys_seg 1 prio class 0[465621.024388] XFS (dm-7): log I/0 error -5[4656Z1.024020] XFS (dm-7): Filesystem has been shut down due to log crror (0x2).[4656Z1.025288] XFS (dm-7): Please unmount the filesystem and rectify the problem(s).[466700.443897] sd 0:0:1:0: [sdb] tag#1021 timing out command, waited 1080s[466700.443578] I/0 error, dev sdb, sector 9469840 op BxA: (READ) flags 0x phys_seg 1 prio class 0[522497.539017] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/#:##:####]
Confirm vSAN network and storage connectivity are stable.
Reboot affected VMs.
On restart, the XFS filesystem remounts cleanly.
VM functionality is restored.