Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage

Products

VMware vSAN

Issue/Introduction

Following a vSAN network outage, several Linux VMs reported I/O errors on their root disk.

vSAN health: Green post-recovery.

Guest OS behavior

Basic commands (uptime, date, df -h) fail due to root filesystem unavailability.

VMware observations

vmware.log: No heartbeat timeout or storage device errors. Only minor log discard:

YYYY-MM-DDTHH:MM:SS.SSSZ No (00) svga - >>> Error writing log, 110 bytes discarded. Disk full?

vmkernel.log: Events align with a vSAN network outage, including:
- High latency warnings
- Node removals from cluster membership
- Leader election and failover

YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu13:2100033) CMMDS: LeaderUpdateMeanRTLatency: 12423: Throttled: #-#-#-#-#: High RT latency. Node #-#-#-#-#, RT latency 958 (ms). Mean RT latency 122 (ms)

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu3:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865:#-#-#-#-#: Number of slow updates in last interval is 1 maxLatency 313 millisecs slowest #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865: 521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 689 millisecs slowest UUID #-#-#-#-#

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderSendHeartbeat : 2635: #-#-#-#-#: Backup unresponsive
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSStateDestroyNode : 708: #-#-#-#-#: Destroying node #-#-#-#-#: Backup is too far behind
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu51:2100033) CMMDS: LeaderLostBackup: 545: #-#-#-#-#: Leader Failover: MUUID #-#-#-#-# old #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderRemoveNodeFromMembership: 8592: #-#-#-#-#: Removing node #-#-#-#-# (vsanNodeType: data) from the cluster membership
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSClusterDestroyNodeImpl: 262: Destroying node #-#-#-#-# from the cluster db. Last HB received from node - #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ Wa(180) vmkwarning: cpu51:2100033) WARNING: RDT: RDTEndQueuedMessages: 1410: assoc 0x4322c6ac1680 message 9901371 failure

#-#-#-#-#

Environment

VMware ESXi 8.x

VMware ESXi 7.x

VMware vSAN 8.x

VMware vSAN 7.x

Cause

During the vSAN network outage, backend storage became temporarily inaccessible. The Linux VMs attempted read/write operations to their root disk (/dev/sda), but commands did not complete within the configured SCSI timeout (1080s / 1880s).

XFS journal writes failed with log I/O error -5.
XFS forced a filesystem shutdown to protect data integrity.

Since /dev/sda contained the root filesystem, all VM operations became unresponsive.

Validation

Example kernel logs:

[465620.448887] I/D error, dev sdb, sector 9473560 op 0x8: (READ) flags 0x0 phys_seg 1 prio class 0
[465621.023491] sd 0:0:0:0: [sda] tag#1016 timing out command, waited 1880s
[465621.023969] I/0 error, dev sda, sector 229747980 op 0x1: (WRITE) flags 0x9800 phys_seg 1 prio class 0
[465621.024388] XFS (dm-7): log I/0 error -5
[4656Z1.024020] XFS (dm-7): Filesystem has been shut down due to log crror (0x2).
[4656Z1.025288] XFS (dm-7): Please unmount the filesystem and rectify the problem(s).
[466700.443897] sd 0:0:1:0: [sdb] tag#1021 timing out command, waited 1080s
[466700.443578] I/0 error, dev sdb, sector 9469840 op BxA: (READ) flags 0x phys_seg 1 prio class 0
[522497.539017] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/#:##:####]

Resolution

Confirm vSAN network and storage connectivity are stable.

Reboot affected VMs.

On restart, the XFS filesystem remounts cleanly.
VM functionality is restored.

Additional Information

Linux VMs flags their file-system in read-only after datastore inaccessibility

Linux based file systems become read-only