Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage
search cancel

Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage

book

Article ID: 410664

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Following a vSAN network outage, several Linux VMs reported I/O errors on their root disk.

  • vSAN health: Green post-recovery.

Guest OS behavior

  • Basic commands (uptime, date, df -h) fail due to root filesystem unavailability.

VMware observations

  • vmware.log: No heartbeat timeout or storage device errors. Only minor log discard:

YYYY-MM-DDTHH:MM:SS.SSSZ No (00) svga - >>> Error writing log, 110 bytes discarded. Disk full?

  • vmkernel.log: Events align with a vSAN network outage, including:

    • High latency warnings

    • Node removals from cluster membership

    • Leader election and failover

YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu13:2100033) CMMDS: LeaderUpdateMeanRTLatency: 12423: Throttled: #-#-#-#-#: High RT latency. Node #-#-#-#-#, RT latency 958 (ms). Mean RT latency 122 (ms)

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu3:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865:#-#-#-#-#: Number of slow updates in last interval is 1 maxLatency 313 millisecs slowest #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865: 521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 689 millisecs slowest UUID #-#-#-#-#

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderSendHeartbeat : 2635: #-#-#-#-#: Backup unresponsive
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSStateDestroyNode : 708: #-#-#-#-#: Destroying node #-#-#-#-#: Backup is too far behind
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu51:2100033) CMMDS: LeaderLostBackup: 545: #-#-#-#-#: Leader Failover: MUUID #-#-#-#-# old #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderRemoveNodeFromMembership: 8592: #-#-#-#-#: Removing node #-#-#-#-# (vsanNodeType: data) from the cluster membership
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSClusterDestroyNodeImpl: 262: Destroying node #-#-#-#-# from the cluster db. Last HB received from node - #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ Wa(180) vmkwarning: cpu51:2100033) WARNING: RDT: RDTEndQueuedMessages: 1410: assoc 0x4322c6ac1680 message 9901371 failure

#-#-#-#-#

Environment

VMware ESXi 8.x 

VMware ESXi 7.x

VMware vSAN 8.x

VMware vSAN 7.x 

Cause

During the vSAN network outage, backend storage became temporarily inaccessible. The Linux VMs attempted read/write operations to their root disk (/dev/sda), but commands did not complete within the configured SCSI timeout (1080s / 1880s).

  • XFS journal writes failed with log I/O error -5.

  • XFS forced a filesystem shutdown to protect data integrity.

Since /dev/sda contained the root filesystem, all VM operations became unresponsive.

Validation

  • Example kernel logs:

[465620.448887] I/D error, dev sdb, sector 9473560 op 0x8: (READ) flags 0x0 phys_seg 1 prio class 0
[465621.023491] sd 0:0:0:0: [sda] tag#1016 timing out command, waited 1880s
[465621.023969] I/0 error, dev sda, sector 229747980 op 0x1: (WRITE) flags 0x9800 phys_seg 1 prio class 0
[465621.024388] XFS (dm-7): log I/0 error -5
[4656Z1.024020] XFS (dm-7): Filesystem has been shut down due to log crror (0x2).
[4656Z1.025288] XFS (dm-7): Please unmount the filesystem and rectify the problem(s).
[466700.443897] sd 0:0:1:0: [sdb] tag#1021 timing out command, waited 1080s
[466700.443578] I/0 error, dev sdb, sector 9469840 op BxA: (READ) flags 0x phys_seg 1 prio class 0
[522497.539017] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/#:##:####]

Resolution

Confirm vSAN network and storage connectivity are stable.

Reboot affected VMs.

  • On restart, the XFS filesystem remounts cleanly.

  • VM functionality is restored.

 

 

Additional Information