Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage / vSAN Cluster Partition
search cancel

Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage / vSAN Cluster Partition

book

Article ID: 410664

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

There was a vSAN network outage which might resulted in an outage on the affected vSAN Cluster ( = vSAN Cluster Partition )

You might have observed in the vSphere Client on the Summary page of the Cluster,  the related vSAN Cluster partition alert

 

Now vSAN has recovered (vSAN Healthcheck is not showing any relevant issues ), but several Linux VMs reported I/O errors on their root disk.

 

Guest OS behavior

  • Basic commands (uptime, date, df -h) fail due to root filesystem unavailability. 

  • You might observe the following error within the Guest OS

 

VMware observations

  • vmware.log: No heartbeat timeout or storage device errors. Only minor log discard:

YYYY-MM-DDTHH:MM:SS.SSSZ No (00) svga - >>> Error writing log, 110 bytes discarded. Disk full?

  • vmkernel.log: Events align with a vSAN network outage, including:

    • High latency warnings

    • Node removals from cluster membership

    • Leader election and failover

YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu13:2100033) CMMDS: LeaderUpdateMeanRTLatency: 12423: Throttled: #-#-#-#-#: High RT latency. Node #-#-#-#-#, RT latency 958 (ms). Mean RT latency 122 (ms)

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu3:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865:#-#-#-#-#: Number of slow updates in last interval is 1 maxLatency 313 millisecs slowest #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865: 521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 689 millisecs slowest UUID #-#-#-#-#

YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderSendHeartbeat : 2635: #-#-#-#-#: Backup unresponsive
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSStateDestroyNode : 708: #-#-#-#-#: Destroying node #-#-#-#-#: Backup is too far behind
YYYY-MM-DDTHH:MM:SS.SSSZ In(182) vmkernel: cpu51:2100033) CMMDS: LeaderLostBackup: 545: #-#-#-#-#: Leader Failover: MUUID #-#-#-#-# old #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: LeaderRemoveNodeFromMembership: 8592: #-#-#-#-#: Removing node #-#-#-#-# (vsanNodeType: data) from the cluster membership
YYYY-MM-DDTHH:MM:SS.SSSZ In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSClusterDestroyNodeImpl: 262: Destroying node #-#-#-#-# from the cluster db. Last HB received from node - #-#-#-#-#
YYYY-MM-DDTHH:MM:SS.SSSZ Wa(180) vmkwarning: cpu51:2100033) WARNING: RDT: RDTEndQueuedMessages: 1410: assoc 0x4322c6ac1680 message 9901371 failure

#-#-#-#-#

Environment

VMware ESXi 8.x 

VMware ESXi 7.x

VMware vSAN 8.x

VMware vSAN 7.x 

Cause

During the vSAN network outage, backend storage became temporarily inaccessible. The Linux VMs attempted read/write operations to their root disk (/dev/sda), but commands did not complete within the configured SCSI timeout (1080s / 1880s).

  • XFS journal writes failed with log I/O error -5.

  • XFS forced a filesystem shutdown to protect data integrity.

Since /dev/sda contained the root filesystem, all VM operations became unresponsive.

Validation

  • Example kernel logs:

[465620.448887] I/D error, dev sdb, sector 9473560 op 0x8: (READ) flags 0x0 phys_seg 1 prio class 0
[465621.023491] sd 0:0:0:0: [sda] tag#1016 timing out command, waited 1880s
[465621.023969] I/0 error, dev sda, sector 229747980 op 0x1: (WRITE) flags 0x9800 phys_seg 1 prio class 0
[465621.024388] XFS (dm-7): log I/0 error -5
[4656Z1.024020] XFS (dm-7): Filesystem has been shut down due to log crror (0x2).
[4656Z1.025288] XFS (dm-7): Please unmount the filesystem and rectify the problem(s).
[466700.443897] sd 0:0:1:0: [sdb] tag#1021 timing out command, waited 1080s
[466700.443578] I/0 error, dev sdb, sector 9469840 op BxA: (READ) flags 0x phys_seg 1 prio class 0
[522497.539017] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/#:##:####]

Resolution

Confirm vSAN network and storage connectivity are stable.

Reboot affected VMs.

  • On restart, the XFS filesystem remounts cleanly.

  • VM functionality is restored.

 

 

Additional Information