Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage
search cancel

Linux VMs Report I/O Errors and Filesystem Shutdown After vSAN Network Outage

book

Article ID: 410664

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Following a vSAN network outage, several Linux VMs reported I/O errors on their root disk.

  • vSAN health: Green post-recovery.

Guest OS behavior

  • Basic commands (uptime, date, df -h) fail due to root filesystem unavailability.

VMware observations

  • vmware.log: No heartbeat timeout or storage device errors. Only minor log discard:

2025-09-16T10:49:41.604Z No (00) svga - >>> Error writing log, 110 bytes discarded. Disk full?

  • vmkernel.log: Events align with a vSAN network outage, including:

    • High latency warnings

    • Node removals from cluster membership

    • Leader election and failover

2025-09-15T15:45:07.263Z In(182) vmkernel: cpu13:2100033) CMMDS: LeaderUpdateMeanRTLatency: 12423: Throttled: 521609c9-####-####-####-0b840dldf835: High RT latency. Node 65189716-####-####-####-1423f2320bb0, RT latency 958 (ms). Mean RT latency 122 (ms)

2025-09-15T15:45:18.065Z In (182) vmkernel: cpu3:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865:521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 313 millisecs slowest UUID529f964e-####-####-####-b57clle7b3c5
2025-09-15T15:45:18.755Z In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSCompleteLocalUpdate: 3865: 521609c9-####-####-####-0b840dldf835: Number of slow updates in last interval is 1 maxLatency 689 millisecs slowest UUID529f964e-####-####-####-b57clle7b3c5

2025-09-15T15:45:43.982Z In (182) vmkernel: cpu51:2100033) CMMDS: LeaderSendHeartbeat : 2635: 521609c9-####-####-####-0b840dldf835: Backup unresponsive
2025-09-15T15:45:43.982Z In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSStateDestroyNode : 708: 521609c9-####-####-####-0b840dldf835: Destroying node 65189716-####-####-####-1423f2320bb0: Backup is too far behind
2025-09-15T15:45:43.982Z In(182) vmkernel: cpu51:2100033) CMMDS: LeaderLostBackup: 545: 521609c9-####-####-####-0b840dldf835: Leader Failover: MUUID a734c868-####-####-####-1423f231d820 old 326cb968-####-####-####-1423f231d820
2025-09-15T15:45:43.982Z In (182) vmkernel: cpu51:2100033) CMMDS: LeaderRemoveNodeFromMembership: 8592: 521609c9-####-####-####-0b840dldf835: Removing node 65189716-####-####-####-1423f2320bb0 (vsanNodeType: data) from the cluster membership
2025-09-15T15:45:43.982Z In (182) vmkernel: cpu51:2100033) CMMDS: CMMDSClusterDestroyNodeImpl: 262: Destroying node 65189716-####-####-####-1423f2320bb0 from the cluster db. Last HB received from node - 973246628589212
2025-09-15T15:45:43.982Z Wa(180) vmkwarning: cpu51:2100033) WARNING: RDT: RDTEndQueuedMessages: 1410: assoc 0x4322c6ac1680 message 9901371 failure

Environment

VMware ESXi 8.x 

VMware ESXi 7.x

VMware vSAN 8.x

VMware vSAN 7.x 

Cause

During the vSAN network outage, backend storage became temporarily inaccessible. The Linux VMs attempted read/write operations to their root disk (/dev/sda), but commands did not complete within the configured SCSI timeout (1080s / 1880s).

  • XFS journal writes failed with log I/O error -5.

  • XFS forced a filesystem shutdown to protect data integrity.

Since /dev/sda contained the root filesystem, all VM operations became unresponsive.

Validation

  • Example kernel logs:

[465620.448887] I/D error, dev sdb, sector 9473560 op 0x8: (READ) flags 0x0 phys_seg 1 prio class 0
[465621.023491] sd 0:0:0:0: [sda] tag#1016 timing out command, waited 1880s
[465621.023969] I/0 error, dev sda, sector 229747980 op 0x1: (WRITE) flags 0x9800 phys_seg 1 prio class 0
[465621.024388] XFS (dm-7): log I/0 error -5
[4656Z1.024020] XFS (dm-7): Filesystem has been shut down due to log crror (0x2).
[4656Z1.025288] XFS (dm-7): Please unmount the filesystem and rectify the problem(s).
[466700.443897] sd 0:0:1:0: [sdb] tag#1021 timing out command, waited 1080s
[466700.443578] I/0 error, dev sdb, sector 9469840 op BxA: (READ) flags 0x phys_seg 1 prio class 0
[522497.539017] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/#:##:####]

Resolution

Confirm vSAN network and storage connectivity are stable.

Reboot affected VMs.

  • On restart, the XFS filesystem remounts cleanly.

  • VM functionality is restored.

 

 

Additional Information