Following storage outage, Linux VM CPU lockup and VSCSI resets when fstrim runs
search cancel

Following storage outage, Linux VM CPU lockup and VSCSI resets when fstrim runs

book

Article ID: 435667

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESX 8.x

Issue/Introduction

  • Following storage outage VMs experience filesystem corruption requiring manual fsck or VM restart.

  • During next guest OS fstrim (UNMAP) operation the VM hangs with CPU errors:
    watchdog: BUG: soft lockup - CPU## stuck for xxs!

Environment

VMware vSphere ESX 8.0

Cause

During the storage outage, the UNMAP command failed to complete its execution on the backing LUN, resulting in stale metadata locks.

Because the UNMAP process requires exclusive locks to safely perform block reclamation, these orphaned locks are now preventing subsequent space reclamation tasks and causing transaction errors.

/var/run/log/vmkernel.log reports
vmkernel: cpu48:2097827)VSCSI: 3772: handle 30326781256934566(GID:xxxx)(vscsi0:0):processing reset for handle ... state 1381192706
vmkernel: cpu14:8406539)Res6: 2944: '<datastorename>': RC Lock not free for type 1, return TXN FULL
vmkernel: cpu14:8406539)Fil6: 3816: <datastorename>: <FD c10 r89> - Failed to unmap file blocks 0/1:Transaction ran out of lock space or log space

Resolution

  • If the guest OS hangs, a hard reset is required to clear the vSCSI reset loop.
  • Migrate the affected VMs to a different datastore as moving the VM disk files forces the storage layer to release/clear the stale metadata locks associated with the specific blocks on the original datastore.
  • Perform a staged, rolling reboot of all ESXi hosts in the cluster.