Physical Network Interfaces Down and PCPU Lockups Due to VMFS6 Datastore Contention
search cancel

Physical Network Interfaces Down and PCPU Lockups Due to VMFS6 Datastore Contention

book

Article ID: 437967

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The Physical network interfaces (pNics) randomly transition to a down state on ESXi hosts within a cluster. Concurrently, the affected ESXi hosts experience PCPU lockups and missed heartbeats.
  • vobd logs verify the uplink failures:
    • [vob.net.dvport.uplink.transition.down] Uplink: vmnic2 is down [vob.net.dvport.uplink.transition.down] Uplink: vmnic0 is down
  • vmkernel logs show PCPU lockup warnings and NMI alerts:
    • WARNING: Heartbeat: 961: PCPU 70 didn't have a heartbeat for 5 seconds, timeout is 10, 1 IPIs sent; *may* be locked up.
  • Stack traces from the locked PCPUs consistently point to VMFS6 resource allocation and file I/O operations, specifically:
    • MCSLockSpin Res6AffMgr_AllocResources Fil6_AllocateBlocks Res6AffMgrComputeSortIndices Res3StatVMFS6

Environment

VMware vSphere ESXi

Cause

VMFS6 datastore caused a software locking condition that lead to PCPU lockups.Since the CPUs are locked processing VMFS6 operations, the host fails to send heartbeats, which cascades into the network subsystem incorrectly marking the pNics as down.

Resolution

  • Make sure the VMFS datastores are not above 64TB in size. Identify and unmount the specific VMFS6 datastore causing the lockups to immediately cease the VMFS6 lock contention, allowing the PCPUs to process heartbeats normally.
  • Maintain the unmounted state of the affected datastore to preserve host and network stability.
  • Investigate the underlying storage array and SAN fabric backing this specific LUN for severe latency, hardware faults, or misconfigurations.
  • Validate the exact ESXi version deployed to cross-reference against known internal PRs matching the Res6AffMgrComputeSortIndices stack trace for potential software patching.