vSAN VMs may become inaccessible after primary vSAN vmk loses connectivity because of a bad SFP on switch
search cancel

vSAN VMs may become inaccessible after primary vSAN vmk loses connectivity because of a bad SFP on switch

book

Article ID: 410417

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • A network isolation event occurs on a host that is part of a vSAN cluster and HA cluster where the VMs are powered off and a failed attempt is made to restart the VMs on another host.
  • The VMs become inaccessible.

Environment

ESXi 8.x
vCenter 8.x

Cause

A potential cause is that the host is configured to use Link Status only as the failure detection policy and there is an upstream issue that is not detected, like a bad SFP on the switch, so the host becomes vSAN isolated but the vSAN traffic is not failed over to the standby vmnic.  As a result, the shutdown is not able to properly update the locks and objects on vSAN to make it available for another host to power on the VMs.

  • Some log messages on the host where the VMs were running:

    fdm.0:2025-08-24T08:25:44.522Z In(166) Fdm[########]: [Originator@#### sub=Monitor opID=pingableAddressMonitor.cpp:###-########] No ping reply from ###.###.###.###
    fdm.0:2025-08-24T08:55:00.392Z In(166) Fdm[########]: [Originator@#### sub=Policy opID=clusterManager.cpp:###-########] Host isolated is true

  • An log message that might show up on the host that is chosen to restart the VM:

    swapobjd.log:2025-08-24T08:56:47.696Z Er(11) swapobjd[2103459] 44537806:SwapObjCreateFileInt:298: Failed to create object /vmfs/volumes/vsan:################-################/########-####-####-####-############/<vmname>.vswp (The file already exists)

  • A log message for the primary vmnic on the active uplink for the vSAN vmk like the following is not seen before the isolation is set to true.

    2025-08-24T14:25:19.314Z In(14) vobd[########]:  [netCorrelator] ################: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate down

 

Resolution

This is to help understand a scenario that may cause VMs to go inaccessible after an event has occurred and to help understand the issue so that it can be remedied in future similar events.
Please consider using other network failover detection policies in the environment.  This is a design decision that must be made for your individual environment.

Additional Information