After a power outage the VMs on vSAN datastore show as Invalid

Products

VMware vSAN

Issue/Introduction

Symptoms:

Virtual machines on vSAN datastores report as invalid after recovering from a power outage
vCenter server also resides on the vSAN datastore and is marked as invalid
vsan health indicates physical disk issues

esxcli vsan health cluster list

esxcli vsan health cluster list

Overall health findings red (Physical disk issue)

Physical disk red

Physical disk health retrieval issues red

Operation health yellow

Congestion green

Physical disk component utilization green

Component metadata health green

Memory pools (heaps) green

Memory pools (slabs) green

Disk capacity green

Data red

vSAN object health red

vSAN object format health green

Performance service red

Stats DB object red

Stats primary election green

Performance data collection green

All hosts contributing stats green

Stats DB object conflicts green

Capacity utilization yellow
It is also observed that the objects are in inaccessible state, due to which the virtual machines are marked as invalid. Use the below command to verify the status of the objects.

esxcli vsan debug object health summary get
On verifying the capacity utilization of the disks, it is observed that few hosts do not have any disk groups. Note, there are no compute only nodes in the cluster

To check the capacity utilization use the below command:

cmmds-tool find -t HOSTNAME -f json | egrep "uuid|hostname" | sed -e 's/\"content\"://g' | awk '{print $2}' | sed -e 's/[\",\},\,]//g' | xargs -n 2 | while read hostuuid hostname; do echo -e "\n\nHost Name: $hostname::: Host UUID: $hostuuid\n Disk Name\t\t| Disk UUID\t\t| Disk Usage | Disk Capacity | Usage Percentage" ; cmmds-tool find -f python -t DISK -o $hostuuid | grep uuid | cut -c 13-48 | while read diskuuid;do cmmds-tool find -f json -t DISK -o $hostuuid -u $diskuuid| egrep "uuid|content" | sed -e 's/\"content\":|\\"uuid\"://g' | sed -e 's/[\",\},\]//g' | awk '{printf $0}' | sed -e 's/},/\n/g'| awk '{print $37 " " $5 " " $45}'| while read disknaa diskcap maxcomp; do diskcapused=$(cmmds-tool find -f json -t DISK_STATUS -u $diskuuid | grep content |sed -e 's/[\",\},\]//g' | awk '{print $3}'); diskperc=$(echo "$diskcapused $diskcap" | awk '{print $1/$2*100}'); if [ "$maxcomp" != 0 ]; then echo -en " $disknaa\t| $diskuuid\t| $diskcapused\t | $diskcap\t | $diskperc%\n"; fi;done;done;done;
Further, on running the below command on all the ESXi hosts in the cluster, it is observed that on few of the hosts, the disks are not recognized by cmmds

esxcli vsan storage list | grep -i cmmds

Sample output:

esxcli vsan storage list | grep -i cmmds

In CMMDS: false

In CMMDS: false

In CMMDS: false

In CMMDS: false

In CMMDS: false

In CMMDS: false

In CMMDS: false

On a healthy ESXi host with no physical disk issues observed, the output will be as below:
esxcli vsan storage list | grep -i cmmds
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true

Environment

VMware VSAN 8.x

Cause

This issue occurs because the Cluster Monitoring, Membership, and Directory Services (CMMDS) is unable to validate the state of the disks residing on the hosts. As a result, these disks are reporting a "Stale" or "Unknown" status within the cluster directory. Since the disks are not recognized as active members, vSAN marks the data components residing on them as Absent, causing the associated objects to lose quorum and become inaccessible.

Cause Validation:

From the /var/run/log/vmkernel.log file of the ESXi host, below events will be reported indicating that the disks are detected as stale

2025-11-20T05:10:47.515Z In(182) vmkernel: cpu5:153229876)PLOG: PLOGMapDataPartition:3026: Mapping SSD cache data partition for 52a9e078-xxxx-xxxx-xxxx-xxxxxxxxxxxx not found SSD device mapped:0x0 fromRescan 0x1
2025-11-20T05:10:47.516Z In(182) vmkernel: cpu7:153229876)PLOG: PLOGProbeDevice:7022: Probed plog device <naa.6000xxxxxxxxxxxxxxxxxxxxx:1> 52a9e078-xxxx-xxxx-xxxx-xxxxxxxxxxxx 0x45xxxxxxxxxx exists.. continue with old entry

Resolution

Place the host into maintenance mode with No action and reboot the host.

If the issue persists even after rebooting the host, engage hardware vendor to validate the health of the drives.