Improper vSAN Shutdown and Accidental turning off vSAN leads to Cluster Configuration Loss and Service Disruption
search cancel

Improper vSAN Shutdown and Accidental turning off vSAN leads to Cluster Configuration Loss and Service Disruption

book

Article ID: 394578

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • vSAN configuration loss and vSAN service disruption can be caused by (but not limited to):
    1. Improper vSAN shutdown
    2. Accidental turning off vSAN on a cluster
    3. Sudden power outages
  • vSAN services on the hosts show as not enabled.

[root@esxi:~] esxcli vsan cluster get
vSAN Clustering is not enabled on this host

Environment

  • VMware vSAN 7.x
  • VMware vSAN 8.x

Cause

  • When the vSAN is turned off abruptly, it will remove vSAN configurations on the hosts. 

  • The disk groups on the hosts will be unmounted.

[root@esxi:~] esxcli vsan storage list | grep -i cmmds
   In CMMDS: false
   In CMMDS: false

  • vSAN Health will show no object placement when the vsan cluster is manually enabled.

[root@esxi:~] esxcli vsan debug object health summary get
Health Status                                              Number Of Objects
---------------------------------------------------------  -----------------
remoteAccessible                                                           0
inaccessible                                                               0
reduced-availability-with-no-rebuild                                       0
reduced-availability-with-no-rebuild-delay-timer                           0
reducedavailabilitywithpolicypending                                       0
reducedavailabilitywithpolicypendingfailed                                 0
reduced-availability-with-active-rebuild                                   0
reducedavailabilitywithpausedrebuild                                       0
data-move                                                                  0
nonavailability-related-reconfig                                           0
nonavailabilityrelatedincompliancewithpolicypending                        0
nonavailabilityrelatedincompliancewithpolicypendingfailed                  0
nonavailability-related-incompliance                                       0
nonavailabilityrelatedincompliancewithpausedrebuild                        0
healthy                                                                    0

  • Advanced settings such as /VSAN/DOMPauseAllCCPs and /VSAN/IgnoreClusterMemberListUpdates are set to 1, which pauses normal vSAN object operations and causes the hosts to ignore cluster membership updates from vCenter.

[root@esxi:~] esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs
Value of DOMPauseAllCCPs is 1

[root@esxi:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1

Resolution

Note: Do not make any changes to the disk and disk group, and avoid deleting or recreating it.

Steps to Resolve the issue:

  1. Identify Sub-Cluster UUID and join the ESXi host with the vsan cluster:

    • Use the command grep nodeCount /var/run/log/vsansystem.log to locate the Sub-Cluster UUID for reconstruction.

    • Use the below command to join the ESXi host with the exsisting vsan cluster.
      # esxcli vsan cluster join -u (Sub-Cluster UUID)
  2. Rebuild Unicast Agent List:
    A. Collect Required Info per Host:

    • # esxcli vsan cluster unicastagent list

    • # esxcli vsan network list → Note VMkernel interface (e.g., vmkX)

    • # esxcli network ip interface ipv4 get | grep vmkX → Get IP of vSAN vmk

    • # cmmds-tool whoami → Get host UUID

    • # openssl x509 -in /etc/vmware/ssl/rui.crt -fingerprint -sha256 -noout → Get certificate thumbprint

    B. Add Entries:

    • For Data Node:
      # esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <VSAN_IP> -p 12321 -T <Thumbprint>

    • For Witness Node:
      # esxcli vsan cluster unicastagent add -t witness -u <Witness_UUID> -U true -a <Witness_IP> -p 12321

    C. Verify:

    • # esxcli vsan cluster unicastagent list → Ensure all nodes are present.

  3. Remount Disk Groups:
    A. Check status:

    • # esxcli vsan storage list → Ensure disks show "In CMMDS: TRUE"
      B. If FALSE, mount with:

    • Run vdq -iH on the ESXi host to get the Cache_Disk_UUID
    • Run the below command to mount the disk group.
      # esxcli vsan storage diskgroup mount -u <Cache_Disk_UUID>

  4. Create a new vSAN Cluster:

    • Create a new cluster in vCenter with the same configuration (Dedup, Compression, Encryption).

    • Validate Dedup and Compression settings:

      • # esxcli vsan storage list | grep -E "Deduplication:|Compression:"
    • Validate encryption settings:

      • # esxcli vsan encryption kms list

      • # esxcli vsan encryption info get

    • Match storage policies and ensure rulesets are the same.

  5. Preserve Unicast During Host Migration:

    • Run on all hosts:
      # esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates

    • Disconnect and move hosts one by one to new cluster.

  6. Reset Advanced Settings:

    • Check values:

      • # esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs

      • # esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

    • Revert both to 0:

      • # esxcfg-advcfg -s 0 /VSAN/DOMPauseAllCCPs

      • # esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates

  7. Configure Stretched Cluster under Fault Domains 
  8. Validate and Recover:

    • After settings are reset, vSAN object accessibility should be restored.

    • Resync operations should begin, and VMs should power on successfully.