vSAN File Service shares become inaccessible as vSAN File Service VMs enter 10 minute reboot loop
search cancel

vSAN File Service shares become inaccessible as vSAN File Service VMs enter 10 minute reboot loop

book

Article ID: 428454

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

vSAN File Service shares may become intermittently or permanently inaccessible. This behavior is often observed in environments with high file activity and high-concurrency workloads, such as large-scale document management systems or high-transaction SMB file shares.

Symptoms:

  • vSAN File shares cycle between Available and Unavailable states.

  • vSAN File Service nodes (FSVMs) enter a 10 minute reboot loop or fail to remain powered on.

  • The task console show repeated Power On virtual machine tasks for FSVMs, sometimes failing with:
    The attempted operation cannot be performed in the current state (Powered On).


     

  • In the vSphere Client, the following alerts or errors are observed:
    The system allows for a maximum of 100 file shares.
    Failed to query vSAN file service shares. VDFS datastore is not present.



  • vSAN Skyline Health shows the following alerts:
    Infrastructure Health | Category: File Service | Impact area: Availability
    File Server Health  | Category: File Service  | Impact area: Availability



Environment

VMware vSAN
vSAN File Services
File Service VMs
Protocol: SMB

Cause

This issue is caused by a memory allocation failure (Panic) within the vdfsd-proxy service on the ESXi hosts. When navigating to the /var/core directory on the affected hosts, vdfsd-proxy-zdumps files will be present.

The issue stems from a known limitation in the 9p driver used for filesystem caching. Under specific high-load conditions, the driver fails to automatically free File Identifiers (FIDs), leading to a continuous increase in cached FID information within the vdfs proxy.
There is an existing management mechanism to trigger a proxy cache cleanup at 80MB threshold - however; rapid spikes in memory demand can exceed the proxy memory limit before the cleanup process completes. This results in service crash and subsequent FSVM reboots.

Resolution

There is no permanent fix available at this time. This issue is under investigation by Broadcom Engineering.

Workaround:

Restore access to the file shares by performing a rolling reboot all the ESXi hosts in the affected vSAN cluster using Ensure Accessibility mode.

Once the hosts return from reboot, collect a log bundle of the ESXi hosts in the affected vSAN cluster and the vCenter and file a case with Broadcom Support.

Note: For Encrypted vSAN Environments: If encryption is enabled in vSAN, the log bundle must be collected using a password to ensure core dumps can be decrypted for analysis.
Per Article 319493, Step 3: Select the Password for encrypted core dumps option and specify a password. This password must be shared with the Broadcom Support Engineer assigned to the case.