vSAN File Service Shares are offline and "File Server Health" reports "File server not found" or "File server is (re)starting"
search cancel

vSAN File Service Shares are offline and "File Server Health" reports "File server not found" or "File server is (re)starting"

book

Article ID: 318126

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

To inform about an ongoing issue with fsvm-sockrelay on the ESXi hosts and possible side-effects.

Symptoms:
  • The ESXi host is running on 7.0 U3k or earlier, or ESXi 8.0b or earlier.
  • The vSAN File Service Shares are offline and not reachable, or there are false-positives reported of the same.
  • On ESXi hosts following pattern in /var/run/log/vmkernel.log can be seen repeatedly:
2023-04-10T00:56:00.665Z cpu25:2103738)Admission failure in path: host/vim/vmvisor/fsvm-sockrelay:sockrelay.2103738:uw.2103738
2023-04-10T00:56:00.665Z cpu25:2103738)UserWorld 'sockrelay' with cmdline '/usr/lib/vmware/vdfs/bin/sockrelay -m unix:/var/run/fsvm_docker.sock,vsock:2375:-1711806116;unix:/var/run/fsvm_cmd.sock,vsock:2376:-1711806116'
  • In the service-specific logs in /var/run/log/sockrelay.log following message can be seen:
Failed to create thread: No space left on device.
 
  • Additionally "Install agent" tasks by EAM could be seen in vCenter and the "vSAN File Service Node"-VMs repeatedly restarting.
  • Skyline Health alerts are reporting:
    • Alarm File Server Health reports "File server not found" or "File server is (re)starting"vsanfs1a.jpg
    • Alarm Infrastructure Health reports either "File service infrastructure is in a good state" or "File service is comparatively overloaded on this host."vsanfs1b.jpg
  • If the issue affects major amount of hosts in the cluster, the share accessibility can get impacted. In such rare cases, in Skyline Health with alarm "File Server Health" errors "NFS daemon is not running." or "File server IP address is not present" could be observed.


Environment

VMware vSAN 7.0.x
VMware vSAN 8.0.x

Cause

In some cases the service fsvm-sockrelay might try to consume more than the assigned hard-limit of assigned memory and tries exceeding its limit. When said service requests more memory than allowed, vmkernel is rejecting the memory assignment and logs "Admission failure". This can lead to the service misbehaving.

When this service misbehaves and communication issues between ESXi and the vSAN File Service Node-VM is present, in rare cases this can lead to file share failover mechanism failing and not being able to find remaining healthy hosts, leading to inaccessible file shares.

Resolution

Improvements have been made in newer releases. Please update to ESXi 7.0 U3l (21424296) or ESXi 8.0b (21203435) and newer.

Workaround:
If you are not able to update at this time, you can restart the fsvm-sockrelay to get short-term and temporary relief. However this issue can re-occur at any time, and hence update to at least ESXi 7.0 U3l or ESXi 8.0b is recommended.

The service can be restarted on all affected hosts via:
/etc/init.d/fsvmsockrelay restart

After the restart, you should no longer see new "Admission failure" for "fsvm-sockrelay" events in vmkernel.log and Skyline Health should report the vSAN File Service Node being back healthy shortly after.