Restarting management agents in an All-Paths-Down condition in ESXi fails with the error: Not all VMFS volumes were updated; the error encountered was 'No connection'
search cancel

Restarting management agents in an All-Paths-Down condition in ESXi fails with the error: Not all VMFS volumes were updated; the error encountered was 'No connection'

book

Article ID: 342615

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • During an All-Paths-Down (APD) condition, restarting the ESXi management agents to troubleshoot issues as per Restarting the Management agents on an ESXi Server fails
  • Restarting the management services manually from the ESXi Shell using the services.sh restart command fails
  • You see the error:

    Errors: Not all VMFS volumes were updated; the error encountered was 'No connection'.
    Errors: Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.
    Error while scanning interfaces, unable to continue. Error was Not all VMFS volumes were updated; the error encountered was 'No connection'.


  • In the vSphere Client, the ESXi host appears as disconnected or not-responding
  • Unable to connect directly to the ESXi host and the hostd process(es) fail to start
  • Running esxcli commands fail with the error:

    Connect to localhost failed: Connection failure



Environment

VMware vSphere ESXi 7.X
VMware vSphere ESXi 8.X

Resolution

To resolve this issue, determine the worlds, such as the virtual machine, user-worlds, or system processes, that are accessing the VMFS volume(s) in an APD state using this command:
 
# localcli storage core device world list
You see an output similar to:

Device World ID Open Count World Name
----------------------------------------------------------------------
naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 2060 1 idle0

You can then forcefully stop all worlds that access the device which is in the APD state and remove the dead paths.
 
To stop all worlds that access the device in an APD state and to remove the dead paths:
  1. Run this command to list the virtual machine worlds currently running on the host:

    # localcli vm process list

    You see an output similar to:

    <VM-NAME>:
    World ID: ######
    Process ID: 0
    VMX Cartel ID: ######
    UUID: ## ## ## ## ## ## ## ##-## ## ## ## ## ## ## ##
    Display Name: <VM-NAME>
    Config File: /vmfs/volumes/########-########-####-############/<VM-NAME>/<VM-NAME>.vmx

  2. Run this command to kill the virtual machine World ID: processes:

    # localcli vm process kill --type=force --world-id <World ID>

    For example:

    # localcli vm process kill --type=force --world-id=12346

  3. Rescan to remove dead paths using this command:

    # localcli storage core adapter rescan -A vmhbaX -t delete

    The ESXi host itself may have open handles to the affected VMFS volume which is in the APD state. WorkID 2060 (idle0) is a core system process that is used when a VMFS volume is opened or mounted to an ESXi host. If you attempt to remove dead paths while the ESXi host has open handles, you see this error:

    Errors:
    Rescan complete, however some dead paths were not removed because they were in use by the system. Please use the 'storage core device world list' command to see the VMkernel worlds still using these paths.
    Error while scanning interfaces, unable to continue. Error was Not all VMFS volumes were updated; the error encountered was 'IO was aborted by VMFS via a virt-reset on the device'.


    Note: This process cannot be forcefully shutdown or terminated as it is a critical system process. The only option available is to reboot all affected ESXi host(s).


Additional Information

For more information on APD in vSphere ESXi , see Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere ESXI

Impact/Risks:
In this case, as the management agents are affected by the APD condition, performing a vMotion on the unaffected virtual machines is not possible. As a result, a reboot of the affected ESXi host(s) force an outage to all non-affected virtual machines on that host.