When an ESXi host fails with purple diagnostic screen, High Availability(HA) does not failover the virtual machines to the other host
search cancel

When an ESXi host fails with purple diagnostic screen, High Availability(HA) does not failover the virtual machines to the other host

book

Article ID: 328275

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
  • WWhen an ESXi host fails with purple diagnostic screen, High Availability(HA) does not failover the virtual machines to the other host.
  • When you restart a host, virtual machine on other host also restarts.
  • In the /var/log/fdm.log, you see entries similar to:
fdm.log:<YYYY-MM-DD>T<time>[FFF00B70 verbose 'Invt' opID=6F8EF6B2-00000FC7-a6-78] [InventoryManagerImpl::Handle(ClusterConfigNotification)] Processing cluster config</time>
fdm.log:<YYYY-MM-DD>T<time>[FFF00B70 verbose 'Invt' opID=6F8EF6B2-00000FC7-a6-78] [InventoryManagerImpl::UpdateAgentVms] Number of required agents vms changed to 1.</time>
fdm.log:<YYYY-MM-DD>T<time>[FFE3DB70 info 'Invt' opID=SWI-1c512ce0] [InventoryManagerImpl::ProcessHostChanges] Slave state of host-3485 changed to Dead</time>

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.


Cause

This issue occurs when there are agent virtual machines running in the environment. Failover fails for all the protected virtual machines due to InsufficientAgentVmsDeployed fault in such a setup.

Note: Agent machines are vShield or any other appliance which the other virtual machines are dependent for a service like in a vCloud environment.

Resolution

This is a known issue effecting vCenter Server 5.5.

This issue is resolved in vCenter Server 5.5 Update 3, available at Customer Connect. For more information, see VMware vCenter Server 5.5 Update 3 Release Notes.

To work around the issue when you do not want to upgrade, update the esx_agent_vm_state column.

Note: Take a backup of the vCenter Server before performing this operation.

For vCenter Appliance:
  1. Log in to the vCenter Server appliance as root.
  2. Run this command to stop the vCenter service:

    service vmware-vpxd stop

  3. Run this command to access the database:

    /opt/vmware/vpostgres/9.0/bin/psql -d VCDB vc

  4. Run this command on the VCDB command prompt:

    select * from vpx_vm;

  5. Run this command to update esx_agent_vm_state column:

    update vpx_vm set esx_agent_vm_state = 0 where file_name like '%agent machine name%';

  6. Run this command to check the changes made to the column:

    Note: This should be set to 0 now.

    select file_name, esx_agent_vm_state from vpx_vm;

  7. Quit the database interaction by running \q command
  8. Run this command to start the service:

    service vmware-vpxd start
For vCenter Server on a windows machine:
  1. Access the vCenter database using SQL management studio,oracle SQL plus or GUI manager and run the command:

    select * from vpx_vm;

  2. Run this command to update esx_agent_vm_state column:

    update vpx_vm set esx_agent_vm_state = 0 where file_name like '%agent machine name%';

  3. Run this command to check the changes made to the column:

    Note: This should be set to 0 now:

    select file_name, esx_agent_vm_state from vpx_vm;

  4. Start the virtual center service from service manager.
  5. The virtual machines should now failover when PSOD occurs.
Note: This workaround is valid until the next restart of the vCenter service.