After a vCenter upgrade, vSphere HA fails with an error Device or resource busy
book
Article ID: 322298
calendar_today
Updated On:
Products
VMware vCenter Server
Issue/Introduction
Symptoms: After a cluster upgrade, when vCenter restarts, in a short time most of the VMs might be migrated to a small group of the ESXi hosts in the cluster, which leads to a performance downgrade. In the FDM Install error logs on the ESXi hosts, you see a warning such as:
rm: can't remove '/tardisks/vmware_f.v00': Device or resource busy
Cause
After a cluster upgrade, while ESXi hosts reconnect to the vpxd service, the Fault Domain Manager (FDM) agent needs some time to become fully operational. If during this time the vSphere DRS component that scans the VM-to-host compatibility finds the FDM agent does not work on a given ESXi host, DRS forces the migration of the VMs to another host for high availability purposes.
Resolution
This has been resolved in vCenter Server 7.0 U3l, with the addition of a user-configurable option whereby DRS will wait for a grace period of n seconds before initiating moves. This is configured with the advanced option below:
CompatCheckTransientFailureTimeSeconds
This advanced value will accept a value in seconds for how long to wait before migrating after a transient compat check failure. The recommendation is to use a value of 600 for 10 minutes.
Workaround: A workaround for versions prior to 7.0 U3l is to place DRS in Manual mode during the upgrade process and then set it back to Fully Automated once the upgrade activities have completed successfully.