During a SRM failover, VMs may fail to power on with their NICs disconnected when VM storage is provided by Nutanix AOS
search cancel

During a SRM failover, VMs may fail to power on with their NICs disconnected when VM storage is provided by Nutanix AOS

book

Article ID: 313975

calendar_today

Updated On:

Products

VMware Live Recovery

Issue/Introduction

Symptoms:
  • Recovery for VMs intermittently fail after 20 minutes during reconfiguration and subsequent power on 
  • The VMs in the target site are in a shutdown state
  • The virtual NIC of the effected VMs are in a disconnected state and IP customization fails after retrying 20 times
  • The correct port group needs to be manually assigned to the virtual NIC, and the Recovery Plan needs to be run again to overcome the failure
  • As part of the SRM recovery workflow, the VM's network ID gets re-configured and updated correctly with the dvSwitch ID of the DR site
  • However, during power on the dvSwitch ID of the VM gets reverted back to the dvSwitch ID of the production site
  • This issue does not occur if the VM is powered on in the same ESXi host where it is re-configured
  • The issue does not occur if DRS in the target site is disabled or set to manual
  • The issue occurs only when DRS decides to power on the VM on a different host from the one where it originally resides and where the re-configure happens
  • This issue is seen only when ESXi is used with Nutanix AOS Stargate for NFS
 
  • "Hostd" logs for the re-configure operation on the ESXi host in the target site:
2023-07-13T07:07:20.763Z verbose hostd[2100051] [Originator@6876
sub=Vmsvc.vm:/vmfs/volumes/c9bb7c2f-35be9a24/Test/Test.vmx opID=4e7b7aa4-9162-4907-800e-
4d36124733ba-failover:e94a:dde7:2670:3cb8:05dc-9f-01-75-7127 user=vpxuser:user\SRM-
39c5cb64-a157-45e0-8651-8e798808a7d6] Reconfigure: (vim.vm.ConfigSpec) {
--> createDate = "2022-10-27T00:50:27.35248Z",
--> files = (vim.vm.FileInfo) {
--> vmPathName = "[]/vmfs/volumes/c9bb7c2f-35be9a24/Dummy2/Dummy2.vmx",
--> },
--> deviceChange = (vim.vm.device.VirtualDeviceSpec) [
--> (vim.vm.device.VirtualDeviceSpec) {
--> operation = "edit",
--> device = (vim.vm.device.VirtualVmxnet3) {
--> key = 4000,
--> deviceInfo = (vim.Description) {

--> label = "Network adapter 1",
--> summary = "DVSwitch: 50 00 f4 cf b3 8f 98 b4-3c e9 61 af 7c d5 25 b0"
--> },
--> backing = (vim.vm.device.VirtualEthernetCard.DistributedVirtualPortBackingInfo) {
--> port = (vim.dvs.PortConnection) {
--> switchUuid = "50 18 b9 f9 48 87 b4 36-dc 50 c0 8d ba 42 ce 6c",
--> portgroupKey = "dvportgroup-36",
--> portKey = "80",
--> connectionCookie = 2075781137
  • Hostd logs during registration of the VM on the target ESXi host where DRS decides to power the VM on the dvSwitch ID changes here and as expected, hostd reports that the DVS cannot be found:
2023-07-13T07:07:23.529Z warning hostd[2099929] [Originator@6876
sub=Hostsvc.NetworkProvider opID=4e7b7aa4-9162-4907-800e-4d36124733ba-
failover:e94a:dde7:2670:594c:ad7a-62-01-01-06-01-38-a024] GetDvsById: dvs 50 00 f4 cf b3 8f 98
b4-3c e9 61 af 7c d5 25 b0 not found
2023-07-13T07:07:23.529Z warning hostd[2099929] [Originator@6876
sub=Hostsvc.NetworkProvider opID=4e7b7aa4-9162-4907-800e-4d36124733ba-
failover:e94a:dde7:2670:594c:ad7a-62-01-01-06-01-38-a024] Error getting dvs 50 00 f4 cf b3 8f 98 b4-3c e9 61 af 7c d5 25 b0 : Fault cause: vim.fault.NotFound



Environment

VMware Site Recovery Manager 8.x

Cause

  • The issue is caused when the target ESXi host issues a READ for a stale file handle of the VM’s VMX. The target ESXi host only knows the stale FH as it is not aware of the re-configure operation on the VMX by the source ESXi host.
  • Nutanix AOS Stargate does not respond with a stale FH error, but instead retrieves the stale file from its "Recycle Bin". From a Nutanix perspective, it returns the stale FH since the “Recycle Bin” is accessible to the NFS clients (ESXi host in this case).
  • The "Recycle Bin" functionality was introduced in Nutanix AOS 5.18 STS
  • The issue can occur in a combination of ESXi and Nutanix AOS with versions later than 5.18

 

Resolution

  • There is no permanent resolution for this issue at this moment. However, Nutanix Engineering is working to have this fixed in the upcoming releases


Workaround:
  • The only workaround is to disable "Recycle Bin" for AOS


Additional Information

Impact/Risks:
  • During a SRM failover, VMs may not  power-on and have their NICs disconnected