EMC RecoverPoint causes VMs to disconnect or become unresponsive

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

VMs that are being protected by RecoverPoint are randomly disconnecting or freezing.
Production VMs may hang/freeze or become unresponsive after splitter encounters IO errors from VSCSI layer
This may result in data unavailability until ESX is rebooted.

1. IO error while submitting the IO to the lower layer. Due to the fix for PSOD after IO Submission failure even valid attempts at clean up of failed IO will be marked as an incorrect callback and are not handled further. ESXi host VMkernel logs will show outputs similar to:

YYYY-MM-DDTHH:MM:SS - #2 - 2103327/2103284 - KS: krnl:[22:41:20.917] 0/0 #0 - IoEsx_ToStorage_s_forwardToLower: VSCSIFilter_IssueCommandToBackend Failed (io: 0x433682f8b850), with status Busy krnl:[22:41:20.917] 0/0 #2 - (skipped 0 prints) - IoEsx_ToStorage_v_handleSendToStorageFailed_i: Called with status (Busy)
krnl:[22:41:20.917] 0/0 #0 - IoEsx_ToStorage_s_sendToStorageDone: Incorrect callback for a failed IO Submit for io 0x433682f8b850, skipping CommandIoBase_v_storageEndIo

2. Even though the underlying storage issue is not caused by the splitter, because these IOs are not handled properly, a continuous loop of VSCSI resets occur and the VM remains hung even after the storage issue is resolved.

ESXi host VMkernel logs will also show outputs similar to:

YYYY-MM-DDTHH:MM:SS cpu2:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 18/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu3:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 19/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu6:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 20/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu5:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 21/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu0:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 28/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu3:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 29/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu20:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 30/0] from (vmm0:ProdVM01)
YYYY-MM-DDTHH:MM:SS cpu5:2097832)VSCSI: 2903: handle 20963(vscsi0:10):Reset [Retries: 31/0] from (vmm0:ProdVM01)

Cause

This is a known issue that has been identified by RecoverPoint. This is only applicable if the ESXi host has RP version 5.2.2.2 or 5.2.2.3 installed.

Resolution

There is no resolution at this time.

Dell EMC engineering is currently investigating this issue. A permanent fix is still in progress. Contact the Dell EMC Customer Support Center or your service representative for assistance and reference this solution ID.

Workaround:

Rebooting the ESXi host is currently the only way to release the VM.

vMotion all other VMs from the ESXi host
Reboot the ESXi host

NOTE: If the issue is not on entire cluster's Hosts, this needs to be analyzed and fixed from storage level

- Do not put the Host to MM to automatic vMotion all the VMs.

- Disable HA and DRS so the VMs do not suffer downtime if HA/DRS storage migrate the vMotion manually each VM from each affected Hosts

Additional Information

RecoverPoint for VMs 5.2.2.2 and 5.2.2.3: Protected VMs may freeze after a storage access issue and IO aborts

Impact/Risks:
A splitter code fix to correct a PSOD(Reference PSOD after IO Submission failure) may, in some scenarios, cause the splitter handling of failed IOs to be stuck in a loop, rendering the host VM unresponsive.