DellEMC RecoverPoint times out and replication keeps failing constantly

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

DellEMC RecoverPoint may report constant replication failures and become unstable.
vSocket debug logging on RecoverPoint Appliance may report :

Deadlock and send timeouts.
The iofilter daemon is stuck waiting for data from another VM.

2019/08/07 11:20:40.600 - #1 - 4429/4339 - SocketInfoJIRAF::isFDReady: poll timeout errno = 92 a_expireTimeUsecs = 52487696116( m_lr=(0x74656b636f5376,0x2000013ba,e_JIRAF) m_handle=0 m_openCount=3 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 160)
2019/08/07 11:21:40.640 - #1 - 4429/4339 - SocketInfoJIRAF::isFDReady: poll timeout errno = 92 a_expireTimeUsecs = 52547753234( m_lr=(0x74656b636f5376,0x2000013ba,e_JIRAF) m_handle=0 m_openCount=3 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 160)
2019/08/07 11:22:40.675 - #1 - 4429/4339 - SocketInfoJIRAF::isFDReady: poll timeout errno = 92 a_expireTimeUsecs = 52607792673( m_lr=(0x74656b636f5376,0x2000013ba,e_JIRAF) m_handle=0 m_openCount=3 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 160)

The ESXi host log (iofilterd-emcjiraf.log) contains the error log messages as shown below:

2019-08-07T11:20:45Z iofilterd-emcjiraf[68109]: jiraf_receive_msg: unknown cmd type
2019-08-07T11:21:04Z iofilterd-emcjiraf[68109]: jiraf_receive_msg: unknown cmd type
2019-08-07T11:21:04Z iofilterd-emcjiraf[68109]: jiraf_receive_msg: unknown cmd type
2019-08-07T11:21:04Z iofilterd-emcjiraf[68109]: jiraf_receive_msg: unknown cmd type

NOTE:The preceding log excerpts are only examples.Date,time and environmental variables may vary depending on your environment.

Environment

VMware vSphere 7.x

VMware vSphere 8.x

Cause

RecoverPoint IO-filter daemon is single threaded.

A deadlock could occur, where the IO filter daemon is stuck in a vSocket receive, because an unknown command type was received from one of the Recover Point Appliance(RPA) VMs. In this case, no communication would happen until the deadlock was resolved.
The second issue was that the IO filter daemon was scanning datastores for new files - again in the single thread of the daemon. This scanning would occur periodically (every 30 seconds appeared to be the default setting), and nothing else would happen during this scan. These scans could take several minutes, if access to datastores was slow, and thereby causing vSocket timeouts in the RP appliance VMs, since the IO filter daemon would not process vSocket communication during the scans.

Resolution

This is not a VMware issue.

Workaround:

Contact DellEMC support to implement the below workaround.

To change vSocket timeout values in EMC RP iofilter deamon:

t_JIRAFvSocketTimeoutSecs - change from 10 to 30 [how long RP waited for the vSocket communication]

t_SocketIOProviderDeadlockConfig - change the value to 35 [how long before a dedlock was declared ]

For more information refer to Dell EMC KB :RecoverPoint for Virtual Machines: Intermittent Loss of Access to Journal & Repository Volumes

Disclaimer: VMware is not responsible for the reliability of any data, opinions, advice or statements made on third-party websites. Inclusion of such links does not imply that VMware endorses, recommends or accepts any responsibility for the content of such sites.