When attempting to perform Virtual machine(VM) backups using third-party backup software Rubrik via the NBD (Network Block Device) transport mode, the backup process hangs and eventually fails or times out when the VM's are running on a single Host.
The backup software logs report errors similar to:
Could not fetch snapshot disk data. Error message: ReadVMDKResponse: err=14009, async=1, start_sector=####, num_sectors_to_read=512, payload_len=####, err_str=The server refused connection(err=14009), checksums_count_=####
In the affected ESXi host's /var/log/hostd.log, the Network File Copy (NFC) session successfully starts but drops mid-transfer with a broken pipe error:
[NFC ERROR]NfcAioRecvData: Failed to receive data: NFC_NETWORK_ERROR
[NFC ERROR]NfcAioGetMessage: Recv msg failed: NFC_NETWORK_ERROR
[NFC ERROR]NfcAioGetAndProcessMsg: Failed to receive an AIO message: NFC_NETWORK_ERROR
[NFC ERROR]NfcSendMessage: NfcNet_Send failed: NFC_NETWORK_ERROR
[NFC INFO]NfcSessionStats: session=####, type=server, clientName='vddk', streamMode=0, fssrvrMode=0, aioMode=1, version=11, remoteVersion=11, currState=NFC_IDLE, prevState=NFC_AIO_SESSION, returnCode=NFC_NETWORK_ERROR (3), detail="The operation experienced a network error -- Failed to send complete message: Broken pipe"
Running a continuous vmkping from the ESXi host to the backup appliance IP address (which resides on a different VLAN) during the active backup reveals significant, intermittent packet loss (for example, ~50% drop rate)
vmkping -I vmk<number> -c 999 -d -s 1472 <Backup_Appliance_IP>-s 8972 if it is an MTU 9000 environment).However vmkping from the ESXi host to its local default gateway completes with 0% packet loss, confirming local Layer 2 connectivity is healthy.
Reviewing the vmnic adapter counters and esxtop Network view indicates no local network drops or errors on the ESXi host along with esxtop. For more information on validating these counters, see ESXTOP overview for Performance Troubleshooting , Troubleshooting NIC errors and other network performance issues
There are no subnet mask or MTU configuration mismatches between the affected host and the working hosts within the cluster.
esxtop output for the Host on which the VM is running does not show any storage related latency. For more information on the same: Using esxtop to identify storage performance issues for ESXiVMware vSphere ESXi 8.x
Physical network degradation (such as failing switch hardware, SFP etc.) on the upstream network path between the ESXi host and the backup appliance causes this issue. The physical packet loss during the heavy data transfer causes TCP retransmissions to time out, forcing the ESXi host to drop the connection and report an NFC_NETWORK_ERROR.