Virtual machine backups to Rubrik timeout on single ESXi host
search cancel

Virtual machine backups to Rubrik timeout on single ESXi host

book

Article ID: 441078

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When attempting to perform Virtual machine(VM) backups using third-party backup software Rubrik via the NBD (Network Block Device) transport mode, the backup process hangs and eventually fails or times out when the VM's are running on a single Host.

  • The backup software logs report errors similar to:

    Could not fetch snapshot disk data. Error message: ReadVMDKResponse: err=14009, async=1, start_sector=####, num_sectors_to_read=512, payload_len=####, err_str=The server refused connection(err=14009), checksums_count_=####

     

  • In the affected ESXi host's /var/log/hostd.log, the Network File Copy (NFC) session successfully starts but drops mid-transfer with a broken pipe error:

    [NFC ERROR]NfcAioRecvData: Failed to receive data: NFC_NETWORK_ERROR
    [NFC ERROR]NfcAioGetMessage: Recv msg failed: NFC_NETWORK_ERROR
    [NFC ERROR]NfcAioGetAndProcessMsg: Failed to receive an AIO message: NFC_NETWORK_ERROR
    [NFC ERROR]NfcSendMessage: NfcNet_Send failed: NFC_NETWORK_ERROR
    [NFC INFO]NfcSessionStats: session=####, type=server, clientName='vddk', streamMode=0, fssrvrMode=0, aioMode=1, version=11, remoteVersion=11, currState=NFC_IDLE, prevState=NFC_AIO_SESSION, returnCode=NFC_NETWORK_ERROR (3), detail="The operation experienced a network error -- Failed to send complete message: Broken pipe"

     

  • Running a continuous vmkping from the ESXi host to the backup appliance IP address (which resides on a different VLAN) during the active backup reveals significant, intermittent packet loss (for example, ~50% drop rate)

    vmkping -I vmk<number> -c 999 -d -s 1472 <Backup_Appliance_IP>
     
    (Note: Replace vmk<number> with the VMkernel adapter used for Management traffic, <Backup_Appliance_IP> with the IP of your backup server, and -s 8972 if it is an MTU 9000 environment).
  • However vmkping from the ESXi host to its local default gateway completes with 0% packet loss, confirming local Layer 2 connectivity is healthy.

  • Reviewing the vmnic adapter counters and esxtop Network view indicates no local network drops or errors on the ESXi host along with esxtop. For more information on validating these counters, see ESXTOP overview for Performance Troubleshooting , Troubleshooting NIC errors and other network performance issues

  • There are no subnet mask or MTU configuration mismatches between the affected host and the working hosts within the cluster.

  • The esxtop output for the Host on which the VM is running does not show any storage related latency. For more information on the same: Using esxtop to identify storage performance issues for ESXi

Environment

VMware vSphere ESXi 8.x

Cause

Physical network degradation (such as failing switch hardware, SFP etc.) on the upstream network path between the ESXi host and the backup appliance causes this issue. The physical packet loss during the heavy data transfer causes TCP retransmissions to time out, forcing the ESXi host to drop the connection and report an NFC_NETWORK_ERROR.

Resolution

  1. Engage the internal network team to trace the upstream routing path.
  2. Check for physical hardware degradation, specifically faulty SFP transceivers or switch hardware.
  3. Once connectivity stabilizes with 0% packet loss, retry the backup.