VDS reporting as "down" after NSX upgrade
search cancel

VDS reporting as "down" after NSX upgrade

book

Article ID: 417106

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • After upgrading NSX 4.2.1 to 4.2.3, some hosts failed to prepare NSX. 
  • Each DPU interface is showing as "NIC disabled" in vmkernel logging.
  • When viewing from ESX host </var/run/log/vmkernel.log> log messages similar to the following are observed:
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_ERR> nmlx5_core: 0000:2a:00.1: Health: NIC disabled state detected
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[0] 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[1] 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[2] 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[3] 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[4] 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertExitPtr 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertCallra 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> firmwareVersion 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> hwId 0x00000000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> iriscIndex 0
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> synd 0x0: unrecognized error
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> extSynd 0x0000
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> driver 4.23.6.5
    <timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_WRN> nmlx5_core: 0000:2a:00.1: Health: Bad device state recovery is started
  • Reviewing output from the command <net-dvs -l> shows the following:
    "com.vmware.common.host.dpu.failover.status" = "red fail"
    
  • The following is seen in </var/run/log/hostd.log> showing nsxa is not responding at that time:
    hostd.8:<timestamps> UTC Er(163) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] MessageSendHelper: Failed to send opaque network msg: opId:[DpuFailover-1004ef14-80] opCode:12
    hostd.8:<timestamps> UTC Er(163) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] MessageSendHelper: Task [DpuFailover-1004ef14-80] failed or has no response
    hostd.8:<timestamps> UTCIn(166) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] Fail to launch DPU failover on NSXA. Err: No response from NSXA

Cause

  • During NSX upgrade from 4.2.1 to 4.2.3, DPU failover is triggered and the DPU failover workflow failed.
    • Hostd triggered the failover, because all the active pnics are down and 1 standby pnic is up.
  • There is a 1s delay to double check before starting failover workflow to avoid nic flap.
    • But the active pnic - vmnic3 - is activated at 3s, so hostd starts the failover.
    • Hostd failed to send failover message to opsagent because nsxa is not connected to the IPC socket likely due to the upgrade process.
  • The DVS status is marked down after the failed DPU failover operation.

Resolution

  • This behavior will be resolved in a future release of ESXi (8.0 U3H P07).

    • The fix will not be present in 8.0 U3I (P08).
    • It will be present in 8.0 U3J (P09).

  • Below workaround will recover VDS state on ESX host impacted by this issue:

    1. Run the following commands on affected ESX host.
      esxcfg-vswitch -l

      Find the impacted switch name. This is used in the following command:

      net-dvs -u "com.vmware.common.host.dpu.failover.status" -p hostPropList <switch_name>
      
    2.  On VC UI, disconnect the host from inventory and reconnect.