VDS reporting as "down" after NSX upgrade

Products

VMware NSX

Issue/Introduction

After upgrading NSX 4.2.1 to 4.2.3, some hosts failed to prepare NSX.
Each DPU interface is showing as "NIC disabled" in vmkernel logging.

When viewing from ESX host </var/run/log/vmkernel.log> log messages similar to the following are observed:

<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_ERR> nmlx5_core: 0000:2a:00.1: Health: NIC disabled state detected
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[0] 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[1] 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[2] 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[3] 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertVar[4] 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertExitPtr 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> assertCallra 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> firmwareVersion 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> hwId 0x00000000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> iriscIndex 0
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> synd 0x0: unrecognized error
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> extSynd 0x0000
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_INF> driver 4.23.6.5
<timestamps> UTC  In(182) vmkernel: cpu56:2098438)<NMLX_WRN> nmlx5_core: 0000:2a:00.1: Health: Bad device state recovery is started

Reviewing output from the command <net-dvs -l> shows the following:
```
"com.vmware.common.host.dpu.failover.status" = "red fail"
```

The following is seen in </var/run/log/hostd.log> showing nsxa is not responding at that time:

hostd.8:<timestamps> UTC Er(163) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] MessageSendHelper: Failed to send opaque network msg: opId:[DpuFailover-1004ef14-80] opCode:12
hostd.8:<timestamps> UTC Er(163) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] MessageSendHelper: Task [DpuFailover-1004ef14-80] failed or has no response
hostd.8:<timestamps> UTCIn(166) Hostd[2103060]: [Originator@6876 sub=Hostsvc opID=DpuFailover-1004ef14] Fail to launch DPU failover on NSXA. Err: No response from NSXA

Cause

During NSX upgrade from 4.2.1 to 4.2.3, DPU failover is triggered and the DPU failover workflow failed.
- Hostd triggered the failover, because all the active pnics are down and 1 standby pnic is up.
There is a 1s delay to double check before starting failover workflow to avoid nic flap.
- But the active pnic - vmnic3 - is activated at 3s, so hostd starts the failover.
- Hostd failed to send failover message to opsagent because nsxa is not connected to the IPC socket likely due to the upgrade process.
The DVS status is marked down after the failed DPU failover operation.

Resolution

This behavior will be resolved in a future release of ESXi (8.0 U3H P07).
- The fix will not be present in 8.0 U3I (P08).
- It will be present in 8.0 U3J (P09).
Below workaround will recover VDS state on ESX host impacted by this issue:
1. Run the following commands on affected ESX host.
```
esxcfg-vswitch -l
```
  Find the impacted switch name. This is used in the following command:
```
net-dvs -u "com.vmware.common.host.dpu.failover.status" -p hostPropList <switch_name>
```
2. On VC UI, disconnect the host from inventory and reconnect.

VDS reporting as "down" after NSX upgrade

Article ID: 417106

Updated On:

Products

Issue/Introduction

Cause

Resolution

Feedback