nsx-agent-node went to CrashLoopBackOff during Openshift Cluster upgrade

Products

VMware NSX

Issue/Introduction

Associated pods are seen in "CrashLoopBackOff" state
- #oc get pods
  nsx-node-agent-g2### 2/3 CrashLoopBackOff 27 (3m11s ago) 26m <<<<<<<
  nsx-node-agent-2p### 0/3 CrashLoopBackOff 21 (4m52s ago) 17m <<<<<<<<<
Openshift Cluster upgrade from 4.15 to above
For these pods, below errro will be seen
- <nsx-node-agent-2p###_nsx-node-agent.log>
  2025-10-01T08:38:57.060Z <nsx-manager> NSX 17 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] nsx_ujo.common.utils nsx_ujo.agent.agent exiting for NSXNodeAgentException: Unexpected error from nsx_node_agent: Can not find host MAC address for interface ens192.
  2025-10-01T08:38:57.068Z <nsx-manager> NSX 13 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo_privsep.comm Unexpected error: <class 'OSError'>
  Traceback (most recent call last):
  File "/usr/local/bin/nsx_node_agent", line 10, in <module>
  sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/nsx_ujo/cmd/nsx_node_agent.py", line 17, in main
"MISS_VERSION_HANDSHAKE" seen for the hyperbus connection on the ESXi Host
- nsxcli -c get hyperbus connection info
  Wed Oct 01 2025 UTC 08:11:16.115
  VIFID Connection Status HostSwitchID
  f72c17bd-####-####-####-eded900##### 169.254.1.##:2345 MISS_VERSION_HANDSHAKE ## ## ## ## 7c de 45 ##-42 ca ## ## ## ## ## ##
The following logs will be present in the logs of pod/container nsx-node-agent/nsx-ovs
- 2025-10-01T08:40:44Z <nsx-manager> NSX 1 - [nsx@6876 comp="nsx-container-node" subcomp="NSX-OVS" level="ERROR"] .usr.local.bin.start_ovs Failed to execute: 'nmcli con up ens192-ovs-intf'. Retrying after 1 seconds
  Error: Connection activation failed: Open vSwitch database connection failed
journalctl messages logs
- NetworkManager[1348]: <warn> device (ens192-ovs-port)[Open vSwitch Port]: device ens192 could not be added to a ovs port: Error running the transaction: timed out: "where" clause test failed

Environment

VMware NSX-T Data Center

VMware NSX

OpenShift 4.15 and above

Cause

A change in RHCOS (Red Hat Enterprise Linux CoreOS) causes a conflict between the OVS instance running in the nsx-ovs container and the OVS instance running in the host OS. This failure mode occurs as while the host OS is trying to restart OVS, the nsx-ovs container might end up connecting to the OVS DB instance running on the host rather than the one running on the container.

Resolution

This issue is resolved in VMware Cloud Foundation 9.0 and above along with NCP 4.2.3, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:
Perform the below sequence of operations performed on each node:

systemctl stop openvswitch
systemctl disable openvswitch
systemctl mask openvswitc (this ensures persistency across reboots)
systemctl restart NetworkManager
delete nsx-node-agent pod

Alternatively, for Step #4 we can reboot the node.

Additional Information

You can refer the KB OpenShift nodes become 'NotReady' and unreachable over OVS uplink interface for the same behavior with additional symptoms and checks.