nsx-agent-node went to CrashLoopBackOff during Openshift Cluster upgrade
search cancel

nsx-agent-node went to CrashLoopBackOff during Openshift Cluster upgrade

book

Article ID: 412855

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Associated pods are seen in "CrashLoopBackOff" state  
    • #oc get pods 
      nsx-node-agent-g2### 2/3 CrashLoopBackOff 27 (3m11s ago) 26m <<<<<<<
      nsx-node-agent-2p### 0/3 CrashLoopBackOff 21 (4m52s ago) 17m <<<<<<<<<
  • Openshift Cluster upgrade from 4.15 to above 
  • For these pods, below errro will be seen
    • <nsx-node-agent-2p###_nsx-node-agent.log>
      2025-10-01T08:38:57.060Z <nsx-manager> NSX 17 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] nsx_ujo.common.utils nsx_ujo.agent.agent exiting for NSXNodeAgentException: Unexpected error from nsx_node_agent: Can not find host MAC address for interface ens192.
      2025-10-01T08:38:57.068Z <nsx-manager> NSX 13 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="WARNING"] oslo_privsep.comm Unexpected error: <class 'OSError'>
      Traceback (most recent call last):
        File "/usr/local/bin/nsx_node_agent", line 10, in <module>
          sys.exit(main())
        File "/usr/local/lib/python3.9/site-packages/nsx_ujo/cmd/nsx_node_agent.py", line 17, in main
  • "MISS_VERSION_HANDSHAKE" seen for the hyperbus connection on the ESXi Host 
    • nsxcli -c get hyperbus connection info
      Wed Oct 01 2025 UTC 08:11:16.115
                 VIFID                           Connection                 Status                            HostSwitchID
      f72c17bd-####-####-####-eded900#####      169.254.1.##:2345       MISS_VERSION_HANDSHAKE         ## ## ## ## 7c de 45 ##-42 ca ## ## ## ## ## ##  
  • The following logs will be present in the logs of pod/container nsx-node-agent/nsx-ovs
    • 2025-10-01T08:40:44Z <nsx-manager> NSX 1 - [nsx@6876 comp="nsx-container-node" subcomp="NSX-OVS" level="ERROR"] .usr.local.bin.start_ovs Failed to execute: 'nmcli con up ens192-ovs-intf'. Retrying after 1 seconds
      Error: Connection activation failed: Open vSwitch database connection failed
  • journalctl messages logs  
    • NetworkManager[1348]: <warn> device (ens192-ovs-port)[Open vSwitch Port]: device ens192 could not be added to a ovs port: Error running the transaction: timed out: "where" clause test failed

Environment

VMware NSX-T Data Center 

VMware NSX 

OpenShift 4.15 and above

Cause

A change in RHCOS (Red Hat Enterprise Linux CoreOS) causes a conflict between the OVS instance running in the nsx-ovs container and the OVS instance running in the host OS. This failure mode occurs as while the host OS is trying to restart OVS, the nsx-ovs container might end up connecting to the OVS DB instance running on the host rather than the one running on the container.

 

Resolution

This issue is resolved in VMware Cloud Foundation 9.0 and above along with NCP 4.2.3, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:
Perform the below sequence of operations performed on each node:

  1. systemctl stop openvswitch
  2. systemctl disable openvswitch
  3. systemctl mask openvswitc (this ensures persistency across reboots)
  4. systemctl restart NetworkManager 
  5. delete nsx-node-agent pod

Alternatively, for Step #4 we can reboot the node.

Additional Information

You can refer the KB OpenShift nodes become 'NotReady' and unreachable over OVS uplink interface for the same behavior with additional symptoms and checks.