Bare Metal Edge node hangs or loses management communication when exiting NSX Maintenance Mode
search cancel

Bare Metal Edge node hangs or loses management communication when exiting NSX Maintenance Mode

book

Article ID: 439652

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • On Bare Metal Edge using in-band management interface over an i40e-based PNIC
  • Some or all of the following may be seen while the Baremetal Edge with in-band management interface is exiting Maintenance Mode:
    • Dataplane packet loss.
    • Console operations become unresponsive.
    • The kernel logs (/var/log/kern.log) may contain logs similar to the following:
      YYYY-MM-DDTHH:MM:SS.NNNZ ##### kernel - - - [#####.#####] INFO: task <Service_Name>:<PID> blocked for more than 120 seconds.

 

Environment

VMware NSX 4.2.3.3

Cause

The problem stems from a race condition between the nsx-edge-nsd and nsx-edge-datapath systemd services on the Edge. 
These two services start in parallel when exiting Maintenance Mode. 
Both services attempt to configure kernel network interfaces simultaneously, which causes a Routing Netlink (RTNL) Mutex deadlock in the kernel, blocking multiple threads/cores.

Resolution

Workaround:
The intent of the workaround is to have the nsx-edge-nsd systemd service start after the nsx-edge-datapath systemd service to avoid the inter-process race condition.

  1. Log in to Edge via CLI as the admin user, and then execute the following command to switch to the root user.
    start engineer
  2. Update the nsx-edge-nsd.service file to start after nsx-edge-datapath.service and reload the systemd daemon.
    sed -i '/PartOf=docker.service/a After=nsx-edge-datapath.service' /lib/systemd/system/nsx-edge-nsd.service
    systemctl daemon-reload
  3. Verify that the Edge can now exit Maintenance Mode without issues.


Additional Information

Note:

To revert the configuration, remove the added line from /lib/systemd/system/nsx-edge-nsd.service and run systemctl daemon-reload again.

  1. Log in to Edge via CLI as the admin user, and then execute the following command to switch to the root user.
    start engineer
  2. Execute the following command to roll back the service startup order.
    sed -i '/After=nsx-edge-datapath.service/d' /lib/systemd/system/nsx-edge-nsd.service
    systemctl daemon-reload