YYYY-MM-DDTHH:MM:SS.NNNZ ##### kernel - - - [#####.#####] INFO: task <Service_Name>:<PID> blocked for more than 120 seconds.
VMware NSX 4.2.3.3
The problem stems from a race condition between the nsx-edge-nsd and nsx-edge-datapath systemd services on the Edge.
These two services start in parallel when exiting Maintenance Mode.
Both services attempt to configure kernel network interfaces simultaneously, which causes a Routing Netlink (RTNL) Mutex deadlock in the kernel, blocking multiple threads/cores.
Workaround:
The intent of the workaround is to have the nsx-edge-nsd systemd service start after the nsx-edge-datapath systemd service to avoid the inter-process race condition.
start engineer
sed -i '/PartOf=docker.service/a After=nsx-edge-datapath.service' /lib/systemd/system/nsx-edge-nsd.service
systemctl daemon-reload
Note:
To revert the configuration, remove the added line from /lib/systemd/system/nsx-edge-nsd.service and run systemctl daemon-reload again.
start engineer
sed -i '/After=nsx-edge-datapath.service/d' /lib/systemd/system/nsx-edge-nsd.service
systemctl daemon-reload