NSX upgrade fails during edge node upgrade step.

Products

VMware NSX

Issue/Introduction

Edge node upgrade fails with following errors:
" Upgrade Agent on Edge node xxxx is unreachable. Restart the Upgrade Agent service and check network connectivity."
" Management plane connection status of the edge transport node xxxx is DOWN"
The NSX upgrade fails as soon as the edge upgrade fails.
The edge node is inaccessible and in hung state when accessed from the vCenter.
The IP address of the edge node is not visible on the vCenter and host web client.
Any operation performed on the edge node eg: migration, power off etc. fails with error:
"Another task is in progress"
When enabling SSH or putting the respective host on which the edge node resides in maintenance mode, the operation fails with error:
"An error occurred during the host configuration"
Enabling Shell and SSH from iDRAC fails.

Command fails to start and the following error is logged in /var/log/vmkernel.log:

YYYY-MM-DDThh:mm:ss.397Z In(18#) vmkernel: cpu##:95538## opID=c1aae###)World: ##: VC opID m1hkp###-21###-auto-gag-h5:70002###-2#-60-#### maps to vmkernel opID c1aae###
YYYY-MM-DDThh:mm:ss.397Z Wa(18#) vmkwarning: cpu##:95538## opID=c1aae###)WARNING: Sched: vm 160089##: 63##: could not create container group, status: Limit exceeded
YYYY-MM-DDThh:mm:ss.397Z Wa(18#) vmkwarning: cpu##:95538## opID=c1aae###)WARNING: Sched: vm 160089##: 63##: could not create container group, status: Limit exceeded

Errors similar to following are noticed in /var/run/log/hostd.log:

YYYY-MM-DDThh:mm:ss.208Z Er(163) Hostd[20996##]: [Originator@6876 sub=SysCommandPosix opID=CSMM-domain-c370915-39762-d### sid=52738### user=vpxuser] Failed to ForkExec /usr/lib/vmware/clusterAgent/bin/clusterAdmin: File too large
YYYY-MM-DDThh:mm:ss.443Z Er(163) Hostd[20996##]: [Originator@6876 sub=SysCommandPosix opID=CSMM-domain-c370915-39763-d### sid=52738### user=vpxuser] Failed to ForkExec /usr/lib/vmware/clusterAgent/bin/clusterAdmin: File too large

Environment

VMware NSX-T Data Center

Cause

If a very large number of processes are started exceeding the number allowed by the system a large number of times, or if processes fail to start due to lack of their memory resource a large number of times, it may become impossible to start new processes. This issue could cause ESXi host to become unresponsive.

When the ESXi host and its virtual machines become unresponsive to any changes, it prevents necessary updates to the host and edge nodes, ultimately causing the NSX upgrade to fail.

Resolution

This issue has been fixed in ESXi 8.0u3e. To download the same, click on this link to Broadcom Support Portal

Workaround:

Reboot the ESXi host to restore responsiveness and SSH access.

Additional Information

For further information regarding troubleshooting SSH connectivity issues in ESXi 8.0u3,refer:

SSH attempt to unresponsive ESXi 8.0u3 host fails and DCUI errors out with "/bin/dcuiweasel: line #: can't fork: File too large"