"Waiting for MPA" error or "Heartbeating between NSX management node and host [host-uuid] is down" error after ESX 7.x Host upgrades from a version greater or equal to NSX 4.2.1.0 to a version less than 4.2.3

Products

VMware NSX

Issue/Introduction

ESXi host transport nodes with version 7.x are being upgraded.
NSX is being upgraded from a version greater or equal to NSX 4.2.1.0 to any version less than 4.2.3.
Connection between NSX manager and host is down.

Possible errors seen:

- Upgrade of ESXi hosts via NSX Manager stalls and fails with error "Waiting for MPA".
- The host shows errors "Heartbeating between NSX management node and host [host-uuid] is down" and "Unexpected error while upgrading upgrade unit. Command IsHostInMaintenanceMode failed on host(######).
- Upgrade of ESXi hosts via vLCM fails with the following error after remediation and host reboot:
  
  Upgrade failed: Failed to execute ESXi post upgrade dataplane check. Error occurred while transferring the upgrade scripts to host, SFHC connectivity may be down
- NSX UI shows the following alert/warning "Connection between host [host-uuid] and NSX Controller is DOWN. Response : Client is responding to heartbeats"
- NSX UI will show status Failed: "NSX service on the host are not at target version 4.#.#.#.###"
- NSX controller failure reason is CONTROLLER_REJECTED_HOST_CERT when command 'nsxcli -c get controllers' is run from host CLI:

Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason

<Controller-IP> 1235 enabled disconnected true down NA CONTROLLER_REJECTED_HOST_CERT
<Controller-IP> 1235 enabled not used false null NA NA
<Controller-IP> 1235 enabled not used false null NA NA

- Error message similar to the below may be seen in the ESXi host logs:
  - In log file: /var/run/log/esxupdate.log
    
    esxupdate: 12251955: LiveImageInstaller: DEBUG: Output: nsx-proxy being upgraded /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-cert.bak: no such file /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-privkey.bak: no such file sh: 2: unknown operand backup proxy certificate not found, creating Copying CCP config from backup Copying host config file from backup Copying appliance info file from backup /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-cert.bak: no such file /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-privkey.bak: no such file sh: 2: unknown operand tnuuid = ########-####-####-####-############. Generating host certificate with TN uuid = ########-####-####-####-############. Generating certificate using make_cert.py Generating a RSA private key **************************************************************************************************************************************************************************************************+++++ ************************************************************************************************************************************************************************************************************************************************************************************************************************+++++ writing new private key to '/tmp/host-privkey.pem' ----- Entering make_cert.py Running ['openssl', 'req', '-days', '3650', '-new', '-nodes', '-x509', '-keyout', '/tmp/host-privkey.pem', '-out', '/tmp/host-cert.pem', '-config', '/tmp/tmp.######', '-extensions', 'req_ext'] Execution of openssl req returned 0 in 0.363 seconds. nsx-proxy starts
  - In log file: /var/run/log/nsx-syslog
    
    nsx-proxy[12370596]: NSX 12370596 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="12370681" level="WARNING"] RpcConnection[10 Connecting to ssl://<ESXI-IP/FQDN>:1234 0] Couldn't connect to ssl://<ESXI-IP/FQDN>:1234 (error: 336151576-tlsv1 alert unknown ca (SSL routines, ssl3_read_bytes))
    nsx-proxy[12370596]: NSX 12370596 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="12370681" level="WARNING"] StreamConnection[5 Connecting to ssl://<ESXI-IP/FQDN>:1235 sid:5] Couldn't connect to 'ssl://<ESXI-IP/FQDN>:1235' (error: 336151574-sslv3 alert certificate unknown (SSL routines, ssl3_read_bytes)
- Error message similar to the below may be seen in the NSX Manager logs: /var/log/syslog
  
  NSX-MGR NSX 120080 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-net" tid="######" level="ERROR" errorCode="NET1111"] Certificate validation failed: 18-self-signed certificate#012Certificate: #012 Version: 3 (0x2) #012

Environment

VMware NSX (upgrading on VMware ESXi 7.x only)
VMware NSX (upgrading from version >= 4.2.1.0 and < 4.2.3)
VMware NSX (upgrading to version > 4.2.1.0 and < 4.2.3)

Cause

The following points detail the cause of communication failure between the host transport node and the NSX controller after a VIB upgrade:

Communication Breakage: Post-VIB upgrade, the NSX controller fails to recognize the host transport node's new certificate. This new certificate was generated by nsx-proxy during its startup initialization script.
New Certificate Generation: The nsx-proxy generates new host transport node certificates because the VIB installation failed to restore the original certificate files.
Failed Restoration Cause: The VIB installation did not restore the certificate files because the "sticky bit" attribute was missing from the certificate files at the time of the VIB installation.
Sticky Bit Loss: The ESX host loses the sticky bit on these files whenever the ESX system restarts. These restarts may have occurred due to prior ESX or NSX upgrades.

Resolution

This issue is resolved in VMware NSX 4.2.3, available at Broadcom downloads.
Greenfield deployments with VMware NSX 4.2.3 and later versions do not have this issue.
Upgrades from VMware NSX with versions greater or equal to 4.2.1.0 to VMware NSX 4.2.3 or later also do not have this issue.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround and Preventive steps:

Proactive Prevention Steps

There is a way to prevent the 'NSX transport node disconnected' problem even before the upgrade activity.
If an ESXi host with version 7.x currently has the NSX VIBs installed, then the host-cert.pem and host-privkey.pem are expected to have the below permissions:

File path: /etc/vmware/nsx

Expected Permissions for the files in question :

-rw-rw-rwT 1 root root 1610 Jan 22 10:01 host-cert.pem

-rw-rw-rwT 1 root root 1704 Jan 22 10:01 host-privkey.pem

But if the permissions for host-cert.pem and host-privkey.pem are different than above, then the files have wrong permissions and the host is expected to hit the 'NSX transport node disconnected' problem during upgrade to 4.2.1.x or 4.2.2.x.
We can proactively validate the permission of the files in each host and manage them correctly to avoid the issue. Here is how to correct the permissions:

SSH to the host with root user
Run the following command:

chmod 1666 /etc/vmware/nsx/host-cert.pem /etc/vmware/nsx/host-privkey.pem
Verify with 'll' command if these files now have the correct permissions.
Initiate upgrade to NSX 4.2.1.x or 4.2.2.x.

Recovery Steps

Open an SSH session to the ESXi host experiencing the issue and confirm that none of the three NSX controllers are in a connected state by running command nsxcli -c get controllers.
Example response:

Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
<Controller-IP> 1235 enabled disconnected true down NA CONTROLLER_REJECTED_HOST_CERT
<Controller-IP> 1235 enabled not used false null NA NA
<Controller-IP> 1235 enabled not used false null NA NA

Note: In a working configuration, two controllers display the not used status and one controller has the connected status. If the NSX Controller shows connected, refresh the UI and confirm that the status is green. If the controller shows not connected, continue to the next step.
Open an SSH session to one of the NSX Manager nodes as admin and run the command get certificate api thumbprint.

Note: The command output is a string of alphanumeric numbers that is unique to this NSX Manager.
On the ESXi host, push the host certificate to the Management Plane:

ESXi> nsxcli -c push host-certificate <NSX Manager IP or FQDN> username admin thumbprint <thumbprint obtained in step #2>
Confirm the controller status is connected.

ESXi> nsxcli -c get controllersNote: Confirm the controller connection state is green on the UI for this host transport node.

Note: If the ESXi host display Failure Reason MAINTAINANCE_MODE as below, take the following steps:

nsxcli -c get controllers

Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
<Controller-IP> 1235 enabled disconnected true down NA MAINTAINANCE_MODE
<Controller-IP> 1235 enabled not used false null NA MAINTAINANCE_MODE
<Controller-IP> 1235 enabled not used false null NA MAINTAINANCE_MODE

Access the NSX UI > System > Fabric > Hosts > Find the host and check mark it
Click on Actions
Click on Exit NSX Maintenance Mode

Additional Information

If this KB did not help resolve your issue, you can review the following KBs for further troubleshooting steps:
Loss of Controller Connectivity after Host Upgrade
Troubleshooting NSX Host Upgrade Failures