"Waiting for MPA" error after ESX Host upgrade from NSX 4.2.1.0 gets stuck at 45%
search cancel

"Waiting for MPA" error after ESX Host upgrade from NSX 4.2.1.0 gets stuck at 45%

book

Article ID: 393792

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX is being upgraded from version 4.2.1.0 to any higher version.
  • ESXi host transport nodes are being upgraded.
  • ESXi host version is 7.0.x.
  • Upgrade of ESXi hosts via NSX Manager stalls at 45% before failing with error "Waiting for MPA".
    • Alternatively, upgrade of ESXi hosts via vLCM fails with the following error after remediation and host reboot:
      "Upgrade failed: Failed to execute ESXi post upgrade dataplane check. Error occurred while transferring the upgrade scripts to host, SFHC connectivity may be down".
    • Another possible alternative is the host showing errors "Heartbeating between NSX management node and host ##### is down" and "Unexpected error while upgrading upgrade unit. Command IsHostInMaintenanceMode failed on host(######).
  • NSX controller failure reason is CONTROLLER_REJECTED_HOST_CERT when command 'nsxcli -c get controllers' is run from host CLI:
    Controller IP     Port    SSL         Status          Is Physical  Master   Session State  Controller FQDN       Failure Reason
    <Controller-IP>   1235   enabled     disconnected       true                  down           NA                    CONTROLLER_REJECTED_HOST_CERT
    <Controller-IP>   1235   enabled     not used           false                 null           NA                    NA
    <Controller-IP>   1235   enabled     not used           false                 null           NA                    NA
  • Error message similar to the below may be seen in the ESXi host logs:

    /var/run/log/esxupdate.log

    esxupdate: 12251955: LiveImageInstaller: DEBUG: Output: nsx-proxy being upgraded /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-cert.bak: no such file /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-privkey.bak: no such file sh: 2: unknown operand backup proxy certificate not found, creating Copying CCP config from backup Copying host config file from backup Copying appliance info file from backup /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-cert.bak: no such file /etc/init.d/nsx-proxy: line 1: can't open /tmp/host-privkey.bak: no such file sh: 2: unknown operand tnuuid = ########-####-####-####-############. Generating host certificate with TN uuid = ########-####-####-####-############. Generating certificate using make_cert.py Generating a RSA private key **************************************************************************************************************************************************************************************************+++++ ************************************************************************************************************************************************************************************************************************************************************************************************************************+++++ writing new private key to '/tmp/host-privkey.pem' ----- Entering make_cert.py Running ['openssl', 'req', '-days', '3650', '-new', '-nodes', '-x509', '-keyout', '/tmp/host-privkey.pem', '-out', '/tmp/host-cert.pem', '-config', '/tmp/tmp.######', '-extensions', 'req_ext'] Execution of openssl req returned 0 in 0.363 seconds. nsx-proxy starts

    /var/run/log/nsx-syslog

    INFO task-executor-9-1-workitem-HOST-### InspectionTask 1371044 - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] [HUT] For host <ESXI-IP/FQDN>, error is Issue: Heartbeating between NSX management node and host <ESXI-IP/FQDN> is down.
    nsx-proxy[12370596]: NSX 12370596 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="12370681" level="WARNING"] RpcConnection[10 Connecting to ssl://<ESXI-IP/FQDN>:1234 0] Couldn't connect to ssl://<ESXI-IP/FQDN>:1234 (error: 336151576-tlsv1 alert unknown ca (SSL routines, ssl3_read_bytes))
    nsx-proxy[12370596]: NSX 12370596 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="12370681" level="WARNING"] StreamConnection[5 Connecting to ssl://<ESXI-IP/FQDN>:1235 sid:5] Couldn't connect to 'ssl://<ESXI-IP/FQDN>:1235' (error: 336151574-sslv3 alert certificate unknown (SSL routines, ssl3_read_bytes)
    nsx-proxy[7014696]: NSX 7014696 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="7014696" level="INFO"] Write ccp session message to nestdb ccp_id {   7caaxxxx-1cxx-46xx-a6xx-77c06exxxxxx } ip {   ipv4: 214xxx601 } server_port: 1235 fqdn: "" state: DISCONNECTED master: false
    nsx-proxy[7014696]: NSX 7014696 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="7014696" level="INFO"] Write ccp session message to nestdb ccp_id {   a0fdxxxx-c0xx-43xx-a7xx-8d946bxxxxxx } ip {   ipv4: 214xxx602 } server_port: 1235 fqdn: "" state: DISCONNECTED master: false
    nsx-proxy[7014696]: NSX 7014696 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="7014696" level="INFO"] Write ccp session message to nestdb ccp_id {   0058xxxx-93xx-4axx-90b3-98d041xxxxxx } ip {   ipv4: 214xxx600 } server_port: 1235 fqdn: "" state: DISCONNECTED master: true failure_reason: CONTROLLER_REJECTED_HOST_CERT
    nsx-proxy[7014696]: NSX 7014696 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="7014696" level="INFO"] CcpConnection: Connecting to new CCP a0fdxxxx-c0xx-43xx-a7a8-8d946bxxxxxx.
    nsx-proxy[7014696]: NSX 7014696 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" tid="7014696" level="INFO"] CcpConnection: Disconnecting from ssl://128.x.x.16:1235
  • Error message similar to the below may be seen in the NSX Manager logs:

    /var/log/syslog

    NSX-MGR NSX 1391201 - [nsx@6876 audit="true" comp="nsx-manager" level="INFO" subcomp="upgrade-coordinator"] UserName="<Username>", Src="<IP-address>", ModuleName="Upgrade", Operation="GetUpgradestatusSunmary", Operation status="success", New value=[{"selection_status": "ALL" }] 
    NSX-MGR NSX 120080 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-net" tid="######" level="ERROR" errorCode="NET1111"] Certificate validation failed: 18-self-signed certificate#012Certificate: #012 Version: 3 (0x2) #012
  • Below alert/warning could also be seen on the NSX UI.

        "Connection between host [host-uuid] and NSX Controller is DOWN. Response : Client is responding to heartbeats"

  • After host gets into Disconnected state, if we click on 'View Error' & click 'Resolve', this may fix the error and the upgrade could get completed for some hosts.
  • NSX UI will show status Failed:
    "NSX service on the host are not at target version 4.#.#.#.###"

Environment

VMware NSX 4.2.1.x
VMware NSX 4.2.2.x
VMware ESXi 7.0.x

Cause

This behavior is the result of a known issue that prevents the upgraded ESXi host(s) from reconnecting with NSX Manager post-VIB upgrade. Post-VIB upgrade, the NSX controller is not aware of the new host transport node certificate which was generated by nsx-proxy as part of its startup INIT script resulting in communication breakage between the host transport node and NSX controller.

Resolution

This issue is resolved in VMware NSX 4.2.3, available at Broadcom downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

Option A

When the host shows stalled at 45% "Waiting for MPA" in the upgrade page:

  1. Go to System > Fabric > Nodes in NSX Manager.
  2. Click the Error icon on the host which is stalled at 45%.
  3. Select the error on the host and click on resolve.
  4. If the issue is resolved, the upgrade will automatically resume from 45% and progress to completion.

If Option A, did not resolve the issue for you, proceed with Option B:

Option B

  1. Open an SSH session to the ESXi host experiencing the issue and confirm that none of the three NSX controllers are in a connected state by running command nsxcli -c get controllers.
    Example response:
    Controller IP Port    SSL                       Status             Is Physical Master   Session State Controller FQDN   Failure Reason
    <Controller-IP> 1235   enabled   disconnected    true            down       NA                                        CONTROLLER_REJECTED_HOST_CERT
    <Controller-IP> 1235   enabled   not used           false            null         NA                                        NA
    <Controller-IP> 1235   enabled   not used           false            null         NA                                        NA                                                 

    Note: In a working configuration, two controllers display the not used status and one controller has the connected status. If the NSX Controller shows connected, refresh the UI and confirm that the status is green. If the controller shows not connected, continue to the next step.

  2. Open an SSH session to one of the NSX Manager nodes as admin and run the command get certificate api thumbprint.

    Note: The command output is a string of alphanumeric numbers that is unique to this NSX Manager.

  3. On the ESXi host, push the host certificate to the Management Plane:
    ESXi> nsxcli -c push host-certificate <NSX Manager IP or FQDN> username admin thumbprint <thumbprint obtained in step #2>
  4. Confirm the controller status is connected.

    ESXi> nsxcli -c get controllers

    Note:
    Confirm the controller connection state is green on the UI for this host transport node.

 

Note: If the ESXi host display Failure Reason MAINTAINANCE_MODE as below, take the following steps:

nsxcli -c get controllers
Controller IP Port    SSL    Status             Is Physical  Master   Session State  Controller FQDN   Failure Reason
<Controller-IP> 1235   enabled   disconnected       true          down       NA                            MAINTAINANCE_MODE
<Controller-IP> 1235   enabled   not used           false         null       NA                            MAINTAINANCE_MODE
<Controller-IP> 1235   enabled   not used           false         null       NA                            MAINTAINANCE_MODE
  1. Access the NSX UI > System > Fabric > Hosts > Find the host and check mark it
  2. Click on Actions
  3. Click on Exit NSX Maintenance Mode
Note: If this issue continues, restart the following NSX services on the ESXi host:
ESXi> /etc/init.d/nsx-opsagent restart
ESXi> /etc/init.d/nsx-proxy restart

 

Reference:
Loss of Controller Connectivity after Host Upgrade

 

Proactive prevention:

There is a way to prevent the 'NSX transport node disconnected' problem even before the upgrade activity. 
If an ESXi with version 7.0.x currently has the VIBs from NSX version 4.2.1.0, then the host-cert.pem and host-privkey.pem are expected to have the below permissions:

File path: /etc/vmware/nsx

Expected Permissions for the files in question : 

-rw-rw-rwT    1 root     root          1610 Jan 22 10:01 host-cert.pem

-rw-rw-rwT    1 root     root          1704 Jan 22 10:01 host-privkey.pem


But if the permissions for host-cert.pem and host-privkey.pem are different than above, then the files have wrong permissions and the host is expected to hit the 'NSX transport node disconnected' problem during upgrade to 4.2.1.x, 4.2.2.x.
We can proactively validate the permission of the files in each host and manage them correctly to avoid the issue. Here is how to correct the permissions:

  • SSH to the host with root user
  • Apply the command:

chmod 1666 /etc/vmware/nsx/host-cert.pem /etc/vmware/nsx/host-privkey.pem

  • Verify with 'll' command if these files now have the correct permissions. 
  • Initiate upgrade to NSX 4.2.1.x, 4.2.2.x.

Additional Information

If this KB did not help resolve your issue, you can review the following KB for further troubleshooting steps: Troubleshooting NSX Host Upgrade Failures