ESXi Host Transport Nodes are in "Host Disconnected" state post replacing NSX manager expired certificate.
search cancel

ESXi Host Transport Nodes are in "Host Disconnected" state post replacing NSX manager expired certificate.

book

Article ID: 399995

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The NSX Manager certificates reached their expiration date and were subsequently replaced.

  • All ESXi Transport Nodes displays an NSX Configuration status of 'Host Disconnected' within the NSX Manager UI.

  • The ESXi Transport Nodes are displaying multiple TCP connections in a TIME_WAIT state on ports 1234 and 1235, as shown below:

    Appliance Proxy Hub (APH) acts as a communication channel between NSX Manager and the transport node. APH runs as a service on NSX Manager.
    Uses port 1234 for communication between the management plane and transport node.
    Uses port 1235 for communication between the CCP and transport node.

    [root@ESXIHOST] esxcli network ip connection list | grep -i <ipaddress_of_Host>
    
    tcp 0 0 ##.##.##.51:50268 ##.##.##.17:1234 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:56995 ##.##.##.15:1234 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:54114 ##.##.##.17:1235 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:33423 ##.##.##.15:1235 TIME_WAIT 0

    Alternatively, you may run the below commands to check the connectivity between NSX Manager and Transport Node on port 1234 and 1235.

    [root@ESXIHOST] localcli network ip connection list | grep 1234
    [root@ESXIHOST] localcli network ip connection list | grep 1235
  • Controller connectivity shows down on TNs for the respective master nodes as below:

    [root@esx:~] :~] nsxcli -c get controllers "HOST_REJECTED_CONTROLLER_CERT"
    
     Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
      ##.##.##.16 1235 enabled not used false null NA NA
      ##.##.##.17 1235 enabled disconnected true down NA HOST_REJECTED_CONTROLLER_CERT
      ##.##.##.15 1235 enabled not used false null NA NA
    
    # Controller and manager connectivity down due the certificate rejection between the HOST and Controller.
    [root@esx:~] nsxcli -c verify controllers certificate
     Controller IP Port CRL Status Certificate Status
      ##.##.##.15 1235 CERTIFICATE_REVOKED HOST_REJECTED_CONTROLLER_CERT and CONTROLLER_REJECTED_HOST_CERT
      ##.##.##.17 1235 CERTIFICATE_REVOKED HOST_REJECTED_CONTROLLER_CERT and CONTROLLER_REJECTED_HOST_CERT
      ##.##.##.16 1235 NA CONNECTION_TIMED_OUT

     

  • SSL errors seen in ESXi host logs in /var/run/log/nsx-syslog.log.

    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6315274" level="WARNING"] StreamConnection[2886 Connecting to ssl://##.##.##.17:1234 sid:2886] Couldn't connect to 'ssl://##.##.##.17:1234' (error: 336134278-certificate verify failed)
    
    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6315274" level="WARNING"] StreamConnection[2886 Error to ssl://##.##.##.17:1234 sid:-1] Error 336134278-certificate verify failed
    
    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="6315274" level="WARNING"] RpcConnection[2886 Connecting to ssl://##.##.##.17:1234 0] Couldn't connect to ssl://##.##.##.17:1234 (error: 336134278-certificate verify failed)

     

  • The NSX proxy on the impacted ESXi hosts was unable to load the replacement certificate, as shown below. Note the extra characters found within the certificate file (Before the begin certificate we can see few extra characters are added as highlighted below).

    YYYY-MM-DDTHH:MM:SS.###Z Er(179) nsx-proxy[6453519]: NSX 6453519 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6453519" level="ERROR" errorCode="NET1109"] X509Certificate: PEM - failed to read X509: x906d06c-PEM routines:PEM_read_bio:no start line
    
    YYYY-MM-DDTHH:MM:SS.###Z In(182) nsx-proxy[6453519]: NSX 6453519 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="6453519" level="INFO"] AphInfo: invalid certificate #CN=##,OU=##,O=##,L=##,ST=##,C=##-----BEGIN CERTIFICATE-----
    
    YYYY-MM-DDTHH:MM:SS.###Z In(182)[+] nsx-proxy[6453519]: MIIGIzCCBAugAwIBAgIUDioC/bYg2v90vtPbM1jGKzwM6cgwDQYJKoZIhvcNAQE


  • On ESX TNs (Transport Nodes), you can validate the certificate content using the command:
    [root@esx:~] cat /etc/vmware/nsx/appliance-info.xml.


  • In root user mode on the NSX Manager, you can validate the certificate content using the command:
    root@nsx-mngr:~# cat /etc/vmware/nsx/appliance-info.xml.

 

Environment

VMware NSX

Cause

Certs used to replace the expired certs were having extra characters that were not being processed by openssl causing the cert not getting properly loaded.

This was causing the communication issues with the NSX managers and TNs leading to TNs in disconnected state in UI.

The content of the cert should be validated before being applied on NSX managers (they should not have extra characters).

Resolution

We would need to replace the certs again. This time the replaced cert should be WITHOUT the extra characters that were causing problem earlier.

Additional Information

FIX : There is a logic added in VCF9.0 onwards that discards the extra characters while loading the cert.