ESXI Host Transport nodes are in disconnected state post NSX manager expired cert replacement
search cancel

ESXI Host Transport nodes are in disconnected state post NSX manager expired cert replacement

book

Article ID: 399995

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX manager certs expired and replaced post expiration
  • All TNs in UI show disconnected state.
  • TNs will show TIME_WAIT state for 1234 and 1235 as below:
    [root@ESXIHOST] esxcli network ip connection list | grep -i <ipaddress_of_Host>

    tcp 0 0 ##.##.##.51:50268 ##.##.##.17:1234 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:56995 ##.##.##.15:1234 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:54114 ##.##.##.17:1235 TIME_WAIT 0 
    tcp 0 0 ##.##.##.51:33423 ##.##.##.15:1235 TIME_WAIT 0
  • Controller connectivity shows down on TNs for the respective master nodes as below: 
    [root@esx:~] :~] nsxcli -c get controllers "HOST_REJECTED_CONTROLLER_CERT"

     Controller IP Port SSL Status Is Physical Master Session State Controller FQDN Failure Reason
      ##.##.##.16 1235 enabled not used false null NA NA
      ##.##.##.17 1235 enabled disconnected true down NA HOST_REJECTED_CONTROLLER_CERT
      ##.##.##.15 1235 enabled not used false null NA NA

    # Controller and manager connectivity down due the certificate rejection between the HOST and Controller.
    [root@esx:~] nsxcli -c verify controllers certificate
     Controller IP Port CRL Status Certificate Status
      ##.##.##.15 1235 CERTIFICATE_REVOKED HOST_REJECTED_CONTROLLER_CERT and CONTROLLER_REJECTED_HOST_CERT
      ##.##.##.17 1235 CERTIFICATE_REVOKED HOST_REJECTED_CONTROLLER_CERT and CONTROLLER_REJECTED_HOST_CERT
      ##.##.##.16 1235 NA CONNECTION_TIMED_OUT

 

  • SSL errors seen in ESXi host logs in /var/run/log/nsx-syslog.log.
    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6315274" level="WARNING"] StreamConnection[2886 Connecting to ssl://##.##.##.17:1234 sid:2886] Couldn't connect to 'ssl://##.##.##.17:1234' (error: 336134278-certificate verify failed)

    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6315274" level="WARNING"] StreamConnection[2886 Error to ssl://##.##.##.17:1234 sid:-1] Error 336134278-certificate verify failed

    YYYY-MM-DDTHH:MM:SS.###Z Wa(180) nsx-proxy[6315252]: NSX 6315252 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="6315274" level="WARNING"] RpcConnection[2886 Connecting to ssl://##.##.##.17:1234 0] Couldn't connect to ssl://##.##.##.17:1234 (error: 336134278-certificate verify failed)

 

  • The NSX proxy on the impacted ESXi hosts was unable to load the replacement certificate, as shown below. Note the extra characters found within the certificate file (Before the begin certificate we can see few extra characters are added as highlighted below).

    YYYY-MM-DDTHH:MM:SS.###Z Er(179) nsx-proxy[6453519]: NSX 6453519 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="6453519" level="ERROR" errorCode="NET1109"] X509Certificate: PEM - failed to read X509: x906d06c-PEM routines:PEM_read_bio:no start line

    YYYY-MM-DDTHH:MM:SS.###Z In(182) nsx-proxy[6453519]: NSX 6453519 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="mpa-proxy-lib" tid="6453519" level="INFO"] AphInfo: invalid certificate #CN=##,OU=##,O=##,L=##,ST=##,C=##-----BEGIN CERTIFICATE-----

    YYYY-MM-DDTHH:MM:SS.###Z In(182)[+] nsx-proxy[6453519]: MIIGIzCCBAugAwIBAgIUDioC/bYg2v90vtPbM1jGKzwM6cgwDQYJKoZIhvcNAQEL
  • On ESX TNs (Transport Nodes), you can validate the certificate content using the command:
    [root@esx:~] cat /etc/vmware/nsx/appliance-info.xml.

  • In root user mode on the NSX Manager, you can validate the certificate content using the command:
    root@nsx-mngr:~# cat /etc/vmware/nsx/appliance-info.xml.

 

Environment

VMware NSX

Cause

Certs used to replace the expired certs were having extra characters that were not being processed by openssl causing the cert not getting properly loaded.

This was causing the communication issues with the NSX managers and TNs leading to TNs in disconnected state in UI.

The content of the cert should be validated before being applied on NSX managers (they should not have extra characters).

Resolution

We would need to replace the certs again. This time the replaced cert should be WITHOUT the extra characters that were causing problem earlier.

Additional Information

FIX : There is a logic added in VCF9.0 onwards that discards the extra characters while loading the cert.