NSX Transport Nodes disconnect after replacing certificates with replace_certs.py in NSX 4.1.x
search cancel

NSX Transport Nodes disconnect after replacing certificates with replace_certs.py in NSX 4.1.x

book

Article ID: 369349

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • After upgrading to 4.1.x, may see the self-signed certificates in NSX managers expired/expiring. This is related to a known issue: NSX alarms indicating certificates have expired or are expiring 
  • You use the python script in the above KB to replace the expired certs.
  • In some cases, after running the replace_certs.py script, the NSX Transport Nodes are disconnected from the NSX managers. 
  • You may see similar entries in the following logs:

    /var/log/syslog*

    YYYY-MM-DDTHH:MM:SS.101Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.115Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.135Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.150Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.166Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.183Z nsx-proxy[4388757]: NSX 4388757 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4388787" level="WARNING"] Certificate validation: couldn't find SHA256 digest 'redacted' in local trust store
    YYYY-MM-DDTHH:MM:SS.517Z nsx-proxy[4395471]: NSX 4395471 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="4395504" level="INFO"] StreamConnection[9 Connected to ssl://NSX-Manager:1234 sid:9] Connected from ssl-tcp://NSX-TN:13663 to server with certificate with sha256 fingerprint 'redacted'
    
    
    YYYY-MM-DDT0HH:MM:SS.765Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.945Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.033Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.141Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.218Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.323Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    YYYY-MM-DDT0HH:MM:SS.443Z NSX 3987044 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" reqId="reqId" subcomp="manager" username="admin"] Heartbeating for host host-uuid is down.
    
    
    YYYY-MM-DDT0HH:MM:SS.357Z nsx-proxy[2101556]: NSX 2101556 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="2101593" level="INFO"] StreamSocket[4321 Open f:47 i:0 ? -> ssl://NSX-TN:1235] on_connect 336134278-certificate verify failed
    YYYY-MM-DDT0HH:MM:SS.357Z nsx-proxy[2101556]: NSX 2101556 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="2101593" level="WARNING"] StreamConnection[4321 Connecting to ssl://NSX-TN:1235 sid:4321] Couldn't connect to 'ssl://NSX-TN:1235' (error: 336134278-certificate verify failed)
    YYYY-MM-DDT0HH:MM:SS.357Z nsx-proxy[2101556]: NSX 2101556 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-net" tid="2101593" level="WARNING"] StreamConnection[4321 Error to ssl://NSX-TN:1235 sid:-1] Error 336134278-certificate verify failed
    YYYY-MM-DDT0HH:MM:SS.357Z nsx-proxy[2101556]: NSX 2101556 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="2101593" level="WARNING"] RpcConnection[4321 Connecting to ssl://NSX-TN:1235 0] Couldn't connect to ssl://NSX-TN:1235 (error: 336134278-certificate verify failed)
    YYYY-MM-DDT0HH:MM:SS.357Z nsx-proxy[2101556]: NSX 2101556 - [nsx@6876 comp="nsx-esx" subcomp="nsx-proxy" s2comp="nsx-rpc" tid="2101593" level="WARNING"] RpcTransport[0] Unable to connect to ssl://NSX-TN:1235: 336134278-certificate verify failed


    /var/log/proton/nsxapi.log

    YYYY-MM-DDTHH:MM:SS.619Z ERROR WrapperStartStopAppMain TrustStoreServiceImpl 4101771 SYSTEM [nsx@6876 comp="nsx-manager" errorCode="MP100" level="ERROR" subcomp="manager"] Failed to sync certificate between DB and disk for profile: profileName: APH-TN, serviceType: APH_TN, preProcessor: null, postProcessor: null, uniqueUse: false, clusterCertificate: false, requiresPrivateKey: true, nodeTypes: [global-manager, nsx-manager, nsx-shared], certificatePath: /etc/vmware/nsx-appl-proxy/appl-proxy-cert.pem, keyPath: /etc/vmware/nsx-appl-proxy/appl-proxy-privkey.pem

Environment

NSX 4.1.x

This can happen on both federated and non-federated environment 

Cause

This is a known issue when a customer upgrades to 4.1.x and performs replace-certificate of APH_TN.

Proton cannot update certificate because of missing permissions for user uproton.

 

"ls -lart" under path "/etc/vmware/nsx-appl-proxy" lists the following:

-rw-r--r--  1 appl-proxy appl-proxy 1.7K MM DD  YYYY appl-proxy-cert.pem
-rw-r--r--  1 appl-proxy appl-proxy 1.7K MM DD  YYYY appl-proxy-privkey.pem
-rw-r--r--  1 appl-proxy appl-proxy  766 MM DD  YYYY openssl-appl-proxy.cnf
-rw-r--r--  1 appl-proxy appl-proxy   52 MM DD  YYYY appl-proxy-public-cfg.json
-rw-r--r--  1 appl-proxy appl-proxy   90 MM DD  YYYY appl-proxy-public-cfg.xml
-rw-r--r--  1 appl-proxy appl-proxy 2.2K DM DD  YYYY appl-proxy.xml

Resolution

This issue is resolved in VMware NSX 4.2.0

Workaround:

Use the version 1.1 or higher of the replace_certs.py script to prevent this from happening. 

    1. Go to the nsx-appl-proxy directory by running below command on a NSX manager:

      cd /etc/vmware/nsx-appl-proxy

    2. Run below command to remove tmp files. The ".*" after pem cleans up only tmp key files.

      rm appl-proxy-privkey.pem.*

    3. Run below commands to change permissions for appl-proxy related certs and keys. Post 4.1.0, the below files requires uproton permissions.

      chown uproton:appl-proxy appl-proxy-cert.pem
      chmod 660 appl-proxy-cert.pem

      chown uproton:appl-proxy appl-proxy-privkey.pem
      chmod 660 appl-proxy-privkey.pem

      chown uproton:appl-proxy appl-proxy-ar-cert.pem
      chmod 660 appl-proxy-ar-cert.pem

      chown uproton:appl-proxy appl-proxy-ar-privkey.pem
      chmod 660 appl-proxy-ar-privkey.pem

    4. Check the permissions for files under this folder. Run,

      ls -lart

      Example of how permissions should appear:

      total 40
      -rw-r--r-- 1 appl-proxy appl-proxy 3136 MM DD YYYY appl-proxy.xml
      -rw-r--r-- 1 appl-proxy appl-proxy 90 MM DD 00:34 appl-proxy-public-cfg.xml
      -rw-r--r-- 1 appl-proxy appl-proxy 52 MM DD 00:34 appl-proxy-public-cfg.json
      -rw-r--r-- 1 appl-proxy appl-proxy 766 MM DD 00:34 openssl-appl-proxy.cnf
      -rw-rw---- 1 uproton appl-proxy 1704 MM DD 00:34 appl-proxy-privkey.pem
      -rw-rw---- 1 uproton appl-proxy 1639 MM DD 00:34 appl-proxy-cert.pem
      -rw-rw---- 1 uproton appl-proxy 1704 MM DD 00:34 appl-proxy-ar-privkey.pem
      -rw-rw---- 1 uproton appl-proxy 1639 MM DD 00:34 appl-proxy-ar-cert.pem

    5. SSH into NSX Transport node and restart the nsx-proxy and nsx-opsagent services

      /etc/init.d/nsx-proxy restart
      /etc/init.d/nsx-opsagent restart

    6. If you still see the host disconnected, run the following:

      On one of the NSX manager:

           get certificate api thumbprint

      On the hosts:

           nsxcli -c sync-aph-certificates NSX-Manager-IP username admin thumbprint <thumbprint> password <password>

           /etc/init.d/nsx-proxy restart