Using Certificate Analyzer, Results and Recovery (CARR) Script to fix certificate related issues in NSX
search cancel

Using Certificate Analyzer, Results and Recovery (CARR) Script to fix certificate related issues in NSX

book

Article ID: 369034

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

This script is intended to be used to resolve certificate management issues on NSX 3.2.x and 4.x. It performs integrity checks and recovery operations for NSX self-signed certificates, and can replace certificates that have expired or will be expiring soon.

Environment

  • VMware NSX 4.x
  • VMware NSX-T Data Center 3.2.x

Cause

 

Resolution

The script will make an assessment of certificate remediation needed, present the proposed changes and ask for approval to proceed.

There should be no impact associated with running the CARR script, but Broadcom recommends running the script during a maintenance window. 

Client Requirements:

  • For NSX 3.2.3.x, 4.1.x and 4.2.x, the script can be run directly on a Local or Global NSX Manager from /root directory. 
  • For NSX versions less than 3.2.3.x and NSX 4.0.x an external client machine is required meeting the following requirements.

    Python version requirements:

    • You can check the version of python with the following command: python --version
    • 3.13+ requires the client machine to have internet connectivity and cannot be run offline.
    • 3.8 to 3.12 can run with or without client internet connectivity.

    OS: MAC and Linux

    Architecture -  (if the appliance has an internet connection, then there is no restriction, dependencies are downloaded)

    • MAC : : "macosx_10_9_x86_64" "macosx_11_0_arm64"
    • Linux : : "musllinux_1_1_x86_64" "musllinux_1_1_aarch64" "manylinux_2_17_x86_64" "manylinux_2_17_s390x" "manylinux_2_17_aarch64" 

  • In all cases, the script requires the following ports to be open between the client machine and the 3 NSX Managers:
    • ssh port 22
    • https port 443
    • corfu port 9000
      Note: If running the carr script on the NSX Manager directly, ports 443 and 9000 will already be open between the 3 Managers.

  • ssh access via admin and root users must be enabled on all NSX Managers, if needed see Enable ssh root access for NSX appliances. In Federation environments, this requirement applies all LMs and GMs.

Execution Notes:

  • This script can be ran on any node and it will reach out to the respective NSX nodes in the correct order.
  • Ensure that you have a recent, valid backup of your NSX managers and ensure that you know the passphrase for your backups.
  • On NSX, the script should be run from the /root directory, it will not work from the /tmp directory.
  • admin and root passwords of NSX Manager are required as inputs.
  • The script can be run from vCenter Server 8.x. If there is an issue copying the script to vCenter it may be necessary to change the shell to bash on vCenter. This can be performed by following the guidance in KB - Toggling the vCenter Server Appliance default shell ensuring the shell is reverted once the file has been copied.
  • A log named carr.log is created in the folder where the start.sh script is located. For any issues requiring support, please collect this log separately, it will not be collected as part of the support bundle.
  • Expired/expiring certificates which are not in use, will not be processed by the script, these can be manually deleted in the NSX UI.
  • The script will process the expired certificates, with regards to expiring certificates, only certificates expiring within the next 31 days will be processed, by default, unless you specify a lead time using the -t option, which can be between 31 and 825 days 
    • e.g. ./start -t 100 (to check for certificates expiring in the next 100 days). 
  • The script processes self-signed certificates only, CA signed certificates are out of scope and must be managed by the organization owners.
  • By default the script will run in offline mode, if the appliance has internet connection, you can use the -o option to force the script to check online for dependencies.

Note: For NSX versions 4.1.2 or below it may be necessary to run the script twice to correct CBM based certificates, please refer here for additional information: Alarms Indicating CBM Certificates Have Expired or Are Expiring Prevent NSX Manager Upgrades

Note: If the default admin account username was changed to something other than admin, CARR script will not work. The user will need to renamed to admin using the following procedure "Unable to determine the NSX version. Please ensure the IP address and password is correct" error when running CARR script.

Note: If using SCP/WinSCP to transfer the carr script tar file to a NSX Manager, it is necessary to enable root ssh. Enable ssh root access for NSX appliances
         

Execution Steps:

  1. Copy carr-1.18.tar.gz to the client server where it will be run. On the NSX Manager use the /root folder
  2. Extract the bundle
    tar -xvf carr-1.18.tar.gz
  3. Change to the extracted folder
    cd carr-1.18
  4. Launch the carr script
    ./start.sh

Script option:

  • -o = this flag is used to force online mode
  • -t = specify lead time for expiring certificates, between 31 and 825 days.
  • -d = Dry run mode, also checks for transport node certificates expiring.

Dry Run:

Dry run is read only execution that will also identify the number of Edges and Hosts with TN certificates of validity 825 days or less, it also supports the -t option and it will generate a file called validation_config_recovery_mode.yaml and populate it with issues discovered and require fixing. 

> ./start.sh -d
Or
> ./start.sh -d -t <number of days>

<snip>
══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
HOST                 ║ ERROR  : vcsa.example.com::ESX_Cluster1 :: Certificate on                   ║ Host certificate on #1 hosts will be replaced.       ║
║                     ║ #1 hosts are expiring or have expired                                         ║                                                      ║
║                     ║                                                                               ║                                                      ║
║                     ║                                                                               ║                                                      ║
║                     ║ For detailed information, see: dry_run_transport_nodes_validation_report.yaml ║                                                      ║
╚═════════════════════╩═══════════════════════════════════════════════════════════════════════════════╩══════════════════════════════════════════════════════╝
║ EDGE                ║ ERROR  : EdgeCluster1:: Certificate on #1 hosts are ║ Edge node certificate on 1 nodes will be replaced. ║
║                     ║ expiring or have expired                                                      ║                                                      ║
║                     ║                                                                               ║                                                      ║
║                     ║ For detailed information, see: dry_run_transport_nodes_validation_report.yaml ║                                                      ║
╚═════════════════════╩═══════════════════════════════════════════════════════════════════════════════╩══════════════════════════════════════════════════════╝

Transport Node Certificates

On versions NSX 4.1.x and 4.2.0, Edge and Host Transport Nodes are instantiated using a certificate with validity period of 825 days instead of 10 years. 
These are permanent certificates that are not replaced by upgrades. 
Starting from version 1.15, CARR script replaces these certificates with new certs of 10 year validity period.

Note: If TN certificates have already expired and the 24 hour grace period has elapsed, TN's will be disconnected. At this point CARR can no longer be used to replace the TN certs.
          See Transport Node Certificate Has Expired.

If a VM is vMotioned to the ESX host at the moment the certificate is being replaced, there is a possibility that it may fail to get a network connection.
To prevent vMotion during this time, it is recommended to disable DRS on the vSphere cluster for the duration of the activity.

To trigger TN certificate replacement, you first need to run the script in dry run mode, this will check all TN's and other environmental certificates:

./start.sh -d -t 825

Once complete, it will populate a file called validation_config_recovery_mode.yaml and display the results in the console.

Then to apply the fixes identified by the script (recovery mode):

./starts.sh -t 825 -r validation_config_recovery_mode.yaml

Note: The dry run (discovery mode), will by default check certificate up to 825 days, if you then run recovery mode, with no lead time (-t option), the recovery mode will only check up to 31 days, therefore any issues detected over 31 days in discovery mode, will not be fixed in recovery mode.

Note: If the validation_config_recovery_mode.yaml is not supplied, then it will use the details contained in the validation_config.yaml file, this file needs to be manually populated, see details below on how to do this.

Relevant files

README - How to use script details

start.sh - carr script

carr.log - audit log generated during carr operation

validation_config.yaml - file for transport node validation, if not using the auto generated file validation_config_recovery_mode.yaml, this will referenced, this file needs to be manually populated.

validation_config_recovery_mode.yaml - Auto generated, populates which transport nodes need resolving and other certificates which need resolving.

before_recovery_transport_nodes_validation_report.yaml - Pre recovery file, which lists details about transport nodes certificates.

after_recovery_transport_nodes_validation_report.yaml - Post recovery file, which lists details about transport nodes certificates.

dry_run_transport_nodes_validation_report.yaml - Detailed list of transport nodes with certificate or connection issues.

Edit validation_config.yaml file (Optional)

On the Manager the file can be edited using vi editor, alternatively SCP the file out and edit it with Notepad++ and copy it back to the Manager.

This yaml file is located in the same directory as start.sh

To replace certificates on Hosts, the Compute Manager name must be specified and the vSphere cluster names that should be processed.
To replace certs on Edges, the Edge cluster name must be specified.
During certificate replacement, it's possible, vMotion to the Host may not be possible.
It's recommended to start with one cluster and validate functionality.
Existing datapath flows through the Edge and Host are not expected to experience disruption.

e.g.

HOST:
  validate: True
  clusters:
    - vcenter_name: vcsa.example.com
      vcenter_cluster_name: ESX_Cluster1
    - vcenter_name: vcsa.example.com
      vcenter_cluster_name: ESX_Cluster2
EDGE:
  validate: True
  clusters:
    - name: EdgeCluster-1
    - name: EdgeCluster-2

Note: Currently only Edges in clusters are processed, standalone Edges are ignored. The vcenter_name must match the Compute manager Name (not FQDN/IP) in NSX-T: System, Fabric, Compute managers.

After saving this file run CARR to replace TN certs:

> ./start.sh -t 825   (The lead time is tuneable, in this example all Certs that expire in 825 days or less will be replaced with 10 year certs)

Notes:

  • When executing the Script, you may get ERROR  : string indices must be integers. This is due to the yaml file syntax issue. To resolve it, when you edit the validation_config.yaml file, make sure to add a space between keys and values.
    For eg; - vcenter_name: vcsa-01.example.com
  • When executing the Script, you may get ERROR: Edge-cluster-01:: There are 1 edge_nodes. Certificates on these Edge Nodes will not be replaced. To resolve the issue, check if there are any edge node that are in powered off or disconnected state in the cluster. To resolve the issue, power on the edge node.
  • It is recommended that APH and Transport node certificates should be replaced in separate run of the script. 

Compute Manager Certificate

Starting from version 1.15, the CARR script retrieves the list of Compute Managers registered in NSX Manager, retrieves the vCenter certificates and checks their thumbprints and chain order.
If the CRL Distribution Point field is present in the vCenter certificates, the script disables the Certificate Revocation List (CRL) checking in NSX.
If there is a mismatch with the vCenter thumbprints, it updates the new thumbprints in NSX.

Federation Certificates

  • The carr script will rotate expired or expiring certificates in Federation, however, as per any other certificate, it will not remove unused expired/expiring certificate(s).
  • On the Standby Global manager, the unused certificate can be removed in the UI from NSX-T 4.2 onwards.
  • For NSX versions lower than 4.2, please refer to KB Generating and applying NSX-T Federation certificates for Standby Global Manager which has steps to remove the unused certificate.
  • In Federation environments, the script needs to be run on each site, it will not rotate certificates on another site, the credentials are used to check and validate sync between sites.

Uninstall:

CARR script gets installed in the directory ~/.virtualenvs/carr_script.
For example, when running CARR script on an NSX Manager, the install can be reversed as follows

 rm -rf /root/.virtualenvs/carr_script


Note: This rm command deletes files recursively without checks. If executed incorrectly it can remove system files irreversibly requiring the NSX appliance to be replaced.

Additional Information

See Create a virtual machine for running the Certificate Analyzer, Results and Recovery (CARR) Script for detailed instructions on creating a Photon OS VM as a location to run the CARR script if no suitable location exists in your environment.

If the suggested resolution steps do not resolve the issue, please consider submitting a support case to Broadcom. Kindly include the error screenshot or details, along with NSX manager log files and script log file (A log named carr.log is created in the folder where the start.sh script is located.) 

Attachments

carr-1.18.tar.gz get_app