This script is intended to be used to resolve certificate management issues on NSX 3.2.x and 4.x. It performs integrity checks and recovery operations for NSX self-signed certificates, and can replace certificates that have expired or will be expiring soon.
The script will make an assessment of certificate remediation needed, present the proposed changes and ask for approval to proceed.
There should be no impact associated with running the CARR script, but Broadcom recommends running the script during a maintenance window.
Python version requirements:
python --version
OS: MAC and Linux
Architecture - (if the appliance has an internet connection, then there is no restriction, dependencies are downloaded)
/root
directory, it will not work from the /tmp
directory.carr.log
is created in the folder where the start.sh
script is located. For any issues requiring support, please collect this log separately, it will not be collected as part of the support bundle../start -t 100
(to check for certificates expiring in the next 100 days). Note: For NSX versions 4.1.2 or below it may be necessary to run the script twice to correct CBM based certificates, please refer here for additional information: Alarms Indicating CBM Certificates Have Expired or Are Expiring Prevent NSX Manager Upgrades
Note: If the default admin account username was changed to something other than admin, CARR script will not work. The user will need to renamed to admin using the following procedure "Unable to determine the NSX version. Please ensure the IP address and password is correct" error when running CARR script.
Note: If using SCP/WinSCP to transfer the carr script tar file to a NSX Manager, it is necessary to enable root ssh. Enable ssh root access for NSX appliances
Execution Steps:
/root
foldertar -xvf carr-1.18.tar.gz
cd carr-1.18
./start.s
h
Script option:
-o
= this flag is used to force online mode-t
= specify lead time for expiring certificates, between 31 and 825 days.-d
= Dry run mode, also checks for transport node certificates expiring.Dry run is read only execution that will also identify the number of Edges and Hosts with TN certificates of validity 825 days or less, it also supports the -t option and it will generate a file called validation_config_recovery_mode.yaml and populate it with issues discovered and require fixing.
> ./start.sh -d
Or
> ./start.sh -d -t <number of days>
<snip>
║
═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
═║
║
HOST ║ ERROR : vcsa.example.com::ESX_Cluster1 :: Certificate on ║ Host certificate on #1 hosts will be replaced. ║
║ ║ #1 hosts are expiring or have expired ║ ║
║ ║ ║ ║
║ ║ ║ ║
║ ║ For detailed information, see: dry_run_transport_nodes_validation_report.yaml ║ ║
╚═════════════════════╩═══════════════════════════════════════════════════════════════════════════════╩══════════════════════════════════════════════════════╝
║ EDGE ║ ERROR : EdgeCluster1:: Certificate on
#1 hosts are ║ Edge node certificate on 1 nodes will be replaced. ║
║ ║ expiring
or have expired
║ ║
║ ║ ║ ║
║ ║ For detailed information, see: dry_run_transport_nodes_validation_report.yaml ║ ║
╚═════════════════════╩═══════════════════════════════════════════════════════════════════════════════╩══════════════════════════════════════════════════════╝
On versions NSX 4.1.x and 4.2.0, Edge and Host Transport Nodes are instantiated using a certificate with validity period of 825 days instead of 10 years.
These are permanent certificates that are not replaced by upgrades.
Starting from version 1.15, CARR script replaces these certificates with new certs of 10 year validity period.
Note: If TN certificates have already expired and the 24 hour grace period has elapsed, TN's will be disconnected. At this point CARR can no longer be used to replace the TN certs.
See Transport Node Certificate Has Expired.
If a VM is vMotioned to the ESX host at the moment the certificate is being replaced, there is a possibility that it may fail to get a network connection.
To prevent vMotion during this time, it is recommended to disable DRS on the vSphere cluster for the duration of the activity.
To trigger TN certificate replacement, you first need to run the script in dry run mode, this will check all TN's and other environmental certificates:
./start.sh -d -t 825
Once complete, it will populate a file called validation_config_recovery_mode.yaml and display the results in the console.
Then to apply the fixes identified by the script (recovery mode):
./starts.sh -t 825 -r validation_config_recovery_mode.yaml
Note: The dry run (discovery mode), will by default check certificate up to 825 days, if you then run recovery mode, with no lead time (-t option), the recovery mode will only check up to 31 days, therefore any issues detected over 31 days in discovery mode, will not be fixed in recovery mode.
Note: If the validation_config_recovery_mode.yaml is not supplied, then it will use the details contained in the validation_config.yaml file, this file needs to be manually populated, see details below on how to do this.
Relevant files
README - How to use script details
start.sh - carr script
carr.log - audit log generated during carr operation
validation_config.yaml - file for transport node validation, if not using the auto generated file validation_config_recovery_mode.yaml, this will referenced, this file needs to be manually populated.
validation_config_recovery_mode.yaml - Auto generated, populates which transport nodes need resolving and other certificates which need resolving.
before_recovery_transport_nodes_validation_report.yaml - Pre recovery file, which lists details about transport nodes certificates.
after_recovery_transport_nodes_validation_report.yaml - Post recovery file, which lists details about transport nodes certificates.
dry_run_transport_nodes_validation_report.yaml - Detailed list of transport nodes with certificate or connection issues.
On the Manager the file can be edited using vi editor, alternatively SCP the file out and edit it with Notepad++ and copy it back to the Manager.
This yaml file is located in the same directory as start.sh
To replace certificates on Hosts, the Compute Manager name must be specified and the vSphere cluster names that should be processed.
To replace certs on Edges, the Edge cluster name must be specified.
During certificate replacement, it's possible, vMotion to the Host may not be possible.
It's recommended to start with one cluster and validate functionality.
Existing datapath flows through the Edge and Host are not expected to experience disruption.
e.g.
HOST:
validate: True
clusters:
- vcenter_name: vcsa.example.com
vcenter_cluster_name: ESX_Cluster1
- vcenter_name: vcsa.example.com
vcenter_cluster_name: ESX_Cluster2
EDGE:
validate: True
clusters:
- name: EdgeCluster-1
- name: EdgeCluster-2
Note: Currently only Edges in clusters are processed, standalone Edges are ignored. The vcenter_name
must match the Compute manager Name (not FQDN/IP) in NSX-T: System, Fabric, Compute managers.
After saving this file run CARR to replace TN certs:
> ./start.sh -t 825 (The lead time is tuneable, in this example all Certs that expire in 825 days or less will be replaced with 10 year certs)
Notes:
ERROR : string indices must be integers
. This is due to the yaml file syntax issue. To resolve it, when you edit the validation_config.yaml
file, make sure to add a space between keys and values. For eg; - vcenter_name: vcsa-01.example.com
ERROR: Edge-cluster-01:: There are 1 edge_nodes. Certificates on these Edge Nodes will not be replaced
. To resolve the issue, check if there are any edge node that are in powered off or disconnected state in the cluster. To resolve the issue, power on the edge node.Starting from version 1.15, the CARR script retrieves the list of Compute Managers registered in NSX Manager, retrieves the vCenter certificates and checks their thumbprints and chain order.
If the CRL Distribution Point field is present in the vCenter certificates, the script disables the Certificate Revocation List (CRL) checking in NSX.
If there is a mismatch with the vCenter thumbprints, it updates the new thumbprints in NSX.
CARR script gets installed in the directory ~/.virtualenvs/carr_script
.
For example, when running CARR script on an NSX Manager, the install can be reversed as follows
rm -rf /root/.virtualenvs/carr_script
Note: This rm
command deletes files recursively without checks. If executed incorrectly it can remove system files irreversibly requiring the NSX appliance to be replaced.
See Create a virtual machine for running the Certificate Analyzer, Results and Recovery (CARR) Script for detailed instructions on creating a Photon OS VM as a location to run the CARR script if no suitable location exists in your environment.
If the suggested resolution steps do not resolve the issue, please consider submitting a support case to Broadcom. Kindly include the error screenshot or details, along with NSX manager log files and script log file (A log named carr.log
is created in the folder where the start.sh
script is located.)