Replacing a faulty NSX-T manager node in a VCF environment

search cancel

Replacing a faulty NSX-T manager node in a VCF environment

book

Article ID: 314670

calendar_today

Updated On:

Products

VMware NSX VMware Cloud Foundation

Issue/Introduction

This article provides procedures for replacing a single node from a backup in a 3-node NSX-T Manager cluster within a VCF environment.

When the NSX cluster's health is degraded, and there is a faulty NSX manager node, it can block several VCF operations.

Environment

VMware NSX-T Data Center

VMware Cloud Foundation 4.x

VMware Cloud Foundation 5.x

Resolution

1. Procedure to replace one NSX-T manager node in 3-node cluster.
In this procedure, the 3-node NSX-T manager cluster has a single node down. In this example, there are 3x NSX-T manager nodes in the MGMT cluster:
- 172.17.110.22 ######.com ----> vip
- 172.17.110.23 ######1.com
- 172.17.110.24 ######2.com
- 172.17.110.25 ######3.com
Assume that ######2 has gone down and must be replaced. You must obtain the following information for the node that needs a replacement.

2.1 Prerequisites

You must obtain the following information as prerequisites for the unrepairable NSX-T manager:
- VM Name
- FQDN hostname
- IP address, netmask & gateway
- DNS, NTP servers
- admin, audit, root user passwords
If you do not know the admin and root passwords, follow the instructions in the password chapter of the VMware Cloud Foundation Operations and Administration guide to retrieve them from the SDDC Manager inventory. If you want to change these two passwords, should do so after restoring the NSX-T manager VMs and using the SDDC Manager password update function.

For the NSX-T manager cluster
- NSX-T manager VM size. When you deploy a NSX-T Manager OVA, you must specify the node size. To determine the size, log into the management domain vCenter, navigate to the summary page for one of the NSX-T manager VMs for this NSX-T instance, and expand the VM hardware pane. Compare the memory size and number of cpus of that VM to the NSX Manager VM Resource Requirements section in the the following product documents, and from the table, determine the size to use in step 2.2.4
- NSX-T 3.0: Please refer to NSX Manager VM and Host Transport Node System Requirements.
- NSX-T 2.5: Please refer to NSX Manager VM and Host Transport Node System Requirements
- NSX 4.0: Please refer to NSX Manager VM and Host Transport Node System Requirements
For one operational NSX-T Manager:
- FQDN hostname and IP address
Note: Make sure you download the OVA image before you start and check the md5sum of the OVA file as well. Section 2.2.4 lists the procedures for determining the specific OVA to download.

2.2 Procedure

2.2.1. Delete faulty NSX-T manager node from NSX-T manager cluster
- From NSX-T manager UI > System > Overview, select the failed manager node that needs deleting, click the wheel icon or the 'Action' button, depending on your NSX version, and click Delete.
- If the faulty NSX-T manager was manually deployed via OVA, the wheel icon will not be available, or the Delete option will be grayed out, and it can be deleted using the CLI as described below.
Note: Steps differ based on if the node was automatically deployed via NSX UI or manually deployed to vCenter via OVA for deleting the manager

2.2.1.1. Delete an NSX-T Manager node that was automatically deployed via NSX UI
- - - From NSX-T manager UI > System > Overview, select the failed manager node that needs deleting, click the wheel icon or the 'Action' button, depending on your NSX version, and click Delete.
2.2.1.2. Delete an NSX-T Manager node that was manually deployed to vCenter via OVA
- First you must obtain the UUID of the faulty NSX-T manager, SSH into one operational NSX-T Manager
ssh admin@######3
- Issue the get cluster status command, record the UUID of the faulty NSX-T manager
get cluster status
- Issue the detach command from the operational NSX-T Manager:
detach node <uuid>

For example: detach node 77a01dab-####-####-####-0bfc3e0c40d9

2.2.1.3. Wait for deletion complete
- From NSX-T manager UI > Home > Dashboard > System, wait until 2-node NSX-T manager cluster has green stable status
2.2.2. Power off faulty NSX-T manager VM
- From the vSphere Web Client > Hosts and Clusters, select ######2 VM
- Select Power Off
2.2.3. Delete faulty NSX-T manager VM
- From vSphere Web Client > Hosts and Clusters, select ######2 VM
- Select Delete from disk
- If you want to be safe, you can rename it instead, but remember to delete it once procedure is completed
2.2.3.4. Verification

After removing one node from the cluster and before adding a new one, make sure you run "get cluster status" command and verify the services are UP on the nodes.

2.2.4. Add new NSX-T manager node

The new node cannot be added from the NSX-T manager UI. The NSX-T manager VMs are in the MGMT cluster, but the NSX-T manager cluster only knows about the VI WLD vCenter, not the MGMT domain vCenter, so the add nodes wizard will not allow adding a node to the MGMT cluster.
- From NSX-T manager UI > System > Overview, record the NSX version and build number of the NSX-T manager nodes
For example: 2.4.2.1.0.14374085
- Download the exact build from customerconnect.vmware.com. e.g. nsx-unified-appliance-2.4.2.1.0.14374085.ova
- From vSphere Web Client > Hosts and Clusters > SDDC-Cluster1 > Mgmt-ResourcePool, select Deploy OVF template...
- Select local file, select the OVA file
nsx-unified-appliance-2.4.2.1.0.14374085.ova, click Next
- Input the VM Name, ######2
- Select the SDDC-Datacenter, click Next
- Select the SDDC-Cluster1, select the Mgmt-ResourcePool, click Next
- Choose the correct size as determined in Section 2 above, click Next
- Select the vSAN datastore, For example: sfo01-m01-vsan, click Next
- Select the MGMT portgroup, For example: SDDC-DPortGroup-Mgmt, click Next
- Input the information for the faulty NSX-T manager node
For example:
fqdn=######2.com
role=nsx-manager-nsx-controller
gateway=172.##.110.#
ipv4=###.17.110.##
netmask=255.255.255.0
dns=172.##.110.###
domain=######.com
ntp=172.##.###.251
ssh=enabled (checked)
allowroot=disabled (unchecked)
- Review and click Finish
- Wait for task completion
- Power On the VM
2.2.5. Join the new NSX-T manager node to the cluster
- Lookup the IP address of the NSX-T manager node nslookup ######1
- You can also ping ######1 and it will display the IP address
- SSH into an operational NSX-T manager node
ssh admin@######1
- Get the cluster ID
get cluster config | find Id:
- Get the API thumbprint
get certificate api thumbprint
- SSH into the new NSX-T manager node
ssh admin@######2
- join the new node to the cluster
join <nsx mgr ip> cluster-id <uuid> thumbprint <thumbprint> username admin

For example:
join ###.17.110.## cluster-id 3ca96913-####-####-####-365a7c52b545 thumbprint dd35[...]ca1e username admin
- wait for 5 minutes for join to complete
2.2.6. Wait for addition complete
- From NSX-T manager UI > Home > Dashboard > System, wait until 3-node NSX-T manager cluster has a green stable status.
2.2.7. Reassign old VMCA or external CA-signed certificate to the new NSX-T manager node.

2.2.7.1 If the certificate exists follow the below mentioned steps:
- From NSX-T manager UI > System > Certificates > Certificates, check the VMCA/external CA-signed certificate for the faulty NSX-T manager node.
For example:
Issued By=CA
Issued To=######2.com
- Note that you should only check the checkbox, but do not open the detail screen
- Hover the mouse over the ID field, the second column, a pop-up appears with the ID of the certificate
- Record the certificate ID number, note this is the second column
- For the next step, you require to send a POST request to the new NSX-T manager node, you can either use Postman in Windows or curl in linux, the example here will use curl in Linux.
- Issue the POST request to the new NSX-T manager node to assign the old certificate to the new node
curl -H 'Accept: application/json' -H 'Content-Type: application/json' --insecure -u 'admin:<password>' -X POST
'https://<new nsx-t mgr fqdn or ip>/api/v1/node/services/http?action=apply_certificate&certificate_id=<certificate id>'

For example:
curl -H 'Accept: application/json' -H 'Content-Type: application/json' --insecure -u 'admin:<password>' -X POST
'https://######2.com/api/v1/node/services\/http?action=apply_certificate&certificate_id=24781ed5-####-####-####-cc8a4415d60e'
- The curl command does not return a response, since the http server is restarted as part of the command.
- From the vSphere Web Client > Host and Clusters, select the new NSX-T manager VM, click Restart Guest OS.
- Wait 5 minutes for the node to reboot.
Specific to VCF 4.0: If assigning the certificate fails because the certificate revocation list (CRL) could not be verified, please follow the steps in Failure to apply NSX-T certificate: Couldn't get LDAP context from URI to address the problem. If you decide to disable CRL checking in order to assign the certificate, re-enable CRL checking once the certificate has been assigned.

2.2.7.2 If the certificate does not exits, please follow Replace Expired or Self-signed NSX-T Manager Certificates with VMCA-Signed Certificates for more information.

2.2.8. Wait for reboot to complete
- From NSX-T manager UI > Home > Dashboard > System, wait until 3-node NSX-T manager cluster has a green stable status
2.2.9. Verify VMCA cert of the new NSX-T manager node
- This step requires openssl
- Issue the openssl command to retrieve the certificate of the new NSX-T manager node
For example:
echo | openssl s_client -no_ign_eof -showcerts -connect \
######2.com:443 > nsx2.pem
- Issue the openssl command to display the certificate
For example:
openssl x509 -in nsx2.pem -noout -text | more
- Verify the Issuer is psc-1
- Verify the Subject CN=vi1nsxmanager2.dellrack1.vmware.corp
2.2.10. Refresh SDDC Manager SSH Key Store for VCF 4.0

This step is specific to VCF 4.x and VCF 5.x, as a final step, you need to update the SSH keys SDDC Manager saves for the NSX-T managers. VMware offers a script that automates this process. Please follow the Refresh SDDC Manager SSH Keys procedure documented in How to update the SSH host keys on the SDDC Manager.

2.2.11 Update VM Anti-Affinity Rule

When Cloud Foundation deploys NSX-T Manager, it creates a VM anti-affinity rule to prevent the VMs of the same NSX-T Manager cluster from running on the same host. In this step, you need to add the newly deployed replacement VM to the rule for this NSX-T Manager cluster.
Log in to the management domain vCenter Server, and select Menu > Hosts and Clusters.
In the Navigator pane, select the management cluster
Select Configure > VM/Host Rules.
Finally, add the VM to the correct "separate virtual machine" rule. The rule for the management-domain NSX-T Manager cluster is named anti-affinity-rule-nsxt, while the rule for workload domains has the form "<NSXT Mgr VIP FQDN> - NSX-T Managers Anti Affinity Rule". Once you locate the rule, click edit, and add the newly deployed VM (e.g., vi1nsxmanager2) to it.

Feedback

thumb_up Yes

thumb_down No

Replacing a faulty NSX-T manager node in a VCF environment

Article ID: 314670

Updated On:

Products

Issue/Introduction

Environment

Resolution

Procedure to replace one NSX-T manager node in 3-node cluster.

2.1 Prerequisites

For the NSX-T manager cluster

For one operational NSX-T Manager:

2.2 Procedure

2.2.1. Delete faulty NSX-T manager node from NSX-T manager cluster

2.2.1.1. Delete an NSX-T Manager node that was automatically deployed via NSX UI

2.2.1.2. Delete an NSX-T Manager node that was manually deployed to vCenter via OVA

2.2.1.3. Wait for deletion complete

2.2.3.4. Verification

2.2.4. Add new NSX-T manager node

2.2.5. Join the new NSX-T manager node to the cluster

2.2.6. Wait for addition complete

2.2.7. Reassign old VMCA or external CA-signed certificate to the new NSX-T manager node.

2.2.8. Wait for reboot to complete

2.2.9. Verify VMCA cert of the new NSX-T manager node

2.2.10. Refresh SDDC Manager SSH Key Store for VCF 4.0

2.2.11 Update VM Anti-Affinity Rule

Feedback