Resolving NSX Manager Corruption Post-Storage Outage via Redeployment and Restore

Products

VMware NSX

Issue/Introduction

An NSX Manager appliance (or the entire NSX Manager cluster) became corrupted and unrecoverable following an underlying storage outage. This rendered the NSX management plane inaccessible and inoperable, impacting the ability to manage the NSX environment.
Symptoms typically include the NSX Manager VM being unresponsive, unable to boot, or reporting disk/filesystem errors.

Environment

VMware NSX 4.x

Cause

The direct cause of the NSX Manager corruption was an unplanned storage outage or severe storage degradation. This led to data inconsistencies or damage to the NSX Manager's virtual disks and/or the internal distributed database (CorfuDB), rendering the appliance unbootable or inoperable

Resolution

Resolution:
The issue was resolved by completely deleting the corrupted NSX Manager appliance(s) from vCenter and deploying new NSX Manager appliance(s) from OVA, followed by restoring the NSX configuration from a valid external backup.

Steps:

Verify External Backup Availability (CRITICAL PRE-STEP):
- Confirm that a recent and valid NSX Manager backup is available on your configured external SFTP server (or other backup destination).
- Ensure you have the exact filename, location (path on SFTP), and credentials for the backup.
- Without a valid external backup, this recovery method is not possible.
Download Corresponding NSX Manager OVA:
- Download the NSX Manager OVA file from Broadcom/VMware Customer Connect that matches the exact NSX version of your backup. For example, if your backup is from NSX 4.2.1.0, download the 4.2.1.0 OVA.
Clean Up Corrupted NSX Manager(s) from vCenter:
- If the corrupted NSX Manager VM(s) are still present in vCenter inventory, power them off (if possible) and right-click -> Remove from Inventory.
- Then, right-click and select Delete from Disk to remove the associated virtual disk files. Ensure you have no further need for forensic analysis before deleting disks.
Deploy the First New NSX Manager Appliance (Primary Node):
- In vCenter Server, navigate to the desired cluster/host.
- Right-click and select Deploy OVF Template.
- Select your downloaded NSX Manager OVA.
- Crucially, deploy the appliance with the following original settings:
  - Same FQDN/Hostname.
  - Same IP Address, Netmask, Gateway, DNS, NTP servers.
  - Same Form Factor (Small, Medium, Large, X-Large).
- Power on the newly deployed VM.
Restore NSX Configuration from Backup:
- Wait for the new NSX Manager VM to fully boot and reach the admin CLI prompt.
- Do NOT proceed with the initial cluster setup prompts in the console.
- Execute the restore command, providing the SFTP details and backup filename:
  bash restore backup file <backup_file_name.tar.gz> url sftp://<sftp_server_ip_or_fqdn>:<port>/<backup_directory> username <sftp_username> password <sftp_password>
  Example:
  bash restore backup file nsx-backup-20240620100000.tar.gz url sftp://192.168.1.100:22/nsx_backups username nsxuser password MySecureP@ssword
- Monitor the console. The appliance will download the backup, restore the configuration, and automatically reboot multiple times.
Verify Restored NSX Manager (Primary Node):
- Once the appliance reboots and fully comes online, access the NSX Manager UI (Web Client) using its FQDN/IP.
- Log in with the original NSX Manager admin credentials (from the time the backup was taken).
- Verify the NSX Manager UI is accessible and appears as expected.
- Check System > Fabric > Nodes > NSX Managers. The primary node should show "Up" and "Stable".
Deploy and Join Remaining Cluster Nodes (if applicable):
- If your original deployment was a 3-node cluster, repeat step 4 to deploy the other two NSX Manager VMs using their original IP addresses and form factors.
- After the new secondary nodes boot, use the join command at their console prompts to join the existing restored cluster:
  bash join <Primary_NSX_Manager_IP_or_FQDN> username admin password <admin_password> role member
- Monitor in the NSX UI until all cluster nodes are "Up" and "Stable".
Post-Restoration Verification and Reconciliation:
- Compute Manager (vCenter) Status: In NSX UI, System > Fabric > Compute Managers. Ensure vCenter connection is "Up". If "Disconnected," click Edit and re-enter credentials.
- Host Transport Nodes Status: System > Fabric > Nodes > Host Transport Nodes. All ESXi hosts should report "Up" and "Success". If any are "Not Configured" or "Degraded", select them and click "Resolve" to re-prepare.
- Edge Transport Nodes Status: System > Fabric > Nodes > Edge Transport Nodes. All NSX Edges should report "Up" and "Success". If not, attempt "Resolve" or redeploy/reconfigure if necessary.
- Verify Functionality: Test core NSX functionalities: VM network connectivity on segments, North-South traffic, DFW rules, NAT rules, Load Balancers, and routing protocols.

Justification:
This resolution successfully restored the NSX management plane from a state of corruption following a storage outage. The process leverages VMware's built-in backup and restore functionality, which is the supported method for disaster recovery of the NSX Manager cluster. By deploying a fresh appliance and restoring the configuration, the integrity of the NSX environment is re-established, allowing the virtual network and security services to resume operation.

Additional Information

Reference : https://techdocs.broadcom.com/us/en/vmware-tanzu/standalone-components/tanzu-kubernetes-grid-integrated-edition/1-20/tkgi/nsxt-install-managers.html