An NSX Manager appliance (or the entire NSX Manager cluster) became corrupted and unrecoverable following an underlying storage outage. This rendered the NSX management plane inaccessible and inoperable, impacting the ability to manage the NSX environment.
Symptoms typically include the NSX Manager VM being unresponsive, unable to boot, or reporting disk/filesystem errors.
VMware NSX 4.x
The direct cause of the NSX Manager corruption was an unplanned storage outage or severe storage degradation. This led to data inconsistencies or damage to the NSX Manager's virtual disks and/or the internal distributed database (CorfuDB), rendering the appliance unbootable or inoperable
Resolution:
The issue was resolved by completely deleting the corrupted NSX Manager appliance(s) from vCenter and deploying new NSX Manager appliance(s) from OVA, followed by restoring the NSX configuration from a valid external backup.
Steps:
Verify External Backup Availability (CRITICAL PRE-STEP):
Download Corresponding NSX Manager OVA:
Clean Up Corrupted NSX Manager(s) from vCenter:
Remove from Inventory.Delete from Disk to remove the associated virtual disk files. Ensure you have no further need for forensic analysis before deleting disks.Deploy the First New NSX Manager Appliance (Primary Node):
Deploy OVF Template.Restore NSX Configuration from Backup:
admin CLI prompt.bash restore backup file <backup_file_name.tar.gz> url sftp://<sftp_server_ip_or_fqdn>:<port>/<backup_directory> username <sftp_username> password <sftp_password>bash restore backup file nsx-backup-20240620100000.tar.gz url sftp://192.168.1.100:22/nsx_backups username nsxuser password MySecureP@sswordVerify Restored NSX Manager (Primary Node):
System > Fabric > Nodes > NSX Managers. The primary node should show "Up" and "Stable".Deploy and Join Remaining Cluster Nodes (if applicable):
join command at their console prompts to join the existing restored cluster:bash join <Primary_NSX_Manager_IP_or_FQDN> username admin password <admin_password> role memberPost-Restoration Verification and Reconciliation:
System > Fabric > Compute Managers. Ensure vCenter connection is "Up". If "Disconnected," click Edit and re-enter credentials.System > Fabric > Nodes > Host Transport Nodes. All ESXi hosts should report "Up" and "Success". If any are "Not Configured" or "Degraded", select them and click "Resolve" to re-prepare.System > Fabric > Nodes > Edge Transport Nodes. All NSX Edges should report "Up" and "Success". If not, attempt "Resolve" or redeploy/reconfigure if necessary.Justification:
This resolution successfully restored the NSX management plane from a state of corruption following a storage outage. The process leverages VMware's built-in backup and restore functionality, which is the supported method for disaster recovery of the NSX Manager cluster. By deploying a fresh appliance and restoring the configuration, the integrity of the NSX environment is re-established, allowing the virtual network and security services to resume operation.