NSX Manager is not working properly after experiencing storage issues affecting datastores related to the NSX appliance VMs

search cancel

NSX Manager is not working properly after experiencing storage issues affecting datastores related to the NSX appliance VMs

book

Article ID: 393945

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

After a storage outage including a VSAN failure or other issues with datastores backing NSX Manager VMs, any of the following may occur:

NSX Manager nodes may fail to boot up completely. When checking via console, there may be messages stating that partitions are in read-only (RO) status or services may fail to start with an error "Failed to start service X". Example screenshot below shows a failed service start in the console.
Access is unavailable to the UI for the specific impacted node. If multiple nodes were impacted then the GUI may be unreachable for all members and/or the VIP.
If UI access is available, the cluster may show that it is in a degraded status, and one or more of the Manager nodes may have alerts related to storage errors.
Home > Open Alarms may show Storage Error events with Critical Severity

Environment

VMware NSX 4.x
VMware NSX-T Data Center 3.x

Cause

A storage failure on the datastores backing NSX appliance VMs causes filesystem corruption.

Resolution

Refer to Manager node disk/partition is mounted as read-only alarm in NSX Manager Disk Corruption correction using FSCK for information and steps to resolve issues related to filesystems mounted in read-only mode.

In many cases, a reboot of an NSX Manager node can restore operability if the underlying storage issue has been resolved. If multiple nodes are impacted, reboot them one at a time and monitor the console as they come up. NSX will attempt to resolve filesystem issues. If it can, then check the status of services on the Manager node either under Fabric > Appliances > View Details (per node) or via admin CLI with command: get services

If rebooting does not correct the issue, then as described in the article linked above, a node replacement may be needed if not all nodes are impacted. Otherwise, it is likely that a restoration from backup will be required.

To replace a single node in an NSX Manager cluster, you can add an additional Manager appliance node and delete the faulted one. To keep the same name and IP, first remove the faulted appliance from NSX, power off the VM in vSphere and delete it (or rename it at first if preferred), then add a node to the cluster using the FQDN and IP the old one used previously. Refer to Deploy NSX Manager Nodes to Form a Cluster from the UI
Replace an NSX node in a VCF/SDDC Manager Environment has steps that can be referenced whether the NSX is part of an SDDC Manager deployment or not.
Restore a Backup to recover a single NSX Manager Appliance VM, then Deploy NSX Manager Nodes to Form a Cluster from the UI to add additional nodes that replace the prior cluster members. Note that the backup restoration is only done on a single node and additional nodes sync data from the one that was restored when they are added to the cluster.

Additional Information

If you are contacting Broadcom support about this issue, please provide the following:

NSX Manager support bundles (if possible)
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation.

Handling Log Bundles for offline review with Broadcom support

Feedback

thumb_up Yes

thumb_down No