Product: NSX Manager Cluster
Related components: VCLOUD, ESXi hosts
The root cause of the intermittent cluster instability and operational failures is a transient failure or instability in the storage/datastore supporting one of the NSX Manager cluster nodes. Storage errors led to intermittent Input/Output (I/O) failures for the node, causing it to lose communication with the rest of the cluster and resulting in the intermittent cluster alarms and UI/API failures.
To resolve the issue, the faulty NSX Manager node was isolated, removed, and fully redeployed onto a known stable host/datastore configuration.
Follow these steps to replace the affected node:
Removing and redeploying the affected node moves the NSX Manager services off the unstable storage volume/host, eliminating the intermittent I/O errors and communication failures. This action restores the high availability (HA) and quorum of the NSX cluster.