NSX Manager cluster instability and intermittent UI/API failures caused by datastore I/O errors
search cancel

NSX Manager cluster instability and intermittent UI/API failures caused by datastore I/O errors

book

Article ID: 419462

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are experiencing chronic, intermittent NSX Manager Cluster instability which is causing critical operational failures, including intermittent connectivity to the NSX Manager administrative UI and intermittent failures for VCLOUD Tenants to modify their firewall (FW) rules.
  • This instability is characterized by recurring cluster alarms that self-resolve.
  • A temporary solution, such as rebooting all NSX Managers, only provided stability for approximately one week.

Environment

Product: NSX Manager Cluster

Related components: VCLOUD, ESXi hosts

Cause

The root cause of the intermittent cluster instability and operational failures is a transient failure or instability in the storage/datastore supporting one of the NSX Manager cluster nodes. Storage errors led to intermittent Input/Output (I/O) failures for the node, causing it to lose communication with the rest of the cluster and resulting in the intermittent cluster alarms and UI/API failures.

Resolution

To resolve the issue, the faulty NSX Manager node was isolated, removed, and fully redeployed onto a known stable host/datastore configuration.

Follow these steps to replace the affected node:

  1. Isolate and shut down the affected node.
  2. Gracefully remove the node from the NSX cluster using the recommended procedure.
  3. Deploy a new NSX Manager appliance with the same IP and FQDN.
  4. Ensure the new appliance is placed on a datastore that has no reported underlying issues.
  5. Join the new appliance to the existing NSX cluster.

Removing and redeploying the affected node moves the NSX Manager services off the unstable storage volume/host, eliminating the intermittent I/O errors and communication failures. This action restores the high availability (HA) and quorum of the NSX cluster.

Additional Information

  • Reference KB 405669 for detailed replacement steps.