NSX-T cluster not available due to underlying storage issue
search cancel

NSX-T cluster not available due to underlying storage issue

book

Article ID: 322407

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You are running NSX-T in a 3 node manager cluster.
  • The underlying storage of the NSX-T managers can be shared or different for each node.
  • The NSX-T UI may be unavailable.
  • If it is available, you may see 1 manager as down in the appliances.
  • Some virtual machine may have dataplane impact.
  • Appliance cli commands may fail such as:
    nsx-mgr> get cluster status
    % The get cluster status operation cannot be processed currently, please try again later
  • ESXi hosts still show connection on port 1235 to controllers:
    esxcli network ip connection list | grep 1235
  • In the logs, you may see unexpected gaps in logging entries:

    <182>1 20221-04-09T05:46:47.738Z nsx-mgr-2 ...
    <182>1 2022-04-09T05:46:47.740Z nsx-mgr-2 ...
    <30>1 2022-04-09T07:35:13.931257+00:00 nsx-mgr-2 ...
    <30>1 2022-04-09T07:35:13.931524+00:00 nsx-mgr-2 ...
    <30>1 2022-04-09T07:35:13.931534+00:00 nsx-mgr-2 ...
    <30>1 2022-04-09T07:35:13.931540+00:00 nsx-mgr-2 ...
    

Note: the actual messages are not of concern here, just the gap we see in the logging above i.e. between 05:46 and 07:35.

Environment

VMware NSX-T

Cause

When an underlying storage issue impacts the NSX-T managers in the cluster, UNIX will set all mounts to read-only.
Due to an issue in the was corfu detects if a node is active, this is one or more or the nodes in the 3 node cluster, it will NOT mark the node as down and continue trying to write to the logs, which will fail and cause the node to become down.
This is where the gap in the logs comes from, no service is able to write to them as they are in read-only mode.
Transport nodes using the impacted manager's are still connected as it is not marked as down, therefore they do not failover to a different manager.
This can lead to dataplane impact for VMs which resides on these transport nodes.

Resolution

In VMware NSX 4.0.0 a new mechanism was added to detect if corfu was in read only state, using write operation to disk and mark the node down for corfu if detected.
To workaround the issue, reboot the impacted appliance.
If after the reboot, there are disk errors, follow the steps in this KB: 
Manager node disk/partition is mounted as read-only alarm in NSX Manager Disk Corruption correction using FSCK