NSX Manager displays I/O error from SSH and Console
search cancel

NSX Manager displays I/O error from SSH and Console

book

Article ID: 414708

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

SSH into NSX Manager or through the console of the Manager VM through VCenter, you see multiple I/O log lines as shown below:

You may notice VIP keeps getting assigned to different managers.

get cluster status returns STABLE for cluster

NSX Manager uptimes may be high over 100+ days.

Environment

VMware NSX 4.1.2.3

Cause

The underlying storage may have unrecoverable read/write errors, which the Linux kernel reports as I/O errors.

The issue could have occurred due to file system corruption and that could have happened when storage issue was observed on the setup. 
After the storage issues were resolved, the manager appliance OS might choose to mount the file-system in read-only mode as the file system could have been corrupted.

Resolution

Workaround:

 

Option 1: Attempt a Controlled Reboot

 

  1. Reboot the NSX Manager VM: If the underlying storage issue was temporary (e.g., a brief network interruption), a simple reboot of the NSX Manager VM may allow the filesystem to correct itself and remount the partition as read/write. If the issue affects an entire cluster, reboot them one at a time.

 

Option 2: Filesystem Check (fsck) for Read-Only Partitions

 

If the reboot fails and the console shows that a partition (like /) is mounted as read-only, a filesystem check is required.

  1. Reboot the NSX Manager.

  2. Access GRUB Menu: During the boot process, press the Shift or Esc key repeatedly to enter the GRUB menu.

  3. Edit Boot Parameters: Select the Ubuntu/NSX Manager boot option and press 'e' to edit the command line.

  4. Append fsck Commands: Navigate to the line starting with linux and append the following parameters at the end:

    fsck.mode=force fsck.repair=yes
    
  5. Boot the Appliance: Press F10 or Ctrl+X to boot with the modified parameters. This forces the OS to run an automated filesystem check and attempt to repair any corruption.

  6. Verify Status: After booting, log in and verify that the filesystem is no longer read-only and all NSX services are running (get services).

 

Option 3: Node Replacement (If Corruption Persists)

 

If the fsck process is unsuccessful or the filesystem corruption is too severe, the affected NSX Manager node needs to be replaced.

  1. Delete the Faulted Node: Remove the faulted appliance from the NSX cluster.

  2. Deploy a New Node: Deploy a new NSX Manager appliance with the same FQDN and IP address to replace the failed member. The new node will synchronize its data from the remaining healthy nodes.