NSX Manager is down after changing the MTU on a Virtual Distributed Switch and losing access to network backed storage
search cancel

NSX Manager is down after changing the MTU on a Virtual Distributed Switch and losing access to network backed storage

book

Article ID: 407358

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

 

  • Recent MTU changes were made on the Virtual Distributed Switch (VDS) just prior to the outage.

  • NSX Manager is offline or unreachable via its management interface. Attempts to ping or SSH into NSX Manager fail.

  • ESXi hosts show disconnected or inaccessible datastores, including those where NSX VMs' file are located. 

  • If vCenter Server's storage is also disconnected, the vSphere UI and vCenter shell will also become inaccessible, resulting in the inability to revert the change made to MTU on the VDS. As vCenter is required to make configuration changes to the VDS, it will be necessary to recover the vCenter Server first.

 

 

Environment

  • VMware NSX-T Data Center
  • VMware NSX
  • VMware vCenter Server

Cause

The MTU on the Virtual Distributed Switch (VDS) was decreased (from 9000 to 1500, for example), resulting in a mismatch with the VMkernel interfaces and NFS storage configured for jumbo frames. Traffic between ESXi hosts and the storage backend began traversing a VDS path that could not accommodate the larger frame size, leading to packet fragmentation or drops. This disrupted access to datastores hosting NSX Manager and vCenter Server VMs, rendering them unreachable.

Resolution

Option 1. If vCenter Server is still available:

  1. Revert the MTU settings on the VDS to re-connect the network backed storage.
    *Refer to Enabling Jumbo Frames on virtual switches

  2. Check the health of NSX Manager nodes. Losing access to storage causes management plane service disruptions and will likely require rebooting the NSX Manager appliance VMs. This allows them to do filesystem checks and remediate limited issues.
    *Refer to important documentation about NSX Manager issues that can be caused by storage failures at NSX Manager is not working properly after experiencing storage issues affecting datastores related to the NSX appliance VMs

 

Option 2. vCenter Recovery Steps if the vCenter Server's storage was also impacted and vSphere is unavailable for DVS management:

Temporary Network Reconfiguration

*Many of the steps below can be done with UI access to an ESXi host or by using command line on ESXi.
*References for similar CLI steps at Configuring Standard vSwitch (vSS) or virtual Distributed Switch (vDS) from the command line in ESXi 

  1. Create a Temporary Virtual Standard Switch (VSS) if one does not already exist

    • Configure with MTU 9000

  2. Create a temporary portgroup (PG) on the the VSS. 

  3. Move a VMNIC from the VDS to the new VSS PG. 

  4. Reconfigure the vmk that was was being used to connect ESXi to the network backed storage. It will have to be deleted from the VDS first before adding a new one on the temporary VSS PG. 

    • To remove a vmk from the VDS by command line: 

      esxcli network ip interface remove --interface-name=vmk<VMK_NUMBER>
    • To recreate the vmk on the VSS PG by command line:

      • # 1. Create a VMkernel interface on the temporary port group

        esxcli network ip interface add --interface-name=vmk<VMK_NUMBER> --portgroup-name=<TEMP_PG_NAME>

        # 2. Assign a static IPv4 address and netmask to the interface

        esxcli network ip interface ipv4 set --interface-name=vmk<VMK_NUMBER> --ipv4=<IP_ADDRESS> --netmask=<NETMASK> --type=static

         # 3. Add a static route for the specified network

        esxcli network ip route ipv4 add --gateway <GATEWAY_IP> --network <NETWORK_CIDR> --interface=vmk<VMK_NUMBER>
  5. Set the VLAN on the Temporary Portgroup, if applicable, ensuring it matches the NFS network configuration.

  6. Verify backend storage is configured properly and will be able to communicate with the ESXi host

  7. Test Connectivity from ESXi to the network backed storage server

    vmkping -I vmk<VMK_NUMBER> -d <StorageServer_IP_OR_FQDN>
    vmkping -I vmk<VMK_NUMBER> -s 9000 <StorageServer_IP_OR_FQDN>
  8. Mount the NFS Datastore (if needed)

  9. Verify vCenter VM Visibility and Network Access

    • Confirm vCenter files are visible

    • Ensure the VM is registered and can power on

    • If network access fails, edit the vCenter's VM settings to attach a NIC to the temporary VSS portgroup

  10. Restore VDS MTU to 9000 via vCenter UI 
    *This should re-establish storage access for other ESXi hosts

 
Final Cleanup

  1. Migrate back to VDS and remove the temporary VSS

  2. Validate the network configuration before finally removing the temporary VSS through the vCenter or host UI and exiting Maintenance Mode

  3. Check the health of NSX Manager nodes (see step 2 from Option 1 above)

Additional Information