NSX Host Reinstallation Blocked by vSAN Health Issues Leading to Storage Inconsistencies
search cancel

NSX Host Reinstallation Blocked by vSAN Health Issues Leading to Storage Inconsistencies

book

Article ID: 408586

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

NSX host reinstallation cannot proceed due to underlying vSAN health problems that result in storage inconsistencies across the cluster. Depending on vSAN configuration, if a host is lost and affects quorum, critical infrastructure VMs including vCenter Server, NSX Manager, and Edge Nodes may experience storage inconsistencies requiring reboot or fsck repair. The vSAN issues prevent normal cluster operations and block NSX reinstallation attempts.

The following symptoms occur:

  • NSX host reinstallation fails with host showing "partial success" state
  • vSAN reports storage partition sync failures between hosts - check resync status using Using Esxcli Commands with vSAN
  • Depending on vSAN configuration, if a host loss affects quorum, multiple VMs may show storage inconsistencies requiring reboot or fsck repair:
    • vCenter Server becomes unresponsive or services fail
    • NSX Manager enters read-only mode
    • Edge Nodes experience storage failures
  • ESXi hosts cannot enter maintenance mode with error: "Failed to enter namespaces maintenance mode due to Error: system_error Messages: vapi.send.failed" (KB 406801)
  • vCenter WCP service failure prevents normal cluster operations
  • NSX configuration shows VTEP vmk10 interface conflict: "Host configuration: VTEP [vmk10,<IP address>] failed to be applied: The vnic vmk10 exists" (KB 322412)
  • Management connectivity lost when attempting NSX remediation (management vmkernel on NSX switch) - requires CLI recovery (KB 326175)
  • Stale VDS entries block network reconfiguration: "Create DVSwitch failed with the following error message: Unable to Create Proxy DVS ; Status(bad0005)= Already exists" (KB 307917)
  • Manager node disk/partition mounted as read-only requiring fsck repair (KB 330478)

Environment

  • VMware vSphere ESXi
  • VMware vCenter Server
  • VMware NSX
  • VMware vSAN
  • Management, vSAN, and vMotion vmkernel interfaces configured on NSX switch

Cause

The root cause is vSAN health issues that create storage inconsistencies across the cluster. Depending on vSAN configuration, when a host is lost and affects quorum, the following issues may occur:

  1. Storage inconsistencies affecting critical infrastructure VMs (vCenter, NSX Manager, Edge Nodes) that may require reboot or fsck repair
  2. NSX Manager potentially entering read-only mode due to storage issues
  3. vCenter WCP service failure that can prevent normal maintenance operations
  4. Inability to properly remediate NSX configuration due to corrupted state
  5. Critical dependency: Management, vSAN, and vMotion vmkernel interfaces are configured on NSX switch - when NSX fails or is removed, all cluster connectivity is lost
  6. NSX cannot be reinstalled until underlying vSAN issues are resolved, creating a complex recovery scenario

Resolution

Step 1: Assess Current State and Identify Active Issues

  1. Document vSAN health status and identify affected hosts
  2. List all VMs showing storage inconsistencies (if any)
  3. Verify console access (iLO/DRAC) is available for all affected hosts
  4. Critical: Confirm that management, vSAN, and vMotion vmkernel interfaces are configured on NSX switch - this means all will be lost if NSX is removed
  5. Identify which of the following issues are present in your environment:
    • VMs with storage inconsistencies or read-only filesystems
    • vCenter WCP service failures
    • NSX Manager in read-only mode
    • Hosts unable to enter maintenance mode
    • Stale VDS configurations
    • Lost network connectivity

Step 2: Address VM Storage Inconsistencies (If Present)

If vSAN quorum issues have caused VM storage inconsistencies:

For NSX Manager in read-only mode:

For vCenter Server storage issues:

For Edge Node VMs:

  • Reboot affected Edge Node VMs
  • Apply fsck repair if storage inconsistency persists after reboot
  • Verify storage consistency after restart

Step 3: Work Around vCenter Maintenance Mode Issues (If Present)

If WCP service failure prevents normal maintenance mode operations:

  1. Use ESXi CLI to place hosts in maintenance mode:
    esxcli system maintenanceMode set --enable true
  2. Document this workaround usage for each affected host

Phase 4: Recover Management Connectivity (Required before NSX reinstallation)

Important: Because management, vSAN, and vMotion vmkernel interfaces are configured on the NSX switch, removing NSX causes complete loss of cluster connectivity. Management must be restored using a standard switch before NSX can be reinstalled.

Refer to Configuring Standard vSwitch (vSS) or virtual Distributed Switch (vDS) from the command line in ESXi for detailed command line network configuration.

  1. Access host via console (iLO/DRAC) - network access will be unavailable
  2. Create temporary standard vSwitch configuration:
    esxcli network vswitch standard add --vswitch-name=vSwitch0
    esxcli network vswitch standard portgroup add --portgroup-name=Management --vswitch-name=vSwitch0
    esxcli network vswitch standard portgroup set --portgroup-name=Management --vlan-id=<management_vlan>
  3. Recreate management vmkernel interface:
    esxcli network ip interface add --interface-name=vmk1 --portgroup-name=Management
    esxcli network ip interface ipv4 set --interface-name=vmk1 --ipv4=<IP> --netmask=<mask> --type=static
  4. Recreate vMotion and vSAN vmkernel interfaces on standard switch:
    esxcli network vswitch standard portgroup add --portgroup-name=vMotion --vswitch-name=vSwitch0
    esxcli network vswitch standard portgroup add --portgroup-name=vSAN --vswitch-name=vSwitch0
    esxcli network ip interface add --interface-name=vmk2 --portgroup-name=vMotion
    esxcli network ip interface add --interface-name=vmk3 --portgroup-name=vSAN
  5. Add physical uplinks:
    esxcli network vswitch standard uplink add --uplink-name=vmnic0 --vswitch-name=vSwitch0
    esxcli network vswitch standard uplink add --uplink-name=vmnic1 --vswitch-name=vSwitch0

Phase 5: Clean Stale VDS Configuration

Refer to Adding a host to vDS Distributed Switch fails with error: Create DVSwitch failed for stale VDS removal.

  1. Identify stale VDS entries preventing reconfiguration
  2. Contact Broadcom Support for force removal procedure if needed
  3. After removal, reconfigure VDS properly
  4. Migrate interfaces from temporary standard switch back to VDS

Phase 6: Properly Remove and Reinstall NSX

Critical: NSX cannot be successfully reinstalled until underlying vSAN issues are resolved. Attempting NSX reinstallation with active vSAN problems will result in continued failures.

Refer to Deleting NSX VIBs from an ESXi host using "del nsx" fails for NSX VIB removal issues.

  1. Prerequisite: Verify vSAN health issues have been addressed
  2. With host in maintenance mode (via CLI if necessary), remove NSX:
    • Attempt removal via NSX Manager UI first
    • If VTEP conflict persists, manually remove vmkernel interfaces:
      esxcli network ip interface remove --interface-name=vmk10
    • Remove NSX ports from DVS:
      net-dvs -D -p <port_uuid> <switch_name>
    • Use del nsx command if necessary
  3. Reinstall NSX through NSX Manager (only after vSAN issues are resolved)
  4. Migrate network configuration back to NSX switch from temporary standard switch:
    • Management (vmk1)
    • vMotion (vmk2)
    • vSAN (vmk3)
    • Physical adapters

Step 7: Validate vSAN Health and Prepare for NSX Reinstallation

  1. Check vSAN health and resync status using Using Esxcli Commands with vSAN:
    esxcli vsan debug resync list
    esxcli vsan health cluster list
  2. Verify no objects are pending resync before proceeding
  3. Verify NSX configuration is correct:
    vmkping -I vmk10 <other_host_vtep_ip>
  4. Confirm all VMs are accessible and storage is consistent
  5. If issues persist: Engage vSAN team for root cause resolution:
    • Provide details of storage partition sync failures
    • Document which hosts show duplicate entries
    • Include all remediation steps taken

Step 8: Complete NSX Host Reinstallation

Prerequisite: vSAN health must be fully restored and all VMs must be functional before proceeding with NSX host reinstallation.

  1. Verify vSAN cluster health shows no errors and quorum is restored
  2. Critical: Confirm vSAN is in full quorum and can tolerate removing one ESXi host without causing another failure:
    • Check current fault tolerance level
    • Verify cluster can maintain quorum if one host enters maintenance mode
    • Ensure no objects would become inaccessible if a host is removed
  3. Confirm all infrastructure VMs (vCenter, NSX Manager, Edge Nodes) are operational
  4. Once vSAN resilience is confirmed, complete NSX host reinstallation process
  5. Monitor for any recurring storage issues during reinstallation
  6. Validate NSX installation completion across all affected hosts

If the error persists after following these steps, contact Broadcom Support for further assistance.

Please provide the below information when opening a support request with Broadcom for this issue:

  • vSAN health reports and storage partition status
  • List of affected VMs with corruption symptoms
  • NSX Manager and vCenter Server logs showing storage errors
  • ESXi host logs from affected nodes
  • Complete timeline of issues and remediation attempts
  • Screenshots of vSAN health alerts