NSX Host Reinstallation Blocked by vSAN Health Issues Leading to Storage Inconsistencies

search cancel

NSX Host Reinstallation Blocked by vSAN Health Issues Leading to Storage Inconsistencies

book

Article ID: 408586

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

NSX host reinstallation cannot proceed due to underlying vSAN health problems that result in storage inconsistencies across the cluster. Depending on vSAN configuration, if a host is lost and affects quorum, critical infrastructure VMs including vCenter Server, NSX Manager, and Edge Nodes may experience storage inconsistencies requiring reboot or fsck repair. The vSAN issues prevent normal cluster operations and block NSX reinstallation attempts.

The following symptoms occur:

NSX host reinstallation fails with host showing "partial success" state
vSAN reports storage partition sync failures between hosts - check resync status using Using Esxcli Commands with vSAN
Depending on vSAN configuration, if a host loss affects quorum, multiple VMs may show storage inconsistencies requiring reboot or fsck repair:
- vCenter Server becomes unresponsive or services fail
- NSX Manager enters read-only mode
- Edge Nodes experience storage failures
ESXi hosts cannot enter maintenance mode with error: "Failed to enter namespaces maintenance mode due to Error: system_error Messages: vapi.send.failed" (KB 406801)
vCenter WCP service failure prevents normal cluster operations
NSX configuration shows VTEP vmk10 interface conflict: "Host configuration: VTEP [vmk10,<IP address>] failed to be applied: The vnic vmk10 exists" (KB 322412)
Management connectivity lost when attempting NSX remediation (management vmkernel on NSX switch) - requires CLI recovery (KB 326175)
Stale VDS entries block network reconfiguration: "Create DVSwitch failed with the following error message: Unable to Create Proxy DVS ; Status(bad0005)= Already exists" (KB 307917)
Manager node disk/partition mounted as read-only requiring fsck repair (KB 330478)

Environment

VMware vSphere ESXi
VMware vCenter Server
VMware NSX
VMware vSAN
Management, vSAN, and vMotion vmkernel interfaces configured on NSX switch

Cause

The root cause is vSAN health issues that create storage inconsistencies across the cluster. Depending on vSAN configuration, when a host is lost and affects quorum, the following issues may occur:

Storage inconsistencies affecting critical infrastructure VMs (vCenter, NSX Manager, Edge Nodes) that may require reboot or fsck repair
NSX Manager potentially entering read-only mode due to storage issues
vCenter WCP service failure that can prevent normal maintenance operations
Inability to properly remediate NSX configuration due to corrupted state
Critical dependency: Management, vSAN, and vMotion vmkernel interfaces are configured on NSX switch - when NSX fails or is removed, all cluster connectivity is lost
NSX cannot be reinstalled until underlying vSAN issues are resolved, creating a complex recovery scenario

Resolution

Step 1: Assess Current State and Identify Active Issues

Document vSAN health status and identify affected hosts
List all VMs showing storage inconsistencies (if any)
Verify console access (iLO/DRAC) is available for all affected hosts
Critical: Confirm that management, vSAN, and vMotion vmkernel interfaces are configured on NSX switch - this means all will be lost if NSX is removed
Identify which of the following issues are present in your environment:
- VMs with storage inconsistencies or read-only filesystems
- vCenter WCP service failures
- NSX Manager in read-only mode
- Hosts unable to enter maintenance mode
- Stale VDS configurations
- Lost network connectivity

Step 2: Address VM Storage Inconsistencies (If Present)

If vSAN quorum issues have caused VM storage inconsistencies:

For NSX Manager in read-only mode:

Attempt graceful restart of NSX Manager
If unsuccessful, follow fsck repair procedure from Manager node disk/partition is mounted as read-only alarm in NSX Manager Disk Corruption correction using FSCK:
- Access GRUB menu during boot
- Add fsck.mode=force fsck.repair=yes to boot options
- Boot and allow filesystem repair to complete
Verify NSX Manager returns to normal operational state

For vCenter Server storage issues:

Reboot vCenter Server VM
If storage inconsistency persists after reboot, follow fsck repair procedure from Manager node disk/partition is mounted as read-only alarm in NSX Manager Disk Corruption correction using FSCK:
- Access GRUB menu during boot
- Add fsck.mode=force fsck.repair=yes to boot options
- Boot and allow filesystem repair to complete
Note: vCenter storage issues may affect WCP service. If WCP service fails to start after repair:
```
service-control --start --all
```
If WCP service failure prevents entering maintenance mode via vCenter, manually enable maintenance mode via ESXi CLI and make a vCenter case:
```
esxcli system maintenanceMode set --enable true
```
Refer to Cannot put ESXi host into maintenance mode: system_error Messages: vapi.send.failed for persistent WCP service issues

For Edge Node VMs:

Reboot affected Edge Node VMs
Apply fsck repair if storage inconsistency persists after reboot
Verify storage consistency after restart

Step 3: Work Around vCenter Maintenance Mode Issues (If Present)

If WCP service failure prevents normal maintenance mode operations:

Use ESXi CLI to place hosts in maintenance mode:
```
esxcli system maintenanceMode set --enable true
```
Document this workaround usage for each affected host

Phase 4: Recover Management Connectivity (Required before NSX reinstallation)

Important: Because management, vSAN, and vMotion vmkernel interfaces are configured on the NSX switch, removing NSX causes complete loss of cluster connectivity. Management must be restored using a standard switch before NSX can be reinstalled.

Refer to Configuring Standard vSwitch (vSS) or virtual Distributed Switch (vDS) from the command line in ESXi for detailed command line network configuration.

Access host via console (iLO/DRAC) - network access will be unavailable

Create temporary standard vSwitch configuration:

esxcli network vswitch standard add --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup add --portgroup-name=Management --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup set --portgroup-name=Management --vlan-id=<management_vlan>

Recreate management vmkernel interface:

esxcli network ip interface add --interface-name=vmk1 --portgroup-name=Management
esxcli network ip interface ipv4 set --interface-name=vmk1 --ipv4=<IP> --netmask=<mask> --type=static

Recreate vMotion and vSAN vmkernel interfaces on standard switch:

esxcli network vswitch standard portgroup add --portgroup-name=vMotion --vswitch-name=vSwitch0
esxcli network vswitch standard portgroup add --portgroup-name=vSAN --vswitch-name=vSwitch0
esxcli network ip interface add --interface-name=vmk2 --portgroup-name=vMotion
esxcli network ip interface add --interface-name=vmk3 --portgroup-name=vSAN

Add physical uplinks:

esxcli network vswitch standard uplink add --uplink-name=vmnic0 --vswitch-name=vSwitch0
esxcli network vswitch standard uplink add --uplink-name=vmnic1 --vswitch-name=vSwitch0

Phase 5: Clean Stale VDS Configuration

Refer to Adding a host to vDS Distributed Switch fails with error: Create DVSwitch failed for stale VDS removal.

Identify stale VDS entries preventing reconfiguration
Contact Broadcom Support for force removal procedure if needed
After removal, reconfigure VDS properly
Migrate interfaces from temporary standard switch back to VDS

Phase 6: Properly Remove and Reinstall NSX

Critical: NSX cannot be successfully reinstalled until underlying vSAN issues are resolved. Attempting NSX reinstallation with active vSAN problems will result in continued failures.

Refer to Deleting NSX VIBs from an ESXi host using "del nsx" fails for NSX VIB removal issues.

Prerequisite: Verify vSAN health issues have been addressed
With host in maintenance mode (via CLI if necessary), remove NSX:
- Attempt removal via NSX Manager UI first
- If VTEP conflict persists, manually remove vmkernel interfaces:
```
esxcli network ip interface remove --interface-name=vmk10
```
- Remove NSX ports from DVS:
```
net-dvs -D -p <port_uuid> <switch_name>
```
- Use del nsx command if necessary
Reinstall NSX through NSX Manager (only after vSAN issues are resolved)
Migrate network configuration back to NSX switch from temporary standard switch:
- Management (vmk1)
- vMotion (vmk2)
- vSAN (vmk3)
- Physical adapters

Step 7: Validate vSAN Health and Prepare for NSX Reinstallation

Check vSAN health and resync status using Using Esxcli Commands with vSAN:
```
esxcli vsan debug resync list
esxcli vsan health cluster list
```
Verify no objects are pending resync before proceeding
Verify NSX configuration is correct:
```
vmkping -I vmk10 <other_host_vtep_ip>
```
Confirm all VMs are accessible and storage is consistent
If issues persist: Engage vSAN team for root cause resolution:
- Provide details of storage partition sync failures
- Document which hosts show duplicate entries
- Include all remediation steps taken

Step 8: Complete NSX Host Reinstallation

Prerequisite: vSAN health must be fully restored and all VMs must be functional before proceeding with NSX host reinstallation.

Verify vSAN cluster health shows no errors and quorum is restored
Critical: Confirm vSAN is in full quorum and can tolerate removing one ESXi host without causing another failure:
- Check current fault tolerance level
- Verify cluster can maintain quorum if one host enters maintenance mode
- Ensure no objects would become inaccessible if a host is removed
Confirm all infrastructure VMs (vCenter, NSX Manager, Edge Nodes) are operational
Once vSAN resilience is confirmed, complete NSX host reinstallation process
Monitor for any recurring storage issues during reinstallation
Validate NSX installation completion across all affected hosts

If the error persists after following these steps, contact Broadcom Support for further assistance.

Please provide the below information when opening a support request with Broadcom for this issue:

vSAN health reports and storage partition status
List of affected VMs with corruption symptoms
NSX Manager and vCenter Server logs showing storage errors
ESXi host logs from affected nodes
Complete timeline of issues and remediation attempts
Screenshots of vSAN health alerts

Feedback

thumb_up Yes

thumb_down No