Host in a vSAN environment is not responding
search cancel

Host in a vSAN environment is not responding

book

Article ID: 392012

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

When a host in a vSAN environment is not responding, it can disrupt cluster operations and impact VM availability

Environment

VMware vSAN 7.0.x

VMware vSAN 8.0.x

Cause

Possible causes to lead host into not responding state: 

  • Network Issues
- vSAN or management network connectivity loss (e.g., misconfigured switches, VLANs, or firewalls).
- Incorrect VMkernel adapter settings for vSAN traffic (port 12321 by default).
  • Host Hardware Failures
- Power supply, NIC, or storage controller failures.
- Host crash due to hardware issues (check via iLO/iDRAC,xClarity or any other hardware admin console).
  • Storage Problems
- Disk failures or degraded disk groups.
- Full ESXi installation directory (check with `df -h`).
- The ramdisk 'root' is full (check with `vdf -h).`
  • Service Failures
- Management agent crashes (e.g., `hostd`, `vpxa`).
- vSAN service hangs or crashes.
  • Cluster Membership Issues
- Host accidentally removed from the cluster.
- Split-brain scenarios or network partitions.
  • Time Synchronization
- NTP misconfiguration causing clock problem.

Resolution

 

Please Note: if administrator found any host into not responding state, for initial investigation, Please open a case with Broadcom technical support for further investigation. 

 

  • Verify Host Connectivity
 
- Ping the Host: Check if the host responds to ping requests.
 
- Check Management Network: Ensure the management interface is reachable via vCenter or SSH.
 
- Test vSAN Network: Confirm vSAN VMkernel adapters are online and reachable between hosts.
 
  • Check vSAN Health Status
 
- Navigate to vCenter > Monitor > vSAN > Health for cluster-wide alerts.
 
- Check for Hosts disconnected or Network partition warnings.
 
  • Review Physical Hardware
 
- Use out-of-band management tools (iLO/iDRAC) to verify power, fans, and NIC status.
 
- Ensure storage devices (disks, HBAs) are detected and healthy.
 
  • Restart Management Services
 
- Access the host via ESXi console or SSH: Restarts vSAN management services
 
 
/etc/init.d/vpxa restart && /etc/init.d/hostd restart && /etc/init.d/vsanmgmtd restart  
 
 
- If SSH is unavailable, use the Direct Console User Interface (DCUI).
 
  • Check Storage Capacity
 
- Use `df -h` via SSH to ensure the ESXi boot partition is not full.
 
- Resolve space issues by removing logs or unnecessary files.
 
  • Validate Network Configuration
 
- Ensure vSAN VMkernel adapters are on the correct subnet and VLAN.
 
- Verify physical switch port statistics for errors/discards.
 
- Confirm firewall rules allow vSAN traffic (TCP/UDP 12321, 23451).
 
  • Review Logs
 
- Host Logs (via SSH):
 
/var/log/vmware/vsan-health.log # vSAN health service
 
/var/log/vmware/hostd.log # Host management service
 
 
- vCenter Logs: Check for cluster reconfiguration events.
 
  • Check Cluster Membership
 
- Ensure the host is still part of the vSAN cluster in vCenter.
 
- Look for recent cluster reconfiguration or partition events.
 
  • Advanced Checks
 
- vSAN Performance graph: Use the vSAN Performance tool to analyze network latency/throughput.
 
 
 

Additional Information

Please Note: To apply test below, contact Broadcom technical support for further assistance 
 
 
 
Validate Time Synchronization
 
- Confirm NTP is running:
 
/etc/init.d/ntpd status
 
- Force a time sync:
 
ntpq -p # Check connection 
 
ntpd -q # Force sync
 
 
 
Please see: https://knowledge.broadcom.com/external/article/344682/troubleshooting-an-esxi-host-in-a-not-re.html