vSAN Skyline Health Degraded or Slow Due to "Too Many Open Files" in vsanmgmtd
search cancel

vSAN Skyline Health Degraded or Slow Due to "Too Many Open Files" in vsanmgmtd

book

Article ID: 438502

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms

  • vSAN Skyline Health reports "Hosts with connectivity issues" or "Stats primary election" errors.
  • The vSAN Health UI page is slow to load or fails to render data.
  • The issue temporarily resolves after restarting the vsanmgmtd service on the affected ESXi host.
  • Checking open file handlers for vsanmgmtd reveals counts significantly higher than 100 (e.g., 180+).
  • ESXi host vpxa.log contains TimeoutException errors for local port 8089 (the vsanmgmtd loopback port): HTTP Connection has timed out while waiting for further requests; <TCP '[IP_ADDRESS] : 8089'>, N7Vmacore16TimeoutExceptionE

Environment

VMware vSAN 8.x 

Cause

This issue is caused by a massive influx of API requests—such as queryBatchPerformanceStatistics and queryAvailableMetric—hitting the ESXi host in a short period (dozens of requests per second). These requests are typically generated by external monitoring software, most commonly VMware Aria Operations, using an aggressive collection interval.

Because each request consumes a file descriptor, the high volume causes the vsanmgmtd daemon to hit its hard limit for open files. Once this limit is reached, the daemon can no longer accept local socket connections from vpxa (the vCenter agent), leading to service stalls and health check failures.

Resolution

To resolve this issue, identify the source of the aggressive API calls and adjust the monitoring collection frequency.

  1. Identify the Source Machine:

    • Search for API calls in the host vpxa.loggrep -Ei 'queryBatchPerformanceStatistics|queryAvailableMetric' vpxa.log.
    • Match these calls to session IDs and IP addresses in the vpxd-profiler.log on vCenter to identify the monitoring server. MEMORY PRESSURE on vsanmgmtd due to large API queries
  2. Adjust Aria Operations Collection Interval:

    • Log in to the Aria Operations UI.
    • Navigate to Data Sources > Integrations > Accounts.
    • Select the vCenter Server account associated with the affected cluster.
    • Click Edit and expand Advanced Settings.
    • Check the Collection Interval. If it is set to 1 minute, increase it to 5 minutes (standard) or 10 minutes (conservative).
    • Verify that "Performance Metrics" collection is not duplicated across multiple monitoring policies within the vSAN Management Pack.
  3. Monitor File Handlers:

    • Run the following command on the affected ESXi host to verify the handler count has stabilized: /bin/lsof | grep -v "MMAP" | grep vsanmgmtd | wc -l
    • The count should remain consistently below 100. vSAN Health Degraded