vCenter Server 8.0 U2 shows instability in SPS and vsan-health services with a large amount of VSAN-enabled clusters
search cancel

vCenter Server 8.0 U2 shows instability in SPS and vsan-health services with a large amount of VSAN-enabled clusters

book

Article ID: 344897

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

  • VSAN Skyline Health does not load, or only loads periodically, within the vSphere Client.
  • VM provisioning operations may fail.
  • Messages within /var/log/vmware/vmware-sps/sps.log indicate thread pool exhaustion and very high wait times in the queue.
2023-11-11T11:11:11.721Z [pool-3-thread-12] INFO  opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Request took 1199610 millis to execute. | Slow run() method execution Alert
2023-11-11T11:11:11.722Z [pool-3-thread-12] INFO  opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 163, Time spent waiting in queue: 2311468 millis | ThreadPool Starvation AND Queue wait time Alert
  • There may also be messages in sps.log indicating connections to vsanHealth timing out while attempting listVStorageObjectsForSpec calls
2023-11-03T11:11:11.797Z [pool-17-thread-7] ERROR opId=WorkQueue-75b6bd30-e32 com.vmware.vim.vmomi.server.impl.SoapBindingImpl - Method 'listVStorageObjectsForSpec' completed with undeclared fault of type 'com.vmware.vim.vmomi.client.exception.ConnectionException'
com.vmware.vim.vmomi.client.exception.ConnectionException: http://localhost:1080/vsanHealth invocation failed with "java.net.SocketTimeoutException: Read timed out"



Environment

VMware vCenter Server 8.0.2

Cause

When a large amount of API requests are received by vsanmgmtvcd, they are unnecessarily increased by a circular dependency between vsanvcmgmtd and sps. Each VSAN cluster in a vSphere environment contributes to the count of these calls, and at some point, a threshold is reached where the calls overwhelm their thread pools, causing degradation in any related components.

Resolution

VMware is aware of this issue and working towards a fix in a future release.

Workaround:

To workaround this issue, increase the maxThreads and throttle value for vsanvcmgmt

  1. Back up the VsanVcMgmtConfig.xml file
cp /usr/lib/vmware-vsan/VsanVcMgmtConfig.xml ~/VsanVcMgmtConfig.bak
  1. Open the file for editing
vi /usr/lib/vmware-vsan/VsanVcMgmtConfig.xml
  1. Add a new option within the <vmacore> threadpool section for <maxThreads> and set it to 500
<config>
   <vmacore>
      <threadPool>
         <maxThreads>500</maxThreads>   <----Add this
      </threadPool>
   </vmacore>
</config>
  1. Add another option within <adapterServer> for <throttleFixed> and set it to 300
<config>
   <adapterServer>
      <throttleFixed>300</throttleFixed>  <----Add this
   </adapterServer>
</config>
  1. To save, press ESC, type :wq! and press ENTER
  2. Restart vsan-health
vmon-cli -r vsan-health



Additional Information

It has also been noted that increasing the number of vCPUs assigned to the vCenter Appliance can help with this issue. This is because the associated threadpools are dynamically configured based on the number of CPUs the VC has. This will only work to a certain point, however, and the workaround will need to be applied nonetheless. Anecdotally, this is somewhere between 200-300 VSAN enabled clusters.