vCenter Server 8.0 U2 shows instability in SPS and vsan-health services with a large amount of VSAN-enabled clusters

search cancel

vCenter Server 8.0 U2 shows instability in SPS and vsan-health services with a large amount of VSAN-enabled clusters

book

Article ID: 344897

calendar_today

Updated On: 11-27-2024

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

VSAN Skyline Health does not load, or only loads periodically, within the vSphere Client.
VM provisioning operations may fail.
Messages within /var/log/vmware/vmware-sps/sps.log indicate thread pool exhaustion and very high wait times in the queue.

2023-11-11T11:11:11.721Z [pool-3-thread-12] INFO opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Request took 1199610 millis to execute. | Slow run() method execution Alert
2023-11-11T11:11:11.722Z [pool-3-thread-12] INFO opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 163, Time spent waiting in queue: 2311468 millis | ThreadPool Starvation AND Queue wait time Alert

There may also be messages in sps.log indicating connections to vsanHealth timing out while attempting listVStorageObjectsForSpec calls

2023-11-03T11:11:11.797Z [pool-17-thread-7] ERROR opId=WorkQueue-75b6bd30-e32 com.vmware.vim.vmomi.server.impl.SoapBindingImpl - Method 'listVStorageObjectsForSpec' completed with undeclared fault of type 'com.vmware.vim.vmomi.client.exception.ConnectionException'
com.vmware.vim.vmomi.client.exception.ConnectionException: http://localhost:1080/vsanHealth invocation failed with "java.net.SocketTimeoutException: Read timed out"

Environment

VMware vCenter Server 8.0.2

Cause

When a large amount of API requests are received by vsanmgmtvcd, they are unnecessarily increased by a circular dependency between vsanvcmgmtd and sps. Each VSAN cluster in a vSphere environment contributes to the count of these calls, and at some point, a threshold is reached where the calls overwhelm their thread pools, causing degradation in any related components.

Resolution

VMware is aware of this issue and working towards a fix in a future release.

Workaround:

To workaround this issue, increase the maxThreads and throttle value for vsanvcmgmt

Back up the VsanVcMgmtConfig.xml file

cp /usr/lib/vmware-vsan/VsanVcMgmtConfig.xml ~/VsanVcMgmtConfig.bak

Open the file for editing

vi /usr/lib/vmware-vsan/VsanVcMgmtConfig.xml

Add a new option within the <vmacore> threadpool section for <maxThreads> and set it to 500

<config>
   <vmacore>
      <threadPool>
         <maxThreads>500</maxThreads>   <----Add this
      </threadPool>
   </vmacore>
</config>

Add another option within <adapterServer> for <throttleFixed> and set it to 300

To save, press ESC, type :wq! and press ENTER
Restart vsan-health

vmon-cli -r vsan-health

Additional Information

It has also been noted that increasing the number of vCPUs assigned to the vCenter Appliance can help with this issue. This is because the associated threadpools are dynamically configured based on the number of CPUs the VC has. This will only work to a certain point, however, and the workaround will need to be applied nonetheless. Anecdotally, this is somewhere between 200-300 VSAN enabled clusters.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No