vCenter Server vpxd services crash and service restart required to regain production

Products

VMware vCenter Server

Issue/Introduction

vCenter Services may randomly crash and may become unavailable, a reboot or full service restart is needed to restore production for some time before issue reoccurs.
In /var/log/vmware/vpxd/vpxd.log we may see below when services crash:

####-##-##T##:##:##.###+##:## error vpxd[#####] [Originator@6876 sub=Memory checker] Current value 11726448 exceeds hard limit 11682816. Shutting down process.
####-##-##T##:##:##.###+##:## panic vpxd[#####] [Originator@6876 sub=Default]
-->
--> Panic: Memory exceeds hard limit. Panic
--> Backtrace:
--> [backtrace begin] product: VMware VirtualCenter, version: 8.0.3, build: build-24305161, tag: vpxd, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] libvmacore.so[0x00531DC5]
--> backtrace[01] libvmacore.so[0x0042182A]: Vmacore::System::Stacktrace::CaptureFullWork(unsigned int)
--> backtrace[02] libvmacore.so[0x00434009]: Vmacore::System::SystemFactory::CreateBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)
--> backtrace[03] libvmacore.so[0x0050A989]
--> backtrace[04] libvmacore.so[0x0050AAA1]: Vmacore::PanicExit(char const*)
--> backtrace[05] libvmacore.so[0x0042154C]: Vmacore::System::ResourceChecker::DoCheck()
--> backtrace[06] libvmacore.so[0x00385107]
--> backtrace[07] libvmacore.so[0x0037EC04]
--> backtrace[08] libvmacore.so[0x00384517]
--> backtrace[09] libvmacore.so[0x00510FBB]
--> backtrace[10] libpthread.so.0[0x00008EB0]
--> backtrace[11] libc.so.6[0x000FFADF]
--> backtrace[12] (no module)
--> [backtrace end]

In /var/log/vmware/vmware-sps/sps.log we may see entries like below:

####-##-##T##:##:##.###+##:## [pool-3-thread-19] INFO opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 11, Time spent waiting in queue: 4 millis | ThreadPool Starvation Alert

May also see a lot of of createContainerView events in vpxd for SPS tasks, these do all get closed, but shows high level of activity for SPS:

grep vim.view.ViewManager.createContainerView vpxd-*.log | grep BEGIN | awk '{print$16}' | sort | uniq -c | sort -nr | head

8649 55555####-####-####-####-############(55555####-####-####-####-############)
2902 #########-####-####-####-############(#########-####-####-####-############)

--> To confirm what is responsible for these, below can be run against the ID seen from above output:

find -iname "vpxd-profiler*" -type f -exec grep -H "55555#####-####-####-####-############" {} \; | grep "ClientIP" | head -n 5

Environment

vCenter Server 8.0

Cause

This is due to Core Pool Size for SMS which is set to default 20, this is being reached due to high SPS activity in environment and thread pool starvation is occurring causing vpxd memory to be reached and service to crash.

Resolution

**Snapshot/backup of vCenter Server (offline snapshot for enhanced linked mode environment) prior to making any changes

1. Log into VC through SSH as root and run shell

2. Run below command to make changes to sms.properties file

vi /usr/lib/vmware-vpx/sps/conf/sms.properties

3. Increase below in BOLD value from 20 to 30

sms.threadpool.corePoolSize=30
sms.threadpool.maxPoolSize=500
sms.threadpool.keepAlive=120
sms.threadpool.queueSize=2

4. Save file

Press "Esc"

Type ":wq!"

Press "Enter"

5. Restart VC services to ensure changes take affect

service-control --stop --all && service-control --start --all

6. Monitor to ensure issue does not reoccur