HCX - High CPU observed in registered vCenter causing migrations to be very slow

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

High CPU usage will be observed in the registered vCenter due to Storage Management Service (SPS) receiving numerous requests from HCX.
To check CPU utilization on vCSA >> Check vCenter VM performance charts.

SSH to vCSA and run 'top' to confirm the high CPU behaviour from Guest OS side :

top - 08:40:38 up 7 days, 17:19,  0 users,  load average: 11.48, 10.28, 9.12
Threads: 5719 total,  12 running, 5707 sleeping,   0 stopped,   0 zombie
%Cpu0  :  47.7/15.1   63[|||||||]   ]     %Cpu1  :  39.3/28.1   67[|||||||]   ]
%Cpu2  :  57.1/13.2   70[|||||||]   ]     %Cpu3  :  89.2/10.1   99[|||||||||||| ]
%Cpu4  :  45.5/15.9   61[|||||||]   ]     %Cpu5  :  72.5/25.3   98[||||||||||]
%Cpu6  :  52.2/18.5   71[|||||||]   ]     %Cpu7  :  36.3/30.8   67[|||||||]   ]

Type "P" to sort by the %CPU column. You can notice sps service cpu usage is too high and also too many sps threads spun.

PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND
157849 sps       20   0 6333.1m 878.5m   70.1  2.9   0:00.00 S      `- /usr/j+
157858 sps       20   0 6333.1m 878.5m   89.5  2.9  63:24.53 S          `- /u+
157859 sps       20   0 6333.1m 878.5m   0.3   2.9  58:18.73 S          `- /u+
157860 sps       20   0 6333.1m 878.5m   0.0   2.9   0:38.42 S          `- /u+

During this issue, HCX migrations will be slow and will take hours to complete.

HCX Manager hybridity/Plugin UI --> Migration --> Expand the Migration Task --> Events, shows that the specific task "Reserving storage for disks" takes too long time.

HCX manager log /common/logs/admin/app.log shows the below StorageProfilePollerJob continuously running for long.

2024-##-##.594 UTC [VsphereStorageService_SvcThread-42905, Ent: HybridityAdmin, , TxId: ########-####-####-####-########] INFO c.v.v.h.s.s.v.j.StorageProfilePollerJob- For datastore:datastore-####, chosen host to get profileCost is:host-####
2024-##-##.612 UTC [VsphereStorageService_SvcThread-42896, Ent: HybridityAdmin, , TxId: ########-####-####-####-########] INFO c.v.v.h.s.s.v.j.StorageProfilePollerJob- For datastore:datastore-####, chosen host to get profileCost is:host-####
2024-##-##.619 UTC [VsphereStorageService_SvcThread-42897, Ent: HybridityAdmin, , TxId: ########-####-####-####-########] INFO c.v.v.h.s.s.v.j.StorageProfilePollerJob- For datastore:datastore-####, chosen host to get profileCost is:host-####

vCenter Storage Profile Service log (/var/log/vmware/vmware-sps/sps.log) shows below :

2024-##-##+01:00 [pool-3-thread-4] INFO  opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 9, Time spent waiting in queue: 9 millis | ThreadPool Starvation Alert
2024-##-##+01:00 [pool-3-thread-5] INFO  opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 8, Time spent waiting in queue: 9 millis | ThreadPool Starvation Alert
2024-##-##+01:00 [pool-3-thread-8] INFO  opId= com.vmware.vim.storage.common.task.CustomThreadPoolExecutor - [VLSI-client] Active thread count is: 20, Core Pool size is: 20, Queue size: 14, Time spent waiting in queue: 7 millis | ThreadPool Starvation Alert

Environment

HCX version upgraded to 4.10.0 & 4.10.1 are impacted.
vCenter Server

Cause

HCX syncs storage policy information from its registered vCenter using a poller that starts when HCX boots or during app-engine start.

Upgrading from HCX 4.9 to 4.10 triggers new poller instances, along with those from the previous version.

Errors in retrieving storage information cause the number of pollers to double, leading to multiple calls to the vCenter storage profile API.

Resolution

This issue is resolved in VMware HCX 4.10.2 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Additional Information

This will cause high CPU usage on the registered vCenter.
All HCX migration types will experience significant slowness.
No impact to the network extensions workflow and datapath.
The impact will affect systems upgraded to HCX 4.10 or later.
Newly deployed HCX 4.10/4.10.1 will remain unaffected.