TKG Workload Cluster Nodes in Soft Lockup CPU Stuck error
book
Article ID: 377826
calendar_today
Updated On:
Products
VMware Telco Cloud Automation
Issue/Introduction
TKG Worker Nodes/VMs in status "Not Ready".
TKG Node/VMs completely unresponsive with no reply to ping & no ssh access with Soft Lockup CPU Stuck error shown on console.
Common errors noticed on the ESXi hosts is as below.
Error: 202x-0x-xxT23:28:37.752Z Wa(180) vmkwarning: cpu73:81104094)WARNING: LinuxThread: 421: sockrelay: Error cloning thread: -12 (bad0014)
Reason: Sockrelay is a vsan thread but does not mean sockrelay caused it, just that it couldn't spawn because ESXi Host was exhausted of the resources.
Environment
TCA 2.3 TKG 2.x
Cause
The 'Soft Lockup CPU Stuck' error is a known message in Linux systems, signaling that a CPU has been occupied with a task for too long without giving control to the kernel. This occurs when the kernel detects that a CPU has been stuck on a task, preventing other processes from running. The kernel uses a watchdog timer to monitor if any CPU core has failed to respond within a certain timeframe, typically 10 seconds.
In virtualized environments, this can happen if the guest OS kernel isn’t scheduled on the CPU because other processes in hypervisor are consuming too much CPU time, often due to over-committed resources, leading to a soft lockup for that specific VM.
Following a thorough analysis, it was determined that the root cause lies generally in the over committed cpu resource allocation.
Resolution
Review the current resource allocation and adjust the resources assigned to the clusters to prevent these issues from recurring.