TKG Workload Cluster Nodes in Soft Lockup CPU Stuck error
search cancel

TKG Workload Cluster Nodes in Soft Lockup CPU Stuck error

book

Article ID: 377826

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

  • TKG Worker Nodes/VMs in status "Not Ready".
  • TKG Node/VMs completely unresponsive with no reply to ping & no ssh access with Soft Lockup CPU  Stuck error shown on console.
  • Common errors noticed on the ESXi hosts is as below.
    Error: 202x-0x-xxT23:28:37.752Z Wa(180) vmkwarning: cpu73:81104094)WARNING: LinuxThread: 421: sockrelay: Error cloning thread: -12 (bad0014)
    Reason: Sockrelay is a vsan thread but does not mean sockrelay caused it, just that it couldn't spawn because ESXi Host was exhausted of the resources.


Environment

TCA 2.3
TKG 2.x

Cause

  • The 'Soft Lockup CPU Stuck' error is a known message in Linux systems, signaling that a CPU has been occupied with a task for too long without giving control to the kernel. This occurs when the kernel detects that a CPU has been stuck on a task, preventing other processes from running. The kernel uses a watchdog timer to monitor if any CPU core has failed to respond within a certain timeframe, typically 10 seconds.
  • In virtualized environments, this can happen if the guest OS kernel isn’t scheduled on the CPU because other processes in hypervisor are consuming too much CPU time, often due to over-committed resources, leading to a soft lockup for that specific VM.
  • Following a thorough analysis, it was determined that the root cause lies generally in the over committed cpu resource allocation.

Resolution

Review the current resource allocation and adjust the resources assigned to the clusters to prevent these issues from recurring.