Supervisor Control Plane node fails with "acpi PNP0C80:04: add

Products

VMware vSphere Kubernetes Service

Issue/Introduction

A Supervisor Control Plane VM receives a CPU and Memory upgrade.

The dmesg on the affected node show the following log details:

[#####.######] acpi PNP0C80:04: add_memory failed
[#####.######] acpi PNP0C80:04: acpi_memory_enable_device() error
[#####.######] kthreadd invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=2, oom_score_adj=0
...
[#####.######] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode0e08c902636fc9ada07bc271fe1ffe2.slice/cri-containerd-################################################################.scope,task=python3,pid=1693,uid=1002
[#####.######] Out of memory: Killed process 1693 (python3) total-vm:524628kB, anon-rss:141296kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:404kB oom_score_adj:1000
[#####.######] Tasks in /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pode0e08c902636fc9ada07bc271fe1ffe2.slice/cri-containerd-################################################################.scope are going to be killed due to memory.oom.group set
[#####.######] Out of memory: Killed process 1562 (bash) total-vm:4528kB, anon-rss:292kB, file-rss:4kB, shmem-rss:0kB, UID:1002 pgtables:44kB oom_score_adj:1000
[#####.######] Out of memory: Killed process 1675 (apiserver-proxy) total-vm:1340908kB, anon-rss:6516kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:128kB oom_score_adj:1000
[#####.######] Out of memory: Killed process 1693 (python3) total-vm:524628kB, anon-rss:141296kB, file-rss:0kB, shmem-rss:0kB, UID:1002 pgtables:404kB oom_score_adj:1000
[#####.######] oom_reaper: reaped process 1693 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

On the affected Control Plane node the vsphere-syncer from the vsphere-csi-controller appeared to consume all the memory every # minutes.
-- reason for this could be a high count of persistent volumes within environment.

The kube-apiserver showing logs similar to below which confirmed memory consumption on vsphere-syncer.

E0217 ##:##:##.######       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"https://###.##.#.#:10250/containerLogs/vmware-system-csi/vsphere-csi-controller-##########-#####/vsphere-syncer?follow=true&tailLines=100&timestamps=true\", Err:(*errors.errorString)(0x5c0e840)}: Get \"https://###.##.#.#:10250/containerLogs/vmware-system-csi/vsphere-csi-controller-##########-#####/vsphere-syncer?follow=true&tailLines=100&timestamps=true\": context canceled" logger="UnhandledError"

Environment

vCenter version and build number: 9.0.1.0 (24957454)
ESXi version and build number : VMware ESXi 9.0.1.0.24957456
TKGs or VKS version: 3.5.0
Supervisor cluster k8s version: v1.31.6+vmware.3-fips-vsc9.0.1.0-24953340
Guest cluster k8s version: 1.32 / 1.33

Cause

An increase of CPU/RAM was actioned by an administrator to handle the above high memory consumption.

Two Supervisor Control Plane nodes succeed however the third Supervisor Control Plane node had a problem when attempting to increase it's memory.

The environment appears to be under heavy memory consumption from vsphere-syncer from the vsphere-csi-controller with multiple logs observed in the kube-apiserver log similar to below.

E0217 ##:##:##.######       1 status.go:71] "Unhandled Error" err="apiserver received an error that is not an metav1.Status: &url.Error{Op:\"Get\", URL:\"https://###.##.#.#:10250/containerLogs/vmware-system-csi/vsphere-csi-controller-##########-#####/vsphere-syncer?follow=true&tailLines=100&timestamps=true\", Err:(*errors.errorString)(0x5c0e840)}: Get \"https://###.##.#.#:10250/containerLogs/vmware-system-csi/vsphere-csi-controller-##########-#####/vsphere-syncer?follow=true&tailLines=100&timestamps=true\": context canceled" logger="UnhandledError"

The high memory consumption was understood to be as a result of sizeable amount of persistent volumes in environment (approx > 1300).

The vsphere-syncer continued to consume memory ever X amount of minutes leading to immediate exhaustion and OOM kills.

[47356.844585] kthreadd invoked oom-killer: gfp_mask=0x40cc0(GFP_KERNEL|__GFP_COMP), order=2, oom_score_adj=0

These OOM kills is understood to have prevented the memory increase from allocating the required memory increase.

Resolution

To resolve the state, the memory must be initialized via a boot sequence.

Prerequisites:

Ensure the other two Supervisor Control Plane nodes are in a `Ready` state and have maintained etcd quorum.

kubectl get nodes -A

kubectl get pods -A -owide | grep -Eiv "complete|running"

kubectl get pkgi -A | grep -Eiv "succeed"

alias etcdctl='/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/*/fs/usr/local/bin/etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --cacert /etc/kubernetes/pki/etcd/ca.crt';

etcdctl member list -w table; etcdctl --cluster=true endpoint health -w table; etcdctl --cluster=true endpoint status -w table;`

Steps:

1. Log in to the vSphere Client.

2. Locate the affected Supervisor Control Plane VM.

3. Right-click the VM and select Power - Restart Guest OS.
Note - If the Guest OS is hung, use Reset.

4. Wait for the node to come back online (approx. 5-10 minutes).

5. SSH into the node and verify the memory is now fully visible using the command 'free -m'.

6. Verify the vsphere-csi-controller and other critical pods stabilize.