[vSphere with Tanzu - TKGs] Supervisor cluster's high memory usage due to TKC cluster upgrade skipping K8s versions

search cancel

[vSphere with Tanzu - TKGs] Supervisor cluster's high memory usage due to TKC cluster upgrade skipping K8s versions

book

Article ID: 345901

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

This article will provide a likely root cause for the mentioned symptoms and a way forward.

Symptoms:

Supervisor cluster presents an unstable state with constant high memory usage. Frequent SV VMs restarts may be present as well.
Listing TKC objects and their associated Machines ($ kubectl get tkc,machine -A) shows inconsistent "TKR NAME" and "VERSION" fields. TKC object shows a Kubernetes version at least N+2 higher than the one listed for its Machine objects.

For example:

$ kubectl get tkc,machine -A

NAMESPACE NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE

test tanzukubernetescluster.run.tanzu.vmware.com/tkc-test 1 1 v1.23.8---vmware.3-tkg.1 11m False True [v1.23.15+vmware.1-tkg.4 v1.24.9+vmware.1-tkg.4 v1.24.11+vmware.1-fips.1-tkg.1]

NAMESPACE NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION

test machine.cluster.x-k8s.io/tkc-test-65r5k-mjtg2 tkc-test tkc-test-65r5k-mjtg2 vsphere://42124fc0-4c9b-41df-0458-9f41e371a223 Running 11m v1.21.6+vmware.1

test machine.cluster.x-k8s.io/tkc-test-servicesnodepool-hvngk-864748ff77-ngncq tkc-test tkc-test-servicesnodepool-hvngk-864748ff77-ngncq vsphere://421258ae-19e5-0e9a-0a6e-e45051bb843c Running 11m v1.21.6+vmware.1

Hundreds/thousands of "wcpmachinetemplate" objects are present in the system and are constantly being created: $ kubectl get wcpmachinetemplate -A

Note: the issue was observed on vCenter 7. It is possible that the same issue could occur on vCenter 8, in which case instead of wcpmachinetemplates, there would be vspheremachinetemplates.

capi-controller-manager logs show webhook validating errors indicating an upgrade skipping more than one minor Kubernetes version cannot be performed. For example, 1.21.x -> 1.22.x is allowed, but 1.21.x -> 1.23.x is not allowed.

Environment

VMware vSphere 7.0 with Tanzu

Cause

Generally, a TKC cluster upgrade cannot jump more than one minor Kubernetes version in a single upgrade hop. There're admission webhooks that block this kind of scenario as follows:

Example:

If we have a tkc cluster on 1.23.8 version, the admission webhook will block an upgrade to any 1.25.x or higher version. It will allow upgrades to all compatible versions listed under "Updates Available" field:

$ kubectl get tkc -A
NAMESPACE NAME CONTROL PLANE WORKER TKR NAME AGE READY TKR COMPATIBLE UPDATES AVAILABLE
test tkc 1 1 v1.23.8---vmware.3-tkg.1 20h True True [v1.23.15+vmware.1-tkg.4 v1.24.9+vmware.1-tkg.4 v1.24.11+vmware.1-fips.1-tkg.1]

$ kubectl edit -n test tkc tkc
error: tanzukubernetesclusters.run.tanzu.vmware.com "tkc" could not be patched: admission webhook "default.validating.tanzukubernetescluster.run.tanzu.vmware.com" denied the request: version upgrade not compatible with rules

If a compatible TKC upgrade is triggered, but before the upgrade is completed, i.e. all the machines, vms, nodes, etc. provisioned, another subsequent upgrade is triggered, then it'll leave the system in the inconsistent state described in the Issues/Symptoms section.

Example:

Trigger 1.23.x -> 1.24.x upgrade.
Before 1.24.x upgrade completes, trigger a subsequent 1.24.x -> 1.25.x upgrade.
In this case, because the TKC object has already been updated to 1.24.x version (although machines, nodes, etc. may still be showing 1.23.x), the admission webhook will not block the second upgrade to 1.25.x.

Resolution

It is suggested to manually delete all TKC clusters left in an inconsistent state due to the incomplete initial upgrade. After the deletion, all associated wcpmachinetemplates/vspheremachinetemplates will get automatically deleted and memory pressure on Supervisor VMs will decrease.

Feedback

thumb_up Yes

thumb_down No