[vSphere with Tanzu - TKGs] Supervisor cluster's high memory usage due to TKC cluster upgrade skipping K8s versions
search cancel

[vSphere with Tanzu - TKGs] Supervisor cluster's high memory usage due to TKC cluster upgrade skipping K8s versions

book

Article ID: 345901

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

This article will provide a likely root cause for the mentioned symptoms and a way forward.


Symptoms:

  • Supervisor cluster presents an unstable state with constant high memory usage. Frequent SV VMs restarts may be present as well.
  • Listing TKC objects and their associated Machines ($ kubectl get tkc,machine -A) shows inconsistent "TKR NAME" and "VERSION" fields. TKC object shows a Kubernetes version at least N+2 higher than the one listed for its Machine objects.

For example:

 

$ kubectl get tkc,machine -A

NAMESPACE  NAME                               CONTROL PLANE  WORKER  TKR NAME          AGE  READY  TKR COMPATIBLE  UPDATES AVAILABLE

test    tanzukubernetescluster.run.tanzu.vmware.com/tkc-test 1        1    v1.23.8---vmware.3-tkg.1  11m  False  True       [v1.23.15+vmware.1-tkg.4 v1.24.9+vmware.1-tkg.4 v1.24.11+vmware.1-fips.1-tkg.1]

 

NAMESPACE          NAME                                         CLUSTER       NODENAME                           PROVIDERID                    PHASE   AGE  VERSION

test             machine.cluster.x-k8s.io/tkc-test-65r5k-mjtg2                tkc-test tkc-test-65r5k-mjtg2                vsphere://42124fc0-4c9b-41df-0458-9f41e371a223  Running  11m  v1.21.6+vmware.1

test             machine.cluster.x-k8s.io/tkc-test-servicesnodepool-hvngk-864748ff77-ngncq  tkc-test tkc-test-servicesnodepool-hvngk-864748ff77-ngncq  vsphere://421258ae-19e5-0e9a-0a6e-e45051bb843c  Running  11m  v1.21.6+vmware.1

 

  • Hundreds/thousands of "wcpmachinetemplate" objects are present in the system and are constantly being created: $ kubectl get wcpmachinetemplate -A

Note: the issue was observed on vCenter 7. It is possible that the same issue could occur on vCenter 8, in which case instead of wcpmachinetemplates, there would be vspheremachinetemplates.
 

  • capi-controller-manager logs show webhook validating errors indicating an upgrade skipping more than one minor Kubernetes version cannot be performed. For example, 1.21.x -> 1.22.x is allowed, but 1.21.x -> 1.23.x is not allowed.



Environment

VMware vSphere 7.0 with Tanzu

Cause

  • Generally, a TKC cluster upgrade cannot jump more than one minor Kubernetes version in a single upgrade hop. There're admission webhooks that block this kind of scenario as follows:

Example:

If we have a tkc cluster on 1.23.8 version, the admission webhook will block an upgrade to any 1.25.x or higher version. It will allow upgrades to all compatible versions listed under "Updates Available" field:

$ kubectl get tkc -A
NAMESPACE   NAME                 CONTROL PLANE   WORKER   TKR NAME                   AGE   READY   TKR COMPATIBLE   UPDATES AVAILABLE
test        tkc   1               1        v1.23.8---vmware.3-tkg.1   20h   True    True             [v1.23.15+vmware.1-tkg.4 v1.24.9+vmware.1-tkg.4 v1.24.11+vmware.1-fips.1-tkg.1]

$ kubectl edit -n test tkc tkc
error: tanzukubernetesclusters.run.tanzu.vmware.com "tkc" could not be patched: admission webhook "default.validating.tanzukubernetescluster.run.tanzu.vmware.com" denied the request: version upgrade not compatible with rules

 

  • If a compatible TKC upgrade is triggered, but before the upgrade is completed, i.e. all the machines, vms, nodes, etc. provisioned, another subsequent upgrade is triggered, then it'll leave the system in the inconsistent state described in the Issues/Symptoms section.

Example:

  1. Trigger 1.23.x -> 1.24.x upgrade.
  2. Before 1.24.x upgrade completes, trigger a subsequent 1.24.x -> 1.25.x upgrade.
  3. In this case, because the TKC object has already been updated to 1.24.x version (although machines, nodes, etc. may still be showing 1.23.x), the admission webhook will not block the second upgrade to 1.25.x.

Resolution

It is suggested to manually delete all TKC clusters left in an inconsistent state due to the incomplete initial upgrade. After the deletion, all associated wcpmachinetemplates/vspheremachinetemplates will get automatically deleted and memory pressure on Supervisor VMs will decrease.