Continuous reconciliation of the daemonset "update-node-password-runner" due to missing VM Class in the cluster namespace
search cancel

Continuous reconciliation of the daemonset "update-node-password-runner" due to missing VM Class in the cluster namespace

book

Article ID: 431566

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Continuous reconciliation of the daemonset "update-node-password-runner"  resulting in continuous pod churn
  • The cluster YAML may or may not have the osConfiguration.password.renewalDaysBeforeExpiry , which is available with ClusterClass version builtin-generic-v3.4.0 and later.
  • Supervisor vmware-system-tkg-controller-manager pod logs under /var/log/pods/svc-tkg-domain-c#_vmware-system-tkg-controller-manager-##########/manager show below similar entries 
    YYYY-MM-DDTHH:MM:SS stderr F I0209 HH:MM:SS 1 password_controller.go:###] "Cluster needs reconciliation" logger="svc-tkg-domain-c#-tkg-controller.password-controller" name="<tkg-cluster-name>" namespace="<tkg-ns-name>" lastUpdateTimestamp="YYYY-MM-DD HH:MM:SS +0000 UTC" passwordAge="<#h#m#s>" renewalDaysBeforeExpiry="<#h#m#s>"
    YYYY-MM-DDTHH:MM:SS stderr F E0209 HH:MM:SS 1 controller.go:347] "Reconciler error" err="admission webhook \"capi.validating.tanzukubernetescluster.run.tanzu.vmware.com\" denied the request: vm class(es): "<tkg-vmclass-name>" not found" controller="password-controller" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<tkg-ns-name>/<tkg-cluster-name>" namespace="<tkg-ns-name>" name="<tkg-cluster-name>" reconcileID="########-####-####-####-############"
    YYYY-MM-DDTHH:MM:SS stderr F I0209 HH:MM:SS 1 password_controller.go:###] "Cluster needs reconciliation" logger="svc-tkg-domain-c#-tkg-controller.password-controller" name="<tkg-cluster-name>" namespace="<tkg-ns-name>" lastUpdateTimestamp="YYYY-MM-DD HH:MM:SS +0000 UTC" passwordAge="<#h#m#s>" renewalDaysBeforeExpiry="<#h#m#s>"
    YYYY-MM-DDTHH:MM:SS stderr F I0209 HH:MM:SS 1 password_controller.go:###] "Creating update runner in guest cluster" logger="svc-tkg-domain-c#-tkg-controller.password-controller" name="<tkg-cluster-name>" namespace="<tkg-ns-name>"
    YYYY-MM-DDTHH:MM:SS stderr F I0209 HH:MM:SS 1 password_controller.go:###] "Guest cluster update runner created" logger="svc-tkg-domain-c#-tkg-controller.password-controller" name="<tkg-cluster-name>" namespace="<tkg-ns-name>"

Environment

vSphere with Tanzu

vSphere Kubernetes Service[VKS] 3.4

Cause

Since VKS 3.4, the daemonset "update-node-password-runner" is only triggered when the cluster password is expired and is expected to rotate the cluster SSH password and update cluster annotation lastPasswordtimeStamp. The specific annotation will be set by VKS after the password rotation is done and when it's set, the password rotation is considered as succeeded and won't retry the password update.

Failure of the validating admission webhook (often caused by environmental issues like a missing VM Class) would potentially block any further update on the cluster due to which the controller never gets the chance to add the annotation to the cluster spec, which causes the password rotation to treat it as update failure and retrigger the password rotation daemonset.

Resolution

Add the missing VM class to the namespace Manage VM Classes on a Namespace in vSphere with Tanzu and ensure the VM class that is defined in cluster YAML is mapped to the cluster namespace.

Note: 

If the VM class is reattached to the namespace without making any changes to the Cluster object, no rollout will be triggered.

However, if a new VM class is required, it must first be associated with the namespace and then updated in the Cluster object. Updating the Cluster object with the new VM class will trigger a rollout.

Additional Information

Automatic password rotation before expiration - To avoid lock-outs due to password expiration for vmware-system-user, VKS 3.4 features rotating passwords automatically before their expiration. Making it seamless, yet secure for platform engineers to connect to the VKS clusters using SSH, while adhering to established security hardening protocols. VMware vSphere Kubernetes Service Release Notes

VKS will kick-off the update when expiry (90 days) - renewalDaysBeforeExpiry (7d) = 83 days by default to prevent the user from locking out of the cluster node.

This is enabled by default and renewalDaysBeforeExpiry defaults to 7. (unless renewalDaysBeforeExpiry is set to 0 to disable this feature entirely) using using osConfiguration.password.renewalDaysBeforeExpiry

For example: If user sets renewalDaysBeforeExpiry set to 14 using osConfiguration.password.renewalDaysBeforeExpiry, the password rotation will occur after 76 days.