VKS Cluster Upgrade Fails with update cannot be initiated SystemChecksSucceeded condition is not True
search cancel

VKS Cluster Upgrade Fails with update cannot be initiated SystemChecksSucceeded condition is not True

book

Article ID: 433183

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime VMware vSphere Kubernetes Service

Issue/Introduction

Unable to start a VKS cluster upgrade to a higher VKR version because of an error similar to the following:

update cannot be initiated as <affected VKS cluster>'s SystemChecksSucceeded condition is not True.

 

The error contains a Message with more details on the specific component blocking the VKS cluster upgrade.

Environment

vSphere Supervisor

VKS Cluster

VKS 3.5 and higher

Cause

VKS 3.5 and 3.6 introduce system pre-checks to detect misconfigured Kubernetes components that are known to cause cluster upgrades to become stuck.

Previously, these misconfigurations could cause VKS cluster upgrades to stall or fail without a clear indication of the root cause.

When the system pre-checks detect one of these issues, it will flag a "not True" failure for the SystemChecksSucceeded condition and include a Message with further details.

These system pre-checks include the following known issues:

  • PodDisruptionBudget (PDB) - introduced in VKS 3.5
    • If a PodDisruptionBudget (PDB) associated with one of the pods exists with zero "Allowed Disruptions", then it will prevent the node where the pod is present from draining, leaving the node in Ready,SchedulingDisabled and Deleting status because the PDB configuration has been set to prevent a certain number of replicas for this pod from ever being down at any time.


  • Third Party Webhooks - introduced in VKS 3.6
    • When there is a webhook installed in the affected cluster which requires that pods are checked against the third party application webhook's service before allowing the pod to be created, this can prevent the container network interface pod (antrea or calico) from starting on the affected node and result in a rolling redeployment to become stuck.

      See the below for a list of third party webhooks known to cause issues in VKS clusters:
      • Rancher
      • Gatekeeper
      • k8tz
      • Kyverno
      • Dynatrace
      • Linkerd
      • opa-gatekeeper

Resolution

The corresponding steps related to the detailed error message should be followed.

If there are any concerns regarding these steps, reach out to VMware by Broadcom Technical Support.

 

PodDisruptionBudget (PDB)

Message: PodDisruptionBudgets blocking rollouts

There are one or more PodDisruptionBudgets (PDBs) in the VKS cluster with an Allowed Disruption value of 0.

These objects monitor the count of pods for an application and can be configured to ensure a specific number of pods are Running at all times. However, this can cause VKS cluster upgrades and rolling redeployments to become stuck Deleting in Ready,ScheduledDisabled state because the PDB is preventing the pod on that stuck node from draining and terminating.

  1. Connect into the affected VKS cluster's context

  2. Run the below command to check the status of all PDBs in the VKS cluster:
    kubectl get pdb -A

     

  3. For any PDBs with an Allowed Disruption value of 0, reach out to the application owner on how to adjust the PDB to be more tolerant.

 

Third Party Webhook

  • Message: MisconfiguredSoftwareChecks failed: [<third party webhook>]

    Where the value in brackets is one of the following third party webhooks:

    • validate.kyverno.svc-fail
    • rancher.cattle.io
    • admission-controller.k8tz.io
    • v1beta1.dynakube.webhook.dynatrace.com
    • v1beta2.dynakube.webhook.dynatrace.com
    • v1beta3.dynakube.webhook.dynatrace.com
    • linkerd-sp-validator.linkerd.io
    • vopentelemetrycollectorcreateupdatebeta.kb.io

NOTE: VMware by Broadcom is not responsible for and does not provide support for third party applications.
Any issues with webhooks installed by a third party application should be escalated to the third party application owner.

The following steps detailed how to temporarily take a backup of and temporarily delete the third party webhooks in the affected VKS cluster.

  1. Connect into the affected VKS cluster's context

  2. Run the below command to view all webhookconfigurations in the VKS cluster:
    kubectl get validatingwebhookconfiguration,mutatingwebhookconfiguration -A

     

  3. If you are using vOpen Telemetry Collector webhook configuration, reach out to VMware by Broadcom Technical Support referencing this KB article.

  4. If the third party webhook configuration is configured to not block or fail the VKS upgrade, reach out to VMware by Broadcom Technical Support referencing this KB article.

  5. Any other third party webhook configurations will need to be temporarily backed up and taken down.
    • NOTE: VMware system webhooks for antrea/calico and standard package (PKGI) such as cert-manager should not be touched.


  6. Example commands to take a backup of a third party webhook configuration, where values in angle brackets <> should be replaced as per your environment:
    kubectl get validatingwebhookconfiguration <third party validating webhook configuration>  -o yaml  >    <vwc-backup>.yaml
    
    kubectl get mutatingwebhookconfiguration <third party mutating webhook configuration> -o yaml   >    <mwc-backup>.yaml

     

  7. IMPORTANT: Ensure that the backups are saved outside of the VKS cluster's nodes.


  8. See the below commands to temporarily delete only the third party webhooks:
    kubectl delete validatingwebhookconfiguration <third party validating webhook configuration>
    
    kubectl delete mutatingwebhookconfiguration <third party mutating webhook configuration>

     

  9. After the VKS cluster upgrade or rolling redeployment completes on all nodes, the following command can be used to restore the third party webhooks:
    kubectl apply -f <vwc-backup>.yaml
    
    kubectl apply -f <mwc-backup>.yaml

 

 

Additional Information

Expected system webhooks in the environment would be related to the CNI or any installed packages (PKGI) in the workload cluster.

For example, the expected system antrea webhooks are:

  • crdvalidator.antrea.io
  • crdmutator.antrea.io

----

Release Notes: vSphere Kubernetes Service 3.6.0+v1.35