Upgrade of a Tanzu Kubernetes Grid Integrated Edition Cluster Fails on Apply-Addons

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Upgrade of a Tanzu Kubernetes Grid Integrated Edition Cluster Fails on Apply-Addons.
One will see messages similar to the following when running TKGI Cluster <ClusterName> command.
tkgi cluster cluster-name

PKS Version: 1.19.1-build.14
Name: cluster-name
K8s Version: 1.28.9
Plan Name: Plan-11
UUID: c8c42b09-###-40a6-###-0fad86#####
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: c8c42b09-###-40a6-###-0fad86#####, broker-request-id: a7f9####################5cf, task-id: 765##28, operation: update, error-message: 0 succeeded, 1 errored, 0 canceled
Kubernetes Master Host: cluster-name.domain.com
Kubernetes Master Port: 8443
Worker Nodes: 13
Kubernetes Master IP(s): 10.xx.xx.xx
Network Profile Name: np-cluster-name
Kubernetes Profile Name: k8s-profile- cluster-name
Compute Profile Name: comp-profile-cluster-name
NSX Policy: false
Tags:

When running the following command to check the bosh upgrade task using the task-id we got from the pervious command, one case see error mesage similrer to the followign
# bosh task task-id --debug

ex:

bosh task 765##28 --debug

', "result_output" = '{"instance":{"group":"apply-addons","id":"00xxxcd6-xxx-xxx-8c7f-80ddxxx3830"},"errand_name":"apply-addons","exit_code":1,"stdout":"No need to change the CoreDNS replica because there are 3 linux worker nodes\nDeploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nserviceaccount/coredns unchanged\nclusterrole.rbac.authorization.k8s.io/system:coredns unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:coredns unchanged\nconfigmap/coredns unchanged\ndeployment.apps/coredns configured\nservice/kube-dns unchanged\nfailed to start all system specs after 1200 with exit code 1\n","stderr":"Warning: spec.template.metadata.annotations[seccomp.security.alpha.kubernetes.io/pod]: non-functional in v1.27+; use the \"seccompProfile\" field instead\nerror: deployment \"coredns\" exceeded its progress deadline\n","logs":{"blobstore_id":"e4a09ce7-xxx-xxxx-xxxx-ce8753ed33af","sha1":”61cxxxxxxa9ef9e8

Verification on the cluster a core-dns pod is in pending state:

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-65ccfbf9b-d4gc6 1/1 Running 0 3d
coredns-65ccfbf9b-mgl4c 1/1 Running 0 3d6h
coredns-65ccfbf9b-rtx64 0/1 Pending 0 44s

Environment

TKGi 1.19

TKGi 1.2x

Cause

There could be multiple reasons why the core-dns is in pending state the apply-addon operations will verify if all instances are available and running and in case they are not will fail the task

Multiple nodes are windows nodes and there are not enough linux nodes to host the required number of copies

a dis-balance in number of worker node (or a deletion of a worker node due to unresponsive agent during the process)

manual drain of a worker node in case there is a small number of workers a pod can be in pending state if there are not enough available nodes

Resolution

Make sure the number of worker nodes from tkgi cluster <NAME> matches with bosh -d <SI> vms

verify if there are no worker nodes scheduling disabled state

Confirm all core-dns pods are run running state before retry another upgrade.