In a vSphere Supervisor environment, a workload cluster upgrade is stuck while draining the previous version's node because pinniped-concierge pods are not draining.
This issue occurs on a vSphere Supervisor cluster with VKS service 3.3 or higher.
While connected to the Supervisor cluster context, the following symptoms are observed:
kubectl get machines -n <workload cluster namespace>
While connected to the affected workload cluster's context, the following symptoms are observed:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
<worker node A> Ready <none> ##h <previous VKR version>
<old control plane node> Ready,SchedulingDisabled control-plane ##h <previous VKR version>
<new control plane node> Ready control-plane ##h <desired VKR version>
kubectl get pods -A -o wide | egrep -v "Run|Completed"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
pinniped-concierge pinniped-concierge-<id> 0/1 Pending # ##h <IP> <old node name>
pinniped-concierge pinniped-concierge-kube-cert-agent-<id> 0/1 ImagePullBackOff # ##h <IP> <old node name>
vSphere Supervisor
VKS 3.3 and higher
Worker nodes do not begin upgrading until all control plane nodes are on the desired VKR version and the previous VKR version control plane nodes have been cleaned up by the system.
Manual deletions of nodes will not help the upgrade to proceed.
Starting in VKS 3.3 and higher, the behavior of draining nodes in a workload cluster has changed.
If a node does not drain within the workload cluster's configured node drain time-out, the upgrade will not continue.
A workload cluster with IDP enabled can result in the pinniped-concierge component from draining properly.
In this scenario, the pinniped-concierge-kube-cert-agent is unable to drain successfully and continues to restart on the draining node which causes the upgrade to become stuck.
Create a MachineDrainRule in the affected workload cluster's namespace. This must be performed in the Supervisor Cluster context.
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDrainRule
metadata:
name: vks-pod-drain-skip-pinniped
spec:
drain:
behavior: Skip
pods:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: pinniped-concierge
selector:
matchLabels:
kube-cert-agent.pinniped.dev: v3
kubectl apply -f <MachineDrainRule.yaml> -n <workload cluster namespace>
kubectl get machinedrainrule -n <workload cluster namespace>