VKS guest cluster upgrade fails with "cni plugin not initialized" when Windows gMSA Webhook is deployed
search cancel

VKS guest cluster upgrade fails with "cni plugin not initialized" when Windows gMSA Webhook is deployed

book

Article ID: 442165

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Guest cluster contains Windows work nodes and Windows gMSA Webhook. See Configuring a Windows Node Pool to Use Group Managed Service Accounts
  • Upgrading fails on new control plane node which is stuck on NotReady status and finally is deleted after 12 minutes by MHC (MachineHealthCheck) in loop. 
  • Describe machine resource of new control plane node in Supervisor cluster that has below similar messages:

    Container runtime network not ready: NetworkReady=false reason: NetworkPluginNotReady message: Network plugin returns error: cni plugin not initialized

  • SSH to the new control plane and run the command 'crictl ps' which has only 4 static pods running without CNI pod such as antrea-agent:

    etcd
    kube-scheduler
    kube-apiserver
    kube-controller-manager

  • Deployment of antrea-controller in guest cluster has below similar messages from the command:

    kubectl describe deployments -n kube-system antrea-controller

    "message": "Internal error occurred: failed calling webhook "admission-webhook.windows-gmsa.sigs.k8s.io": failed to call webhook: Post "https://windows-gmsa-webhook.windows-gmsa-webhook.svc:443/mutate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "windows-gmsa-webhook-ca")"

Environment

VMware vSphere Kubernetes Service

Cause

The kube-apiserver does not trust the gMSA webhook certificate that causes connection is failed. The antrea-agent pod will not be able to schedule to new control plane node. 

Resolution

To workaround the issue: 

  1. Login guest cluster. See Connect to a TKG Service Cluster as a vCenter Single Sign-On User with Kubectl
  2. Pause the gMSA app:

    kubectl get app -A | grep gmsa
    kubectl patch app <gmsa-app-name> -n <namespace> --type=merge -p '{"spec":{"paused":true}}'

  3. Patch the webhook failure policies to Ignore, forcing the API Server to bypass the TLS verification failure and admit critical system pods:

    kubectl patch mutatingwebhookconfigurations.admissionregistration.k8s.io windows-gmsa-webhook --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'
    kubectl patch validatingwebhookconfigurations.admissionregistration.k8s.io windows-gmsa-webhook --type='json' -p='[{"op": "replace", "path": "/webhooks/0/failurePolicy", "value": "Ignore"}]'

  4. The antrea-agent pod will be created in new control plane node:

    kubectl get pods -A -o wide | grep <new-control-plane-node-name>

  5. After guest cluster upgrade completed remove pause for gMSA app:

    kubectl patch app <gmsa-app-name> -n <namespace> --type=merge -p '{"spec":{"paused":false}}'