Multiple pods goes into ImagePullBackOff state post SSPI certificate update
search cancel

Multiple pods goes into ImagePullBackOff state post SSPI certificate update

book

Article ID: 430162

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

If SSPI Ingress certificate update is done in quick succession, pods running in control nodes goes into ImagePullBackOff state and not able to come up.

Environment

SSP v5.1.0 and v5.1.1

Cause

If SSPI Ingress certificate update is done in quick succession, docker CA property in k8s' control plane custom resource could get stuck in it's earlier value due to a race condition with CAPI controller.  As a result, SSP control nodes could no longer pull image from SSPI as SSPI is presenting a certificate not known to the control nodes.  Symptom can be observed from pod status:

kube-system         antrea-agent-pn5v6                                                0/2     Init:ImagePullBackOff   0             5m3s
kube-system         antrea-agent-vw8ql                                                0/2     Init:ImagePullBackOff   0             2m41s
kube-system         kube-vip-9o5e5dsm-controller-7bb5k                                0/1     ImagePullBackOff        0             2m39s
kube-system         kube-vip-9o5e5dsm-controller-f6pgf                                0/1     ImagePullBackOff        0             5m3s
kube-system         vsphere-cpi-8868l                                                 0/1     ImagePullBackOff        0             5m3s
kube-system         vsphere-cpi-hj246                                                 0/1     ImagePullBackOff        0             2m41s
vmware-system-csi   vsphere-csi-node-g9dtj                                            0/3     ImagePullBackOff        0             2m41s

Pod events shows that it can not trust the certificate presented by docker registry

vmware-system-csi   5m24s       Warning   Failed                             pod/vsphere-csi-node-xgbr5                                                                         Failed to pull image "sspifqdn.com.internal/registry/install/sig-storage/csi-node-driver-registrar:v2.10.1": unable to pull image or OCI artifact: pull image err: initializing source docker://sspifqdn.com.internal/registry/install/sig-storage/csi-node-driver-registrar:v2.10.1: pinging container registry sspifqdn.com.internal: Get "https://sspifqdn.com.internal/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "sspifqdn.com.internal"); artifact err: get manifest: build image source: pinging container registry sspifqdn.com.internal: Get "https://sspifqdn.com.internal/v2/": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "sspifqdn.com.internal")
vmware-system-csi   5m24s       Warning   Failed                             pod/vsphere-csi-node-xgbr5                                                                         Error: ErrImagePull
vmware-system-csi   5m24s       Normal    Pulling                            pod/vsphere-csi-node-xgbr5                                                                         Pulling image "sspifqdn.com.internal/registry/install/csi-vsphere/driver:v3.3.1"

From SSPI, /var/log/secop/secop.log, we see that updating kubeadmcontrolplane object failed due to optimistic concurrency error:

2026-02-12T03:05:14.594Z        INFO    secopapi/secopapi.go:2040       Request received to UpdateCertificate
2026-02-12T03:05:16.370Z        INFO    certificateservice/service.go:252       Updating K8S cluster's docker CA, /config/clusterctl/1/9o5e5dsm.kubeconfig
2026-02-12T03:05:16.533Z        ERROR   certificateservice/service.go:291       Failed to update kubeadmcontrolplanes   {"name": "9o5e5dsm-controller", "namespace": "9o5e5dsm", "error": "Operation cannot be fulfilled on kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"9o5e5dsm-controller\": the object has been modified; please apply your changes to the latest version and try again"}

Resolution

When the system falls into this state, users should perform SSPI certificate replacement workflow again.  This can be done by generating a new CSR, have it signed by a trusted CA and upload the new certificate again.  This would force the certificate for docker CA to be refreshed with the latest CA on all control nodes.

Additional Information

Please refer to : manage-certificates