Policy Recommendations, Backup Jobs, and Intelligence Flows Fail When CSI Controller Is in CrashLoopBackoff

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

Symptom - 1: Policy Recommendations stuck in Queued for discovery

Symptom - 2: Backup and Restore job fails with error: "failed to start operation, please contact administrator"

Symptom – 3: Intelligence Security Flows not visible in SSP UI

When you run new recommendation from SSP UI, it gets stuck at Queued for Discovery state or when you run a Backup and Restore job, it fails with the above mentioned error.
You will see infraclassifier or other job stuck in pending state and vsphere-csi-controller in CrashLoopBackOff.
Login to SSPI CLI as root (if 5.0) or sysadmin (if 5.1) and run below,

k get pods -A | egrep -vi "runn|compl"
NAMESPACE NAME READY STATUS RESTARTS AGE
nsxi-platform anomalydetectionstreamingjob-3bf6e796f3c9518d-exec-1 0/1 Pending 0 6d3h
nsxi-platform infraclassifier-8646eb9646a0565e-exec-1 0/1 Pending 0 39d
nsxi-platform infraclassifier-8646eb9646a0565e-exec-2 0/1 Pending 0 39d
nsxi-platform infraclassifier-8646eb9646a0565e-exec-3 0/1 Pending 0 39d
nsxi-platform infraclassifier-8646eb9646a0565e-exec-4 0/1 Pending 0 39d
vmware-system-csi vsphere-csi-controller-75f8894c79-sdhck 6/7 CrashLoopBackOff 14105 (4m33s ago) 54d
Describing the pod infraclassifier pod shows below error,

k describe pod infraclassifier-8646eb9646a0565e-exec-1 -n nsxi-platform

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 105s (x11490 over 39d) default-scheduler 0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
This happens because vsphere-csi-controller is crashlooping.
You will see vsphere-csi-controller container in vsphere-csi-controller pod is in error state.

k describe pod vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi

--- Output truncated ---

vsphere-csi-controller:
Container ID: cri-o://89a90cba83dd8aa271edda13c9367fcf4ffe8402889c7377db127d90f31caf21
Image: sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0
Image ID: sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver@sha256:af8887fde54bb0b8c44e821597cbb3b8087e4451b09bb2861d7ac67c66808775
Ports: 9808/TCP, 2112/TCP
Host Ports: 0/TCP, 0/TCP
Args:
--fss-name=internal-feature-states.csi.vsphere.vmware.com
--fss-namespace=$(CSI_NAMESPACE)
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 27 May 2025 20:15:21 +0000
Finished: Tue, 27 May 2025 20:15:21 +0000
Ready: False
Restart Count: 14106
Liveness: http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
...

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 55m (x14097 over 54d) kubelet Container image "sspi-prod.exapmple.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0" already present on machine
Warning BackOff 37s (x339265 over 50d) kubelet Back-off restarting failed container vsphere-csi-controller in pod vsphere-csi-controller-75f8894c79-sdhck_vmware-system-csi(e369c968-57e9-4651-83fb-f4dbc8fc0e43)
vsphere-csi-controller container logs will show this error,

k logs vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi -c vsphere-csi-controller

{"level":"error","time":"2025-05-27T20:10:18.974743694Z","caller":"vsphere/virtualcenter.go:672","msg":"failed to connect to VirtualCenter host: \"vc01.example.org\". Err: Post \"https://vc01.example.org:443/sdk\": host \"vc01.example.org:443\" thumbprint does not match \"02:0E:FF:10:2F:0F:9A:A8:AA:77:D9:D0:27:2F:FA:EE:0A:67:24:6D\"","TraceId":"f2027a3e-d083-406e-ae50-49faaccbb6dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:672\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:236\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:96\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}
On SSPI UI > Instance Management > vCenter Parameters, you will see this error

Failed to establish a connection to vCenter vc01.example.org. Unable to connect to the vCenter. Please check the network or click EDIT CONNECTION to update the vCenter settings.

Environment

SSP 5.0, SSP 5.1

Cause

This is caused when certificate on vCenter has changed or replaced.

Resolution

Reconnect the vCenter on SSPI UI.
On SSPI UI > Instance Management > vCenter Parameters > Edit Connection, enter the credentials and certificate and connect.
Once SSPI is connected to vCenter, you will see Failing pods recovering.
When all pods are up, Recommendation Job and the backup operations will succeed.

Additional Information

Once you reconnect vCenter with updated certificate, it will trigger recreation of workload cluster nodes to update the VC certificate.
Wait for the recreation of all the nodes to complete and become Ready.
You will see pods getting restarted during node recreation, this will result in SSP UI to become unavailable temporarily during this time.
You can observe the node status and pod status using this command from the SSP Installer CLI.

k get nodes -o wide

k get pods -A