Policy Recommendations, Backup Jobs, and Intelligence Flows Fail When CSI Controller Is in CrashLoopBackoff
search cancel

Policy Recommendations, Backup Jobs, and Intelligence Flows Fail When CSI Controller Is in CrashLoopBackoff

book

Article ID: 399170

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

Symptom - 1: Policy Recommendations stuck in Queued for discovery

Symptom - 2: Backup and Restore job fails with error: "failed to start operation, please contact administrator"

Symptom – 3: Intelligence Security Flows not visible in SSP UI
  • When you run new recommendation from SSP UI, it gets stuck at Queued for Discovery state or when you run a Backup and Restore job, it fails with the above mentioned error.
  • You will see infraclassifier or other job stuck in pending state and vsphere-csi-controller in CrashLoopBackOff.
  • Login to SSPI CLI as root (if 5.0) or sysadmin (if 5.1) and run below,

    k get pods -A | egrep -vi "runn|compl"
    NAMESPACE           NAME                                                              READY   STATUS             RESTARTS            AGE
    nsxi-platform       anomalydetectionstreamingjob-3bf6e796f3c9518d-exec-1              0/1     Pending            0                   6d3h
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-1                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-2                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-3                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-4                           0/1     Pending            0                   39d
    vmware-system-csi   vsphere-csi-controller-75f8894c79-sdhck                           6/7     CrashLoopBackOff   14105 (4m33s ago)   54d

  • Describing the pod infraclassifier pod shows below error,

    k describe pod infraclassifier-8646eb9646a0565e-exec-1 -n nsxi-platform

    Events:
      Type     Reason            Age                     From               Message
      ----     ------            ----                    ----               -------
      Warning  FailedScheduling  105s (x11490 over 39d)  default-scheduler  0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

  • This happens because vsphere-csi-controller is crashlooping. 
  • You will see vsphere-csi-controller container in vsphere-csi-controller pod is in error state.

    k describe pod vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi

    --- Output truncated ---

     vsphere-csi-controller:
        Container ID:  cri-o://89a90cba83dd8aa271edda13c9367fcf4ffe8402889c7377db127d90f31caf21
        Image:         sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0
        Image ID:      sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver@sha256:af8887fde54bb0b8c44e821597cbb3b8087e4451b09bb2861d7ac67c66808775
        Ports:         9808/TCP, 2112/TCP
        Host Ports:    0/TCP, 0/TCP
        Args:
          --fss-name=internal-feature-states.csi.vsphere.vmware.com
          --fss-namespace=$(CSI_NAMESPACE)
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Tue, 27 May 2025 20:15:21 +0000
          Finished:     Tue, 27 May 2025 20:15:21 +0000
        Ready:          False
        Restart Count:  14106
        Liveness:       http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    ...

    Events:
      Type     Reason   Age                     From     Message
      ----     ------   ----                    ----     -------
      Normal   Pulled   55m (x14097 over 54d)   kubelet  Container image "sspi-prod.exapmple.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0" already present on machine
      Warning  BackOff  37s (x339265 over 50d)  kubelet  Back-off restarting failed container vsphere-csi-controller in pod vsphere-csi-controller-75f8894c79-sdhck_vmware-system-csi(e369c968-57e9-4651-83fb-f4dbc8fc0e43)

  • vsphere-csi-controller container logs will show this error,

    k logs vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi -c vsphere-csi-controller

    {"level":"error","time":"2025-05-27T20:10:18.974743694Z","caller":"vsphere/virtualcenter.go:672","msg":"failed to connect to VirtualCenter host: \"vc01.example.org\". Err: Post \"https://vc01.example.org:443/sdk\": host \"vc01.example.org:443\" thumbprint does not match \"02:0E:FF:10:2F:0F:9A:A8:AA:77:D9:D0:27:2F:FA:EE:0A:67:24:6D\"","TraceId":"f2027a3e-d083-406e-ae50-49faaccbb6dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:672\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:236\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:96\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

  • On SSPI UI > Instance Management > vCenter Parameters, you will see this error

    Failed to establish a connection to vCenter vc01.example.org. Unable to connect to the vCenter. Please check the network or click EDIT CONNECTION to update the vCenter settings.

Environment

SSP 5.0, SSP 5.1

Cause

  • This is caused when certificate on vCenter has changed or replaced.

Resolution

  • Reconnect the vCenter on SSPI UI.
  • On SSPI UI > Instance Management > vCenter Parameters > Edit Connection, enter the credentials and certificate and connect.
  • Once SSPI is connected to vCenter, you will see Failing pods recovering.
  • When all pods are up, Recommendation Job and the backup operations will succeed.

Additional Information

  • Once you reconnect vCenter with updated certificate, it will trigger recreation of workload cluster nodes to update the VC certificate.
  • Wait for the recreation of all the nodes to complete and become Ready.
  • You will see pods getting restarted during node recreation, this will result in SSP UI to become unavailable temporarily during this time.
  • You can observe the node status and pod status using this command from the SSP Installer CLI.

k get nodes -o wide

k get pods -A