Policy Recommendations, Backup Jobs, and Intelligence Flows Fail When CSI Controller Is in CrashLoopBackoff
search cancel

Policy Recommendations, Backup Jobs, and Intelligence Flows Fail When CSI Controller Is in CrashLoopBackoff

book

Article ID: 399170

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

Symptom - 1: Policy Recommendations stuck in Queued for discovery

Symptom - 2: Backup and Restore job fails with error: "failed to start operation, please contact administrator"

Symptom – 3: Intelligence Security Flows not visible in SSP UI
  • When you run new recommendation from SSP UI, it gets stuck at Queued for Discovery state or when you run a Backup and Restore job, it fails with the above mentioned error.
  • You will see infraclassifier or other job stuck in pending state and vsphere-csi-controller in CrashLoopBackOff.
  • Login to SSPI CLI as root (if 5.0) or sysadmin (if 5.1) and run below,

    k get pods -A | egrep -vi "runn|compl"
    NAMESPACE           NAME                                                              READY   STATUS             RESTARTS            AGE
    nsxi-platform       anomalydetectionstreamingjob-3bf6e796f3c9518d-exec-1              0/1     Pending            0                   6d3h
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-1                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-2                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-3                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-4                           0/1     Pending            0                   39d
    vmware-system-csi   vsphere-csi-controller-75f8894c79-sdhck                           6/7     CrashLoopBackOff   14105 (4m33s ago)   54d

  • Describing the pod infraclassifier pod shows below error,

    k describe pod infraclassifier-8646eb9646a0565e-exec-1 -n nsxi-platform

    Events:
      Type     Reason            Age                     From               Message
      ----     ------            ----                    ----               -------
      Warning  FailedScheduling  105s (x11490 over 39d)  default-scheduler  0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

  • This happens because vsphere-csi-controller is crashlooping. 
  • You will see vsphere-csi-controller container in vsphere-csi-controller pod is in error state.

    k describe pod vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi

    --- Output truncated ---

     vsphere-csi-controller:
        Container ID:  cri-o://89a90cba83dd8aa271edda13c9367fcf4ffe8402889c7377db127d90f31caf21
        Image:         sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0
        Image ID:      sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver@sha256:af8887fde54bb0b8c44e821597cbb3b8087e4451b09bb2861d7ac67c66808775
        Ports:         9808/TCP, 2112/TCP
        Host Ports:    0/TCP, 0/TCP
        Args:
          --fss-name=internal-feature-states.csi.vsphere.vmware.com
          --fss-namespace=$(CSI_NAMESPACE)
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Tue, 27 May 2025 20:15:21 +0000
          Finished:     Tue, 27 May 2025 20:15:21 +0000
        Ready:          False
        Restart Count:  14106
        Liveness:       http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    ...

    Events:
      Type     Reason   Age                     From     Message
      ----     ------   ----                    ----     -------
      Normal   Pulled   55m (x14097 over 54d)   kubelet  Container image "sspi-prod.exapmple.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0" already present on machine
      Warning  BackOff  37s (x339265 over 50d)  kubelet  Back-off restarting failed container vsphere-csi-controller in pod vsphere-csi-controller-75f8894c79-sdhck_vmware-system-csi(e369c968-57e9-4651-83fb-f4dbc8fc0e43)

  • vsphere-csi-controller container logs will show this error,

    k logs vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi -c vsphere-csi-controller

    {"level":"error","time":"2025-05-27T20:10:18.974743694Z","caller":"vsphere/virtualcenter.go:672","msg":"failed to connect to VirtualCenter host: \"vc01.example.org\". Err: Post \"https://vc01.example.org:443/sdk\": host \"vc01.example.org:443\" thumbprint does not match \"02:0E:FF:10:2F:0F:9A:A8:AA:77:D9:D0:27:2F:FA:EE:0A:67:24:6D\"","TraceId":"f2027a3e-d083-406e-ae50-49faaccbb6dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:672\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:236\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:96\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

  • On SSPI UI > Instance Management > vCenter Parameters, you will see this error

    Failed to establish a connection to vCenter vc01.example.org. Unable to connect to the vCenter. Please check the network or click EDIT CONNECTION to update the vCenter settings.

Environment

SSP 5.0, SSP 5.1

Cause

  • This is caused when certificate on vCenter has changed or replaced.

Resolution

  • Reconnect the vCenter on SSPI UI.
  • On SSPI UI > Instance Management > vCenter Parameters > Edit Connection, enter the credentials and certificate and connect.  More detailed steps provided below. 
  • Once SSPI is connected to vCenter, you will see Failing pods recovering.
  • When all pods are up, Recommendation Job and the backup operations will succeed.

 

To add vcenter certificate, it depends on the certificate authority (CA) structure used to sign the vCenter certificate. Identify which case applies to your environment and follow the corresponding steps.

Case 1 — Root CA Certificate Only

Use this case when the vCenter certificate is signed directly by a root CA (no intermediate certificates in the chain).

  1. Download the vCenter root CA certificate by following the VMware documentation: Download and Install the vCenter Server Certificate
  2. In the SSP UI, navigate to the vCenter connection wizard.
  3. Upload the downloaded root CA certificate when prompted.
  4. Complete the reconnection wizard and verify the connection succeeds.

Case 2 — Root CA and Intermediate Certificate(s)

Use this case when the vCenter certificate is signed by an intermediate CA that chains up to a root CA. You must upload the full certificate chain in the correct order.

Step 1 — Export the Full Certificate Chain from the Browser

  1. Navigate to the vCenter URL in your browser (Chrome or Edge recommended).
  2. Click the lock icon in the address bar and open the certificate details.
  3. Navigate to the Certificate Path or Certification Path tab to view the full chain.
  4. Export the certificate chain. Select the option to export as Base64-encoded (.pem) certificate chain (not DER/binary format).

Step 2 — Verify Certificate Chain Order

Before uploading, confirm the exported file contains the certificates in the correct order. Open the file in a text editor and verify the sequence is:

  • Server certificate (first)
  • Intermediate CA certificate(s) (middle)
  • Root CA certificate (last)

Note: If the order is incorrect, rearrange the PEM blocks manually so that the chain reads from server to root (top to bottom). SSP requires this order to correctly validate the full trust chain.

Step 3 — Upload to SSP

  1. In the SSP UI, navigate to the vCenter connection wizard.
  2. Upload the full certificate chain file when prompted.
  3. Complete the reconnection wizard and verify the connection succeeds.

Ex: 

Additional Information

  • Once you reconnect vCenter with updated certificate, it will trigger recreation of workload cluster nodes to update the VC certificate.
  • Wait for the recreation of all the nodes to complete and become Ready.
  • You will see pods getting restarted during node recreation, this will result in SSP UI to become unavailable temporarily during this time.
  • You can observe the node status and pod status using this command from the SSP Installer CLI.

k get nodes -o wide

k get pods -A