Policy Recommendations stuck in Queued for discovery on SSP UI and vsphere-csi-controller pod in CrashLoopBackOff.
search cancel

Policy Recommendations stuck in Queued for discovery on SSP UI and vsphere-csi-controller pod in CrashLoopBackOff.

book

Article ID: 399170

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

  • When you run new recommendation from SSP UI, it gets stuck at Queued for Discovery state.
  • You will see infraclassifier or other job stuck in pending state and vsphere-csi-controller in CrashLoopBackOff.
  • Login to SSPI CLI as root and run below,

    k get pods -A | egrep -vi "runn|compl"
    NAMESPACE           NAME                                                              READY   STATUS             RESTARTS            AGE
    nsxi-platform       anomalydetectionstreamingjob-3bf6e796f3c9518d-exec-1              0/1     Pending            0                   6d3h
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-1                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-2                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-3                           0/1     Pending            0                   39d
    nsxi-platform       infraclassifier-8646eb9646a0565e-exec-4                           0/1     Pending            0                   39d
    vmware-system-csi   vsphere-csi-controller-75f8894c79-sdhck                           6/7     CrashLoopBackOff   14105 (4m33s ago)   54d

  • Describing the pod infraclassifier pod shows below error,

    k describe pod infraclassifier-8646eb9646a0565e-exec-1 -n nsxi-platform

    Events:
      Type     Reason            Age                     From               Message
      ----     ------            ----                    ----               -------
      Warning  FailedScheduling  105s (x11490 over 39d)  default-scheduler  0/8 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

  • This happens because vsphere-csi-controller is crashlooping. 
  • You will see vsphere-csi-controller container in vsphere-csi-controller pod is in error state.

    k describe pod vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi

    --- Output truncated ---

     vsphere-csi-controller:
        Container ID:  cri-o://89a90cba83dd8aa271edda13c9367fcf4ffe8402889c7377db127d90f31caf21
        Image:         sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0
        Image ID:      sspi.example.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver@sha256:af8887fde54bb0b8c44e821597cbb3b8087e4451b09bb2861d7ac67c66808775
        Ports:         9808/TCP, 2112/TCP
        Host Ports:    0/TCP, 0/TCP
        Args:
          --fss-name=internal-feature-states.csi.vsphere.vmware.com
          --fss-namespace=$(CSI_NAMESPACE)
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Tue, 27 May 2025 20:15:21 +0000
          Finished:     Tue, 27 May 2025 20:15:21 +0000
        Ready:          False
        Restart Count:  14106
        Liveness:       http-get http://:healthz/healthz delay=30s timeout=10s period=180s #success=1 #failure=3
    ...

    Events:
      Type     Reason   Age                     From     Message
      ----     ------   ----                    ----     -------
      Normal   Pulled   55m (x14097 over 54d)   kubelet  Container image "sspi-prod.exapmple.org/registry/1.6.3/cloud-provider-vsphere/csi/release/driver:v3.1.0" already present on machine
      Warning  BackOff  37s (x339265 over 50d)  kubelet  Back-off restarting failed container vsphere-csi-controller in pod vsphere-csi-controller-75f8894c79-sdhck_vmware-system-csi(e369c968-57e9-4651-83fb-f4dbc8fc0e43)

  • vsphere-csi-controller container logs will show this error,

    k logs vsphere-csi-controller-75f8894c79-sdhck -n vmware-system-csi -c vsphere-csi-controller

    {"level":"error","time":"2025-05-27T20:10:18.974743694Z","caller":"vsphere/virtualcenter.go:672","msg":"failed to connect to VirtualCenter host: \"vc01.example.org\". Err: Post \"https://vc01.example.org:443/sdk\": host \"vc01.example.org:443\" thumbprint does not match \"02:0E:FF:10:2F:0F:9A:A8:AA:77:D9:D0:27:2F:FA:EE:0A:67:24:6D\"","TraceId":"f2027a3e-d083-406e-ae50-49faaccbb6dc","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/common/cns-lib/vsphere.GetVirtualCenterInstanceForVCenterConfig\n\t/build/pkg/common/cns-lib/vsphere/virtualcenter.go:672\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/vanilla.(*controller).Init\n\t/build/pkg/csi/service/vanilla/controller.go:236\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/build/pkg/csi/service/driver.go:188\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/build/pkg/csi/service/driver.go:202\nmain.main\n\t/build/cmd/vsphere-csi/main.go:96\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"}

  • On SSPI UI > Instance Management > vCenter Parameters, you will see this error

    Failed to establish a connection to vCenter vc01.example.org. Unable to connect to the vCenter. Please check the network or click EDIT CONNECTION to update the vCenter settings.

Environment

SSP 5.0.0

Cause

  • This is caused when certificate on vCenter has changed or replaced.

Resolution

  • Reconnect the vCenter on SSPI UI.
  • On SSPI UI > Instance Management > vCenter Parameters > Edit Connection, enter the credentials and certificate and connect.
  • Once SSPI is connected to vCenter, you will see Failing pods recovering.
  • When all pods are up, Recommendation Job will succeed.