SSP: druid-broker and druid-router pods failed to come up due to druid-s3-provisioning job was not found during upgrade from SSP5.0 to SSP 5.1.0

Products

VMware vDefend Firewall with Advanced Threat Prevention VMware vDefend Firewall

Issue/Introduction

During the SSP 5.0 to 5.1 upgrade, the following components failed to start:

druid-broker pods
druid-router pods

These pods remained in either CrashLoopBackOff or Init state.

Verification Steps

Log in to the SSPI node via SSH as the root user and run the following checks:

Verify the druid-broker pods:
- Confirm that the pods are in CrashLoopBackOff or Init state.

k -n nsxi-platform get pods -A | grep druid-broker

Describe the affected pod:
- The pod logs show it is waiting for the druid-s3-provisioning job.
- k -n nsxi-platform describe pod <pod-name received from step1 output> >>> and the output of describe pod will be as below if druid-s3-provisioning job not found

Start waiting for job druid-s3-provisioning
Error from server (NotFound): jobs.batch "druid-s3-provisioning" not found
Failed waiting for job druid-s3-provisioning

3.Verify job presence:

- The druid-s3-provisioning job is not found in the nsxi-platform namespace.

k -n nsxi-platform jobs -A | grep druid-s3-provisioning >>>> job was not found in SSPI

Environment

SSP 5.0.0 to SSP 5.1.0 upgrade

Cause

The issue occurred because the nsxi-platform Helm release was stuck in a pending-upgrade state.
Due to this incomplete Helm upgrade, the required druid-s3-provisioning job was never created, causing dependent Druid components to fail during startup.

The Helm release status confirms the issue:

All other Helm releases show DEPLOYED
The nsxi-platform Helm release shows PENDING-UPGRADE

verify the helm charts status by executing below command in SSPI cli as root user

root@sspi:~# helm list -A --all --kubeconfig /config/clusterctl/1/workload.kubeconfig
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /config/clusterctl/1/workload.kubeconfig
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /config/clusterctl/1/workload.kubeconfig
NAME                            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                                       APP VERSION
analytics-common                nsxi-platform   1               2025-08-12 20:37:16.804326309 +0000 UTC deployed        analytics-common-chart-5.0.0-0.0-24631124   5.0.0-0.0-24631124
cert-manager                    cert-manager    2               2025-12-04 19:31:18.121806507 +0000 UTC deployed        cert-manager-5.1.0-0.0-25009250             5.1.0-0.0-25009250
intelligence                    nsxi-platform   1               2025-08-12 20:40:27.487832032 +0000 UTC deployed        nsxi-chart-5.0.0-0.0-24631124               5.0.0-0.0-24631124
metrics                         nsxi-platform   1               2025-08-07 20:05:38.121123509 +0000 UTC deployed        metrics-5.0.0-0.0-24631132                  5.0.0-0.0-24631132
network-traffic-analysis        nsxi-platform   1               2025-08-12 20:38:27.051730083 +0000 UTC deployed        nta-chart-5.0.0-0.0-24631124                5.0.0-0.0-24631124
nsx-metadata                    nsxi-platform   1               2025-08-12 20:44:14.129110863 +0000 UTC deployed        nsx-metadata-service-5.0.0-0.0-24631138     5.0.0-0.0-24631138
nsx-ndr                         nsxi-platform   1               2025-08-12 20:47:14.647185859 +0000 UTC deployed        nsx-ndr-5.0.0-0.0-24631126                  5.0.0-0.0-24631126
nsxi-platform                   nsxi-platform   2               2025-12-04 19:50:50.210307664 +0000 UTC pending-upgrade napp-platform-advanced-5.1.0-0.0-25009250   5.1.0-0.0-25009250   >>> nsxi-platform  helm stuck at pending-upgrade status  
projectcontour                  projectcontour  2               2025-12-04 19:31:47.373871463 +0000 UTC deployed        contour-5.1.0-0.0-25009250                  5.1.0-0.0-25009250
ssp-antrea                      kube-system     1               2025-08-07 19:51:05.936213896 +0000 UTC deployed        antrea-2.1.0                                2.1.0
ssp-metallb                     metallb-system  1               2025-08-07 19:51:29.300222271 +0000 UTC deployed        metallb-5.0.3                               0.14.3
ssp-upgrade                     nsxi-platform   1               2025-12-04 19:21:41.8382117 +0000 UTC   deployed        napp-upgrade-5.1.0-0.0-25009250             5.1.0-0.0-25009250

Resolution

please follow the steps below to roll back the Helm release and reattempt the upgrade.

Step 1: Roll back the Helm release

Helm rollback will recreate the required jobs. Execute the following command from the SSPI CLI to roll back the nsxi-platform Helm release:

helm rollback nsxi-platform 2 -n nsxi-platform --kubeconfig /config/clusterctl/1/workload.kubeconfig

Step 2: Verify Helm status after rollback

After the rollback completes, verify the Helm status using the command below:

helm list -A --all --kubeconfig /config/clusterctl/1/workload.kubeconfig

Ensure that the nsxi-platform Helm release status is shown as DEPLOYED.

Step 3: Retrigger the upgrade

Once the Helm status is verified:

Navigate to the SSP UI
Retrigger the upgrade process

The upgrade should now proceed and complete successfully.

Additional Information

If the issue persists after performing the rollback and retriggering the upgrade, please contact Broadcom Support for further assistance.