If the relevant storage becomes unavailable, SSP downtime is expected. However, once storage and NSX services are restored, SSP is designed to fully recover without requiring operator intervention.
In certain cases, this automatic recovery does not occur as expected, resulting in the SSP UI becoming unavailable. The example screenshot below illustrates this UI failure, which prevents users from logging in or performing any actions within the SSP environment.
SSP CLI access remains available, but several pods are observed in a CrashLoopBackOff state, including the cluster-api pod.
Example commands for validation:
k get pods -A | grep -v "Com\|Run"
k -n nsxi-platform logs deploy/monitor
k -n nsxi-platform logs deploy/cluster-api -c cluster-api
Monitoring and cluster API log snippet as an example:
Monitor Startup Failures
Monitor tried to call cluster-api during startup and received 500 errors:
ERROR: Got HTTP/1.1 500 Internal Server Error
Response from ClusterapiApi#getServiceFeatureHealth(), response body could not be read
This caused monitor's bean initialization to fail:
WARN: Exception encountered during context initialization - cancelling refresh attempt
Error creating bean with name 'monitorServiceImpl': Invocation of init method failed
nested exception is feign.RetryableException: Server Error
Monitor attempted retry cycles:
INFO: Get cluster-api client with base url https://cluster-api:443
ERROR: Got HTTP/1.1 500 Internal Server Error
Response from ClusterapiApi#getServiceFeatureHealth(), response body could not be read
Cluster-API side:
ERROR: failed to make request to get feature deployment status
{"error": "Get \"https://cluster-api/cluster-api/features/intelligence/status\": dial tcp x.x.x.x:443: i/o timeout"}
ERROR: failed to make request to get feature deployment status
{"error": "Get \"https://cluster-api/cluster-api/features/metrics/status\": dial tcp x.x.x.x:443: i/o timeout"}
ERROR: failed to make request to get feature deployment status
{"error": "Get \"https://cluster-api/cluster-api/features/ndr/status\": dial tcp x.x.x.x:443: i/o timeout"}
When storage and NSX services recover, SSP’s monitoring services expect the cluster-api pods to be fully available. However, due to timing differences during storage recovery, the cluster-api pods may take significantly longer to come up. Since the monitoring components do not validate cluster-api readiness before starting, the monitoring pods launch prematurely.
As a result, the monitoring pods repeatedly attempt to call cluster-api during initialization, receive errors, and enter a CrashLoopBackOff state. This creates a race condition in which neither component is able to recover cleanly, preventing the SSP environment from returning to a healthy state.
Once it is identified that monitoring and cluster-api are stuck in a race condition, the issue can be resolved by performing a rolling restart of both the monitoring and cluster-api pods.
Restarting these components breaks the race condition and allows the SSP cluster UI to recover successfully.
Workaround
Log in to the SSPI node as the root user and run the following commands:
kubectl rollout restart deployment <monitoring-deployment> -n nsxi-platform
kubectl rollout status deployment <monitoring-deployment> -n nsxi-platform
kubectl rollout restart deployment <cluster-api-deployment> -n nsxi-platform
kubectl rollout status deployment <cluster-api-deployment> -n nsxi-platform
Software improvement added in SSP 5.1.x to check the startup, readiness and liveness probe for the cluster api service.