SSP UI Error: ‘Failed to Fetch Form Factor’ After Storage Outage

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

If the relevant storage becomes unavailable, SSP downtime is expected. However, once storage and NSX services are restored, SSP is designed to fully recover without requiring operator intervention.

In certain cases, this automatic recovery does not occur as expected, resulting in the SSP UI becoming unavailable. The example screenshot below illustrates this UI failure, which prevents users from logging in or performing any actions within the SSP environment.

SSP CLI access remains available, but several pods are observed in a CrashLoopBackOff state, including the cluster-api pod.

Example commands for validation:

k get pods -A | grep -v "Com\|Run"
k -n nsxi-platform logs deploy/monitor
k -n nsxi-platform logs deploy/cluster-api -c cluster-api

Monitoring and cluster API log snippet as an example:

Monitor Startup Failures 
  Monitor tried to call cluster-api during startup and received 500 errors:
        ERROR: Got HTTP/1.1 500 Internal Server Error
        Response from ClusterapiApi#getServiceFeatureHealth(), response body could not be read
  This caused monitor's bean initialization to fail:
       WARN: Exception encountered during context initialization - cancelling refresh attempt
        Error creating bean with name 'monitorServiceImpl': Invocation of init method failed
        nested exception is feign.RetryableException: Server Error
  Monitor attempted retry cycles:
        INFO: Get cluster-api client with base url https://cluster-api:443
        ERROR: Got HTTP/1.1 500 Internal Server Error
        Response from ClusterapiApi#getServiceFeatureHealth(), response body could not be read

Cluster-API side: 
        ERROR: failed to make request to get feature deployment status
        {"error": "Get \"https://cluster-api/cluster-api/features/intelligence/status\": dial tcp x.x.x.x:443: i/o timeout"}
        ERROR: failed to make request to get feature deployment status  
        {"error": "Get \"https://cluster-api/cluster-api/features/metrics/status\": dial tcp x.x.x.x:443: i/o timeout"}
        ERROR: failed to make request to get feature deployment status
        {"error": "Get \"https://cluster-api/cluster-api/features/ndr/status\": dial tcp x.x.x.x:443: i/o timeout"}

Environment

SSP 5.0
Shared storage such as vsan

Cause

When storage and NSX services recover, SSP’s monitoring services expect the cluster-api pods to be fully available. However, due to timing differences during storage recovery, the cluster-api pods may take significantly longer to come up. Since the monitoring components do not validate cluster-api readiness before starting, the monitoring pods launch prematurely.

As a result, the monitoring pods repeatedly attempt to call cluster-api during initialization, receive errors, and enter a CrashLoopBackOff state. This creates a race condition in which neither component is able to recover cleanly, preventing the SSP environment from returning to a healthy state.

Resolution

Once it is identified that monitoring and cluster-api are stuck in a race condition, the issue can be resolved by performing a rolling restart of both the monitoring and cluster-api pods.

Restarting these components breaks the race condition and allows the SSP cluster UI to recover successfully.

Workaround

Log in to the SSPI node as the root user and run the following commands:

kubectl rollout restart deployment <monitoring-deployment> -n nsxi-platform
kubectl rollout status deployment <monitoring-deployment> -n nsxi-platform
kubectl rollout restart deployment <cluster-api-deployment> -n nsxi-platform
kubectl rollout status deployment <cluster-api-deployment> -n nsxi-platform

Software improvement added in SSP 5.1.x to check the startup, readiness and liveness probe for the cluster api service.