VCFA UI Inaccessible and Service Instability Due to High Datastore Latency
search cancel

VCFA UI Inaccessible and Service Instability Due to High Datastore Latency

book

Article ID: 421284

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

The VCF Automation (VCFA) User Interface (UI) is inaccessible and display “Unhealthy upstream errors.”

Impact

  • VCFA UI inaccessible due to upstream health failures

  • Multiple VMSP and Prelude Kubernetes pods entered a crashed or unhealthy state

  • Management operations (for example, power-off actions) failed from the Fleet Management.

  • Kubernetes API server connectivity failures on port 6443

Storage latency directly impacted the stability of the etcd service, which is critical for Kubernetes cluster operations.

2025-12-09T12:04:35.97800310Z {"level":"warn","ts":"2025-12-09T12:04:35.972024Z","caller":"ecodeservev1/v0_server.go:430","msg":"Waiting for RealIndex response took too long, retrying","ment-request-id":"127e7615e656e6d1001&, "retry-timeout":"500ms"}

As the etcd is unstable, we will see Connectivity failures to the Kubernetes API server (port 6443) and inability to perform management actions (e.g., power off) via the Fleet Management UI, resulting in "No route to host" errors.

Fleet Management Logs failure for Power off task

ERROR vrlcm[1298] [pool-3-thread-68] [c.v.v.l.u.SessionHolder]  -- SessionHolder.newSession Exception encountered
com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host (Host unreachable)
ERROR vrlcm[1298] [pool-3-thread-68] [c.v.v.l.v.p.t.FetchInfraDetailsFromVMSPClusterTask]  -- Failed to fetch VMSP config details from VMSP cluster
INFO vrlcm[1298] [pool-3-thread-68] [c.v.v.l.p.a.s.Task]  -- Injecting task failure event. Error Code : 'LCMVMSP10019', Retry : 'true', Causing Properties : '{ CAUSE :: primaryVip === vmwareSystemUserPassword YXYXYXYX  }'

Above issue is caused due to extreme latency directly compromised the stability of critical internal services, particularly etcd, leading to timeouts, crashes in Kubernetes PODs, and ultimately rendering the VCFA UI inaccessible.

Using esxtop to identify storage performance issues for ESXi

Environment

VCF Automation 9.0

Cause

The issue was caused by extreme storage latency on the datastore hosting the VCFA nodes.

Resolution

The issue is resolved by addressing the underlying storage performance problem.

  • Migrated the affected VCFA nodes from the problematic datastore to stable datastores within the same cluster.

 

Additional Information

Due to Kubernetes API server instability, kubectl is not functioning due to the issues noted above, we need to rely on the native containerd utilities to review the logs.

To list and review container logs using crictl use the following commands:

1. List all containers

crictl ps

This command displays all running containers. From the output, note the CONTAINER ID of the container you want to inspect.

2. Retrieve logs for a specific container

crictl logs <CONTAINER_ID>

Replace <CONTAINER_ID> with the ID obtained from the previous command to view the container’s logs.

Refer to below document: https://kubernetes.io/docs/tasks/debug/debug-cluster/crictl/