ESXi is not able to connect with VASA Provider due to TCP Failure
search cancel

ESXi is not able to connect with VASA Provider due to TCP Failure

book

Article ID: 412434

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When a VASA Provider experiences an unexpected restart due to infrastructure events such as power failures, system upgrades, or operational maintenance, all connected ESXi hosts must simultaneously re-establish their connections with the VASA Provider.
  • Another scenario which can cause large number of concurrent VASA Provider `setContext` such as certificate rotation events or network infrastructure resets, the VASA Provider may become overwhelmed by the sudden influx of parallel `setContext` from multiple ESXi hosts. 
  • You would notice the below or similar setContext failure in /var/run/vvold logs in such case

2025-08-08T12:17:14.610Z Er(163) VVold[2255518]: [Originator@6876 sub=Default OpId=Session] VasaSession::DoSetContext: setContext for VP XXX (url: https://XXX:8443/vasa/services/vasaService?version=4) failed [connectionState: TransportError]: TRANSPORT_FAULT (Timeout / SSL_connect() failed in tcp_connect())
25331:2025-08-08T17:37:07.849Z Er(163) VVold[2317367]: [Originator@6876 sub=Default OpId=Session] VasaSession::DoSetContext: setContext for VP XXX (url: https://XXXX/axis2/services/vasa2) failed [connectionState: Connected]: STORAGE_FAULT (ADB conversion error. Details: Invalid Fault Id / )

Environment

  • VMware vSphere ESXi 7.x
  • VMware vSphere ESXi 8.x

Cause

  • The root cause stems from inadequate admission control mechanisms within the VASA Provider architecture.
  • When the provider lacks proper request throttling and queuing mechanisms, it accepts connection requests beyond its processing capacity.
  • This results in resource exhaustion and system unresponsiveness, where the provider cannot effectively manage the concurrent load generated by multiple ESXi hosts attempting simultaneous reconnection.

Resolution

There is no code fix  for this issue.

The below work around can be applied 

  • In environments where the VASA Provider lacks built-in admission control capabilities, manual intervention is required to implement controlled connection management.
  • The recommended approach involves temporarily stopping the VVOL daemon on all ESXi hosts and implementing a phased restart strategy to prevent connection storms.
  • Implementation Steps:

1. Stop VVOL daemon on all ESXi hosts:
   /etc/init.d/vvold stop

2. Implement phased restart strategy:

    • Restart VVOL daemons in small batches (recommended: 5-10 hosts per batch)
    • Allow 2-3 minutes between each batch to ensure stable connections
    • Monitor VASA Provider response times before proceeding to next batch

3. Start VVOL daemon on each batch:
   /etc/init.d/vvold start

Verification:

  • Confirm and. Verify setContext operations complete successfully
  • Monitor for transport faults or timeout errors.

Additional Information

This connection storm phenomenon has been observed with specific VASA Provider implementations that lack sophisticated load balancing and admission control features.