vCenter vSphere 8.x Backup Failures (vim.fault.NotAuthenticated) due to Envoy Gateway Token Expiration
search cancel

vCenter vSphere 8.x Backup Failures (vim.fault.NotAuthenticated) due to Envoy Gateway Token Expiration

book

Article ID: 440879

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

Users may observe widespread virtual machine backup failures across multiple ESXi clusters managed by a single vCenter Server Appliance (VCSA). Common indicators include:

  • Backup software (e.g., Rubrik) successfully initiates snapshots, but data transfer fails during the Network File Copy (NFC) phase.
  • Backup logs report errors such as Failed to refresh vCenter.
  • Concurrent API connection timeouts for management tools (e.g., BladeLogic) and data transfer utilities (WinSCP/SSH).
  • Review the vCenter Server /var/log/vmware/vpxd/vpxd.log for HTTP 500 errors generated by the internal Envoy Host Gateway pipeline. Look for the following signature:

    info vpxd[4085937] [Originator@6876 sub=vmomi.soapStub[774949] opID=7ac8f6ad] SOAP request returned HTTP failure; <<io_obj p:0x00007f740d5e8b48, h:250, <UNIX ''>, <UNIX '/var/run/envoy-hgw/hgw-pipe'>>, /hgw/host-150117/sdk>, method: fileManagement; code: 500(Internal Server Error); fault: (vim.fault.NotAuthenticated) {
    -->    faultCause = (vmodl.MethodFault) null,
    -->    faultMessage = <unset>,
    -->    object = 'vim.Datastore:62aef28d-fe0e61a8-3da4-9efa524000e2',
    -->    privilegeId = "Datastore.FileManagement"
    -->    msg = "Received SOAP response fault from [...]: fileManagement. The session is not authenticated."
    --> }

Environment

vCenter Server 8.x

Cause

The issue is caused by management plane resource exhaustion and session token cache desynchronization within the vpxd and envoy-hgw services.

In vSphere 8.x, third-party backup proxies using VADP authenticate via vCenter, which proxies data requests through the Envoy Host Gateway. A massive "API storm" (caused by concurrent heavy automation scripts and high-volume backup streams) can exceed session limits. When snapshots are bulk-requested, session tokens may sit idle in the proxy queue beyond the default 5-minute (300-second) timeout. Once these tokens expire, the management plane rejects the connections, potentially leading to a self-inflicted denial-of-service that hangs SSH/WinSCP sessions.

Resolution

To resolve this issue, perform a full restart of the vCenter server services by running the following command:

service-control --stop --all && service-control --start --all

Additional Information

To resolve this without a full vCenter reboot, implement the following optimizations:

1. Stagger Operational Windows

Ensure that high-volume automation tasks (e.g., BladeLogic) and backup schedules do not run concurrently. Separate these windows to reduce the peak API transaction load.

2. Configure Client-Side Throttling

  • Automation Tools: Modify scripts to use batching (processing 20 to 50 VMs at a time) and enforce single-session token reuse instead of spawning unique logins per VM.
  • Backup Platforms: Enable features like Rubrik's Adaptive Backup and enforce datastore latency thresholds (e.g., 20ms to 25ms) to pace snapshot allocations.

3. Targeted Service Remediation

If the NotAuthenticated state reoccurs, recover the proxy pipeline by restarting the specific services via SSH:

service-control --restart vmware-vpxd vmware-envoy-hgw

4. Proactive Alerting

Configure an alert in Aria Operations for Logs (formerly Log Insight) to monitor for proxy failure thresholds:

  • Query: Search for text matching The session is not authenticated and vim.NfcService.fileManagement within the vpxd application logs.
  • Trigger: Set a threshold of $>10$ occurrences within a 5-minute interval.