Requests Through Envoy Gateway Fail with SDS Errors upstream connect error or disconnect/reset before headers

Products

VMware Cloud Foundation

Issue/Introduction

This article describes how to remediate an envoy gateway proxy that has entered a state where its SDS (Secret Discovery Service) configuration is invalid, causing all requests routed through the affected gateway to fail with TLS errors. The remediation restarts the envoy gateway controlplane via the node agent API (port 5480) on a vcf services runtime control plane node. No kubectl access is required.

After upgrading or during normal operations of VCF, a management UI (e.g. VCFA, VCF Operations, or another component) may become unresponsive or fail to load. When this occurs, API requests routed through the affected gateway return HTTP 503 errors with the following message:

upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS error: Secret is not supplied by SDS

This error indicates that an internal gateway proxy component has entered a state where it can no longer establish secure connections to backend services, causing all requests routed through the affected gateway to fail.

When to Use This Procedure

Use this procedure when you observe one or more of the following:

Management UI is unresponsive or fails to load — A VCF management component UI (e.g. VCF Automation, VCF Operations) shows a blank page, a loading spinner that never completes, or displays a generic error. The browser developer console (F12 > Network tab) may show multiple failed requests returning HTTP 503.
API calls return HTTP 503 — API calls (typically the backend work taken a use action in the UI) to the platform return a 503 status code with the SDS error message shown above.
Gateway proxy logs contain SDS errors — In Log Management, run the following query to check for SDS errors in the gateway proxy logs:
```
TLS_error:_Secret_is_not_supplied_by_SDS AND k8s_container:envoy AND k8s_namespace:vmsp-platform
```
If results are returned, the gateway proxy is experiencing SDS certificate failures and this article applies.

Warning: Before running the remediation, confirm that this article applies to your situation by following the Confirming the Issue steps below. Restarting the gateway when the root cause is different will not resolve the problem.

Environment

VCF Services Runtime 9.1

Cause

An internal gateway proxy component failed to load updated TLS certificate data. While the proxy’s configuration indicates it has the latest certificates, the actual certificate data was not applied. This prevents the proxy from establishing secure connections to backend services. This condition does not self-resolve and requires a restart of the affected component.

Resolution

Prerequisites

Tools Required

`curl`	Always	Communicates with the node agent API on the control plane node
`jq`	Always	Parses JSON API responses

Installing the tools

curl — pre-installed on most Linux distributions. Verify with curl --version.

jq

# Debian / Ubuntu
sudo apt-get install -y jq
RHEL / Photon OS / yum-based
sudo yum install -y jq
Direct binary download (any Linux x86-64)
curl -fsSL https://github.com/jqlang/jq/releases/latest/download/jq-linux-amd64 

-o /usr/local/bin/jq && chmod +x /usr/local/bin/jq

Verify all tools are available before running the script:

curl --version && jq --version

Access Requirements

Network access to port 5480 on at least one cluster control plane node.
The breakglass password for the vmware-system-user account on the VCF Services Runtime cluster. This is the password set for VCF services runtime during initial deployment.

Confirming the Issue

Before proceeding with remediation, confirm that this article applies to your situation.

1. Check if the UI is accessible

Attempt to load the affected management UI in a browser. If the page fails to load, shows a blank screen, or returns errors, proceed to step 2.

2. Test an API endpoint directly

From a machine that can reach the platform, run:

curl -k https://<VCF services runtime FQDN>/<component-path>

For example: curl -k https://<platform-hostname>/km/

If the response contains TLS error: Secret is not supplied by SDS, this article applies.

3. Verify backend services are healthy

Use the script (see Running the Script) or manually query the envoy-gateway service inventory to confirm backend pods are healthy. If backend pods are healthy but API requests still fail with the SDS error, this confirms the issue is at the gateway proxy layer.

Cause

An internal gateway proxy component failed to load updated TLS certificate data. While the proxy’s configuration indicates it has the latest certificates, the actual certificate data was not applied. This prevents the proxy from establishing secure connections to backend services. This condition does not self-resolve and requires a restart of the affected component.

Remediation

The envoy_gateway_sds_fix.sh script automates the full remediation sequence via the API (port 5480) on the control plane node. No kubectl access is required. The script performs the following steps. You can also run these steps manually using the provided curl commands.

Step 1: Authentication

The script authenticates with the VCF services runtime cluster control plane node at https://<node-ip>:5480 using the vmware-system-user breakglass credentials. A JWT token is obtained and used for all subsequent API calls. The token is automatically refreshed if it expires.

To find the node IP, open the LCM UI and navigate to Build → Lifecycle → VCF Services Runtime. The page should provide a list of Control Plane Nodes with their IP addresses.

Manual Execution:

export TOKEN=$(curl -sk -X POST "https://<node-ip>:5480/api/v1/auth/login" \
  -H "Content-Type: application/json" \
  -d '{"username": "vmware-system-user", "password": "<your-password>"}' | jq -r .access_token)

Step 2: Discover vsp Component

The script queries GET /api/v1/components?type=vsp to find the vsp component ID. This is the component that owns the envoy gateway service.

Manual Execution:

export COMP_ID=$(curl -sk -H "Authorization: Bearer $TOKEN" \
  "https://<node-ip>:5480/api/v1/components?type=vsp" | jq -r '.elements[0].id')
echo "VSP Component ID: $COMP_ID"

Step 3: Verify Envoy Gateway Service Health

The script queries GET /api/v1/components/{id}/inventory/services/envoy-gateway to confirm the service exists and report the number of running instances before making changes.

Manual Execution:

curl -sk -H "Authorization: Bearer $TOKEN" \
  "https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway" | jq .

Step 4: Restart envoy-gateway Service

The envoy gateway controlplane is managed by a Kubernetes deployment named envoy-gateway in the vmsp-platform namespace. The script triggers a restart via POST /api/v1/components/{id}/inventory/services/envoy-gateway?action=restart and polls GET /api/v1/tasks/{taskId} until the task reaches Succeeded or Failed.

Manual Execution:

# Trigger the restart
export TASK_ID=$(curl -sk -X POST -H "Authorization: Bearer $TOKEN" \
  "https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway?action=restart" | jq -r .id)
echo "Restart Task ID: $TASK_ID"
Poll the task status (run this until status is Succeeded or Failed)
curl -sk -H "Authorization: Bearer $TOKEN" 

"https://<node-ip>:5480/api/v1/tasks/$TASK_ID" | jq .status

Step 5: Post-restart Verification

The script re-checks that the envoy-gateway service exists and has running instances after the restart completes, and provides guidance for manual verification.

Manual Execution:

curl -sk -H "Authorization: Bearer $TOKEN" \
  "https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway" | jq .

Running the Script

The script can be run from any machine that has network access to port 5480 on at least one VCF Services Runtime cluster control plane node — for example, a laptop, a jump host, or any VM on the same network segment as the cluster. It does not need to run on the cluster itself and does not require kubectl access.

Download the script attached to this article (see Script Reference), save it to a local file, and make it executable:

chmod +x envoy_gateway_sds_fix.sh

Basic Usage

./envoy_gateway_sds_fix.sh --node-ip <NODE_IP>

You will be prompted for the breakglass password interactively.

Non-interactive (Password via Flag or Environment)

# Via command-line flag
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> --password <PASSWORD>
Via environment variable
export VMSP_PASSWORD=<PASSWORD>
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP>

Dry Run (Recommended First Step)

Logs all planned actions without executing any restart operations:

./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> --dry-run

Script Options Reference

Required

`--node-ip <IP>`	IP address of any reachable VCF Services Runtime cluster control plane node (port 5480)
`--password <PASSWORD>`	Breakglass password for `vmware-system-user`. Omit to be prompted interactively

Optional

`--dry-run`	Validate and log planned actions without executing the restart

Environment Variable Overrides

`NODE_IP`	—	IP address of any reachable cluster control plane node
`VMSP_PASSWORD`	—	Breakglass password (avoids interactive prompt)
`TASK_POLL_INTERVAL`	`15`	Seconds between task status polls
`TASK_TIMEOUT_SECONDS`	`600`	Seconds to wait for the restart task before timing out

Monitoring Progress

The script writes timestamped log output to stdout. Each phase is marked with a clear separator:

==========================================================
[2026-03-23T10:00:00Z] [STEP]  Authenticating with node agent
==========================================================
[2026-03-23T10:00:01Z] [INFO]  Authentication successful.
==========================================================
[2026-03-23T10:00:01Z] [STEP]  Discovering vsp component
[2026-03-23T10:00:02Z] [INFO]  Found vsp component: vsp (id: 449f907a-..., status: Running)
==========================================================
[2026-03-23T10:00:02Z] [STEP]  Verifying envoy-gateway service health
[2026-03-23T10:00:03Z] [INFO]  Service 'envoy-gateway' found with 1 instance(s).
[2026-03-23T10:00:03Z] [INFO]  Instances:
[2026-03-23T10:00:03Z] [INFO]    - envoy-gateway-c5676f5d9-5dhwb
==========================================================
[2026-03-23T10:00:03Z] [STEP]  Restarting envoy-gateway service
[2026-03-23T10:00:04Z] [INFO]  Restart task created: xnuqdronuzgfldv442o426c6u4
[2026-03-23T10:00:19Z] [INFO]  Task xnuqdronuzgfldv442o426c6u4 status: Succeeded
==========================================================
[2026-03-23T10:00:19Z] [STEP]  Verifying SDS issue is resolved
[2026-03-23T10:00:20Z] [INFO]  Service 'envoy-gateway' found with 1 instance(s) after restart.
[2026-03-23T10:00:20Z] [INFO]  Instances after restart:
[2026-03-23T10:00:20Z] [INFO]    - envoy-gateway-c5676f5d9-7xk2m

To monitor in real time and save a log file:

./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> 2>&1 | tee /tmp/sds_fix_$(date +%Y%m%d_%H%M%S).log

Verifying the Fix

After the script completes successfully:

Load the affected management UI in a browser. The page should load without errors.
Test an API endpoint directly:
```
curl -k https://<platform-hostname>/<component-path>
```
The request should return a successful response (not a 503 SDS error).

Troubleshooting

Authentication fails (HTTP 401 or 403)

Confirm the breakglass password for vmware-system-user is correct.
Ensure port 5480 is reachable from the machine running the script:
```
curl -sk https://<NODE_IP>:5480/api/v1/components
```

Cannot reach cluster control plane node (connection refused or timeout)

Verify the node IP is correct and the node is reachable.

Confirm the API server is running on the node:

# From the node itself
systemctl status vmsp-agent

Try a different node IP with --node-ip.

Restart task fails (status: Failed)

The workflow for the restart encountered an error.
Check the task details logged by the script for the failure message.
Re-run the script — it is safe to run multiple times.

Restart task times out

The default timeout is 600 seconds. If the restart is taking longer:

Increase the timeout: export TASK_TIMEOUT_SECONDS=1200
Check the workflow status in the cluster UI or logs.
Re-run the script after resolving the underlying issue.

Issue persists after restart

Verify that the envoy-gateway service has running instances after the restart (re-run the script to check).
If the SDS error recurs shortly after a restart, there may be an underlying certificate synchronization issue. Collect the gateway proxy pod logs and open a support case.

Script exits partway through — safe to re-run

The script is safe to re-run after a failure. It will re-authenticate and re-discover the component before attempting the restart.

Additional Information

Example Log Entry (SDS Error)

json { "response_code": 503, "response_code_details": "upstream_reset_before_response_started{remote_connection_failure|TLS_error:_Secret_is_not_supplied_by_SDS}", "upstream_transport_failure_reason": "TLS_error:_Secret_is_not_supplied_by_SDS", "route_name": "httproute/prelude/vksm/rule/3/match/0/<hostname>" }

Attachments

envoy_gateway_sds_fix.sh get_app