This article describes how to remediate an envoy gateway proxy that has entered a state where its SDS (Secret Discovery Service) configuration is invalid, causing all requests routed through the affected gateway to fail with TLS errors. The remediation restarts the envoy gateway controlplane via the node agent API (port 5480) on a vcf services runtime control plane node. No kubectl access is required.
After upgrading or during normal operations of VCF, a management UI (e.g. VCFA, VCF Operations, or another component) may become unresponsive or fail to load. When this occurs, API requests routed through the affected gateway return HTTP 503 errors with the following message:
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS error: Secret is not supplied by SDS
This error indicates that an internal gateway proxy component has entered a state where it can no longer establish secure connections to backend services, causing all requests routed through the affected gateway to fail.
Use this procedure when you observe one or more of the following:
TLS_error:_Secret_is_not_supplied_by_SDS AND k8s_container:envoy AND k8s_namespace:vmsp-platform
If results are returned, the gateway proxy is experiencing SDS certificate failures and this article applies.Warning: Before running the remediation, confirm that this article applies to your situation by following the Confirming the Issue steps below. Restarting the gateway when the root cause is different will not resolve the problem.
VCF Services Runtime 9.1
An internal gateway proxy component failed to load updated TLS certificate data. While the proxy’s configuration indicates it has the latest certificates, the actual certificate data was not applied. This prevents the proxy from establishing secure connections to backend services. This condition does not self-resolve and requires a restart of the affected component.
curl | Always | Communicates with the node agent API on the control plane node |
jq | Always | Parses JSON API responses |
curl — pre-installed on most Linux distributions. Verify with curl --version.
jq
# Debian / Ubuntu
sudo apt-get install -y jq
RHEL / Photon OS / yum-based
sudo yum install -y jq
Direct binary download (any Linux x86-64)
curl -fsSL https://github.com/jqlang/jq/releases/latest/download/jq-linux-amd64
-o /usr/local/bin/jq && chmod +x /usr/local/bin/jq
Verify all tools are available before running the script:
curl --version && jq --version
vmware-system-user account on the VCF Services Runtime cluster. This is the password set for VCF services runtime during initial deployment.Before proceeding with remediation, confirm that this article applies to your situation.
Attempt to load the affected management UI in a browser. If the page fails to load, shows a blank screen, or returns errors, proceed to step 2.
From a machine that can reach the platform, run:
curl -k https://<VCF services runtime FQDN>/<component-path>
For example: curl -k https://<platform-hostname>/km/
If the response contains TLS error: Secret is not supplied by SDS, this article applies.
Use the script (see Running the Script) or manually query the envoy-gateway service inventory to confirm backend pods are healthy. If backend pods are healthy but API requests still fail with the SDS error, this confirms the issue is at the gateway proxy layer.
An internal gateway proxy component failed to load updated TLS certificate data. While the proxy’s configuration indicates it has the latest certificates, the actual certificate data was not applied. This prevents the proxy from establishing secure connections to backend services. This condition does not self-resolve and requires a restart of the affected component.
The envoy_gateway_sds_fix.sh script automates the full remediation sequence via the API (port 5480) on the control plane node. No kubectl access is required. The script performs the following steps. You can also run these steps manually using the provided curl commands.
The script authenticates with the VCF services runtime cluster control plane node at https://<node-ip>:5480 using the vmware-system-user breakglass credentials. A JWT token is obtained and used for all subsequent API calls. The token is automatically refreshed if it expires.
To find the node IP, open the LCM UI and navigate to Build → Lifecycle → VCF Services Runtime. The page should provide a list of Control Plane Nodes with their IP addresses.
Manual Execution:
export TOKEN=$(curl -sk -X POST "https://<node-ip>:5480/api/v1/auth/login" \
-H "Content-Type: application/json" \
-d '{"username": "vmware-system-user", "password": "<your-password>"}' | jq -r .access_token)
The script queries GET /api/v1/components?type=vsp to find the vsp component ID. This is the component that owns the envoy gateway service.
Manual Execution:
export COMP_ID=$(curl -sk -H "Authorization: Bearer $TOKEN" \
"https://<node-ip>:5480/api/v1/components?type=vsp" | jq -r '.elements[0].id')
echo "VSP Component ID: $COMP_ID"
The script queries GET /api/v1/components/{id}/inventory/services/envoy-gateway to confirm the service exists and report the number of running instances before making changes.
Manual Execution:
curl -sk -H "Authorization: Bearer $TOKEN" \
"https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway" | jq .
The envoy gateway controlplane is managed by a Kubernetes deployment named envoy-gateway in the vmsp-platform namespace. The script triggers a restart via POST /api/v1/components/{id}/inventory/services/envoy-gateway?action=restart and polls GET /api/v1/tasks/{taskId} until the task reaches Succeeded or Failed.
Manual Execution:
# Trigger the restart
export TASK_ID=$(curl -sk -X POST -H "Authorization: Bearer $TOKEN" \
"https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway?action=restart" | jq -r .id)
echo "Restart Task ID: $TASK_ID"
Poll the task status (run this until status is Succeeded or Failed)
curl -sk -H "Authorization: Bearer $TOKEN"
"https://<node-ip>:5480/api/v1/tasks/$TASK_ID" | jq .status
The script re-checks that the envoy-gateway service exists and has running instances after the restart completes, and provides guidance for manual verification.
Manual Execution:
curl -sk -H "Authorization: Bearer $TOKEN" \
"https://<node-ip>:5480/api/v1/components/$COMP_ID/inventory/services/envoy-gateway" | jq .
The script can be run from any machine that has network access to port 5480 on at least one VCF Services Runtime cluster control plane node — for example, a laptop, a jump host, or any VM on the same network segment as the cluster. It does not need to run on the cluster itself and does not require kubectl access.
Download the script attached to this article (see Script Reference), save it to a local file, and make it executable:
chmod +x envoy_gateway_sds_fix.sh
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP>
You will be prompted for the breakglass password interactively.
# Via command-line flag
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> --password <PASSWORD>
Via environment variable
export VMSP_PASSWORD=<PASSWORD>
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP>
Logs all planned actions without executing any restart operations:
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> --dry-run
--node-ip <IP> | IP address of any reachable VCF Services Runtime cluster control plane node (port 5480) |
--password <PASSWORD> | Breakglass password for vmware-system-user. Omit to be prompted interactively |
--dry-run | Validate and log planned actions without executing the restart |
NODE_IP | — | IP address of any reachable cluster control plane node |
VMSP_PASSWORD | — | Breakglass password (avoids interactive prompt) |
TASK_POLL_INTERVAL | 15 | Seconds between task status polls |
TASK_TIMEOUT_SECONDS | 600 | Seconds to wait for the restart task before timing out |
The script writes timestamped log output to stdout. Each phase is marked with a clear separator:
==========================================================
[2026-03-23T10:00:00Z] [STEP] Authenticating with node agent
==========================================================
[2026-03-23T10:00:01Z] [INFO] Authentication successful.
==========================================================
[2026-03-23T10:00:01Z] [STEP] Discovering vsp component
[2026-03-23T10:00:02Z] [INFO] Found vsp component: vsp (id: 449f907a-..., status: Running)
==========================================================
[2026-03-23T10:00:02Z] [STEP] Verifying envoy-gateway service health
[2026-03-23T10:00:03Z] [INFO] Service 'envoy-gateway' found with 1 instance(s).
[2026-03-23T10:00:03Z] [INFO] Instances:
[2026-03-23T10:00:03Z] [INFO] - envoy-gateway-c5676f5d9-5dhwb
==========================================================
[2026-03-23T10:00:03Z] [STEP] Restarting envoy-gateway service
[2026-03-23T10:00:04Z] [INFO] Restart task created: xnuqdronuzgfldv442o426c6u4
[2026-03-23T10:00:19Z] [INFO] Task xnuqdronuzgfldv442o426c6u4 status: Succeeded
==========================================================
[2026-03-23T10:00:19Z] [STEP] Verifying SDS issue is resolved
[2026-03-23T10:00:20Z] [INFO] Service 'envoy-gateway' found with 1 instance(s) after restart.
[2026-03-23T10:00:20Z] [INFO] Instances after restart:
[2026-03-23T10:00:20Z] [INFO] - envoy-gateway-c5676f5d9-7xk2m
To monitor in real time and save a log file:
./envoy_gateway_sds_fix.sh --node-ip <NODE_IP> 2>&1 | tee /tmp/sds_fix_$(date +%Y%m%d_%H%M%S).log
After the script completes successfully:
curl -k https://<platform-hostname>/<component-path>
The request should return a successful response (not a 503 SDS error).vmware-system-user is correct.curl -sk https://<NODE_IP>:5480/api/v1/components
# From the node itself
systemctl status vmsp-agent
--node-ip.The default timeout is 600 seconds. If the restart is taking longer:
export TASK_TIMEOUT_SECONDS=1200The script is safe to re-run after a failure. It will re-authenticate and re-discover the component before attempting the restart.
json { "response_code": 503, "response_code_details": "upstream_reset_before_response_started{remote_connection_failure|TLS_error:_Secret_is_not_supplied_by_SDS}", "upstream_transport_failure_reason": "TLS_error:_Secret_is_not_supplied_by_SDS", "route_name": "httproute/prelude/vksm/rule/3/match/0/<hostname>" }