Envoy Pod 1/2 Status in TKGm With Error ‘upstream connect error or disconnect/reset before headers. reset reason: connection termination ’
search cancel

Envoy Pod 1/2 Status in TKGm With Error ‘upstream connect error or disconnect/reset before headers. reset reason: connection termination ’

book

Article ID: 410688

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

You will notice that the Envoy pods are showing a status of 1/2, indicating that one container is failing. Upon further inspection, in envoy logs, you will observe gRPC-related errors:

kubectl logs -n tanzu-system-ingress envoy-xxxxx -c envoy
...
[./source/extensions/config_subscription/grpc/grpc_stream.h:193] StreamRuntime gRPC config stream to contour closed since 32s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination

And in Contour logs, we can see that it is timing out when fetching gRPC config: 

kubectl logs -n tanzu-system-ingress contour-xxxxx
...
extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.service.runtime.v3.Runtime

Environment

TKGm 2.5.x

Cause

This issue occurs when the certificates used between Envoy and Contour expire. Specifically, the problem lies in either the contourcert or envoycert secrets in the tanzu-system-ingress namespace.

Envoy and Contour use mTLS to establish the gRPC configuration stream. If the CA or TLS certificate is expired, Envoy cannot authenticate Contour’s xDS server, causing Envoy pods to remain in a 1/2 (not ready) state. 

To verify that certs are expired: 

#List secrets in the namespace
kubectl get secret -n tanzu-system-ngress

#Inspect the secrets in yaml format (replace <secret-name> with contourcert or envoycert from previous output)
kubectl get secret -oyaml -n tanzu-system-ingress <secret-name>

#Copy the ca.crt or tls.crt value from the previous output(it will be base64-encoded)

#Decode the base64 string and check the certs validity
echo "<base64 string gathered previously>" | base64 -d | openssl x509 -text -noout

#Review the validity section of the output of both ca.crt and tls.crt and compare the results from both contourcert or envoycert secrets
#You may see a discrepancy like below

contourcert - ca.crt (EXPIRED)
Validity
    Not Before: Jan 01 10:00:00 2024 GMT
    Not After : Jan 01 10:00:00 2025 GMT

envoycert - ca.crt (VALID)
Validity
    Not Before: Jan 01 10:00:00 2025 GMT
    Not After : Jan 01 10:00:00 2026 GMT

Resolution

Since these certificates are managed by cert-manager, the simplest fix is to delete the expired secrets. cert-manager will automatically regenerate fresh ones.

1. Backup the secrets

Always back up the existing secrets in case you need to review or restore them later

kubectl get secret contourcert -n tanzu-system-ingress -o yaml > contourcert-backup.yaml
kubectl get secret envoycert   -n tanzu-system-ingress -o yaml > envoycert-backup.yaml

2. Delete the expired secrets

kubectl delete secret contourcert -n tanzu-system-ingress
kubectl delete secret envoycert   -n tanzu-system-ingress

3. Wait for new secrets to be created

cert-manager will automatically reconcile and issue fresh certificates. You can verify with: 

kubectl get secrets -n tanzu-system-ingress

 

Once the new secrets are created, Envoy will pick them up and recover on its own.