Envoy Service External IP Stuck in Pending State After Contour Tanzu Package or Guest Cluster Upgrade
search cancel

Envoy Service External IP Stuck in Pending State After Contour Tanzu Package or Guest Cluster Upgrade

book

Article ID: 424045

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

 

  • The Envoy service in the Tanzu Kubernetes Cluster (TKC) is unable to obtain an external IP address.

  • The TKC upgrade was performed while the Envoy service was in a Pending state. During the node rollout phase, an external IP address was temporarily assigned to the Envoy service; however, after the upgrade completed, the Envoy service reverted to the Pending state and the Contour package remained stuck in the Reconciling phase.

  • The affected TKC is configured to use a static IP address in the cluster YAML

  • All services pods are in running state.
root@xxxx-xxxx-xxxx-control-plane[ / ]# k get svc -A

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

tanzu-system-ingress envoy LoadBalancer 10.xxx.xxx.xx <pending> 80:30029/TCP,443:31188/TCP 85d

We can see the endpoints are assigned:

root@ib2-intranet-mgmt-prod-cluster-1-control-plane-gplwz [ /var/log ]# k get endpoints -A
NAMESPACE              NAME                ENDPOINTS 

tanzu-system-ingress envoy 192.xxx.x.x:8443,192.xxx.x.x:8443,192.xxx.x.x:8443 + 3 more... 85d

 

We can see the below error while describing the service:

root@xxxx-xxxx-xxxx-control-plane [ /var/log ]# k describe svc envoy -n tanzu-system-ingress
Name:                     envoy
Namespace:                tanzu-system-ingress
:
:
:
Events:
  Type     Reason                        Age                   From                       Message
  ----     ------                        ----                  ----                       -------
  Normal   UpdatedLoadBalancer           49m (x12 over 4h39m)  service-controller         Updated load balancer with new hosts
  Normal   EnsuringLoadBalancer          46m                   service-controller         Ensuring load balancer
  Normal   EnsuredLoadBalancer           46m                   service-controller         Ensured load balancer
  Normal   UpdatedLoadBalancer           42m (x2 over 42m)     service-controller         Updated load balancer with new hosts
  Normal   Removed                       33m                   avi-kubernetes-operator    Removed virtualservice for envoy
  Warning  FailedToUpdateEndpointSlices  28m (x6 over 28m)     endpoint-slice-controller  Error updating Endpoint Slices for Service tanzu-system-ingress/envoy: skipping Pod envoy-rkcgd for Service tanzu-system-ingress/envoy: Node xxx-xxxxx-xxxxx-xxxxx-xxxxx Not Found

Container logs of Contour from control plane shows the below:

YYYY-MM-DDT07:50:39.742864015Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743093939Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743106316Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743109371Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address= <========== no IP

Based on the CPI(guest cluster cloud provider) logs:

the last service update occurred at 07:55:31 due to node change:
I1216 07:55:31.463812       1 event.go:307] "Event occurred" object="tanzu-system-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

 

Environment

VMware vCenter Server: 8.X

Tanzu Kubernetes Runtime

Cause

Based on the kapp controller logs:

This reconciliation process likely reset or cleared the Service.Status.LoadBalancer.Ingress field for the Envoy service.

The kapp-controller logs show that it updated the contour app resources (including the envoy service) at YYYY-MM-DDT07:58:54Z
{"level":"info","ts":"YYYY-MM-DDT07:58:54Z","logger":"kc.controller.app","msg":"Updating status","request":{"name":"contour","namespace":"default"},"desc":"flushing: flush all"} 

As a result, the service transitioned into a Pending state, awaiting reassignment of a load balancer IP.

The Kubernetes Service Controller detected an update to the Service object; however, no synchronization with the Supervisor Service was triggered because there were no changes to attributes within the Service specification (spec). As a result, the Envoy Service status remains in a Pending state.

Refer: cloud-provider/controllers/service/controller.go at 65aef96cfa4925fceb426b9d3faa6c0d3bb15484 · kubernetes/cloud-provider · GitHub

Resolution

Restart the Cloud Provider Interface (CPI) pod in the guest cluster to trigger a resynchronization between the Supervisor vmservice and the guest cluster service.

Steps:

kubectl get deploy -A | grep guest-cluster-cloud-provider
kubectl rollout restart deploy guest-cluster-cloud-provider -n <cloud provider namespace>

 

Additional Information

  • Assigning a static IP address is a supported and recommended configuration for Ingress Controllers (such as Contour) to ensure DNS record stability. To maintain security, organizations should restrict developer-level roles from having the patch verb for services/status as this permission could potentially be misused to alter load balancer endpoints and enable traffic hijacking.
  • When a Service of type LoadBalancer is configured without a static IP address, the load balancer may allocate a new IP address after a restoring namespace operation from backup, as the request is treated as a new service provisioning event.