Envoy Service External IP Stuck in Pending State After Contour Tanzu Package or Guest Cluster Upgrade

Products

VMware vSphere Kubernetes Service

Issue/Introduction

The Envoy service in the Tanzu Kubernetes Cluster (TKC) is unable to obtain an external IP address.
The TKC upgrade was performed while the Envoy service was in a Pending state. During the node rollout phase, an external IP address was temporarily assigned to the Envoy service; however, after the upgrade completed, the Envoy service reverted to the Pending state and the Contour package remained stuck in the Reconciling phase.
The affected TKC is configured to use a static IP address in the cluster YAML
All services pods are in running state.

root@xxxx-xxxx-xxxx-control-plane[ / ]# k get svc -A

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

tanzu-system-ingress envoy LoadBalancer 10.xxx.xxx.xx <pending> 80:30029/TCP,443:31188/TCP 85d

We can see the endpoints are assigned:

root@ib2-intranet-mgmt-prod-cluster-1-control-plane-gplwz [ /var/log ]# k get endpoints -A
NAMESPACE              NAME                ENDPOINTS 

tanzu-system-ingress envoy 192.xxx.x.x:8443,192.xxx.x.x:8443,192.xxx.x.x:8443 + 3 more... 85d

We can see the below error while describing the service:

root@xxxx-xxxx-xxxx-control-plane [ /var/log ]# k describe svc envoy -n tanzu-system-ingress
Name:                     envoy
Namespace:                tanzu-system-ingress
:
:
:
Events:
  Type     Reason                        Age                   From                       Message
  ----     ------                        ----                  ----                       -------
  Normal   UpdatedLoadBalancer           49m (x12 over 4h39m)  service-controller         Updated load balancer with new hosts
  Normal   EnsuringLoadBalancer          46m                   service-controller         Ensuring load balancer
  Normal   EnsuredLoadBalancer           46m                   service-controller         Ensured load balancer
  Normal   UpdatedLoadBalancer           42m (x2 over 42m)     service-controller         Updated load balancer with new hosts
  Normal   Removed                       33m                   avi-kubernetes-operator    Removed virtualservice for envoy
  Warning  FailedToUpdateEndpointSlices  28m (x6 over 28m)     endpoint-slice-controller  Error updating Endpoint Slices for Service tanzu-system-ingress/envoy: skipping Pod envoy-rkcgd for Service tanzu-system-ingress/envoy: Node xxx-xxxxx-xxxxx-xxxxx-xxxxx Not Found

Container logs of Contour from control plane shows the below:

YYYY-MM-DDT07:50:39.742864015Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743093939Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743106316Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address=10.XX.XX.XXX
YYYY-MM-DDT07:50:39.743109371Z stderr F time="YYYY-MM-DDT07:50:39Z" level=info msg="received a new address for status.loadBalancer" context=loadBalancerStatusWriter loadbalancer-address= <========== no IP

Based on the CPI(guest cluster cloud provider) logs:

the last service update occurred at 07:55:31 due to node change:
I1216 07:55:31.463812       1 event.go:307] "Event occurred" object="tanzu-system-ingress/envoy" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="UpdatedLoadBalancer" message="Updated load balancer with new hosts"

Environment

VMware vCenter Server: 8.X

Tanzu Kubernetes Runtime

Cause

Based on the kapp controller logs:

This reconciliation process likely reset or cleared the Service.Status.LoadBalancer.Ingress field for the Envoy service.

The kapp-controller logs show that it updated the contour app resources (including the envoy service) at YYYY-MM-DDT07:58:54Z
{"level":"info","ts":"YYYY-MM-DDT07:58:54Z","logger":"kc.controller.app","msg":"Updating status","request":{"name":"contour","namespace":"default"},"desc":"flushing: flush all"}

As a result, the service transitioned into a Pending state, awaiting reassignment of a load balancer IP.

The Kubernetes Service Controller detected an update to the Service object; however, no synchronization with the Supervisor Service was triggered because there were no changes to attributes within the Service specification (spec). As a result, the Envoy Service status remains in a Pending state.

Refer: cloud-provider/controllers/service/controller.go at 65aef96cfa4925fceb426b9d3faa6c0d3bb15484 · kubernetes/cloud-provider · GitHub

Resolution

Restart the Cloud Provider Interface (CPI) pod in the guest cluster to trigger a resynchronization between the Supervisor vmservice and the guest cluster service.

Steps:

kubectl get deploy -A | grep guest-cluster-cloud-provider
kubectl rollout restart deploy guest-cluster-cloud-provider -n <cloud provider namespace>

Additional Information

Assigning a static IP address is a supported and recommended configuration for Ingress Controllers (such as Contour) to ensure DNS record stability. To maintain security, organizations should restrict developer-level roles from having the patch verb for services/status as this permission could potentially be misused to alter load balancer endpoints and enable traffic hijacking.
When a Service of type LoadBalancer is configured without a static IP address, the load balancer may allocate a new IP address after a restoring namespace operation from backup, as the request is treated as a new service provisioning event.