Loadbalancer services fail to provision in a Tanzu Kubernetes Grid Cluster running on Azure
search cancel

Loadbalancer services fail to provision in a Tanzu Kubernetes Grid Cluster running on Azure

book

Article ID: 316953

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
  • When you enable OIDC on a Tanzu Kubernetes Grid (TKG) management cluster, you will see the pinniped app in a Reconciled failed state and dexsvc which is of type LoadBalancer will be in the pending state because of the failure in LoadBalancer provisioning state. When you describe the service, you will see output similar to the following:

kubectl -n tanzu-system-auth describe svc dexsvc

Name:                     dexsvc
Namespace:                tanzu-system-auth
Labels:                   app=dex
                          kapp.k14s.io/app=1618977512920384658
                          kapp.k14s.io/association=v1.07cd6e14046aafeb4d8a3195cb353a1d
Annotations:              kapp.k14s.io/identity: v1;tanzu-system-auth//Service/dexsvc;v1
                          kapp.k14s.io/original:
                            {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"dex","kapp.k14s.io/app":"1618977512920384658","kapp.k14s...
                          kapp.k14s.io/original-diff-md5: 3ba1829de15e3013270fed06cad5b893
Selector:                 app=dex,kapp.k14s.io/app=1618977512920384658
Type:                     LoadBalancer
IP Families:              <none>
IP:                       100.##.#.##
IPs:                      100.##.#.##
Port:                     dex  443/TCP
TargetPort:               https/TCP
NodePort:                 dex  30113/TCP
Endpoints:                100.1##.##.##:5556
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                  Age                 From                Message
  ----     ------                  ----                ----                -------
  Normal   EnsuringLoadBalancer    2m1s (x8 over 12m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m (x8 over 12m)    service-controller  Error syncing load balancer: failed to ensure load balancer: not a vmss instance

  • When you create a service of type LoadBalancer on a TKG cluster on the existing VNET, the status of the service is stuck in a Pending state with the service not getting an ExternalIP. When you describe the service, you will see output similar to the following:

kubectl describe svc nginx-svc                                

Name:                     nginx-svc
Namespace:                default
Labels:                   run=nginx
Annotations:              <none>
Selector:                 run=nginx
Type:                     LoadBalancer
IP Families:              <none>
IP:                       100.##.##.##
IPs:                      100.##.##.##
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  32075/TCP
Endpoints:                100.##.#.#:##
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type     Reason                  Age                   From                Message
  ----     ------                  ----                  ----                -------
  Normal   EnsuringLoadBalancer    2m34s (x10 over 23m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed  2m34s (x10 over 22m)  service-controller  Error syncing load balancer: failed to ensure load balancer: nsg "azure-wlkd-prod-node-nsg" not found

  • This issue may also occur when utilizing an existing network with provisioned NSGs.


Environment

VMware Tanzu Kubernetes Grid 1.x

Resolution

This issue is resolved in Tanzu Kubernetes Grid 1.3.1

Workaround:
To workaround the first issue where the LoadBalancer service is not getting created with the error "Error syncing load balancer: failed to ensure load balancer: not a vmss instance", you must use a ytt overlay that forces the use of "vmss": "standard" on TKG 1.3.0. The overlay file can be created in the location ~/.tanzu/tkg/providers/infrastructure-azure/ytt/azure-overlay.yaml.

Sample overlay file:
#@ load("@ytt:overlay", "overlay")

#@overlay/match by=overlay.subset({"kind":"KubeadmConfigTemplate"}),expects="1+"
---
spec:
  #@overlay/match missing_ok=True
  template:
    #@overlay/match missing_ok=True
    spec:
      #@overlay/match missing_ok=True
      preKubeadmCommands:
      #@overlay/append
      - "if [ -f /etc/kubernetes/azure.json ]; then sed -i 's/\"vmType\": \"vmss\"/\"vmType\": \"standard\"/' /etc/kubernetes/azure.json; fi"

#@overlay/match by=overlay.subset({"kind":"KubeadmControlPlane"})
---
spec:
  #@overlay/match missing_ok=True
  kubeadmConfigSpec:
    #@overlay/match missing_ok=True
    preKubeadmCommands:
    #@overlay/append
    - "if [ -f /etc/kubernetes/azure.json ]; then sed -i 's/\"vmType\": \"vmss\"/\"vmType\": \"standard\"/' /etc/kubernetes/azure.json; fi"

For the second issue as stated in the cause section, creating an NSG for the node will fix the LB not getting the external IP, this will be added as a requirement in official docs soon.
 

To workaround the second issue, a Network Security Group (NSG) must be created on the existing VNET.
For the second issue, when a cluster is created on the existing VNET it looks for the NSG (Network Security Group) for the nodes. 

The name of the NSG has to be in the form <clustername>-node-nsg. For example, if your cluster name is azure-wlkd-dev, you need to create a Network Security Group on your Resource Group in the same region with the name azure-wlkd-dev-node-nsg. There is no need to modify the Inbound and Outbound rules when you create the NSG.

Once the NSG is created, the Loadbalancer service will get an external IP and any new service you create of type LoadBalancer will automatically get added to the NSG.