NSX-ALB VIP health-check failure with "Overlapping routes found" when using ClusterIP mode in TKG
search cancel

NSX-ALB VIP health-check failure with "Overlapping routes found" when using ClusterIP mode in TKG

book

Article ID: 372626

calendar_today

Updated On: 07-30-2024

Products

VMware Tanzu Kubernetes Grid VMware NSX Advanced Load Balancer

Issue/Introduction

Symptoms

  • VIP access failed because AVI health check had already failed
  • When switching to NodePort access, it worked well
  • Static Route entries seem to exist in ALB controller UI
  • AKO pod log said "error: Overlapping routes found"
# kubectl -n avi-system logs ako-0
2024-07-08T08:23:03.451Z WARNm	rest/rest_operation.go:304	key: admin/global, msg: RestOp method PATCH path /api/vrfcontext/vrfcontext-d19e04cc-a0bf-45e6-9557-f579ff6494f5 tenant admin Obj {"static_routes":[{"next_hop":{"addr":"192.168.13.11","type":"V4"}...},....  returned err {"code":0,"message":"map[error:Overlapping routes found]","Verb":"PATCH","Url":"https://####//api/vrfcontext/vrfcontext-d19e04cc-a0bf-45e6-9557-f579ff6494f5","HttpStatusCode":400} with response null
  • ALB controller UI shows "Failure: Overlapping routes found"

Environment

  • All TKG version
  • Using NSX-ALB as ClusterIP mode (Dedicated AVI-SE per k8s cluster)

Cause

AKO can't register a new route against the ALB controller because old routing entries already exist with the wrong label.  As a result, AVI-SE can't install the correct routing information from the ALB controller.

  • ALB controller UI doesn't show label information, so the user can't confirm whether it's a valid routing entry or not
    • You can only see the Static Route entries: (ALB controller UI --> Infrastructure --> Cloud Resources --> VRF Context --> global --> Edit --> Static Route)
    • "Additional Information" section in this KB explains how to check the routing entry with a label via CLI, not ALB controller UI

Resolution

1. Clean up the old routing entries based on k8s node IP address

  • AVI-UI --> Infrastructure --> Cloud Resources --> VRF Context --> Select target VRF (default:global) --> Edit --> Static Route
  • See "Next Hop" which corresponded with k8s node IP address
  • Click the Trash icon against the target routing entry
  • Click "SAVE"

2. Restart the AKO pod, after that AKO will recreate fresh routing entries

kubectl -n avi-system delete pod ako-0
kubectl -n avi-system get pods # Status: running
kubectl -n avi-system logs ako-0 # Check no ERROR

3. Check VIP access

  • VirtualService health-check will be all green
  • Expected routing entries are installed into AVI-SE
  • VIP access works well

Additional Information

Troubleshooting Step

Check the target node IP address

kubectl get nodes -owide
#> NAME                                            STATUS    INTERNAL-IP    
#> workload-cluster-A-75hv8-4qpvk                  Ready     192.168.13.11  
#> workload-cluster-A-md-0-92b72-7c67d684c-57rhn   Ready     192.168.13.47  
#> workload-cluster-A-md-0-92b72-7c67d684c-jtrms   Ready     192.168.13.51  
#> workload-cluster-A-md-0-92b72-7c67d684c-nmzmp   Ready     192.168.13.56  

 

Check the current AKO configuration

kubectl -n avi-system get cm avi-k8s-config -oyaml | grep -E 'cloudName:|clusterName:|cniPlugin:|controllerVersion:|serviceEngineGroupName:|serviceType:'
#> cloudName: Default-Cloud
#> clusterName: workload-cluster-A
#> cniPlugin: antrea
#> controllerVersion: 22.1.2
#> serviceEngineGroupName: test-seg
#> serviceType: ClusterIP

 

Check the routing table in the AVI-SE

ssh admin@$AVI_SE_IPADDR

sudo ip netns list
#> avi_ns1 (id: 0)
#> avi_poll_ns1
#> avi_poll_ns2

sudo ip netns exec avi_ns1 ip route
#> 100.96.0.0/24 via 192.168.13.11 dev avi_eth8 metric 30000 (<--- Required but Not exist during the trouble)
#> 100.96.1.0/24 via 192.168.13.51 dev avi_eth8 metric 30000 (<--- Required but Not exist during the trouble)
#> 100.96.2.0/24 via 192.168.13.47 dev avi_eth8 metric 30000 (<--- Required but Not exist during the trouble)
#> 100.96.3.0/24 via 192.168.13.56 dev avi_eth8 metric 30000 (<--- Required but Not exist during the trouble)

 

Check the routing information in the ALB-controller

ssh admin@$ALB_CONTROLLER_IPADDR

admin@avi-controller:~$ shell
Login: admin
Password:

[admin:avi-controller]: > show vrfcontext global
Multiple objects found for this query.
        [0]: vrfcontext-f790b561-e7f2-49ee-b08c-425aac54e286#global in tenant admin, Cloud Default-Cloud
        [1]: vrfcontext-69cacb9b-c92f-4432-b038-57a4ce1196d2#global in tenant admin, Cloud TEST-CLOUD
Select one: 0 # <------ Choose the target Cloud number
+------------------+-------------------------------------------------+
| Field            | Value                                           |
+------------------+-------------------------------------------------+
| uuid             | vrfcontext-f790b561-e7f2-49ee-b08c-425aac54e286 |
| name             | global                                          |
| static_routes[1] |                                                 |
|   prefix         | 100.96.0.0/24                                   |
|   next_hop       | 192.168.13.11                                   |
|   route_id       | workload-cluster-A-1                            |
|   labels[1]      |                                                 |
|     key          | clustername                                     |
|     value        | workload-cluster-A                              | # CHECK: Target cluster name corresponds with "clusterName"?
| static_routes[2] |                                                 |
|   prefix         | 100.96.2.0/24                                   |
|   next_hop       | 192.168.13.47                                   | # CHECK: Is there any duplicated next_hop?
|   route_id       | workload-cluster-A-2                            |
|   labels[1]      |                                                 |
|     key          | clustername                                     |
|     value        | workload-cluster-A                              |
| static_routes[3] |                                                 |
|   prefix         | 100.96.1.0/24                                   |
|   next_hop       | 192.168.13.51                                   |
|   route_id       | workload-cluster-A-3                            |
|   labels[1]      |                                                 |
|     key          | clustername                                     |
|     value        | workload-cluster-A                              |
| static_routes[4] |                                                 |
|   prefix         | 100.96.3.0/24                                   |
|   next_hop       | 192.168.13.56                                   |
|   route_id       | workload-cluster-A-4                            |
|   labels[1]      |                                                 |
|     key          | clustername                                     |
|     value        | workload-cluster-A                              |
| system_default   | True                                            |
| lldp_enable      | True                                            |
| tenant_ref       | admin                                           |
| cloud_ref        | Default-Cloud                                   |
+------------------+-------------------------------------------------+