In AKO environment, pools are down due to health check failing with "Server Unreachable" error.
search cancel

In AKO environment, pools are down due to health check failing with "Server Unreachable" error.

book

Article ID: 380731

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

In an environment where AKO (Avi Kubernetes Operator) is deployed, pools may be marked down with the reason "Server unreachable".
The reason can be seen in the UI by navigating to the pool used by the virtual service, clicking on the "Servers" tab, and hovering your mouse over the red circle indicating the server state.

This error is indicating that the server does not have a route to connect to the pool server network. This can occur if the pool servers are not on the same L2 network, and there is no route to connect to the pool server network.

Cause

This can be caused by labels being added to a service engine group, and the needed routes are in a VRF context that does not have a matching label.

Routes in a VRF will only be applied to Service Engines in a Service Engine Group that have a matching label.
When AKO is deployed and the serviceType is set to ClusterIP, labels will be applied to the Service Engine Group.
If the Default Gateway (or any needed route) does not have the same label as the Service Engine Group, the route will not be added to the service engine.

Resolution

First the connectivity issue needs to be identified by connecting to the SE cli, entering the namespace for the data interfaces, and checking the routes.
You can use the admin account to ssh directly to the SE's management interface or use a web console (in vCenter environment) to access the SE cli.
Once in the SE's cli, verify the namespace in use by entering the command:

ip netns

This will display the default namespace (avi_ns1) and any additional namespaces in use. (the avi_poll namespaces are not applicable for this troubleshooting)

To enter the avi_ns1 namespace for example, use the command:

sudo ip netns exec avi_ns1 bash

After authenticating, use the ifconfig command to verify you are in the namespace. You should see all the data interfaces (avi_eth1 - avi_eth9).

Now verify there is no connectivity to the pool servers by trying to ping a pool server's IP address from within the namespace:

root@Avi-se-uqsvg:/home/admin# ping 10.206.21.10
ping: connect: Network is unreachable

If you get "network is unreachable" error, use the route command to check if there is a route or default gateway.

root@Avi-se-uqsvg:/home/admin# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
10.206.252.0    0.0.0.0         255.255.252.0   U     0      0        0 avi_eth3

As shown above there is no route present, and the pool server is in a different subnet which is causing the "server unreachable" error.
But we do have a default gateway in our configuration (which can be verified in the UI under Infrastructure > Cloud Resources > VRF Context).

This may indicate a labeling issue.

The next step is to verify what labels are applied to the Service Engine Group and the VRF context.
To do this we will need to SSH to the controller cli.
SSH to the controller leader node and type shell to enter the Avi shell.
enter the below command to return the service engine group settings:

show serviceenginegroup [service engine group name]

If a label is present, it will appear in the list of settings returned:

| labels [1]                              |
|   key                                   | clustername
|   value                                 | test-cluster

Next use the show vrfconext command to check if any labels are applied to the VRF context containing the route.

[admin:controller]: > show vrfcontext global

+------------------+-------------------------------------------------+
| Field            | Value                                           |
+------------------+-------------------------------------------------+
| uuid             | vrfcontext-6e5693f2-f28b-40b3-8dc0-7e1057e0e555 |
| name             | global                                          |
| static_routes[1] |                                                 |
|   prefix         | 0.0.0.0/0                                       |
|   next_hop       | 10.206.252.1                                    |
|   route_id       | 1                                               |
| system_default   | True                                            |
| lldp_enable      | True                                            |
| tenant_ref       | admin                                           |
| cloud_ref        | vCenter                                         |
+------------------+-------------------------------------------------+

Here you see there is no "labels" setting for the vrfcontect. Therefore this route will not be applied to the Service Engine Group because it does not have the matching label.

A temporary workaround would be to remove the label for the service engine group.
This can be done in the same shell with the commands below (use the same index number for the label as shown in the service engine group settings):

[admin:controller]: > configure serviceenginegroup Default-Group
[admin:controller]: serviceenginegroup> no labels index 1
[admin:controller]: serviceenginegroup> save
[admin:controller]: >

After the label has been removed, there are two ways to have the route change picked up by the SE's.
1) The SE can be rebooted (when it comes back online the route will get pushed to the SE).
2) The current route can be edited (removed and re-added, or changed to a different IP then changed back). This will cause the route config to get re-pushed to the SE.

Once the route change has been pushed to the SE, you can access the SE namespace as outlined earlier and verify the route now exists.

root@Avi-se-uqsvg:/home/admin# route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         10.206.252.1    0.0.0.0         UG    30000  0        0 avi_eth3
10.206.252.0    0.0.0.0         255.255.252.0   U     0      0        0 avi_eth3

Additional Information

When AKO is deployed, there is a values.yaml file which contains a setting for "serviceType".
This can be configured as NodePort, ClusterIP, or NodePortLocal. (These are settings are case sensitive)
When the serviceType is set to ClusterIP and disableroutesync is set to False, labels are required and will be added to the SE group.
When the serviceType is set to ClusterIP and disableroutesync is set to True, labels will not be added to the SE group. 
Labels will not be added when using serviceType as NodePort or NodePortLocal

When deploying a TKG management cluster, there is an optional section where you can specify labels.
https://docs.vmware.com/en/VMware-Tanzu-for-Kubernetes-Operations/2.3/tko-reference-architecture/GUID-deployment-guides-tko-on-vsphere.html#deploy-tanzu-kubernetes-grid-tkg-management-cluster-19

"Cluster Labels: Optional. Leave the cluster labels section empty to apply the above workload cluster network settings by default. If you specify any label here, you must specify the same values in the configuration YAML file of the workload cluster. Else, the system places the endpoint VIP of your workload cluster in Management Cluster Data Plane VIP Network by default."

To determine if labels are added on AKO deployment, it is required to check the values.yaml file of the AKO deployment to see what serviceType is set, or when deploying a TKG Management cluster, in the UI configuration prompts.