Unable to access prometheus port via cluster VIP

Products

VMware Tanzu Kubernetes Grid Management VMware Telco Cloud Automation

Issue/Introduction

Prometheus monitoring is configured to use nodeport service type that is using port 30145 on all kubernetes nodes.

When trying to get curl the expected output:

curl -kv http://KUBEVIP:30145 --connect-timeout 3

* TryingKUBEVIP:30145...

*curl: (28) Failed to connect to KUBEVIP port 30145 Operation timed out

Testing each node IP shows that KUBEVIP and any control plane IPs are timing out, while worker node IPs are working fine.

Other clusters with same configuration are working fine.

Environment

TKGm 2.x

TCA 3.x

Calico CNI

Cause

During the initial configuration of the cluster a second interface is allocated to the worker nodes only, this results in applying different IP address used for calico peers network.

This leads to control plane isolation and they are not able to compete the routing when a forward have to be completed.

To confirm that this is the case there are two methods to confirm the issue:

Using calicoctl:

download the calicoctl tool following the external article https://docs.tigera.io/calico/latest/operations/calicoctl/install on the control plane and worker nodes

execute from the control plane and worker nodes:

sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |   SINCE    |    INFO     |
+---------------+-------------------+-------+------------+-------------+
| 10.xxx.xx.228 | node-to-node mesh | up    | 2024-12-18 | Established |
| 10.xxx.xx.219 | node-to-node mesh | up    | 2024-12-18 | Established |
| 172.xx.xx.31  | node-to-node mesh | start | 2024-12-18 | Passive     |
| 172.xx.xx.94  | node-to-node mesh | start | 10:43:10   | Passive     |
| 172.xx.xx.89  | node-to-node mesh | start | 10:44:28   | Passive     |
+---------------+-------------------+-------+------------+-------------+

It can be seen that there are two different networks identified and the partition is in place from the info column

By identifying the calico node annotation on all nodes

kubectl get nodes -oyaml | grep projectcalico.org/IPv4Address

projectcalico.org/IPv4Address: 10.xxx.xx.228/25
projectcalico.org/IPv4Address: 10.xxx.xx.219/25
projectcalico.org/IPv4Address: 10.xxx.xx.253/25
projectcalico.org/IPv4Address: 172.xx.xx.94/22
projectcalico.org/IPv4Address: 172.xx.xx.31/22

Resolution

Apply IP_AUTODETECTION_METHOD by specifying the desired interface in this example eth0:

kubectl get cm -n kube-system kubernetes-services-endpoint -oyaml

apiVersion: v1
data:
  IP_AUTODETECTION_METHOD: interface=eth0
kind: ConfigMap
metadata:
  name: kubernetes-services-endpoint
  namespace: kube-system

Apply the config and restart calico related controller and node agents

Once this is applied verify if the annotation and calicoctl shows updated node to node mesh:

sudo calicoctl node status
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 10.xxx.xx.228 | node-to-node mesh | up    | 10:50:58 | Established |
| 10.xxx.xx.219 | node-to-node mesh | up    | 10:50:58 | Established |
| 10.xxx.xx.148 | node-to-node mesh | up    | 10:50:58 | Established |
| 10.xxx.xx.231 | node-to-node mesh | up    | 10:50:58 | Established |
| 10.xxx.xx.156 | node-to-node mesh | up    | 10:50:58 | Established |
+---------------+-------------------+-------+----------+-------------+