Telegraph exporter is not collecting metrics after upgrading to VMware Tanzu Kubernetes Grid Integrated Edition 1.19.1
search cancel

Telegraph exporter is not collecting metrics after upgrading to VMware Tanzu Kubernetes Grid Integrated Edition 1.19.1

book

Article ID: 376785

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  • Telegraph exporter is not collecting metrics  from kubernetes cluster after upgrading to VMware Tanzu Kubernetes Grid Integrated Edition 1.19.1.
  • One will see  "status 403 Forbidden" messages similar to the following in Telegraf  pod logs:

kubectl -n pks-system logs telegraf-9rlsv

2024-08-20T10:53:45Z I! Loading config: /etc/telegraf/cluster-metric-sinks.conf
2024-08-20T10:53:45Z I! Loading config: /etc/telegraf/telegraf.conf
2024-08-20T10:53:45Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB
2024-08-20T10:53:45Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores
2024-08-20T10:53:45Z I! Loaded inputs: cpu disk diskio kubernetes mem net
2024-08-20T10:53:45Z I! Loaded aggregators:
2024-08-20T10:53:45Z I! Loaded processors:
2024-08-20T10:53:45Z I! Loaded secretstores:
2024-08-20T10:53:45Z I! Loaded outputs: prometheus_client
2024-08-20T10:53:45Z I! Tags enabled: cluster_name=pks-cluster-name host=7b7fc874-###-###-####-85129530c869
2024-08-20T10:53:45Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"7b7fc874-###-###-####-85129530c86", Flush Interval:20s
2024-08-20T10:53:45Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead
2024-08-20T10:53:45Z I! [outputs.prometheus_client] Listening on http://0.0.0.0:29273/metrics
2024-08-20T10:53:50Z E! [inputs.kubernetes] Error in plugin: https://127.0.0.1:10250/pods returned HTTP status 403 Forbidden
2024-08-20T10:54:00Z E! [inputs.kubernetes] Error in plugin: https://127.0.0.1:10250/pods returned HTTP status 403 Forbidden
2024-08-20T10:54:50Z E! [inputs.kubernetes] Error in plugin: https://127.0.0.1:10250/pods returned HTTP status 403 Forbidden
2024-08-20T10:55:00Z E! [inputs.kubernetes] Error in plugin: https://127.0.0.1:10250/pods returned HTTP status 403 Forbidden
2024-08-20T10:55:10Z E! [inputs.kubernetes] Error in plugin: https://127.0.0.1:10250/pods returned HTTP status 403 Forbidden


  • One will see messages similar to the following in cluster audit logs.

    ++master.780c2061-########-##########.2024-08-20-19-16-42.tgz/kube-apiserver/audit/log/audit.log++

 

kind":"Event","apiVersion":"audit.k8s.io/v1","level":"RequestResponse","auditID":"628b1cbc-e1dd-416e-882b-d4d290106ff0","stage":"ResponseComplete","requestURI":"/apis/authorization.k8s.io/v1/subjectaccessreviews","verb":"create","user":{"username":"kubelet","uid":"kubelet","groups":["system:authenticated"]},"sourceIPs":["172.##.##.7"],"userAgent":"kubelet/v1.28.9+vmware.1 (linux/amd64) kubernetes/dda3aee","objectRef":{"resource":"subjectaccessreviews","apiGroup":"authorization.k8s.io","apiVersion":"v1"},"responseStatus":{"metadata":{},"code":201},"requestObject":

{"kind":"SubjectAccessReview","apiVersion":"authorization.k8s.io/v1","metadata":{"creationTimestamp":null},"spec":{"resourceAttributes":{"verb":"get","version":"v1","resource":"nodes","subresource":"proxy","name":"7b7fc874-####-####-###-85129530c869"},"user":"system:serviceaccount:pks-system:telegraf","groups":["system:serviceaccounts","system:serviceaccounts:pks-system","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["telegraf-9rlsv"],"authentication.kubernetes.io/pod-uid":["e33a6a73-1690-420f-a77d-6e95ad1be069"]},"uid":"07564e4e-95fe-40ad-a7f8-85971b08aeb5"},"status":{"allowed":false}},"responseObject":{"kind":"SubjectAccessReview","apiVersion":"authorization.k8s.io/v1","metadata":{"creationTimestamp":null,"managedFields":[{"manager":"kubelet","operation":"Update","apiVersion":"authorization.k8s.io/v1",

"time":"2024-08-20T15:06:20Z","fieldsType":"FieldsV1","fieldsV1":{"f:spec":{"f:extra":{".":{},"f:authentication.kubernetes.io/pod-name":{},"f:authentication.kubernetes.io/pod-uid":{}},"f:groups":{},"f:resourceAttributes":{".":{},"f:name":{},"f:resource":{},"f:subresource":{},"f:verb":{},"f:version":{}},"f:uid":{},"f:user":{}}}}]},"spec":{"resourceAttributes":{"verb":"get","version":"v1","resource":"nodes","subresource":"proxy","name":"7b7fc874-####-####-###-85129530c869"},"user":"system:serviceaccount:pks-system:telegraf","groups":["system:serviceaccounts","system:serviceaccounts:pks-system","system:authenticated"],"extra":{"authentication.kubernetes.io/pod-name":["telegraf-9rlsv"],"authentication.kubernetes.io/pod-uid":["e33a6a73-1690-420f-a77d-6e95ad1be069"]},"uid":"07564e4e-95fe-40ad-a7f8-85971b08aeb5"},"status":{"allowed":false}},"requestReceivedTimestamp":"2024-08-20T15:06:20.032859Z","stageTimestamp":"2024-08-20T15:06:20.049360Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":"RBAC: allowed by ClusterRoleBinding \"kubo:internal:kubelet\" of ClusterRole \
"system:node\" to User \"kubelet\""}}


The only event I found regarding the 403 "Forbidden", is related to the healthwatch service account
"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"c4bfbcf4-9588-49c2-b293-72f78170b3c7","stage":"ResponseComplete","requestURI":"/metrics","verb":"get","user":{"username":"healthwatch","groups":["healthwatch","system:authenticated"]},"sourceIPs":["100.64.0.11"],"userAgent":"Prometheus/2.46.0","responseStatus":{"metadata":{},"status":"Failure","message":"forbidden: User \"healthwatch\" cannot get path \"/metrics\"","reason":"Forbidden","details":{},"code":403},"requestReceivedTimestamp":"2024-08-20T15:15:02.226149Z","stageTimestamp":"2024-08-20T15:15:02.226837Z","annotations":{"authorization.k8s.io/decision":"forbid","authorization.k8s.io/reason":"”}}

 

 

Environment

VMware Tanzu Kubernetes Grid Integrated Edition (TKGI).

Cause

  • This is a known issue affecting TKGI 1.19.1 and will be fixed in 1.20.1.
  • The resources "nodes/stats" is missing the "watch" verbs in the  "Telegraf" ClusterRole.

    # kubectl get clusterrole telegraf -o yaml

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"labels":{"metrics":"true","safeToDelete":"true"},"name":"telegraf"},"rules":[{"apiGroups":[""],"resources":["nodes/stats"],"verbs":["get","list"]},{"apiGroups":[""],"resources":["pods"],"verbs":["watch"]}]}
      creationTimestamp: "2024-12-20T05:42:49Z"
      labels:
        metrics: "true"
        safeToDelete: "true"
      name: telegraf
      resourceVersion: "2588"
      uid: 75366dff-0fa9-4692-b06c-7425af5d2c9f
    rules:
    - apiGroups:
      - ""
      resources:
      - nodes/stats
      verbs:
      - get
      - list
    - apiGroups:
      - ""
      resources:
      - pods
      verbs:
      - watch

Resolution

  • Run the following command to edit the "Telegraf" ClusterRole.
  • Type i to enter insert mode.
  • Add following lines under the "rules" sectioto add correct permission.

# kubectl edit clusterrole telegraf

- apiGroups:
   - ""
   resources:
   - nodes/proxy
   verbs:
   - get
   - watch
   - list

  • Press esc to exit insert mode.
    Type :wq to write, save, and quit the file