After Upgrade to TKGI 1.19.x, ClusterMetricSink Creation Fails: 'Failed Calling Webhook metric.validator.pksapi.io'
search cancel

After Upgrade to TKGI 1.19.x, ClusterMetricSink Creation Fails: 'Failed Calling Webhook metric.validator.pksapi.io'

book

Article ID: 382961

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi)

Issue/Introduction

When attempting to create a ClusterMetricSink resource on TKGI 1.19.x using configuration that was previously used on 1.18.x, the process may fail with the following error when applying the configuration YAML:

Error from server (InternalError): error when creating "clustermetricsink-config.yaml": Internal error occurred: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": EOF

In the API server logs, you may notice entries similar to the following:

169898:W1120 17:02:03.040542       6 dispatcher.go:217] Failed calling webhook, failing closed metric.validator.pksapi.io: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": context deadline exceeded
169280:Trace[1110904728]: ["Call validating webhook" configuration:validator.pksapi.io,webhook:metric.validator.pksapi.io,resource:pksapi.io/v1beta1, Resource=clustermetricsinks,subresource:,operation:CREATE,UID:78bdbe84-c2a2-42cb-a3a3-7f417d2a0956 10000ms (12:02:51.010)]
169925:W1120 12:034:33.433821       6 dispatcher.go:217] Failed calling webhook, failing closed metric.validator.pksapi.io: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": EOF

In Telegraf logs, you may see errors similar to below: 

2024-11-28T09:50:50Z E! [inputs.kubernetes] Error in plugin:https://127.0.0.1:10250/stats/summary returned HTTP status 403 Forbidden

You may also notice that logs are no longer arriving from preconfigured ClusterMetricSink resources. 

 

Cause

The issue is still under investigation, but it is believed that the telegraf version change from 1.13.2 to 1.29.5 from TKGI 1.18 to 1.19 is the cause of the issue. 

Resolution

To resolve this issue, update both the validator and telegraf  ClusterRoles  to include the necessary permissions for listing and watching pods and namespace resources:
 
 
First, edit the validator cluster role by using a command like: kubectl edit clusterrole  validator -n pks-system

And add the following roles (do not delete the existing cluster roles from the configuration): 

- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - watch
  - list 
- apiGroups:
  - ""
  resources:
  - nodes/proxy
  verbs:
  - get
  - watch
  - list

Second, edit the telegraf cluster role using a command like: kubectl edit clusterrole telegraf

And add the following roles (do not delete the existing cluster roles from the configuration)

- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - watch
  - list 
- apiGroups:
  - ""
  resources:
  - nodes/proxy
  verbs:
  - get
  - watch
  - list

After applying these updates, creating a new ClusterMetricSink should succeed, and logs should begin arriving on your logging platform as expected.