When attempting to create a ClusterMetricSink resource on TKGI 1.19.x using configuration that was previously used on 1.18.x, the process may fail with the following error when applying the configuration YAML:
Error from server (InternalError): error when creating "clustermetricsink-config.yaml": Internal error occurred: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": EOF
In the API server logs, you may notice entries similar to the following:
169898:W1120 17:02:03.040542 6 dispatcher.go:217] Failed calling webhook, failing closed metric.validator.pksapi.io: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": context deadline exceeded
169280:Trace[1110904728]: ["Call validating webhook" configuration:validator.pksapi.io,webhook:metric.validator.pksapi.io,resource:pksapi.io/v1beta1, Resource=clustermetricsinks,subresource:,operation:CREATE,UID:78bdbe84-c2a2-42cb-a3a3-7f417d2a0956 10000ms (12:02:51.010)]
169925:W1120 12:034:33.433821 6 dispatcher.go:217] Failed calling webhook, failing closed metric.validator.pksapi.io: failed calling webhook "metric.validator.pksapi.io": failed to call webhook: Post "https://validator.pks-system.svc:443/metricsink?timeout=10s": EOF
In Telegraf logs, you may see errors similar to below:
2024-11-28T09:50:50Z E! [inputs.kubernetes] Error in plugin:https://127.0.0.1:10250/stats/summary returned HTTP status 403 Forbidden
You may also notice that logs are no longer arriving from preconfigured ClusterMetricSink
resources.
The issue is still under investigation, but it is believed that the telegraf version change from 1.13.2 to 1.29.5 from TKGI 1.18 to 1.19 is the cause of the issue.
ClusterRoles
to include the necessary permissions for listing and watching pods and namespace resources:And add the following roles (do not delete the existing cluster roles from the configuration):
- apiGroups:
- ""
resources:
- pods
- namespaces
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- nodes/proxy
verbs:
- get
- watch
- list
Second, edit the telegraf cluster role using a command like: kubectl edit clusterrole telegraf
And add the following roles (do not delete the existing cluster roles from the configuration)
- apiGroups:
- ""
resources:
- pods
- namespaces
verbs:
- watch
- list
- apiGroups:
- ""
resources:
- nodes/proxy
verbs:
- get
- watch
- list
ClusterMetricSink
should succeed, and logs should begin arriving on your logging platform as expected.