Telegraf metric-sink pods crash with Back-off pulling image error after upgrading TKGI
book
Article ID: 335094
calendar_today
Updated On:
Products
VMware Tanzu Kubernetes Grid
Issue/Introduction
Symptoms:
After upgrading TKGI, Telegraf metric-sink pods are crashing with the below error
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BackOff 4m18s (x547 over 129m) kubelet Back-off pulling image "cnabu-docker-local.artifactory.eng.vmware.com/oratos/telegraf:1a3337bb81890b3ca0848b5dd456
The Bosh vms disk usage has been already checked to ensure that the Telegraf image was not deleted due to the disk space consumption
Restarting/deleting the pod does not fix the issue
Cause
This is a known issue where the Telegraf pods cannot be updated to use the new Telegraf image after the TKGI upgrade, due to an error on the metric-controller failing to update the Telegraf role/rolebinding/deployment.
Resolution
The fix is available in TKGI 1.17.0, TKGI 1.16.3, TKGI 1.15.6 releases to automatically update the Telegraf pods with the new images
Workaround: Export the metricsinks.pksapi.io CRD as yaml file, then deletes the current CRD and re-apply it using the new exported yaml. The new Telegraf pods should now be created successfully with the new Telegraf image.