Telegraf metric-sink pods crash with Back-off pulling image error after upgrading TKGI
search cancel

Telegraf metric-sink pods crash with Back-off pulling image error after upgrading TKGI

book

Article ID: 335094

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
  • After upgrading TKGI, Telegraf metric-sink pods are crashing with the below error
Events:
  Type    Reason   Age                     From     Message
  ----    ------   ----                    ----     -------
  Normal  BackOff  4m18s (x547 over 129m)  kubelet  Back-off pulling image "cnabu-docker-local.artifactory.eng.vmware.com/oratos/telegraf:1a3337bb81890b3ca0848b5dd456 
  • The Bosh vms disk usage has been already checked to ensure that the Telegraf image was not deleted due to the disk space consumption
  • Restarting/deleting the pod does not fix the issue


Cause

  • This is a known issue where the Telegraf pods cannot be updated to use the new Telegraf image after the TKGI upgrade, due to an error on the metric-controller failing to update the Telegraf role/rolebinding/deployment.

Resolution

The fix is available in TKGI 1.17.0, TKGI 1.16.3, TKGI 1.15.6 releases to automatically update the Telegraf pods with the new images

Workaround:
Export the metricsinks.pksapi.io CRD as yaml file, then deletes the current CRD and re-apply it using the new exported yaml. The new Telegraf pods should now be created successfully with the new Telegraf image.