TKGi telemetry-agent-image and csi-images post-start scripts fail
search cancel

TKGi telemetry-agent-image and csi-images post-start scripts fail

book

Article ID: 383743

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Performing Bosh operations on nodes, such as "bosh deploy manifest" fails with post-start scripts for telemetry-agent-image and/or csi-images jobs:

Error: Action Failed get_task: Task 1d57e7f8-bb61-44bc-50fa-2bf53d0f778d result: 1 of 6 post-start scripts failed. Failed Jobs: telemetry-agent-image. Successful Jobs: bosh-dns, kubelet, csi-images, load-images, sink-resources-images.

In the node where the script fails, /var/vcap/sys/log/telemetry-agent-image/post-start.stderr.log and/or /var/vcap/sys/log/csi-images/post-start.stderr.log show:

ctr: failed to dial "/var/vcap/sys/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /var/vcap/sys/run/containerd/containerd.sock: connect: connection refused"

 

Environment

All TKGi versions prior to 1.21.0

Cause

Sporadic short unavailability of containerd socket will cause telemetry-agent-image and/or csi-images jobs' post-start scripts to fail.

These scripts don't have the necessary logic to tolerate short unavailability of containerd socket.

Other jobs' post-start scripts have retry mechanisms, so they shouldn't be impacted by this.

Resolution

Workaround

  1. Log into the node presenting failure and change to root user:
    # bosh -d <deployment-id> ssh <node-id>
    # sudo -i

  2. Manually run the telemetry-agent-image and/or csi-images post-start script:
    # find / -name "post-start" | grep telemetry
    # find / -name "post-start" | grep csi

    Execute the scripts:

    # /var/vcap/data/jobs/telemetry-agent-image/<id>/bin/post-start
    # /var/vcap/data/jobs/csi-images/<id>/bin/post-start

    For example:

    # find / -name "post-start" | grep telemetry
    /var/vcap/data/jobs/telemetry-agent-image/00c1dfb4a48f26dd0e333c92c7b245bcf86f1d05/bin/post-start

    # find / -name "post-start" | grep csi
    /var/vcap/data/jobs/csi-images/e910669941d80c9a5877d40409def0e2f1c85783/bin/post-start

    # /var/vcap/data/jobs/telemetry-agent-image/00c1dfb4a48f26dd0e333c92c7b245bcf86f1d05/bin/post-start
    [Mon Jan  6 02:46:51 PM UTC 2025] Loading cached container: /var/vcap/packages/telemetry-agent-image/pkstelemetrybot_telemetry-agent:65db853.tar
    unpacking docker.io/pkstelemetrybot/telemetry-agent:latest (sha256:2043771f05caab18ce35214052d6e3eafb334d1858aa5ed4afad2b1785179dba)...done
    [Mon Jan  6 02:46:54 PM UTC 2025] Successfully loaded container: /var/vcap/packages/telemetry-agent-image/pkstelemetrybot_telemetry-agent:65db853.tar

    # /var/vcap/data/jobs/csi-images/e910669941d80c9a5877d40409def0e2f1c85783/bin/post-start
    [Mon Jan  6 02:47:52 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/gcr.io_cloud-provider-vsphere_csi_release_syncer:v3.1.2.tar
    unpacking gcr.io/cloud-provider-vsphere/csi/release/syncer:v3.1.2 (sha256:1c7d2ac07bfb6c95dba26d0ea8133ae9c9a38ba48be132811105f08035e0203e)...done
    [Mon Jan  6 02:47:54 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/gcr.io_cloud-provider-vsphere_csi_release_syncer:v3.1.2.tar
    [Mon Jan  6 02:47:54 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/gcr.io_k8s-staging-sig-storage_snapshot-controller:v6.2.2.tar
    unpacking gcr.io/k8s-staging-sig-storage/snapshot-controller:v6.2.2 (sha256:71500f91ddc8e2c6abd1019bdc06eaf3fde9f072376cffd3d78bdc95aaf49a60)...done
    [Mon Jan  6 02:47:55 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/gcr.io_k8s-staging-sig-storage_snapshot-controller:v6.2.2.tar
    [Mon Jan  6 02:47:55 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/registry.k8s.io_sig-storage_snapshot-validation-webhook:v6.2.2.tar
    unpacking registry.k8s.io/sig-storage/snapshot-validation-webhook:v6.2.2 (sha256:92bca6a86fcb9bd2e9751f8c562f1cde3b573e0c117f65eb355e0633de9914bc)...done
    [Mon Jan  6 02:47:56 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/registry.k8s.io_sig-storage_snapshot-validation-webhook:v6.2.2.tar

Fix

telemetry-agent-image and csi-images post-start scripts improvements will be included in TKGi 1.21.0.