Performing Bosh operations on nodes, such as "bosh deploy manifest"
fails with post-start scripts for telemetry-agent-image and/or csi-images jobs:
Error: Action Failed get_task: Task 1d57e7f8-bb61-44bc-50fa-2bf53d0f778d result: 1 of 6 post-start scripts failed. Failed Jobs: telemetry-agent-image. Successful Jobs: bosh-dns, kubelet, csi-images, load-images, sink-resources-images.
In the node where the script fails, /var/vcap/sys/log/telemetry-agent-image/post-start.stderr.log
and/or /var/vcap/sys/log/csi-images/post-start.stderr.log
show:
ctr: failed to dial "/var/vcap/sys/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /var/vcap/sys/run/containerd/containerd.sock: connect: connection refused"
All TKGi versions prior to 1.21.0
Sporadic short unavailability of containerd socket will cause telemetry-agent-image and/or csi-images jobs' post-start scripts to fail.
These scripts don't have the necessary logic to tolerate short unavailability of containerd socket.
Other jobs' post-start scripts have retry mechanisms, so they shouldn't be impacted by this.
# bosh -d <deployment-id> ssh <node-id>
# sudo -i
# find / -name "post-start" | grep telemetry
# find / -name "post-start" | grep csi
# /var/vcap/data/jobs/telemetry-agent-image/<id>/bin/post-start
# /var/vcap/data/jobs/csi-images/<id>/bin/post-start
# find / -name "post-start" | grep telemetry
/var/vcap/data/jobs/telemetry-agent-image/00c1dfb4a48f26dd0e333c92c7b245bcf86f1d05/bin/post-start
# find / -name "post-start" | grep csi
/var/vcap/data/jobs/csi-images/e910669941d80c9a5877d40409def0e2f1c85783/bin/post-start
# /var/vcap/data/jobs/telemetry-agent-image/00c1dfb4a48f26dd0e333c92c7b245bcf86f1d05/bin/post-start
[Mon Jan 6 02:46:51 PM UTC 2025] Loading cached container: /var/vcap/packages/telemetry-agent-image/pkstelemetrybot_telemetry-agent:65db853.tar
unpacking docker.io/pkstelemetrybot/telemetry-agent:latest (sha256:2043771f05caab18ce35214052d6e3eafb334d1858aa5ed4afad2b1785179dba)...done
[Mon Jan 6 02:46:54 PM UTC 2025] Successfully loaded container: /var/vcap/packages/telemetry-agent-image/pkstelemetrybot_telemetry-agent:65db853.tar
# /var/vcap/data/jobs/csi-images/e910669941d80c9a5877d40409def0e2f1c85783/bin/post-start
[Mon Jan 6 02:47:52 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/gcr.io_cloud-provider-vsphere_csi_release_syncer:v3.1.2.tar
unpacking gcr.io/cloud-provider-vsphere/csi/release/syncer:v3.1.2 (sha256:1c7d2ac07bfb6c95dba26d0ea8133ae9c9a38ba48be132811105f08035e0203e)...done
[Mon Jan 6 02:47:54 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/gcr.io_cloud-provider-vsphere_csi_release_syncer:v3.1.2.tar
[Mon Jan 6 02:47:54 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/gcr.io_k8s-staging-sig-storage_snapshot-controller:v6.2.2.tar
unpacking gcr.io/k8s-staging-sig-storage/snapshot-controller:v6.2.2 (sha256:71500f91ddc8e2c6abd1019bdc06eaf3fde9f072376cffd3d78bdc95aaf49a60)...done
[Mon Jan 6 02:47:55 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/gcr.io_k8s-staging-sig-storage_snapshot-controller:v6.2.2.tar
[Mon Jan 6 02:47:55 PM UTC 2025] Loading cached image: /var/vcap/packages/csi/container-images/registry.k8s.io_sig-storage_snapshot-validation-webhook:v6.2.2.tar
unpacking registry.k8s.io/sig-storage/snapshot-validation-webhook:v6.2.2 (sha256:92bca6a86fcb9bd2e9751f8c562f1cde3b573e0c117f65eb355e0633de9914bc)...done
[Mon Jan 6 02:47:56 PM UTC 2025] Successfully loaded image: /var/vcap/packages/csi/container-images/registry.k8s.io_sig-storage_snapshot-validation-webhook:v6.2.2.tar
telemetry-agent-image and csi-images post-start scripts improvements will be included in TKGi 1.21.0.