Worker nodes failing to start after startup of TKGi cluster, etcd post-start script failed
search cancel

Worker nodes failing to start after startup of TKGi cluster, etcd post-start script failed

book

Article ID: 403926

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

The environment was shutdown as outlined in Shutting down and restarting Tanzu Kubernetes Grid Integrated Edition

On startup, the Master nodes started successfully but startup of the Worker nodes fails while validating the health of etcd:

Task 564189 | 11:05:13 | L executing post-start: master/########-####-####-####-############ (0) (canary) (00:04:10)
                       L Error: Action Failed get_task: Task 6333b5e9-7f48-490b-6fc5-311eab3f3653 result: 1 of 5 post-start scripts failed. Failed Jobs: etcd. Successful Jobs: bosh-dns, kubernetes-roles, kube-apiserver, pks-nsx-t-ncp.

Cause

The etcd process is running on all 3 Master nodes and all 3 members are referenced in the etcd cluster configuration. 

# etcdctl member list -w table
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+------------+
|        ID        | STATUS  |                 NAME                 |                PEER ADDRS                |               CLIENT ADDRS               | IS LEARNER |
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+------------+
| 17f206fd866fdab2 | started | ########-####-####-####-############ | https://master-0.etcd.cfcr.internal:2380 | https://master-0.etcd.cfcr.internal:2379 |      false |
| 8f18440d0ccf8bf9 | started | ########-####-####-####-############ | https://master-1.etcd.cfcr.internal:2380 | https://master-1.etcd.cfcr.internal:2379 |      false |
| fce4f52fecd850d5 | started | ########-####-####-####-############ | https://master-2.etcd.cfcr.internal:2380 | https://master-2.etcd.cfcr.internal:2379 |      false |
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+------------+

 

But on 2 nodes, it is not healthy and is etcdctl client cannot connect 

# etcdctl endpoint --cluster health -w table
{"level":"warn","ts":"2025-07-09T11:21:37.437621Z","logger":"client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00032e000/master-2.etcd.cfcr.internal:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: internal error\""}
{"level":"warn","ts":"2025-07-09T11:21:37.437542Z","logger":"client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00032e780/master-0.etcd.cfcr.internal:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
+------------------------------------------+--------+--------------+---------------------------+
|                 ENDPOINT                 | HEALTH |     TOOK     |           ERROR           |
+------------------------------------------+--------+--------------+---------------------------+
| https://master-1.etcd.cfcr.internal:2379 |   true |  20.010757ms |                           |
| https://master-2.etcd.cfcr.internal:2379 |  false | 5.005s       | context deadline exceeded |
| https://master-0.etcd.cfcr.internal:2379 |  false | 5.002047351s | context deadline exceeded |
+------------------------------------------+--------+--------------+---------------------------+

 

Resolution

Restart etcd on the two unhealthy nodes

monit stop etcd
monit start etcd
etcdctl endpoint --cluster health -w table

 

If etcd is still unhealthy after the restart, please contact Broadcom Tanzu Support for assistance.