Attempts to create a control-plane, along with one or more node pools containing a large number of nodes (more than 10), the task may get stuck displaying a "processing
" status in the UI/API.
Errors in etcd pod logs of the cluster:
The leader failed to send the heartbeat in a timely manner, likely due to being overloaded from a slow disk.
"wal/wal.go:805", "msg": "slow fdatasync", "took": "1.190301394s", "expected-duration": "1s"
etcd/0.log:3595: 2024-11-13T19:05:20.40535408Z stderr F WARNING: 2024/11/13 19:05:20 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
etcdserver/util.go:166, "msg": "apply request took too long", "took": "104.808029ms", "expected-duration": "100ms", "prefix": "read-only range", "request": "key:\"/registry/ako.vmware.com/clustersets/\" range_end:\"/registry/ako.vmware.com/clustersets0\" count_only:true", "response": "range_response_count:0 size:7"
TCA 2.3, 3.0, 3.1
TCP 3.x, 4.x, 5.x
There are a couple of scenarios in which this can occur:
When a cluster creation request includes one or more node pools with more than 10 nodes, the system attempts to create the control-plane and node pools in parallel. This can put significant load on the etcd and API servers. If the storage used by the control-plane and worker nodes is slow, it can cause extra CPU load due to throttling and scheduling delays. As a result, while some nodes or node pools may be created successfully, the initialization of one of the control-plane nodes might not complete, leaving the overall cluster status stuck in the "processing" state.
If control-plane nodes are already deployed and stable, and the user attempts to add one or more node pools with a large number of nodes, the system will try to bring up all the nodes simultaneously. Depending on the storage speed, this could place excessive load on etcd and lead to high CPU utilization on the control-plane nodes. As a result, the node pool creation process may get stuck in a "processing" state.
Since the creation of the cluster or node pool is a critical, one-time task during CaaS bring-up, it is essential to minimize or avoid any failures that could arise due to slow storage or oversubscription of CPU/memory in the underlying infrastructure.