Node Pool Creation with Large Number of Nodes Stuck Due to Slow Storage or High CPU on K8s Control-Plane

search cancel

Node Pool Creation with Large Number of Nodes Stuck Due to Slow Storage or High CPU on K8s Control-Plane

book

Article ID: 382554

calendar_today

Updated On: 11-21-2024

Products

VMware Telco Cloud Platform VMware Telco Cloud Platform - 5G Edition VMware Telco Cloud Platform Advanced VMware Telco Cloud Platform Essentials VMware Telco Cloud Platform Essentials for RAN VMware Telco Cloud Platform RAN VMware Telco Cloud Platform Standard MANO VMware Telco Cloud Automation

Issue/Introduction

Attempts to create a control-plane, along with one or more node pools containing a large number of nodes (more than 10), the task may get stuck displaying a "processing" status in the UI/API.

Errors in etcd pod logs of the cluster:

The leader failed to send the heartbeat in a timely manner, likely due to being overloaded from a slow disk.
"wal/wal.go:805", "msg": "slow fdatasync", "took": "1.190301394s", "expected-duration": "1s"
etcd/0.log:3595: 2024-11-13T19:05:20.40535408Z stderr F WARNING: 2024/11/13 19:05:20 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
etcdserver/util.go:166, "msg": "apply request took too long", "took": "104.808029ms", "expected-duration": "100ms", "prefix": "read-only range", "request": "key:\"/registry/ako.vmware.com/clustersets/\" range_end:\"/registry/ako.vmware.com/clustersets0\" count_only:true", "response": "range_response_count:0 size:7"

Environment

TCA 2.3, 3.0, 3.1

TCP 3.x, 4.x, 5.x

Cause

There are a couple of scenarios in which this can occur:

Cluster Creation with Large Node Pools

When a cluster creation request includes one or more node pools with more than 10 nodes, the system attempts to create the control-plane and node pools in parallel. This can put significant load on the etcd and API servers. If the storage used by the control-plane and worker nodes is slow, it can cause extra CPU load due to throttling and scheduling delays. As a result, while some nodes or node pools may be created successfully, the initialization of one of the control-plane nodes might not complete, leaving the overall cluster status stuck in the "processing" state.

Adding Large Node Pools to an Existing Control-Plane

If control-plane nodes are already deployed and stable, and the user attempts to add one or more node pools with a large number of nodes, the system will try to bring up all the nodes simultaneously. Depending on the storage speed, this could place excessive load on etcd and lead to high CPU utilization on the control-plane nodes. As a result, the node pool creation process may get stuck in a "processing" state.

Resolution

Since the creation of the cluster or node pool is a critical, one-time task during CaaS bring-up, it is essential to minimize or avoid any failures that could arise due to slow storage or oversubscription of CPU/memory in the underlying infrastructure.

For a typical cluster setup with 3 control-plane nodes and up to 10 nodes in a node pool, users can submit a single request to create both the control-plane and the node pool.
Note: The total number of nodes, including both control-plane and node pool nodes, should not exceed 20.
Once the cluster and node pools have been successfully created, adjust the individual node pools to scale them to the desired size.

Feedback

thumb_up Yes

thumb_down No