CNF Application Scheduled Auto Backup Fails

Products

VMware Telco Cloud Automation

Issue/Introduction

CNF application scheduled backup intermittently fails.
Increased disk I/O latency impacting the TKG control plane as soon as the CNF Auto Backup process is started

The only consistent warnings logged in TKG control plane node etcd logs indicate slow disk I/O performance.

2024-11-22T03:00:13.803904232Z stderr F {"level":"warn","ts":"2024-11-22T03:00:13.803Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"112.621055ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/configmaps/xxxx\" ","response":"range_response_count:0 size:8"}

2024-11-22T03:00:21.541557256Z stderr F {"level":"warn","ts":"2024-11-22T03:00:21.541Z","caller":"etcdserver/raft.go:415","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"xxxxx","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"663.212927ms"}
2024-11-22T03:00:21.541570448Z stderr F {"level":"warn","ts":"2024-11-22T03:00:21.541Z","caller":"etcdserver/raft.go:415","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"xxxxx","heartbeat-interval":"500ms","expected-duration":"1s","exceeded-duration":"663.236912ms"}

2024-11-22T03:00:24.886580645Z stderr F WARNING: 2024/11/22 03:00:24 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"
2024-11-22T03:00:31.107090921Z stderr F WARNING: 2024/11/22 03:00:30 [core] grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"

In ESXi host logs a consistent pattern is observed , for all timestamps in the logs , where disk I/O performance degradation occurred at 3 AM UTC approx everyday i.e. the time backup process is scheduled to start.

The performance deterioration typically lasts for about 1 to 3 minutes before showing signs of improvement and eventually returning to original levels.

2024-11-22T03:00:44.751Z cpu95:2098448)WARNING: ScsiDeviceIO: 1513: Device naa.xxxxxxxxxxxxxxx6c00026a7a performance has deteriorated. I/O latency increased from average value of 386 microseconds to 10703 microseconds. (0.010703 seconds) 2024-11-22T03:00:45.482Z cpu95:2098448)WARNING: ScsiDeviceIO: 1513: Device naa.xxxxxxxxxxxxxxxx6c00026a7a performance has deteriorated. I/O latency increased from average value of 386 microseconds to 22365 microseconds. (0.022365 seconds)

Environment

TCP 3.0
TCA 2.3
TKG 2.x

Cause

The error in the logs indicates that the leader node of the etcd cluster is not able to send heartbeats to its followers in a timely manner, and the cause is the disk I/O on the leader is slow.The slow disk IO is due to below reasons:
- High disk latency: The disk is taking longer than expected to respond to read/write requests.
- Overloaded disk: Too many concurrent operations are being performed on the disk, which could be due to excessive logging, heavy disk usage, or other processes consuming disk I/O e.g. the backup activity.
- There appears to be a recurring pattern with the problem occurring at similar times i.e. 3AM UTC
- Impact: If the leader continues to fail to send heartbeats within the required time, the Raft protocol could assume the leader has failed and trigger a leadership election, which can cause temporary unavailability in the etcd cluster. Heartbeat Interval is 500ms after which the leader is expected to send heartbeat messages to its followers and operation is expected to complete in one second , but in the logs we noticed the operation exceeded by 2+ seconds meaning a significant delay.
Analysis of the ESXi host logs also supports the conclusion that the slow disk performance in the SCSI device at the start of the backup process is the cause of the failure.

Resolution

The CNF Application Backup process is highly IO-intensive and had impacted the TKG control plane, as clearly observed from the TKG and ESXi log files.
Collaborate with the storage vendor to ensure the TKG control plane receives the minimum recommended resources, particularly in terms of IOPS (specifically sequential IOPs).
I/O performance can be achieved by a number of actions in the backend storage which may include adjusting the RAID level or splitting the cluster components across datastores etc specifically dedicating one DS for TKG Control plane
Consult the CNF Application vendor to implement improved mechanism for the backup process
Refer to the inks for further guidance on ETCD HW Requirements & ETCD dimensioning requirements.

CNF Application Scheduled Auto Backup Fails

Article ID: 385623

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback