Simple troubleshooting guide for etcd performance issues
search cancel

Simple troubleshooting guide for etcd performance issues

book

Article ID: 411827

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

You see the error message "apply request took too long" in etcd logs or "Client.Timeout exceeded while awaiting headers" in the status of Kubernetes resource, it usually indicates a performance issue with etcd.

Environment

VMware vSphere Kubernetes Services

Cause

Common causes include:

  • Disk I/O latency
  • Network communication issues between etcd peers
  • Resource pressure, such as high CPU or memory usage

Resolution

Disk I/O

For disk I/O latency, the simplest way is to search log message "slow fdatasync". etcd logs this message when it takes more than 1 second to flush WAL data to disk. Normally it takes no more than 32ms, so 1 second is clearly not acceptable.

Additionally, you can monitor the following etcd metrics:

  • etcd_disk_wal_fsync_duration_seconds_bucket: measure the latency of WAL sync operations
  • etcd_disk_backend_commit_duration_seconds_bucket: measure the latency of backend commit operations

Network

For etcd peer to peer network communication, since etcd usually runs in host network, so the simplest way is to ping between control plane VMs, and measure the round trip time. It should be generally less than three times the heartbeat interval (defaults to 100ms), so it should be less than 300ms.

Additionally, you can monitor the etcd metrics "etcd_network_peer_round_trip_time_seconds_bucket".

Resource

For CPU or memory usage, the simplest way is just to run top