Storage LUN performance degradation potentially caused by high IOPS from kube-api

Products

VMware Telco Cloud Automation

Issue/Introduction

If you are experiencing performance issues related to disk write operations on the storage used by kube-apiserver, it may be due to excessive IOPS (Input/Output Operations Per Second) generated during certain cluster activities.

To determine whether this high IOPS load is only occurring during application instantiation or if it persists continuously after the cluster is deployed, performance monitoring should be performed. This will help isolate whether the performance degradation is event-driven (e.g., during deployments) or a consistent problem tied to the kube-apiserver workload.

By logging IOPS, CPU, memory, and top processes over time, you can identify patterns and correlate them with cluster events, enabling more targeted troubleshooting and remediation.

Environment

2.x, 3.x

Resolution

This Bash script is a system performance monitoring tool that collects various system statistics every 2 seconds and logs them to a file and need to be run on the control plane.
This script continuously monitors:
- Disk I/O (IOPS and latency)
- Network usage
- CPU and memory usage
- System load average
- Top CPU-consuming process

It saves all this data to a log file in CSV format for later analysis in the /tmp/performance_monitor.log

To run the Bash script, follow these steps:
Create a file and past script into :
```
vi performance_monitor.sh
```
Make the script executable
```
chmod +x performance_monitor.sh
```
Run the script:
```
sh performance_monitor.sh
```

Script will run 60 seconds.

#!/bin/bash

LOG_FILE="/tmp/performance_monitor.log"
PREV_READS=$(awk '{ if ($3 == "sda") { print $4 } }' /proc/diskstats)
PREV_WRITES=$(awk '{ if ($3 == "sda") { print $8 } }' /proc/diskstats)

echo "Starting performance monitoring for 1 minute. Output will be saved to $LOG_FILE."
echo "This will stop automatically after 1 minute."

echo "Timestamp, Network RX (KB/s), Network RX Drop, Network TX (KB/s), Network TX Drop, CPU Usage (%), MEM_TOTAL_MB , Load Avg, Top Process, IOSTATS" > $LOG_FILE

START_TIME=$(date +%s)
DURATION=60  # Run time in seconds

while true; do
    CURRENT_TIME=$(date +%s)
    ELAPSED=$((CURRENT_TIME - START_TIME))
    if [ "$ELAPSED" -ge "$DURATION" ]; then
        echo "Monitoring finished after 1 minute."
        break
    fi

    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    CURR_READS=$(awk '{ if ($3 == "sda") { print $4 } }' /proc/diskstats)
    CURR_WRITES=$(awk '{ if ($3 == "sda") { print $8 } }' /proc/diskstats)
    READ_TIME=$(awk '{ if ($3 == "sda") { print $7 } }' /proc/diskstats)
    WRITE_TIME=$(awk '{ if ($3 == "sda") { print $11 } }' /proc/diskstats)

    READ_IOPS=$((CURR_READS - PREV_READS))
    WRITE_IOPS=$((CURR_WRITES - PREV_WRITES))

    IOSTATS="IOStats Reads: $CURR_READS Writes: $CURR_WRITES Read Time (ms): $READ_TIME Write Time (ms): $WRITE_TIME Read IOPS: $READ_IOPS Write IOPS: $WRITE_IOPS"

    PREV_READS=$CURR_READS
    PREV_WRITES=$CURR_WRITES

    NETWORK=$(ifstat | grep eth0 | awk '{print $1","$6,$7,$8,$9}')
    CPU=$(top -b -n 1 | grep "Cpu(s)" | awk '{print $2 + $4}')
    MEM_TOTAL=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
    MEM_FREE=$(awk '/MemFree/ {print $2}' /proc/meminfo)
    MEM_USED=$((MEM_TOTAL - MEM_FREE))
    MEM_TOTAL_MB=$((MEM_TOTAL / 1024))
    MEM_USED_MB=$((MEM_USED / 1024))
    LOAD=$(uptime | awk -F 'load average: ' '{print $2}' | cut -d',' -f1)
    TOP_PROCESS=$(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $11}')

    echo "$TIMESTAMP, $NETWORK, $CPU, $MEM_USED_MB, $LOAD, $TOP_PROCESS, $IOSTATS" >> $LOG_FILE

    sleep 2
done

Additional Information

If you need adjusted time for the scrip to run longer than 60 seconds please change this part :

DURATION=60  # Run time in seconds

Additionally script need to be run on the control plane nodes :

kubectl get nodes
NAME                                               STATUS   ROLES           AGE   VERSION
nginx-busybox-controlplane-sfmt7-2p7rg             Ready    control-plane   83d   v1.30.2+vmware.1
nginx-busybox-controlplane-sfmt7-cjlcx             Ready    control-plane   83d   v1.30.2+vmware.1
nginx-busybox-controlplane-sfmt7-k9dg6             Ready    control-plane   83d   v1.30.2+vmware.1