Storage LUN performance degradation potentially caused by high IOPS from kube-api
search cancel

Storage LUN performance degradation potentially caused by high IOPS from kube-api

book

Article ID: 399076

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

If you are experiencing performance issues related to disk write operations on the storage used by kube-apiserver, it may be due to excessive IOPS (Input/Output Operations Per Second) generated during certain cluster activities.

To determine whether this high IOPS load is only occurring during application instantiation or if it persists continuously after the cluster is deployed, performance monitoring should be performed. This will help isolate whether the performance degradation is event-driven (e.g., during deployments) or a consistent problem tied to the kube-apiserver workload.

By logging IOPS, CPU, memory, and top processes over time, you can identify patterns and correlate them with cluster events, enabling more targeted troubleshooting and remediation.

Environment

2.x, 3.x

Resolution

  • This Bash script is a system performance monitoring tool that collects various system statistics every 2 seconds and logs them to a file and need to be run on the control plane. 

  • This script continuously monitors:
    • Disk I/O (IOPS and latency)
    • Network usage
    • CPU and memory usage
    • System load average
    • Top CPU-consuming process

It saves all this data to a log file in CSV format for later analysis in the /tmp/performance_monitor.log

  • To run the Bash script, follow these steps:
  • Create a file and past script into :
    vi performance_monitor.sh
  • Make the script executable 
    chmod +x performance_monitor.sh
  • Run the script:
    sh performance_monitor.sh
  • Script will run 60 seconds.
    #!/bin/bash
    
    LOG_FILE="/tmp/performance_monitor.log"
    PREV_READS=$(awk '{ if ($3 == "sda") { print $4 } }' /proc/diskstats)
    PREV_WRITES=$(awk '{ if ($3 == "sda") { print $8 } }' /proc/diskstats)
    
    echo "Starting performance monitoring for 1 minute. Output will be saved to $LOG_FILE."
    echo "This will stop automatically after 1 minute."
    
    echo "Timestamp, Network RX (KB/s), Network RX Drop, Network TX (KB/s), Network TX Drop, CPU Usage (%), MEM_TOTAL_MB , Load Avg, Top Process, IOSTATS" > $LOG_FILE
    
    START_TIME=$(date +%s)
    DURATION=60  # Run time in seconds
    
    while true; do
        CURRENT_TIME=$(date +%s)
        ELAPSED=$((CURRENT_TIME - START_TIME))
        if [ "$ELAPSED" -ge "$DURATION" ]; then
            echo "Monitoring finished after 1 minute."
            break
        fi
    
        TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
        CURR_READS=$(awk '{ if ($3 == "sda") { print $4 } }' /proc/diskstats)
        CURR_WRITES=$(awk '{ if ($3 == "sda") { print $8 } }' /proc/diskstats)
        READ_TIME=$(awk '{ if ($3 == "sda") { print $7 } }' /proc/diskstats)
        WRITE_TIME=$(awk '{ if ($3 == "sda") { print $11 } }' /proc/diskstats)
    
        READ_IOPS=$((CURR_READS - PREV_READS))
        WRITE_IOPS=$((CURR_WRITES - PREV_WRITES))
    
        IOSTATS="IOStats Reads: $CURR_READS Writes: $CURR_WRITES Read Time (ms): $READ_TIME Write Time (ms): $WRITE_TIME Read IOPS: $READ_IOPS Write IOPS: $WRITE_IOPS"
    
        PREV_READS=$CURR_READS
        PREV_WRITES=$CURR_WRITES
    
        NETWORK=$(ifstat | grep eth0 | awk '{print $1","$6,$7,$8,$9}')
        CPU=$(top -b -n 1 | grep "Cpu(s)" | awk '{print $2 + $4}')
        MEM_TOTAL=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
        MEM_FREE=$(awk '/MemFree/ {print $2}' /proc/meminfo)
        MEM_USED=$((MEM_TOTAL - MEM_FREE))
        MEM_TOTAL_MB=$((MEM_TOTAL / 1024))
        MEM_USED_MB=$((MEM_USED / 1024))
        LOAD=$(uptime | awk -F 'load average: ' '{print $2}' | cut -d',' -f1)
        TOP_PROCESS=$(ps aux --sort=-%cpu | head -2 | tail -1 | awk '{print $11}')
    
        echo "$TIMESTAMP, $NETWORK, $CPU, $MEM_USED_MB, $LOAD, $TOP_PROCESS, $IOSTATS" >> $LOG_FILE
    
        sleep 2
    done

Additional Information

If you need adjusted time for the scrip to run longer than 60 seconds please change this part :

DURATION=60  # Run time in seconds

Additionally script need to be run on the control plane nodes :

kubectl get nodes
NAME                                               STATUS   ROLES           AGE   VERSION
nginx-busybox-controlplane-sfmt7-2p7rg             Ready    control-plane   83d   v1.30.2+vmware.1
nginx-busybox-controlplane-sfmt7-cjlcx             Ready    control-plane   83d   v1.30.2+vmware.1
nginx-busybox-controlplane-sfmt7-k9dg6             Ready    control-plane   83d   v1.30.2+vmware.1