cdm reporting top processes over 100% and java processes PIDs in alarm do not match up with processes probe status window data

book

Article ID: 192904

calendar_today

Updated On:

Products

NIMSOFT PROBES DX Infrastructure Management

Issue/Introduction

When the cdm probe generates an alarm its showing java processes that are > 100%. Also the processes probe PIDs do not match up for the same java processes listed in the alarm.

Environment

Release : 20.1

Component : UIM - AWS

Resolution

1. CPU over 100% for a given process. (This is working as designed/as expected).

CPU utilization is measured relative to a single CPU. The maximum is 100% for each CPU, so a four-CPU system would have a maximum CPU utilization of 400%.

If you search for "Linux ps %CPU > 100" you will find that this is expected on multi CPU systems.

If %CPU for a process > 100 it means it's occupying a full core plus a little of another. So, if you have say, 4 cores, a process that is multi-threaded (so it can handle pushing load to all cores) could reach 400%. The top command run on that system will confirm this behavior.

2. cdm alarms showing top processes with PIDs that do not match the processes probe status (PPID and PID) for java processes.

After checking various alarms and the PIDs being reported within the alarm AND refreshing the processes probe Status window, we could see that every few minutes, the PPID and the PIDs would change.

Average (3 samples) total cpu is now 99.37%, which is above the error threshold (95%).Top Processes [java[4840]-(152.00%)];[java[5309]-(147.00%)];[java[5701]-(116.00%)];[java[6141]-(96.00%)];[java[6484]-(31.20%)]

Average (3 samples) total cpu is now 99.31%, which is above the error threshold (95%).Top Processes [java[13225]-(154.00%)];[java[13772]-(118.00%)];[java[14068]-(88.70%)];[tesvc[1161]-(4.30%)];[adclient[1641]-(2.30%)]

Average (3 samples) total cpu is now 99.38%, which is above the error threshold (95%).Top Processes [java[26840]-(153.00%)];[java[26570]-(139.00%)];[java[27180]-(124.00%)];[java[28141]-(30.80%)];[java[28232]-(27.50%)]

Average (3 samples) total cpu is now 99.24%, which is above the error threshold (95%).Top Processes [java[26863]-(154.00%)];[java[27187]-(154.00%)];[java[26609]-(151.00%)];[java[27640]-(105.00%)];[java[28041]-(78.20%)]

cdm uses this command to get the top processes-> /bin/ps -e -o pcpu,ppid,pid,args, --sort=-pcpu

We confirmed that the PPIDs and the PIDs for the java processes (AWS elastic search) change over time, within a few minutes on the RHEL 8 machine.

Other process PPIDs/PIDs, e.g., non-java processes, remain the same over time.

Additional Information

java process PIDs for elastic search and docker changed every few minutes. 

For docker or Kubernetes, in terms of monitoring, review container status, e.g., like output of "docker ps"
 
Also, check with your administrator(s) to make sure that the java processes are not crashing / restarting, e.g., look for hs_### files.

Attachments