Troubleshoot and Identify the cause of cpuCoreUtilizationExceeded alarms on APMs

book

Article ID: 167986

calendar_today

Updated On:

Products

XOS

Issue/Introduction

How do I find out what is causing my cpuCoreUtilizationExceeded alarms on my APMs?

A alarm for cpuCoreUtilizationExceeded is present on the chassis:

pod1# show alarms history 
ID Date Severity Source Description 
-- ---- -------- ------ ----------- 
890 Sep 6 13:44:05 Minor ap7 CPU utilization core: 2
890 Sep 6 13:44:15 Info system New system alarm level (clear)  
890 Sep 6 13:44:15 Clear ap7 CPU utilization core: none 
 

The following log lines can be found in the messages file:

Sep 6 13:44:05 pod1 cbsalarmlogrd: AlarmID 890 | Fri Sep 6 13:44:05 2013 | minor | ap7 | cpuCoreUtilizationExceeded | CPU utilization core: 2 
Sep 6 13:44:15 pod1 cbsalarmlogrd: AlarmID 892 | Fri Sep 6 13:44:15 2013 | clear | ap7 | cpuCoreUtilizationExceeded | CPU utilization core: none | CorrelationID 890 


 

 

 

Resolution

We want to correlate this alarm to a process that is using the CPU core.  We will want to utilize both XOS tools an Linux tools to do this. 

First we want to identify the APM that is giving us the alarm and the core that is being subscribed to the point of an alarm.  We can pick this out of the log message above.  We can see that ap7 is getting the alarm for core 2. 

To identify the vap-group this APM belongs to and the slot number of the APM run the following command

pod1# show ap-vap-mapping
Module  Slot  Status  VAP IP Address  VAP Group  Index  Master (true/false)
AP3     5     Up      10.5.1.101      fw         1      false              
AP5     7     Up      10.5.1.102      fw         2      true               
(2 rows)

Our next step would be to rsh into the APM.  From the output of show ap-vap-mapping above, you can see the VAP-group, VAP-Index and VAP-IP that can be used to login into the APM
From theXOS CLI:

pod1# unix su
Password:
[[email protected] admin]# rsh fw_1
fw_1 (pod1): [vs0] ~$

or
[[email protected] admin]# rsh 10.5.1.102
fw_2 (pod1): [vs0] ~$


You would then run the top command, followed by the #1.  This will open up the top command to show the individual cores.

top - 13:18:10 up 6 days,  2:30,  1 user,  load average: 0.09, 0.06, 0.00

Tasks: 161 total,   1 running, 160 sleeping,   0 stopped,   0 zombie

Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st

Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:  12289724k total,  1705500k used, 10584224k free,      180k buffers

Swap:        0k total,        0k used,        0k free,   127708k cached

 

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                         

    1 root      15   0 10352  724  608 S  0.0  0.0   0:02.15 init                            

    2 root      RT  -5     0    0    0 S  0.0  0.0   0:01.30 migration/0                     

    3 root      34  19     0    0    0 S  0.0  0.0   0:00.05 ksoftirqd/0                     

    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                      

    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/1                     

    6 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1                     

    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                      

    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/2                     

    9 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/2                     

   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2

After breaking out the cores with the #1 key we see the above output.  After monitoring we will be able to decipher which process is utilizing and individual core's resources by following which is the highest in the process list.  In some instances we will see ksoftirqd as the highest process in the list.  This is an indication that what is driving high CPU or core utilization is new connections into the blade.  Another instance which is often seen is a fw_worker_# process.  The # indicates which core is being used.  This is an indication that a firewall process, assigned to the core, is the one that is using resources.  This means that the firewall is driving the higher CPU core utilization.  In this instance you may want to speak with Checkpoint to narrow down the issue.