How do I find out what is causing my cpuCoreUtilizationExceeded alarms on my APMs?
A alarm for cpuCoreUtilizationExceeded is present on the chassis:
pod1# show alarms history
ID Date Severity Source Description
-- ---- -------- ------ -----------
890 Sep 6 13:44:05 Minor ap7 CPU utilization core: 2
890 Sep 6 13:44:15 Info system New system alarm level (clear)
890 Sep 6 13:44:15 Clear ap7 CPU utilization core: none
The following log lines can be found in the messages file:
Sep 6 13:44:05 pod1 cbsalarmlogrd: AlarmID 890 | Fri Sep 6 13:44:05 2013 | minor | ap7 | cpuCoreUtilizationExceeded | CPU utilization core: 2
Sep 6 13:44:15 pod1 cbsalarmlogrd: AlarmID 892 | Fri Sep 6 13:44:15 2013 | clear | ap7 | cpuCoreUtilizationExceeded | CPU utilization core: none | CorrelationID 890
We want to correlate this alarm to a process that is using the CPU core. We will want to utilize both XOS tools an Linux tools to do this.
First we want to identify the APM that is giving us the alarm and the core that is being subscribed to the point of an alarm. We can pick this out of the log message above. We can see that ap7 is getting the alarm for core 2.
To identify the vap-group this APM belongs to and the slot number of the APM run the following command
pod1# show ap-vap-mapping
Module Slot Status VAP IP Address VAP Group Index Master (true/false)
AP3 5 Up 10.5.1.101 fw 1 false
AP5 7 Up 10.5.1.102 fw 2 true
(2 rows)
Our next step would be to rsh into the APM. From the output of show ap-vap-mapping above, you can see the VAP-group, VAP-Index and VAP-IP that can be used to login into the APM
From theXOS CLI:
pod1# unix su
Password:
[[email protected] admin]# rsh fw_1
fw_1 (pod1): [vs0] ~$
or
[[email protected] admin]# rsh 10.5.1.102
fw_2 (pod1): [vs0] ~$
You would then run the top command, followed by the #1. This will open up the top command to show the individual cores.
top - 13:18:10 up 6 days, 2:30, 1 user, load average: 0.09, 0.06, 0.00
Tasks: 161 total, 1 running, 160 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 99.3%id, 0.3%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 12289724k total, 1705500k used, 10584224k free, 180k buffers
Swap: 0k total, 0k used, 0k free, 127708k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 15 0 10352 724 608 S 0.0 0.0 0:02.15 init
2 root RT -5 0 0 0 S 0.0 0.0 0:01.30 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.05 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/1
6 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
7 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
8 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/2
9 root 34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2
10 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
After breaking out the cores with the #1 key we see the above output. After monitoring we will be able to decipher which process is utilizing and individual core's resources by following which is the highest in the process list. In some instances we will see ksoftirqd
as the highest process in the list. This is an indication that what is driving high CPU or core utilization is new connections into the blade. Another instance which is often seen is a fw_worker_#
process. The # indicates which core is being used. This is an indication that a firewall process, assigned to the core, is the one that is using resources. This means that the firewall is driving the higher CPU core utilization. In this instance you may want to speak with Checkpoint to narrow down the issue.