Misconfigured remote-box causes high CPU and VRRP flapping

book

Article ID: 167887

calendar_today

Updated On:

Products

XOS

Issue/Introduction

Incorrectly configured remote-box can cause VRRP stability issues along with high CPU utilization on CPM.
VRRP Flapping

In the /var/log/messages file, repeated VRRP failover messages appear indicating that failovers are happening several times each second. The chassis changes state from VRRP master to VRRP standby, and then back to VRRP master, and so on.
 
Mar 17 19:23:58 CBS cbsalarmlogrd: AlarmID 6839 | Thu Mar 17 19:23:58 2011 | major | cp1 | vrrpFailGroupStatusChange | Failover group fg1 status backup 
Mar 17 19:23:58 CBS cbsalarmmond: [I] vrId:0 circuitId:0[] failOverGrId:2[fwbbe], Reason: masterNoResponse
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 2 state changed backup->master lastRX 0 sec ago p=1
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 2 state: backup -> master
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: VR on 10.0.4.28 took over XRRP group 2
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 2 state changed master->backup lastRX 0 sec ago p=0
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 2 state: master -> backup
Mar 17 19:23:58 CBS cbsalarmlogrd: AlarmID 6840 | Thu Mar 17 19:23:58 2011 | major | cp1 | vrrpFailGroupStatusChange | Failover group fg2 status master 
Mar 17 19:23:58 CBS cbsalarmmond: [I] vrId:0 circuitId:0[] failOverGrId:1[fg2], Reason: masterNoResponse
Mar 17 19:23:58 CBS cbsalarmlogrd: AlarmID 6841 | Thu Mar 17 19:23:58 2011 | major | cp1 | vrrpFailGroupStatusChange | Failover group fg2 status backup 
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 1 state changed backup->master lastRX 0 sec ago p=1
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 1 state: backup -> master
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: VR on 10.0.4.28 took over XRRP group 1
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 1 state changed master->backup lastRX 0 sec ago p=0
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 1 state: master -> backup
Mar 17 19:23:58 CBS cbsalarmlogrd: AlarmID 6842 | Thu Mar 17 19:23:58 2011 | major | cp1 | vrrpFailGroupStatusChange | Failover group fg2 status master 
Mar 17 19:23:58 CBS cbsalarmlogrd: AlarmID 6843 | Thu Mar 17 19:23:58 2011 | major | cp1 | vrrpFailGroupStatusChange | Failover group fg2 status backup 
Mar 17 19:23:58 CBS cbsalarmmond: [I] vrId:0 circuitId:0[] failOverGrId:3[fg1], Reason: masterNoResponse
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 3 state changed backup->master lastRX 0 sec ago p=1
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 3 state: backup -> master
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: VR on 10.0.4.28 took over XRRP group 3
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 3 state changed master->backup lastRX 0 sec ago p=0
Mar 17 19:23:58 CBS cbsirmd: [I] VRRP: Group 3 state: master -> backup

CPU Utilization Alarms

CPU utilization alarms on the active CPM can be also observed:

Mar 17 19:30:41 CBS cbshmonitord: [N] Violation (s=2, alarm) occurred 3+ times: module:13, item:3503 (H_ID_PROC_HI_CORE_UTIL), time:"Thu Mar 17 19:30:41 2011"
Mar 17 19:30:43 CBS cbsalarmlogrd: AlarmID 29180 | Thu Mar 17 19:30:41 2011 | major | cp1 | cpuCoreUtilizationExceeded | CPU utilization core: 2 

Out of Memory Condition

Eventually an Out of Memory condition can occur on the CPM and the oom-killer will start killing random running processes in order to retrieve some memory:

Mar 20 02:58:04 CBS kernel: auditd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=-17


Cause

Problem:

CPU on CPM reaching high levels and unstable VRRP leaves the unit in a very problematic state. Memory is also gradually consumed.


Goal:

To explain this behavior.

Resolution

All of the very serious symptoms described above have a single root cause. The remote-box was accidentally configured by the administrator in a way that instead of setting the IP of other chassis the IP actually points to itself. See the configuration example below (which corresponds with the logs above):

hostname CBS cp1
system-identifier 1
system-internal-network 1.4.0.0/16
remote-box 2 1.4.2.20 10.0.4.28
...

management gigabitethernet 13/2
  ip-addr 10.0.4.28/29 10.0.4.31
  enable
  access-list 1001 input
  access-list 1002 output
#

The solution is to reconfigure the remote-box command using the correct IP address of the another chassis. The problem will be resolved.

Example:

CBS# configure remote-box 2 1.4.2.20 10.0.5.28

Workaround

N/A