APM marked down permanently threshold (MaxFaultInterval) & standby interaction

book

Article ID: 167982

calendar_today

Updated On:

Products

XOS

Issue/Introduction


Q1. When will the blade be marked down?
A1. An APM will be marked down if it fails after the third reboot because of missing HBs or health poll. 
 
Oct 31 18:14:23 TXSAT1DCEX801 cbssysctrld: [I] No health poll from slot 9. Reset in 60 sec. 
Oct 31 18:15:23 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 1 time since "Thu Oct 31 18:15:23 2013" 
Oct 31 18:15:23 TXSAT1DCEX801 cbssysctrld: [I] Stopping slot 9 
Oct 31 18:20:24 TXSAT1DCEX801 cbssysctrld: [I] No health poll from slot 9. Reset in 60 sec. 
Oct 31 18:21:24 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 2 times since "Thu Oct 31 18:15:23 2013" (APP failure: 0, others: 2, TWT: 60, MFI: 35340, MAX: 3) 
Oct 31 18:21:24 TXSAT1DCEX801 cbssysctrld: [I] Stopping slot 9 
Oct 31 21:13:37 TXSAT1DCEX801 cbssysctrld: [I] No health poll from slot 9. Reset in 60 sec. 
Oct 31 21:14:37 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 3 times since "Thu Oct 31 18:15:23 2013" (APP failure: 0, others: 3, TWT: 122, MFI: 35402, MAX: 3) 
Oct 31 21:14:37 TXSAT1DCEX801 cbssysctrld: [I] Stopping slot 9 
Oct 31 21:19:32 TXSAT1DCEX801 cbssysctrld: [I] No health poll from slot 9. Reset in 60 sec. 
Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [E] Slot 9 is down PERMANENTLY 
Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [I] Stopping slot 9 
Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [I] excessive failures on slot 9, retry limit exceeded, disabling permanently.

Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [I] Stopping slot 9

Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [I] APM slot 9 (SN:P109N027) (VAP 0) state change: dying -> down

Oct 31 21:20:32 TXSAT1DCEX801 cbssysctrld: [I] Loading VAP 4 on APM slot 11 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<standby taking over
 

Q2. If you have a standby APM, will it take over if an active APM crashes?
A2.  A single crash/reboot does not kick off a standby APM takeover. Only after the excessive failure threshold (or a blade manually disabled) does a standby blade talk over. This is by design. 

Q3,  What is the time threshold for tracking the checks for missing HBs or health poll? At what point does a previous failure no longer count against the 3
A3. MFI (MaxFaultInterval) is 35402 seconds (9.8338889 hours) in this case, which is measured against the timestamp of the first failure.

Oct 31 18:15:23 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 1 time since "Thu Oct 31 18:15:23 2013"

Oct 31 18:21:24 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 2 times since "Thu Oct 31 18:15:23 2013" (APP failure: 0, others: 2, TWT: 60, MFI: 35340, MAX: 3)

Oct 31 21:14:37 TXSAT1DCEX801 cbssysctrld: [I] Slot 9 failed 3 times since "Thu Oct 31 18:15:23 2013" (APP failure: 0, others: 3, TWT: 122, MFI: 35402, MAX: 3)