Heartbeats and HealthCheck Polls

book

Article ID: 168112

calendar_today

Updated On:

Products

XOS

Issue/Introduction

Detailed information about Heartbeats and HealthCheck Polls, and the affects of missed heartbeats or HC Polls.

Log messages:

The following are log indication within the messages file that Heartbeats or Health Check polls have been missed

 

Feb 15 23:10:56 redpill cbshmonitord[1517]: [N] hm_process_report_data Violation (s=1, no alarm) occurred: module:5, item:1401 (H_ID_POLL_MISSING

 

Example log message of the CPM reporting missing heartbeats:

Mar 13 01:31:19 redpill cbssysctrld[1569]: [I] No heartbeat from slot 6

 

Example of 2 NPMs report missed heartbeats to another slot:

Mar 13 01:31:17 npm2  tHeartbeat: [I] age: [Fab-2] Missing heartbeats FROM slots =0020

Mar 13 01:31:17 npm2  tSdpState: [I] DisablePrcPort: Prc_A: Disable port1 .

Mar 13 01:31:18 npm1  tHeartbeat: [I] age: [Fab-1] Missing heartbeats FROM slots =0020

Mar 13 01:31:18 npm1  tSdpState: [I] DisablePrcPort: Prc_A: Disable port1 .

 

Cause

Provide details about how Heartbeats and Healthcheck Polls are used between modules.

Resolution

Heartbeats:
 
Heartbeats go between all boards 4 times per second. Every board sends heartbeats on all channels - 2 data paths, and 2 control paths - if applicable. (2 NPMs = 2 data paths, 2 CPMs = 2 control paths)


Heartbeats are sent through each path, multicast on the control plane and unicast on the data plane. There are actually 2 types of heartbeat packets: 'Hello' and 'Election'. The latter are sent unicast between 2 CPMs and are used to negotiate which CP is to become the primary. The former are broadcast and are used to track states of links interconnecting various processor modules. All hello messages have the same format, which includes primary CP info and link state info.

Link state is maintained on each module. It is obtained by examining incoming heartbeat messages. APM and NPM modules only keep track of (unicast) connectivity from remote modules to themselves.

CPM maintains the state of all connections (links) in the system. Each module indicates in its heartbeats which slots it's heard from over which ports. The state of each such connection is represented by a 4-bit number in the packet. CPM stores these numbers and makes them available to the user (e.g. via ‘show heartbeat’ CLI command). Other modules ignore state portion of heartbeat messages.

State of a (unidirectional) link is based on the number of heartbeats received over time. The link state is represented as a number between 0 and 15, 0 meaning the link is down and 15 meaning the link is fully operational. It may be easier to think of this as the number 15 representing 100% of the link being available (CLI approach).  One for CP redundancy election and the other are hello packets that all other blades respond to.

NPM originated:
T
he NPM sends heartbeats on the data plane in order to verify the integrity of the switch frabic. These heartbeats are sent 50% unicast and 50% multicast. This is done to verify the NPMs ability to transmit both types of packets. NPM heartbeats also incorporate a send and receive sequence number which makes it possible for the NPM to determine if heartbeats are lost to or from the NPM. These are sent 4 times per second. If the NPM determines that heartbeats were lost to or from a particular slot for 2 seconds (8 heartbeats) it will disable/re-enable the switch fabric link (PRC port) to the offending slot. Note that once the switch fabric link is up and the NPM has not received a heartbeat from that particular slot, it will not bounce the link. Only if heartbeats have been received and then lost will cause the NPM to bounce the link.

Heartbeats are originated from within the kernel from a function called "bottom half." Responses to heartbeats are collected within the CPM and this information is used to populate the "FPM state table" which the cbssysctrld uses to keep track of VAP state as well as the cbsflowcalcd uses this table to make load-balancing decisions.

####-------------------------------------------------------------------###

HealthCheck Polls:

HeathCheck Polls are originated by the CPM's cbshmonitord. There are 3 different types of polls.
Fast polls- these are sent 1 per second and queries for rapidly changing info like insertion and removal of blades, alarm LEDs conditions, etc
Medium polls - sent every 10 seconds and queries other info (link states, temperatures)
Slow polls - sent every 30 seconds and queries more slowly change information on each blade.(cpu utilization, voltage fluctuation, power and fan status,etc)

Responses to HC polls are originated by the "cbshagentd" on every blade, and runs in user space. Engineering implemented this way by design, so that if user space cbshagentd cannot run then any user space applications will not be able to run as well. In addition to the polls, health agent daemon on each blade can send asynchronous update about hardware removal/insertion such as blade, power supply, fan trays, link states, etc....

####-------------------------------------------------------------###


Health status:

The percentage figure from ""show heartbeats"" - is calculated over the last 4 seconds. If 8 heartbeats are lost in 4 seconds then health is 50%. If 2 heartbeats are lost over 4 seconds then health is 88%, and so on.
If 2 seconds of heartbeats (8 heartbeats) are lost in a row then the path between the two boards is considered dead.
If a CPM (master CPM if dual CPM) detects this the board will be rebooted. If the NPM detects this then that board will no longer receive traffic over that path until it's corrected. (With 2 NPMs or CPMs there's still a backup path, so that will be used.) The NPM will also send a message to the CPM about this. If an APM detects this then it won't use that path, and will send a message to the CPM.

Statistics used for flow-scheduling, which are sent from APMs to CPMs, are not used for health status. Health is determined by heartbeats only.


-----------

Missed Heartbeats or HealthChech Polls:

If a blade misses 2 seconds of heartbeats(8) on each path and misses 3 HC polls, the cbssysctrld will reset the blade.

If a blade misses 60 seconds of HC polls but is receiving heartbeats, the cbssysctrl daemon will reset the blade.



###-----------------------------------------------------------###

'Show Heartbeat' Command:

The CLI command 'show heartbeat' provides the percentage view of heartbeats received by each module. The percentage figure from ""show heartbeats"" - is calculated over the last 4 seconds. If 8 heartbeats are lost in 4 seconds then health is 50%. If 2 heartbeats are lost over 4 seconds then health is 88%, and so on

Here is an example:

pod1# show heartbeat
Link Quality TO: 1
FROM      1    2    3    4    5    6    7    8    9    10   11   12   13   14

ON ports

CB A:     NA   100% NA   NA   NA   NA   NA   100% 100% 100% 100% NA   100% NA

CB B:     NA   100% NA   NA   NA   NA   NA   100% 100% 100% 100% NA   NA   100%

DP A:     100% 100% NA   NA   NA   NA   NA   100% 100% 100% 100% NA   100% NA

DP B:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

DP C:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

DP D:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

 

Link Quality TO: 2

FROM      1    2    3    4    5    6    7    8    9    10   11   12   13   14

ON ports

CB A:     100% NA   NA   NA   NA   NA   NA   100% 100% 100% 100% NA   100% NA

CB B:     100% NA   NA   NA   NA   NA   NA   100% 100% 100% 100% NA   NA   100%

DP A:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

DP B:     100% 100% NA   NA   NA   NA   NA   100% 100% 100% 100% NA   100% NA

DP C:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA

DP D:     NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA


 

###-------------------------------------------------------------------------###

Heartbeat Packet:

Heartbeats are low level BOFL packets. If you look at them in a sniffer trace. Text version of the packet will look like this.

 


Frame 1 (94 bytes on wire, 94 bytes captured)

    Arrival Time: Mar 25, 2004 13:59:05.984058000

    Time delta from previous packet: 0.000000000 seconds

    Time since reference or first frame: 0.000000000 seconds

    Frame Number: 1

    Packet Length: 94 bytes

    Capture Length: 94 bytes

Ethernet II, Src: 00:03:d2:00:01:03, Dst: 01:00:5e:00:00:01

    Destination: 01:00:5e:00:00:01 (01:00:5e:00:00:01)

    Source: 00:03:d2:00:01:03 (Crossbea_00:01:03)

    Type: Unknown (0x8102)

Breath of Life

    PDU: 0x11030d00

    Sequence: 16843028

    Padding (72 byte)

 

0000  01 00 5e 00 00 01 00 03 d2 00 01 03 81 02 11 03  ..^.............

0010  0d 00 01 01 01 14 03 e1 00 00 08 00 10 67 02 c3  .............g..

0020  10 10 41 08 60 00 00 00 00 00 01 02 3f ff 00 0f  ..A.`.......?...

0030  00 ff f0 f0 f0 ff 00 00 00 00 00 00 00 00 00 0f  ................

0040  00 ff f0 f0 f0 0f 00 0f 00 ff f0 f0 f0 f0 00 00   ...............

0050  00 00 00 00 00 00 00 00 00 00 00 00 00 00         ..............

Workaround

N/A