DE53884 - System Hang after many messages like Unable to resend broadcast message, slump has disconnected, out-of-sequence node broadcast

book

Article ID: 192731

calendar_today

Updated On:

Products

CA Service Management - Service Desk Manager

Issue/Introduction

In an Advanced Availability environment with multiple Application servers, randomly, processes are unable to communicate between servers with error "UNABLE TO RESEND BROADCAST MESSAGE" in the stdlog file; the system hangs when the number of messages is too high and the system becomes unresponsive.

Steps to Reproduce:

1.  Started executing a script to create 100 tickets on BG server.

2.  Connected to BG server. I am using VM hosted on an ESX server. Hence connected to ESX servers vsphere web client and uncheck the "Connected" checkbox to disconnect the slump and clck OK..

 

   On the BG server stdlog following message appears:


      05/07 23:43:04.76 sdmAA-BGSB1  slump_nxd            8496 SIGNIFICANT  list.c                 553 Node(10.80.58.238) Slump ID(581) has disconnected

 

3. Wait for 1 or 2 seconds and check the "Connected" check box to make sure slump is connected back.

 

   On the BG server we may see the logs as :

   

      05/07 23:43:04.95 sdmAA-ews24  slump_nxd            8496 SIGNIFICANT  list.c                 506 Node(10.80.58.238) Slump ID(581) has connected

 

4. Keep continuing the script to create tickets on the BG server.

 

   We might see below messages in stdlog of BG server.

 

      05/07 23:43:07.73 sdmAA-BGSB1  slump_nxd            8496 TRACE        server.c              5412 Resent broadcast message 443 to node sdmAA-APP1
      05/07 23:43:07.73 sdmAA-BGSB1  slump_nxd            8496 TRACE        server.c              5412 Resent broadcast message 444 to node sdmAA-APP1
      05/07 23:43:07.75 sdmAA-BGSB1  slump_nxd            8496 TRACE        server.c              5412 Resent broadcast message 445 to node sdmAA-APP1

 

   Wireshark log when Resent happens:

 

      15815    2020-05-07 23:43:07.732326    10.80.62.187    10.80.58.238    TCP    1628    54132 → 2101 [PSH, ACK] Seq=2533 Ack=865 Win=2101504 Len=1574
      15818    2020-05-07 23:43:07.735310    10.80.58.238    10.80.62.187    TCP    60    2101 → 54132 [ACK] Seq=865 Ack=4107 Win=2102272 Len=0

 

   Stdlog on App server shows below messages.

 

      05/07 23:43:08.12 sdmAA-APP1  slump_nxd            5928 TRACE        server.c              3323 Received node broadcast message from 579|prov#6516_bpvirtdb_srvr to *|*|cr_status_trans_history::DB_CHANGE
      05/07 23:43:08.12 sdmAA-APP1  slump_nxd            5928 WARNING      list.c                 586 Received out-of-sequence node broadcast 508 from 10.80.62.187 - previous sequence was 442

 

After this whatever broadcast messages BG sends that will be rejected by App server by giving an error Received out-of-sequence.

 

As per the Wireshark traces, BG server is sending the messages (about DB_CHANGE) and same messages are appearing on the App server too.

 

Expected result:

 

   When "Resent broadcast message" happens at BG server , on App server the message should appear some thing like below.

 

      Received resent broadcast message "<< *(dg.pSequenceNumber) << " from node " << pNode->get_node_host() << "; resetting broadcast sequence".

Test Environment for the test

CA Service Desk Manager 17.1 - Rollup patches up to 17.1.0.4

Advanced Availability Configuration:  Background Server (sdmAA-BGSB1), more than 2 Application Servers one of which is sdmAA-APP1

O.S.: Windows Server 2016

Database: Microsoft SQL Server 2014

 

 

Cause

Product defect DE53884.

Environment

Release : 17.1 up through at least 17.1.0.9, 17.2 up through at least 17.2.0.8, 17.3 GA at least.

Component : SERVICE DESK MANAGER

Resolution

Debug test patch T56H126 (Linux) exists for resolving defect DE53884 on CA SDM 17.1 RU4.

The fix is to be integrated into future versions as soon as possible - please search the product documentation for DE53884 to determine the associated rollup patch that includes the fix.