In an Advanced Availability environment with multiple Application servers, randomly, processes are unable to communicate between servers with error "UNABLE TO RESEND BROADCAST MESSAGE" in the stdlog file; the system hangs when the number of messages is too high and the system becomes unresponsive.
Steps to Reproduce:
1. Started executing a script to create 100 tickets on BG server.
2. Connected to BG server. I am using VM hosted on an ESX server. Hence connected to ESX servers vsphere web client and uncheck the "Connected" checkbox to disconnect the slump and clck OK..
On the BG server stdlog following message appears:
05/07 23:43:04.76 sdmAA-BGSB1 slump_nxd 8496 SIGNIFICANT list.c 553 Node(10.80.58.238) Slump ID(581) has disconnected
3. Wait for 1 or 2 seconds and check the "Connected" check box to make sure slump is connected back.
On the BG server we may see the logs as :
05/07 23:43:04.95 sdmAA-ews24 slump_nxd 8496 SIGNIFICANT list.c 506 Node(10.80.58.238) Slump ID(581) has connected
4. Keep continuing the script to create tickets on the BG server.
We might see below messages in stdlog of BG server.
05/07 23:43:07.73 sdmAA-BGSB1 slump_nxd 8496 TRACE server.c 5412 Resent broadcast message 443 to node sdmAA-APP1
05/07 23:43:07.73 sdmAA-BGSB1 slump_nxd 8496 TRACE server.c 5412 Resent broadcast message 444 to node sdmAA-APP1
05/07 23:43:07.75 sdmAA-BGSB1 slump_nxd 8496 TRACE server.c 5412 Resent broadcast message 445 to node sdmAA-APP1
Wireshark log when Resent happens:
15815 2020-05-07 23:43:07.732326 10.80.62.187 10.80.58.238 TCP 1628 54132 → 2101 [PSH, ACK] Seq=2533 Ack=865 Win=2101504 Len=1574
15818 2020-05-07 23:43:07.735310 10.80.58.238 10.80.62.187 TCP 60 2101 → 54132 [ACK] Seq=865 Ack=4107 Win=2102272 Len=0
Stdlog on App server shows below messages.
05/07 23:43:08.12 sdmAA-APP1 slump_nxd 5928 TRACE server.c 3323 Received node broadcast message from 579|prov#6516_bpvirtdb_srvr to *|*|cr_status_trans_history::DB_CHANGE
05/07 23:43:08.12 sdmAA-APP1 slump_nxd 5928 WARNING list.c 586 Received out-of-sequence node broadcast 508 from 10.80.62.187 - previous sequence was 442
After this whatever broadcast messages BG sends that will be rejected by App server by giving an error Received out-of-sequence.
As per the Wireshark traces, BG server is sending the messages (about DB_CHANGE) and same messages are appearing on the App server too.
Expected result:
When "Resent broadcast message" happens at BG server , on App server the message should appear some thing like below.
Received resent broadcast message "<< *(dg.pSequenceNumber) << " from node " << pNode->get_node_host() << "; resetting broadcast sequence".
Test Environment for the test
CA Service Desk Manager 17.1 - Rollup patches up to 17.1.0.4
Advanced Availability Configuration: Background Server (sdmAA-BGSB1), more than 2 Application Servers one of which is sdmAA-APP1
O.S.: Windows Server 2016
Database: Microsoft SQL Server 2014
Product defect DE53884.
Release : 17.1 up through at least 17.1.0.9, 17.2 up through at least 17.2.0.8, 17.3 GA at least.
Component : SERVICE DESK MANAGER
Debug test patch T56H126 (Linux) exists for resolving defect DE53884 on CA SDM 17.1 RU4.
The fix is to be integrated into future versions as soon as possible - please search the product documentation for DE53884 to determine the associated rollup patch that includes the fix.