Network connection loss logic between SDM primary and secondary server

search cancel

Network connection loss logic between SDM primary and secondary server

book

Article ID: 19588

calendar_today

Updated On:

Products

CA Service Desk Manager CA Service Management - Service Desk Manager

Issue/Introduction

This document covers the reconnection logic used by CA Service Desk Manager between primary and secondary servers as well as some common problem scenarios.

This document does not cover the "Advanced Availability" features available in SDM since release 12.9

Environment

Release: CA Service Desk Manager 12.9 and higher
Component: Conventional Configuration

Resolution

The Service Desk installation consists of a Primary server and any number of Secondary servers. Each server contains multiple SDM daemon processes. The main SDM communication manager process is located on the primary server only called Slump (sslump_nxd).

As noted, the Service Desk application is spread across multiple processes that can be distributed across multiple servers. Each process (when it starts) connects to a single known process which we call our "slump" process. Since all processes connect to slump, the slump process is instrumental in knowing how to route message to other connected processes. By default, all communication between processes would be routed through the slump process. To provide improved scalability and performance, some processes have the ability to create a "fast-channel" between each other. Again, since slump is connected to each process, it helps both processes setup the communication. Each connection either via slump or fast-channel will utilize a TCP port. A process such as a webengine, will have multiple ports opened to other processes (domsrvr, bpvirtdb_srvr , etc) as well as one to slump.

All processes connect to slump initially and stay connected to slump. All processes that are connected to slump will have a tcp port opened always beyond 2100 value, but not necessarily close to 2100 value. By default, 2100 tcp port is slump's listener port. If your installation is configured to use slump communication only, each Service Desk process will use the existing slump connection to communicate with other processes. Slump will handle communication between processes via its already open tcp ports between each process. If NX_SLUMP_FIXED_SOCKETS variable is set, this forces slump server to open ports as close as possible to 2100 value. This is required for firewall environments, as a firewalls need to open a range of ports it should keep open to not affect Service Desk usage in a negative way.

If Service Desk is configured to use fast-channel connections only, a Service Desk process will request slump to open a fast-channel connection to the other process. Hence, what it does is, process A requests slump for a fast-channel connection to process B, slump will notify process B to open a port to process A, process B will notify slump of the new port information, slump will pass it to process A. Now, process A and B communicate directly with each other without slump server involvement. Ports created this way are random.

By default, Service Desk is configured to use fast-channel connections. It is controlled by the NX_NOFASTCHAN variable. Changing Service Desk to use slump only communication would cause a performance hit on the primary server where slump server is running. Even if fast-channel is enabled not all SDM processes will utilized it and still rely on slump to send messages back and forth between processes.

In support, whenever there are problems with connectivity, we have recommended customers to use Tcpview to monitor Service Desk processes and what ports they have opened to what processes.

There is currently no verbose logging within the product to show what port numbers are being used. Hence, the recommendation to use the Tcpview tool.

The tcp ports opened are random and are set to be as close to 2100 and above as possible when fix socket variable is set.

It is ideal to have Fixed Sockets and Fast Channel enabled.

SDM Reconnection Attempt Logic:

The Service Desk processes when communicating to each other via a fast channel connection without slump involvement, any time there is a disconnection between these two processes, the process first reporting the event will NOT shutdown, but attempt to reconnect. Reconnection is done quickly and within a few seconds. There is no code to tell the process to wait. Reconnection attempts are indefinite.

The Service Desk processes when needing to communicate to the Slump process for message routing, any time there is a disconnection, the process first reporting the event WILL shutdown. In this case, there is a perception that Slump on the primary is unreachable. This is considered a severe event and as such we request our processes to shutdown. Our Slump process is our communication manager and any perception it is unavailable warrants shutting down of processes and hibernation state. This allows the IT team to investigate the status of the primary if automatic recovery does not take place and take manual steps. In this case, multiple processes on a secondary server will eventually report the same event, and all main daemons are told to shutdown. The Proctor on the secondary in most cases will also have the problem and restart and then hibernate waiting connection with slump to be re-establish. The Proctor process is the only process that hibernates on a secondary server. It will retry indefinitely to reach out to slump for connection until successful. Once proctor is communicating to slump, the daemon manager on the primary will notice processes not running on secondary and request a restart. Those processes will then connect to slump. If for some reason startup fails, the process is restarted again up to ten times before giving up.

How a disconnection to slump captured by SDM:

The processes determine that the connection to the slump server is not available when they attempt to read or write data to the socket. A process usually registers a call back function with the slump layer to notify if the connection is broken. This function is usually called when the disconnections happen for the process to take actions. For example, the webengine terminates as a result of the disconnection once the call back function is invoked.

The socket layer (OS network layer) usually returns a socket error code (for example, WSAECONNRESET (10054) ) to the process read\write calls when it attempts to read or write data on an already established socket connection. Service Desk process logs this error in the standard logs:

09/22 14:28:26.47 SERVER domsrvr:21 3868 INFORMATION socket_port.c 1582 Error: WSAECONNRESET (10054) reading from 
SOCKET_PORT(0x024F9958) description = 	TCP/IP port_name = Slump Port status = 0 ip address = xxx.xxx.xxx.xxx compression = 1 extra_flags = 0 file 
descriptor = 288 write pending = 0 handler type = DATA read count = 2434167 write count = 1683967 socket = 0

Here is the description of the error code:

WSAECONNRESET
10054

Connection reset by peer.
An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.

"BPServer:: init: couldn't logon to slump!" messages:

If multiple connection attempts from the primary server fail, then the message above will be printed to the stdlogs on the secondary server.

In this scenario: the Webengine on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the secondary SDM log file.

Message: "EXIT webengine.c 1152 Slump died detected"

What happens:

The webengine process on the secondary will shutdown. Any perceived problem with communicating to the primary Slump process is considered severe and as such we request the SDM process to shutdown and restart itself.
The secondary proctor process will wait for a request from the primary's pdm_d_mgr (daemon manager) process to restart the secondary's shutdown daemon processes. In this case, the webengine process.
The primary pdm_d_mgr process will notice that some daemons on the secondary are not running and will ping secondary proctor process to start them. In this case the webengine process.
The secondary webengine process is started and its initialization code is executed, eventually during the startup the process will open a new connection to Slump on the primary.
The new connection is done from webengine as one of the first steps of startup and as such a new connection is established fairly quickly within seconds depending on how fast daemon manager notices webengine down and requests proctor to restart it. If no further disruptions are seen between daemon manager and secondary and proctor to primary this is then done quickly within seconds which again depends on speed of servers and networks involved.
If for some reason the starting webengine cannot establish connection or cannot logon to slump, it will be requested to restart and try to establish connection up to 10 times before it gives up. Ten is our max restart limit per process.
If max restart limit is reached, the only way to get the process back running is to restart entire SDM services or by executing the following command SDM pdm_d_refresh.

In the case below, the webengine restarted 10 times failing on each instance with "BPServer::init: couldn't logon to slump!" error message.

10/07 00:16:56.98 SERVER pdm_d_mgr 3548 ERROR daemon_obj.c 1781 Max restarts attempted for _web_eng_SERVER2 You may reset the count by running pdm_d_refresh from the command line.

The connection was established with slump as other SDM processes were working fine.

Support has seen that the most common cause of this are VMWare snapshots and backups. If not configured correctly, the backups/snapshots may take up all of the bandwidth on the wire and cause connection drops for other applications.

After running pdm_d_refresh, the processes should restart if connectivity is re-established:

10/07 00:23:10.55 SERVER1 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started 
(4960):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web:SERVER2:1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2-
web1.cfg -r rpc_srvr:SERVER2

If connectivity still could not be established, then we would see messages like the following:

First message:

10/07 00:16:46.11 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started
(3636):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 	-S web: SERVER2:1 -c 
D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SDCBCBSVMSS02-web1.cfg -r rpc_srvr:SERVER2 
10/07 00:16:46.57 SERVER2 web-engine 3636 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!

Last message:

10/07 00:16:54.69 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 
545 Process Started (3380):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21
-S web:SERVER2 :1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2-web1.cfg -r rpc_srvr:SERVER2

10/07 00:16:55.15 SERVER2 web-engine 3380 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!

In the example above, 10 restart\connection attempts failed within 10 seconds.. (the duration depends on how fast the servers are). There is no code during starting of a process to wait to establish connection. The process is starting and needs to establish connection to proceed. The process does not know when it is starting about what happened previously with an initial disconnection error. That is why a restart mechanism is in place for up to ten times..

Troubleshooting and Tools used to troubleshoot network disconnections between Primary and Secondary servers:

Wireshark or the command line version "Tshark" run in a round robin log.
Speak with your Firewall admin about any reported issues or anything can be found in their logs. Have there been any recent changes if this problem has just started?
Speak with your Network team about any reported issues in that time frame.
Use a Network Monitoring tool to do regular ping tests between the primary and secondary servers to see if there is packet loss with the pings when there is a problem in Service Desk.
Are backups running on the same NIC as the application or on a backup NIC?
When (if any) are VMWare snapshots performed? if this is a new problem, have there been any recent changes?
Is there a WAN between the Primary and Secondary server? High latency may cause performance and stability problems in Service Desk.

Feedback

thumb_up Yes

thumb_down No