This document covers the reconnection logic used by CA Service Desk Manager between primary and secondary servers as well as some common problem scenarios.
This document does not cover the "Advanced Availability" features of 12.9.
How does CA Service Desk Manager (SDM) behave within a network:
The Service Desk installation consists of a Primary server and any number of Secondary servers. Each server contains multiple SDM daemon processes. The main SDM communication manager process is located on the primary server only called Slump (sslump_nxd).
As noted, the Service Desk application is spread across multiple processes that can be distributed across multiple servers. Each process (when it starts)
connects to a single known process which we call our "slump" process. Since all processes connect to slump, the slump process is instrumental in knowing
how to route message to other connected processes. By default, all communication between processes would be routed through the slump process. To provide
improved scalability and performance, some processes have the ability to create a "fast-channel" between each other. Again, since slump is connected to
each process, it helps both processes setup the communication. Each connection either via slump or fast-channel will utilize a TCP port. A process such as
a webengine, will have multiple ports opened to other processes (domsrvr, bpvirtdb_srvr , etc) as well as one to slump.
All processes connect to slump initially and stay connected to slump. All processes that are connected to slump will have a tcp port opened always beyond 2100 value, but not necessarily close to 2100 value. By default, 2100 tcp port is slump's listener port. If your installation is configured to use slump communication only, each Service Desk process will use the existing slump connection to communicate with other processes. Slump will handle communication between processes via its already open tcp ports between each process. If NX_SLUMP_FIXED_SOCKETS variable is set, this forces slump server to open ports as close as possible to 2100 value. This is required for firewall environments, as a firewalls need to open a range of ports it should keep open to not affect Service Desk usage in a negative way.
If Service Desk is configured to use fast-channel connections only, a Service Desk process will request slump to open a fast-channel connection to the other process. Hence, what it does is, process A requests slump for a fast-channel connection to process B, slump will notify process B to open a port to process A, process B will notify slump of the new port information, slump will pass it to process A. Now, process A and B communicate directly with each other without slump server involvement. Ports created this way are random.
By default, Service Desk is configured to use fast-channel connections. It is controlled by the NX_NOFASTCHAN variable. Changing Service Desk to use slump
only communication would cause a performance hit on the primary server where slump server is running. Even if fast-channel is enabled not all SDM processes
will utilized it and still rely on slump to send messages back and forth between processes.
In support, whenever there are problems with connectivity, we have recommended customers to use Tcpview to monitor Service Desk processes and what ports they have opened to what processes.
There is currently no verbose logging within the product to show what port numbers are being used. Hence, the recommendation to use the Tcpview tool.
The tcp ports opened are random and are set to be as close to 2100 and above as possible when fix socket variable is set.
It is ideal to have Fixed Sockets and Fast Channel enabled.
SDM Reconnection Attempt Logic:
The Service Desk processes when communicating to each other via a fast channel connection without slump involvement, any time there is a disconnection between these two processes, the process first reporting the event will NOT shutdown, but attempt to reconnect. Reconnection is done quickly and within a few seconds. There is no code to tell the process to wait. Reconnection attempts are indefinite.
The Service Desk processes when needing to communicate to the Slump process for message routing, any time there is a disconnection, the process first reporting the event WILL shutdown. In this case, there is a perception that Slump on the primary is unreachable. This is considered a severe event and as such we request our processes to shutdown. Our Slump process is our communication manager and any perception it is unavailable warrants shutting down of processes and hibernation state. This allows the IT team to investigate the status of the primary if automatic recovery does not take place and take manual steps. In this case, multiple processes on a secondary server will eventually report the same event, and all main daemons are told to shutdown. The Proctor on the secondary in most cases will also have the problem and restart and then hibernate waiting connection with slump to be re-establish. The Proctor process is the only process that hibernates on a secondary server. It will retry indefinitely to reach out to slump for connection until successful. Once proctor is communicating to slump, the daemon manager on the primary will notice processes not running on secondary and request a restart. Those processes will then connect to slump. If for some reason startup fails, the process is restarted again up to ten times before giving up.
How a disconnection to slump captured by SDM:
The processes determine that the connection to the slump server is not available when they attempt to read or write data to the socket. A process usually registers a call back function with the slump layer to notify if the connection is broken. This function is usually called when the disconnections happen for the process to take actions. For example, the webengine terminates as a result of the disconnection once the call back function is invoked.
The socket layer (OS network layer) usually returns a socket error code (for example, WSAECONNRESET (10054) ) to the process read\write calls when it attempts to read or write data on an already established socket connection. Service Desk process logs this error in the standard logs:
09/22 14:28:26.47 SERVER domsrvr:21 3868 INFORMATION socket_port.c 1582 Error: WSAECONNRESET (10054) reading from SOCKET_PORT(0x024F9958) description = TCP/IP port_name = Slump Port status = 0 ip address = 220.127.116.11 compression = 1 extra_flags = 0 file descriptor = 288 write pending = 0 handler type = DATA read count = 2434167 write count = 1683967 socket = 0
Here is the description of the error code:
Connection reset by peer.|
An existing connection was forcibly closed by the remote host. This normally results if the peer application on the remote host is suddenly stopped, the host is rebooted, the host or remote network interface is disabled, or the remote host uses a hard close (see setsockopt for more information on the SO_LINGER option on the remote socket). This error may also result if a connection was broken due to keep-alive activity detecting a failure while one or more operations are in progress. Operations that were in progress fail with WSAENETRESET. Subsequent operations fail with WSAECONNRESET.
"BPServer:: init: couldn't logon to slump!" messages:
If multiple connection attempts from the primary server fail, then the message above will be printed to the stdlogs on the secondary server.
In this scenario: the Webengine on secondary losses connection to Slump process on primary. As such, a slump connection error message is reported on the secondary SDM log file.
Message: "EXIT webengine.c 1152 Slump died detected"
In the case below, the webengine restarted 10 times failing on each instance with "BPServer::init: couldn't logon to slump!" error message.
10/07 00:16:56.98 SERVER pdm_d_mgr 3548 ERROR daemon_obj.c 1781 Max restarts attempted for _web_eng_SERVER2 You may reset the count by running pdm_d_refresh from the command line.
The connection was established with slump as other SDM processes were working fine.
Support has seen that the most common cause of this are VMWare snapshots and backups. If not configured correctly, the backups/snapshots may take up all of the bandwidth on the wire and cause connection drops for other applications.
After running pdm_d_refresh, the processes should restart if connectivity is re-established:
10/07 00:23:10.55 SERVER1 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started (4960):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web:SERVER2:1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2- web1.cfg -r rpc_srvr:SERVER2
If connectivity still could not be established, then we would see messages like the following:
10/07 00:16:46.11 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started (3636):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web: SERVER2:1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SDCBCBSVMSS02-web1.cfg -r rpc_srvr:SERVER2 10/07 00:16:46.57 SERVER2 web-engine 3636 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!
10/07 00:16:54.69 SERVER2 proctor_SERVER2 4888 SIGNIFICANT pdm_process.c 545 Process Started (3380):D:/PROGRA~1/CA/SERVIC~1/bin/webengine -q -d domsrvr:21 -S web:SERVER2 :1 -c D:/PROGRA~1/CA/SERVIC~1/bopcfg/www/SERVER2-web1.cfg -r rpc_srvr:SERVER2
10/07 00:16:55.15 SERVER2 web-engine 3380 EXIT bpserver.c 246 BPServer::init: couldn't logon to slump!
In the example above, 10 restart\connection attempts failed within 10 seconds.. (the duration depends on how fast the servers are). There is no code during starting of a process to wait to establish connection. The process is starting and needs to establish connection to proceed. The process does not know when it is starting about what happened previously with an initial disconnection error. That is why a restart mechanism is in place for up to ten times..
Troubleshooting and Tools used to troubleshoot network disconnections between Primary and Secondary servers:
Note: Service Desk 12.9 has a new feature called "Advanced Availability" which replaces the Primary / Secondary Server architecture and provides for much greater availability.