We are experiencing intermittent AGENTDOWN status messages for Apps on multiple ESP agents at the same time from our ESP server.
We found multiple "conversation" [manager_ipaddress:ephemeral_port]->[agent_ipaddress:7520] "has been closed by partner" in the management server's errrors.txt file.
Release : 12.3
Component : ESP dSeries Workload Automation DE
This can occur when there are intermittent network problems between DEServer and agent. When these intermittent network problems exist, the agent closes the connection (via a TCP Reset: RST) after wait for 10 seconds.
There are 2 options for fixing these problems.
"has been closed by the partner" messages that were previously causing AgentDown notifications:20220513 07:00:13.915 [dm:communications:exception] [ERROR] DM.OutputMessageQueue.<AgentName>: [2022-05-13_07:00:13.915] Exception caught sending to <AgentName>: The conversation <manager_ipaddress>:49280-><agent_ipaddress>:8520 has been closed by the partner. - [Originator:ESP_UAT_12] [Destination: <AgentName>] [ProcessingStatus: 0] [Priority: 0] [AFM: 20220513 07000061+0400 <AgentName><ManagerName> <Path/To/AppName.GenerationID>/<JobName>_<AgentName> RUN . Data(Command=cmd.exe) User(<username>) Password(<pass>) TargetSubsystem(WIN) MFUser(SCHEDMASTER) ] [modified: false] [originatorModified: false] [destinationModified: false] [processingStatusModified: false] [priorityModified: false] [afmModified: false] [removed: false] [created: true] [agentDownNotified: false] [shutdownMessage: false] [controlManagerMessage: false] [MessageQueueTable: ESP_MANAGER_OUTQ] [MsgQueueTableKey: 6672592, 1652439600620] [TableDaoKey: 6672592] number of messages remaining: 0^Mcom.ca.wa.comp.library.communications.WAConversationCloseByPartnerException: The conversation <manager_ipaddress>:49280-><agent_ipaddress>:8520 has been closed by the partner.^M
at com.ca.wa.comp.library.communications.WAConversation.receivePrefix(WAConversation.java:582)^M
at com.ca.wa.comp.library.communications.WAConversation.receiveBinary(WAConversation.java:452)^M
at com.ca.wa.comp.library.communications.WAConversation.receiveText(WAConversation.java:656)^M
at com.ca.wa.comp.distributedmanager.communications.WADistributedManagerOutputMessageQueue.receiveAck(WADistributedManagerOutputMessageQueue.java:155)^M
at com.ca.wa.comp.library.communications.OutputMessageQueue.sendMessages(OutputMessageQueue.java:372)^M
at com.ca.wa.comp.distributedmanager.communications.WADistributedManagerOutputMessageQueue.sendMessages(WADistributedManagerOutputMessageQueue.java:109)^M
at com.ca.wa.comp.library.communications.OutputMessageQueue.run(OutputMessageQueue.java:171)^M
at com.ca.wa.core.library.concurrent.WAThreadPool$ThreadPoolThread.run(WAThreadPool.java:698)^M
Caused by: java.io.EOFException^M
at java.io.DataInputStream.readInt(DataInputStream.java:392)^M
at com.ca.wa.comp.library.communications.WAConversation.receivePrefix(WAConversation.java:552)^M
... 7 more^MThe "handling" message (there is no "deactivating message or Status update indicating an AGENTDOWN was triggered):
20220513 07:00:13.917 [essential] [INFO] DM.OutputMessageQueue.<AgentName>: [2022-05-13_07:00:13.917] Though AFM communication failed for agent '<AgentName>' subsequent Ping succeeded so assuming that agent is busy but not responding, so don't send AGENTDOWN message as of now (isFailureCountExceeded=false, MAX_FAILCOUNT_THRESHOLD=5, isUnrecoverableException=false)^M
These are often accompanied by messages in the agent's receiver.log indicating that a conversation has been established and then a "Read timed out" exception occurs. Example (timestamps and ports don't match - though they do when comparing from appropriate sources):
03/21/2022 00:00:03.849-0400 2 TCP/IP Controller Plugin.Receiver pool thread <Regular:2>.CybReceiverChannel.a[:158] - Conversation from <manager_ipaddress>:51738 to <agent_ipaddress>:7520 arrived03/21/2022 00:00:17.350-0400 1 TCP/IP Controller Plugin.Receiver pool thread <Regular:2>.CybReceiverChannel.a[:234] - cybermation.library.communications.CybConversationTimeoutException: Read timed out
at cybermation.library.communications.protocol.CybCommunicationProtocolDynamic.receiveData(CybCommunicationProtocolDynamic.java:749)
at cybermation.library.communications.protocol.CybCommunicationProtocolDynamic.receiveMessage(CybCommunicationProtocolDynamic.java:365)
at cybermation.library.communications.CybConversation.receiveMessage(CybConversation.java:460)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.a(CybReceiverChannel.java:174)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.call(CybReceiverChannel.java:139)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.call(CybReceiverChannel.java:51)
at cybermation.commplugins.tcpip.receiver.CybReceiverScheduler$CallableWrapper.call(CybReceiverScheduler$CallableWrapper.java:353)
at cybermation.commplugins.tcpip.receiver.CybReceiverScheduler$CallableWrapper.call(CybReceiverScheduler$CallableWrapper.java:317)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Unknown Source)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at cybermation.library.communications.protocol.CybCommunicationProtocol.receiveLength(CybCommunicationProtocol.java:414)
at cybermation.library.communications.protocol.CybCommunicationProtocolDynamic.receiveData(CybCommunicationProtocolDynamic.java:514)
at cybermation.library.communications.protocol.CybCommunicationProtocolDynamic.receiveMessage(CybCommunicationProtocolDynamic.java:365)
at cybermation.library.communications.CybConversation.receiveMessage(CybConversation.java:460)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.a(CybReceiverChannel.java:174)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.call(CybReceiverChannel.java:139)
at cybermation.commplugins.tcpip.receiver.CybReceiverChannel.call(CybReceiverChannel.java:51)
at cybermation.commplugins.tcpip.receiver.CybReceiverScheduler$CallableWrapper.call(CybReceiverScheduler$CallableWrapper.java:353)
at cybermation.commplugins.tcpip.receiver.CybReceiverScheduler$CallableWrapper.call(CybReceiverScheduler$CallableWrapper.java:317)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:821)
03/21/2022 00:00:17.350-0400 2 TCP/IP Controller Plugin.Receiver pool thread <Regular:2>.CybReceiverChannel.a[:253] - Exiting conversation