Troubleshooting ESP and Agent Communication problems

Products

ESP Workload Automation

Issue/Introduction

When you run an ESPCOM command you get the agent name and status. Is there a document that lists the probable causes for a "SEND ERR" or "CONN ERR" or any other possible messages. I ran a search through the documentation and did not find anything.

Environment

Release:
Component: ESPWA

Resolution

Below is useful information to gather when trying to debug a communication problem.
Agent communication problem
- Identify the problem area or stage:
- If it shows “Agent notified”, it means ESP can talk to agent without problem:
1. Check transmitter.log, if there is any exception error, then the agent has wrong manager communication settings;
2. Check defaultlog_agent.log and runner_os_component.log (with log.level=5 in agentparm), if there is any error. The normal pattern for submitted job in runner_os_component.log should be:
Received <20101012 14335617+0400 TEST JavaAgent#tcpip@XXX_MANAGER XXXX/AGENT.XXX/MAIN
Preparing job XXXX/AGENT21.XXX/MAIN
Job XXXX/AGENT21./MAIN starting
Transmitter: Sending AFM: . . JavaAgent#tcpip@XXX_MANAGER OSCOMPONENT XXXX/AGENT21.XXX/MAIN State EXEC SetStart Status(Executing at LUCY) Jobno(6084)
Job ZYBLZ/AGENT21.285/MAIN has finished
Transmitter: Sending AFM: . . JavaAgent#tcpip@XXX_MANAGER OSCOMPONENT XXXX/AGENT21.XXX/MAIN State State SUBERROR Cmpc(81011) SetEnd Status

If the lines with “Transmitter: Sending AFM” don’t show up, then for UNIX agent, it’s possibly related to IPCS queue. “ipcs –q” and “ipcrm -q queue_id”(the related queue_id can be found in runner_os_component.log) can be used to clear up the current queue. And oscomponent.msgqueue may need to be defined and not using the default.
- If it shows “Connection error” or “transmitter Busy”, it means ESP can’t get proper feedback from agent, it can be caused by:
1. Agent connection parms (like IP or DNS or port) is defined wrongly in AGENTDEF;
2. Agent parms, like encryption, ASCII/EBDIC etc, is defined unmatched in AGENTDEF; there will be error messages in ESP JESMSGLG and/or agent receiver.log/defaultlog_agent.log; For R7 AS400 agent, ASCII should be used; while EBDIC is used for Version 2.
3. There are queued messages with previous wrong settings, issue OPER ESPCOM DEST(agent_name) FLUSH to clear up the queue;
4. Agent is NOT started, and/or the agent input port is not in Listening status, please check the agent process and its input port;
5. Check ESP JESMSGLG, if related messages have W at the end (like 1546W), then it’s warning message mostly caused by TIMEOUT; however if they have E at the end (like 1545E), it means definitely an error.
- If problem occurs AFTER Shadow Manager takes over and MGRADDR command issued:
1. For AS400 agent, need to manually change the CFGTCP table, remove the old pair of manager name and manager IP address, and add the new pair of manager name with manager IP address.
2. Agent having natted MANAGER IP Address can’t send messages to ESP until it’s recycled. Use DNS name on MANAGER statement in AGENTDEF table, combined with the IF logic. For example:
IF SYSNAME='SYSA' THEN DO
MANAGER NAME(ESPPROD) TCPIP(SYSA.xxxx.COM)
ENDDO
ELSE IF SYSNAME='SYSB' THEN DO
MANAGER NAME(ESPPROD) TCPIP((SYSB.xxxx.COM)
ENDDO
ELSE IF SYSNAME='SYSC' THEN DO
MANAGER NAME(ESPPROD) TCPIP((SYSC.xxxx.COM)
ENDDO
Then MGRADDR command will send the DNS name, merely a string like SYS?.xxxx.COM, and it will be up to the agent side to resolve it to a numeric address. It implies the availability of a DNS server or/and an appropriate HOST file on the agent side.
- Possible things to check if it’s a TCPIP issue: configuration in AGENTDEF and agentparm.txt are consistent.
1. If firewall is used in between: the sending ports from both agent and ESP should be open, it can be any available ports; only the receiving port is definite;
2. If they use DNS name, try with IP address and see if it works; if it is, then the problem is related to DNS server and/or /etc/hosts table on UNIX box. If “TSO ping host_name” from ISPF and “INET Q H host_name” return different IP address, then on ESP STC, //SYSTCPD should be added to point to the proper TCPIP setting.
       3. When DNS name is used for agent address, if warning messages (like 1546W and/or 2214W) show before requests send successfully from ESP, then RESOLVERTIMEOUT in TCPIP.DATA (the one specified on //SYSTCPD in ESP STC) should be decreased from the default 30 seconds to 3. It is used to specify the amount of time the resolver waits for a response while trying to communicate with a name server. A name of a DNS server is specified by NSINTERADDR statement in TCPIP.DATA. There could be more than one DNS server defined there. If a DNS server on that list is unreachable, then the resolver waits an amount of time specified by RESOLVERTIMEOUT and then moves to the next one. This can result in visible delays.
       4. If natted or virtual IP address is used on agent and/or ESP side; if yes, try without it and see if it works;
       5. If agent resides on the INACTIVE node of a cluster server; it won’t have fully communication functionality. When cluster is used, AGENT MONITOR should be set only to the agent on ACTIVE node, which can be done easily pointing to the virtual agent only.
- As general approach, please gather following(when nothing above can resolve the problem):
   - Change log.level=8 dynamically and get agent logs with this value; (IP address and port of ESP and agent will show in receiver.log and transmitter.log)
   - Screen shots of the successful pings(and telnet if it’s allowed) from both environments;
- ESP auditlog and jesmsglg;
   - AGENTDEF table;
   - INET trace(optional);
   - TCPIP trace(like ethereal trace) from agent server(optional).