Error Network request [O] returns -1 hostname resolution method not supported

Products

CA Automic Dollar Universe

Issue/Introduction

In the universe.log of the reference node, every hour the following errors appear without knowing what command/script causes them:

| 2023-01-09 08:10:01 |ERROR|X|IO |pid=p.t1| k_trt_req_network         | Network request [O] returns -1 [] error code [1] error msg [hostname resolution method not supported [ ]]
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| o_io_api_out_bridge       | error decoding response
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| owls_connect_auth         | o_io_api_out_bridge returns error [-1]
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| o_callsrv_connect_r       | Connection error 0 []
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| owls_cmd_return           | Can not connect to server. Error!

How can we find what script/command generates the errors and fix it?

Environment

Release : 6.x or 7.x

Component: Dollar Universe

Cause

Investigation:

In order to find more information about the error, it's a good idea to start by increasing the Main Log Level of the node to level 3 right before the time the errors will start to appear.
This can be done in node settings - logging - main log level: 3
Then Save and Close and empty the universe.log to see the relevant messages when they appear again.
On this particular case, increasing these traces allowed us to find the following clues in the universe.log:
| 2023-01-11 12:10:01 |INFO |X|IO |pid=p.t2| kTrtHelloRequest | IOHELLO - hello request is [H2MOBuxcmd]
| 2023-01-11 12:10:01 |INFO |X|IO |pid=p.t2| kTrtHelloRequest | IOHELLO - from request - user is [root]
| 2023-01-13 12:10:01 |INFO |X|IO |pid=p.t| o_io_cache_get | Entry (8/NODENAME) not found: using provider
| 2023-01-13 12:10:01 |ERROR|X|IO |pid=p.t| o_io_cache_get | Object not found
| 2023-01-13 12:10:01 |ERROR|X|IO |pid=p.t| k_trt_req_network | Network request [O] returns -1 [] error code [1] error msg [hostname resolution method not supported [ ]]

NODENAME was the name of a Node that had recently been decomissioned and was no longer in the list of Nodes, so it all indicated that the reason of the errors would be a command launched on this node with the argument node=NODENAME but still we could not find anything in the Uproc Scripts.
uxcmd indicated that this was a command line from Dollar Universe (ux*) and user is [root] allowed us to identify that this command was being launched as root.
Then, we decided to stop the Launcher of all the Areas of the node at the time these errors would appear, and unfortunately the Error messages continued to appear, which meant that this command was being launched by another scheduler/application such as the crontab
By looking at the crontab of the user in question (root) we finally were able to find the script that was being launched at the time of the issue and the command for this particular case was:
```
/duas_folder/bin/uxlst evt node=NODENAME MU=NODENAME_MU exp upr=UPROCNAME
```
This was actually an old supervision script that had been created to monitor the Launcher of other Dollar Universe nodes, to check if this technical Uproc was being launched every hour.

Resolution

The command had to be removed from the associated script being launched in the crontab of the user root as the target node did not exist anymore.

Additional Information

Some other useful commands that were used for troubleshooting this case in Linux were:

1. A netstat on the port of the IO server of the area throwing the error (10600):
tcpdump -i any -nn -A tcp port 10600 -s0 -w captureio.pcap

To be launched just before the error messages were displayed, then the file captureio.pcap can be opened in Wireshark to identify what was the source ip:

2. This was not needed, but could be helpful.

A script named script.sh that would launch the commands "ps aux | grep root" and "netstat -nap | grep 10600" continuously into two output files, that way we could launch them before the occurence to capture the command that was being launched and the parent script that would launch it

1. vi script.sh

declare -ir MAX_SECONDS=30
declare -ir TIMEOUT=$SECONDS+$MAX_SECONDS

while (( $SECONDS < $TIMEOUT )); do
date >> ps.txt
ps aux | grep root >> ps.txt
date >> netstat.txt
netstat -nap | grep 10600 >> netstat.txt
done

Then save and close with :wq

2. Give execution permissions:

chmod a+x script.sh

3. Wait until about 20s before the occurrence of the errors and launch the script

./script.sh

4. Get the two files generated netstat.txt and ps.txt

5. We hope to find inside these files the command that was being launched, on my test case it would be a "uxlst fnc" command, see below:

[root@hostname TST600_hostname ]# grep uxlst netstat.txt
[root@hostname TST600_hostname ]# grep uxlst ps.txt
root 641 0.0 0.0 181720 7384 pts/1 S+ 15:45 0:00 /apps du 600 TST600_hostname bin uxlst fnc
root 1911 0.0 0.0 177312 3996 pts/1 R+ 15:46 0:00 /apps/du/600/TST600_hostname/bin/uxlst FNC