In the universe.log of the reference node, every hour the following errors appear without knowing what command/script causes them:
| 2023-01-09 08:10:01 |ERROR|X|IO |pid=p.t1| k_trt_req_network | Network request [O] returns -1 [] error code [1] error msg [hostname resolution method not supported [ ]]
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| o_io_api_out_bridge | error decoding response
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| owls_connect_auth | o_io_api_out_bridge returns error [-1]
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| o_callsrv_connect_r | Connection error 0 []
| 2023-01-09 08:10:01 |ERROR|X|cmd|pid=p.t2| owls_cmd_return | Can not connect to server. Error!
How can we find what script/command generates the errors and fix it?
Release : 6.x or 7.x
Component: Dollar Universe
/duas_folder/bin/uxlst evt node=NODENAME MU=NODENAME_MU exp upr=UPROCNAMEThis was actually an old supervision script that had been created to monitor the Launcher of other Dollar Universe nodes, to check if this technical Uproc was being launched every hour.
The command had to be removed from the associated script being launched in the crontab of the user root as the target node did not exist anymore.
Some other useful commands that were used for troubleshooting this case in Linux were:
1. A netstat on the port of the IO server of the area throwing the error (10600):
tcpdump -i any -nn -A tcp port 10600 -s0 -w captureio.pcap
To be launched just before the error messages were displayed, then the file captureio.pcap can be opened in Wireshark to identify what was the source ip:
2. This was not needed, but could be helpful.
A script named script.sh that would launch the commands "ps aux | grep root" and "netstat -nap | grep 10600" continuously into two output files, that way we could launch them before the occurence to capture the command that was being launched and the parent script that would launch it
1. vi script.sh
declare -ir MAX_SECONDS=30
declare -ir TIMEOUT=$SECONDS+$MAX_SECONDS
while (( $SECONDS < $TIMEOUT )); do
date >> ps.txt
ps aux | grep root >> ps.txt
date >> netstat.txt
netstat -nap | grep 10600 >> netstat.txt
done
Then save and close with :wq
2. Give execution permissions:
chmod a+x script.sh
3. Wait until about 20s before the occurrence of the errors and launch the script
./script.sh
4. Get the two files generated netstat.txt and ps.txt
5. We hope to find inside these files the command that was being launched, on my test case it would be a "uxlst fnc" command, see below:
[root@hostname TST600_hostname ]# grep uxlst netstat.txt
[root@hostname TST600_hostname ]# grep uxlst ps.txt
root 641 0.0 0.0 181720 7384 pts/1 S+ 15:45 0:00 /apps du 600 TST600_hostname bin uxlst fnc
root 1911 0.0 0.0 177312 3996 pts/1 R+ 15:46 0:00 /apps/du/600/TST600_hostname/bin/uxlst FNC