Many errors u_write_art in universe.log appear in a cluster node

Products

CA Automic Dollar Universe

Issue/Introduction

In a Dollar Universe installed in a Linux Cluster, some Uprocs remain in Launch Wait status with a Start date from the past, example of error:

| 2022-01-26 12:00:00 |ERROR|X|IO |pid=17520.140392495183616| u_write_art | ret is [-4]. No article found for pos_data [180] in file [u_fmlp60/u_mmlp01]

Additionally, some Job Events are not transmitted to the remote nodes where they are awaited, with the following kind of errors in universe.log:

| 2022-01-22 13:00:25 |ERROR|X|IO |pid=104533.140172070995712| u_read_direct | Cannot get article in file [u_fmev60/u_mmev02] (err=9) pos=[76].
| 2022-01-22 13:00:25 |ERROR|X|IO |pid=104533.140172070995712| u_write_art | ret is [-4]. No article found for pos_data [0] in file [u_fmev60/u_mmev02]
| 2022-01-22 13:00:25 |ERROR|X|IO |pid=104533.140172070995712| u_write_art | u_ident={u_fmev60, /FS-Data/DollarU/opt/AUTOMIC/DUAS/duasnode/data/exp/, 6, X}

How can we fix this situation?

Environment

Release : 6.x

Component : DOLLAR UNIVERSE

OS architecture: Linux Cluster

Cause

The uxioserv process of the secondary node was running at the same time than the uxioserv of the primary node which should never occur ( Dollar Universe has a active/passive architecture in cluster).

We could find this by looking at the universe.log and see what PID was writing the error messages and comparing with the PID returned by ps -ef | grep uxioserv:

optnc5a 72312 1 0 Feb28 ? 00:04:57 ./uxioserv OPTNC5 X nodename

In universe.log, the correct uxioserv has PID 72312 and printed normal messages such as:

| 2022-03-01 00:02:22 |INFO |X|IO |pid=72312.139933020829440| u_trt_req_rgz             | End of reorg for area X (0)
| 2022-03-01 00:02:22 |INFO |X|IO |pid=72312.139933020829440| o_io_on_purge             | Next IO purge date [2022030200012200].

But all the warning and error related messages were coming from a different PID (17520) as you can see below:

| 2022-03-02 00:26:15 |WARN |X|IO |pid=17520.140392129353472| o_module_exc_cycle_emissi | No matching numlot for CD article [131566]/[1008]
| 2022-03-02 00:28:45 |ERROR|X|IO |pid=17520.140392129353472| owls_api_exchanger_create | Error 200 connecting to node <distant_node> port 10600
| 2022-03-02 00:28:45 |WARN |X|IO |pid=17520.140392129353472| o_module_exc_cycle_emissi | No matching numlot for CD article [131566]/[1008]
| 2022-03-02 00:30:00 |ERROR|X|IO |pid=17520.140392495183616| u_write_art               | ret is [-4]. No article found for pos_data [232] in file [u_fmlp60/u_mmlp01]
| 2022-03-02 00:30:00 |ERROR|X|IO |pid=17520.140392495183616| u_write_art               | ret is [-4]. No article found for pos_data [0] in file [u_fmhs60/u_mmhs01]
| 2022-03-02 00:30:02 |ERROR|X|IO |pid=17520.140392495183616| GAGESHIS                  | L'execution 0061845 est inconnue dans l'historique                              
| 2022-03-02 00:30:02 |ERROR|X|IO |pid=17520.140392495183616| GATERB33                  | Terminaison : L'uproc CVGCOEXQ0100 est absente de la base de pilotage           
| 2022-03-02 00:45:00 |ERROR|X|IO |pid=17520.140392495183616| GAMEMO32                  | Effacement impossible 'D' sur fichier Log executions

Resolution

Stop all the Dollar Universe processes on both servers and then launch a reorganization and start the node in the correct member of the cluster only.