While trying to recover several cluster nodes we tried ejecting several nodes and re-adding them back we see that several ae simply failing to add. We can see the cluster configuration is reset on the node we are trying to add but remains in the active cluster.
The issue here was several "sync" processes were orphaned on the node we were trying to add. The sync command is called by /sbin/aactrl.sh does in order to force flush any outstanding writes to the filesystem during cluster startup process. If this process does not complete the cluster startup cannot complete and therefore failing to join the cluster
[email protected]:~# ps -ef
UID PID PPID C STIME TTY TIME CMD
uagmon 3019 1 0 May13 ? 00:01:23 /usr/bin/perl -T /sbin/xiomo
uagmon 3045 1 0 May13 ? 00:00:03 /usr/bin/perl -T /sbin/logwa
root 3081 1 0 May13 ? 00:00:05 /sbin/xcd_sfamon
root 3172 1 0 May13 ? 00:01:21 sync
Release : 3.4, 4.x
Component : PRIVILEGED ACCESS MANAGEMENT
Forcably reboot the cluster node. Since the sync process is hung the reboot may not complete as well so a hard reboot may be required.
Linux command sync - Synchronize cached writes to persistent storage