While trying to recover several cluster nodes we tried ejecting several nodes and re-adding them back we see that several ae simply failing to add. We can see the cluster configuration is reset on the node we are trying to add but remains in the active cluster.
Release : 3.4, 4.x
Component : PRIVILEGED ACCESS MANAGEMENT
The issue here was several "sync" processes were orphaned on the node we were trying to add. The sync command is called by /sbin/aactrl.sh does in order to force flush any outstanding writes to the filesystem during cluster startup process. If this process does not complete the cluster startup cannot complete and therefore failing to join the cluster
root@OPPrime1:~# ps -ef
UID PID PPID C STIME TTY TIME CMD
...
uagmon 3019 1 0 May13 ? 00:01:23 /usr/bin/perl -T /sbin/xiomo
uagmon 3045 1 0 May13 ? 00:00:03 /usr/bin/perl -T /sbin/logwa
root 3081 1 0 May13 ? 00:00:05 /sbin/xcd_sfamon
root 3172 1 0 May13 ? 00:01:21 sync
Forcably reboot the cluster node. Since the sync process is hung the reboot may not complete as well so a hard reboot may be required.
Linux command sync - Synchronize cached writes to persistent storage