A previous network outage/change caused a cluster-wide interruption. Upon recovery, the primary node was demoted to a replica. Due to the lack of valid backups or recent snapshots, the cluster could not be restored to a known good state prior to the role reversal.
VCF Operations 9.x
A definitive root cause for the initial VCF cluster node role swap could not be established as the underlying network event occurred approximately 60 days prior, exceeding the log retention period for granular connection tracking. The cluster had been operating in a degraded/diverged state since that time.
to work around the issue, there are two options since there are no backups available, one option is to redeploy the cluster, the other option:
1. Bring the cluster offline and took powered-off snapshots of all cluster nodes and Cloud Proxies to create a recovery baseline.
2. Execute vcopsConfigureRoles.py to manually force the original primary node back into the Primary role and demote the current primary back to Replica.
3. Bring the cluster online to correct the roles, but since there was a cluster-wide interruption, configurations remained absent due to the lack of synchronized data from the period of the network outage.
4. Manually recreate the customization and configurations (if no backups)
The following commands can be executed to check the current role of each node in the cluster via SSH to each node:
egrep repl.db.role $VCOPS_BASE/user/conf/persistence/persistence.properties
command to check the sliceInstanceID for each node:
cat /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/platformState.properties