VCF Operations is unable to go offline, and there several recent customizations and configurations are missing
search cancel

VCF Operations is unable to go offline, and there several recent customizations and configurations are missing

book

Article ID: 439413

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

A previous network outage/change caused a cluster-wide interruption. Upon recovery, the primary node was demoted to a replica. Due to the lack of valid backups or recent snapshots, the cluster could not be restored to a known good state prior to the role reversal.

Environment

VCF Operations 9.x

Cause

A definitive root cause for the initial VCF cluster node role swap could not be established as the underlying network event occurred approximately 60 days prior, exceeding the log retention period for granular connection tracking. The cluster had been operating in a degraded/diverged state since that time.

Resolution

to work around the issue, there are two options since there are no backups available, one option is to redeploy the cluster, the other option:

1. Bring the cluster offline and took powered-off snapshots of all cluster nodes and Cloud Proxies to create a recovery baseline.

2. Execute vcopsConfigureRoles.py to manually force the original primary node back into the Primary role and demote the current primary back to Replica.

3. Bring the cluster online to correct the roles, but since there was a cluster-wide interruption, configurations remained absent due to the lack of synchronized data from the period of the network outage.

4. Manually recreate the customization and configurations (if no backups)

Additional Information

The following commands can be executed to check the current role of each node in the cluster via SSH to each node:

egrep repl.db.role $VCOPS_BASE/user/conf/persistence/persistence.properties 

command to check the sliceInstanceID for each node:

cat /usr/lib/vmware-vcopssuite/utilities/sliceConfiguration/data/platformState.properties