VMware Cloud Director shows two Primary Nodes

Products

VMware Cloud Director

Issue/Introduction

A failover occurred and the VMware Cloud Director (VCD) cluster health is currently reported as DEGRADED or READ ONLY in the VAMI. To access the VAMI, log in as root to the appliance management UI at https://<primary_eth1_ip_address>:5480.
On the VAMI you see two primary cells: one with status failed and one with status running.
If you SSH the running primary cell and run the command "sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show"
Below is the output:
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
-------+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------
###46 | Cell1 | standby | running | Cell2 | default | 100 | 4 | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2
###07 | Cell2 | primary | * running | | default | 100 | 4 | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2
###18 | Cell3 | primary | - failed | ? | default | 100 | | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "Cell3" (ID: ###18)

An automatic failover was triggered by performing an action such as rebooting the primary cell in a VCD appliance cluster.
VCD cells appear to be in a "split-brain" scenario with two primary cells listed.

Environment

VMware Cloud Director 10.x

Cause

A failover occurred and the primary cell changed, however the old primary is still part of the cluster as failed cell.
The failed primary cell needs to be removed from the Cloud Director infrastructure, since it is broken and not following the Primary cell. For more information about VMware Cloud Director Appliance Cluster Health follow the document: View Your VMware Cloud Director Appliance Cluster Health and Failover Mode

Resolution

For removing a standby cell from VMware Cloud Director, please follow the document: Unregister a Running Standby Cell in Your VMware Cloud Director Database High Availability Cluster

If you are unable to use the Cloud Director API for unregistering a standby, please contact Broadcom Support and note this Article ID (384401) in the problem description. For more information, see Creating and managing Broadcom support cases.

To avoid this issue occurring in a scenario where the primary cell needs to be rebooted and automatic failover is enabled then following actions could be taken:

Perform a switchover action so that the cell scheduled for reboot becomes a standby node, and then reboot the standby cell.
Alternatively change the cluster to manual failover mode so that no automatic failover occurs during the primary cell reboot.
WARNING: In manual failover mode if the primary cell is powered down then the Cloud Director UI will become unavailable until the primary cell is started again.