VMware Cloud Director shows two Primary Nodes
search cancel

VMware Cloud Director shows two Primary Nodes

book

Article ID: 384401

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

  • A failover occurred and on the VMware Cloud Director (VCD) VAMI the cluster health is in DEGRADED state. For going inside the VAMI, log in as root to the appliance management UI at https://primary_eth1_ip_address:5480.
  • On the VAMI you see two primary cells: one with status failed and one with status running.
  • If you SSH the running primary cell and run the command sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show you see the below outcome

     ID    | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
    -------+-------+---------+-----------+----------+----------+----------+----------+---------------------------------------------------------------------------------
     ###46 | Cell1 | standby |   running | Cell2    | default  | 100      | 4        | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2
     ###07 | Cell2 | primary | * running |          | default  | 100      | 4        | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2
     ###18 | Cell3 | primary | - failed  | ?        | default  | 100      |          | host=##.##.##.## user=repmgr dbname=repmgr gssencmode=disable connect_timeout=2

    WARNING: following issues were detected
      - unable to connect to node "Cell3" (ID: ###18)

  • An automatic failover was triggered by performing an action such as rebooting the primary cell in a VCD appliance cluster.
  • VCD cells appear to be in a "split-brain" scenario with two primary cells listed.

Environment

VMware Cloud Director 10.x

Cause

A failover occurred and the primary cell changed, however the old primary is still part of the cluster as failed cell.
The failed primary cell needs to be removed from the Cloud Director infrastructure, since it is broken and not following the Primary cell. For more information about VMware Cloud Director Appliance Cluster Health follow he document: View Your VMware Cloud Director Appliance Cluster Health and Failover Mode

Resolution

For removing a standby cell from VMware Cloud Director, please follow the document: Unregister a Running Standby Cell in Your VMware Cloud Director Database High Availability Cluster
If you are unable to use the Cloud Director API for unregistering a standby, please contact Broadcom Support and note this Article ID (384401) in the problem description. For more information, see Creating and managing Broadcom support cases.

To avoid this issue occurring in a scenario where the primary cell needs to be rebooted and automatic failover is enabled then following actions could be taken:

  1. Perform a switchover action so that the cell to be rebooted becomes a standby and then reboot the now standby cell.
  2. Alternatively change the cluster to manual failover mode so that no automatic failover occurs during the primary cell reboot.
    WARNING: In manual failover mode if the primary cell is powered down then the Cloud Director UI will become unavailable until the primary cell is started again.