Artemis does not readmit a Cloud Director cell to the cluster after a JDBC exceptions occurs
search cancel

Artemis does not readmit a Cloud Director cell to the cluster after a JDBC exceptions occurs

book

Article ID: 325682

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

  • Cloud Director API calls got timed out and sometimes the GUI showed a 404 error
  • An event has occurred which triggered disruption on the eth1 network between the database nodes.
  • Within the vcloud-container-debug.log you see JDBC exceptions occurring.
  • The main vmware-vcd service was marked as inactive due to the JDBC exceptions.

2023-09-19 01:28:31,343 | INFO | Cell comatose marker | CellLivenessStatusServiceImpl | Marking Cell as inactive

  • Within the cell-runtime.log you see that the number of cells in the vcd-cluster=topology has been reduced, as the cell is inactive.
  • After a short time the cell is marked as active again

2023-09-19 01:28:48,470 | INFO | HeartBeat-1 | CellLivenessStatusServiceImpl | Marking Cell as active |

  • The amount of nodes in the vcd-cluster=topology in cell-runtime.log does not increase again after the cells goes active
  • Within the jms-debug.log you observe broadcast errors like follows:

2023-09-19 01:28:13,124 | ERROR| Thread-0 (ActiveMQ-scheduled-threads) | VCDBroadcastEndpoint| Error during broadcast for local cell: ########-####-####-####-############ |
org.postgresql.util.PSQLException: The connection attempt failed.
at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:250)
at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49)
at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195)

 

Environment

VMware Cloud Director 10.5.1.1

Cause

When a disruptive action such as a network disconnection occurs between the Cloud Director cells, then a cell will be removed from the Artemis cluster group as the cell had gone inactive. If the cell becomes active again shortly thereafter, the cell does not get readmitted to the Artemis cluster topology because of an issue which leads to the cell failing to broadcast itself sufficiently to the remaining cells in the cluster.

Resolution

The source of the network disruption occurring between the Cloud Director cells should be investigated and mitigated.

The affected node or the VMware Cloud Director cell should be rebooted to join back to the cluster.