VCD tasks slow after failover was triggered
search cancel

VCD tasks slow after failover was triggered

book

Article ID: 315583

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • Cluster failover of VCD cells was triggered due to maintenance.
  • Tasks become slow due to some sub tasks taking longer than earlier.
  • Reboot of the cluster seems to resolve the issue temporarily, however the issue may reoccur. 
  • The jms-debug.log shows error similar to
    • yyyy-mm-dd hh:mm:ss,ms | DEBUG    | Thread-2 (ActiveMQ-client-netty-threads) | ActiveMQClientProtocolManager  | Notifying <Node ID> going down
      ...
      yyyy-mm-dd hh:mm:ss,ms | DEBUG    | e=<Node ID>-<trace>) | RemotingServiceImpl            | RemotingServiceImpl::removing succeeded connection ID <connection ID>, we now have N connections
      yyyy-mm-dd hh:mm:ss,ms | DEBUG    | e=<Node ID>-<trace>) | ServerSessionImpl              | deleting temporary queue notif.<id>.ActiveMQServerImpl_name=<MQServer_Implementation_id>
      yyyy-mm-dd hh:mm:ss,ms | DEBUG    | e=<Node ID>-<trace>) | PageSubscriptionCounterImpl    | Subscription N delete, keepZero=false
  • The  cell-runtime.log or vcloud-container-debug.log may show issues with JDBC connectivity as below:
    • yyyy-mm-dd hh:mm:ss,ms  | ERROR    | Thread-0 (ActiveMQ-scheduled-threads) | VCDBroadcastEndpoint           | Error during broadcast for local cell: <cell-id> |
      org.postgresql.util.PSQLException: The connection attempt failed.
    • Caused by: java.net.SocketTimeoutException: connect timed out
    • Caused by: org.springframework.transaction.CannotCreateTransactionException: Could not open Hibernate Session for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Cannot open connection
    • Caused by: org.hibernate.exception.JDBCConnectionException: Cannot open connection

Environment

VMware Cloud Director 10.x

Cause

When a node either in maintenance or once which was recently failed over, is restored back, the Artemis cluster cannot reconnect it due to incomplete/ incorrect clean up. 

Resolution

  • This issue has been fixed in VMware Cloud Director 10.5.1 release. 
  • To workaround this issue:
    1. Set the mq.discovery.generationDrift to 3600.
      /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n mq.discovery.generationDrift -v 3
    2. Perform a shutdown and start of the vmware-vcd services on ALL cells in the environment in order for the change to be detected by the cell.
      1. /opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s
      2. service vmware-vcd start