Intermittent task and failures in RabbitMQ backed Cloud Director environments
search cancel

Intermittent task and failures in RabbitMQ backed Cloud Director environments

book

Article ID: 429357

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

  • Approximately every second task or placement policy within the environment is failing with 'Automatically failed due to timeout.'
  • These tasks remain in a processing state in VCD until they reached the manual Blocking Task timeout limit.
  • Initial health checks suggest that the message bus is functional but intermittent failures persist.
  • RabbitMQ logs show either:
    PLAIN login refused: user 'vcd' - invalid credentials
    HTTP access denied: user 'admin' - invalid credentials

Environment

  • 10.3
  • 10.4
  • 10.5
  • 10.6

Cause

  • The issue relates to a specific RabbitMQ node entering an inconsistent state. While the node appears active and correct in the VCD Extensibility AMQP interface and passes basic connection tests, it is failing to handle actual task traffic correctly.
  • RabbitMQ logs should reveal authentication failures on this specific node, despite credentials being validated by logging in to RMQ and AMQP broker testing.
  • This results in a round-robin failure pattern where tasks hitting the failed RMQ stall and eventually time out.

Resolution

To resolve this inconsistency the following steps should be taken:

  1. Identify the problematic RabbitMQ node through log analysis.
  2. Restart the affected RabbitMQ service or the server.
  3. Revalidate RabbitMQ credentials.