Intermittent task and failures in RabbitMQ backed Cloud Director environments
book
Article ID: 429357
calendar_today
Updated On:
Products
VMware Cloud Director
Issue/Introduction
Approximately every second task or placement policy within the environment is failing with 'Automatically failed due to timeout.'
These tasks remain in a processing state in VCD until they reached the manual Blocking Task timeout limit.
Initial health checks suggest that the message bus is functional but intermittent failures persist.
RabbitMQ logs show either:
PLAIN login refused: user 'vcd' - invalid credentials
HTTP access denied: user 'admin' - invalid credentials
Environment
10.3
10.4
10.5
10.6
Cause
The issue relates to a specific RabbitMQ node entering an inconsistent state. While the node appears active and correct in the VCD Extensibility AMQP interface and passes basic connection tests, it is failing to handle actual task traffic correctly.
RabbitMQ logs should reveal authentication failures on this specific node, despite credentials being validated by logging in to RMQ and AMQP broker testing.
This results in a round-robin failure pattern where tasks hitting the failed RMQ stall and eventually time out.
Resolution
To resolve this inconsistency the following steps should be taken:
Identify the problematic RabbitMQ node through log analysis.
Restart the affected RabbitMQ service or the server.