Upgrade all instances errand is failing in Rabbit MQ

Products

VMware Tanzu Platform - Cloud Foundry VMware Tanzu RabbitMQ

Issue/Introduction

While TAS upgrade process apply changes failed.
The Smoke Test and upgrade all instance errand is failing in RabbitMQ, during the upgrade of version to 10.0.3.
Failure occur on On-Demand RabbitMQ instances.
The upgrade-all-instances job will fail with error messaging like:

[upgrade-all-service-instances] 2026/01/11 04:56:15.802812 [upgrade-all] FINISHED PROCESSING Status: FAILED; Summary: Number of successful operations: 10; Number of skipped operations: 0; Number of service instance orphans detected: 0; Number of deleted instances before operation could happen: 0; Number of busy instances which could not be processed: 0; Number of service instances that failed to process: 1 [########-####-####-####-624vl4p1d9v0]\n[upgrade-all-service-instances] 2026/01/11 04:56:15.802828 [########-####-####-####-624vl4p1d9v0] Operation failed: bosh task id 0: \n","stderr":"Error: failed to run job-process: exit status 1 (exit status 1)
From the above logging example, the following ID correlates to the RMQ service instance that is failing: ########-####-####-####-624vl4p1d9v0. Running 'bosh -d ########-####-####-####-624vl4p1d9v0 vms' shows 1 or more instances in Unresponsive Agent or Failed status.

Environment

This was observed on Tanzu RabbitMQ for Tanzu Application Service when upgrading from version 6.0.12 to 10.0.3 (RMQ versions 3.3.16 to 4.0.12)

Cause

The cause of the upgrade-all-instances job during RabbitMQ tile upgrade is not conclusive based on the Opsman logging presented. Deeper investigation into the Bosh task logging is required in order to isolate the exact cause of failure. The below resolution steps will help identify exactly which job is failing on the Deployment in question:

First, identify why the Deployment upgrade failed:

Use the following command to view the 10 previous Bosh tasks:

bosh tasks -r=10
Look for the task with Description: "run errand upgrade-all-service-instances from deployment p-rabbitmq-ID". Above this task, there will be several tasks each related to an On-Demand RabbitMQ deployment.
Using the RabbitMQ deployment ID from the failure events in Opsman GUI (in this case ########-####-####-####-624vl4p1d9v0 is the ID we will use for example):
- Identify the failed task with Deployment matching "service-instance_########-####-####-####-624vl4p1d9v0"
- This task will have description "create deployment"
- Gather the task ID.
View the task by ID:

bosh task <TASK_ID>
Failure messaging will look like:

{"time":1768346254,"stage":"Updating instance","tags":["rabbitmq-server"],"total":2,"task":"rabbitmq-server/########-####-####-####-a2e198c173d5 (1) (canary)","index":1,"state":"in_progress","progress":100,"data":{"status":"executing post-start"}}

{"time":1768346264,"stage":"Updating instance","tags":["rabbitmq-server"],"total":2,"task":"rabbitmq-server/########-####-####-####-a2e198c173d5 (1) (canary)","index":1,"state":"failed","progress":100,"data":{"error":"Action Failed get_task: Task ########-####-####-####-c5175a153de2 result: 1 of 3 post-start scripts failed. Failed Jobs: rabbitmq-server. Successful Jobs: ipsec, bosh-dns."}}

{"time":1768346264,"error":{"code":450001,"message":"Action Failed get_task: Task ########-####-####-####-c5175a153de2 result: 1 of 3 post-start scripts failed. Failed Jobs: rabbitmq-server. Successful Jobs: ipsec, bosh-dns."}}

Now that you know the post-start job failed on rabbitmq-server/########-####-####-####-a2e198c173d5 "instance (1)", use the following steps to identify failure messaging from the post-start script:

SSH into the problem instance:

bosh ssh -d service-instance_########-####-####-####-624vl4p1d9v0 rabbitmq-server/########-####-####-####-a2e198c173d5
Check the post-start log for details:

cat /var/vcap/sys/log/rabbitmq-server/post-start.stdout.log
Failure messaging will look like:

2026-01-12T23:42:14Z: Wait for RabbitMQ node startup...

Waiting for pid file '/var/vcap/sys/run/rabbitmq-server/pid' to appear

pid is 17086

Waiting for erlang distribution on node 'rabbit@############################ab7a' while OS process '17086' is running

Waiting for applications 'rabbit_and_plugins' to start on node 'rabbit@f7ffda6cc33ca0a4749d6f50dd4aab7a'

Applications 'rabbit_and_plugins' are running on node 'rabbit@############################ab7a'

2026-01-12T23:42:39Z: Running node checks at Mon Jan 12 11:42:39 PM UTC 2026 from post-start...

2026-01-12T23:42:41Z: Node checks running from post-start passed

2026-01-12T23:42:41Z: Running cluster checks from post-start...

Testing TCP connections to all active listeners on node rabbit@############################ab7a using hostname resolution ...

Will connect to ############################ab7a:15672

Will connect to ############################ab7a:15692

Will connect to ############################ab7a:25672

Will connect to ############################ab7a:5672

Successfully connected to ports 5672, 15672, 15692, 25672 on node rabbit@############################ab7a (using node hostname resolution)

Checking if all vhosts are running on node rabbit@############################ab7a ...

Node rabbit@############################ab7a reported all vhosts as running

User 'guest' exists

2026-01-12T23:42:45Z: RabbitMQ cluster is not healthy

Resolution

In the above instance failure, RabbitMQ instances had been recreated using 'bosh cck' commands. Because of this, a new RabbitMQ cluster was created when the instance upgrade was attempted. Corrective action required manually updating users, as well as manually joining nodes to the cluster, messaging data that hasn't already been processed will be lost as a result.

Please open a support ticket with the Broadcom support team to help correct the cluster replication, or, recreate the On-Demand cluster.