After upgrading to Aria Automation 8.18.0, deployments being created and day2 actions may become stuck and not progress, after some time has passed.

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

Aria Automation stops processing requests periodically (once a week perhaps, or after several hours/days)
When running deployments or day 2 actions, they can get stuck at any stage, including at 0 tasks complete. For example:
- (Delete - Initialization 0 / 5 Tasks)
- (Resize Server - In Progress 0 / 2 Tasks)
It can be seen in the tango-blueprint log TileExecution stats that running tiles equals maxConcurrent :
- INFO tango-blueprint [host='tango-blueprint-service-app-xxx' thread='generalScheduler-4' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace=''] com.vmware.tango.blueprint.telemetry.LogMetricsConfig - TileExecution stats: running=150 created=3 maxConcurrent=150 batchSize=5 rescheduled=2 currentPendingBlockingRequests=0 totalBlockingRequests=14 totalPendingBlockingRequests=0 currentNextExecutions=0
The tango-blueprint log also shows this format error:
- java.lang.IllegalArgumentException: Missing: traceId spanId

^ This log is: /services-logs/prelude/tango-blueprint-service-app/file-logs/tango-blueprint-service-app.log

Environment

VMware Aria Automation systems which have been upgraded to 8.18.0

Cause

This can be caused by in-progress deployments at the time of upgrade. These may be waiting for approval.

As such, if the upgrade can be rolled back to pre-upgrade snapshot, the issue may be avoided by ensuring there are no running actions or deployments while upgrading to 8.18.0

These incomplete pre-upgrade jobs are repeatedly added to the execution queue until the concurrent maximum is reached and nothing can progress.

Resolution

This issue is resolved in patch 1 for Aria Automation 8.18.1

Patch instructions available in KB 385294.

Workaround

It is possible to get some relief from the issue by restarting all tango-blueprint pods, using the following commands:

kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=0
kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=3
or for a single-node system:
kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=1

When the pods have terminated and been recreated, all hung deployments & day2 actions should progress.

Manual Solution

The following procedure makes direct edits in the Aria Automation database. Please take careful note of step 1 and contact VMware Support if assistance is needed

Before performing an update, kindly take snapshots and backups of the Automation VM(s).
Take a DB dump of tango-blueprint-db with this command:
- vracli db dump tango-blueprint-db > tango-blueprint-db.sql
Stop the tango-blueprint-service-app pod before performing the upcoming DB edits:
- kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=0

Please wait for the pods to be deleted. You can check their progress with command:
- watch 'kubectl -n prelude get pods | grep tango-blueprint'
Log in to the vRA database:
- vracli dev psql
Change to the tango DB:
- \c tango-blueprint-db
We need to find the tiles/tasks/flows which are stuck and have the incorrect trace-context. We SELECT/UPDATE only those which are not in a terminal state.
Capture these outputs of the SELECT QUERIES in a notepad so that you know which db records you are updating. Capture the number of rows it should match with your update queries.
There are three queries that need to be run, each is wrapped in a transaction, only COMMIT when you are sure that the UPDATE count of tiles/tasks/flows are the same as the result count which we see in the SELECT query.
For example: (7 rows) in SELECT results, and UPDATE 7 returned from the update transaction.

7a. -- Tiles to be updated:

SELECT ID AS TILE_ID, ENV['DEPLOYMENT_ID'] AS DEPLOYMENT_ID, ENV['DEPLOYMENT_NAME'] AS NAME
FROM BP_TILE_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%' AND STATUS IN ('SCHEDULED', 'IN_PROGRESS', 'WAITING', 'NOT_STARTED');

--> Update the tiles:

BEGIN;

UPDATE BP_TILE_EXECUTION
SET ENV['TRACE_CONTEXT'] = '"{\"traceId\":\"66ed579ea44c9710d4983c630c023780\",\"spanId\":\"c789ad6d4d5113b1\",\"trace\":\"66ed579ea44c9710d4983c630c023780\",\"traceparent\":\"00-66ed579ea44c9710d4983c630c023780-c789ad6d4d5113b1-00\"}"'
WHERE ID IN
(SELECT ID
FROM BP_TILE_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%' AND STATUS IN ('SCHEDULED', 'IN_PROGRESS', 'WAITING', 'NOT_STARTED'));

If the figures for SELECT results and "UPDATE _" results do not agree, then we will rollback the transaction. This will undo the UPDATE query, back to the keyword BEGIN. This is achieved by running: ROLLBACK;

Otherwise, if the figures do agree, then run the following command to commit the update transaction:

COMMIT;

7b. -- Tasks to be updated

SELECT ID AS TASK_ID , ENV['DEPLOYMENT_ID'] AS DEPLOYMENT_ID, ENV['DEPLOYMENT_NAME'] AS NAME
FROM BP_TASK_EXECUTION
WHERE FLOW_EXECUTION_ID IN (SELECT ID
FROM BP_FLOW_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%'
AND STATUS IN ('SCHEDULED','WAITING','IN_PROGRESS'))
AND STATUS IN ('NOT_STARTED','SCHEDULED','IN_PROGRESS','WAITING');

--> Update the tasks

BEGIN;

UPDATE BP_TASK_EXECUTION
SET ENV['TRACE_CONTEXT'] = '"{\"traceId\":\"66ed579ea44c9710d4983c630c023780\",\"spanId\":\"c789ad6d4d5113b1\",\"trace\":\"66ed579ea44c9710d4983c630c023780\",\"traceparent\":\"00-66ed579ea44c9710d4983c630c023780-c789ad6d4d5113b1-00\"}"'
WHERE FLOW_EXECUTION_ID IN
(SELECT ID
FROM BP_FLOW_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%'
AND STATUS IN ('SCHEDULED','WAITING','IN_PROGRESS'))
AND STATUS IN ('NOT_STARTED','SCHEDULED','IN_PROGRESS','WAITING');

If the figures for SELECT results and "UPDATE _" results do not agree, then we will rollback the transaction. This will undo the UPDATE query, back to the keyword BEGIN. This is achieved by running: ROLLBACK;

Otherwise, if the figures do agree, run the following command to commit the update transaction:

COMMIT;

7c. -- Flows to be updated

SELECT ID AS FLOW_ID, ENV['DEPLOYMENT_ID'] AS DEPLOYMENT_ID, ENV['DEPLOYMENT_NAME'] AS NAME
FROM BP_FLOW_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%' AND STATUS IN ('SCHEDULED', 'WAITING', 'IN_PROGRESS');

--> Update the flows

BEGIN;

UPDATE BP_FLOW_EXECUTION
SET ENV['TRACE_CONTEXT'] = '"{\"traceId\":\"66ed579ea44c9710d4983c630c023780\",\"spanId\":\"c789ad6d4d5113b1\",\"trace\":\"66ed579ea44c9710d4983c630c023780\",\"traceparent\":\"00-66ed579ea44c9710d4983c630c023780-c789ad6d4d5113b1-00\"}"'
WHERE ID IN
(SELECT ID
FROM BP_FLOW_EXECUTION
WHERE ENV['TRACE_CONTEXT']::text LIKE '%uber-trace-id%' AND STATUS IN ('SCHEDULED', 'WAITING', 'IN_PROGRESS'));

If the figures for SELECT results and "UPDATE _" results do not agree, then we will rollback the transaction. This will undo the UPDATE query, back to the keyword BEGIN. This is achieved by running: ROLLBACK;

Otherwise, if the figures do agree, run the following command to commit the update transaction:

COMMIT;

8. Exit the database, with \q or Ctrl+D

9. Restart the tango-blueprint-service-app pod.

For a 3-node cluster:

kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=3

For a single-node system:

kubectl scale deployment -n prelude tango-blueprint-service-app --replicas=1

Additional Information

VMware Engineering are working on a permanent fix in code, so that upgraded systems will not be affected by in-progress or waiting deployments.