Intermittent deployment and SSH workflow failures in Aria Automation after rejoining an existing node from backup to the cluster

search cancel

Intermittent deployment and SSH workflow failures in Aria Automation after rejoining an existing node from backup to the cluster

book

Article ID: 407467

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

After restoring an Aria Automation appliance from backup and rejoining it to a high-availability cluster, you experience intermittent deployment failures. Workflows that use certain Orchestrator plugins, such as the SSH plugin, may also fail intermittently.

You may see errors in the deployment history similar to the following:

Extensibility triggered task failed. Event ID: <EVENT-UUID>. Failure: Extensibility error received for topic compute.removal.pre, eventId = '<EVENT-UUID>': [10030] ... No reply from blocking subscription Compute-Removal ...

Orchestrator workflow logs for SSH-based tasks may show the following error:

Error in (Workflow:Run SSH command Linux / Execute SSH Command ... ) Unable to execute command: InternalError: Identity file not found !

Environment

VMware Aria Automation 8.x

VMware Aria Automation Orchestrator 8.x

Cause

The process of removing and rejoining a node from an Aria Automation cluster removes custom configuration files stored within the Orchestrator / Automation service directories on that node. These files are unique to a customer's environment. These files are not part of the standard appliance configuration and are therefore not automatically synchronized from the existing cluster members to the newly joined node.

Common examples of these files include:

vco_key: The default private/public key pair generated by the SSH plugin's "Generate SSH key" workflow.
krb5.conf: The configuration file for Kerberos authentication, often used with the PowerShell plugin.

Because Aria Automation distributes workflow executions across all nodes in the cluster, a task will only fail if it is routed to the rejoined node that is missing the required configuration file. This results in intermittent failures.

Resolution

To resolve this issue, you must manually copy the missing configuration files from a healthy node in the cluster to the newly rejoined node.

Use SSH to log into a healthy Aria Automation appliance node (one that was not rejoined).
Identify the necessary configuration files. The most common files are located in /data/vco/usr/lib/vco/app-server/conf/.

Use a secure copy utility, such as scp, to transfer the files to the rejoined node. Replace <rejoined-node-fqdn-or-ip> with the actual FQDN or IP address of the node you are fixing.

Example for the SSH Plugin key:

scp /data/vco/usr/lib/vco/app-server/conf/vco_key root@<rejoined-node-fqdn-or-ip>:/data/vco/usr/lib/vco/app-server/conf/vco_key

Example for the Kerberos configuration file:

scp /data/vco/usr/lib/vco/app-server/conf/krb5.conf root@<rejoined-node-fqdn-or-ip>:/data/vco/usr/lib/vco/app-server/conf/krb5.conf

Once the files have been copied, the issue will be resolved. No service restarts are required. The rejoined node will now be able to process workflows that depend on these files successfully.

Additional Information

This procedure applies to any custom file required by an Orchestrator plugin that is stored locally on the appliance and not synchronized across the cluster database. Always verify if custom configurations were in place on a node before it was removed from the cluster.

Feedback

thumb_up Yes

thumb_down No