One of the Automation Orchestrator nodes has been deleted or is missing from the cluster
The 2 remaining nodes in the cluster are functional, however running some of the internal scripts produces errors stating one of the nodes (the missing one) is not in a ready state
Running the command kubectl get pods -n prelude shows pods for the missing node with a status of 0/x
Environment
Aria Automation Orchestrator 8.18.1
Resolution
The recovery consists of 2 parts, where the second is to be optionally performed in addition to the first, yet is highly recommended.
The first part consists of removing traces of the missing node from the existing cluster, and re-scaling the pods only to the remaining nodes in the cluster. This will stabilize the cluster and address any previous error messages related to one of the nodes not being in a ready state.
The second part consists of redeploying the missing node, and joining it back to the cluster. This will restore the the environment to the cluster's original functionality.
From any of the 2 remaining nodes in the cluster, remove the missing node from the cluster configuration with the following command:
vracli cluster remove <affected node fqdn or IP>
Then, scale down the pods to just the number of nodes left in the cluster with the following command:
vracli cluster scale-pods -f
Above, the -f option is required to account for "not ready" status errors due to one of the nodes being missing from the cluster. Under normal conditions where all nodes are accounted for, this argument is not necessary.
Once the node is removed from the cluster and pods a re scaled down to just the remaining nodes, perform an inventory sync of the environment in Aria Suite Lifecycle. This will trigger an expected error:
Error Code: LCMCOMMON800014 Snapshot failed with the given product component Spec. java.lang.IllegalStateException: No virtual machine with IP <missing node IP> found for the product: vro
Simply click the Retry button, leaving the "Skip Task" set to True to finish the inventory sync.
At this stage the missing node's reference has been removed from the cluster, and from Aria Suite Lifecycle.
Part 2 : Redeploy an Orchestrator node, join it to the cluster, perform inventory sync.
In Aria Suite Lifecycle, create a new environment, e.g. Orchestrator Recovery.
When selecting the product, check the box for Aria Automation Orchestrator, set Deployment Type as Standard, and set Authentication Type to vSphere (authentication will be set back to what is configured for the cluster after joining it).
For the network configuration step, make sure to add the DNS, Gateway, and Mask matching the cluster being recovered (in a separate tab, navigate to the environment of the original Automation Orchestrator cluster, expand the details for the primary node, and review the details under the Network section).
Set the Admin group and Admin Group Domain to the administrators group you have/use in vCenter, or create a admin user and add the user to the administrators group as a vSphere.local user, and then reference that user's details in this step.
In the product configuration step, for the VM Name follow your naming convention per the nodes in the original cluster, and assign the same FQDN and IP address of the original node being re-deployed.
Finish the install by running the pre-check (this might return a certificate validation error), then click the Next button.
On the last part of the deployment, uncheck "Run Precheck on submit" if the previous step failed on certificate verification, and click the Submit button.
If the nodes in the original cluster are at a higher patch level, patch this new node before joining it to the cluster.
Once the node is patched to the same level as the nodes in the original cluster, the new node can be joined to the cluster with the following command:
vracli cluster join <master node FQDN>
Once the node has joined the original cluster, delete the Orchestrator Recovery environment created in step 1 from Aria Suite Lifecycle (make sure to not check the box to delete the node from vCenter).
Trigger an inventory sync of the original environment.
Finally, it is good practice to confirm the cluster is behaving as expected, and to create snapshots of all nodes in their current state after validation.
Additional Information
The original nodes in the cluster may still have a reference to the missing/deleted node's fingerprint in their known_hosts file. This may trigger an error message during future upgrades/patches during SSH access verification steps.
Review the steps in this KB ( 410138 ) to help resolve the error message.