Aria Automation Pods fail to initialize due to vco-pod crash

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Aria Automation UI Inaccessible "500 Internal Server Error"
Upgrade on Aria Automation may fail :
- Aris Suite Lifecycle Request for the upgrade is stuck at stage : Error Code: LCMVRAVACONFIG90030 VMware Aria Automation VA Upgrade Status Check failed.
- Aria Automation Appliances have been upgrade as per the course of the upgrade workflow, however the start up post reboot is stuck.
Script execution of /opt/scripts/deploy.sh fails, leaving pods stuck in initializing state: initi:0/1 or 0/2. (verify using command kubectl get pods -n prelude)
Reviewing the vco-server-app.log:
The pod logs during vco initialization showed errors similar to :
INFO vco [host='vco-app-<pod-id>' thread='ApplicaitonEventHandler-1' user='' org='' trace=''] {} com.vmware.o11n.plugin.vsphere.connect.DefaultHostSessionFactory - Loading session 'HostSessionKey [hostname=<hostId>, username=<user_account> SdkSession]'.
ERROR vco [host='vco-app-<pod-id>' thread='ApplicaitonEventHandler-1' user='' org='' trace=''] {} com.vmware.o11n.plugin.vsphere.inventory.DefaultSdkManagedObjectFinder - Exception during executing finder for type 'VirtualMachine' and id '<host-id>,id:vm-<id>'.
com.vmware.vim.binding.vmodl.fault.ManagedObjectNotFound: The object 'vim.VirtualMachine:vm-<id>' has already been deleted or has not been completely created.

Environment

Aria Automation 8.x

Aria Orchestrator 8.x

Cause

The VCO pod fails to initialize completely, due to having to index stale configuration elements managed by the VC plugin, as a part of the vco pod initialization.
This task ideally reattempts to index and realize the references held by the configuration element every 10s, goes into a loop due to failing to associate the references, due to them no longer being available.
This loop eventually leads to failure of the vco-service-app pod initialization and ends up in a restart loop.

Resolution

Recommendations when implementing custom profiles:
- Avoid implementing it, in scenarios of an embedded orchestrator.
- Ensure that the configurations do not over commit resources.
- Ensure to have valid snapshots in place prior to the changes.
Recommendations to handle consequences of stale configuration elements:
- Ideal management of configuration element:
  - Ensure to clean up/ update configuration elements, based on the modifications of the references in the endpoint.
  - Example:
    - In the event of deletion of a VM on VC which is known to be associated as a reference on a configuration element on Orchestrator, do ensure to delete / update the reference.
- Recovery in case of pod initialization failures:
  - As a temporary solution, have the vc plugin disabled prior reboot/ pod refresh and re-enable it post successful initiation:
    - Login to orchestration-ui page as an admin
    - Navigate to `System Settings` ---> `Plug-ins`
    - Check the checkbox for the `VC` Plug-in and click `enable/disable` on the top menu and validate, the plugin is now disabled.
    - Wait for Orchestrator to restart
      Docs ref: Install, update, or delete a plug-in
  - If UI is not accessible, but the pods have been recreated, attempt to use command line to temporarily disable the plugin: vracli vro command line utility or vracli capabilities vcoin --disable.
  - If the vracli command fails, please reach out to Broadcom Support, as the plugin would need to be disabled form the database.
  - Once done, rebuild the pods: /opt/scripts/deploy.sh
  - Validate that the script execution completes successfully and the UI is now accessible.
  - Navigate to the Orchestrator page to identify and delete the configuration elements with stale references.
  - Enable the VC plugin:
    - Login to orchestration-ui page as an admin
    - Navigate to `System Settings` ---> `Plug-ins`
    - Check the checkbox for the `VC` Plug-in and click `enable/disable` on the top menu and validate, the plugin is now enabled.
    - Wait for Orchestrator to restart
      Docs ref: Install, update, or delete a plug-in
  - Wait for the pods to restart, take around 10-15 min.

Aria Automation Pods fail to initialize due to vco-pod crash

Article ID: 416209

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback