Aria Automation Provisioning PoD constantly restarts

Products

VCF Automation

Issue/Introduction

When deploying a VM or during a test run for multiple VM instances, the Provisioning Pods consistently crash and restart. This behavior causes the deployment workflow to fail and prevents successful provisioning of virtual machines.

If memory is exhausted, the provisioning logs may not get recorded with the latest failure events. If the logs get recorded, we will see the following event:

Log Path:
/services-logs/prelude/provisioning-service-app/file-logs/provisioning-service-app.log

ERROR provisioning [host='provisioning-service-app-<ID>' thread='reactor-http-epoll-8' user='' org='' trace='' parent=''span=''] c.v.a.a.gateway.ProvisioningGatewayImpl.lambda$registerADAdapterEndpoint$1:88 - [ad-integration]Registration of endpoint adapter [http://provisioning-service.prelude.svc.cluster.local:8282/provisioning/adapter/activedirectory/endpoint-config] for type [AD Integration] at [http://provisioning-service.prelude.svc.cluster.local:8282/config/photon-model-adapters-registry] failed with error finishConnect(..) failed: Connection refused: provisioning-service.prelude.svc.cluster.local/<IP Address>:8282

ERROR provisioning [host='provisioning-service-app-<ID>' thread='reactor-http-epoll-9' user='' org='' trace='' parent=''span=''] c.v.a.i.s.i.EndpointConfigAdapterServiceImpl.lambda$registerEndpoint$1:121 - Registration of endpoint adapter [http://provisioning-service.prelude.svc.cluster.local:8282/provisioning/adapter/ipam/endpoint-config] for type [IPAM Endpoint] at [http://provisioning-service.prelude.svc.cluster.local:8282/config/photon-model-adapters-registry] failed with error finishConnect(..) failed: Connection refused: provisioning-service.prelude.svc.cluster.local/<IP Address>:8282

Environment

Aria Automation 8.18.x

Cause

The diskOperationTaskState contains a large number of storage profiles, and when all these profiles are loaded into memory, it causes an Out-of-Memory condition. This leads to the provisioning pod crashing and restarting repeatedly.

Resolution

This workaround is applicable only when multiple storage profiles exist. If you do not have multiple storage profiles, this workaround is not relevant.

Workaround:

Check for Pending or Stuck Tasks
- Verify whether any old tasks are stuck in a pending state, as these can resume when a pod restarts and load stale data from the disk_operation_task_state table.
- Review all tasks under Infrastructure → Requests and identify any that are incomplete.
Clean Up Task State Tables
- Remove outdated entries from the task state table to prevent old operations from being reloaded and to ensure the pod starts cleanly without reprocessing stale data.
```
DELETE FROM request_status WHERE sub_stage NOT IN ('COMPLETED', 'ERROR');
```
Perform the recommended actions to prevent this issue from reoccuring:
- If you have multiple INCLUDE ALL types of storage profiles with same compute (could be empty) or storage policy (could be empty) or both, then just have one with profile with INCLUDE ALL, copy all the tags from other 'INCLUDE ALL' types of storage profiles, since they are duplicates.
- If you have multiple storage profiles with single datastore, consider using MANUAL storage profiles to group them and multiple datastores into single storage profile and add constaint tags accordingly.

Recommended Configuration (Best Practice):

Requirement:
- Ability to allocate and target specific datastores.
Current Setup:
- One storage profile per datastore using MANUAL filters, tagged as <CLUSTER_NAME><DATASTORE_NAME>.
Recommended Setup Using 8.18 Feature (Multiple Datastores per Profile):
- Tag each datastore with <DATASTORE_NAME> under Resources → Storage → Datastore.
- Create one storage profile per compute cluster and tag it with the compute cluster name.
- Add the required datastores to this profile (they already have the datastore tags).
- In Cloud Templates, specify two constraint tags:
  - Compute cluster tag
  - Datastore tag

This configuration reduces the number of storage profiles, prevents memory overload, and avoids recurrence of the issue.