When deploying a VM or during a test run for multiple VM instances, the Provisioning Pods consistently crash and restart. This behavior causes the deployment workflow to fail and prevents successful provisioning of virtual machines.
If memory is exhausted, the provisioning logs may not get recorded with the latest failure events. If the logs get recorded, we will see the following event:
Log Path:/services-logs/prelude/provisioning-service-app/file-logs/provisioning-service-app.log
ERROR provisioning [host='provisioning-service-app-<ID>' thread='reactor-http-epoll-8' user='' org='' trace='' parent=''span=''] c.v.a.a.gateway.ProvisioningGatewayImpl.lambda$registerADAdapterEndpoint$1:88 - [ad-integration]Registration of endpoint adapter [http://provisioning-service.prelude.svc.cluster.local:8282/provisioning/adapter/activedirectory/endpoint-config] for type [AD Integration] at [http://provisioning-service.prelude.svc.cluster.local:8282/config/photon-model-adapters-registry] failed with error finishConnect(..) failed: Connection refused: provisioning-service.prelude.svc.cluster.local/<IP Address>:8282
ERROR provisioning [host='provisioning-service-app-<ID>' thread='reactor-http-epoll-9' user='' org='' trace='' parent=''span=''] c.v.a.i.s.i.EndpointConfigAdapterServiceImpl.lambda$registerEndpoint$1:121 - Registration of endpoint adapter [http://provisioning-service.prelude.svc.cluster.local:8282/provisioning/adapter/ipam/endpoint-config] for type [IPAM Endpoint] at [http://provisioning-service.prelude.svc.cluster.local:8282/config/photon-model-adapters-registry] failed with error finishConnect(..) failed: Connection refused: provisioning-service.prelude.svc.cluster.local/<IP Address>:8282
Aria Automation 8.18.x
The diskOperationTaskState contains a large number of storage profiles, and when all these profiles are loaded into memory, it causes an Out-of-Memory condition. This leads to the provisioning pod crashing and restarting repeatedly.
This workaround is applicable only when multiple storage profiles exist. If you do not have multiple storage profiles, this workaround is not relevant.
Workaround:
Check for Pending or Stuck Tasks
Verify whether any old tasks are stuck in a pending state, as these can resume when a pod restarts and load stale data from the disk_operation_task_state table.
Review all tasks under Infrastructure → Requests and identify any that are incomplete.
Clean Up Task State Tables
Remove outdated entries from the task state table to prevent old operations from being reloaded and to ensure the pod starts cleanly without reprocessing stale data.
DELETE FROM request_status WHERE sub_stage NOT IN ('COMPLETED', 'ERROR');
If you have multiple INCLUDE ALL types of storage profiles with same compute (could be empty) or storage policy (could be empty) or both, then just have one with profile with INCLUDE ALL, copy all the tags from other 'INCLUDE ALL' types of storage profiles, since they are duplicates.
If you have multiple storage profiles with single datastore, consider using MANUAL storage profiles to group them and multiple datastores into single storage profile and add constaint tags accordingly.
Recommended Configuration (Best Practice):
Requirement:
Ability to allocate and target specific datastores.
Current Setup:
One storage profile per datastore using MANUAL filters, tagged as <CLUSTER_NAME><DATASTORE_NAME>.
Recommended Setup Using 8.18 Feature (Multiple Datastores per Profile):
Tag each datastore with <DATASTORE_NAME> under Resources → Storage → Datastore.
Create one storage profile per compute cluster and tag it with the compute cluster name.
Add the required datastores to this profile (they already have the datastore tags).
In Cloud Templates, specify two constraint tags:
Compute cluster tag
Datastore tag
This configuration reduces the number of storage profiles, prevents memory overload, and avoids recurrence of the issue.