Catalog service can crash due to many concurrent endpoint enumeration requests.
Here are example error messages from the crash loop of the catalog-service-app pod:
WARN The web application [ROOT] appears to have started a thread named [OkHttp TaskRunner] but has failed to stop it. This is very likely to create a memory leak.
ERROR catalog-service-app [...] - Error while starting the application: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'catalogPolicyActuatorController': Invocation of init method failed;
Caused by: org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://provisioning-service:8282/provisioning/config/toggles": Read timed out; nested exception is java.net.SocketTimeoutException:
Aria Automation 8.18.0 or lower
There is a blocking call to the DB in SubnetRangeService to update IP ranges from e.g. Infoblox to Aria Automation.
This is called by IPAM endpoint enumeration when there is a change in Infoblox, causing provisioning service to execute the blocking code for multiple SubnetRangeStates.
When enough of this blocking invocations are performed, the index pool of provisioning service depletes, rendering it unable to service other database requests (which most APIs require).
This is fixed in 8.18.1 so that these DB calls are non-blocking.
To bring the vRA system back up, it is possible to make IPAM filter out all network objects. Therefore the problematic code won't get executed.
However, this also means the IPAM integration causing the issue is no longer usable for this time.
Another approach is to look at all IP ranges in Infoblox and in Automation, to see which differ in the start and end IP.
We can then either manually patch them or configure a filter to not collect them.