Multiple Pods Restarts Intermittently in Aria Automation

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

When you run the command kubectl get pods -n prelude, you’ll notice that all pods in Aria Automation restarts intermittently
By checking the logs of the affected pods, you’ll see the message: "Possible too long JVM pause: ### milliseconds"

1.Below is a sample exception from one of the key pods: /var/log/services-logs/prelude/tango-blueprint-service-app/file-logs/tango-blueprint-service-app.log :

####-##-####:##:##.#### INFO tango-blueprint host='tango-blueprint-service-app-<service_id>' thread='tcp-disco-srvr-[:47500]-#3%embedded%-#24%embedded%' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace='' org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/<ip_address>, rmtPort=53653]
####-##-####:##:##.#### WARN tango-blueprint [host='tango-blueprint-service-app-<service_id>' thread='jvm-pause-detector-worker' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace=''] org.apache.ignite.internal.IgniteKernal%embedded - Possible too long JVM pause: 607 milliseconds.
####-##-####:##:##.#### INFO tango-blueprint host='tango-blueprint-service-app-<service_id>' thread='tcp-disco-srvr-[:47500]-#3%embedded%-#24%embedded%' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace='' org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/<ip_address>, rmtPort=53653]
####-##-####:##:##.#### WARN tango-blueprint [host='tango-blueprint-service-app-<service_id>' thread='Notification listener' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace=''] com.####.####.####.ProxyConnection - ####Pool-1 - Connection org.postgresql.jdbc.PgConnection@#### marked as broken because of SQLSTATE(08006), ErrorCode(0)
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.

2.Below is a sample exception from one of the key pods: /var/log/services-logs/prelude/catalog-service-app/file-logs/catalog-service-app.log

####-##-####:##:##.#### WARN catalog-service-app [host='catalog-service-app-<service_id>' thread='jvm-pause-detector-worker' user='' org='' trace=''] o.a.i.internal.IgniteKernal%embedded - Possible too long JVM pause: 638 milliseconds.
####-##-####:##:##.#### WARN catalog-service-app [host='catalog-service-app-<service_id>' thread='scheduling-3' user='' org='' trace='###############################'] c.v.s.c.c.r.i.SlowRequestInterceptor - Slow API call GET 'http://<POD_Name>:4242/event-broker/api/runnable/types/catalog-service.runnable/poll/100' with response 200 OK took 1321 ms.
####-##-####:##:##.#### WARN catalog-service-app [host='catalog-service-app-<service_id>' thread='jvm-pause-detector-worker' user='' org='' trace=''] o.a.i.internal.IgniteKernal%embedded - Possible too long JVM pause: 1042 milliseconds.
When you establish an SSH connection to the primary Aria automation node and attempt to ping other Aria automation nodes from the primary node, you may observe network latency exceeding 5 milliseconds, which surpasses the maximum allowed latency of 5 ms between each cluster node. For more information, please refer to the system requirements

root@<AriaAutomationNode01_FQDN> ping <AriaAutomationNode02_FQDN>

Pinging <AriaAutomationNode02_FQDN> [##.##.##.##] with 32 bytes of data:
Reply from ##.##.##.##: bytes=32 time=15ms TTL=110
Reply from ##.##.##.##: bytes=32 time=51ms TTL=110
Reply from ##.##.##.##: bytes=32 time=54ms TTL=110
Reply from ##.##.##.##: bytes=32 time=39ms TTL=110

Ping statistics for ##.##.##.##:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 15ms, Maximum = 54ms, Average = 51ms

Environment

Aria Automation 8.x

Cause

The issue occurs due to high system stun times on Aria Automation nodes when there are insufficient compute resources (CPU or Memory) allocated to the Aria Automation appliances. This leads to intermittent pod restarts and degraded service availability

Resolution

To resolve the issue, Please address the hardware compute resource (CPU and Memory) limitations within the vSphere cluster hosting the Aria Automation appliances. Ensuring the vSphere cluster has adequate compute resources will prevent pods restarts and maintain stable operation of Aria Automation services