Multiple POD restarts in Aria Automation 8.x

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Multiple pod restarts are observed randomly in Aria Automation nodes
Some of the major PODS for ex : Tango-blueprint and catalog-service PODS may show errors like "Possible too long JVM pause: ### milliseconds."

tango-blueprint-service-app.log :
####-##-####:##:##.#### INFO tango-blueprint host='tango-blueprint-service-app-<service_id>' thread='tcp-disco-srvr-[:47500]-#3%embedded%-#24%embedded%' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace='' org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/<ip_address>, rmtPort=53653]
####-##-####:##:##.#### WARN tango-blueprint [host='tango-blueprint-service-app-<service_id>' thread='jvm-pause-detector-worker' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace=''] org.apache.ignite.internal.IgniteKernal%embedded - Possible too long JVM pause: 26925 milliseconds.
####-##-####:##:##.#### INFO tango-blueprint host='tango-blueprint-service-app-<service_id>' thread='tcp-disco-srvr-[:47500]-#3%embedded%-#24%embedded%' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace='' org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/<ip_address>, rmtPort=53653]
####-##-####:##:##.#### WARN tango-blueprint [host='tango-blueprint-service-app-<service_id>' thread='Notification listener' user='' org='' blueprint='' project='' deployment='' request='' flow='' task='' tile='' resourceName='' operation='' trace=''] com.####.####.####.ProxyConnection - ####Pool-1 - Connection org.postgresql.jdbc.PgConnection@331ad6eb marked as broken because of SQLSTATE(08006), ErrorCode(0)
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.

catalog-service-app.log :

####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='Thread-83' user='' org='' trace=''] com.vmware.symphony.shutdown.TaskExecutorGracefulShutdown - Waiting 26925ms to shutdown down "taskScheduler"(Custo
mizableThreadPoolTaskScheduler)
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='Thread-83' user='' org='' trace=''] com.vmware.symphony.shutdown.TaskExecutorGracefulShutdown - "taskScheduler"(CustomizableThreadPoolTaskScheduler) h
as been shutdown successfully
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='Thread-83' user='' org='' trace=''] com.vmware.symphony.shutdown.TaskExecutorGracefulShutdown - Waiting 26925ms to shutdown down "mbeanInitializer"(Th
readPoolTaskExecutor)
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='Thread-83' user='' org='' trace=''] com.vmware.symphony.shutdown.TaskExecutorGracefulShutdown - "mbeanInitializer"(ThreadPoolTaskExecutor) has been shutdown successfully
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='Thread-81' user='' org='' trace=''] org.springframework.boot.web.embedded.tomcat.GracefulShutdown - Commencing graceful shutdown. Waiting for active requests to complete
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='tomcat-shutdown' user='' org='' trace=''] org.apache.coyote.http11.Http11NioProtocol - Pausing ProtocolHandler ["http-nio-8090"]
####-##-####:##:##.#### INFO cgs-service [host='cgs-service-app-54cf65d954-4nxnx' thread='tomcat-shutdown' user='' org='' trace=''] org.springframework.boot.web.embedded.tomcat.GracefulShutdown - Graceful shutdown complete.

Environment

Aria Automation 8.x

Cause

Due to high stun time observed on Aria Automation nodes during the snapshot/backup operations vRA pods will get restarted.
Validating virtual machine logs (vmware.log) for Aria Automation nodes will reveal below errors

vmware.log :
####-##-####:##:##.#### In(05) vcpu-0 - CPT: vm was stunned for 26438450 us
####-##-####:##:##.#### No(00) vcpu-0 - CheckpointTiming unstun: VMX took 25705748 us
####-##-####:##:##.#### No(00) vcpu-0 - CheckpointTiming unstun: ALL took 25706303 us
####-##-####:##:##.#### No(00) vcpu-0 - CheckpointTiming total: ALL took 26438392 us
####-##-####:##:##.#### In(05) vcpu-0 - ConsolidateItemCombine: Failed to open disk '/vmfs/volumes/vsan:################-##############/#########-#####-#####-####-#########/<vm_name>_1-000002.vmdk' for consolidate: Failed to lock the file (5)
####-##-####:##:##.#### In(05) vcpu-0 - Destroying virtual dev for scsi0:0 vscsi=#############

Resolution

During snapshot operations, it is normal for a VM to enter a stun state. However, the stun duration should be brief. If the stun period is longer, it can halt the vCPU on Aria Automation nodes, causing internal pods to restart due to Kubernetes self-healing mechanism.