Aria Automation Kubernetes pods randomly restart
search cancel

Aria Automation Kubernetes pods randomly restart

book

Article ID: 403070

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

  • using the command 'vracli service status' does show one or more services in state 'Starting'
  • reviewing Kubernetes pods using command 'kubectl get pod -n prelude'
    • shows some services in status 'CrashLoopBackOff'
    • several services have been restarted with different count
  • reviewing the Aria Automation journal system log for etcd show large delays, to review logs use this command: journalctl -efu etcd
    example logs

    Jan 1 10:00:00 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/leases/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:528" took too long (118.595116ms) to execute
    Jan 1 10:00:00 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/mutatingwebhookconfigurations/\" range_end:\"/registry/mutatingwebhookconfigurations0\" count_only:true " with result "range_response_count:0 size:7" took too long (457.69161ms) to execute
    Jan 1 10:00:01 vra1.example.com etcd[33#####]: request "header:<ID:3538025413016186117 username:\"vra2.example.com\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/pods/kube-system/state-enforcement-cron-29180760-tv868\" mod_revision:57119151 > success:<request_put:<key:\"/registry/pods/kube-system/state-enforcement-cron-29180760-tv868\" value_size:2643 >> failure:<>>" with result "size:20" took too long (104.008418ms) to execute
    Jan 1 10:00:01 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/services/endpoints/prelude/pgpool\" " with result "range_response_count:1 size:825" took too long (436.842757ms) to execute
    Jan 1 10:00:01 vra1.example.com etcd[33#####]: request "header:<ID:3538025413016186124 username:\"vra2.example.com\" auth_revision:1 > lease_grant:<ttl:3660-second id:3119975e2d226d0b>" with result "size:44" took too long (166.813048ms) to execute
    Jan 1 10:00:01 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/apiextensions.k8s.io/customresourcedefinitions/\" range_end:\"/registry/apiextensions.k8s.io/customresourcedefinitions0\" count_only:true " with result "range_response_count:0 size:9" took too long (429.722191ms) to execute
         

Environment

Aria Automation 8.x

Cause

The large delay in etcd keystore causes Kubernetes services to be sporadically unavailable and triggers random pod restarts.

Resolution

Please review the environment / infrastructure for high latencies.

Aria Automation requires a Network latency <= 5ms  and Storage latency <= 20 ms (System Requirements)