Aria Automation Kubernetes pods randomly restart

search cancel

Aria Automation Kubernetes pods randomly restart

book

Article ID: 403070

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

using the command 'vracli service status' does show one or more services in state 'Starting'
reviewing Kubernetes pods using command 'kubectl get pod -n prelude'
- shows some services in status 'CrashLoopBackOff'
- several services have been restarted with different count
reviewing the Aria Automation journal system log for etcd show large delays, to review logs use this command: journalctl -efu etcd
example logs

Jan 1 10:00:00 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/leases/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:528" took too long (118.595116ms) to execute
Jan 1 10:00:00 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/mutatingwebhookconfigurations/\" range_end:\"/registry/mutatingwebhookconfigurations0\" count_only:true " with result "range_response_count:0 size:7" took too long (457.69161ms) to execute
Jan 1 10:00:01 vra1.example.com etcd[33#####]: request "header:<ID:3538025413016186117 username:\"vra2.example.com\" auth_revision:1 > txn:<compare:<target:MOD key:\"/registry/pods/kube-system/state-enforcement-cron-29180760-tv868\" mod_revision:57119151 > success:<request_put:<key:\"/registry/pods/kube-system/state-enforcement-cron-29180760-tv868\" value_size:2643 >> failure:<>>" with result "size:20" took too long (104.008418ms) to execute
Jan 1 10:00:01 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/services/endpoints/prelude/pgpool\" " with result "range_response_count:1 size:825" took too long (436.842757ms) to execute
Jan 1 10:00:01 vra1.example.com etcd[33#####]: request "header:<ID:3538025413016186124 username:\"vra2.example.com\" auth_revision:1 > lease_grant:<ttl:3660-second id:3119975e2d226d0b>" with result "size:44" took too long (166.813048ms) to execute
Jan 1 10:00:01 vra1.example.com etcd[33#####]: read-only range request "key:\"/registry/apiextensions.k8s.io/customresourcedefinitions/\" range_end:\"/registry/apiextensions.k8s.io/customresourcedefinitions0\" count_only:true " with result "range_response_count:0 size:9" took too long (429.722191ms) to execute

Environment

Aria Automation 8.x

Cause

The large delay in etcd keystore causes Kubernetes services to be sporadically unavailable and triggers random pod restarts.

Resolution

Please review the environment / infrastructure for high latencies.

Aria Automation requires a Network latency <= 5ms and Storage latency <= 20 ms (System Requirements)

Feedback

thumb_up Yes

thumb_down No