DOI -OnPrem -apmservices-zookeeper - CrashLoopBackOff
search cancel

DOI -OnPrem -apmservices-zookeeper - CrashLoopBackOff

book

Article ID: 233132

calendar_today

Updated On:

Products

DX Operational Intelligence

Issue/Introduction

Not able to start the 'apmservices-zookeeper' pod. 

 

 

Environment

Release : 21.3

Component : DOI / APM 

Env : RHEL 7.9 

Server Dedicated to DOI only no other Applications 

Cause

Engineering Analysis: 

apmservices-zookeper log 

2159 [myid:] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@383] - Exception causing close of session 0x0: ZooKeeperServer not running
 
If this is down, most of datastore and APM services cannot come up, given the dependency.
 
Hypothesis: 
 
No space on NFS   and NFS usage is 100%
 
Per Engineering, more than 70% NFS usage is not good.
 
Increased the space NFS,  added 100GB space to NFS , but disk usage is still 95%, and still having the same issue.
 
Engineering analyzed further : 
 
Check all the 'zookeeper' logs 
 
Run the below command and send the results. 
  •     go to the nfs directory
  •     then go to zookeeper folder
  •     and do the command as below: 
         ls -laRt
We still see that all apmservices pods are still crashing, because zookeeper pod has again failed. 
This failed because it tried to write the same transaction log which had a corrupted header and with the same exception earlier. 
So we need to delete this file to start the zookeeper.
 
 

Resolution

The root cause was,

1. NFS store was being used 100%, because of this zookeeper wrote the corrupted transaction logs.

2. When NFS store was extended, zookeeper tried to read from the same corrupted transaction log and the pod was not being started.

After deleting the corrupted transaction logs, all started fine and console is being up and running. 

 

Resolution:

Deleting the corrupted transaction log