I have been running the 20.2 cluster since Oct 1st and have very few agents connecting to the cluster.
Also the Postgres data is using 94% of the filesystem. There is not enough space to vacuum the database.
I believe the cause of this is due to running the fake agent process to reproduce agents connection and metrics count.
[email protected]:/APM_dxi/axaservices/pg-data> du -hsx * | sort -rh | head -10
472G userdata
12K dxi
[email protected]:/APM_dxi/axaservices/pg-data> df -kh /APM_dxi/
Filesystem Size Used Avail Use% Mounted on
cdlenc1inasv44.es.oneadp.com:/APM_dxi 542G 509G 34G 94% /APM_dxi
After vacuuming the DB, the pods will not start up. All pods are down.
Release : 20.2
Component : APM Agents
This looks very close to a known zookeeper bug:
https://issues.apache.org/jira/browse/ZOOKEEPER-2332
Check if there is a zero-length TxnLog file present in the log directory. (/nfs/ca/dxi/zookeeper/datalog/version-2)
If yes, delete it.
Steps to follow:
1. Delete zero-size log file
2. Delete the apmservices-zookeeper pod
3. Scale down pods that was still in CrashBackoff status
4. Scale the pods back up
Environment is now back up and running.
The CrashLoopBackOff are all apm pods, mostly due to not able to connect to the gateway or Zookeeper. The gateway pod also fails to connect to Zookeeper.
Checked but apmservices-zookeeper pod is not seen in the List of Pods in NODE_HEALTH.log.
After checking if this pod exists, started that pod using
‘kubectl scale deploy apmservices-zookeeper --replicas=1’
But still the environment wasn't up. Used latest whatsupdxi script and received this output. See Resolution.
[ApmCosAgent] Agent configuration check.
[ApmCosAgent] Forking to backround
[ApmCosAgent] Agent starting
[ApmCosAgent] Agent status check (1)
agent down
[INFO] [MainThread] [ApmCOsAgent] Agent Running. PID: 7
[ApmCosAgent] Agent status check (2)
agent up (PID: 7)
[ApmCosAgent] Agent is up.
ZooKeeper JMX enabled by default
Using config: /conf/zoo.cfg
[myid:] - INFO [main:[email protected]] - Reading configuration from: /conf/zoo.cfg
[myid:] - INFO [main:[email protected]] - autopurge.snapRetainCount set to 3
[myid:] - INFO [main:[email protected]] - autopurge.purgeInterval set to 0
[myid:] - INFO [main:[email protected]] - Purge task is not scheduled.
[myid:] - WARN [main:[email protected]] - Either no config or no quorum defined in config, running in standalone mode
[myid:] - INFO [main:[email protected]] - Reading configuration from: /conf/zoo.cfg
[myid:] - INFO [main:[email protected]] - Starting server
[myid:] - INFO [main:[email protected]] - Server environment:zookeeper.version=3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on 06/29/2018 04:05 GMT
[myid:] - INFO [main:[email protected]] - Server environment:host.name=apmservices-zookeeper-847dd6c5c6-lv9zp
[myid:] - INFO [main:[email protected]] - Server environment:java.version=11.0.8
[myid:] - INFO [main:[email protected]] - Server environment:java.vendor=AdoptOpenJDK
[myid:] - INFO [main:[email protected]] - Server environment:java.home=/opt/jdk
[myid:] - INFO [main:[email protected]] - Server environment:java.class.path=/opt/zookeeper/bin/../build/classes:/opt/zookeeper/bin/../build/lib/*.jar:/opt/zookeeper/bin/../lib/slf4j-log4j12-1.7.25.jar:/opt/zookeeper/bin/../lib/slf4j-api-1.7.25.jar:/opt/zookeeper/bin/../lib/netty-3.10.6.Final.jar:/opt/zookeeper/bin/../lib/log4j-1.2.17.jar:/opt/zookeeper/bin/../lib/jline-0.9.94.jar:/opt/zookeeper/bin/../lib/audience-annotations-0.5.0.jar:/opt/zookeeper/bin/../zookeeper-3.4.13.jar:/opt/zookeeper/bin/../src/java/lib/*.jar:/conf:
[myid:] - INFO [main:[email protected]] - Server environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
[myid:] - INFO [main:[email protected]] - Server environment:java.io.tmpdir=/tmp
[myid:] - INFO [main:[email protected]] - Server environment:java.compiler=<NA>
[myid:] - INFO [main:[email protected]] - Server environment:os.name=Linux
[myid:] - INFO [main:[email protected]] - Server environment:os.arch=amd64
[myid:] - INFO [main:[email protected]] - Server environment:os.version=3.10.0-1127.8.2.el7.x86_64
[myid:] - INFO [main:[email protected]] - Server environment:user.name=default
[myid:] - INFO [main:[email protected]] - Server environment:user.home=/home/default
[myid:] - INFO [main:[email protected]] - Server environment:user.dir=/opt/zookeeper
[myid:] - INFO [main:[email protected]] - tickTime set to 2000
[myid:] - INFO [main:[email protected]] - minSessionTimeout set to -1
[myid:] - INFO [main:[email protected]] - maxSessionTimeout set to -1
[myid:] - INFO [main:[email protected]] - Using org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
[myid:] - INFO [main:[email protected]] - binding to port 0.0.0.0/0.0.0.0:2181
[myid:] - ERROR [main:[email protected]] - Unexpected exception, exiting abnormally
java.io.EOFException
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:397)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:66)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:585)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:604)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:570)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:650)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.fastForwardFromEdits(FileTxnSnapLog.java:219)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:176)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:217)
at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:284)
at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:407)
at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:118)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:122)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:89)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:55)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:119)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:81)