The customer reported that a number of his replicas went down for no apparent reason. The replicas were marked down in the config but the processes were still running on the server. Solr logs reported the following
20241013:11:16:23:019730 gptext-recover:mdw1:gpadmin-[ERROR]:-The following Processes are still running but marked down by solr:
Host: sdw16, port: 18989, data dir: /data2/primary/gptext/solr3, process id: 8157
The cause of this was a bug in Solr,
This bug happens when there is a network issue and solr connects to/disconnects from zookeeper frequently in a very short time. It may cause the solr nodes never to try to connect to the zookeeper again.
The logs are full of Zookeeper related errors, this being one of them
2024-10-13 11:29:10.955 ERROR (coreZkRegister-1-thread-188) [c:text.index01 /collections/db.schema.index01/leaders:shard76 r:core_node307 x:text.broadcom.index01_shard76_replica_n304] o.a.s.c.ZkController Error getting leader from zk => org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1290)
org.apache.solr.common.SolrException: Could not get leader props
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1290) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1254) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1210) ~[?:?]
at org.apache.solr.cloud.ZkController.register(ZkController.java:1094) ~[?:?]
at org.apache.solr.cloud.ZkController$RegisterCoreAsync.call(ZkController.java:260) ~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:1.8.0_301]
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_301]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_301]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_301]
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/db.schema.index01/leaders/shard76/leader
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) ~[?:?]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[?:?]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1221) ~[?:?]
at org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:341) ~[?:?]
at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) ~[?:?]
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:341) ~[?:?]
at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1268) ~[?:?
This issue has just been identified Oct 2024 and R+D are currently working on a fix. Please check future release notes for issue 33609.
To recover from this issue follow this procedure, Note: Replace "<path_to_solr_dir>" with the appropriate value in the commands below.