Gptext 3.9.1 failed Replicas
search cancel

Gptext 3.9.1 failed Replicas

book

Article ID: 380377

calendar_today

Updated On:

Products

VMware Tanzu Greenplum Greenplum Pivotal Data Suite Non Production Edition VMware Tanzu Data Suite VMware Tanzu Data Suite

Issue/Introduction

The customer reported that a number of his replicas went down for no apparent reason. The replicas were marked down in the config but the processes were still running on the server. Solr logs reported the following 

 

20241013:11:16:23:019730 gptext-recover:mdw1:gpadmin-[ERROR]:-The following Processes are still running but marked down by solr:
Host: sdw16, port: 18989, data dir: /data2/primary/gptext/solr3, process id: 8157

Cause

The cause of this was a bug in Solr,

This bug happens when there is a network issue and solr connects to/disconnects from zookeeper frequently in a very short time. It may cause the solr nodes never to try to connect to the zookeeper again.

The logs are full of Zookeeper related errors, this being one of them

2024-10-13 11:29:10.955 ERROR (coreZkRegister-1-thread-188) [c:text.index01 /collections/db.schema.index01/leaders:shard76 r:core_node307 x:text.broadcom.index01_shard76_replica_n304] o.a.s.c.ZkController Error getting leader from zk => org.apache.solr.common.SolrException: Could not get leader props
        at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1290)
org.apache.solr.common.SolrException: Could not get leader props
        at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1290) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1254) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:1210) ~[?:?]
        at org.apache.solr.cloud.ZkController.register(ZkController.java:1094) ~[?:?]
        at org.apache.solr.cloud.ZkController$RegisterCoreAsync.call(ZkController.java:260) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:1.8.0_301]
        at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:1.8.0_301]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:1.8.0_301]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_301]
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /collections/db.schema.index01/leaders/shard76/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:130) ~[?:?]
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[?:?]
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1221) ~[?:?]
        at org.apache.solr.common.cloud.SolrZkClient.lambda$getData$5(SolrZkClient.java:341) ~[?:?]
        at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60) ~[?:?]
        at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:341) ~[?:?]
        at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1268) ~[?:?
 
 


Resolution

This issue has just been identified Oct 2024 and R+D are currently working on a fix. Please check future release notes for issue 33609.

 

To recover from this issue follow this procedure, Note: Replace "<path_to_solr_dir>" with the appropriate value in the commands below.

  • Use "gptext-state -D" to find out which replicas are down.

  • Connect to the server that hosts the downed replicas and tried to start it manually by running :
    SOLR_INCLUDE=/<path_to_solr_dir>/solr3/solr.in.sh /usr/local/greenplum-solr/bin/solr start
    For exmple:
    SOLR_INCLUDE=/data1/primary/gptext/solr3/solr.in.sh /usr/local/greenplum-solr/bin/solr start
    You may get an error as the process is still running. 

  • Stop it with this command :
    SOLR_INCLUDE=/<path_to_solr_dir>/solr3/solr.in.sh /usr/local/greenplum-solr/bin/solr stop

  • Start it again with :
    SOLR_INCLUDE=/<path_to_solr_dir>/solr3/solr.in.sh /usr/local/greenplum-solr/bin/solr start

  • Run a recovery again