After power outage, replicas are down on GPText 3.2.0, gptext-recover did not recover the replicas.
Both indexes on the cluster had replicas down:
gptext-state --index=pivotal.dbo.products 20200903:08:57:22:066419 gptext-state:mdw1:gpadmin-[INFO]:-Execute GPText state ... 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ... 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Check GPText cluster statistics... 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- Replicas Up: 29 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- Replicas Down: 3 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.products following replicas are down: 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- core replica name state node is_leader host 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.products_shard1_replica_n147 core_node150 down sdw1.mgmt:18984_solr false sdw1.mgmt 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.products_shard5_replica_n146 core_node149 down sdw1.mgmt:18984_solr false sdw1.mgmt 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.products_shard13_replica_n145 core_node148 down sdw1.mgmt:18984_solr false sdw1.mgmt 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.products statistics. 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- replication_factor max_shards_per_node num_docs size in bytes last_modified 20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:- 2 6 3329653715 354552775166 2020-09-03T07:17:33.368Z 20200903:08:57:24:066419 gptext-state:mdw1:gpadmin-[INFO]:-Child partition indexes:
gptext-state --index=pivotal.dbo.product_participants 20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Execute GPText state ... 20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ... 20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Check GPText cluster statistics... 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- Replicas Up: 25 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- Replicas Down: 7 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.product_participants following replicas are down: 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- core replica name state node is_leader host 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard0_replica_n209 core_node210 down sdw3.mgmt:18984_solr false sdw3.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard2_replica_n197 core_node198 down sdw2.mgmt:18984_solr false sdw2.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard4_replica_n203 core_node204 down sdw2.mgmt:18984_solr false sdw2.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard8_replica_n199 core_node200 down sdw2.mgmt:18984_solr false sdw2.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard10_replica_n205 core_node206 down sdw2.mgmt:18984_solr false sdw2.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard12_replica_n207 core_node208 down sdw3.mgmt:18984_solr false sdw3.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- pivotal.dbo.product_participants_shard14_replica_n201 core_node202 down sdw2.mgmt:18984_solr false sdw2.mgmt 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.product_participants statistics. 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------ 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- replication_factor max_shards_per_node num_docs size in bytes last_modified 20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:- 2 6 9910692348 692729753804 2020-09-03T07:17:33.903Z
Recovery had no impact:
gptext-recover -f 20200903:08:59:04:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Execute GPText cluster recover. 20200903:08:59:04:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ... 20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Force recover GPText instances ... 20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-No need to recover. Skip. 20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Recover down replicas for indexes ... 20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Drop down replicas ... ........ 20200903:08:59:11:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Add new replicas ... ...... 20200903:08:59:14:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Please run 'gptext-state --index=<index_name>' to see status of new replicas. 20200903:08:59:14:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Done.
From the log of the node containing the down replica, we can see many “failed to connect leader” errors:
pivotal.dbo.product_participants s:shard4 r:core_node150 x:pivotal.dbo.product_participants_shard4_replica_n149] o.a.s.c.RecoveryStrategy Failed to connect leader http://sdw4.mgmt:18985/solr on recovery, try again
From the log of the node containing the leader of the downed replica, we can see that the leader received many “ping” requests and took more than one second to respond to the “ping” request:
2020-08-20 09:10:57.395 INFO (qtp434091818-19) [c:demo.public.test s:shard0 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1] webapp=/solr path=/admin/ping params={wt=javabin&version=2} status=0 QTime=1651
2020-08-20 09:10:58.917 INFO (qtp434091818-236) [c:pivotal.dbo.product_participants s:shard0 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1] webapp=/solr path=/admin/ping params={wt=javabin&version=2} hits=721420288 status=0 QTime=1670
2020-08-20 09:10:58.917 INFO (qtp434091818-236) [c:pivotal.dbo.product_participants s:shard4 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1] webapp=/solr path=/admin/ping params={wt=javabin&version=2} status=0 QTime=1670
Product Version: 5.7
When doing recovery, a replica will ping its leader first. But if the “ping” takes more than a second, then the replica will think the “ping” failed and retry the “ping”. That’s the reason we can get many “failed to connect leader” errors in solr logs. Please also refer to SOLR-13532 for more information.
Why does the ping take more than a second to finish? Well, when the leader receives the ping request, it will do a query like “q=*:*&rows=10”. If we have too many docs in this leader, then it can take more than one second to finish the query. For more information, refer to the workaround below.
Add the following request handler “admin/ping” into solrconfig.xml to change the behavior of the ping. The default behavior of the ping is to conduct the query “q=*:*&rows=10”. We can change it to just return “status”:
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
<lst name="defaults">
<str name="action">status</str>
</lst>
<str name="healthcheckFile">server-enabled.txt</str>
</requestHandler>
Note: The request handler should be placed in the <config> </config> with the solrconfig.xml. If it is outside the config, it will not work.
This will need to be applied for each of the indexes that has replicas down; it can be achieved with the following command:
gptext-config edit -i pivotal.dbo.product_participants -e vim -f solrconfig.xml and gptext-config edit -i pivotal.dbo.product_participants -e vim -f solrconfig.xml
Run a gptext-recover -f to recover the cluster once the above change has been applied:
gptext-recover -f
Note: Please reverse this change once the recovery completes. This is a known issue with Solr and has been fixed in the latest release of GPText.