GPText indexes in YELLOW - replicas are down - cannot be recovered
search cancel

GPText indexes in YELLOW - replicas are down - cannot be recovered

book

Article ID: 296293

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

After power outage, replicas are down on GPText 3.2.0, gptext-recover did not recover the replicas.

Both indexes on the cluster had replicas down:

gptext-state --index=pivotal.dbo.products
20200903:08:57:22:066419 gptext-state:mdw1:gpadmin-[INFO]:-Execute GPText state ...
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ...
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Check GPText cluster statistics...
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   Replicas Up:     29
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   Replicas Down:   3
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.products following replicas are down:
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   core                                      replica name   state   node                   is_leader   host
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.products_shard1_replica_n147    core_node150   down    sdw1.mgmt:18984_solr   false       sdw1.mgmt
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.products_shard5_replica_n146    core_node149   down    sdw1.mgmt:18984_solr   false       sdw1.mgmt
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.products_shard13_replica_n145   core_node148   down    sdw1.mgmt:18984_solr   false       sdw1.mgmt
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.products statistics.
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   replication_factor   max_shards_per_node   num_docs     size in bytes   last_modified
20200903:08:57:23:066419 gptext-state:mdw1:gpadmin-[INFO]:-   2                    6                     3329653715   354552775166    2020-09-03T07:17:33.368Z
20200903:08:57:24:066419 gptext-state:mdw1:gpadmin-[INFO]:-Child partition indexes:
gptext-state --index=pivotal.dbo.product_participants
20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Execute GPText state ...
20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ...
20200903:08:43:34:057868 gptext-state:mdw1:gpadmin-[INFO]:-Check GPText cluster statistics...
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   Replicas Up:     25
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   Replicas Down:   7
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.product_participants following replicas are down:
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   core                                                  replica name   state   node                   is_leader   host
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard0_replica_n209    core_node210   down    sdw3.mgmt:18984_solr   false       sdw3.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard2_replica_n197    core_node198   down    sdw2.mgmt:18984_solr   false       sdw2.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard4_replica_n203    core_node204   down    sdw2.mgmt:18984_solr   false       sdw2.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard8_replica_n199    core_node200   down    sdw2.mgmt:18984_solr   false       sdw2.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard10_replica_n205   core_node206   down    sdw2.mgmt:18984_solr   false       sdw2.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard12_replica_n207   core_node208   down    sdw3.mgmt:18984_solr   false       sdw3.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   pivotal.dbo.product_participants_shard14_replica_n201   core_node202   down    sdw2.mgmt:18984_solr   false       sdw2.mgmt
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-Index pivotal.dbo.product_participants statistics.
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:------------------------------------------------
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   replication_factor   max_shards_per_node   num_docs     size in bytes   last_modified
20200903:08:43:35:057868 gptext-state:mdw1:gpadmin-[INFO]:-   2                    6                     9910692348   692729753804    2020-09-03T07:17:33.903Z 


Recovery had no impact:

gptext-recover -f
20200903:08:59:04:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Execute GPText cluster recover.
20200903:08:59:04:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Check zookeeper cluster state ...
20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Force recover GPText instances ...
20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-No need to recover. Skip.
20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Recover down replicas for indexes ...
20200903:08:59:06:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Drop down replicas ...
........
20200903:08:59:11:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Add new replicas ...
......
20200903:08:59:14:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Please run 'gptext-state --index=<index_name>' to see status of new replicas.
20200903:08:59:14:067669 gptext-recover:mdw1:gpadmin-[INFO]:-Done.


From the log of the node containing the down replica, we can see many “failed to connect leader” errors:

pivotal.dbo.product_participants s:shard4 r:core_node150 x:pivotal.dbo.product_participants_shard4_replica_n149] o.a.s.c.RecoveryStrategy Failed to connect leader http://sdw4.mgmt:18985/solr on recovery, try again


From the log of the node containing the leader of the downed replica, we can see that the leader received many “ping” requests and took more than one second to respond to the “ping” request:

2020-08-20 09:10:57.395 INFO  (qtp434091818-19) [c:demo.public.test s:shard0 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1]  webapp=/solr path=/admin/ping params={wt=javabin&version=2} status=0 QTime=1651
2020-08-20 09:10:58.917 INFO  (qtp434091818-236) [c:pivotal.dbo.product_participants s:shard0 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1]  webapp=/solr path=/admin/ping params={wt=javabin&version=2} hits=721420288 status=0 QTime=1670
2020-08-20 09:10:58.917 INFO  (qtp434091818-236) [c:pivotal.dbo.product_participants s:shard4 r:core_node3 x:pivotal.dbo.product_participants_shard4_replica_n1] o.a.s.c.S.Request [pivotal.dbo.product_participants_shard4_replica_n1]  webapp=/solr path=/admin/ping params={wt=javabin&version=2} status=0 QTime=1670



Environment

Product Version: 5.7

Resolution

Root Cause

When doing recovery, a replica will ping its leader first. But if the “ping” takes more than a second, then the replica will think the “ping” failed and retry the “ping”. That’s the reason we can get many “failed to connect leader” errors in solr logs. Please also refer to SOLR-13532 for more information.


Why does the ping take more than a second to finish? Well, when the leader receives the ping request, it will do a query like “q=*:*&rows=10”. If we have too many docs in this leader, then it can take more than one second to finish the query. For more information, refer to the workaround below.


Workaround:

Add the following request handler “admin/ping” into solrconfig.xml to change the behavior of the ping. The default behavior of the ping is to conduct the query “q=*:*&rows=10”. We can change it to just return “status”:

<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
   <lst name="defaults">
     <str name="action">status</str>
   </lst>
   <str name="healthcheckFile">server-enabled.txt</str>
</requestHandler>

Note: The request handler should be placed in the <config> </config> with the solrconfig.xml. If it is outside the config, it will not work. 

This will need to be applied for each of the indexes that has replicas down; it can be achieved with the following command:

gptext-config edit -i pivotal.dbo.product_participants -e vim -f solrconfig.xml

and 

gptext-config edit -i pivotal.dbo.product_participants -e vim -f solrconfig.xml


Run a gptext-recover -f to recover the cluster once the above change has been applied:

gptext-recover -f

Note: Please reverse this change once the recovery completes. This is a known issue with Solr and has been fixed in the latest release of GPText.