Reasons for Non-balanced shard distribution with GPTEXT

Products

VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire

Issue/Introduction

Errors noticed on this gptext cluster:

2020-10-10 00:00:06.140980 GMT,"data_owner","db01",p386343,th-540252384,"127.0.0.1","35268",2020-10-10 00:00:06 GMT,0,con202716865,cmd2,seg-1,,dx196245360,,sx1,"LOG","00000","An exception was encountered during the execution of statement: insert into utils.ext_wr_log_details(seq,host_name,time_stamp,application,module_name,step,sevirity,error_message,error_details,rows_loaded,rows_failed,bunch_size,bunch_date) values(0,'gptextha',public.current_timestamp_py(),'gptext_index','index_name_01','Indexing-310','F','Error file created','[349314152053743424-349314152061376834] Failed to index (gptxt id):349314152053743424-349314152061376834 Function ""gptext.index(anytable,gptext.__index_context)"": Code: RUNTIME_ERROR, Message: 'Exception writing document id (null) to the index; possible analysis error: number of documents in the index cannot exceed 2147483519. ' (UDF.cpp:215)  (seg0 slice1 192.168.10.47:1025 pid=56516), error file:-1',0,0,0,public.current_timestamp_py());",,,,,,,0,,,,

The state of the shards:

select node_name,index_name,shard_name, index_name ||'-'|| shard_name as index_shard, num_docs,size_in_bytes,size,(num_docs/2147483519::float)*100 as percentage_from_max_docs_in_shard  

from gptext.index_summary() where is_leader is true order by 4 desc ;


sdw4.mgmt:18984_solr,xx.xxx.xxx,shard9,xx.xxx.xxx-shard9,1175978896,87340931264,81.34 GB,54.7607879453067
sdw4.mgmt:18985_solr,xx.xxx.xxx,shard8,xx.xxx.xxx-shard8,2147483519,150391886370,140.06 GB,100
sdw3.mgmt:18983_solr,xx.xxx.xxx,shard7,xx.xxx.xxx-shard7,1155984319,84813187908,78.99 GB,53.8297178428777
sdw2.mgmt:18985_solr,xx.xxx.xxx,shard6,xx.xxx.xxx-shard6,2147483519,150274611177,139.95 GB,100
sdw4.mgmt:18984_solr,xx.xxx.xxx,shard5,xx.xxx.xxx-shard5,1173983717,87964908607,81.92 GB,54.6678801775708
sdw4.mgmt:18985_solr,xx.xxx.xxx,shard4,xx.xxx.xxx-shard4,2147483519,150149894851,139.84 GB,100
sdw3.mgmt:18984_solr,xx.xxx.xxx,shard3,xx.xxx.xxx-shard3,1176001911,87272857968,81.28 GB,54.7618596648238
sdw2.mgmt:18984_solr,xx.xxx.xxx,shard2,xx.xxx.xxx-shard2,2147483519,150297313520,139.98 GB,100
sdw3.mgmt:18984_solr,xx.xxx.xxx,shard15,xx.xxx.xxx-shard15,1187994348,88680601447,82.59 GB,55.3203010635073
sdw2.mgmt:18984_solr,xx.xxx.xxx,shard14,xx.xxx.xxx-shard14,2147483519,150205669698,139.89 GB,100
sdw4.mgmt:18984_solr,xx.xxx.xxx,shard13,xx.xxx.xxx-shard13,1244018137,91202091163,84.94 GB,57.9291121907791
sdw4.mgmt:18985_solr,xx.xxx.xxx,shard12,xx.xxx.xxx-shard12,2147483519,150411751771,140.08 GB,100
sdw1.mgmt:18983_solr,xx.xxx.xxx,shard11,xx.xxx.xxx-shard11,2147483519,150312067746,139.99 GB,100
sdw2.mgmt:18984_solr,xx.xxx.xxx,shard10,xx.xxx.xxx-shard10,2147483519,150353134204,140.03 GB,100
sdw4.mgmt:18984_solr,xx.xxx.xxx,shard1,xx.xxx.xxx-shard1,1167008281,86564023953,80.62 GB,54.3430611073295
sdw4.mgmt:18985_solr,xx.xxx.xxx,shard0,xx.xxx.xxx-shard0,2147483519,150369705706,140.04 GB,100
{
    return foo;
}

As you can see, some of the shards are at 100% while others are at half that.

Resolution

These are some reasons for uneven shard distribution:

1. Incorrect user operations, for example, duplicated data ingesting, deleted some of the documents of the index.

2. The most possible reason: The max document limit issue made the ingesting query fail. Some of the data had already been written into the index but others are not since the query is canceled. Then the indexing script retries multiple times each minute, finally making a lot of errors and duplicating data. This is the reason some shards have more data than other shards. To correct this, what you can do is delete all the documents of the days they hit the max doc limitation issue, then redo the indexing of those days.