GPTEXT - How to index large number of documents in batch

Products

VMware Tanzu Greenplum

Issue/Introduction

Indexing documents in GPTEXT is a memory consuming operation, therefore when indexing a large number of documents, the gptext.index() might fail due to exceeding the limitation of memory or due to timeout, with errors like the one below:

java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000 ms
Error from server at http://sdwXX:1898X/solr/db.schema.table1_shardXXX_replica_nXXX: Server Error

To be able to index these amounts of documents, we may use the workaround mentioned in this article, i.e., indexing the documents in batch.

Environment

Product Version: 6.5

Resolution

Let's assume we have a table named t1 like below. The index "demo.public.t1" was create based on id and content.

                                    Table "public.t1"
 Column  |            Type             |                    Modifiers
---------+-----------------------------+-------------------------------------------------
 id      | integer                     | not null default nextval('t1_id_seq'::regclass)
 content | text                        |
 date    | timestamp without time zone |

SELECT * FROM gptext.create_index('public','t1', 'id', 'content');

Step 1: Get the max indexed ID of the current index:

demo=# select id::int8 from gptext.search_row_limit('demo.public.t1','*', null, 'sort=id desc&rows=1');
  id
------
 2000    << currently, the max id in the index is 2000
(1 row)

Note: If the above query did not return anything, that means no document has been indexed yet. We can assume the max_id = 0;

Step 2: Index the documents within a range. The example below shows how to index the document with ID 2001 to 10000 (E.g: [max_id]+1 to [max_id]+8000):

SELECT * FROM gptext.index(TABLE(select * from t1 where id between (select 2000+1) and (select 2000+8000)), 'demo.public.t1');
SELECT * FROM gptext.commit_index('demo.public.t1');

Step 3: Verify the max_id has been updated and the number of document in index has been added:

demo=# select id::int8 from gptext.search_row_limit('demo.public.t1','*', null, 'sort=id desc&rows=1');
  id
-------
 10000
(1 row)

$ gptext-state -D -i demo.public.t1
...
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-Index demo.public.t1 statistics.
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-   replication_factor   max_shards_per_node   num_docs   size in bytes   last_modified
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-   2                    4                     10000      3340032         2020-07-28T04:38:49.340Z
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-Done.

Repeat the steps above until all documents have been indexed.