Indexing documents in GPTEXT is a memory consuming operation, therefore when indexing a large number of documents, the gptext.index() might fail due to exceeding the limitation of memory or due to timeout, with errors like the one below:
java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000 ms
Error from server at http://sdwXX:1898X/solr/db.schema.table1_shardXXX_replica_nXXX: Server Error
To be able to index these amounts of documents, we may use the workaround mentioned in this article, i.e., indexing the documents in batch.
Product Version: 6.5
Let's assume we have a table named t1 like below. The index "demo.public.t1" was create based on id and content.
Table "public.t1"
Column | Type | Modifiers
---------+-----------------------------+-------------------------------------------------
id | integer | not null default nextval('t1_id_seq'::regclass)
content | text |
date | timestamp without time zone |
SELECT * FROM gptext.create_index('public','t1', 'id', 'content');
Step 1: Get the max indexed ID of the current index:
demo=# select id::int8 from gptext.search_row_limit('demo.public.t1','*', null, 'sort=id desc&rows=1');
id
------
2000 << currently, the max id in the index is 2000
(1 row)
Note: If the above query did not return anything, that means no document has been indexed yet. We can assume the max_id = 0;
Step 2: Index the documents within a range. The example below shows how to index the document with ID 2001 to 10000 (E.g: [max_id]+1 to [max_id]+8000):
SELECT * FROM gptext.index(TABLE(select * from t1 where id between (select 2000+1) and (select 2000+8000)), 'demo.public.t1');
SELECT * FROM gptext.commit_index('demo.public.t1');
Step 3: Verify the max_id has been updated and the number of document in index has been added:
demo=# select id::int8 from gptext.search_row_limit('demo.public.t1','*', null, 'sort=id desc&rows=1');
id
-------
10000
(1 row)
$ gptext-state -D -i demo.public.t1
...
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-Index demo.public.t1 statistics.
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:------------------------------------------------
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:- replication_factor max_shards_per_node num_docs size in bytes last_modified
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:- 2 4 10000 3340032 2020-07-28T04:38:49.340Z
20200728:00:39:45:011373 gptext-state:mdw:gpadmin-[INFO]:-Done.
Repeat the steps above until all documents have been indexed.