When querying an external table with the location set to gpfdist, multistage query plans will produce invalid results without producing any error messages.
Regardless of using a hash table, external table with gpfdist or file, the query planner is creating the same plan but:
Error Message:
No error messages will be logged for this issue. Master and segment logs will show normal activity while gpfdist, even with debugging, will show normal activity as well.
This issue has been raised to Pivotal Engineering via MPP-26406 and is awaiting their feedback on the cause to this issue.
RCA
1) Setup a file to load with gpfdist such as:
[gpadmin@gpdb-sandbox gpfdist]$ cat new_file_list.csv a;1;1 b;2;2 c;3;3 d;4;4 e;5;5 f;6;6
2) Create the external table definition:
CREATE EXTERNAL TABLE ext_new_file_list (name text, id int, val int) LOCATION ('gpfdist://localhost:9300/new_file_list.csv') FORMAT 'text' (delimiter ';' null '' escape 'off') LOG ERRORS INTO err_new_file_list SEGMENT REJECT LIMIT 100;
3. Start gpfdist -p 9300 where the file is located
4. Run some queries and notice you never get the correct values back:
distinct_repro=# select * from ext_new_file_list; name | id | val ------+----+----- a | 1 | 1 b | 2 | 2 c | 3 | 3 d | 4 | 4 e | 5 | 5 f | 6 | 6 (6 rows) distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 6 | 6 | 0 (1 row) distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 0 | 0 | 6 (1 row) distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 6 | 6 | 0 (1 row) distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 6 | 6 | 0 (1 row)
5. Set gp_enable_multiphase_agg off and immediately the results are correct:
distinct_repro=# set gp_enable_multiphase_agg=off; SET distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 6 | 6 | 6 (1 row) distinct_repro=# select count(*), count(distinct id), count(distinct val) from ext_new_file_list ; count | count | count -------+-------+------- 6 | 6 | 6 (1 row)
This issue has been raised to Pivotal Engineering via MPP-26406 and is awaiting their resolution.