A customer frequently encountered the following error: "PXF server error: This feature is disabled. Please refer to dfs.client.block.write.replace-datanode-on-failure.enable when running specific PXF jobs."
UTC,"generic_gpdb_load_utility","gpdb_prd",p46594,th1531054208,"10.x.x.x","58996",2024-09-04 16:47:40 UTC,0,con219152,cmd17,seg-1,,dx1118568,,sx1,"ERROR","P0001","Error during transaction
Error Code : 08000
Message : PXF server error : This feature is disabled. Please refer to dfs.client.block.write.replace-datanode-on-failure.enable configuration property. (seg99 10.x.x.x:40003 pid=311859)
Detail :
Hint : Check the PXF logs located in the '/usr/local/pxf-gp6/logs' directory on host 'localhost' or 'set client_min_messages=LOG' for additional details.
Context : SQL statement ""insert into gpdb.platform_attribute_client_12_ext (user_identity_key,user_identity_type_id,attribute_id) select user_identity_key,user_identity_type_id,attribute_id from gpdb._platform_attribute_client_12""
PL/pgSQL function stg.usp_pop_hdfs_generic(text,text,boolean) line 67 at EXECUTE statement",,,,,,"select stg.usp_pop_hdfs_generic('mesobase', 'fact_platform_attribute_client_12', true); --DAG:dw-mesobase620_lowes_prospect_srf TASK:hdfs_attribute_client.postgres",0,,"pl_exec.c",3072,
Prod:
GPDB: 6.25.1
PXF: 6.10.2
The issue is likely due to the high workload on the HDFS cluster or potential network problems between the PXF hosts and the HDFS cluster, leading to a high rate of timeouts between the HDFS client (PXF) and the HDFS cluster.
Greenplum DB (GPDB) engineering recommended that the customer enable the dfs.client.block.write.replace-datanode-on-failure.enable
feature on their Hadoop cluster. However, the customer declined to implement this change due to the large size of their Hadoop cluster and the significant time and effort required to apply the modification.
On the Greenplum/PXF side, there are only a few parameters we can adjust to address this issue. As we previously attempted, increasing the values for dfs.datanode.socket.write.timeout
and dfs.client.block.write.retries
may help reduce the number of errors. We can consider adjusting these parameters further to higher values and monitor if that helps in mitigating the errors at some point.
Adjusting the following allowed the customer PXF view to complete successfully.
Original settings:
hdfs-site.xml- <property>
hdfs-site.xml: <name>dfs.datanode.socket.write.timeout</name>
hdfs-site.xml- <value>9000000</value>
hdfs-site.xml- </property>
hdfs-site.xml- <property>
hdfs-site.xml: <name>dfs.client.block.write.retries</name>
hdfs-site.xml- <value>16</value>
hdfs-site.xml- </property>
Updated Settings:
hdfs-site.xml- <property>
hdfs-site.xml: <name>dfs.datanode.socket.write.timeout</name>
hdfs-site.xml- <value>240000000</value>
hdfs-site.xml- </property>
hdfs-site.xml- <property>
hdfs-site.xml: <name>dfs.client.block.write.retries</name>
hdfs-site.xml- <value>32</value>
hdfs-site.xml- </property>