Install and configure PXF for Local Files

Products

VMware Tanzu Greenplum

Issue/Introduction

This article describes how to set up PXF for local files in a tmp directory or on a mounted file system that needs a specific pxf-profile.xml.

Environment

Product Version: 5.25

Resolution

First of all, make sure you've followed the pxf set-up by installing Java 8 or 11 on all hosts and have sourced it correctly.

Next, after init your PXF cluster you have a $PXF-CONF set up. With that in the conf directory, you'll see a pxf-profile.xml file.

Use your favorite editor to add these profiles:

<profile>
        <name>localfile:parquet</name>
        <description>A profile for reading and writing Parquet data from HDFS</description>
        <plugins>
            <fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
            <accessor>org.greenplum.pxf.plugins.hdfs.ParquetFileAccessor</accessor>
            <resolver>org.greenplum.pxf.plugins.hdfs.ParquetResolver</resolver>
          </plugins>
          <protocol>localfile</protocol>
</profile>
<profile>
        <name>localfile:csv</name>
        <description>This profile is suitable for using when reading delimited single line records
            from plain text CSV files on S3
        </description>
        <plugins>
            <fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
            <accessor>org.greenplum.pxf.plugins.hdfs.LineBreakAccessor</accessor>
            <resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
        </plugins>
        <protocol>localfile</protocol>
</profile>    
<profile>
        <name>localfile:text</name>
        <description>This profile is suitable for using when reading delimited single line records
            from plain text, tab-delimited, files on S3
        </description>
        <plugins>
            <fragmenter>org.greenplum.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
            <accessor>org.greenplum.pxf.plugins.hdfs.LineBreakAccessor</accessor>
            <resolver>org.greenplum.pxf.plugins.hdfs.StringPassResolver</resolver>
        </plugins>
        <protocol>localfile</protocol>
 </profile>

For this example, I've just added the localfile:csv to mine.

After this you must do, depending on your pxf version, either a pxf cluster stop, pxf cluster sync, pxf cluster start (please check your versions docs for specific restart instructions).

I put my csv file in the /tmp/core/Batting.csv directory. If you see an Error like the following when doing a select from your table, that means your csv file is not on all the hosts:

ERROR:  remote component error (500) from '127.0.0.1:5888':  type  Exception report   message   javax.servlet.ServletException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/core/Batting.csv    description   The server encountered an internal error that prevented it from fulfilling this request.    exception   javax.servlet.ServletException: javax.servlet.ServletException: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/core/Batting.csv (libchurl.c:946)  (seg0 slice1 192.168.99.101:30002 pid=8264) (cdbdisp.c:254)
DETAIL:  External table atting, file pxf:///tmp/core/Batting.csv?PROFILE=localfile:csv

So please copy your csv file to all the hosts. Make sure you've created the pxf extension.
Create your External Table like this:

CREATE EXTERNAL TABLE atting (playerID varchar(200),yearID varchar(200),stint varchar(200),teamID varchar(200),lgID varchar(200),G varchar(200),AB varchar(200),R varchar(200),H varchar(200), Doubles varchar(200),Triples varchar(200),HR varchar(200),RBI varchar(200),SB varchar(200),CS varchar(200),BB varchar(200),SO varchar(200),IBB varchar(200),HBP varchar(200),SH varchar(200),SF varchar(200),GIDP varchar(200))                                   LOCATION ('pxf:///tmp/core/Batting.csv?PROFILE=localfile:csv')                                            FORMAT 'CSV';

Now you are ready to select * from your file.