Troubleshooting Guide- Configuring Amazon S3 with Greenplum

Products

VMware Tanzu Greenplum

Issue/Introduction

The scope of this article is to provide a troubleshooting guide and perform a root cause analysis (RCA) for common problems when configuring Amazon Simple Storage Service popularly known as Amazon S3. Amazon S3 is easy to use object storage, with a simple web service interface to store and retrieve any amount of data from anywhere on the web.

Starting Greenplum Database 4.3.8.0 onwards, the CREATE EXTERNAL TABLE command supports creating readable external tables that access files from Amazon S3. To take advantage of this feature, the s3 protocol should be configured on the Greenplum cluster. This guide shares some of the commonly faced problems during and after configuring s3 protocol and how to segregate them.

Resolution

Checklist:

s3 URL should follow the proper format: s3://S3_endpoint/bucket_name/[S3_prefix]. Make sure it is pointing to the correct S3_endpoint, has the correct bucket name and S3_prefix. s3 endpoint depends on the region under which your resources are located. For a list of regions and valid endpoint it supports, please refer valid region and endpoints.

All segments in the cluster should be able to send and receive requests from Amazon s3. This is required as Greenplum segments will send a request to fetch objects from s3.

Configuration file must be present at the file location specified in the s3 URL on all segments: 's3://S3_endpoint/bucket_name/[S3_prefix] [config=config_file_location]'

AWS secret key and access ID uniquely identifies a user and authenticates whether the user has access to perform a specific operation under a parent account. These parameters are present in configuration file s3.conf. You can retrieve and cross check these parameters using - How To get your access key ID and secret access key.

Greenplum errors out with a generic message in case any of the above conditions are not met.

ERROR: Failed to init S3 extension, segid = 12, segnum = 24, please check your configurations and net connection (gps3ext.cpp:166)
 (seg12 slice1 sdw3:40000 pid=4741) (cdbdisp.c:1326) DETAIL: External table test_s3_table, 
 file s3://s3-us-west-2.amazonaws.com/test_s3_bucket/test_s3_file.txt                         config=/home/gpadmin/s3/s3.conf

In order to get detailed error messages and isolate the issue, set log_min_messages to debug and rerun the query.
```
gpconfig -c log_min_messages -v debug
```
There is a possibility that everything is configured properly on Greenplum, but you are still facing errors. This can occur when the access ID configure in s3.conf does not have proper permissions for s3 resources. In this scenario, Greenplum will error out with a message similar to one below:
```
 ERROR: s3_import: could not read data (gps3ext.cpp:185) (seg11 slice1 sdw5:40003 pid=9702)
 (cdbdisp.c:1326) DETAIL: External table test_s3_table, file s3://s3-us-west-2.amazonaws.com/test_s3_bucket/test_s3_file.txt
 config=/home/gpadmin/s3/s3.conf
```
The above error is caused because the user is not granted enough permissions. Amazon s3 checks for permissions at three different levels:

User -Checks whether a user has permission to perform an operation. The user is identified by the combination of access ID and access secret.
Bucket - Checks whether the user has access to the s3 bucket.
Object - Checks whether the user has access to the objects in the s3 bucket. For _enabling debug logging using log_min_messages, exact object name can be tracked This object should have permission issues.

If any of the above checks fail then, Greenplum will not be able to perform read/write operation on Amazon s3.