Resolving FTS Max Retries and Hostname Lookup Issues

Products

VMware Tanzu Data Suite Greenplum VMware Tanzu Greenplum

Issue/Introduction

In a Greenplum Database environment, users may encounter an issue where segments go down due to FTS (Fault Tolerance Service) max retries. The specific log error encountered is:

FTS: cannot establish libpq connection to (content=<content number>, dbid=<DB ID>): could not translate host name ""(null)"" to address: Name or service not known

This error indicates that the FTS process is unable to establish a connection to one or more segments, potentially due to hostname resolution issues or segment overload.

Environment

GPDB 6.X

This issue is observed in a Greenplum Database cluster where the FTS process is responsible for monitoring the health of database segments. The problem occurs when the FTS service cannot connect to one or more segments, leading to failovers.

Cause

The log error suggests that the issue is related to the failure in translating the hostname to an IP address. Here’s a breakdown of potential causes:

1. FTS Max Retries: The error message indicates that the FTS service has reached its maximum retry attempts to connect to the segments. This is often a sign of the segments being heavily loaded or slow to respond.

2. Hostname Resolution Failure: The specific message "could not translate host name ""(null)"" to address: Name or service not known" suggests that the hostname provided by the system is either missing or incorrectly configured.

3. DNS Issues: If the cluster relies on DNS for hostname resolution, there may have been a brief outage or unresponsiveness from the DNS server, causing the failure to resolve hostnames.

4. Configuration Mismatch: Hostnames configured in the "gp_segment_configuration" table might not be correctly resolved by the system. The master node may be unable to translate these hostnames into IP addresses.

Resolution

To resolve this issue, follow these steps:

1. Check /etc/hosts File:

Inspect the /etc/hosts file on all hosts in the cluster. Ensure that all segment hostnames and their corresponding IP addresses are correctly listed. The command "ls -l /etc/hosts" can be used to check the file’s existence and permissions.

Example command to view the /etc/hosts file:

cat /etc/hosts

2. Verify Hostname Configuration:

Ensure that the hostnames found in the "gp_segment_configuration" table match the entries in the "/etc/hosts" file. You can check the hostname configuration with:

SELECT * FROM gp_segment_configuration;

3. Check DNS Server Availability:

If your cluster is using DNS for hostname resolution, verify the availability and responsiveness of your DNS servers. You can use tools like nslookup or dig to test DNS resolution.

Example command to test DNS resolution:

nslookup <hostname>

Note - We recommend listing all hosts (both segment hosts and coordinator hosts) in the /etc/hosts files rather than relying on DNS. Using DNS can lead to instability if the DNS server becomes unavailable.

4. Network and Load Checks:

Investigate whether network issues or high load on segments are contributing to the problem. Review network logs and segment performance metrics to ensure that segments are not overwhelmed.