How to perform a root cause analysis for the segments marked down

Products

VMware Tanzu Greenplum VMware Tanzu Data Suite VMware Tanzu Data Suite Greenplum

Issue/Introduction

This article discusses how to review and understand the cause of segments failing. There are several reasons segments go down or fail.

Environment

Resolution

In order to perform root cause analysis (RCA) on a "segment down" event, the following information needs to be reviewed:

gp_segment_configuration table

select * from gp_segment_configuration gsc join pg_filespace_entry pfe on gsc.dbid=pfe.fsedbid where content= <contentID for segment which went down> ;

gp_configuration_history table

select * from gp_configuration_history order by "desc" desc;

DB log files from the date and time segments went down.

Master logs ($MASTER_DATA_DIRECTORY/pg_log/)
Primary logs (segment_data_directory/pg_log/)
Mirror logs (segment_data_directory/pg_log/)

Collect var/log/messages files, may contain hardware or network issues.
Relevant configuration parameters (gpconfig -s)

gpconfig -s gp_fts_probe_interval 
gpconfig -s gp_fts_probe_threadcount
gpconfig -s gp_fts_probe_timeout
gpconfig -s gp_segment_connect_timeout

The time (exact or approximate) when the segment(s) went down.
Identification of the segment - either the Database ID (DBID) or a combination of the ContentID and Role (Primary or Mirror).

Note: The corresponding primary or corresponding mirror segment for a specific segment is located in the "gp_segment_configuration" table. The primary or mirror pair will have same "content_id".

How to perform the RCA

Identify the following from the gp_segment_configuration and master logs:

Which segments went down?
What time did they all went down?
Is it only primary segments?
Is it only mirror segments?
Is it only one server or rack?
Is it a mix of primarie segments or mirror segments?

Case 1

In most cases, the mirror segments are marked down. The general cause of mirror segments going down is the inability for the primary and mirror to keep timely communication. The primary segment is unable to receive confirmation within the time limit controlled by gp_segment_connect_timeout parameter.

This is an indication of mirrors going down due to high workload or networking overload. The "gp_segment_connect_timeout" can be increased to allow for s longer response time from the mirror. This is not a permanent fix and if the workload keeps increasing, other failures will occur.

2013-05-08 04:10:50.730638 EDT,,,p28480,th111540096,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=1500' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,

Case 2

Mirrors go down due to missing files. Find the log entry referring to 'transition' and search for missing files in the segment log files.

Case 3

There is multiple reasons for primary segments being down. Review the primary segment log files at the time the primary segment failed. Look for the word "transition". The reason for failure is one of the following:

PANIC/SIGSEGV
Out of memory (OS or VMEM)
Network issues

Case 4

The postmaster process on the primary segment verifies periodically if the I/O on the segment data directory works properly (this file can be written and read). This process is completed by writing a file under the data directory ("fts_probe_file.bak").

When there is a problem with the I/O (stuck controller), the segment will not be able to respond to FTS process on the master. The FTS will promote the mirror to primary and transition the primary to mirror. Symptoms of these issues are when segments are transitioned and segment servers are "stuck".

Long-Term Trend Analysis

Analyze the past behavior of segment failures to identify any long-term trends, such as possible hardware issues. There is a PSQL script attached to this article (segment_failures.sql) which can be used for this purpose.

This script will analyze the last three months of segment failures and produce 3 reports:

Any primary segments that have failed more than once within the reporting window.
Any mirrors that have failed more than once within the reporting window.
Any server with segment failures, the date and time (to an hour granularity) and the number of segment failures (mirror and primary) within that hour window.

The output will look similar to the following:

[gpadmin@mdw ~]$ psql -p 54320 -f f.sql
Timing is on.
Primary segments with more than 1 failure
 hostname | content | number_failures
----------+---------+-----------------
(0 rows)

Time: 15.570 ms
Mirror segments with more than 1 failure
 hostname | content | number_failures
----------+---------+-----------------
 sdw1     |       2 |               2
 sdw1     |       3 |               2
(2 rows)

Time: 4.946 ms
Hosts and time with failures
 hostname |      failure time      | number_failures
----------+------------------------+-----------------
 sdw1     | 2016-01-14 10:00:00-08 |               4
 sdw1     | 2016-02-29 07:00:00-08 |               2
 sdw1     |                        |               6
          |                        |               6
(4 rows)

Time: 2.476 ms

Based on the above output, certain segments can be flagged for investigation for potential failures.

The third report can also be used to roll up the total number of segment failures per node for the reporting period.

To change the reporting period, alter the first line of the script:

\set report_interval ('3 month')::INTERVAL

Additional Information

Details about the segments down are located in the segment DB logs.
Use gp_configuration_history to understand if there any patterns.
Use gpstate -e to see the quick state of segments.
Catalog issues can also cause segments to go down.
Always understand why segment went down before you can suggest the recovery.
gprecoverseg full deletes all the data and files in the segment data directory.
Running incremental recovery after full will not work, due to above reasons.