This article discusses how to review and understand the cause of segments failing. There are several reasons segments go down or fail.
In order to perform root cause analysis (RCA) on a "segment down" event, the following information needs to be reviewed:
select * from gp_segment_configuration gsc join pg_filespace_entry pfe on gsc.dbid=pfe.fsedbid where content= <contentID for segment which went down> ;
select * from gp_configuration_history order by "desc" desc;
Master logs ($MASTER_DATA_DIRECTORY/pg_log/) Primary logs (segment_data_directory/pg_log/) Mirror logs (segment_data_directory/pg_log/)
gpconfig -s gp_fts_probe_interval gpconfig -s gp_fts_probe_threadcount gpconfig -s gp_fts_probe_timeout gpconfig -s gp_segment_connect_timeout
Note: The corresponding primary or corresponding mirror segment for a specific segment is located in the "gp_segment_configuration" table. The primary or mirror pair will have same "content_id".
Identify the following from the gp_segment_configuration and master logs:
This is an indication of mirrors going down due to high workload or networking overload. The "gp_segment_connect_timeout" can be increased to allow for s longer response time from the mirror. This is not a permanent fix and if the workload keeps increasing, other failures will occur.
2013-05-08 04:10:50.730638 EDT,,,p28480,th111540096,,,,0,,,seg-1,,,,,"WARNING","01000","threshold '75' percent of 'gp_segment_connect_timeout=1500' is reached, mirror may not be able to keep up with primary, primary may transition to change tracking",,"increase guc 'gp_segment_connect_timeout' by 'gpconfig' and 'gpstop -u'",,,,,0,,"cdbfilerepprimaryack.c",860,
Mirrors go down due to missing files. Find the log entry referring to 'transition' and search for missing files in the segment log files.
There is multiple reasons for primary segments being down. Review the primary segment log files at the time the primary segment failed. Look for the word "transition". The reason for failure is one of the following:
The postmaster process on the primary segment verifies periodically if the I/O on the segment data directory works properly (this file can be written and read). This process is completed by writing a file under the data directory ("fts_probe_file.bak").
When there is a problem with the I/O (stuck controller), the segment will not be able to respond to FTS process on the master. The FTS will promote the mirror to primary and transition the primary to mirror. Symptoms of these issues are when segments are transitioned and segment servers are "stuck".
Analyze the past behavior of segment failures to identify any long-term trends, such as possible hardware issues. There is a PSQL script attached to this article (segment_failures.sql) which can be used for this purpose.
This script will analyze the last three months of segment failures and produce 3 reports:
The output will look similar to the following:
[gpadmin@mdw ~]$ psql -p 54320 -f f.sql Timing is on. Primary segments with more than 1 failure hostname | content | number_failures ----------+---------+----------------- (0 rows) Time: 15.570 ms Mirror segments with more than 1 failure hostname | content | number_failures ----------+---------+----------------- sdw1 | 2 | 2 sdw1 | 3 | 2 (2 rows) Time: 4.946 ms Hosts and time with failures hostname | failure time | number_failures ----------+------------------------+----------------- sdw1 | 2016-01-14 10:00:00-08 | 4 sdw1 | 2016-02-29 07:00:00-08 | 2 sdw1 | | 6 | | 6 (4 rows) Time: 2.476 ms
Based on the above output, certain segments can be flagged for investigation for potential failures.
The third report can also be used to roll up the total number of segment failures per node for the reporting period.
To change the reporting period, alter the first line of the script:
\set report_interval ('3 month')::INTERVAL