gprecoverseg fails because segment PANIC "could not open relation xxx/yyy/zzz: No such file or directory"

Products

VMware Tanzu Greenplum

Issue/Introduction

Note: The error "No such file or directory" may be reported for a number of different reasons. This article describes only one. Please refer to other articles if the symptoms do not match.

- noticed in 5.29.2
- gprecoveseg fails for both incremental and full recovery of a segment with the following details:

20220529:10:18:40:030226 gprecoverseg:e9a331119f23268:gpadmin-[INFO]:-Starting gprecoverseg with args: -aF
(...)
20220529:10:19:01:030226 gprecoverseg:e9a331119f23268:gpadmin-[INFO]:-Updating configuration to mark mirrors up
20220529:10:19:01:030226 gprecoverseg:e9a331119f23268:gpadmin-[INFO]:-Updating primaries
20220529:10:19:01:030226 gprecoverseg:e9a331119f23268:gpadmin-[INFO]:-Commencing parallel primary conversion of 1 segments, please wait...
20220529:10:34:46:030226 gprecoverseg:e9a331119f23268:gpadmin-[INFO]:-Process results...
20220529:10:34:46:030226 gprecoverseg:e9a331119f23268:gpadmin-[WARNING]:-Failed to inform primary segment of updated mirroring state.  Segment: 8d7f44501e602d8:/data3/primary/gpseg60:content=60:dbid=62:mode=r:status=u: REASON: Conversion failed.  stdout:""  stderr:"peer shut down connection before response was fully received  Retrying no 1  failure: Error: MirroringFailure failure: Error: MirroringFailure "

- the above error is a result of recovered target (down segment that is being recovered) to shut down because filerep is failing.

[ mirror (recovery target) log ]

2022-05-29 10:24:44.230248 UTC,,,p10991,th-294840448,,,2000-01-01 00:00:00 UTC,0,,,seg-1,,,,,"WARNING","XX000","receive EOF on connection: Success (cdbfilerepconnserver.c:333)",,,,,,,0,,"cdbfilerepconnserver.c",333,
2022-05-29 10:24:44.304185 UTC,,,p10993,th-294840448,,,,0,,,seg-1,,,,,"LOG","00000","'set segment state', mirroring role 'mirror role' mirroring state 'resync' segment state 'in fault' filerep state 'fault' process name(pid) 'mirror consumer writer process(10993)' 'cdbfilerep.c' 'L2444' 'FileRep_SetSegmentState'",,,,,,,0,,"cdbfilerep.c",1824,"Stack trace:
1    0x96600b postgres errstart (elog.c:521)
2    0xa17ff4 postgres FileRep_InsertConfigLogEntryInternal (cdbfilerep.c:1806)
3    0xa1b67c postgres FileRepSubProcess_SetState (cdbfilerepservice.c:555)
4    0xa1b952 postgres FileRepSubProcess_ProcessSignals (cdbfilerepservice.c:281)

- the filerep on mirror shut down because the primary segment (recovery source) has a missing file and PANICs

[ primary (recovery source) log]

2022-05-29 10:24:44.222589 UTC,,,p15891,th-1524660352,,,,0,,,seg-1,,,,,"PANIC","58P01","could not open relation 1663/2379506/6450393: No such file or directory",,,,,,,0,,"md.c",1478,"Stack trace:
1    0x96600b postgres errstart (elog.c:521)
2    0x832e3f postgres <symbol not found> (md.c:1471)
3    0x8343ea postgres mdnblocks (md.c:1651)
4    0xa8d0fb postgres PersistentFileSysObj_MarkWholeMirrorFullCopy (cdbpersistentfilesysobj.c:4488)
5    0xa2e0e3 postgres FileRepPrimary_StartResyncManager (cdbfilerepresyncmanager.c:902)
6    0xa1bf08 postgres FileRepSubProcess_Main (cdbfilerepservice.c:867)
7    0xa16426 postgres <symbol not found> (cdbfilerep.c:2667)
8    0xa1b105 postgres FileRep_Main (cdbfilerep.c:3571)
9    0x58a7e3 postgres AuxiliaryProcessMain (bootstrap.c:513)
10   0x7d85db postgres <symbol not found> (postmaster.c:7406)
11   0x7dcf5c postgres StartFilerepProcesses (postmaster.c:1622)
12   0x7e68b9 postgres doRequestedPrimaryMirrorModeTransitions (primary_mirror_mode.c:1760)
13   0x7e1681 postgres <symbol not found> (postmaster.c:2465)
14   0x7e38ca postgres PostmasterMain (postmaster.c:1533)
15   0x4cdce7 postgres main (main.c:206)
16   0x7fb5a06fe555 libc.so.6 __libc_start_main + 0xf5
17   0x4ce29c postgres <symbol not found> + 0x4ce29c
"

[ checks on the recovery source segment ]
- Note the database OID from the PANIC error message in the log. It is highlighted in green above.
- Note the relfilenode from the PANIC error message in the log. It is highlighted in yellow above.
- connect to primary segment (recovery source) and check for relfilenode

$ PGOPTIONS='-c gp_session_role=utility' psql -h <primary_segment_host> -p <primary_segment_port>
psql (8.3.23)
Type "help" for help.
gpadmin=# SELECT oid,datname from pg_database where oid = 2379506;
  oid  | datname
-------+---------
 2379506 | prod
(1 row)
gpadmin=# \c prod
You are now connected to database "prod" as user "gpadmin".
prod=# SELECT * from pg_class where relfilenode = 6450393;
(0 rows)

This shows that the relfilenode does not exist in a catalog. The recovery process is trying to synchronize a file that does not exist on the filesystem and does not belong to any relation which causes the gprecoverseg to fail.

- Check the persistent tables:

gpadmin=# SELECT * from gp_persistent_relation_node where database_oid=2379506 and relfilenode_oid=6450393;
(0 rows)

Environment

Product Version: 5.28

Resolution

Run REINDEX, VACUUM, ANALYZE and REINDEX on pg_class, pg_type and pg_attribute tables on segment.

$ PGOPTIONS='-c gp_session_role=utility' psql -h <primary_segment_host> -p <primary_segment_port> -d <database_name>
# REINDEX TABLE pg_class;
# REINDEX TABLE pg_type;
# REINDEX TABLE pg_attribute;
# VACUUM pg_class;
# VACUUM pg_type;
# VACUUM pg_attribute;
# ANALYZE pg_class;
# ANALYZE pg_type;
# ANALYZE pg_attribute;
# REINDEX TABLE pg_class;
# REINDEX TABLE pg_type;
# REINDEX TABLE pg_attribute;