Master may go into recovery mode if segment host has hardware issue
search cancel

Master may go into recovery mode if segment host has hardware issue

book

Article ID: 296787

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

Summary:

- As per the design of Greenplum, QD(Query Dispatcher, which is the master node) will dispatch the task to QE (Query Executor, which is the segment node). 
- when a host has a hardware issue, for example, the IO device is not responding, the QE might not able to respond to the master due to high IO wait
- For each transaction, when QD tries to commit an ongoing transaction, if QE does not respond to QD for a long time, the operation will be marked as fail.
- As a result, the Master will go into recovery mode to ensure consistency across all segments

An example log is as below:
  2023-04-07 00:36:43.760627 EDT,"xxxxx","xxxxx",p779359,th1757833152,"xx.xx.xx.xx","33712",2023-04-07 00:28:55 EDT,103178757,con6721,,seg-1,,dx6072,x103178757,sx1,
  "PANIC","XX000","unable to complete 'Commit Prepared' broadcast (cdbtm.c:663)","gid=xxxxxxx-xxxxxxxx, state=Retry Commit Prepared",,,,,

 

Based on the design of Greenplum, this is an expected behavior. As we mentioned above, the master must ensure consistency across all segments so it has to rollback all the transactions 

IMPORTANT NOTE:
The example master PANIC String (unable to complete 'Commit Prepared') may be triggered by issues other than hardware issues. Examples of other issues that may cause the same panic are the segment host was rebooted, or other software bugs that causes QE not able to respond to QD in time.

Suggestion:
Proactively do the health check on all hosts to avoid such panic issues if that is caused by HW.

Environment

Product Version: 6.19