GPCC Startup Fails or Aborts After Segment Host Crash
search cancel

GPCC Startup Fails or Aborts After Segment Host Crash

book

Article ID: 439601

calendar_today

Updated On:

Products

VMware Tanzu Data Intelligence VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire

Issue/Introduction

After a segment host goes offline, crashes and becomes unreachable on the network, Greenplum Command Center (GPCC) fails to start.

 

Environment

GPCC 6.15

Cause

When executing the gpcc start command, the process may hang and eventually fail. A review of the GPCC logs indicates that the Agent Manager initialization has been aborted, often accompanied by SSH connection timeouts or host unreachable errors pointing to the downed segment host.

This issue occurs due to how GPCC validates the database cluster topology during its startup routine:

  1. Catalog Query: During initialization (StartAgentManager), GPCC dynamically reads the database's gp_segment_configuration system catalog to generate a complete list of all segment hosts.

  2. Lack of Status Filtering: GPCC does not currently filter this list based on the segment's actual status. It pulls the host list regardless of whether the database has marked the segments on that host as "up" or "down."

  3. SSH Verification: For every host retrieved from the catalog, GPCC attempts to establish an SSH connection to resolve the operating system hostname.

  4. Initialization Abort: Because the crashed host is offline, the SSH connection attempt times out or fails. GPCC treats this network failure as a critical error and immediately aborts the entire Agent Manager initialization, halting the GPCC startup.

Resolution

  1. Verify Cluster State: Check the status of all segments using gpstate -s to confirm which host/segments are down.
  2. Inspect Detailed Logs: Review /bb/bin/gplum/greenplum-cc-6.16.1/logs as referenced in the error to identify if a specific host is timing out or causing the agent manager to hang.
  3. Perform Segment Recovery: If segments on the crashed host are down, attempt recovery using gprecoverseg once the host is back online.
  4. Restart GPCC.