How to manually restart the Metrics collector without restarting the Greenplum Database.
search cancel

How to manually restart the Metrics collector without restarting the Greenplum Database.

book

Article ID: 400524

calendar_today

Updated On:

Products

VMware Tanzu Data Suite VMware Tanzu Greenplum VMware Tanzu Greenplum / Gemfire

Issue/Introduction

Customers may experience occasional issues where metrics collector does not start up after a segment failure and recovery using gprecoverseg or gprecoverseg -r. To remedy this issue, customers can manually start the metrics collector. 

The following error can be seen in the logs:

2025-08-05 07:02:50.851618 EDT,"gpmon","gpperfmon",p1316779,th1713226368,"10.10.10.1","52170",2025-08-05 06:27:17 EDT,0,con280114,cmd1700,seg-1,,,,sx1,"ERROR","38000","external table gpcc_mc_checker_segment_hosts command ended with error.  (seg56 slice2 10.10.10.13:45000 pid=1931774)","Command: execute:ps -ef | grep postgres | grep metrics | grep -o ""postgres: .*,"" | grep -o ""[0-9]\+""",,,,,"(SELECT g.hostname, g.content, checker.port FROM gpmetrics.gpcc_mc_checker_coordinator checker INNER JOIN gp_segment_configuration g ON checker.gp_segment_id=g.content WHERE g.role = 'p' AND g.status='u')

Environment

Greenplum Command Cenrter (GPCC) 6.13.x and 6.14.x

MetricsCollector: 6.13_gp_6.28.1

Cause

This is a bug in GPCC v6.13 and v6.14,

It happens when there is no metrics_collector process found,.

The following command in gpcc_mc_checker_segment_hosts external function will return with exit code 1,

ps -ef | grep postgres | grep metrics | grep -o "postgres: .*," | grep -o "[0-9]\+"

which fails the sql call that checks metrics_collector's existence and prevents further actions to restart the missing metrics_collectors.

Resolution

The attached script, start_mc.sh,  will check if the metrics collector is running for the particular segment and will safely start the metrics collector if none is found.

The script must be executed by the gpadmin user on the coordinator host to ensure proper permissions and access to the database.

Copy the script to the coordinator host and chamge the permissions to allow gpadmin read and execute permissions.

Run the script with the segmentID as an input parameter:

./start_mc.sh <segID> 

NOTE: start.mc.sh script is attached to this kb. 

Additional Information

Contents of the start_mc.sh script:

#!/bin/bash

set -euo pipefail

segid=$1
if [ -z "${segid}" ]; then
	echo "no segid provided, exit"
	exit
fi
port=`psql -t -c "select port from gp_segment_configuration where role='p' and content=${segid}" postgres`
if [ -z "${port}" ]; then
	echo "no port found for segment $segid, exit"
	exit
fi
seghost=`psql -t -c "select hostname from gp_segment_configuration where role='p' and content=${segid}" postgres`
if [ -z "${seghost}" ]; then
        echo "no host found for segment $segid, exit"
        exit
fi
echo "looking for metrics_collector on segment ${segid}/${seghost}, with port ${port}..."
mccount=`ssh ${seghost} "ps aux | grep postgres | grep [m]etrics | grep ${port} | wc -l"`
if [ 0 == "`echo ${mccount}|tr -d '\n'`"  ]; then
	echo "no metrics_collector found on segment ${segid}/${seghost}, start it..."
	psql -c "select gpmetrics.metrics_collector_restart_worker(${segid})" gpperfmon
else
	echo "metrics_collector is already there on segment ${segid}/${seghost}, won't start."
fi

 

 

Attachments

start_mc.sh get_app