Customers may experience occasional issues where metrics collector does not start up after a segment failure and recovery using gprecoverseg or gprecoverseg -r. To remedy this issue, customers can manually start the metrics collector.
The following error can be seen in the logs:
2025-08-05 07:02:50.851618 EDT,"gpmon","gpperfmon",p1316779,th1713226368,"10.10.10.1","52170",2025-08-05 06:27:17 EDT,0,con280114,cmd1700,seg-1,,,,sx1,"ERROR","38000","external table gpcc_mc_checker_segment_hosts command ended with error. (seg56 slice2 10.10.10.13:45000 pid=1931774)","Command: execute:ps -ef | grep postgres | grep metrics | grep -o ""postgres: .*,"" | grep -o ""[0-9]\+""",,,,,"(SELECT g.hostname, g.content, checker.port FROM gpmetrics.gpcc_mc_checker_coordinator checker INNER JOIN gp_segment_configuration g ON checker.gp_segment_id=g.content WHERE g.role = 'p' AND g.status='u')
Greenplum Command Cenrter (GPCC) 6.13.x and 6.14.x
MetricsCollector: 6.13_gp_6.28.1
This is a bug in GPCC v6.13 and v6.14,
It happens when there is no metrics_collector process found,.
The following command in gpcc_mc_checker_segment_hosts external function will return with exit code 1,
ps -ef | grep postgres | grep metrics | grep -o "postgres: .*," | grep -o "[0-9]\+"
which fails the sql call that checks metrics_collector's existence and prevents further actions to restart the missing metrics_collectors.
The attached script, start_mc.sh, will check if the metrics collector is running for the particular segment and will safely start the metrics collector if none is found.
The script must be executed by the gpadmin user on the coordinator host to ensure proper permissions and access to the database.
Copy the script to the coordinator host and chamge the permissions to allow gpadmin read and execute permissions.
Run the script with the segmentID as an input parameter:
./start_mc.sh <segID>
NOTE: start.mc.sh script is attached to this kb.
Contents of the start_mc.sh script:
#!/bin/bash
set -euo pipefail
segid=$1
if [ -z "${segid}" ]; then
echo "no segid provided, exit"
exit
fi
port=`psql -t -c "select port from gp_segment_configuration where role='p' and content=${segid}" postgres`
if [ -z "${port}" ]; then
echo "no port found for segment $segid, exit"
exit
fi
seghost=`psql -t -c "select hostname from gp_segment_configuration where role='p' and content=${segid}" postgres`
if [ -z "${seghost}" ]; then
echo "no host found for segment $segid, exit"
exit
fi
echo "looking for metrics_collector on segment ${segid}/${seghost}, with port ${port}..."
mccount=`ssh ${seghost} "ps aux | grep postgres | grep [m]etrics | grep ${port} | wc -l"`
if [ 0 == "`echo ${mccount}|tr -d '\n'`" ]; then
echo "no metrics_collector found on segment ${segid}/${seghost}, start it..."
psql -c "select gpmetrics.metrics_collector_restart_worker(${segid})" gpperfmon
else
echo "metrics_collector is already there on segment ${segid}/${seghost}, won't start."
fi