FATAL- cgroup Is Not Properly Configured, Gives The Error, "can not find cgroup mount point"

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:

Greenplum has 2 resource management scheme available:

Resource Queue
Resource Groups

This article is for the Greenplum configured with the Resource Groups. In order to verify the type of configuration, check the "postgresql.conf" file:

$grep gp_resource_manager $MASTER_DATA_DIRECTORY/postgresql.conf
gp_resource_manager='group'

Note- cgroup error prevents segment and/or master to start. If the problem is on the master, you would not be able to start master even in the master-only mode and as a result, you would not be able to use "gpconfig" command to verify current resource management scheme. That is why it is better if you use "postgresql.conf" to verify it.

When Greenplum is started, it hangs and then fails giving following:

20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Starting gpstart with args: -a -m
20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Gathering information and validating the environment...
20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.3.0 build commit:2155c5a8cf8bb7f13f49c6e248fd967a74fed591'
20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Greenplum Catalog Version: '301705051'
20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Master-only start requested in configuration without a standby master.
20180118:00:23:34:014090 gpstart:gpdb-sandbox:gpadmin-[INFO]:-Starting Master instance in admin mode
...
20180118:00:33:36:014090 gpstart:gpdb-sandbox:gpadmin-[CRITICAL]:-Failed to start Master instance in admin mode
20180118:00:33:36:014090 gpstart:gpdb-sandbox:gpadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /gpdata/master/gpseg-1 -l /gpdata/master/gpseg-1/pg_log/startup.log -w -t 600 -o " -p 5432 --gp_dbid=1 --gp_num_contents_in_cluster=0 --silent-mode=true -i -M master --gp_contentid=-1 -x 0 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
', stderr='pg_ctl: could not start server
Examine the log output.

You will find the following error in the startup log:

Error Message:

less $MASTER_DATA_DIRECTORY/pg_log/startup.log
...
2018-01-18 00:10:40.952175 GMT,,,p14026,th-1032046816,,,,0,,,seg-1,,,,,"FATAL","XX000","cgroup is not properly configured: can not find cgroup mount point (resgroup-ops-linux.c:610)",,,,,,,,"detectCgroupMountPoint","resgroup-ops-linux.c",610,1 0x95467b postgres errstart + 0x1db
2 0x9563f4 postgres elog_finish + 0xb4
3 0x9941c6 postgres ResGroupOps_Bless (resgroup-ops-linux.c:601)
4 0xa95917 postgres gpvars_assign_gp_resource_manager_policy (cdbvars.c:1251)
5 0x9739e2 postgres <symbol not found> (guc.c:0)
6 0x9774bd postgres set_config_option + 0x1dd
7 0x97dd64 postgres ProcessConfigFile + 0x304
8 0x97e0fe postgres SelectConfigFiles + 0x8e
9 0x7d3486 postgres PostmasterMain (postmaster.c:1113)
10 0x70c057 postgres main (main.c:206)
11 0x7f9fbe1b5d1d libc.so.6 __libc_start_main + 0xfd
12 0x4c86d5 postgres <symbol not found> + 0x4c86d5

Environment

Cause

Greenplum with the resource groups relays on cgroup to be enabled and mounted. The following has to work on a segment or master in order for the Greenplum instance to start:

1) Control group service has to keep running

# service cgconfig status
Running

2) By default, only cpu and cpuacct resources are used. They have to be mounted and listed as an available cgroup.

To check resources used by gpdb issue, perform:

# grep -e "cpuset\|cpu\|cpuacct\|memory\|devices\|freezer\|net_cls\|blkio" /etc/cgconfig.d/gpdb.conf | cut -d '{' -f 1
cpu
cpuacct

Check if cpu and cpuacct are mounted and available:

# cat /proc/mounts | grep cgroup
cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0
cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0

lscgroup
cpuset:/
cpu:/
cpu:/gpdb
cpuacct:/
cpuacct:/gpdb
memory:/
devices:/
freezer:/
net_cls:/
blkio:/

Resolution

Most of the time, the issue can be fixed by simply restarting the cgroup as per following:

After the restart, verify if you can see cpu and cpuacct in /proc/mounts and lscgroup.

Note- cgroups don't have a log file. But, so far the only issue with restarting a service is, if one of the bash sessions was in cgroup subfolder:

# pwd
/cgroup/cpu
[root@gpdb-sandbox cpu]# service cgconfig restart
Stopping cgconfig service: cgclear failed with Device or resource busy
[ OK ]
Starting cgconfig service: Error: cannot mount cpu to /cgroup/cpu: Device or resource busy
/sbin/cgconfigparser; error loading /etc/cgconfig.conf: Cgroup mounting failed
/sbin/cgconfigparser; error loading /etc/cgconfig.d/gpdb.conf: Cgroup one of the needed subsystems is not mounted
Failed to parse /etc/cgconfig.conf or /etc/cgconfig.d [FAILED]

Make sure there are no shell logins in /cgroup subfolder:

# /usr/sbin/lsof | grep /cgroup
bash 14193 root cwd DIR 0,19 0 47480 /cgroup/cpu 
# cd /
# /usr/sbin/lsof | grep /cgroup
#