GPCC agent failed to start on part of servers

Products

VMware Tanzu Greenplum

Issue/Introduction

Symptoms:
Agent fails to be enabled and restarting does not help.

[gpadmin@dw-greenplum-1 ~]$ gpcc start
Starting the gpcc agents and webserver‚Ä¶
2019/03/18 16:47:43 Agent successfully started on 3/3 hosts
2019/03/18 16:47:43 View Greenplum Command Center at http://dw-greenplum-1:28080

[gpadmin@dw-greenplum-1 ~]$ gpcc status
2019/03/18 16:47:49 GPCC webserver: running
2019/03/18 16:47:49 GPCC agents: 1/3 agents running
2019/03/18 16:47:49 Agent is stopped on dw-greenplum-3
2019/03/18 16:47:49 Agent is stopped on dw-greenplum-2

Environment

Cause

As it's the issue regarding issue of segment agent, so we need to check the agent log located in GPCC home directory on segment server:

[gpadmin@dw-greenplum-3 logs]$ cat agent.log
2019/02/17 00:06:40 connect to rpc server dw-greenplum-1:8899
2019/02/17 00:06:43 Agent cannot start due to no RPC connectionrpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.16.76.149:8899: connect: no route to host"

The agent log shows the message that the segment server could not communicate with 172.16.76.149:8899. So issue might be related to this IP. Then we can try to ping this IP, but not working. So that possibly means the segment could not recognize this IP.

Then we can check the /etc/hosts file on the segment server, we can see the IP is in hosts file and it's the mdw IP:

[root@dw-greenplum-2 ~]# cat /etc/hosts
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
172.16.76.150   dw-greenplum-2  sdw1
172.16.76.151 dw-greenplum-3 sdw2
172.16.76.149 dw-greenplum-1 mdw

From now on, you can see 172.16.76.149 should be recognized by this segment server, so the only reason should be that the server does not exist in the cluster anymore.

You can confirm this conclusion by checking ifconfig and /etc/hosts file on master:

[root@dw-greenplum-1 logs]# ifconfig | less
eth0      Link encap:Ethernet  HWaddr 00:0C:29:B0:25:F7  
          inet addr:172.16.76.152  Bcast:172.16.76.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feb0:25f7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:264197 errors:0 dropped:0 overruns:0 frame:0
          TX packets:174656 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:251842110 (240.1 MiB)  TX bytes:73770998 (70.3 MiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:196812 errors:0 dropped:0 overruns:0 frame:0
          TX packets:196812 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:341777495 (325.9 MiB)  TX bytes:341777495 (325.9 MiB)


[root@dw-greenplum-1 logs]# cat /etc/hosts
127.0.0.1               localhost.localdomain localhost
::1             localhost6.localdomain6 localhost6
172.16.76.152   dw-greenplum-1  mdw
172.16.76.150 dw-greenplum-2 sdw1
172.16.76.151 dw-greenplum-3 sdw2

The true IP address of mdw is 172.16.76.152, not 172.16.76.149. That's the cause of this issue: master has changed its IP, but segment servers still use the old one.

Resolution

Modify the mdw IP in /etc/hosts file of each problematic segment server. After that, we can see the agent started in success:

[gpadmin@dw-greenplum-1 ~]$ gpcc status
2019/03/18 16:57:52 GPCC webserver: running
2019/03/18 16:57:53 GPCC agents: 3/3 agents running