Recover Etcd when "more than One Leader Exists"

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

The following issue occurs with etcd when more than one leader exists.

The following error "more than one leader exists" is seen in the log:

/etcd_tls_server.0.job/job/etcd/etcd_consistency_checker.stderr.log

2017/01/11 18:40:29 more than one leader exists: [https://etcd-tls-server-1.cf-etcd.service.cf.internal:4001 https://etcd-tls-server-2.cf-etcd.service.cf.internal:4001 https://etcd-tls-server-3.cf-etcd.service.cf.internal:4001]

Environment

Resolution

Make sure that there is only one leader for all the nodes. Unfortunately, we've come across a scenario where the etcdctl tool will state that the cluster is healthy but it could be in a state where there are multiple leaders. This could be caused due to a network partition.

There are several etcd clusters in PCF you will need to identify which cluster the affected VM is a member of.

SSH on to your Ops Manager VM and run the following

bosh instances --ps

This will display a list of the VM which are running the etcd services.

Bosh SSH onto a VM and run the following command to see which nodes are members of the cluster.

Checking which host is the leader:

Run the following from your Ops Manger VM

curl -k http://<IP_Address_of_etcd_node>:4001/v2/stats/leader | python -mjson.tool</code
curl -k http://<IP_Address_of_etcd_node>:4001/v2/stats/self | python -mjson.tool

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
Dload  Upload   Total   Spent    Left  Speed

100   289  100   289    0     0    821      0 --:--:-- --:--:-- --:--:--   823

{
    "id": "68893871562c3e2e",
    "leaderInfo": {
        "leader": "68893871562c3e2e",
        "startTime": "2017-05-18T11:06:02.866578439Z",
        "uptime": "623h53m16.637407751s"
    },

    "name": "etcd-tls-server-0",
    "recvAppendRequestCnt": 0,
    "sendAppendRequestCnt": 0,
    "startTime": "2017-05-18T11:06:02.516168237Z",
    "state": "StateLeader"
}

Otherwise you can run the command directly from a node you are on by running the following.

curl -k http://127.0.0.1:4001/v2/stats/self

Displaying Cluster Health:

bosh ssh (Select a etcd VM)
sudo -i
/var/vcap/packages/etcd/etcdctl \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
-C https://etcd-tls-server-0.cf-etcd.service.cf.internal:4001 cluster-health

cluster is healthy
member 5b17b9b8c8ecf1db is healthy
member 818096dea8a05251 is healthy
member f9a631500f5c2195 is healthy

Listing cluster Members:

bosh ssh (Select a etcd VM)
sudo -i
/var/vcap/packages/etcd/etcdctl \
--ca-file /var/vcap/jobs/etcd/config/certs/server-ca.crt \
--cert-file /var/vcap/jobs/etcd/config/certs/client.crt \
--key-file /var/vcap/jobs/etcd/config/certs/client.key \
-C https://etcd-tls-server-0.cf-etcd.service.cf.internal:4001 member list

5b17b9b8c8ecf1db: name=diego-database-0 peerURLs=https://diego-database-0.etcd.service.cf.internal:7001 clientURLs=https://diego-database-0.etcd.service.cf.internal:4001
818096dea8a05251: name=diego-database-2 peerURLs=https://diego-database-2.etcd.service.cf.internal:7001 clientURLs=https://diego-database-2.etcd.service.cf.internal:4001
f9a631500f5c2195: name=diego-database-1 peerURLs=https://diego-database-1.etcd.service.cf.internal:7001 clientURLs=https://diego-database-1.etcd.service.cf.internal:4001

The fastest way to resolve this issue is by restarting each etcd node one at a time so that the cluster can achieve quorum.

To restart the etcd service run the following on the affected cluster

sudo -i
monit stop etcd
monit start etcd

Once the etcd cluster has been restarted check that there is only one leader and the correct number of members are healthy. Once etcd is operating as normal then restart the doppler and metron services, using monit stop/start to ensure they are properly communicating with the etcd cluster.