The following issue occurs with etcd when more than one leader exists.
The following error "more than one leader exists" is seen in the log:
/etcd_tls_server.0.job/job/etcd/etcd_consistency_checker.stderr.log 2017/01/11 18:40:29 more than one leader exists: [https://etcd-tls-server-1.cf-etcd.service.cf.internal:4001 https://etcd-tls-server-2.cf-etcd.service.cf.internal:4001 https://etcd-tls-server-3.cf-etcd.service.cf.internal:4001]
Make sure that there is only one leader for all the nodes. Unfortunately, we've come across a scenario where the etcdctl
tool will state that the cluster is healthy but it could be in a state where there are multiple leaders. This could be caused due to a network partition.
There are several etcd clusters in PCF you will need to identify which cluster the affected VM is a member of.
SSH on to your Ops Manager VM and run the following
bosh instances --ps
This will display a list of the VM which are running the etcd services.
Bosh SSH onto a VM and run the following command to see which nodes are members of the cluster.
Checking which host is the leader:
Run the following from your Ops Manger VM
curl -k http://<IP_Address_of_etcd_node>:4001/v2/stats/leader | python -mjson.tool</code
curl -k http://<IP_Address_of_etcd_node>:4001/v2/stats/self | python -mjson.tool
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 289 100 289 0 0 821 0 --:--:-- --:--:-- --:--:-- 823 { "id": "68893871562c3e2e", "leaderInfo": { "leader": "68893871562c3e2e", "startTime": "2017-05-18T11:06:02.866578439Z", "uptime": "623h53m16.637407751s" }, "name": "etcd-tls-server-0", "recvAppendRequestCnt": 0, "sendAppendRequestCnt": 0, "startTime": "2017-05-18T11:06:02.516168237Z", "state": "StateLeader" }
Otherwise you can run the command directly from a node you are on by running the following.
curl -k http://127.0.0.1:4001/v2/stats/self
Displaying Cluster Health:
cluster is healthy member 5b17b9b8c8ecf1db is healthy member 818096dea8a05251 is healthy member f9a631500f5c2195 is healthy
Listing cluster Members:
5b17b9b8c8ecf1db: name=diego-database-0 peerURLs=https://diego-database-0.etcd.service.cf.internal:7001 clientURLs=https://diego-database-0.etcd.service.cf.internal:4001 818096dea8a05251: name=diego-database-2 peerURLs=https://diego-database-2.etcd.service.cf.internal:7001 clientURLs=https://diego-database-2.etcd.service.cf.internal:4001 f9a631500f5c2195: name=diego-database-1 peerURLs=https://diego-database-1.etcd.service.cf.internal:7001 clientURLs=https://diego-database-1.etcd.service.cf.internal:4001
The fastest way to resolve this issue is by restarting each etcd node one at a time so that the cluster can achieve quorum.
To restart the etcd service run the following on the affected cluster
sudo -i
monit stop etcd
monit start etcd
Once the etcd cluster has been restarted check that there is only one leader and the correct number of members are healthy. Once etcd is operating as normal then restart the doppler and metron services, using monit stop/start to ensure they are properly communicating with the etcd cluster.