How to Recover ETCD Cluster after Failure
search cancel

How to Recover ETCD Cluster after Failure

book

Article ID: 297831

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Metron and Doppler use the ETCD cluster for discovery (from Pivotal CF 1.8 TCP routing group and route data are also kept in ETCD). There are scenarios where a node on the ETCD cluster separates to become its own node (incorrectly) causing either Metrons to fail to find Dopplers or all Metrons swarming a small group of Dopplers.

 

Environment


Resolution

ETCD cluster failures in PCF can be corrected by wiping the data from the nodes and resetting them.  This process essentially gives the cluster a fresh start and because there is no persistent data stored on the etcd cluster, the operation is harmless.

Because this process is quick, non-destructive and has a high success rate for fixing ETCD problems, Pivotal recommends trying this process first, before doing any additional debugging.

To perform this process, follow the instructions in the Failed Deploys, Upgrades, Split-Brain Scenarios, etc section of the following link.

https://github.com/cloudfoundry-incubator/etcd-release#failure-recovery

$ monit stop etcd (on all nodes in etcd cluster
$ rm -rf /var/vcap/store/etcd/* (on all nodes in etcd cluster)
$ monit start etcd (one-by-one on each node in etcd cluster)

If you need assistance with these instructions, please open a ticket with Pivotal Support.

 

Impact

If you choose to enable TCP routing, do not remove ETCD data stores during failure recovery procedures since router group data added by the routing API is not ephemeral.