/var/vcap/sys/log/etcd/etcd.stderr.log
on the failing master VM.
2019-09-11 19:44:39.007268 I | raft: 17f206fd866fdab2 [term: 1] received a MsgHeartbeat message with higher term from 3682e7dfa9f9d2be [term: 14] 2019-09-11 19:44:39.007282 I | raft: 17f206fd866fdab2 became follower at term 14 2019-09-11 19:44:39.007287 C | raft: tocommit(11490872) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost? panic: tocommit(11490872) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost? goroutine 115 [running]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42000bce0, 0xfff0aa, 0x5d, 0xc42045b160, 0x2, 0x2) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x162 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4201f4d20, 0xaf5638) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1194 +0x54 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1140 +0x3ff github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:868 +0x12f1 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*node).run(0xc4200a5920, 0xc42012e500) /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x1059 created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.StartNode /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:210 +0x61e
Product Version: 1.19
OS: Ubuntu
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table endpoint status +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://master-0.etcd.cfcr.internal:2379 | 17f206fd866fdab2 | 3.3.12 | 3.7 MB | false | 15 | 51293 | | https://master-2.etcd.cfcr.internal:2379 | e206af779877c47b | 3.3.12 | 3.5 MB | false | 15 | 51293 | | https://master-1.etcd.cfcr.internal:2379 | 7b6fad663771e2c1 | 3.3.12 | 3.7 MB | true | 15 | 51293 | +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
* When running this command, your results might be different than above. For example, If you run this command on a healthy node, you will be missing the endpoint of the etcd member that is failing
The NAME
column is equal to the instance id of the master VM. This can be retrieved by bosh vms
command. The output is in the format master/<instance_id>
.
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table member list +------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+ | 17f206fd866fdab2 | started | fa34147d-9458-468a-9dbd-f7238f01db33 | https://master-0.etcd.cfcr.internal:2380 | https://master-0.etcd.cfcr.internal:2379 | | 7b6fad663771e2c1 | started | d22e457d-07b3-441c-b1c1-41785240daa9 | https://master-1.etcd.cfcr.internal:2380 | https://master-1.etcd.cfcr.internal:2379 | | e206af779877c47b | started | 0c5d0f56-d4d6-4235-bd3e-d5d90ffe39fe | https://master-2.etcd.cfcr.internal:2380 | https://master-2.etcd.cfcr.internal:2379 | +------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+
In this example we are going to remove member id 7b6fad663771e2c1.
From the member list table we identify the corresponding instance id, d22e457d-07b3-441c-b1c1-41785240daa9
for the member. Using bosh vms
we identify which host the unhealthy or corrupted member resides on. In this case it is: master/d22e457d-07b3-441c-b1c1-41785240daa9
.
On the failing VM stop etcd and rename the current etcd data directory. Make sure the other monit services are up on the VM and the etcd process has been completely stopped.
monit stop etcd ps aux | grep etcd mv /var/vcap/store/etcd /var/vcap/store/etcd-old
Remove the failing etcd member.
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl member remove 7b6fad663771e2c1 Member 7b6fad663771e2c1 removed from cluster 6d938e3be5102340
Create a data directory under /var/vcap/store
with user and group owner as vcap. Both etcd and an empty member directory must exist before performing the following steps.
mkdir -p /var/vcap/store/etcd/member chown -R vcap:vcap /var/vcap/store/etcd chmod 700 /var/vcap/store/etcd
Add the member while etcd is in a stopped state.
source /var/vcap/jobs/etcd/bin/utils.sh ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl member add 7b6fad663771e2c1 --peer-urls ${etcd_peer_address}
Restart etcd:
monit restart etcd
Confirm if all the members are on the same RAFT TERM and RAFT INDEX:
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table endpoint status +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+ | https://master-0.etcd.cfcr.internal:2379 | 17f206fd866fdab2 | 3.3.12 | 3.7 MB | true | 16 | 56879 | | https://master-2.etcd.cfcr.internal:2379 | e206af779877c47b | 3.3.12 | 3.5 MB | false | 16 | 56879 | | https://master-1.etcd.cfcr.internal:2379 | 60dc44d66fde6570 | 3.3.12 | 3.4 MB | false | 16 | 56879 | +------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
Verify the K8s objects using kubectl
and remove the older data directory, /var/vcap/store/etcd-old