How to recover an etcd member in a multi master Kubernetes cluster when persistent disk is lost or corrupted
search cancel

How to recover an etcd member in a multi master Kubernetes cluster when persistent disk is lost or corrupted

book

Article ID: 298734

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

IMPORTANT NOTE: These instructions are only valid for a multi master Kubernetes cluster deployed by Tanzu Kubernetes Grid Integrated (TKGi). Please do not attempt this on a single master cluster as it will lead to a data loss. Also, please ensure that in multi control plane scenarios 2 out of 3 etcd members are healthy before proceeding with these steps

In a scenario where one of the etcd members in a Kubernetes cluster is corrupted or the data on the persistent disk is lost, the corrupted member fails to start up. The corruption can be verified from the logs under /var/vcap/sys/log/etcd/etcd.stderr.log on the failing master VM.
2019-09-11 19:44:39.007268 I | raft: 17f206fd866fdab2 [term: 1] received a MsgHeartbeat message with higher term from 3682e7dfa9f9d2be [term: 14]
2019-09-11 19:44:39.007282 I | raft: 17f206fd866fdab2 became follower at term 14
2019-09-11 19:44:39.007287 C | raft: tocommit(11490872) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(11490872) is out of range [lastIndex(1)]. Was the raft log corrupted, truncated, or lost?

goroutine 115 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42000bce0, 0xfff0aa, 0x5d, 0xc42045b160, 0x2, 0x2)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x162
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc4201f4d20, 0xaf5638)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1194 +0x54
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1140 +0x3ff
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc42012e500, 0x8, 0x17f206fd866fdab2, 0x3682e7dfa9f9d2be, 0xe, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:868 +0x12f1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*node).run(0xc4200a5920, 0xc42012e500)
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x1059
created by github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.StartNode
        /tmp/etcd-release-3.3.12/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:210 +0x61e


Environment

Product Version: 1.15
OS: Ubuntu

Resolution

Get the endpoint status

ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table endpoint status

+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                 ENDPOINT                 |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://master-0.etcd.cfcr.internal:2379 | 17f206fd866fdab2 |  3.3.12 |  3.7 MB |     false |        15 |      51293 |
| https://master-2.etcd.cfcr.internal:2379 | e206af779877c47b |  3.3.12 |  3.5 MB |     false |        15 |      51293 |
| https://master-1.etcd.cfcr.internal:2379 | 7b6fad663771e2c1 |  3.3.12 |  3.7 MB |      true |        15 |      51293 |
+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+

Get the member list

The NAME column is equal to the instance id of the master VM. This can be retrieved by bosh vms command. The output is in the format master/<instance_id>.

 ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table member list
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+
|        ID        | STATUS  |                 NAME                 |                PEER ADDRS                |               CLIENT ADDRS               |
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+
| 17f206fd866fdab2 | started | fa34147d-9458-468a-9dbd-f7238f01db33 | https://master-0.etcd.cfcr.internal:2380 | https://master-0.etcd.cfcr.internal:2379 |
| 7b6fad663771e2c1 | started | d22e457d-07b3-441c-b1c1-41785240daa9 | https://master-1.etcd.cfcr.internal:2380 | https://master-1.etcd.cfcr.internal:2379 |
| e206af779877c47b | started | 0c5d0f56-d4d6-4235-bd3e-d5d90ffe39fe | https://master-2.etcd.cfcr.internal:2380 | https://master-2.etcd.cfcr.internal:2379 |
+------------------+---------+--------------------------------------+------------------------------------------+------------------------------------------+

Identify the etcd member to remove and master VM it's running on

In this example we are going to remove member id 7b6fad663771e2c1From the member list table we identify the corresponding instance id, d22e457d-07b3-441c-b1c1-41785240daa9 for the member. Using bosh vms we identify which host the unhealthy or corrupted member resides on. In this case it is: master/d22e457d-07b3-441c-b1c1-41785240daa9


Fixing the etcd cluster

On the failing VM stop etcd and rename the current etcd data directory. Make sure the other monit services are up on the VM and the etcd process has been completely stopped.
monit stop etcd
ps aux | grep etcd
mv /var/vcap/store/etcd /var/vcap/store/etcd-old
Remove the failing etcd member.
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl member remove 7b6fad663771e2c1
Member 7b6fad663771e2c1 removed from cluster 6d938e3be5102340

Create a data directory under /var/vcap/store with user and group owner as vcap. Both etcd and an empty member directory must exist before performing the following steps.

mkdir -p /var/vcap/store/etcd/member
chown -R vcap:vcap /var/vcap/store/etcd
chmod 700 /var/vcap/store/etcd
Add the member while etcd is in a stopped state.
source /var/vcap/jobs/etcd/bin/utils.sh
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl member add 7b6fad663771e2c1 --peer-urls ${etcd_peer_address}
Restart etcd:
monit restart etcd
Confirm if all the members are on the same RAFT TERM and RAFT INDEX:
ETCDCTL_API=3 /var/vcap/jobs/etcd/bin/etcdctl -w table endpoint status
+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                 ENDPOINT                 |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://master-0.etcd.cfcr.internal:2379 | 17f206fd866fdab2 |  3.3.12 |  3.7 MB |      true |        16 |      56879 |
| https://master-2.etcd.cfcr.internal:2379 | e206af779877c47b |  3.3.12 |  3.5 MB |     false |        16 |      56879 |
| https://master-1.etcd.cfcr.internal:2379 | 60dc44d66fde6570 |  3.3.12 |  3.4 MB |     false |        16 |      56879 |
+------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
Verify the K8s objects using kubectl and remove the older data directory, /var/vcap/store/etcd-old