How to detect etcd inconsistency issues in TKGI 1.13.x which could be affected by a known issue in etcd v3.5.x

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

There may be no clear symptoms of data inconsistencies, but your clusters may be affected. If you see the following symptoms contact support:

Inconsistent data returned while executing “kubectl” commands
Run the command “/var/vcap/jobs/etcd/bin/etcdctl endpoint status -w json --cluster” on any control plane node to compare the revision between members. They should have the same “revision”, or be converged into the same value soon. Otherwise, it indicates that the data between members is inconsistent (and there is possible data corruption).

If you do not see these symptoms then it is safe to apply the “Workaround” steps in this KB.

To reduce the risk of hitting this issue, please do not increase workloads of your clusters and wait for the release of TKGI 1.13.4 which will include a fix.

Note: Community analysis indicates that there is a very low possibility to run into this scenario if the control plane nodes running etcd do not have memory pressure or sigkill interrupted.

Cause

With recent reproduction of data inconsistency issues in #13766 , etcd maintainers are no longer recommending v3.5 releases for production. This impacts all TKGI 1.13.x releases.

Resolution

The fix is expected in the next patch release TKGI v1.13.4
Note: TKGI team is first awaiting a community release of etcd 3.5 (expected to be v3.5.3) with a fix before they can release TKGI 1.13.4

Workaround:

To set these flags, you can choose the manual steps or the automatic steps.

Note: This operation is not persistent across a VM recreate.

Here are the manual steps to set these two flags for each control plane node.

Ssh into the 1st control plane node and switch to root user:

bosh -d service_instance-xxx ssh master/0

sudo -i

Append flags after line 35 in file /var/vcap/jobs/etcd/bin/etcd. The updated file would look like below. Make sure there is no trailing space after \.

…
/var/vcap/packages/etcd/bin/etcd \
  --experimental-initial-corrupt-check=true \#make sure no trailing space
  --experimental-corrupt-check-time="1h" \#make sure no trailing space
  --name="d24253cb-63ec-4a73-9f41-d94bdaf53662" \
  --data-dir="/var/vcap/store/etcd" \
  …

Restart etcd service

monit restart service

Check etcd service state is running

monit summary

Ssh into the 2nd and 3rd control plane nodes if you have and repeat steps 1- 4.

Here are the automatic steps to set these two flags for all control plane nodes together.

Note Run these commands in the linux terminal and pay attention to those symbols including single quotes, double quotes, and slashes.

Run these commands for each cluster in your environment.

one line to set etcd flags for all control plane nodes

bosh -d service-instance_xxxxxx ssh master -c 'sudo sed -i "/\/var\/vcap\/packages\/etcd\/bin\/etcd/a \  --experimental-initial-corrupt-check=true \\\\\n  --experimental-corrupt-check-time=\"1h\" \\\\" /var/vcap/jobs/etcd/bin/etcd'

restart etcd service

bosh -d service-instance_xxxxxx ssh master -c 'sudo monit restart etcd'

check etcd running state

bosh -d service-instance_xxxxx ssh master -c 'sudo monit summary'

After applying the workaround

If the data is inconsistent between members or data is corrupted, then etcd will fail to start, and you can see error messages like "checkInitialHashKV failed, ...found data inconsistency with peers" in the “/var/vcap/sys/log/etcd/etcd.stderr.log” log file. You can confirm etcd status by running “sudo monit summary” from any control plane node.

If the above error messages occur or the etcd member is stopped on one of your control plane nodes then contact support to help recover the etcd cluster.

Additional Information

Impact/Risks:
The consensus is there is a very low possibility to run into this scenario if the control plane nodes running etcd do not have memory pressure, sigkill interrupted or crash.

Once the 2 flags are added etcd will be stopped if the issue is detected and this will prevent any data corruption. If etcd is stopped, the Kubernetes cluster operations will be impacted.