There may be no clear symptoms of data inconsistencies, but your clusters may be affected. If you see the following symptoms contact support:
If you do not see these symptoms then it is safe to apply the “Workaround” steps in this KB.
To reduce the risk of hitting this issue, please do not increase workloads of your clusters and wait for the release of TKGI 1.13.4 which will include a fix.
Note: Community analysis indicates that there is a very low possibility to run into this scenario if the control plane nodes running etcd do not have memory pressure or sigkill interrupted.
With recent reproduction of data inconsistency issues in #13766 , etcd maintainers are no longer recommending v3.5 releases for production. This impacts all TKGI 1.13.x releases.
bosh -d service_instance-xxx ssh master/0 sudo -i
… /var/vcap/packages/etcd/bin/etcd \ --experimental-initial-corrupt-check=true \#make sure no trailing space --experimental-corrupt-check-time="1h" \#make sure no trailing space --name="d24253cb-63ec-4a73-9f41-d94bdaf53662" \ --data-dir="/var/vcap/store/etcd" \ …
monit restart service
monit summary
Note Run these commands in the linux terminal and pay attention to those symbols including single quotes, double quotes, and slashes.
Run these commands for each cluster in your environment.
bosh -d service-instance_xxxxxx ssh master -c 'sudo sed -i "/\/var\/vcap\/packages\/etcd\/bin\/etcd/a \ --experimental-initial-corrupt-check=true \\\\\n --experimental-corrupt-check-time=\"1h\" \\\\" /var/vcap/jobs/etcd/bin/etcd'
bosh -d service-instance_xxxxxx ssh master -c 'sudo monit restart etcd'
bosh -d service-instance_xxxxx ssh master -c 'sudo monit summary'
If the data is inconsistent between members or data is corrupted, then etcd will fail to start, and you can see error messages like "checkInitialHashKV failed, ...found data inconsistency with peers" in the “/var/vcap/sys/log/etcd/etcd.stderr.log” log file. You can confirm etcd status by running “sudo monit summary” from any control plane node.
If the above error messages occur or the etcd member is stopped on one of your control plane nodes then contact support to help recover the etcd cluster.