TKGI worker nodes are deployed by bosh and have at least three disks attached to it when no Kubernetes persistent volumes are being used. These three disks are
/var/vcap/data
/var/vcap/store
The procedure mentioned in this article can be used to recover bosh created persistent disks
All Versions of VMware Tanzu Kubernetes Grid Integrated Edition
Persistent disk corruption can occur due to underlying IaaS or filesystem issues.
Important Note:
bosh recreate or bosh cck
10.20.0.5
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
011704a1-5f0f-4cb9-bd91-f9ad7aec17e5 Ready <none> 20h v1.23.7+vmware.1 10.20.0.5 10.20.0.5 Ubuntu 16.04.7 LTS 4.15.0-191-generic containerd://1.6.4
8334e164-8e9b-4ffb-9c89-bfe015e094a8 Ready <none> 20h v1.23.7+vmware.1 10.20.0.4 10.20.0.4 Ubuntu 16.04.7 LTS 4.15.0-191-generic containerd://1.6.4
c649ec99-bb3a-4049-9c57-1751f6de271e Ready <none> 21h v1.23.7+vmware.1 10.20.0.3 10.20.0.3 Ubuntu 16.04.7 LTS 4.15.0-191-generic containerd://1.6.4
bosh vms -d service-instance_77e44aad-1a76-4980-8d4e-43d7c273d167 | grep 10.20.0.5
worker/fcd09dc3-9e7a-4528-8015-22620b553f27 running az 10.20.0.5 vm-c2b8073f-949d-4891-b420-36769ecdee60 medium.disk true bosh-vsphere-esxi-ubuntu-xenial-go_agent/621.265
Note: Other drain options maybe needed if drain fails
kubectl drain 011704a1-5f0f-4cb9-bd91-f9ad7aec17e5 --ignore-daemonsets
node/011704a1-5f0f-4cb9-bd91-f9ad7aec17e5 cordoned
WARNING: ignoring DaemonSet-managed Pods: pks-system/fluent-bit-7rg24, pks-system/telegraf-xjsx4
evicting pod kube-system/coredns-67bd78c556-9vwfd
pod/coredns-67bd78c556-9vwfd evicted
node/011704a1-5f0f-4cb9-bd91-f9ad7aec17e5 drained
kubectl get nodes
NAME STATUS ROLES AGE VERSION
011704a1-5f0f-4cb9-bd91-f9ad7aec17e5 Ready,SchedulingDisabled <none> 20h v1.23.7+vmware.1
8334e164-8e9b-4ffb-9c89-bfe015e094a8 Ready <none> 20h v1.23.7+vmware.1
c649ec99-bb3a-4049-9c57-1751f6de271e Ready <none> 21h v1.23.7+vmware.1
bosh update-resurrection off -d service-instance_77e44aad-1a76-4980-8d4e-43d7c273d167
bosh -d service-instance_77e44aad-1a76-4980-8d4e-43d7c273d167 ssh worker/fcd09dc3-9e7a-4528-8015-22620b553f27
sudo su -
monit stop all
To confirm everything stopped
monit summary
/var/vcap/store
, before repairing identify the filesystem path/var/vcap/store
is leveraging /dev/sdc1
df -h
Filesystem Size Used Avail Use% Mounted on
<------ Truncated Output ------>
/dev/sda1 2.9G 1.4G 1.4G 52% /
/dev/sdb1 32G 3.5G 27G 12% /var/vcap/data
tmpfs 16M 4.0K 16M 1% /var/vcap/data/sys/run
/dev/sdc1 50G 2.1G 45G 5% /var/vcap/store
<------ Truncated Output ------>
umount /var/vcap/store
umount
fails because device is busy, identify which processes have blocked the operation using
fuser -m -u -v /dev/sdc1 or
fuser -m -u -v /var/vcap/store
kill
<PID> fsck /dev/sdc1
fsck from util-linux 2.27.1
e2fsck 1.42.13 (17-May-2015)
/dev/sdc1: clean, 12599/3276800 files, 794069/13106688 blocks
mount /dev/sdc1 /var/vcap/store
mount | grep sdc
/dev/sdc1 on /var/vcap/store type ext4 (rw,relatime,data=ordered)
As part of process stop and start, kubelet
has also restarted which should bring the nodes out of SchedulingDisabled state
monit start all