When a TKGm node fails and is scheduled for deletion by the MachineHealthCheck (MHC), it can be helpful to collect logs directly from the crashed node. Once the node is restarted, manually deleted, or removed by MHC, useful logs capturing the failure can be lost/rotated. These logs are valuable for support when troubleshooting, as they provide critical insights into the state of the machine leading up to and during the failure.
vSphere and TKGm 2.5.x
1) From the management cluster context, pause cluster reconciliation for the cluster that is demonstrating the issue. This ensures that when the node/machine is powered off, it will not be recreated:
kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": true}}'
2) Power off the node/machine in vCenter
3) Right-click on the powered-off machine and select the option to create a clone.
4) Once the clone has completed, attach the disk to a test Linux-based machine. Only one VM can be powered on while using this disk at a time. Ensure the original machines remain powered off.
5) On the test machine, run the following command to confirm that the disk is recognised:
sudo fdisk -l
6) Mount the disk on the test machine to access the filesystem. Replace "<disk>" with the name of the disk gathered previously.
sudo mount /dev/<disk> /mnt/
7) The disk should now be available in the /mnt/ directory on the test VM. From here, we can collect relevant logs from the node. We can use the following commands to bundle the logs so they can be easily shared with support later:
/var/log/ will contain the most useful logs from the cluster, including pod logs, journal logs, audit logs, cloud-init-output, etc.
tar -czvf node_logs.tar.gz /mnt/var/log
Kubernetes manifests and other configuration files can be collected from /etc/kubernetes
tar -czvf node_manifests.tar.gz /mnt/etc/kubernetes
If the node is a control plane, etcd data can be collected from /var/lib/etcd
tar -czvf node_etcd.tar.gz /mnt/var/lib/etcd
Once logs have been collected/examined, they can be copied off the disk.
8) Detach the disk from the test VM. The cloned machine can now be safely deleted.
9) From the management cluster context, unpause cluster reconciliation:
kubectl patch cluster <Workload Cluster> --type merge -p '{"spec":{"paused": false}}'