Deleted k8s node is not recreated by Machine Health Check (MHC) in TKGm

Products

Tanzu Kubernetes Runtime

Issue/Introduction

Machine Health Check (MHC) is not recreating the Node, despite the Node being removed from the "kubectl get nodes" output.

# kubectl get nodes
# - Only 2 worker nodes exist
NAME                            STATUS   ROLES           AGE   VERSION
test-controlplane-f5zbl-j44jj   Ready    control-plane   45d   v1.33.1+vmware.1
test-md-0-h2d4r-4qhs2-rsf2w     Ready    <none>          45d   v1.33.1+vmware.1
test-md-0-h2d4r-4qhs2-xwwhn     Ready    <none>          45d   v1.33.1+vmware.1

However, some orphaned objects (ma/vspheremachine/vspherevms) remain unexpectedly.

# kubectl get ma -A
NAME                           CLUSTER   NODENAME
test-controlplane-f5zbl-j44jj  test      test-controlplane-f5zbl-j44jj
test-md-0-h2d4r-4qhs2-rsf2w    test      test-md-0-h2d4r-4qhs2-rsf2w
test-md-0-h2d4r-4qhs2-xwwhn    test      test-md-0-h2d4r-4qhs2-xwwhn
test-md-0-h2d4r-4qhs2-z754m    test      test-md-0-h2d4r-4qhs2-z754m # <------- still remained

# kubectl get vspheremachine -A
NAME                           CLUSTER  READY
test-controlplane-f5zbl-j44jj  test     true
test-md-0-h2d4r-4qhs2-rsf2w    test     true
test-md-0-h2d4r-4qhs2-xwwhn    test     true
test-md-0-h2d4r-4qhs2-z754m    test     true # <---------- still remained

# kubectl get vspherevms -A
NAME                           AGE
test-controlplane-f5zbl-j44jj  45d
test-md-0-h2d4r-4qhs2-rsf2w    45d
test-md-0-h2d4r-4qhs2-xwwhn    45d
test-md-0-h2d4r-4qhs2-z754m    45d # <---------- still remained

Environment

TKGm v2.5.0

Cause

Known Issue - TKGm v2.5.0 - Orphan vSphereMachine objects after cluster upgrade or scale

Resolution

Delete the unexpected orphaned node object (ma/vspheremachine/vspherevms) manually.

# 1. Switch the context to the "Management Cluster"
kubectl config use-context <MANAGEMENT_CLUSTER>

# 2. Check
kubectl -n <NAMESPACE> get ma,vspheremachine,vspherevms

# 3. Delete the objects manually
kubectl -n <NAMESPACE> delete ma <TARGET_NODE_OBJECT>
kubectl -n <NAMESPACE> delete vspheremachine <TARGET_NODE_OBJECT>
kubectl -n <NAMESPACE> delete vspherevms <TARGET_NODE_OBJECT>

# 4. After 5 minutes, verify that the objects have been deleted successfully
kubectl -n <NAMESPACE> get ma,vspheremachine,vspherevms

# 5. In case of deletion failure, delete the value of the finalizers
kubectl -n <NAMESPACE> edit ma <TARGET_NODE_OBJECT>
kubectl -n <NAMESPACE> edit vspheremachine <TARGET_NODE_OBJECT>
kubectl -n <NAMESPACE> edit vspherevms <TARGET_NODE_OBJECT>

# 6. New node will be recreated by MHC

If MHC is not triggered, restart the Cluster API pods.

kubectl -n capi-system rollout restart deployment/capi-controller-manager
kubectl -n capv-system rollout restart deployment/capv-controller-manager

If the cluster is in a "paused" state, revert it to the "unpaused" state.

kubectl -n <NAMESPACE> patch cluster <CLUSTER> --type merge -p '{"spec":{"paused": false}}'

Additional Information

This bug has been addressed in Cluster API v1.9 and subsequent releases. Please consider upgrading the TKGm version to v2.5.3 or later
Japaneser version KB: 手動で削除した k8s node が Machine Health Check (MHC) によって再作成されない