Time drift issue across hosts and/or nodes can cause unexpected machine deletion and recreation
search cancel

Time drift issue across hosts and/or nodes can cause unexpected machine deletion and recreation

book

Article ID: 323918

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Symptoms:
The MachineHealthCheck (MHC) controller in TKG is responsible for remediating unhealthy machines.  In the event that the time is not synchronized across hosts or nodes, it is possible that the MHC goes into a loop in remediating nodes that it deemed unhealthy.  If time is not synchronized and a restart/recreate of a node happens, the MHC could wrongly calculate the age of the machine during startup and consider it has been unhealthy for a long time, and then it will quickly delete and recreate the node.  It will be thrown into a loop until the time has been synchronized.

One of the symptoms in this case is that a new node VM keeps getting created, powered on, and then powered off relatively quickly and then deleted. 

From the capi-controller-manager pod logs, you should see the time the machine has been created by searching for the the particular machine name (e.g., wkld-md-1-xxxxx-c695b5dxx-p9999 in the example below).
I0918 14:32:49.364540 1 machineset_controller.go:476] "Created machine 1 of 1 with name \"wkld-md-1-xxxxx-c695b5dxx-p9999\"" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" machineSet="tkg-system/wkld-md-1-xxxxx-c695b5dxx" namespace="tkg-system" name="wkld-md-1-xxxxx-c695b5dxx" reconcileID=9be7c4b0-60b5-4264-9397-5542a98xxxxx

Then, compare the log that shows why the MHC was remediating the machine. Particularly check the reason and the message. In the message, it states how long it has been since the condition has been observed. .
I0918 14:33:55.360129 1 machinehealthcheck_controller.go:431] "Target has failed health check, marking for remediation" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" machineHealthCheck="tkg-system/wkld-md-1-xxxxx" namespace="tkg-system" name="wkld-md-1-xxxxx" reconcileID=9da00031-982c-46fc-9f5b-f676c47xxxxx cluster="wkld" target="tkg-system/wkld-md-1-xxxxx/wkld-md-1-xxxxx-c695b5dxx-p9999/wkld-md-1-xxxxx-c695b5dxx-p9999" reason="UnhealthyNode" message="Condition Ready on node is reporting status False for more than 12m0s"

 
In this example, it indicates that the machine has been in an unhealthy condition for more than 12 minutes, but the log timestamps indicates that the machine has just been created not more than 2 mins ago.  

DHCP IP address exhaustion can also occur because VMs keep getting created.

Environment

VMware Tanzu Kubernetes Grid Plus 1.x

Cause

The cause of the issue is that time is not synchronized across the hosts and nodes.

Resolution

Steps to recover from this issue.

1.) Pause the cluster reconciliation to stop the loop of continuously deleting and recreating the nodes and to avoid IP address exhaustion:

kubectl patch cluster <cluster-name> --type merge -p '{"spec":{"paused": true}}'


2.) Set up Time Sync across the ESXi hosts.  Confirm that time has been synchronized afterwards.

3.) Make sure that DHCP has available IP addresses.  If needed, release the unassigned IP address leases accordingly.

4.) Resume the cluster reconciliation.

kubectl patch cluster <cluster-name> --type merge -p '{"spec":{"paused": false}}'

Afterwards, the node recreation should be successful and the cluster should be in a stable and running state.