Unable to find VM by BIOS UUID

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

This article provides guidance on how to manually remove the .status.failureMessage and .status.failureReason properties from the affected objects. This step is necessary to ensure that the controllers can resume reconciling the objects and restore the cluster status.

Symptoms:

Customers may experience a situation where the VSphereVM controller is unable to locate a VSphereVM by its BiosUUID in vCenter. This problem results in the .status.failureMessage and .status.failureReason fields being set. These fields do not automatically clear after resolving storage issues, and may prevent further reconciliation of the VSphereVM, VSphereMachine, and Machine objects, even after the storage issues have been rectified.

The typical symptom is the failure message: "Unable to find VM by BIOS UUID <UUID>."

Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

The issue may arise from a storage outage or a similar event that disrupts the VSphereVM controller's ability to locate a VSphereVM by its BiosUUID in vCenter. Subsequently, the .status.failureMessage and .status.failureReason fields are set, indicating a terminal problem. Because these fields are not designed to reset automatically, they impede further reconciliation by the controllers even after the storage issues are resolved.

Resolution

As of the time this article was written, no definitive resolution for this issue has been developed.

Workaround:

You can manually remove the .status.failureMessage and .status.failureReason properties from the affected objects. Once these fields have been removed, the controllers will resume reconciliation of the objects, allowing the cluster status to recover if no other issues exist.

You can use the following commands, ensuring you replace the placeholders for $VSPHERE_VM_NAME and $MACHINE_NAME.

For the VSphereVM object:

kubectl patch --subresource=status --type merge vspherevm $VSPHERE_VM_NAME --patch '{"status": {"failureMessage": null, "failureReason": null}}'

For the Machine object:

kubectl patch --subresource=status --type merge machine $MACHINE_NAME --patch '{"status": {"failureMessage": null, "failureReason": null}}'

Note:-
1) There is no need to reset the status for the VSphereMachine, as removing it from the VSphereVM will sync to the VSphereMachine.
2) kubectl CLI version should be v1.24 and above to apply this workaround as "–subresource=status" feature got introduced in v1.24 as alpha

Additional Information

For more information, you may refer to GitHub Issue #9085.

Impact/Risks:

The impacted objects (VSphereVM, VSphereMachine, and Machine) will not reconcile, and this prevents the restoration of the cluster status. This situation may significantly affect a cluster's ability to heal itself.