Drift Detector for Tanzu Kubernetes Grid Management Cluster

Products

VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Symptoms:

Note that the drift detector is an experimental feature and as the drift is complicated, the detector is doing its best effort to find the drift, it may not cover all kinds of cases and should only be used as a reference.

Mismatch between the resources recorded in the backup and the actual state of the infrastructure of the Tanzu Kubernetes Grid management cluster.

Environment

VMware Tanzu Kubernetes Grid 2.1.0

Cause

VMware Tanzu Kubernetes Grid (TKG) is a product for managing the lifecycle of Kubernetes clusters.

Since version v2.1.0, a solution has been provided to TKG customers for performing backup and restore to cluster objects on a management cluster, such that in the case of a disaster which causes the management cluster to be unavailable but the workload clusters remain accessible, the user can provision a new management cluster instance, restore the cluster objects and continue managing the existing workload clusters via the new instance. For more details, please refer to Back Up and Restore Management and Workload Cluster Infrastructure on vSphere

In the context of this solution, "drift" refers to a situation where there is a mismatch between the resources recorded in the backup and the actual state of the infrastructure. This mismatch can lead to problems during the restoration process. To gain a better understanding of handling drift, please refer to the "Handling Drift" section in the doc to understand more details: Handling Drift

Resolution

To address the issue of drift, the Drift Detector has been introduced as a tool. It compares the content of a backup with the current state of the infrastructure and generates a comprehensive report. This report assists users in identifying potential issues and performing necessary manual steps to mitigate the drift before initiating the restore workflow, thereby facilitating a smoother restoration process.

Workaround:

How to install

Download and unzip the drift-detector-v0.2.0.zip attached to this KB on the section "Attachments". The file contains binaries for Linux, MacOS, and Windows that you can use as shell commands without any installation process.

How to use

Use the drift detector tool before performing the restoration by following the steps:

Download the backup tarball

Download the backup tarball either from the backup store portal directly or use the Velero CLI:

velero backup download <backup-name>

Detect the drift with the detector

All the available options of the drift-detector command are as follows:

drift-detector detect -h
Detect the drifts between the backup and infrastructure

Usage:
 drift-detector detect [flags]

Flags:
   --backup string       The local path of the backup tarball file. Required
   --format string       The report format. One of: (json) (default "json")
 -h, --help            help for detect
   --ignore-healthy-resources  Ignore the healthy resources in the report
   --insecure-skip-verify    Skip the verification of an infra server’s certificate during a connection.
 -o, --output string       Report output file. Required
   --skip-access-apiserver   Specify whether skip accessing the API servers of workload clusters during the detection


Global Flags:
 -D, --debug  Enable debug mode

The "--backup" option is required, it is used to set the local file system path of the backup tarball.

If the management cluster manages lots of workload clusters, the output of the detector will contain lots of information which is hard to locate the drift resources in the output. Users can set the "--ignore-healthy-resources" option to set the output contain only the drift resources.

Connecting to the API server of the workload clusters is helpful, but not required. If the API servers of the workload clusters are not accessible, set "--skip-access-apiserver" option to skip it.

Run the drift detector command:

drift-detector detect --backup my-backup-data.tar.gz --insecure-skip-verify -o report.json

The output is as follows:

The command output has three main parts that describe how the Kubernetes objects in the backup match the VMs and other infrastructure resources:

Summary: An overview of the detect result, including the overall status, total cluster count in each status, total infrastructure machines that need further confirmation, and whether the detection process generated any errors.
Resource list: The list of clusters and infrastructure machines which are marked with different statuses.
JSON report: A detailed description of the detection result.

The Machine listings have four possible statuses:

Healthy: The object matches an infrastructure resource.
Stale: The object has no corresponding infrastructure resource. No manual remediation required, the TKG controller will take care of it
Ghost: An infrastructure resource is found that has no corresponding object in the backup. Requires manual remediation.
Unknown: Error during detection. Requires further investigation.

ControlPlane, Workers, Cluster, and overall Summary listings have four possible statuses:

Healthy: If all the sub resources are healthy, the resource itself is marked as healthy.
ManualRemediationNotRequired: The resource contains sub resources that are Stale. No manual remediation is required; the TKG controller will take care of it.
ManualRemediationRequired: The resource contains sub resources that are Ghost. Requires manual remediation.
NeedFurtherConfirmationInfraMachines: The resource matches no object and is not referenced by any cluster, so the detector cannot determine whether it is a Ghost TKG machine or has no connection to TKG. Requires further investigation.
NeedFurtherConfirmation: Error during detection. Requires further investigation.

Remediation

Follow the guide to remediate all the ghost machines after performing the restoration

Attachments

drift-detector-v0.1.0 get_app

drift-detector-v0.2.0 get_app