TKGI: Kubernetes clusters and VMs in various unhealthy states after a Datastore outage: recovery

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

This article provides guidance to bringing all TKGI CP and Kubernetes cluster VMs back to a Running state in the case of a vSphere Datastore outage.

Scenario:

You experienced an outage with one or more Datastores within the vSphere foundation used by the TKGI Control Plane and Kubernetes clusters.
The Datastore outage could have been due to:
- A Datastore offline unexpectedly
- Loss of Datastore connectivity
- etc
The Datastores now report online and functional.
However, some TKGI CP and Kubernetes cluster VMs do not return to a Running status.
The TKGI Control Plane and Kubernetes cluster VMs may show various unhealthy states.
You are looking for a method for quick recovery of those affected VMs.

Example VM States:

When looking at the states of the TKGI CP and Cluster VMs from bosh commands like below:

bosh vms

or

bosh -d service-instance_CLUSTERUUID vms

You may see VMs in various unhealthy states, including:

- Unresponsive agent
- Stopped
- Failed
- -

NOTE:

Below is a general approach to more quickly getting TKGI CP nodes and possibly many Kubernetes clusters and their VMs back to Running state.
The following approach was based upon a real life customer incident where all Datastores experienced a power outage.
There are other approaches. Review the below outline for recovery first.

Details of this approach:

This approach focuses on specific bosh and vSphere operations which can more-quickly bring back the bosh managed VMs to a Running state.
The approach below may not be the only approach you should consider. Nor necessarily your best approach. However, it may be a good starting point in your recovery.
In terms if IaaS resources, this article only covers vSphere as the underlying IaaS.

Environment

TKGI 1.17 and above

VMware vSphere

VMware vCenter

Resolution

STEPS:

1- Identify all VMs requiring recovery

Use the commands below to identify all problematic bosh Deployments, and only the VMs with issues or failing processes. Good idea to track things such as:

The VM Instance (INSTANCE-GROUP/INSTANCE-ID),

VM CID,

Initial Process State of VM and processes,

Current state,

Any recovery operations you perform

- List list all bosh Deployments names (UUIDs)

bosh deployments --column=name
Example output:

Using environment '<BOSH_DIRECTOR_IP>' as client 'ops_manager'

Name
harbor-container-registry-XXXXXXXXXXXXXXXXXXXX
pivotal-container-service-YYYYYYYYYYYYYYYYYYYY
service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX

- Then, for each Deployment, check the overall state of the Virtual Machines:

bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX vms

- - - - NOTE: Note any bosh VMs "not" in running state.

- Then, for each Deployment showing one or more VMs "not" in running state:
  - Check the Process State of all its bosh processes:

bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX is --ps

NOTE: Note any bosh processes "not" in running state.

2- Recover VMs

NOTE: The preferred command to start with is bosh recreate of individual VMs (or an entire Deployment UUID)
In a real life incident, simply using bosh cck does not recover all VM issues.
We had to perform a bosh recreate anyway most of the time (or Guest O/S reboot from vSphere).
Starting with this may save a lot of time!!

IMPORTANT: Unless you wish to recreate "every" VM under a Deployment UUID, you "must" include these options:

--fix --no-converge INSTANCE-GROUP/INSTANCE-ID

- RECREATE INDIVIDUAL VMs:

Recreate single VMs that do "not" show running state.

Recommended if most VMs show running state. Recreating a large number of Deployment VMs will take much longer.

bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX recreate --fix --no-converge INSTANCE-GROUP/INSTANCE-ID

- RECREATE ALL DEPLOYMENT VMs:

Recreate every VM under a Deployment UUID. Regardless of running state:

Recommended if most VMs do "not" show running. Recreating a large number of Deployment VMs will take much longer.

bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX recreate --fix --skip-drain

- Restart guest O/S from vSphere UI for anything else

Login to vSphere UI

Search for the VM CID

Use Actions menu to Restart O/S of the VM

Confirm running state via bosh.

If needed, run previous commands.

- Finally, if necessary, use cloud-check for reconciling IaaS resources (vSphere).

NOTE: Run the following for each Deployment UUID above with processes or VMs "not" in running state

bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX cloud-check

- - - Select: 3: Recreate VM without waiting for processes to start
    - Enter y to Continue