This article provides guidance to bringing all TKGI CP and Kubernetes cluster VMs back to a Running state in the case of a vSphere Datastore outage.
Scenario:
Example VM States:
When looking at the states of the TKGI CP and Cluster VMs from bosh
commands like below:
bosh vms
or
bosh -d service-instance_CLUSTERUUID vms
You may see VMs in various unhealthy states, including:
NOTE:
Details of this approach:
TKGI 1.17 and above
VMware vSphere
VMware vCenter
STEPS:
1- Identify all VMs requiring recovery
Use the commands below to identify all problematic bosh Deployments, and only the VMs with issues or failing processes. Good idea to track things such as:
- The VM Instance (INSTANCE-GROUP/INSTANCE-ID),
- VM CID,
- Initial Process State of VM and processes,
- Current state,
- Any recovery operations you perform
bosh deployments --column=name
Example output:
Using environment '<BOSH_DIRECTOR_IP>' as client 'ops_manager'
Name
harbor-container-registry-XXXXXXXXXXXXXXXXXXXX
pivotal-container-service-YYYYYYYYYYYYYYYYYYYY
service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX vms
bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX is --ps
NOTE: Note any bosh processes "not" in running state.
NOTE: The preferred command to start with is bosh recreate of individual VMs (or an entire Deployment UUID)
In a real life incident, simply using bosh cck does not recover all VM issues.
We had to perform a bosh recreate anyway most of the time (or Guest O/S reboot from vSphere).
Starting with this may save a lot of time!!
IMPORTANT: Unless you wish to recreate "every" VM under a Deployment UUID, you "must" include these options:
--fix --no-converge INSTANCE-GROUP/INSTANCE-ID
Recreate single VMs that do "not" show running state.
Recommended if most VMs show running state. Recreating a large number of Deployment VMs will take much longer.
bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX recreate -
-fix --no-converge INSTANCE-GROUP/INSTANCE-ID
Recreate every VM under a Deployment UUID. Regardless of running state:
Recommended if most VMs do "not" show running. Recreating a large number of Deployment VMs will take much longer.
bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX recreate -
-fix --skip-drain
Login to vSphere UI
Search for the VM CID
Use Actions menu to Restart O/S of the VM
Confirm running state via bosh.
If needed, run previous commands.
cloud-check
for reconciling IaaS resources (vSphere).NOTE: Run the following for each Deployment UUID above with processes or VMs "not" in running state
bosh -d service-instance_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX cloud-check