bosh deleting all VMs deployed in a vSphere
search cancel

bosh deleting all VMs deployed in a vSphere

book

Article ID: 413105

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

When during some maintenance work, like deleting some of the details from bosh state file and deployed our Bosh director.

A weird behaviour during this deployment, during deployment Bosh director successfully delete all VMs. A sample log would be

Compiling package 'nats/####'... Skipped [Package already compiled] (00:00:00)
Compiling package 's3cli/####'... Skipped [Package already compiled] (00:00:00)
Compiling package 'powerdns/####'... Skipped [Package already compiled] (00:00:00)
Compiling package 'uaa/####'... Skipped [Package already compiled] (00:00:01)
Compiling package 'postgres-10/####'... Skipped [Package already compiled] (00:00:00)
Compiling package 'director/####'... Skipped [Package already compiled] (00:00:00)
Compiling package 'bosh-gcscli/####'... Skipped [Package already compiled] (00:00:00)
Updating instance 'bosh/0'...


Finished (00:01:14)
Waiting for instance 'bosh/0' to be running...

Finished (00:01:11)
Running the post-start scripts 'bosh/0'... Finished (00:00:01)
Finished deploying (00:09:12)

Deleting unused stemcell ''... 

If you go to vCenter and check you can see that during deployment it is deleting VMs:

 

Environment

Ops Manager deployed in vSphere cluster

Cause

This is a bug which has the following behaviour:

1.) When the state file contains stemcells:[{}] instead of stemcells:[], the BOSH CLI interprets {} incorrectly as a "valid stemcell" with blank ids and cid. 


2) The create-env command will deploy the (valid) stemcell specified in the passed bosh.yaml manifest, then appends this resulting stemcell record to our stemcells list in the deployment state and notes the id of the stemcell as the "current" stemcell for the deployment (n.b., this list does not track stemcells for BOSH deployed VMs, just the BOSH director itself).

3) At the end of the BOSH director deployment, there is some behavior to delete unused stemcells. This is intended to remove the stemcell the BOSH director was previously using after it has upgraded to a new stemcell. It works by iterating over our list of stemcells, and if the id does not match the "current" stemcell id (the stemcell currently used by the director), it will issue a CPI command to delete the stemcell with the `cid` from the stemcell record. 


4) In this case, when the blank record is encountered, the blank id does not match the current stemcell (of course), and a delete_stemcell command is issued for the vSphere cpi with a blank cid. 

5)  The delete_stemcell behavior for the vSphere CPI is intended to find the specified stemcell and delete it. However, critically in this case, the vSphere cpi does NOT do an "exact match" on the stemcell cid – it matches existing VM names with a regex against the submitted stemcell cid. Why does it do this? Stemcells may be replicated across datastores (presumably to ensure there is a "local" copy of the stemcell available for VM creation). When stemcells are replicated, they end up getting a CID based on the original stemcell CID, with some additional metadata. Thus, if one wishes to completely delete a stemcell from a vSphere deployment, it is necessary to find all VMs that contain the stemcell name. Of course, when the stemcell CID is "", this results in a regex that matches ALL VM names, and consequently the CPI attempts to delete every VM it can find. 

This behaviour is therefore the result of two intersecting problems: 

1) state files are not validated by the CLI at load time:

https://github.com/cloudfoundry/bosh-cli/blob/93ac2eac353649bb4a55367f70b3875cc18d2c7d/cmd/deployment_preparer.go#L116.

It's likely because this file is not considered "user input", and so the contents are assumed to be valid. Given that it is possible to modify the state file and some procedures instruct users to modify it directly, it seems like there should be validation at load. What happens with the described scenario is the "{}" value becomes an "empty" stemcell record in the deployment state. A validator could either error out here or silently reject empty stemcells. This missing validation is not ideal but by itself somewhat benign. The state file is not intended to be modified by hand, and in those circumstances it's not totally unreasonable for the CLI to assume that if the file exists the contents will be valid. However, this lack of validation leaves us open to surprising, unintended behavior. Validation here seems warranted given that we have many instances where KBs or even official documentation guides users to make manual changes to the state file. This change will need to be made to the BOSH CLI. Details for this problem is tracked here:

https://github.com/cloudfoundry/bosh/issues/2625

2) A delete_stemcell message issued to the vSphere CPI with a blank string results in the CPI issuing API calls to delete all stemcells. This is clearly very bad, although it sounds like there are generally enough safeguards in the rest of the system to prevent against the observed problem, as this code has been here at least 10 years without modification. It should be able to reject blank strings as invalid. This change will need to be made to the BOSH vSphere CPI. Details for this problem is tracked here:

https://github.com/cloudfoundry/bosh-vsphere-cpi-release/pull/407

 

 

Resolution

A fix to this bug is included in vsphere CPI version 98.0.1 which is included in Ops Manager v3.2.1

Upgrade to Ops Manager v3.2.1+

Additional Information

bosh state file is located at OpsManager VM. It is best practice to backup state file prior to editing/changes so its easier to revert back.

cd /var/tempest/workspaces/default/deployments
sudo cp bosh-state.json bosh-state.json.bkp

There is a known KB for recreating bosh director that requires editing state file, please note that resolution is changing stemcells into:

stemcells:[]

Please take extra precaution not to set it to: 
** Note on the curly brackets **

stemcells:[{}

As this would delete vms if using versions prior to v3.2.1