If you are in the scenario mentioned above with duplicate IPs, it is likely the result of using the
--fix option for a deploy while the underlying infrastructure was
degraded.
For reference, Bosh does not look at which IPs are in use. Bosh assumes that it has full control over the IP range and if it doesn’t know about a VM using an IP, it assumes it is free. By doing the
bosh deploy --fix, Bosh deletes the knowledge it has about the existing VM and IP. So it assumes the IP is free when trying to create the VM later, but that VM might actually still be running since the
--fix wasn’t able to delete the old one in vSphere.
The fix:
The duplicate VMs/IPs will need to be removed manually.
- Configure Bosh itself to use a different IP
- Search for the duplicate IPs in vCenter and resolving the issue by deleting the VM or changing the IP.
Finding the Duplicate IP:
To find all duplicate IP instances, the most efficient way is to get all cloud ID's (CID) and cross reference them with vCenter/IaaS (possibly searching by the deployment tag name there), then deleting the extra VMs.
- Validate that the infrastructure is stable
- Find the VM CID's: bosh vms --column="VM CID"
- Identify the VMs lost in the deploy --fix by looking for matching names in vSphere.
- Delete the duplicate VMs from the IaaS
- Delete any old/duplicate VMs from Bosh - bosh delete-vm <VM-ID>
The columns in the VMs tab can also be expanded by selecting the down arrow in the column header to find any VM names that match.
Differences Between --fix and CCK
deploy --fix: This is an experimental flag added to
bosh deploy to automatically recreate VMs that are not responsive. The
deploy --fix flag can result in duplicate VMs because failures and errors in deletion are ignored.
bosh -d <deployment> manifest > manifest.yml
bosh -d <deployment> manifest.yml --fix
Bosh CCK: Recreating the VM will destroy the ephemeral disk, create a new VM, destroy the old VM, and
start all on a brand new VM based on the deployment manifest.
- https://docs.pivotal.io/ops-manager/2-9/install/trouble-advanced.html#cck
Additional Step if VM exists in Bosh but not vCenter
If there is any latency between the communication with Bosh and the CPI to vCenter, a result can be a VM being removed from vCenter but not the Bosh director database. In this case:
1. Use "
bosh vms" to get the VM CID for that instance
2. Verify that the VM BOSH expects no longer exists in vCenter by searching for the
cloud id 3. Run "
bosh cck" and choose
Delete VM reference for the missing VM only