Bosh VMs are fluctuating between 'Unresponsive Agents' and Healthy states

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When a BOSH agent frequently alternates between responsive and unresponsive states, the root cause is commonly traced back to network packet loss or a duplicate IP on the network.

This issue stems from VMs becoming unexpectedly unavailable within the IaaS itself. The IaaS then loses track of VM instances due to this critical IaaS fault.
Examples of such events are datastore failures, or significant host crashes.

In this same scenario you may see symptoms of an 'Unable to render jobs' error if a deployment upgrade failed and bosh recreate/cck was used to resolve VM issues. This error occurs because there is upgraded code and the expected variables in Credhub have not yet been updated from the failed deploy.

Behavior

If at any point bosh cck or bosh deploy --fix is run while the system is degraded, new VMs may be created. The subsequent old and unavailable VMs will not be destroyed once vSphere is restored, resulting in duplicate VMs/IPs.

As a result of the critical failure, operators can then see duplicate IPs reported by Bosh on the network because a VM may be running on a remote ESXi server which vCenter is unaware of.

Bosh, no longer knowing about the old unresponsive VMs, will try to update the deployment's missing VMs with new VMs and the CPI will complain about conflicting IPs.

Error: Unknown CPI error 'Unknown' with message 'Detected IP conflicts with other VMs on the same networks:..

This error above is specific to the vSphere CPI during the create_vm process. The error is the result of another VM currently existing and is using the IP address called out in the log.

Note: The following details the proper procedure to recover instability resulting from an IaaS failure:

1. Turn off resurrection: bosh update-resurrection on|off
2. Validate that the IaaS is stable and all VMs are running
2. Run Bosh CCK
3. Delete any unstable/unavailable VMs from IaaS

See the following resource for additional information:
https://bosh.io/docs/cck/#not-responsive-vm

Resolution

If you are in the scenario mentioned above with duplicate IPs, it is likely the result of using the --fix option for a deploy while the underlying infrastructure was degraded.

For reference, Bosh does not look at which IPs are in use. Bosh assumes that it has full control over the IP range and if it doesn’t know about a VM using an IP, it assumes it is free. By doing the bosh deploy --fix, Bosh deletes the knowledge it has about the existing VM and IP. So it assumes the IP is free when trying to create the VM later, but that VM might actually still be running since the --fix wasn’t able to delete the old one in vSphere.

The fix:

The duplicate VMs/IPs will need to be removed manually.

Configure Bosh itself to use a different IP
Search for the duplicate IPs in vCenter and resolving the issue by deleting the VM or changing the IP.

Finding the Duplicate IP:

To find all duplicate IP instances, the most efficient way is to get all cloud ID's (CID) and cross reference them with vCenter/IaaS (possibly searching by the deployment tag name there), then deleting the extra VMs.

Validate that the infrastructure is stable
Find the VM CID's: bosh vms --column="VM CID"
Identify the VMs lost in the deploy --fix by looking for matching names in vSphere.
Delete the duplicate VMs from the IaaS
Delete any old/duplicate VMs from Bosh - bosh delete-vm <VM-ID>

The columns in the VMs tab can also be expanded by selecting the down arrow in the column header to find any VM names that match.

Differences Between --fix and CCK

deploy --fix:
This is a flag added to bosh deploy to automatically recreate VMs that are not responsive. The deploy --fix flag can result in duplicate VMs because failures and errors in deletion are ignored.

 bosh -d <deployment> manifest > manifest.yml 
 bosh -d <deployment> manifest.yml --fix

Bosh CCK:
Recreating the VM will destroy the ephemeral disk, create a new VM, destroy the old VM, and start all on a brand new VM based on the deployment manifest.

https://docs.pivotal.io/ops-manager/2-9/install/trouble-advanced.html#cck

Additional Step if VM exists in Bosh but not vCenter

If there is any latency between the communication with Bosh and the CPI to vCenter, a result can be a VM being removed from vCenter but not the Bosh director database. In this case:

1. Use "bosh vms" to get the VM CID for that instance
2. Verify that the VM BOSH expects no longer exists in vCenter by searching for the cloud id
3. Run "bosh cck" and choose Delete VM reference for the missing VM only