BOSH deployment convergence is an important topic to understand. This KB aims to clear up what it is and when it occurs.
At a high-level, a deployment convergence is BOSH's attempt to bring a deployment back to its intended state.
The official bosh docs covering convergence can be found here.
This KB will walk you through a demonstration of a failed BOSH deploy scenario and review when a convergence would or would not be triggered.
This article will use a generic naming convention with all upper case characters and dashes to represent specific items. This will help with understanding because the concept of deployment convergence applies to all BOSH deployments that have VMs. For example, a cloud foundry deployment, a mysql service instance deployment, or a zookeeper deployment - these are all susceptible to a deployment convergence. However this KB will not refer to them by those deployment names, instead it will refer to a deployment as DEPLOYMENT-A.
This naming convention will be leveraged throughout and will apply to other components as well.
For example: VM-A, PROPERTY-A, VALUE-A.
To further help with clarification, refer to the following example for how this will translate.
Maybe the update is a syslog port setting, maybe the update is a stemcell upgrade, or maybe the update is the rotation of a certificate. In any case, abstracting away the specifics helps focus on the important concepts that are globally applicable to the concept of deployment convergence.
Options:
1. If you run the following:
bosh -d DEPLOYMENT-A recreate VM-C
Then the deployment will converge. All the changes that were applied to VM-A, VM-B, and VM-C will be reverted because the intended state wasn't updated due to the failed deploy of the updated DEPLOYMENT-A-manifest.yml file.
2. As of BOSH v270.4.0 and BOSH CLI v6.0.0, the start, stop, restart, and recreate commands all support a --no-converge flag. If you run the following:
bosh -d DEPLOYMENT-A recreate VM-C --no-converge
Then the deployment will not converge. All the changes that were applied to VM-A, VM-B, and VM-C will persist and VM-C will be recreated with its actual state (PROPERTY-A with a value of VALUE-B).
3. If you delete VM-C from the IaaS and run the following:
bosh -d DEPLOYMENT-A cck
Then the deployment will not converge. All the changes that were applied to VM-A, VM-B, and VM-C will persist and VM-C will be detected as missing and can be recreated with its actual state (PROPERTY-A with a value of VALUE-B).
4. If you delete VM-C from the IaaS and let the BOSH resurrector detect and fix it, then the deployment will not converge. All the changes that were applied to VM-A, VM-B, and VM-C will persist and VM-C will be detected as missing and will be recreated with its actual state (PROPERTY-A with a value of VALUE-B).
It is paramount to note that at this point, even if you are able to get VM-C to recreate successfully, the deployment's intended state still has not changed. DEPLOYMENT-A is still susceptible to a deployment convergence. In order to update the intended state of the deployment, a successful deploy needs to occur. In this example, you should re-run the following:
bosh -d DEPLOYMENT-A deploy DEPLOYMENT-A-manifest.yml
Only after that command is successful will the intended state update in the BOSH database. Then you will no longer at risk of a deployment convergence for DEPLOYMENT-A.
To reiterate why this is an important concept to understand, lets relate what you just learned to a large cloud foundry deployment. There can be hundreds of machines in this type of deployment.
1. A certificate rotation is needed so the manifest is updated with the new certs and deployed against the cloud foundry deployment.
2. 100 machines were updated and a failure occurs on the 101st machine due to a firewall issue.
3. In an effort to mitigate, you open the port in the IaaS that is necessary for the 101st machine's update to be successful.
4. You issue a bosh -d CF recreate VM-101
5. The deployment will converge back to the intended state and undo all of the changes for the 100 machines that already updated.
This can be very time consuming and undesirable. The correct thing to do in that situation would be to first get the 101st machine in a running state without converging the deployment and then resume deploying the manifest. For example, if you added the --no-converge to the recreate command in step 4, then you would have only acted on that target instance, recreating it with its actual state. Once the 101st machine is recreated successfully, then you can proceed to redeploy the manifest. BOSH will detect the actual state of the 100 VMs and recognize it doesn't need to update them again and will continue the deploy where it left off when it failed on the initial deploy.
Note: The last sentence is only true if the --recreate flag is not present in the deploy command. If the --recreate flag is present with the deploy command, then BOSH will recreate the 100 machines even though they are already updated. In a case like this where you need to not recreate the first 100 machines, BOSH ignore them for the deploy and unignore them after the successful deploy.
Next, you should define what an intended state is and compare it to what actual state is.
The intended state for a deployment is what is defined in the manifest that was last successfully deployed for that deployment. When a BOSH deploy occurs, it is supplied a manifest file. Upon successfully deploying the manifest file, BOSH will write the manifest to its internal database and this becomes the intended state for that deployment.
The actual state for a deployment is the intended state in addition to any changes to the deployment that was brought about by a BOSH deploy.
To illustrate this, this article will consider the following example:
DEPLOYMENT-A has 1 instance group with 3 instances desired. Each instance will be a VM so this article will refer to them as VM-A, VM-B, and VM-C.
DEPLOYMENT-A was successfully deployed.
To get the intended state of DEPLOYMENT-A, you run the command:
bosh -d DEPLOYMENT-A manifest > DEPLOYMENT-A-manifest.yml
Lets now consider that you need to update a property in this deployment. To update the property, you make a change to the manifest and deploy it against the deployment.
You edit DEPLOYMENT-A-manifest.yml and change PROPERTY-A from VALUE-A to VALUE-B.
You then run:
bosh -d DEPLOYMENT-A deploy DEPLOYMENT-A-manifest.yml
During this deploy, VM-A updates, VM-B updates, VM-C attempts to update but encounters an error and does not successfully update. Because of this error, the deployment of the manifest file is not considered successful and the deployment's intended state does not change (remember that BOSH only updates its internal database upon successful deploys).
The actual state for the deployment however did change. This is key to understanding deployment convergence.
In this example, the following is true at this point:
The intended state of DEPLOYMENT-A:
The actual state of DEPLOYMENT-A:
Recall that VM-C had an error during the update. It is important to understand that BOSH did attempt to update it. Therefore VM-C will have the updated VALUE-B for PROPERTY-A in its actual state.
Now you can start to understand deployment convergence.
Anytime a BOSH update command is issued to a deployment, BOSH will try to converge the deployment to its intended state. BOSH update commands include the following:
bosh <start|stop|restart|recreate>
Lets continue this example further to explore difference scenarios where a deployment convergence will take place.