In some vSphere environments, upgrade/installation to Pivotal Cloud Foundry version 1.9 fails with the following error:
Director task 10718 Started preparing deployment > Preparing deployment. Done (00:00:01) Started preparing package compilation > Finding packages to compile. Done (00:00:00) Started updating instance diego_cell Started updating instance diego_cell > diego_cell/8007640d-569b-46e4-bd3a-ece29bad8cc5 (0) (canary). Failed: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review logs for failed jobs: rep (00:05:45) Error 400007: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review lfogs for failed jobs: rep Task 10718 error
The key symptom here is that the rep
process is failing. See below the output of `monit summary` from the diego_cell where the `rep` is in an `unknown` state:
diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)* | failing | AZ1 | xlarge.disk | 10.2.15.20 | | consul_agent | running | | | | | rep | unknown | | | | | garden | running | | | | | metron_agent | running |
Restarting rep process does not fix the issue either.
The issue is caused by the following line in the `/var/vcap/jobs/rep/bin/rep_as_vcap` file:
azure_fd=$(curl -f --connect-timeout 5 --silent http://169.254.169.254/metadata/v1/InstanceInfo/FD)
In some of the vSphere environments, the above curl command does not timeout within 30 seconds, causing `rep_as_vcap` script to exit after 30 seconds. Since monit is configured to terminate a process if its associated startup scripts don't exit after 30 seconds, in this case, monit terminates the `rep` process. Hence, the rep process status is `unknown` from the `monit summary` command above.
Note: The reference to Azure in the above control scripts is related to querying metadata for the Azure IaaS. This IaaS dependency in the Diego control scripts is required to enable some features for Azure although, it has some unintended consequences in the vSphere environments. Please see below for the final fix.
This issue is fixed in Elastic Runtime version 1.9.18. Upgrade to Elastic Runtime 1.9.18 or above. See the release notes: Adds Azure Fault-Domain detection failure logic to rep.