In some vSphere Environments, Upgrade or Installation to 1.9 Fails with diego_cell Errors
search cancel

In some vSphere Environments, Upgrade or Installation to 1.9 Fails with diego_cell Errors

book

Article ID: 297688

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

In some vSphere environments, upgrade/installation to Pivotal Cloud Foundry version 1.9 fails with the following error:

Director task 10718 
Started preparing deployment > Preparing deployment. Done (00:00:01)

Started preparing package compilation > Finding packages to compile. Done (00:00:00)

Started updating instance diego_cell 
Started updating instance diego_cell > diego_cell/8007640d-569b-46e4-bd3a-ece29bad8cc5 (0) (canary). Failed: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review logs for failed jobs: rep (00:05:45)

Error 400007: 'diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)' is not running after update. Review lfogs for failed jobs: rep

Task 10718 error

 

The key symptom here is that the rep process is failing. See below the output of `monit summary` from the diego_cell where the `rep` is in an `unknown` state:

diego_cell/0 (8007640d-569b-46e4-bd3a-ece29bad8cc5)*                    | failing | AZ1 | xlarge.disk | 10.2.15.20  |
|   consul_agent                                                          | running |     |             |             |
|   rep                                                                   | unknown |     |             |             |
|   garden                                                                | running |     |             |             |
|   metron_agent                                                          | running | 

 

Restarting rep process does not fix the issue either.

 

Environment


Cause

The issue is caused by the following line in the `/var/vcap/jobs/rep/bin/rep_as_vcap` file:

azure_fd=$(curl -f  --connect-timeout 5 --silent http://169.254.169.254/metadata/v1/InstanceInfo/FD)

 

In some of the vSphere environments, the above curl command does not timeout within 30 seconds, causing `rep_as_vcap` script to exit after 30 seconds. Since monit is configured to terminate a process if its associated startup scripts don't exit after 30 seconds, in this case, monit terminates the `rep` process. Hence, the rep process status is `unknown` from the `monit summary` command above.

Note: The reference to Azure in the above control scripts is related to querying metadata for the Azure IaaS. This IaaS dependency in the Diego control scripts is required to enable some features for Azure although, it has some unintended consequences in the vSphere environments. Please see below for the final fix.

 

Resolution

This issue is fixed in Elastic Runtime version 1.9.18. Upgrade to Elastic Runtime 1.9.18 or above. See the release notes: Adds Azure Fault-Domain detection failure logic to rep.