When attempting an upgrade build of Elastic Runtime, Diego cells are not rolling properly:
Error Message:
Failed updating instance diego_cell > diego_cell/80bf75c3-caf9-4087-b0dd-f3b5571324fd (27): Timed out sending 'get_task' to 0f410cca-a50a-42d9-aa4a-a2599ad329e9 after 45 seconds (00:13:33) Failed updating instance diego_cell > diego_cell/8c595349-5de1-4c29-8c7f-769c2c8aed65 (63): Action Failed get_task: Task b697cd2d-64c7-41d3-498a-e78584ba38b4 result: Stopping Monitored Services: Stopping services '[consul_agent]' errored (00:11:04) Failed updating instance diego_cell > diego_cell/931389ef-3591-46ea-aad9-0a3c2139a6ea (23): Action Failed get_task: Task 533905df-5a62-48b4-586a-da092595ee34 result: Updating certificates with retries (00:13:30)
"Updating certificates with retries" errors could result from poor disk performance. The churn of restarting so many instances simultaneously is stressing out the IaaS too much.
The max-in-flight value in Operations (Ops) Manager needs to be reduced in order to prevent too many Diego cells from being updated simultaneously.
The Exact API to call to change max-in-flight can be found in Ops Manager API here:
https://[FQDN Ops Manager]/docs#configuring-the-max_in_flight-settings-for-a-product-39-s-jobs
Follow the steps below:
Authenticate & Get Token: https://[FQDN Ops Manager]/docs#authentication
1. Target your Ops Manager IP:
uaac target https://[FQDN Ops Manager]/uaa
2. Log in to your Ops Manager with the Client name “opsman” as Ops Manager admin:
uaac token owner get Client name: opsman Client secret: User name: YOUR_USERNAME_HERE Password: YOUR_PASSWORD_HERE
3. Retrieve your Ops Manager access token via "access_token:" section:
uaac context Result [5]*[https://[FQDN Ops Manager]/uaa] skip_ssl_validation: true [0]*[admin] user_id: 61258ded-24df-4724-b3a2-c88768437864 client_id: opsman access_token: eyJhbGciOiJRUzI1NiIsImprdSI6I....
4. Get TAS products guid via "guid" section, starting from "cf-...":
curl "https://[FQDN Ops Manager]/api/v0/deployed/products" -k -X GET -H "Authorization: Bearer [Ops Manager access token retrieved at Step 3.]" Result { "installation_name": "cf-6595dd22a5007e3f6f93", "guid": "cf-6595dd22a5007e3f6f93", "type": "cf", "product_version": "1.10.8-build.7" }
5. Get Max in Flight value and retrieve Diego Cell guid, starting from "diego_cell-...":
curl "https://[FQDN Ops Manager]/api/v0/staged/products/[PAS products guid retrieved at Step 4.]/max_in_flight" -k -X GET -H "Authorization: Bearer [Ops Manager access token retrieved at Step 3.]" Result { "max_in_flight": { ... "diego_cell-81b4916ae28d873c1988": 10, …} }
6. Set new Max in Flight value for Diego Cell - in the following example, the new value is 4.
curl "https://[FQDN Ops Manager]/api/v0/staged/products/[PAS products guid retrieved at Step 4.]/max_in_flight" -k -X PUT -H "Authorization: Bearer [Ops Manager access token retrieved at Step 3.]" -H "Content-Type: application/json" -d '{"max_in_flight": {"[Diego Cell guid retrieved at Step 5.]": 4 } }'
7. Confirm whether Max in Flight value is set as expected by repeating the command at Step 5.