How to avoid app downtime while upgrading from PAS v2.1.x to v2.2.7
search cancel

How to avoid app downtime while upgrading from PAS v2.1.x to v2.2.7

book

Article ID: 297627

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Operators encounter the following issues while upgrading from PAS v2.1 to v2.2:

  • As soon as the Diego database starts to upgrade, the application informs the operator that their application is experiencing downtime. 
  • Symptoms include app routes getting 404 errors and the Apps Manager shows the application crashed.
  • Operators checked the cf logs, but there is no information of the instance crashing.
  • There is no Cloud Foundry (CF) event for any of the applications saying its crashing. 

Pivotal Support has observed the following behaviors: 

  • The Diego Cell upgrade takes too long to upgrade. In this case, it takes 20 minutes for the upgrade to finish.
  • The BOSH task debug shows that the reason for slowness is due to app scheduling.
  • An Apps Manager with six instances, shows four to five instances are going down. The routing table shows there is only one route being emitted out of six instances. The routes and instances come back after five to six seconds and within that time, a few more instance of the applications are going down.
  • This issue is happening so often that changes in the Diego Cell are not being executed.
  • From the Healthwatch app, which performs CF Push, the smoke test is failing with the error message "insufficient resources".
  • All the applications on Diego Cells are not upgraded and are not accepting any applications.
  • This is causing the applications to crash and restart since one or two cells are accepting all the applications and others are accepting none.

Environment


Cause

PAS 2.2.7 includes an update to the MySQL PXC release which stops registering the MySQL proxy with Consul. During an upgrade from 2.1.x, services expecting to reach the MySQL proxy via consul DNS are unable to do so until they update and begin to use BOSH DNS for service discovery.


This has been observed to cause API downtime and is believed to also be the cause of application downtime in later 2.2 patch releases.

Resolution

Upgrading directly from PAS v2.1.x to v2.2.7 may cause significant app downtime. If you are planning an upgrade from v2.1.x to 2.2.x, Pivotal recommends you upgrade directly to PAS v2.2.12 which includes a fix that will safely migrate components using consul to bosh-dns.