Starting Container Count and LRP Auctions Spike During TPCF BBR Backup
search cancel

Starting Container Count and LRP Auctions Spike During TPCF BBR Backup

book

Article ID: 392385

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

While taking a BBR backup, you may notice a large spike in starting containers and LRP auctions. This could put you over the maximum in-flight container start limit and apps may get delayed and show the following message when starting up:

Error starting instances: 'waiting to start instance: reached in-flight start limit'

The BBR backup may also time out due to this as it tries to start some system apps such as the autoscaler or usage service

 

Cause

This can be caused by crashing apps. If you have a significant number of crashing apps, you may get a large number of apps trying to start at the same time after the backup completes (specifically when the cloud controllers get unlocked).

We have observed the below behavior:

  1. There are some apps that are continuously crashing and getting restarted by Diego
  2. Cloud controller gets locked for BBR backup
  3. Apps crash and try to start again
  4. Diego can't download the droplet because cloud controller is locked
  5. Crashed apps keep crashing due to missing droplet
  6. Cloud controller gets unlocked after backup
  7. Now there's a large number of apps that are trying to start at the same time

Resolution

There are a couple solutions to this issue:

  1. Recommended: Fix or stop the apps that are continuously crashing
  2. Increase the maximum number of starting containers. Please be aware that setting this too high can overload the diego cells during a cold start

 

Below are some metrics that may be helpful for tracking the number of starting containers and crashing apps

  • origin: rep
    statsd metric name: StartingContainerCount
  • origin: bbs
    statsd metric name: CrashedActualLRPs
  • origin: cc
    statsd metric name: tasks_running.count

Additional Information