While taking a BBR backup, you may notice a large spike in starting containers and LRP auctions. This could put you over the maximum in-flight container start limit and apps may get delayed and show the following message when starting up:
Error starting instances: 'waiting to start instance: reached in-flight start limit'
The BBR backup may also time out due to this as it tries to start some system apps such as the autoscaler or usage service
This can be caused by crashing apps. If you have a significant number of crashing apps, you may get a large number of apps trying to start at the same time after the backup completes (specifically when the cloud controllers get unlocked).
We have observed the below behavior:
There are a couple solutions to this issue:
Below are some metrics that may be helpful for tracking the number of starting containers and crashing apps