Script post-backup-unlock for job lock-unlock-pcf-autoscaling failed to connect to CAPI endpoint during backup
search cancel

Script post-backup-unlock for job lock-unlock-pcf-autoscaling failed to connect to CAPI endpoint during backup

book

Article ID: 298146

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When using BBR to backup TAS foundation, it failed with script post-backup-unlock for job lock-unlock-pcf-autoscaling for connecting to CAPI endpoint. Sometimes it failed with following errors.
[bbr] 2023/06/20 15:18:57 DEBUG - Trying to execute 'sudo BBR_AFTER_BACKUP_SCRIPTS_SUCCESSFUL=true /var/vcap/jobs/lock-unlock-pcf-autoscaling/bin/bbr/post-backup-unlock' on remote
[bbr] 2023/06/20 15:18:57 DEBUG - Trying to execute 'sudo BBR_AFTER_BACKUP_SCRIPTS_SUCCESSFUL=true /var/vcap/jobs/bbr-usage-servicedb/bin/bbr/post-backup-unlock' on remote
[bbr] 2023/06/20 15:18:57 DEBUG - stdout: Setting API endpoint to https://api.system.testpcf.net...
[bbr] 2023/06/20 15:18:57 DEBUG - stdout: FAILED
[bbr] 2023/06/20 15:18:57 DEBUG - stderr: Error unmarshalling the following into a cloud controller error: 503 Service Unavailable: Requested route ('api.system.testpcf.net') has no available endpoints.
[bbr] 2023/06/20 15:18:57 ERROR - Error unlocking lock-unlock-pcf-autoscaling on backup_restore/32899568-e8a4-449c-9361-59a71cd4f2a2.
But sometimes it might fail with different error (but still about connecting to CAPI endpoint).
[bbr] 2023/06/19 15:18:58 DEBUG - Trying to execute 'sudo BBR_AFTER_BACKUP_SCRIPTS_SUCCESSFUL=true /var/vcap/jobs/bbr-usage-servicedb/bin/bbr/post-backup-unlock' on remote
[bbr] 2023/06/19 15:18:58 DEBUG - stdout: Setting API endpoint to https://api.system.testpcf.net...
[bbr] 2023/06/19 15:18:58 DEBUG - stdout: FAILED
[bbr] 2023/06/19 15:18:58 DEBUG - stderr: Unexpected Response
Response Code: 502
Code: 0, Title: , Detail: 502 Bad Gateway: Registered endpoint failed to handle the request.
[bbr] 2023/06/19 15:18:58 ERROR - Error unlocking lock-unlock-pcf-autoscaling on backup_restore/32899568-e8a4-449c-9361-59a71cd4f2a2.
By reviewing BBR logs it’s found that cloud_controller jobs were unlocked at almost same time as when lock-unlock-pcf-autoscaling script was executed.  
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/05c3321f-6a97-46c9-8a42-b3f8b98c9fc9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/4bfad0f4-6adc-4d07-9ea8-c8e91f96aa43.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/e604480b-4ea9-4db5-9f14-7c525c1c016f.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/2f6ce3bc-07bf-4a6d-840f-2b5481bf55a9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/0ebeede5-3d3c-4bbd-82cb-a35924eb8de9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking cloud_controller_ng on cloud_controller/50e50f4b-7621-4428-afeb-556260e463da.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/2f6ce3bc-07bf-4a6d-840f-2b5481bf55a9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/50e50f4b-7621-4428-afeb-556260e463da.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/05c3321f-6a97-46c9-8a42-b3f8b98c9fc9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/0ebeede5-3d3c-4bbd-82cb-a35924eb8de9.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/4bfad0f4-6adc-4d07-9ea8-c8e91f96aa43.
[bbr] 2023/06/19 15:18:58 INFO - Finished unlocking routing-api on cloud_controller/e604480b-4ea9-4db5-9f14-7c525c1c016f.

Since lock-unlock-pcf-autoscaling script is to start autocaling apps, it would firstly attempt to authenticate with capi endpoint.  
backup_restore/ba402cc8-83ed-4940-b0e3-16aa9e9162a3:/var/vcap/jobs/lock-unlock-pcf-autoscaling/bin/bbr# cat post-backup-unlock 

#!/bin/bash 
set -e 

function authenticate_and_target() { 
  cf api $API_ENDPOINT  
  cf auth $ADMIN_USER $ADMIN_PASSWORD 

  if ! cf target -o $ORG -s $SPACE ; then 
    echo "Autoscaler org/space not found; exiting" 
    exit 0 
  fi 
} 

......

cf start ${APP_NAME} 
cf start ${APP_NAME}-api 

However, it might take a little while for CAPI endpoint to be registered in gorouters, therefore lock-unlock-pcf-autoscaling script could possibly fail with connecting to CAPI endpoint right after cloud_controller jobs were unlocked. 

Environment

Product Version: 3.0

Resolution

Since this is expected behaviour, product team decided to change lock-unlock-pcf-autoscaling script to have it wait for a short while before attempting connection to CAPI endpoint. For example, 

function authenticate_and_target() { 

  sleep 120 # to wait for cloud_controller coming up 

  cf api $API_ENDPOINT  
  cf auth $ADMIN_USER $ADMIN_PASSWORD 

  if ! cf target -o $ORG -s $SPACE ; then 
    echo "Autoscaler org/space not found; exiting" 
    exit 0 
  fi 
} 

The change was already checked into cf-autoscaling v249.2.2 release and waiting for it to be patched into next TAS releases.
And the same change can also be manually added to /var/vcap/jobs/lock-unlock-pcf-autoscaling/bin/bbr/post-backup-unlock script on backup_restore instance of TAS deployment as a temporary solution. Please note that the manual change would be reverted if backup_restore instance is updated/recreated.