Concourse for VMware Tanzu fails running deployment errands that use the clock

Products

Concourse for VMware Tanzu

Issue/Introduction

Operators can trigger the execution of an errand at any time after the deploy and receive back the script's stdout, stderr and exit code upon its completion. This issue has been seen when an operator uses custom pipelines definitions to run errands in their deployments (VMware Tanzu Application Service for VMs, VMware Tanzu Gemfire, etc..).

Errands are run using the deployment's clock_global VMs. However, if more than one clock_global VM is available in the deployment, a race condition can occur if both VM's try to upload bits to the Blobstore at the same time during the push-apps errand. This will cause the errand to fail.

The error message depends on the errand that you run. This article will use the metric_registrar_smoke_test errand as an example. Below are some of the error messages reported on Concourse for VMware Tanzu:

FAILED
           Creating user provided service metric-registrar-smoke-test-structured in org system / space metric-registrar-monitor as metric_registrar_smoke_test...
           FAILED
           Server error, status code: 400, error code: 60002, message: The service instance name is taken: metric-registrar-smoke-test-structured

Stderr     + export PATH=/var/vcap/packages/cf-cli-6-linux/bin:/var/vcap/jobs/metric_registrar_smoke_test/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
           + PATH=/var/vcap/packages/cf-cli-6-linux/bin:/var/vcap/jobs/metric_registrar_smoke_test/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
           + echo '**************************** SETTING UP SESSION ****************************'
           + /var/vcap/jobs/metric_registrar_smoke_test/bin/setup_session.sh
           + cf -v
           + cf api https://XXXXXXXXXX
           + cf logout
           + cf auth metric_registrar_smoke_test XXXXXXXXXX --client-credentials
           + cf create-org system
           Org system already exists.

           + cf create-space metric-registrar-monitor -o system
           Space metric-registrar-monitor already exists

           + cf target -o system -s metric-registrar-monitor
           + echo '**************************** DEPLOYING MONITORING APP **********************'
           + /var/vcap/jobs/metric_registrar_smoke_test/bin/deploy.sh
           + pushd /var/vcap/packages/smoke_test/bin
           + delete-service metric-registrar-smoke-test-structured
           + cf service metric-registrar-smoke-test-structured
           Service instance metric-registrar-smoke-test-structured not found
           + cf create-user-provided-service metric-registrar-smoke-test-structured -l structured-format://DogStatsD
           Error: failed to run job-process: exit status 1 (exit status 1)


2 errand(s)

Errand 'metric_registrar_smoke_test' completed with error (exit code 1)

Exit code 1

Environment

Product Version: 5.2

Resolution

Before going through the resolution you will want to validate that you can successfully run this errand. You can use one of the following ways to confirm:

Option 1: Run the errand using BOSH

1. Log in to the Ops Manager VM with SSH: https://docs.pivotal.io/platform/2-7/customizing/trouble-advanced.html#ssh

2. Authenticate with BOSH: https://docs.pivotal.io/platform/2-7/customizing/trouble-advanced.html#log-in

3. List the VMs in the affected deployment and confirm that there is more than one clock_global present:

bosh -d <affected deployment name> vms

4. Run the errand using only one of the clock_global instances by running either of the following commands:

$ bosh -d <deployment name> run-errand <name of the errand> --instances clock_global/0 
or 
$ bosh -d <deployment name> run-errand <name of the errand> --instances clock_global/first
or
$ bosh -d <deployment name> run-errand <name of the errand> --instances <clock global instance name>

Option 2: Run the errand from within one of the clock_global VMs

1. Log in to the Ops Manager VM with SSH: https://docs.pivotal.io/platform/2-7/customizing/trouble-advanced.html#ssh

2. Authenticate with BOSH: https://docs.pivotal.io/platform/2-7/customizing/trouble-advanced.html#log-in

3. List the VMs in the affected deployment and confirm that there is more than one clock_global present:

$ bosh -d <affected deployment name> vms

4. Pick one of the clock_global VMs listed and SSH into it: https://bosh.io/docs/cli-v2/#ssh-mgmt

$ bosh -d <affected deployment> ssh <clock global instance name>

5. When inside the VM switch to root user.

6. Change directories to the directory containing the clock global jobs:

# cd /var/vcap/jobs

7. From within the jobs directory pick the errand you want to run. Change directories to the bin directory inside the errand directory:

# cd /var/vcap/jobs/<errand name>/bin

8. This directory will contain a script called "run". Execute the script and confirm that the errand runs successfully:

# ./run

After running any of the options above, the operator will confirm that the errands are running as expected.

9. To resolve this issue the operator will have to change the script that concourse uses so that only one instance is being used when the errand is triggered. The errand should be triggered using one of the options below:

$ bosh -d <deployment name> run-errand <name of the errand> --instances clock_global/0 
or 
$ bosh -d <deployment name> run-errand <name of the errand> --instances clock_global/first
or
$ bosh -d <deployment name> run-errand <name of the errand> --instances <clock global instance name>

Note: If you are using Concourse for VMware Tanzu with Platform Automation you will not experience this since Platform Automation follows the Ops Manager workflow and only runs the errand in one instance of clock_global.