App usage data and events data get corrupted after upgrading to or installing Pivotal Cloud Foundry 1.7.x

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Users could experience any of the following issues:

Rapidly inflating usage data on the Accounting Report and Usage Report in Apps Manager.
Crashing app-usage-server App: The app that handles this data is found in the System ORG and System Space will repeatedly crash.
Push Apps Manager errand will fail during Apply Changes or BOSH run errand command.
cf logs app-usage-server --recent will display recent errors for the app-usage-server errand.
cf apps will display app-usage-server-venerable App in the started state.

Environment

Cause

This problem is caused by multiple instances of the App usage server suite running in a Pivotal Cloud Foundry (PCF) deployment. Multiple workers cause data replication, and in turn, calculations are performed on replicated data. This results in data corruption of the usage data.

This issue has been observed in the following scenarios:

1. You upgraded from PCF v1.6.x to a version less than PCF v1.7.16. In this scenario, there is a bug in the App usage service deployment which moved the deployment to a new space and failed to clean up the Apps that were running in the old space. This results in multiple instances of the "app-usage" Apps running in both spaces.

2. You installed a new PCF foundation on a version less than PCF v1.7.15. In this scenario, there is bug in the App usage service deployment that left “venerable” Apps from a blue-green deployment in a running state. This results in multiple instances of the App running.

3. You upgraded to any version of PCF greater than PCF v1.7.16 from an affected version without addressing the bug first. In this scenario, you may see issues with app usage data integrity. This is the result of one the above scenarios not being addressed in prior versions of PCF.

Resolution

The resolution you will execute will depend on the following factors:

1. How long has the foundation been on the affected version?

Depending on the length of time that has passed since the foundation was on any of the affected versions of PCF, the integrity of the data will be affected. Subsequently, the best action to take will depend on this factor. Refer to the Time Table below.

2. Production dependency of App Usage Data: Can it be deleted? Refer to "What Gets Deleted" at the end of this document.

Time Table

Pivotal Application Service Current Version	Date installed	Action	Result
v1.6.x	N/A	Upgrade to Pivotal Application Service 1.7.18+	Installations that upgrade directly to Pivotal Application Service 1.7.18+ will not experience the issue. Proceed to upgrade directly to Pivotal Application Service 1.7.18+
v1.7.0 - >1.7.15	<30 days ago	Upgrade to Pivotal Application Service 1.7.18+ and email us	The installation likely has data quality issues that can be resolved
	<60 days ago	Upgrade to Pivotal Application Service 1.7.18+ and email us	The installation likely has data quality issues that can be partially resolved, and in some cases fully resolved
	60+ days ago	Upgrade to Pivotal Application Service 1.7.18+ and email us	The installation likely has data quality issues that may be difficult to resolve without data loss

Repairing the Integrity of the Foundation

For PCF v1.7.x only, remove the app-usage-service Space and Apps.

If you have already upgraded to a version PCF v1.8.0 or above, skip this section and go to the "All Versions" section below.

Note: The following steps remove the app-usage-service Space and Apps, which is a temporary fix, and this temporary fix can revert when a new deployment occurs. Upgrading to PCF 1.7.18 or above immediately following this procedure is strongly encouraged.

1. Log into CF API target as admin and select the system org:

cf target -a https://<YOUR-APP-DOMAIN-API-ENDPOINT>
cf target -o system

2. List out all the spaces in the system org:

cf spaces

3. If an apps-manager Space or apps-usage-service Space exists, then delete them as needed to remove these Spaces and all Apps within these Spaces. They should not be present in a PCF 1.7 installation.

cf delete-space <SPACE-NAME>

4. Get a list of all the Apps running in the System Space:

cf target -o system -s system
cf apps

5. Confirm that none of app-usage-server-venerable, app-usage-worker-venerable, or app-usage-scheduler-venerable are running. If any one of them is running, stop them:

cf stop <APP-NAME>

6. Validate: Re-running the CF Apps and CF Spaces should now show that the apps-usage-service has been removed and the app-usage-server-venerable application has stopped.

7. Upgrade. You should now upgrade to PCF 1.7.18+ to permanently resolve this issue.

All Versions Option 1: Repair and restore the data

Based on the factors listed above, the first step should be to try to repair the data. As previously mentioned, the length of time on the affected version will determine if the data is likely to have integrity issues and whether or not it can be repaired.

For customers who are affected by this issue AND are using the user service data for business-critical applications, please open a Support Ticket with the following information:

1. Obtain the results of the diagnostic tool that is located at:

https://app-usage.<system-domain>/data_status_report

2. Include all relevant details into the ticket, such as the version history of the foundation in question, as well as the output from the data_status page if available.

These results will be sent to the Apps Manager team to determine what level of recovery we can provide prior to executing the next steps.

At this point, Pivotal Support should have had a chance to review the results of the output above and determine if the data can be recovered. Next, we will need a copy of the Database, which can be obtained by creating a dump of the MySQL Database.

3. Obtain the MySQL root user's credentials from your installation:

https://<YOUR-OPSMAN-DOMAIN>/api/v0/deployed/products/cf-*/credentials/.mysql.mysql_admin_credentials

4. From the Ops Manager VM as the root user or sudo, use mysqldump to export the Database. Depending on the size of the Database, this may take some time:

(sudo) mysqldump -h<IPADDR-MYSQL/0 vm> -u<ROOT> -p app_usage_service> app_usage_service_export.sql

5. Upload the file using

https://knowledge.broadcom.com/external/article/140731/uploading-files-to-cases-on-the-broadcom.html

We will then take that Database and repair it using internal tools and return the repaired Database dump to the customer. The restoration should be applied by or with the assistance of the Pivotal Field or Support staff. The restoration process is as follows:

6. Using the cf CLI, login to affected foundation:

cf login -a <your-affected-foundation>

7. Select System Org and System Space:

cf target -o system -s system

8. Stop all the three app Usage Applications:

cf stop app-usage-worker
cf stop app-usage-scheduler
cf stop app-usage-server

9. Export a backup of the current app_usage_service Database using mysqldump. You'll want to make sure that the export is suitable for importing in case you need to rollback. In other words, make sure the drop statements are included.

10. Import the repaired Database we provided:

mysql -u [username] -p app_usage_service < [database name].sql

11. Start the Usage Service applications using the cf CLI in the following order:

cf start app-usage-server
cf start app-usage-scheduler
cf start app-usage-worker

The data should start to look better and it should be 100% caught up after the Usage Service completes a full cycle at 2 AM server time. We advise waiting a full day to verify that it has worked.

All Versions Option 2: Purge and reseed the data

For customers whose data can’t be recovered, OR who are not using Usage Data for business-critical applications, they should purge and reseed their app usage data and app events using the following process.

**Warning** This process will completely ERASE the app_usage Database, as well as Cloud Controller’s (CCs) current app events data. See the app_usage_service Database table below for details on what will be deleted.

1. Using the cf CLI, login to your affected foundation:

cf login -a <your-affected-foundation>

2. Select System Org and System Space:

cf target -o system -s system

3. Stop all three app Usage Applications:

cf stop app-usage-worker
cf stop app-usage-schedule
cf stop app-usage-server

4. From the Ops Manager VM, connect to the MySQL server of your affected foundation:

Run bosh instances to find the name of your VM.
bosh ssh mysql/0 where mysql is the name of your VM

5. Login to the MySQL Database using the root credentials from https://<YOUR-OPSMAN-DOMAIN>/api/v0/deployed/products/cf-*/credentials/.mysql.mysql_admin_credentials:

mysql -u root -p

6. Drop the app_usage_service Database:

DROP DATABASE app_usage_service;

7. Recreate an empty Database called app_usage_service:

CREATE DATABASE app_usage_service;

8. Destructively purge and reseed app, task, and service usage events in Cloud Controller:

cf curl -X POST /v2/app_usage_events/destructively_purge_all_and_reseed_started_apps
cf curl -X POST /v2/service_usage_events/destructively_purge_all_and_reseed_existing_instances

9. Start the Usage Service applications using the CF CLI in the following order:

cf start app-usage-server
cf start app-usage-scheduler
cf start app-usage-worker

You should now have the new Database populated with the indexes and tables shown below in the sample app_usage_service Database.

What gets deleted when dropping the app_usage_service Database?

Sample app_usage_service Database:

[app_usage_service]> show tables;

Tables_in_app_usage_service

app_events

app_events_fetcher_job_run_logs

app_usage_rollover_job_run_logs

daily_app_config_usages

delayed_jobs

monthly_app_config_usages

old_app_data_deleter_job_run_logs

old_service_data_deleter_job_run_logs

persisted_monthly_usage_summaries

platform_app_instance_counts

schema_migrations

service_events

service_events_fetcher_job_run_logs

service_instance_usages

system_logs

worker_check_ins

Common Errors

cf logs app-usage-server --recent shows errors for the app-usage-server errand failure

"Connected, dumping recent logs for app app-usage-server in org system/space system as admin...
[APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `each'
[APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `migrate'

...

[APP/PROC/WEB/0]ERR Tasks: TOP => db:migrate
...
[API/0] OUT Process has crashed with type: "web"
[API/0] OUT App instance exited with guid <> payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* 1 error(s) occurred:\n\n* Exited with status 4\n* 2 error(s) occurred:\n\n* cancelled\n* cancelled", "crash_count"=>134, "crash_timestamp"=>..., "version"=>"..."}"

Apps Manager reports incorrect accounting data

Deploy will fail as a result of the Push Apps Manager errand failure

Running errand Push Apps Manager for Pivotal Application Service:

...

+ cf start app-usage-worker 
+ echo '+++++++++++++ USAGE DEPLOY FAILED! +++++++++++++' 
+++++++++++++ USAGE DEPLOY FAILED! +++++++++++++ 
...
0 of 1 instances running, 1 starting 
0 of 1 instances running, 1 starting 
FAILED 
Start app timeout 
Use 'cf logs app-usage-server --recent' for more information

…