Symptoms:
Users could experience any of the following issues:
app-usage-server App: The app that handles this data is found in the System ORG and System Space will repeatedly crash.cf logs app-usage-server --recent will display recent errors for the app-usage-server errand.cf apps will display app-usage-server-venerable App in the started state.This problem is caused by multiple instances of the App usage server suite running in a Pivotal Cloud Foundry (PCF) deployment. Multiple workers cause data replication, and in turn, calculations are performed on replicated data. This results in data corruption of the usage data.
This issue has been observed in the following scenarios:
1. You upgraded from PCF v1.6.x to a version less than PCF v1.7.16. In this scenario, there is a bug in the App usage service deployment which moved the deployment to a new space and failed to clean up the Apps that were running in the old space. This results in multiple instances of the "app-usage" Apps running in both spaces.
2. You installed a new PCF foundation on a version less than PCF v1.7.15. In this scenario, there is bug in the App usage service deployment that left “venerable” Apps from a blue-green deployment in a running state. This results in multiple instances of the App running.
3. You upgraded to any version of PCF greater than PCF v1.7.16 from an affected version without addressing the bug first. In this scenario, you may see issues with app usage data integrity. This is the result of one the above scenarios not being addressed in prior versions of PCF.
The resolution you will execute will depend on the following factors:
1. How long has the foundation been on the affected version?
Depending on the length of time that has passed since the foundation was on any of the affected versions of PCF, the integrity of the data will be affected. Subsequently, the best action to take will depend on this factor. Refer to the Time Table below.
2. Production dependency of App Usage Data: Can it be deleted? Refer to "What Gets Deleted" at the end of this document.
| Pivotal Application Service Current Version |
Date installed |
Action |
Result |
|
v1.6.x |
N/A |
Upgrade to Pivotal Application Service 1.7.18+ |
Installations that upgrade directly to Pivotal Application Service 1.7.18+ will not experience the issue. Proceed to upgrade directly to Pivotal Application Service 1.7.18+ |
|
v1.7.0 - >1.7.15 |
<30 days ago |
Upgrade to Pivotal Application Service 1.7.18+ and email us |
The installation likely has data quality issues that can be resolved |
|
<60 days ago |
Upgrade to Pivotal Application Service 1.7.18+ and email us |
The installation likely has data quality issues that can be partially resolved, and in some cases fully resolved |
|
|
60+ days ago |
Upgrade to Pivotal Application Service 1.7.18+ and email us |
The installation likely has data quality issues that may be difficult to resolve without data loss |
For PCF v1.7.x only, remove the app-usage-service Space and Apps.
If you have already upgraded to a version PCF v1.8.0 or above, skip this section and go to the "All Versions" section below.
Note: The following steps remove the app-usage-service Space and Apps, which is a temporary fix, and this temporary fix can revert when a new deployment occurs. Upgrading to PCF 1.7.18 or above immediately following this procedure is strongly encouraged.
1. Log into CF API target as admin and select the system org:
2. List out all the spaces in the system org:
3. If an apps-manager Space or apps-usage-service Space exists, then delete them as needed to remove these Spaces and all Apps within these Spaces. They should not be present in a PCF 1.7 installation.
4. Get a list of all the Apps running in the System Space:
5. Confirm that none of app-usage-server-venerable, app-usage-worker-venerable, or app-usage-scheduler-venerable are running. If any one of them is running, stop them:
6. Validate: Re-running the CF Apps and CF Spaces should now show that the apps-usage-service has been removed and the app-usage-server-venerable application has stopped.
7. Upgrade. You should now upgrade to PCF 1.7.18+ to permanently resolve this issue.
Based on the factors listed above, the first step should be to try to repair the data. As previously mentioned, the length of time on the affected version will determine if the data is likely to have integrity issues and whether or not it can be repaired.
For customers who are affected by this issue AND are using the user service data for business-critical applications, please open a Support Ticket with the following information:
1. Obtain the results of the diagnostic tool that is located at:
2. Include all relevant details into the ticket, such as the version history of the foundation in question, as well as the output from the data_status page if available.
These results will be sent to the Apps Manager team to determine what level of recovery we can provide prior to executing the next steps.
At this point, Pivotal Support should have had a chance to review the results of the output above and determine if the data can be recovered. Next, we will need a copy of the Database, which can be obtained by creating a dump of the MySQL Database.
3. Obtain the MySQL root user's credentials from your installation:
4. From the Ops Manager VM as the root user or sudo, use mysqldump to export the Database. Depending on the size of the Database, this may take some time:
5. Upload the file using
https://knowledge.broadcom.com/external/article/140731/uploading-files-to-cases-on-the-broadcom.html
We will then take that Database and repair it using internal tools and return the repaired Database dump to the customer. The restoration should be applied by or with the assistance of the Pivotal Field or Support staff. The restoration process is as follows:
6. Using the cf CLI, login to affected foundation:
7. Select System Org and System Space:
8. Stop all the three app Usage Applications:
9. Export a backup of the current app_usage_service Database using mysqldump. You'll want to make sure that the export is suitable for importing in case you need to rollback. In other words, make sure the drop statements are included.
10. Import the repaired Database we provided:
11. Start the Usage Service applications using the cf CLI in the following order:
The data should start to look better and it should be 100% caught up after the Usage Service completes a full cycle at 2 AM server time. We advise waiting a full day to verify that it has worked.
For customers whose data can’t be recovered, OR who are not using Usage Data for business-critical applications, they should purge and reseed their app usage data and app events using the following process.
**Warning** This process will completely ERASE the app_usage Database, as well as Cloud Controller’s (CCs) current app events data. See the app_usage_service Database table below for details on what will be deleted.
1. Using the cf CLI, login to your affected foundation:
2. Select System Org and System Space:
3. Stop all three app Usage Applications:
4. From the Ops Manager VM, connect to the MySQL server of your affected foundation:
5. Login to the MySQL Database using the root credentials from https://<YOUR-OPSMAN-DOMAIN>/api/v0/deployed/products/cf-*/credentials/.mysql.mysql_admin_credentials:
6. Drop the app_usage_service Database:
7. Recreate an empty Database called app_usage_service:
8. Destructively purge and reseed app, task, and service usage events in Cloud Controller:
9. Start the Usage Service applications using the CF CLI in the following order:
You should now have the new Database populated with the indexes and tables shown below in the sample app_usage_service Database.
Sample app_usage_service Database:
|
"Connected, dumping recent logs for app app-usage-server in org system/space system as admin... [APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `each' [APP/PROC/WEB/0]ERR /home/vcap/app/vendor/bundle/ruby/2.3.0/gems/activerecord-4.2.7.1/lib/active_record/migration.rb:955:in `migrate'
...
[APP/PROC/WEB/0]ERR Tasks: TOP => db:migrate
...
[API/0] OUT Process has crashed with type: "web"
[API/0] OUT App instance exited with guid <> payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* 1 error(s) occurred:\n\n* Exited with status 4\n* 2 error(s) occurred:\n\n* cancelled\n* cancelled", "crash_count"=>134, "crash_timestamp"=>..., "version"=>"..."}"
Running errand Push Apps Manager for Pivotal Application Service:
...
+ cf start app-usage-worker + echo '+++++++++++++ USAGE DEPLOY FAILED! +++++++++++++' +++++++++++++ USAGE DEPLOY FAILED! +++++++++++++ ... 0 of 1 instances running, 1 starting 0 of 1 instances running, 1 starting FAILED Start app timeout Use 'cf logs app-usage-server --recent' for more information
…