Potential data loss for service usage data after upgrading TAS

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The following version of TAS include a version of usage-service that may lose some data. This affects the upgrade path, fresh installs are unaffected. 4.0.9 - 3.0.17 - 2.13.27 - 2.11.45 versions of TAS

The potential for data loss is for the service usage data - if the customer has deployed a new service after deploying the above TAS versions. App and Task usage are unaffected. Customers would need to either execute the workaround or upgrade to a new version of TAS with an upgrade usage service within 30 days otherwise purge and reseed would be the only way forward.

Environment

Product Version: 4.0

Resolution

Patches are now available for TAS. push-usage-service-release release version 674.0.69 contains the necessary mitigations for this issue.

TAS versions 4.0.10, 3.0.18, 2.13.28, and 2.11.46 are the fixed patch versions.

For customers unable to immediately upgrade, the below workaround can also be applied to prevent data loss of usage service data.

How to tell if you're affected

Use the cf CLI to get logs from the app-usage-worker app in the system space of the system org. If you're being affected by this issue, you should find that roughly every 5 minutes you get errors about Mysql2:Error: Duplicate entry. They will look something like this:

2023-10-06T11:10:05.77+0100 [APP/PROC/WEB/0] OUT I, [2023-10-06T10:10:05.776183 #7]  INFO -- : ActiveRecord::RecordNotUnique: Mysql2::Error: Duplicate entry '4b0c79bd-9528-412d-a94f-3526bd827a4f' for key 'service_events.index_service_events_on_guid'
2023-10-06T11:10:05.77+0100 [APP/PROC/WEB/0] OUT I, [2023-10-06T10:10:05.776250 #7]  INFO -- : /home/vcap/deps/0/vendor_bundle/ruby/3.2.0/gems/mysql2-0.5.5/lib/mysql2/client.rb:151:in `_query'

If you connect to the SQL database that backs usage service (often this is the app_usage_service database on the CF internal MySQL instance that's included in the CF deployment), you can also inspect a table called service_events_fetcher_job_run_logs. If you are affected, you should see that all the entries in that table have a created_at timestamp that is earlier than the upgrade which subjected you to this bug.

Workaround:

1. Identify your cf deployment name via bosh cli.

E.g. by running
bosh ds --json | jq '.Tables[].Rows[] | select( .name | startswith("cf-")).name' -r

2. SSH onto one of the Mysql instances:
bosh -d CF_NAME ssh mysql/0

3. Connect to the db with the mysql cli

sudo mysql --defaults-file=/var/vcap/jobs/pxc-mysql/config/mylogin.cnf app_usage_service

4. Execute the below SQL statement

INSERT INTO latest_service_usage_event_guids(service_event_guid,event_created_at,created_at,updated_at) SELECT guid, occurred_at, created_at, updated_at FROM service_events order by id desc LIMIT 1;

How to tell if your intervention fixed the issue

Wait for 10 or 15 minutes after performing the fix operations. Then use the cf CLI to get logs from the app-usage-worker app in the system space of the system org. You should find that Mysql2:Error: Duplicate entry are no-longer occurring.Since you connected to the SQL backing database to execute the fix anyway, you can also inspect the service_events_fetcher_job_run_logs table. You should find that since you applied the fix, new entries have started being created in this table. Those new entries should have a timestamp in their created_at field which is more recent than the time at which you applied the fix.