After upgrading TPCF to v10.2.6 or later, users began observing inconsistencies in ORG memory quota usage

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

After upgrading TPCF to v10.2.6 or later, users began observing inconsistencies in ORG memory quota usage.

As shown in the screenshot below, an ORG containing two spaces is reported as using 38 GB of memory quota. However, the expected usage should be 26 GB, as Space1 is consuming 25 GB and Space2 is consuming 1 GB.

This is not merely a UI or display issue.

It can also have a functional impact, as users may encounter application deployment failures or task scheduling failures due to the ORG memory quota being incorrectly calculated as exhausted or insufficient.

Cause

This issue is considered to be the result of a recent feature enhancement introduced in TPCF v10.2.6, under which PENDING tasks are now counted against quota. In earlier versions, PENDING tasks were not included in quota calculations.

How can you validate whether you are affected by the issue described in this article?

Access your MySQL database. Refer to How to connect to the VMware Tanzu Application Service (TAS) for VMs internal MySQL database if you are using the TAS internal MySQL database.
Run USE ccdb; to access the CCDB database.
Run the following query to retrieve the task status summary for the problematic ORG.

SELECT 
  t.state,
  COUNT(*) AS count
FROM tasks t
INNER JOIN apps a ON t.app_guid = a.guid
INNER JOIN spaces s ON a.space_guid = s.guid
INNER JOIN organizations o ON s.organization_id = o.id
WHERE o.name = 'ORG-name'
GROUP BY t.state;

You are expected to see the result similar to the following where PENDING tasks exist if you are affected by the same issue described in this article.

mysql> SELECT 
  t.state,
  COUNT(*) AS count
FROM tasks t
INNER JOIN apps a ON t.app_guid = a.guid
INNER JOIN spaces s ON a.space_guid = s.guid
INNER JOIN organizations o ON s.organization_id = o.id
WHERE o.name = 'test-org'
GROUP BY t.state;
+-----------+-------+
| state     | count |
+-----------+-------+
| SUCCEEDED |  2060 |
| PENDING   |    12 |
| FAILED    |    29 |
| RUNNING   |     6 |
+-----------+-------+
4 rows in set (0.00 sec)

Resolution

At present, there are two available mitigation options. It is strongly recommended that you take a backup of the MySQL database before performing either of the methods below.

Option 1
Use this option if you have only a small number of affected ORGs or a limited number of PENDING tasks.

Access your MySQL database. Refer to How to connect to the VMware Tanzu Application Service (TAS) for VMs internal MySQL database if you are using the TAS internal MySQL database.
Run USE ccdb; to access the CCDB database.

Run the following query:

SELECT 
  t.state,
  t.app_guid,
  t.sequence_id
FROM tasks t
INNER JOIN apps a ON t.app_guid = a.guid
INNER JOIN spaces s ON a.space_guid = s.guid
INNER JOIN organizations o ON s.organization_id = o.id
WHERE o.name = 'ORG-name'
AND t.state = 'PENDING'
GROUP BY t.state, t.app_guid, sequence_id;

You would see output similar to the following:

+---------+---------------+-------------+
| state   | app_guid      | sequence_id |
+---------+---------------+-------------+
| PENDING | SOME-APP-GUID |         194 |
| PENDING | SOME-APP-GUID |         195 |
| PENDING | SOME-APP-GUID |         198 |
| PENDING | SOME-APP-GUID |         334 |
| PENDING | SOME-APP-GUID |         335 |
| PENDING | SOME-APP-GUID |         336 |
| PENDING | SOME-APP-GUID |         474 |
| PENDING | SOME-APP-GUID |         475 |
| PENDING | SOME-APP-GUID |         476 |
| PENDING | SOME-APP-GUID |         619 |
| PENDING | SOME-APP-GUID |         996 |
| PENDING | SOME-APP-GUID |        1002 |
+---------+---------------+-------------+
12 rows in set (0.00 sec)

Go to the jumpbox where you are targeting the TPCF endpoint:

Retrieve the app name and space name
Target the problematic ORG and the space you retrieved from the last step

Run the cf terminate-task APP_NAME TASK_ID command against each PENDING task.

$ cf curl /v3/apps/SOME-APP-GUID | jq '.name'
"SOME-APP-NAME"

$ cf curl /v3/apps/SOME-APP-GUID | jq '.relationships.space.data.guid'
"SOME-SPACE-GUID"

$ cf curl /v3/spaces/SOME-SPACE-GUID | jq '.name'
"test-space"

$ cf t -o ORG-name -s test-space 

$ cf terminate-task SOME-APP-NAME 1002 
.
.
.
.
.
.
$ cf terminate-task SOME-APP-NAME 194

Repeat the above steps until all unwanted PENDING tasks got terminated.

Option 2
Use this option if you have a large number of affected ORGs or PENDING tasks.

Refer to How to access CCNG console and retrieve the task status to login to the CCNG console.

Run the following to transition the PENDING tasks to FAILED:

TaskModel.where(state: TaskModel::PENDING_STATE).where(Sequel.lit('created_at < ?', 1.hour.ago)).update(state: TaskModel::FAILED_STATE, failure_reason: 'Task expired in PENDING state')

If you need any assistance or have further questions, please feel free to open a support ticket.