Migration failures cause policy server pre-start errors
search cancel

Migration failures cause policy server pre-start errors

book

Article ID: 396712

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

The policy-server pre-start fails when upgrading to cf-networking-release version 3.68.0 or higher, when using dynamic ASGs (set in the “networking” tab) with a mysql database.

Symptom 1: failing on migration 82

{"timestamp":"2025-05-01T10:45:10.131469969Z","level":"error","source":"cfnetworking.policy-server-migrate-db","message":"cfnetworking.policy-server-migrate-db.failed migrating and populating tags, retrying","data":{"error":"perform migrations: executing migration: executor.Exec: Error 3906 (HY000): Exceeded max total length of values per record for multi-valued index 'staging_spaces_idx' by 84 bytes. handling 82"}}


Symptom 2: failing on migration 83

{"timestamp":"2025-05-01T10:45:10.131469969Z","level":"error","source":"cfnetworking.policy-server-migrate-db","message":"cfnetworking.policy-server-migrate-db.failed migrating and populating tags, retrying","data":{"error":"perform migrations: executing migration: executor.Exec: Error 3906 (HY000): Exceeded max total length of values per record for multi-valued index 'running_spaces_idx' by 84 bytes. handling 83"}}


If you have not upgraded to an impacted version yet test to see if you
will be impacted

Option 1: Use the CLI and API

security_groups="$(cf curl /v3/security_groups)"
pages="$(echo ${security_groups} | jq .pagination.total_pages)"

for (( p=1; p<=${pages}; p++ ))
do
    security_groups="$(cf curl /v3/security_groups?page=${p})"
    echo "${security_groups}" | jq '[.resources[] | select(.relationships.staging_spaces.data | length >= 148)] | map({guid, name, staging_spaces_count: (.relationships.staging_spaces.data | length)})'
    echo "${security_groups}" | jq '[.resources[] | select(.relationships.running_spaces.data | length >= 148)] | map({guid, name, running_spaces_count: (.relationships.running_spaces.data | length)})'
done


If any results are returned, then you will run into this bug and you should follow the mitigations. Below is an example of what results would look like from the script above.

[
  {
    "guid": "14ad7fc8-27c2-4456-9641-3d9f8cffb1c1",
    "name": "too_many_staging_spaces_example",
    "staging_spaces_count": 160
  }
]
[
  {
    "guid": "14ad7fc8-27c2-4456-9641-3d9f8cffb1c1",
    "name": "too_many_running_spaces_example",
    "running_spaces_count": 170
  }
]

 

Option 2: Query the database

  1. Connect to the policy server db. 
select name from security_groups WHERE JSON_LENGTH(staging_spaces) > 148;
select name from security_groups WHERE JSON_LENGTH(running_spaces) > 148;


If either of those queries return any rows, then you will run into this bug and you should follow the mitigations.

Environment

The following TAS versions are affected:

  • 4.0.35
  • 6.0.15
  • 10.0.5

Only when using dynamic ASGs and a mysql DB.

 

Cause

Migrations 82 and 83 both add functional indexes to the policy server database to make dynamic ASGs more performant. However, when the size of “staging_spaces” or “running_spaces” is too large the functional index will fail to be created, and thus the migration will fail. This causes the pre-start script to fail. 

The “staging_spaces” and “running_spaces” columns become too large when a single ASG is bound to more than 148 individual spaces for that lifecycle.

 

Resolution

A permanent fix is in CF Networking Release 3.70.0, which is included in the following TAS versions:

  • 4.0.36 and later
  • 6.0.16 and later
  • 10.0.6 and later
  • 10.2.0 and alter

For deployments that are using MYSQL DBs with dynamic ASGS enabled, we suggest skipping all affected TAS versions and using versions which include the fixed CF Networking Release. The permanent fix takes into account that some customers will have followed the mitigations steps and manually updated their databases.

Mitigation Option 1: Force skip migration by updating db.

If the customer has to continue an upgrade urgently and doesn’t have time to do either of the mitigations listed, they can force skip these migrations…

  1. Access the policy server db. (Reference How to connect to TAS Internal DB).
  2. Add these rows manually so it will fake as if migrations 82 and 83 have run.
insert into gorp_migrations (id, applied_at) values (82, NOW());
insert into gorp_migrations (id, applied_at) values (83, NOW());

 

Mitigation Option 2: Use global ASGs

Global ASGs do not need to be bound to individual spaces. However, they can be bound unnecessarily to individual spaces, which will also trigger this bug.

  1. Make and bind a new global ASG with all the same rules as the problematic ASG.
  2. Delete the problematic ASG.
  3. Do not bind the new ASG to spaces or orgs individually.

Mitigation Option 3: Make multiple ASGs with the same rules

Instead of binding one ASG to 148+ spaces, make 2 identical ASGs and bind them to <148 spaces each.

Check Mitigation

If you have already deployed, or attempted to deploy, cf-networking-release version 3.68.0 or higher you can run the policy server migrations manually.

# commands run on diego_database bootstrap VM

# become root
sudo su - 

# make sure you are on the bootstrap VM, if this file is empty then you are on the wrong VM
cat /var/vcap/jobs/policy-server/bin/pre-start

# run the pre-start script. It will log output and will migrate the db
/var/vcap/jobs/policy-server/bin/pre-start