TCA 2.3.0 Troubleshooting guide for Migration failure cases
search cancel

TCA 2.3.0 Troubleshooting guide for Migration failure cases

book

Article ID: 345730

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

This KB lists all the known and possible errors that may occur during the MongoDB to PostgreSQL migration. It also provides the possible cause for the error and a solution to unblock.

Symptoms:
Error Observed as mentioned below:
 
1. Not a valid upgrade bundle for current version
2. com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving the message
3. "exception.MigrationException: 10007: Error migrating a collection"
4. Migration failed and upgrade continued
5. MigrationException: 10001: Unable to initialize a connection to PostgreSQL
6. com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId
7. Unknown error reported during Upgrade
8. Performing Manual Upgrade using CLI


Environment

VMware Telco Cloud Automation 2.1
VMware Telco Cloud Automation 2.3

Cause

For detailed information on each error cause and corresponding Log snippets are documented in the Resolution section along with the fix from Section 1 to Section 6 .

Resolution

Section 1: Not a valid upgrade bundle for current version

Log snippet for reference:
2023-04-24T20:38:42: Validating the distribution bundle...
Executing the pre validations
2023-04-24T20:38:42: Executing the pre validation upgrade script
2023-04-24T20:39:10: SHA256 check succeeded.
2023-04-24T20:39:10: Disk Space Availability check succeeded.
2023-04-24T20:39:10: Not a valid upgrade bundle for current version.
Cause

TCA 2.1.0 ->  TCA 2.3.0 upgrade path is not valid and users can not skip the version for the upgrade. 

Solution

Currently, we don't support skip version upgrades for Telco Cloud Automation. In order to upgrade to TCA 2.3 only appliances running v2.2 are supported. Hence above message is expected behaviour.


Section 2: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving the message

Log snippet for reference:

upgrade.log


16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failed, total_rows_validated=0
com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
        at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:563)
        at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:448)
        at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
        at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259)
        at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)
        at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450)
        at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)
        at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226)
        at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123)
        at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:343)
        at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:334)
        at com.mongodb.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:220)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:731)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:725)
        at com.mongodb.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:463)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:725)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:89)
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:196)
        at com.mongodb.client.internal.MongoIterableImpl.execute(MongoIterableImpl.java:143)
        at com.mongodb.client.internal.MongoIterableImpl.iterator(MongoIterableImpl.java:92)
        at

Cause

During migration, we fetch 500 records per single batch/call. In some exceptional cases, it can take longer than the default timeout to fetch 500 records due to the size of the data, or in scenarios where TCA is under heavy usage causing MongoDB read to take longer than expected. 

Solution

In this scenario, the upgrade needs to be run manually to edit the parameters. Please refer to the following steps to run the upgrade manually from the command line

  • Access VM from the command line with ssh command. Password will be prompted after this command, enter the password to log in.  
SSH

#command to access the vm, user can also ssh as root

ssh admin@<<IP of the VM>>
Password:
 
  • Once logged in to the VM, access the VM with root privileges, Password will be prompted again for the root password. This step can be skipped if the logged-in user is root in the first step. 
root access

# to access root privileges

su -
Password:
 
  • Go to the directory where you want to download the upgrade bundle, please refer to the below example 
cd command

# access directory to run upgrade

cd /tmp
  • Download the VMware Telco Cloud Automation upgrade bundle tarball from VMware Customer Connect
  • Extract the upgrade bundle tarball with the help of the following command 
Extract tar

#To extract the tarball

tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
  • Look for doMigration.sh in the extracted folder of the upgrade bundle and search for the following parameters(lines # 196 and 197).

doMigrationSH-sample
 
doMigration.sh

#Records per single batch/call, use record size <500

MONGODB_FETCH_SIZE=500
 
# Mongo socket timeout in seconds

MONGODB_SOCKET_TIMEOUT=120

Increase the MongoDB timeout and decrease the fetch size.

  • And then run the upgrade with the following command 
running upgrade

./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
  • Upgrade logs are generated in the "/common/logs/upgrade" directory which can be checked with the help of following command
Check logs

#To check the upgrade logs

cat /common/logs/upgrade/upgrade.log

NOTE: If the error still persists, please increase the MongoDB timeout and decrease the fetch size further with the help of same/above commands. 

Section 3: "exception.MigrationException: 10007: Error migrating a collection"

Log snippet for reference:
16:15:58 INFO  MigrationHelper: ==============-END==============collection_name=Job, execution_time_in_ms=30076
16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failure msg=aborting the migration process
16:15:58 ERROR MigrationManager: Error migrating collection: Job
com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow
        at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174)
        at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
        at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
        at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
Exception in thread "main" com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message

Cause

Migration fails with this error in upgrade logs if the migration is failed for the collection listed in the list of mandatory migration tables or fatal list. In the above example migrating collection "Job" failed as it was listed in the fatal list of collections. This list (please refer below) contains the collections for which if the migration fails the entire migration process with exit with the exception. It is sometimes observed that this failure can happen because of some temporary issue ex. timeout etc. 
 

Fatal list example
AlarmInfo
AlarmNameToIdMapping
ApplianceConfig
CSARSchemaVersionMatrix
CatalogConfig
ClusterComputeResource
CnfAlarmDefinitions
CnfInfraRequirementsParams
CnfInventory
ComputeProfile
ComputeResource
DarkLaunchServices
Datacenter
Datastore
ExtendedNetworks
Folder
HIClient
HIEntityRelations
HIRequestedData
HIWhiteListData
HostSystem
IntentWorkflowMapping
Job
.......

Solution

Restart the upgrade with the same upgrade bundle(Please refer to the "Running upgrade manually from the command line" section of this document to run the upgrade again.), if the previous run failed because of timeout/connectivity issues it will likely not reappear. 

but in case of migration fails again, check the upgrade log with the help of following command. If it failed again with same collection, Please open a technical support ticket along with Tech support bundle which includes a database dump as well for further troubleshooting.
 

#To check the upgrade logs

cat /common/logs/upgrade/upgrade.log
Section 4: Migration failed and upgrade continued
 

Cause

It happens in cases where the migration failed and the success flag is not properly communicated to the upgrade to exit. When upgraded in this scenario an application will not be able to come up properly because of incomplete data. 

Note: This error was seen only a couple of times and a fix has already been incorporated.

Solution

Following are the steps to debug and unblock 

Firstly, check the upgrade logs to confirm the failed migration and the reason for the failure 

cat /common/logs/upgrade/upgrade.log
.......
7:29:04 INFO  PostgresUtil: ##################### Postgres DB Stats Summary #############################
17:29:04 INFO  PostgresUtil: phase=validation, PostgreSQL migrated data size 20 MB
17:29:04 INFO  MigrationManager:
17:29:04 INFO  MigrationManager: #####################################################################
17:29:04 INFO  MigrationManager: #####################################################################
17:29:04 INFO  MigrationManager: ##################### Execution Summary #############################
17:29:04 INFO  MigrationManager: migration_status=FAILURE, total_execution_time_in_sec=12, error=10007: Error in migration a collection from the 'fatal list': aborting the migration flow
17:29:04 INFO  MigrationManager: #####################################################################
17:29:04 INFO  MigrationManager: #####################################################################
Migration script exit status: 0

Once confirmed migration is failed, look for the exception/reason for the failed migration. Please check the other cases listed on this page to debug based on exceptions/errors. The previous state of a system can be restored by the backup and restore script. Or in case if backup is not available custom script will be provided to restore the system to the previous state. 

Section 5: MigrationException: 10001: Unable to initialize a connection to PostgreSQL
 

Cause

The migration service is not able to make a successful connection with Postgres. 

Solution

This can happen because of various reasons to debug this issue please start by checking if the Postgres service is up / running
 

#check status of postgres

systemctl status postgres

If the service is active and running it could be a one-time issue because of network/connectivity issues. In this case, the upgrade can be restarted with the same upgrade bundle. 

In case the Postgres service is down try restarting the service and verify if it comes up.  The upgrade can be restarted with the same upgrade bundle once Postgres is running.
 

[root@ATCA /opt/vmware]# systemctl status postgres    
* postgres.service - Postgres
     Loaded: loaded (/etc/systemd/system/postgres.service; enabled; vendor preset: disabled)
     Active: active (exited) since Thu 2023-01-05 08:52:57 UTC; 1 week 5 days ago
   Main PID: 6859 (code=exited, status=0/SUCCESS)
      Tasks: 2 (limit: 2385)
     Memory: 45.3M
     CGroup: /system.slice/postgres.service
             |-   7485 /bin/bash /etc/systemd/postgres-port-forward.sh tca-mgr
             `-3080893 sleep 1

If Postgres fails to start check if the Minikube is running. If Minikube is down, then only try restarting minikube.
 

#check minikube status

systemctl status minikube
 
#start minikube 

systemctl restart minikube

 

If the Minikube is up, check the status of Postgres using the following commands.
 

# To check check the status of Postgres pods, login to TCA using SSH and run the following commands on shell.

export KUBECONFIG=/home/admin/.kube/config
kubectl get pods -n {namespace}

# namespace is tca-mgr for TCA Manager and tca-system for TCA Control Plane.
  
----------------------------------------------
For example:
[admin@tca-mgr~]$ kubectl get pods -n tca-mgr
 NAME                         READY   STATUS             RESTARTS        AGE
 postgresql-ha-postgresql-0   0/1     CrashLoopBackOff   12 (2m9s ago)   24d # postgres pod is in crashing state

In case, the Postgres pod is not present or in crash state and fails to start. try the KB 92228 to run forceRestartPostgres.sh as a workaround to ensure Postgres is up and running

Once the Postgres pod is up,  the upgrade can be restarted with the same upgrade bundle. Please refer to the "Running upgrade manually from the command line" section of this document or Upgrade VMware Telco Cloud Automation Using the Upgrade Bundle document to run the upgrade again.

Section 6: com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId{value=63cc20e0f245924b7190a911}

Cause

This error happens when some scheduled activity is clearing tmp files from objectstore went by while migration which leads to this mismatch.

Log snippet for reference  

17:29:00 INFO  MigrationHelper: ==============START==============collection_name=objectstore.files
17:29:00 INFO  MigrationHelper: phase=migration, collection_name=objectstore.files, status=processing_start
17:29:04 INFO  MigrationHelper: phase=migration, collection_name=objectstore.files, msg=reading_from_mongodb, record_read=19, execution_time_in_ms=3951
17:29:04 INFO  MigrationHelper: phase=migration, msg=writing_into_postgres execution_time_in_ms=66
17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failed, total_rows_validated=0
com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId{value=63cc20e0f245924b7190a911}
    at com.mongodb.client.gridfs.GridFSBucketImpl.getFileInfoById(GridFSBucketImpl.java:587)
    at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:272)
    at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:267)
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateObjectstoreFilesFromMongoDBToPostgres(MigrationHelper.java:279)
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:117)
    at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
    at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
    at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
17:29:04 INFO  MigrationHelper: ==============-END==============collection_name=objectstore.files, execution_time_in_ms=4037
17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failure msg=aborting the migration process
17:29:04 ERROR MigrationManager: Error migrating collection: objectstore.files
com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174)
    at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
    at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
    at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
17:29:04 INFO  MongoDBHelper: phase=validation, MongoDB dbStats = {"db": "hybridity", "collections": 300, "views": 0, "objects": 30442, "avgObjSize": 1451.0160633335524, "dataSize": 42.1255407333374, "storageSize": 36.203125, "numExtents": 0, "indexes": 679, "indexSize": 12.02734375, "ok": 1.0}


Solution

The suggested approach for this error is to retry. Please refer to the "Running upgrade manually from the command line" section of this document or Upgrade VMware Telco Cloud Automation Using the Upgrade Bundle document to run the upgrade again .

Section 7: Resolving Unknown errors

-    When the upgrade fails at a later stage after a successful migration, and when retried it fails again at migration.

Cause

During Upgrade, After the completion of MongoDB to Postgres migration it's possible that the upgrade fails due to any unknown error. In such scenarios, Retrying Upgrade would fail in such scenarios due to "/common/pgsql/passwords/" directory created as part of the migration.

Solution

Please retry upgrading with the same bundle(Please refer to the "Running upgrade manually from the command line" section of this document to run the upgrade again.) after deleting the passwords directory "/common/pgsql/passwords/" manually.
 

delete password directory

# access the following path

cd /common/pgsql/
 
# delete the password directory

rm -rf passwords/

Section 8: Performing upgrade manually from the command line 
 

This section explains the process of running the upgrade manually from the command line in case the user is experiencing any problem with UI based upgrade. 

  • Access VM from the command line with ssh command. Password will be prompted after this command, enter the password to log in.  
SSH

#command to access the vm, user can also ssh as root

ssh admin@<<IP of the VM>>
Password:
 
  • Once logged in to the VM, access the VM with root privileges, Password will be prompted again for the root password. This step can be skipped if the logged-in user is root in the first step. 
root access

# to access root privileges

su -
Password:
Go to the directory where you want to download the upgrade bundle, please refer to the below example
 
cd command

# access directory to run upgrade

cd /tmp
  • Download the VMware Telco Cloud Automation upgrade bundle tarball from VMware Customer Connect
  • Extract the upgrade bundle tarball with the help of the following command 
Extract tar

#To extract the tarball

tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
  • And then run the upgrade with the following command 
Execute upgrade:

./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
  • Upgrade logs are generated in the "/common/logs/upgrade" directory which can be checked with help of following command
Check logs

#To check the upgrade logs

cat /common/logs/upgrade/upgrade.log