As part of the upgrade from Telco Cloud Automation (TCA) 2.2 to 2.3, there is a database migration from MongoDB to PostgreSQL. This KB serves as a reference point to lists all known errors and solutions for issues that may occur during the migration.
Symptoms:
Error Observed as mentioned below:
2.2, 2.3
Each section will detail the cause in known.
Log snippet for reference:
2023-04-24T20:38:42: Validating the distribution bundle... Executing the pre validations 2023-04-24T20:38:42: Executing the pre validation upgrade script 2023-04-24T20:39:10: SHA256 check succeeded. 2023-04-24T20:39:10: Disk Space Availability check succeeded. 2023-04-24T20:39:10: Not a valid upgrade bundle for current version.
Cause
TCA 2.1.0 -> TCA 2.3.0 upgrade path is not valid and users can not skip the version for the upgrade.
Solution
Currently, we don't support skip version upgrades for Telco Cloud Automation. In order to upgrade to TCA 2.3 only appliances running v2.2 are supported. Hence above message is expected behaviour.
Log snippet for reference:
upgrade.log 16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failed, total_rows_validated=0 com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:563) at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:448) at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299) at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259) at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99) at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450) at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72) at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226) at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269) at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131) at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123) at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:343) at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:334) at com.mongodb.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:220) at com.mongodb.operation.FindOperation$1.call(FindOperation.java:731) at com.mongodb.operation.FindOperation$1.call(FindOperation.java:725) at com.mongodb.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:463) at com.mongodb.operation.FindOperation.execute(FindOperation.java:725) at com.mongodb.operation.FindOperation.execute(FindOperation.java:89) at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:196) at com.mongodb.client.internal.MongoIterableImpl.execute(MongoIterableImpl.java:143) at com.mongodb.client.internal.MongoIterableImpl.iterator(MongoIterableImpl.java:92) at
Cause
During migration, we fetch 500 records per single batch/call. In some exceptional cases, it can take longer than the default timeout to fetch 500 records due to the size of the data, or in scenarios where TCA is under heavy usage causing MongoDB read to take longer than expected.
Solution
In this scenario, the upgrade needs to be run manually to edit the parameters. Please refer to the following steps to run the upgrade manually from the command line
ssh admin@<<IP of the VM>>
su -
cd /tmp
tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
doMigration.sh
script with a text editor.MONGODB_FETCH_SIZE
./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
/common/logs/upgrade
directory which can be reviewed with the following commandcat /common/logs/upgrade/upgrade.log
NOTE: If the error still persists, please increase the MongoDB timeout and decrease the fetch size further with the help of same/above commands.
Log snippet for reference:
16:15:58 INFO MigrationHelper: ==============-END==============collection_name=Job, execution_time_in_ms=30076 16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failure msg=aborting the migration process 16:15:58 ERROR MigrationManager: Error migrating collection: Job com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174) at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108) at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69) at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46) Exception in thread "main" com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
Cause
Migration fails with this error in upgrade logs if the migration is failed for the collection listed in the list of mandatory migration tables or fatal list. In the above example migrating collection "Job" failed as it was listed in the fatal list of collections. This list (please refer below) contains the collections for which if the migration fails the entire migration process with exit with the exception. It is sometimes observed that this failure can happen because of some temporary issue ex. timeout etc.
Fatal list example AlarmInfo AlarmNameToIdMapping ApplianceConfig CSARSchemaVersionMatrix CatalogConfig ClusterComputeResource CnfAlarmDefinitions CnfInfraRequirementsParams CnfInventory ComputeProfile ComputeResource DarkLaunchServices Datacenter Datastore ExtendedNetworks Folder HIClient HIEntityRelations HIRequestedData HIWhiteListData HostSystem IntentWorkflowMapping Job .......
Solution
cat /common/logs/upgrade/upgrade.log
Cause
It happens in cases where the migration failed and the success flag is not properly communicated to the upgrade to exit. When upgraded in this scenario an application will not be able to come up properly because of incomplete data.
Note: This error was seen only a couple of times and a fix has already been incorporated.
Solution
cat /common/logs/upgrade/upgrade.log
....... 7:29:04 INFO PostgresUtil: ##################### Postgres DB Stats Summary ############################# 17:29:04 INFO PostgresUtil: phase=validation, PostgreSQL migrated data size 20 MB 17:29:04 INFO MigrationManager: 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################### Execution Summary ############################# 17:29:04 INFO MigrationManager: migration_status=FAILURE, total_execution_time_in_sec=12, error=10007: Error in migration a collection from the 'fatal list': aborting the migration flow 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################################################################### Migration script exit status: 0Once confirmed migration is failed, look for the exception/reason for the failed migration. Please check the other cases listed on this page to debug based on exceptions/errors. The previous state of a system can be restored by the backup and restore script. Or in case if backup is not available custom script will be provided to restore the system to the previous state.
Cause
The migration service is not able to make a successful connection with Postgres.
Solution
systemctl status postgres
[root@ATCA /opt/vmware]# systemctl status postgres * postgres.service - Postgres Loaded: loaded (/etc/systemd/system/postgres.service; enabled; vendor preset: disabled) Active: active (exited) since Thu 2023-01-05 08:52:57 UTC; 1 week 5 days ago Main PID: 6859 (code=exited, status=0/SUCCESS) Tasks: 2 (limit: 2385) Memory: 45.3M CGroup: /system.slice/postgres.service |- 7485 /bin/bash /etc/systemd/postgres-port-forward.sh tca-mgr `-3080893 sleep 1
systemctl status minikube
systemctl restart minikube
ssh admin@<<IP of the VM>>
export KUBECONFIG=/home/admin/.kube/config kubectl get pods -n {namespace}
NOTE: {namespace} is tca-mgr
for TCA Manager and tca-system
for TCA Control Plane.
forceRestartPostgres.sh
as a workaround to ensure Postgres is up and running.Cause
This error happens when some scheduled activity is clearing tmp files from objectstore went by while migration which leads to this mismatch.
Log snippet for reference
17:29:00 INFO MigrationHelper: ==============START==============collection_name=objectstore.files 17:29:00 INFO MigrationHelper: phase=migration, collection_name=objectstore.files, status=processing_start 17:29:04 INFO MigrationHelper: phase=migration, collection_name=objectstore.files, msg=reading_from_mongodb, record_read=19, execution_time_in_ms=3951 17:29:04 INFO MigrationHelper: phase=migration, msg=writing_into_postgres execution_time_in_ms=66 17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failed, total_rows_validated=0 com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId{value=########} at com.mongodb.client.gridfs.GridFSBucketImpl.getFileInfoById(GridFSBucketImpl.java:587) at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:272) at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:267) at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateObjectstoreFilesFromMongoDBToPostgres(MigrationHelper.java:279) at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:117) at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108) at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69) at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46) 17:29:04 INFO MigrationHelper: ==============-END==============collection_name=objectstore.files, execution_time_in_ms=4037 17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failure msg=aborting the migration process 17:29:04 ERROR MigrationManager: Error migrating collection: objectstore.files com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174) at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108) at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69) at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46) 17:29:04 INFO MongoDBHelper: phase=validation, MongoDB dbStats = {"db": "hybridity", "collections": 300, "views": 0, "objects": 30442, "avgObjSize": 1451.0160633335524, "dataSize": 42.1255407333374, "storageSize": 36.203125, "numExtents": 0, "indexes": 679, "indexSize": 12.02734375, "ok": 1.0}
Solution
The suggested approach for this error is to retry the upgrade as per the Performing Manual Upgrade using CLI section of this KB.
Cause
During the upgrade of VMware Telco Cloud Automation from version 2.2 to 2.3, if the upgrade procedure fails due to the migration of mongo to postgres and if a user restarts services or the entire TCA virtual machine, the Appliance Management UI will always show the upgrade status as stuck with a message saying "migrating data from mongo to postgres"
Solution
ssh admin@<<IP of the VM>>
su -
cd /common/logs/upgrade
rm upgrade-status.properties
systemctl restart appliance management
When the upgrade fails at a later stage after a successful migration, and when retried it fails again at migration.
Cause
During Upgrade, After the completion of MongoDB to Postgres migration it's possible that the upgrade fails due to any unknown error. In such scenarios, Retrying Upgrade would fail in such scenarios due to /common/pgsql/passwords/
directory created as part of the migration.
Solution
cd /common/pgsql/
rm -rf passwords/
This section explains the process of running the upgrade manually from the command line in case the user is experiencing any problem with UI based upgrade.
ssh admin@<<IP of the VM>>
su -
cd /tmp
tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
cat /common/logs/upgrade/upgrade.log