Troubleshooting guide for MongoDB to PostgreSQL migrations
search cancel

Troubleshooting guide for MongoDB to PostgreSQL migrations

book

Article ID: 345730

calendar_today

Updated On:

Products

VMware Telco Cloud Automation

Issue/Introduction

As part of the upgrade from Telco Cloud Automation (TCA) 2.2 to 2.3, there is a database migration from MongoDB to PostgreSQL. This KB serves as a reference point to lists all known errors and solutions for issues that may occur during the migration. 

Symptoms:
Error Observed as mentioned below:

  1. Not a valid upgrade bundle for current version
  2. com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving the message
  3. "exception.MigrationException: 10007: Error migrating a collection"
  4. Migration failed and upgrade continued
  5. MigrationException: 10001: Unable to initialize a connection to PostgreSQL
  6. com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId
  7. Upgrade status is stuck in Running state saying migrating data from mongo to postgres.
  8. Unknown error reported during Upgrade
  9. Performing Manual Upgrade using CLI  

Environment

2.2, 2.3

Cause

Each section will detail the cause in known.

Resolution

Section 1: Not a valid upgrade bundle for current version

Log snippet for reference:

2023-04-24T20:38:42: Validating the distribution bundle...
Executing the pre validations
2023-04-24T20:38:42: Executing the pre validation upgrade script
2023-04-24T20:39:10: SHA256 check succeeded.
2023-04-24T20:39:10: Disk Space Availability check succeeded.
2023-04-24T20:39:10: Not a valid upgrade bundle for current version.

Cause

TCA 2.1.0 ->  TCA 2.3.0 upgrade path is not valid and users can not skip the version for the upgrade. 

Solution

Currently, we don't support skip version upgrades for Telco Cloud Automation. In order to upgrade to TCA 2.3 only appliances running v2.2 are supported. Hence above message is expected behaviour.

Section 2: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving the message

Log snippet for reference:

upgrade.log


16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failed, total_rows_validated=0
com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
        at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:563)
        at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:448)
        at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:299)
        at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259)
        at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)
        at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450)
        at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)
        at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226)
        at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123)
        at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:343)
        at com.mongodb.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:334)
        at com.mongodb.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:220)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:731)
        at com.mongodb.operation.FindOperation$1.call(FindOperation.java:725)
        at com.mongodb.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:463)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:725)
        at com.mongodb.operation.FindOperation.execute(FindOperation.java:89)
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:196)
        at com.mongodb.client.internal.MongoIterableImpl.execute(MongoIterableImpl.java:143)
        at com.mongodb.client.internal.MongoIterableImpl.iterator(MongoIterableImpl.java:92)
        at

Cause

During migration, we fetch 500 records per single batch/call. In some exceptional cases, it can take longer than the default timeout to fetch 500 records due to the size of the data, or in scenarios where TCA is under heavy usage causing MongoDB read to take longer than expected. 

Solution

In this scenario, the upgrade needs to be run manually to edit the parameters. Please refer to the following steps to run the upgrade manually from the command line

  • Access VM from the command line with ssh command. Password will be prompted after this command, enter the password to login.  
ssh admin@<<IP of the VM>>
  • Once logged in to the VM, access the VM with root privileges, Password will be prompted again for the root password. This step can be skipped if the logged-in user is root in the first step. 
su -
  • Go to the /tmp directory to download the upgrade bundle.
cd /tmp
  • Download the VMware Telco Cloud Automation 2.3 upgrade bundle tarball from VMware Customer Connect
  • Extract the upgrade bundle tarball with the help of the following command 
tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
  • Search for and open the doMigration.sh script with a text editor.
  • Decrease the MONGODB_FETCH_SIZE
  • Increase the MONGODB_SOCKET_TIMEOUT (seconds).
  • Run the upgrade with the following command 
./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
  • Upgrade logs are generated in the /common/logs/upgrade directory which can be reviewed with the following command
cat /common/logs/upgrade/upgrade.log 

NOTE: If the error still persists, please increase the MongoDB timeout and decrease the fetch size further with the help of same/above commands.

Section 3: "exception.MigrationException: 10007: Error migrating a collection"

Log snippet for reference:

16:15:58 INFO  MigrationHelper: ==============-END==============collection_name=Job, execution_time_in_ms=30076
16:15:58 ERROR MigrationHelper: phase=migration, collection_name=Job, status=failure msg=aborting the migration process
16:15:58 ERROR MigrationManager: Error migrating collection: Job
com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow
        at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174)
        at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
        at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
        at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
Exception in thread "main" com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message

Cause

Migration fails with this error in upgrade logs if the migration is failed for the collection listed in the list of mandatory migration tables or fatal list. In the above example migrating collection "Job" failed as it was listed in the fatal list of collections. This list (please refer below) contains the collections for which if the migration fails the entire migration process with exit with the exception. It is sometimes observed that this failure can happen because of some temporary issue ex. timeout etc. 

Fatal list example
AlarmInfo
AlarmNameToIdMapping
ApplianceConfig
CSARSchemaVersionMatrix
CatalogConfig
ClusterComputeResource
CnfAlarmDefinitions
CnfInfraRequirementsParams
CnfInventory
ComputeProfile
ComputeResource
DarkLaunchServices
Datacenter
Datastore
ExtendedNetworks
Folder
HIClient
HIEntityRelations
HIRequestedData
HIWhiteListData
HostSystem
IntentWorkflowMapping
Job
.......

Solution

  • Restart the upgrade with the same upgrade bundle. Refer to the Performing Manual Upgrade using CLI section of this KB to run the upgrade again. if the previous run failed because of timeout/connectivity issues it will likely not reappear. 
  • If the migration fails again, check the upgrade log with the help of following command. If it fails again with same collection, open a technical support ticket along with a support bundle which includes a database dump as well for further troubleshooting.
cat /common/logs/upgrade/upgrade.log


Section 4: Migration failed and upgrade continued

Cause

It happens in cases where the migration failed and the success flag is not properly communicated to the upgrade to exit. When upgraded in this scenario an application will not be able to come up properly because of incomplete data. 

Note: This error was seen only a couple of times and a fix has already been incorporated.

Solution

  • Following are the steps to debug and unblock 
  • Firstly, check the upgrade logs to confirm the failed migration and the reason for the failure 
cat /common/logs/upgrade/upgrade.log
....... 7:29:04 INFO PostgresUtil: ##################### Postgres DB Stats Summary ############################# 17:29:04 INFO PostgresUtil: phase=validation, PostgreSQL migrated data size 20 MB 17:29:04 INFO MigrationManager: 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################### Execution Summary ############################# 17:29:04 INFO MigrationManager: migration_status=FAILURE, total_execution_time_in_sec=12, error=10007: Error in migration a collection from the 'fatal list': aborting the migration flow 17:29:04 INFO MigrationManager: ##################################################################### 17:29:04 INFO MigrationManager: ##################################################################### Migration script exit status: 0Once confirmed migration is failed, look for the exception/reason for the failed migration. Please check the other cases listed on this page to debug based on exceptions/errors. The previous state of a system can be restored by the backup and restore script. Or in case if backup is not available custom script will be provided to restore the system to the previous state.


Section 5: MigrationException: 10001: Unable to initialize a connection to PostgreSQL
 

Cause

The migration service is not able to make a successful connection with Postgres. 

Solution

  • This can happen because of various reasons to debug this issue please start by checking if the Postgres service is up / running
systemctl status postgres
  • If the service is active and running it could be a one-time issue because of network/connectivity issues. In this case, the upgrade can be restarted with the same upgrade bundle. 
  • In case the Postgres service is down try restarting the service and verify if it comes up.  The upgrade can be restarted with the same upgrade bundle once Postgres is running.
[root@ATCA /opt/vmware]# systemctl status postgres    
* postgres.service - Postgres
     Loaded: loaded (/etc/systemd/system/postgres.service; enabled; vendor preset: disabled)
     Active: active (exited) since Thu 2023-01-05 08:52:57 UTC; 1 week 5 days ago
   Main PID: 6859 (code=exited, status=0/SUCCESS)
      Tasks: 2 (limit: 2385)
     Memory: 45.3M
     CGroup: /system.slice/postgres.service
             |-   7485 /bin/bash /etc/systemd/postgres-port-forward.sh tca-mgr
             `-3080893 sleep 1
  • If Postgres fails to start check if the Minikube is running.
systemctl status minikube
  • If Minikube is down, restart minikube.
systemctl restart minikube
  • If the Minikube is up, check the status of Postgres using the following commands.
ssh admin@<<IP of the VM>>
export KUBECONFIG=/home/admin/.kube/config kubectl get pods -n {namespace}

NOTE: {namespace} is tca-mgr for TCA Manager and tca-system for TCA Control Plane.

Section 6: com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId{value=63cc20e0f245924b7190a911}

Cause

This error happens when some scheduled activity is clearing tmp files from objectstore went by while migration which leads to this mismatch.

Log snippet for reference  

17:29:00 INFO  MigrationHelper: ==============START==============collection_name=objectstore.files
17:29:00 INFO  MigrationHelper: phase=migration, collection_name=objectstore.files, status=processing_start
17:29:04 INFO  MigrationHelper: phase=migration, collection_name=objectstore.files, msg=reading_from_mongodb, record_read=19, execution_time_in_ms=3951
17:29:04 INFO  MigrationHelper: phase=migration, msg=writing_into_postgres execution_time_in_ms=66
17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failed, total_rows_validated=0
com.mongodb.MongoGridFSException: No file found with the id: BsonObjectId{value=63cc20e0f245924b7190a911}
    at com.mongodb.client.gridfs.GridFSBucketImpl.getFileInfoById(GridFSBucketImpl.java:587)
    at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:272)
    at com.mongodb.client.gridfs.GridFSBucketImpl.openDownloadStream(GridFSBucketImpl.java:267)
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateObjectstoreFilesFromMongoDBToPostgres(MigrationHelper.java:279)
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:117)
    at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
    at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
    at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
17:29:04 INFO  MigrationHelper: ==============-END==============collection_name=objectstore.files, execution_time_in_ms=4037
17:29:04 ERROR MigrationHelper: phase=migration, collection_name=objectstore.files, status=failure msg=aborting the migration process
17:29:04 ERROR MigrationManager: Error migrating collection: objectstore.files
com.vmware.vchs.hybridity.migration.exception.MigrationException: 10007: Error in migration a collection from the 'fatal list': aborting the migration flow
    at com.vmware.vchs.hybridity.migration.MigrationHelper.migrateDataForGivenMongoCollectionAndValidate(MigrationHelper.java:174)
    at com.vmware.vchs.hybridity.migration.MigrationManager.executeMigrationAndValidation(MigrationManager.java:108)
    at com.vmware.vchs.hybridity.migration.MigrationManager.doMigration(MigrationManager.java:69)
    at com.vmware.vchs.hybridity.migration.MigrationManager.main(MigrationManager.java:46)
17:29:04 INFO  MongoDBHelper: phase=validation, MongoDB dbStats = {"db": "hybridity", "collections": 300, "views": 0, "objects": 30442, "avgObjSize": 1451.0160633335524, "dataSize": 42.1255407333374, "storageSize": 36.203125, "numExtents": 0, "indexes": 679, "indexSize": 12.02734375, "ok": 1.0}

Solution

The suggested approach for this error is to retry the upgrade as per the Performing Manual Upgrade using CLI section of this KB.

Section 7: Upgrade status is stuck in Running state saying migrating data from mongo to postgres.

Cause
During the upgrade of VMware Telco Cloud Automation from version 2.2 to 2.3, if the upgrade procedure fails due to the migration of mongo to postgres and if a user restarts services or the entire TCA virtual machine, the Appliance Management UI will always show the upgrade status as stuck with a message saying "migrating data from mongo to postgres" 

Solution

  • SSH to the TCA VM and switch to the root user.
ssh admin@<<IP of the VM>>
su -
  • Go the the upgrade folder and delete the upgrade status properties file
cd /common/logs/upgrade
rm upgrade-status.properties
  • Restart the appliance management service
systemctl restart appliance management
  • The Appliance management UI will show the correct upgrade status and will allow the user to retry the upgrade. Please refer to the Performing Manual Upgrade using CLI section of this KB.

 

Section 8: Resolving Unknown errors

When the upgrade fails at a later stage after a successful migration, and when retried it fails again at migration.

Cause

During Upgrade, After the completion of MongoDB to Postgres migration it's possible that the upgrade fails due to any unknown error. In such scenarios, Retrying Upgrade would fail in such scenarios due to /common/pgsql/passwords/ directory created as part of the migration.

Solution

  • Manually delete the password directory
cd /common/pgsql/
rm -rf passwords/
  • Retry the upgrade as per the Performing Manual Upgrade using CLI section of this KB.

Section 9: Performing Manual Upgrade using CLI  

This section explains the process of running the upgrade manually from the command line in case the user is experiencing any problem with UI based upgrade. 

  • Access VM from the command line with ssh command. Password will be prompted after this command, enter the password to log in.  
ssh admin@<<IP of the VM>>
  • Once logged in to the VM, access the VM with root privileges, Password will be prompted again for the root password. This step can be skipped if the logged-in user is root in the first step. 
su -
  • Go to the /tmp directory to download the upgrade bundle.
cd /tmp
  • Download the VMware Telco Cloud Automation upgrade bundle tarball from VMware Customer Connect
  • Extract the upgrade bundle tarball with the help of the following command 
tar -xzf VMware-Telco-Cloud-Automation-upgrade-bundle-2.3.0-21563123.tar.gz
  • And then run the upgrade with the following command 
./vsm-upgrade.sh image/VMware-Telco-Cloud-Automation-image-2.3.0-21563123.img.dist
  • Upgrade logs are generated in the "/common/logs/upgrade" directory which can be checked with help of following command
cat /common/logs/upgrade/upgrade.log