Remote Jobs on TCA-CP aren't getting updated on TCA-M

Products

VMware Telco Cloud Automation

Issue/Introduction

Operations like VIM status are stuck in Pending.
TCA-CP Status in Virtual Infrastructure will show pending
Cluster Status showing Processing
On the TCA-M (2.3) /common/logs/admin/app.log, below logs are observed.

2024-10-04 03:12:15.873 UTC [RemotingService_SvcThread-3, Ent: HybridityAdmin, Usr: HybridityAdmin, , TxId: ###########-#####-#####-#####-############] WARN  c.v.v.h.m.k.KafkaProducerDelegate- Publish failed and will retry
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.RecordTooLargeException: The message is 5363784 bytes when serialized which is larger than 2097152, which is the value of the max.request.size configuration.
        at org.apache.kafka.clients.producer.KafkaProducer$FutureFailure.<init>(KafkaProducer.java:1316)
        at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:985)
        at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:885)
        at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:773)
        at com.vmware.vchs.hybridity.messaging.kafka.KafkaProducerDelegate.sendMessageWithRetries(KafkaProducerDelegate.java:214)
        at com.vmware.vchs.hybridity.messaging.kafka.KafkaProducerDelegate.publishMessageWithTransaction(KafkaProducerDelegate.java:191)
        at com.vmware.vchs.hybridity.messaging.kafka.KafkaProducerDelegate.publish(KafkaProducerDelegate.java:155)
        at com.vmware.vchs.hybridity.messaging.kafka.KafkaProducerDelegate.publish(KafkaProducerDelegate.java:149)
        at com.vmware.vchs.hybridity.messaging.adapter.JobManagerJobPublisher.publish(JobManagerJobPublisher.java:112)
        at com.vmware.vchs.hybridity.messaging.adapter.JobManager.queueJob(JobManager.java:1688)
        at com.vmware.vchs.hybridity.service.remoting.jobs.JobStatusPollAndNotify.handleJobsFromNewVersion(JobStatusPollAndNotify.java:695)
        at com.vmware.vchs.hybridity.service.remoting.jobs.JobStatusPollAndNotify.retrieveUpdatesFromRemoteSinceLastRequest(JobStatusPollAndNotify.java:571)

On the TCA-M (3.2) app log pod will show this:

stdout F Caused by: org.apache.kafka.common.errors.RecordTooLargeException: The message is ####### bytes when serialized which is larger than #######, which is the value of the max.request.size configuration.

Environment

2.3

3.2

Cause

This happens as the topmost record on "RemotingOutbox" on TCA-CP wasn't getting consumed by TCA-M because of the kafka limit of 2MB and the record was more than 5MB. The subsequent updates after that were stuck.

Resolution

Follow the below steps to delete the job:

Take a backup/snapshot of TCA-Manager and TCA-CP
SSH to the corresponding TCA-CP
Connected the Postgres
```
 connect-to-postgres
```

Check the topmost record by the below query:

 >> SELECT val->'job'->>'jobType', "creationDate", "lastUpdated" FROM "RemotingOutbox" ORDER BY "lastUpdated";

Clean up the record:

 >> DELETE FROM "RemotingOutbox" WHERE val->'job'->>'jobType'='<JobType returned in above query>';

Dummy edit the cluster that is showing Processing

Additional Information

To figure out the issue SSH to the target TCA-CP and check the postgres.

  >> SELECT val->'job'->>'jobType', "creationDate", "lastUpdated" FROM "RemotingOutbox" ORDER BY "lastUpdated";

and check the entries: .

  >> SELECT count(*) FROM "RemotingOutbox" ;

Returning around 700+ records means that the remote Jobs aren't getting updated on TCA-M.