In rare cases, the Zero Downtime Upgrade (ZDU) process may report a successful completion, but the subsequent install customization jobs fail.
This unexpected failure is generally attributed to Kubernetes (K8s) not updating its routing tables fast enough after patching services, despite the jobs having a default retry limit of 4.
The issue can also arise if manual intervention was required during the ZDU process, causing the ZDU status to become out of sync for the install-operator program (the operator managing the AAKE installation).
The example below shows some error messages applying a CAU package:
Install from file
--------------------------------------------------
Path: /path/to/packages/cau/Automation.Engine_PCK.AUTOMIC_CAU_AGENT_UNIX_24_4_3+build.1764677075276.zip
Ignore Dependencies: true
Replace Existing: true
Prune Empty Folder: false
Use Existing Appdata: false
Ignore client restriction: false
Cancel Executions: false
--------------------------------------------------
Artifact /path/to/packages/cau found but it has invalid metadata
Resolution order:
PCK.AUTOMIC_CAU_AGENT_UNIX 24.4.3 will be installed
Installation of Pack PCK.AUTOMIC_CAU_AGENT_UNIX in version 24.4.3 with title Agent upgrade resources for unix started
Importing Pack content...
Importing Pack appdata...
System Error: Error occurred when sending request to Automation Engine: code:1001, reason:Connection Idle Timeout
com.automic.apm.exceptions.ApmException: Error occurred when sending request to Automation Engine: code:1001, reason:Connection Idle Timeout
at com.automic.apm.internal.DefaultAeConnector.send(DefaultAeConnector.java:172)
at com.automic.apm.DefaultAeTask.send(DefaultAeTask.java:141)
at com.automic.apm.tasks.ae.ImportXml.handleImportResource(ImportXml.java:52)
at com.automic.apm.tasks.ae.ImportXml.execute(ImportXml.java:41)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at com.automic.apm.internal.utils.JavaMethod.invoke(JavaMethod.java:62)
at com.automic.apm.internal.executers.TaskAnnotationFinder$TaskAnnotation.doExecute(TaskAnnotationFinder.java:77)
at com.automic.apm.internal.executers.TaskAnnotationFinder$TaskAnnotation.execute(TaskAnnotationFinder.java:70)
The JCP log shows the following error message:
U00045545 Web Socket error: 'ClosedChannel'
U00003407 Client connection '*CP002#00000001' from '<IP_ADDRESS>:<PORT>' has logged off from the Server.
Environment.
To correct this state, rerun the failed install customization job and then restore the system's status to PROVISIONED to resynchronize the install-operator.
Since Kubernetes does not permit a direct re-run of a completed or failed job, a new job must be created from the definition of the failing one.
Create a copy of the failed job's definition:
Replace cust-ready-job-name-goes-here with the actual name of your failed job.
kubectl get job cust-ready-job-name-goes-here -o yaml > myjob.yaml
Edit the job definition file (myjob.yaml):
Change the job name (e.g., to myjob-rerun).
Crucially, remove all selectors and any other status-related fields that already exist. If you are unsure which fields to remove, proceed to step 3, and K8s will indicate the conflicting fields you need to delete from the YAML file.
Run the new job in the cluster:
kubectl apply -f myjob.yaml
Wait for the job to succeed. Monitor its status using kubectl get jobs.
After the customization job completes successfully, the system must be manually returned to the PROVISIONED state to restore synchronization with the install-operator.
Shutdown the install-operator: Scale the install-operator deployment down to 0 replicas.
Edit the automic-automation ConfigMap:
Delete the following keys from the ConfigMap:
statusZduStartTime
statusZduStarts
Ensure the following keys are set:
statusVersion is set to the same value as specVersion
statusStage is set to PROVISIONED
Edit the operator-config ConfigMap:
Copy the values, overwriting the backups:
backup.properties ← image.properties
values.backup ← values
Restart the install-operator: Scale the install-operator deployment back to the replica count of 1.
The system should now be operational, and the install-operator will be back in sync.