Some customizations after a ZDU on AAKE may fail.

Products

CA Automic Workload Automation - Automation Engine Automic SaaS

Issue/Introduction

In rare cases, the Zero Downtime Upgrade (ZDU) process may report a successful completion, but the subsequent install customization jobs fail.

This unexpected failure is generally attributed to Kubernetes (K8s) not updating its routing tables fast enough after patching services, despite the jobs having a default retry limit of 4.

The issue can also arise if manual intervention was required during the ZDU process, causing the ZDU status to become out of sync for the install-operator program (the operator managing the AAKE installation).

The example below shows some error messages applying a CAU package:

Install from file
--------------------------------------------------
                       Path: /path/to/packages/cau/Automation.Engine_PCK.AUTOMIC_CAU_AGENT_UNIX_24_4_3+build.1764677075276.zip
        Ignore Dependencies: true
           Replace Existing: true
         Prune Empty Folder: false
       Use Existing Appdata: false
  Ignore client restriction: false
          Cancel Executions: false
--------------------------------------------------
Artifact /path/to/packages/cau found but it has invalid metadata
Resolution order:
 PCK.AUTOMIC_CAU_AGENT_UNIX 24.4.3 will be installed
Installation of Pack PCK.AUTOMIC_CAU_AGENT_UNIX in version 24.4.3 with title Agent upgrade resources for unix started
Importing Pack content...
Importing Pack appdata...
System Error: Error occurred when sending request to Automation Engine: code:1001, reason:Connection Idle Timeout
com.automic.apm.exceptions.ApmException: Error occurred when sending request to Automation Engine: code:1001, reason:Connection Idle Timeout
 at com.automic.apm.internal.DefaultAeConnector.send(DefaultAeConnector.java:172)
 at com.automic.apm.DefaultAeTask.send(DefaultAeTask.java:141)
 at com.automic.apm.tasks.ae.ImportXml.handleImportResource(ImportXml.java:52)
 at com.automic.apm.tasks.ae.ImportXml.execute(ImportXml.java:41)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
 at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:569)
 at com.automic.apm.internal.utils.JavaMethod.invoke(JavaMethod.java:62)
 at com.automic.apm.internal.executers.TaskAnnotationFinder$TaskAnnotation.doExecute(TaskAnnotationFinder.java:77)
 at com.automic.apm.internal.executers.TaskAnnotationFinder$TaskAnnotation.execute(TaskAnnotationFinder.java:70)

The JCP log shows the following error message:

U00045545 Web Socket error: 'ClosedChannel'
U00003407 Client connection '*CP002#00000001' from '<IP_ADDRESS>:<PORT>' has logged off from the Server.

Environment

Automic Automation Kubernetes Edition

Cause

Environment.

Resolution

To correct this state, rerun the failed install customization job and then restore the system's status to PROVISIONED to resynchronize the install-operator.

Part 1: Rerunning the Customization Job

Since Kubernetes does not permit a direct re-run of a completed or failed job, a new job must be created from the definition of the failing one.

Create a copy of the failed job's definition:
- Replace cust-ready-job-name-goes-here with the actual name of your failed job.
Bash
```
kubectl get job cust-ready-job-name-goes-here -o yaml > myjob.yaml
```
Edit the job definition file (myjob.yaml):
- Change the job name (e.g., to myjob-rerun).
- Crucially, remove all selectors and any other status-related fields that already exist. If you are unsure which fields to remove, proceed to step 3, and K8s will indicate the conflicting fields you need to delete from the YAML file.
Run the new job in the cluster:
Bash
```
kubectl apply -f myjob.yaml
```
Wait for the job to succeed. Monitor its status using kubectl get jobs.

Part 2: Marking the System as Provisioned

After the customization job completes successfully, the system must be manually returned to the PROVISIONED state to restore synchronization with the install-operator.

Shutdown the install-operator: Scale the install-operator deployment down to 0 replicas.
Edit the automic-automation ConfigMap:
- Delete the following keys from the ConfigMap:
  - statusZduStartTime
  - statusZduStarts
- Ensure the following keys are set:
  - statusVersion is set to the same value as specVersion
  - statusStage is set to PROVISIONED
Edit the operator-config ConfigMap:
- Copy the values, overwriting the backups:
  - backup.properties ← image.properties
  - values.backup ← values
Restart the install-operator: Scale the install-operator deployment back to the replica count of 1.

The system should now be operational, and the install-operator will be back in sync.