Jobs get stuck in starting state when no disk space available in /opt

book

Article ID: 197227

calendar_today

Updated On:

Products

CA Workload Automation AE - Business Agents (AutoSys) CA Workload Automation AE - System Agent (AutoSys) CA Workload Automation AE - Scheduler (AutoSys) CA Workload Automation Agent CA Workload Automation AE

Issue/Introduction

As per enhancements made to the r12.0 Workload Automation System Agent, it no longer shuts down on Linux servers when the minimum space threshold in its file system is breached:
https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/intelligent-automation/workload-automation-system-agent/12-0/release-notes.html

Support for the System Agent Continuing to Execute in a Non-persistence Mode During Low Disk Space Conditions
When resource monitoring has been enabled, the agent will now continue to execute, but in a non-persistent manner when the disk space has dropped below the critical threshold. Prior to this change the agent would have shutdown in this situation. The parameter agent.resourcemon.threshold.disk.critical.shutdown can be used to restore the prior behavior by setting it to 'true'.

However, even though the machine remains online and the Agent does go into persistent memory mode, when there is no space left (zero bytes) any jobs which try and run get stuck in STARTING state. The jobs remain stuck in a starting state even when space is made available again and the status has to be manually changed on the jobs before they will run again.

Messages such as this is seen in the $AUTOUSER/out/event_demon.$AUTOSERV log when the jobs get stuck in STARTING:

<COMM_ERR_14 Agent on machine [agent_host] has not acknowledged this job request. Please investigate the status of this job.>

Cause

The Agent is unable to acknowledge the job request because it cannot write to its <agent_install_dir>/database directory to track the jobs coming in.

Environment

Release : 12.0

Component : CA Workload Automation System Agent

Resolution

This is working as designed. The Scheduler doesn't get acknowledgement from the Agent, so the job remains in starting state and requires manual intervention.

The Agent is able to switch to persistent memory mode when dropping below the critical threshold, but if the FS is completely depleted to 0 bytes, the Agent will not be able to acknowledge new jobs and they will end up stuck in starting with manual intervention needed.

The idea around this new feature is to allow the Agent to run a little longer while action is taken to free space up to no longer breach the critical threshold. It can recover once space is freed and you can run jobs again... but it is not meant to be able to acknowledge jobs and queue the up or run them later if you actually hit 0 bytes.