Scaling Client Automation (ITCM): Understanding the workflow of ITCM asset jobs and why they might not be running.

Products

CA Client Automation - Asset Management CA Client Automation - IT Client Manager CA Client Automation

Issue/Introduction

You've created and linked an asset job, but the job appears forever stuck in the "waiting" status--

Environment

CA Client Automation (ITCM) -- all versions

Cause

Asset jobs are very much different in ITCM, than software jobs.

First some background on how software jobs work in ITCM:

- When a software job is built by the system delivery engine (sd_taskm.exe) up on the domain manager, it is immediately forwarded to the software delivery server process (sd_server.exe), on each scalability server that contains an agent targeted by the software job.

- If the software package is not staged on the scalability server, then software delivery will invoke the data transport system (DTS) to temporarily stage the package for the duration of the software job.

- Once the package and the job are staged on the scalability server, the sd_server.exe process will begin proactively sending triggers to each agent to run a job check, and to begin downloading and executing the package.

- The SD agent job executioner (sd_jexec.exe) will send the return code of the software job back to sd_server.exe on the scalability server, along with any job output if the $rf macro was specified and used by the software package.

- The sd_server.exe process will then forward the results back to SD installation manager, another instance of sd_taskm.exe up on the domain manager, so the result can be recorded in the database and viewed in DSM Explorer. There may be some delay in sd_server.exe forwarding the results or job progress messages back to the domain manager, if a feature called "bulk update" has been enabled.

Asset jobs use a similar architecture, but a much different workflow:

- Once an asset job is linked to an agent or a computer group, it must get synchronized by the engine (cmengine.exe) processes running the collect tasks, for each scalability server that contains an agent targeted by the asset job.

- If you've ever observed an engine running a collect task, you will notice one of the stages of the collect task is to "validate" the scalability server. One of the tasks during the validation phase, is to synchronize any asset jobs to the scalability server, along with the corresponding scheduling settings.

- Depending on the environment, scaling and architecture, i.e. the number of agents, the number of scalability servers, the distribution of agents to scalability servers, the distribution of collect tasks across the multiple engines, etc. -- there may be some considerable delay on the front-end, with simply getting the asset job linked to every scalability server. Especially if any scalability server has a large volume of files waiting to be collected by its engine.

- Once the asset job is linked to the scalability server, unlike software delivery, there is no proactive job triggering! The AM agent plugin on the agent is driven by two mechanisms:

1- The CAF scheduler policy settings (i.e. "Run the UAM Agent")
2- Environmental triggers (reboot, user logon, network address changes, "caf register", "caf start amagent" or remote job checks from DSM Explorer or the CAF command line interface.

- For example, if the asset job is scheduled to run between 5pm and 7pm, but neither the "run the UAM agent" policy is configured to run during that time window, nor any external events occur, then the asset job will not run!

- On the back-end, once the asset job does run, the agent sends a status file to the sector on the scalability server, informing of the job status. This status file is now waiting in line with all the other files to be collected, pertaining to asset management functionality. It is quite possible there are a number of computer registrations, execution date updates, other status files, hardware inventory, software inventory, etc. -- files that are pending the engines on the domain manager to be collected. Hence there may also be a considerable delay on the back-end, for why the asset job is not running as expected!

Resolution

If you read through the workflow details in the "cause" section above, you may already have an idea of why your asset jobs are not running. The common bottleneck for asset jobs are the engines processing the collect task for each scalability server.

On the front-end, after an asset job is linked to an agent or computer group, the collect task for each scalability server needs to be run, in order to synchronize the asset job and its scheduling information with each scalability server.

In the middle, what's triggering the asset job to run? The AM agent plugin is 1-CAF scheduler policy based and 2-Event trigger based. Asset jobs are not proactively triggered like software jobs. Check the CAF scheduler policy, "Run the UAM Agent".

On the back-end, after the asset job has run on the agent, its status is uploaded to the sector on the scalability server, among all the other files waiting to be collected and processed by the engine-- computer and user registrations, hardware and software inventory files, execution data and status files, along with any custom inventory files.

Hence a very common factor in the perception that asset jobs are not working, is environment, scaling and architecture.

Here is a list of items you should check and consider--

1. Are all the scalability server collect tasks assigned/linked to an engine? You can check this by using the DSM Explorer --> Control Panel --> Engines --> All Engines --> Pick an engine (i.e. SystemEngine) --> Link Existing Task. Ensure the collect task for each scalability server is mapped to an engine, otherwise its files will not be collected.

2. Are all the engine plugins running? Check the "caf status" output to be sure.

3. Are any engines stuck processing a large backlog of files from any scalability server?

4. Architecture and scaling plays a BIG ROLE in how efficiently asset jobs run:

a) The overall number of agents in the environment. A good rule of thumb is about 10,000 agents per ITCM domain manager.

b) The overall number of scalability servers in the environment. A good rule of thumb is about 1,000 agents per ITCM scalability server. Could they handle more? Absolutely, but at what trade-off? The location of each scalability server is also quite important. What's the point of a scalability server, if it's not locally/regionally serving its registered agents?

c) The overall number of engines on the domain manager. A good rule of thumb is no more than 8 or 10 additional engine instances, but varying depending on the architecture and available resources. Remember, the engines are the portals into SQL for processing information. The database schema is remaining constant, so at some point, adding more engines and too much parallelism may only serve to increase delays and wait times for SQL commands to process. There are no hard and fast numbers here-- it is up the the ITCM and SQL administrators to see what configuration is working most efficiently compared to the volume of incoming data from the agents.

d) The distribution of agents across the scalability servers. You can download the WinOffline utility from the ITCM community site, as the scalability servers portal, under "Database Tools" will reveal the count of registered agents per scalability server (versus manually filtering for each scalability server in DSM Explorer):
WinOfflline

e) How often are the agents configured to send a hardware inventory delta? Reference sister scaling document, kb000046211, "Scaling Client Automation: How to improve collect task and replication task performance by limiting the amount of hardware and software scans sent by agents":
How to improve collect task and replication task performance by limiting the amount of hardware and software scans sent by agents.

5. Engine advanced settings-- right click on an engine --> properties --> advanced tab.

a) Number of files the engine will collect during one collect cycle. The default value is 10,000. A good rule of thumb is to change this to 500-1,000 files. Why? It allows the engine to cycle through a collect task (or list of collect tasks) faster, so engines are "validated" more often. When engines are validated more often, computer registrations are updated more frequently, and most importantly for asset jobs, the scalability servers are validated more frequently, so asset jobs are not waiting for a lengthy list of files to be collected, before they are synchronized.

b) The interval in seconds between engine jobs. The default value is 60 seconds. A good rule of thumb is to change this to 20-30 seconds. Why? If you have a lengthy list of collect tasks assigned/linked to an engine, you'll want the collect task resting less between processing tasks, so it can cycle more tasks in a shorter period of time.

6. Asset jobs are not automatically triggered by the scalability servers. Their execution are agent-driven by the AM agent plugin on the agent side. The AM agent plugin is CAF scheduler policy based and event trigger based. Asset jobs are not proactively triggered like software jobs.

a) CAF scheduler policy. DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> DSM --> Common Components --> CAF --> Scheduler --> Run the UAM Agent. Review the settings within. By default, the AM agent plugin is only triggered once per day by the CAF Scheduler.

b) Event triggers. These are external triggers that automatically cause the AM agent plugin to run. These include:

- Computer reboot.
- User login.
- Network address change.
- Local or remote "caf register".
- Local or remote "caf start amagent".
- Remote DSM Explorer Asset Job Check functionality.