SD Jobs are never executed on some target machines

Products

CA Client Automation CA Client Automation - IT Client Manager

Issue/Introduction

Consider an example where an SD job is sent to a lot of targets (>500). Some machines do not execute the SD job. A manual SD jobcheck on the target machine shows that there is no job to execute even though a job exists.

Other machines may be able to execute the job.

Environment

Client Automation - All versions

Cause

On the Scalability Server, SD Server (sd_server.exe) uses the following algorithm to contact all target machines with a SD jobs in waiting state:

SD Server builds a list of all targets with an SD job in waiting state.
SD Server goes through the list of machines on a per machine basis.
If it is time to contact the machine, SD Server sends a TRIGGER datagram to this machine.
If the machine answers the datagram, then SD Server continues to the next machine in the list until 25 (parameter MaxSimActCheck) trigger datagrams have been sent.
If the machine does not answer the datagram, SD Server sets the next time to trigger this machine again to the current time + 600 seconds (Parameter WaitBetweenJobCheckTriggs) and progresses to the next machine in the list.
The steps 2-3 are finished when 25 (MaxSimActCheck) triggers have been sent or when the end of list has been reached.

SD Server executes this algorithm every 30 seconds. When it starts it again, it always begins at the beginning of the list.
Therefore, if there are a lot of machines in the SD Job Container, it is possible that SD Server may never reach the end of the list as it sends 25 datagram triggers every 30 seconds. So in 10 minutes it could send 25*2*10=500 triggers Datagram.

But after 10 minutes, SD Server starts again to send the trigger Datagram to unreachable machines again and it may happen that SD Server may never be able to send the Datagram trigger to machines which are at position >500 in the list.

For example:
We send a SD job to 2000 machines of which 600 machines are switched off. After 10 minutes, SD Server has sent 500 (2*25*10) datagrams to 500 machines.
Some machines are on and and the SD job executes on them. Since, some of the machines are off SD Server sends the TRIGGER datagram again to them after 10 minutes.

After 20 or 30 minutes, SD Trigger sends the TRIGGER datagram only to the first 500 unreachable machines in the list (in the stipulated 10 minutes - Parameter WaitBetweenJobCheckTriggs) and may never be able to progress to the other machines at the end of the list.

Resolution

This problem could happen when there are a lot of target machines in the SD job and a lot of machines are not reachable (switched off or not on the network).
The solution is to increase the value of the parameter "Jobcheck: Wait between JobCheck triggers" (WaitBetweenJobCheckTriggs) in Default Computer Policy:

In DSM Explorer drill down to Control Panel->Configuration->Configuration Policy.
Unseal the Default Computer Policy.
Under DSM->Software Delivery->Scalability Server, change the value of parameter "Jobcheck: Wait between JobCheck triggers" from 600 (10 minutes) to 10800 (3 hours).
Seal the policy.
Wait for some minutes for the policy to be applied on the DOMAIN.