Consider an example where an SD job is sent to a lot of targets (>500). Some machines do not execute the SD job. A manual SD jobcheck on the target machine shows that there is no job to execute even though a job exists. Other machines may be able to execute the job.
Client Automation - All versions
On the Scalability Server, SD Server (sd_server.exe) uses the following algorithm to contact all target machines with a SD jobs in waiting state:
SD Server executes this algorithm every 30 seconds. When it starts it again, it always begins at the beginning of the list.
Therefore, if there are a lot of machines in the SD Job Container, it is possible that SD Server may never reach the end of the list as it sends 25 datagram triggers every 30 seconds. So in 10 minutes it could send 25*2*10=500 triggers Datagram.
But after 10 minutes, SD Server starts again to send the trigger Datagram to unreachable machines again and it may happen that SD Server may never be able to send the Datagram trigger to machines which are at position >500 in the list.
For example:
We send a SD job to 2000 machines of which 600 machines are switched off. After 10 minutes, SD Server has sent 500 (2*25*10) datagrams to 500 machines.
Some machines are on and and the SD job executes on them. Since, some of the machines are off SD Server sends the TRIGGER datagram again to them after 10 minutes.
After 20 or 30 minutes, SD Trigger sends the TRIGGER datagram only to the first 500 unreachable machines in the list (in the stipulated 10 minutes - Parameter WaitBetweenJobCheckTriggs) and may never be able to progress to the other machines at the end of the list.
This problem could happen when there are a lot of target machines in the SD job and a lot of machines are not reachable (switched off or not on the network).
The solution is to increase the value of the parameter "Jobcheck: Wait between JobCheck triggers" (WaitBetweenJobCheckTriggs) in Default Computer Policy: