search cancel

Windows blue screen (BSOD) during reboot caused by controller.exe (robot)

book

Article ID: 6694

calendar_today

Updated On:

Products

DX Unified Infrastructure Management (Nimsoft / UIM)

Issue/Introduction

Upon rebooting a system, during or immediately after the reboot process, a Windows crash event (blue screen of death) may occur.

A memory dump/minidump related to the crash may show the following:

FAILURE_BUCKET_ID: 0xEF_wininit.exe_BUGCHECK_CRITICAL_PROCESS_TERMINATED_BY_controller.exe_c9f84880 
BUCKET_ID: 0xEF_wininit.exe_BUGCHECK_CRITICAL_PROCESS_TERMINATED_BY_controller.exe_c9f84880 
PRIMARY_PROBLEM_CLASS: 0xEF_wininit.exe_BUGCHECK_CRITICAL_PROCESS_TERMINATED_BY_controller.exe_c9f84880 

 

Environment

Windows - any version

robot versions prior to 9.33

Cause

When a robot restarts, it tries to shut down all the probes which are running. If they do not shut down within 10 seconds, the controller will issue a 'kill' command based on the PID. At this time the controller also records the PIDs of these processes, and when it restarts, it checks if these PIDs are still active, and if so, it kills those processes before starting up the probes again.  Below is an example of what this looks like in the controller.log file:


Aug 27 19:07:15:050 [139974114551552] Controller: Stopping processes from previous run
Aug 27 19:07:15:050 [139974114551552] Controller: ProcessControl: Sending SIGTERM signal to spooler (24711)...
Aug 27 19:07:15:050 [139974114551552] Controller: ProcessControl: Unable to send stop signal to process spooler (24711)
Aug 27 19:07:16:050 [139974114551552] Controller: ProcessControl: Child exited
Aug 27 19:07:16:050 [139974114551552] Controller: ProcessControl: Sending SIGTERM signal to hdb (24745)...
Aug 27 19:07:16:050 [139974114551552] Controller: ProcessControl: Unable to send stop signal to process hdb (24745)
Aug 27 19:07:17:050 [139974114551552] Controller: ProcessControl: Child exited
Aug 27 19:07:17:050 [139974114551552] Controller: ProcessControl: Sending SIGTERM signal to snmptd (24771)...
Aug 27 19:07:17:050 [139974114551552] Controller: ProcessControl: Unable to send stop signal to process snmptd (24771)
Aug 27 19:07:18:051 [139974114551552] Controller: ProcessControl: Child exited

Sometimes during a reboot, one or more probes can take longer to shut down and the reboot interrupts this process, so that after the reboot, a new process has taken a PID that was previously owned by a probe, and the controller terminates this process. If this is a system critical process it will cause a BSOD. 

Resolution

Functionality was released to prevent this issue from occurring starting in robot 9.33.  Deploy this robot version (or any later version) to deploy the fix.

 

For robot 7.80HF21 and versions up to 9.33, the following can be done to work around the issue.  Keep in mind that adding these settings may cause robot restarts to take longer than usual. 

  • These keys should be added to the robot.cfg in the main <controller> Section.
  • use_force_stop = 0
    • to prevent the robot forcing stopping probe processes – it will loop waiting for probes to shut down naturally instead.
  • stop_existing_processes = 0
    • to prevent the robot killing processes from a previous run of controller if it believes they exist.
  • Working in tandem, these should mitigate cases where the controller could terminate processes it doesn’t own.

 

Additional Information

There is no fix for robot versions prior to 7.80HF21.