In universe.log we find the following Warning Message during a high load period of many Jobs in status Event Wait : |WARN |X|IO |pid=pid.threadid| o_module_sur_cycle | Interrupting supervisor job condition checks because the time limit was exceeded. The job runs conditions will be check less often than specified until further notice.
The Supervisor engine stops checking resources so that jobs waiting for a Resource remain in status Event Wait.
IMPORTANT: the following INFO message can't be found minutes after the WARN message appear: |INFO |X|IO |pid=pid.threadid| o_module_sur_cycle | Supervisor cycle execution time back to acceptable. The condition checks are back to being done at normal interval.
If the message appears minutes later in the log, there is nothing to worry about, the supervisor should check resources in a timely manner again.
It is observed that the Supervisor did not check resources until the engine is manually stopped and started.
Right after the Engine Supervisor is restarted, we see the INFO message in the universe.log: "Supervisor cycle execution time back to acceptable. The condition checks are back to being done at normal interval"
Environment
OS: All OS Version: N/A
Cause
Cause type: Defect Root Cause: The Supervisor would not finish all the checks before the end of its cycle, causing and endless cycle.
Resolution
Workaround : Stop and Start the Supervisor engine.
Update to a fix version listed below or a newer version if available.