Using Timeouts in Orchestration and States
search cancel

Using Timeouts in Orchestration and States

book

Article ID: 325826

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

In Salt, both orchestration and states have timeouts. Sometimes, when creating orchestrations that depend on various states to complete, you may run into issues with orchestrations failing due to timeouts in long running states. This may be because the orchestration gives up waiting before the state has been able to complete. I will briefly describe the scenario and a solution in this article.

Let's say we need to execute a script that may take a long time to run before returning successfully. We'll use an orchestration to execution this across multiple nodes. The example we use is a bit contrived, but the setup is common enough, having an orchestration execute a state.

The orchestration state file orch.longtest.sls will allow a Salt Minion ID to be passed in as Salt Pillar data to determine the target for the Salt State execution.  If the pillar data is not provided or is Null, then the default value of testing will be used as the targeting parameter. Notice that this orchestration has a very long timeout set to allow plenty of time for this state to complete and return a response. If following along, be sure to place all files in the same directory.

# orch.longtest.sls
#
# The orchestration will wait up to 15 minutes before timing out
deploy-temporary-bbcpu:
  salt.state:
    - tgt: {{ pillar.get('minion', 'testing') }}
    - sls: {{ slspath }}.randomlongtest
    - timeout: 900 
The following example state file randomlongtest.sls will demonstrate the timeout and retry options provided by the Salt State system. To demonstrate the timeout and retry functionality of the state module, the following examples can be applied:
# randomlongtest.sls
#
# pycheck.py calls randint to sleep for a random amount of time 
# before returning True

copy_script:
  file.managed:
    - name: /root/pycheck.py
    - source: salt://{{ slspath }}/pycheck.py
    
run_a_random_long_test:
  cmd.run:
    - name: "python3 /root/pycheck.py"
    - timeout: 10
    - retry:
        interval: 3
        attempts: 50
    - require: 
      - file: copy_script 
# code for pycheck.py script
from random import randint
from time import sleep

def randomizethis():
    """
    Sleep for a random time between 0 and 60 seconds 
    before returning True
    """
    sleep_timer = randint(0,60)
    print("Sleeping for {}".format(sleep_timer))
    sleep(sleep_timer)
    return True

if __name__ == '__main__':
    randomizethis()
With the combination of Salt Orchestration executing a Salt State as shown above, a random long test execution with the timeout options set directly will demonstrate the extra functionality provided by the Salt State system.  In the above example, the minion will re-attempt to execute the State up to 50 times with a 3 second pause in between attempts. Each call of the State will have 10 seconds before timing out. The Orchestration State will continue to check every 5 seconds (configurable in the Master configuration) on the status of the job using the saltutil.find_job function. As long as the Minion returns the status of the job as active, the orchestration should wait indefinitely.

Use the following commands to run the examples:
# Before running the orchestration, you will want to connect to the Salt master's # event bus with the following command in one terminal:
salt-run state.event 

# In another terminal connected to your Salt master, assuming you placed all of 
# the above files in the /srv/salt/orch directory, run this:
salt-run state.orch orch.longtest pillar='{"minion":"REPLACE WITH VALID MINION ID"}'
This will execute the orchestration and allow you to watch the event bus of the Salt master so you can see all of the activity as it happens. 

The result is sightly randomized, but because the timeout in the orchestration is set to 900 seconds you should always see the result for the state since the state should never take that long to complete. Also, if you monitor the processes on the minion, you may notice that you get multiple pycheck.py scripts running simultaneously. This may happen since the state itself is only waiting 10 seconds for a result from the command and retrying.

I hope this article helps clarify some of the behavior of how timeouts and retries work in states and how to leverage these with your orchestration to help ensure results in the future.


Environment

VMware Aria Automation Config 8.12.x