Many customers run scheduled tasks using Tanzu Scheduler on their platform. It is important to understand what happens to those tasks during specific scenarios, like platform upgrades and Scheduler outages.
The Tanzu Scheduler software can be used to run periodic tasks on Tanzu Application Service for VMs. The Scheduler itself consists of two applications that run on TAS for VMs, a scheduler and an API. In addition, calls are made directly by the scheduler app and tasks that are executed by Scheduler run as tasks on your TAS for VMs foundation.
Because all of these items run on your TAS for VMs foundation it's important to understand that they can be impacted by upgrades and not just outages of Scheduler itself.
For the most part, Scheduler and scheduled tasks are not going to experience issues when you upgrade your foundation. By running on the TAS platform, Scheduler gets the same HA benefits that your applications deployed to the foundation receive. In addition, updates to the Scheduler tile itself use a blue/green method of deployment that should result in no downtime during upgrades. All of this means that outages to Scheduler itself should be quite rare.
Where there is a potential for problems is during the upgrade of your TAS tile or when deploying new stemcells for TAS. When upgrading TAS or deploying new stemcells, that will require all of your Diego Cells to be rebuilt. When this happens applications and tasks will be evicted off of the Cells prior to it being recreated.
For the two applications that provide the Scheduler services, this is not a problem. There are multiple instance of each application running, which prohibits downtime (this is the default configuration, if you have lowered the instance count for these apps to one you may experience downtime of Scheduler during upgrades). This is because the platform guarantees that it will not take all of an application's instance down at the same time.
What will be impacted by an an upgrade to TAS or TAS stemcells are running tasks. If a task has been triggered and is actively running on a Cell as the Cell is upgraded, the task and all applications running on the Cell are drained off (i.e. stopped). While drained applications are rescheduled on other Cells, drained tasks due to their one-time nature are not rescheduled. A task that is drained will be terminated and the task status will reflect that it failed.
This is important to note for task developers because your tasks will need to be capable of being interrupted/canceled, and re-executed at a later time, possibly manually or possibly on the next scheduled invocation.
As an task developer, you can handle this problem by trapping the SIGTERM signal. Diego will send this to your task when it's being evicted. When you receive this signal, your task needs to save its state and stop work as soon as possible. The task has 10 seconds to perform any clean up and exit. If it does not then the task will be immediately killed (i.e. SIGKILL will be sent). As previously mentioned, your task will not be rescheduled on another Cell but the task needs to be capable of understanding that it previously ran and failed, so, if necessary, it can continue to work from that point where it previously stopped. The task will run again at the next scheduled interval, or perhaps before if your operations team manually triggers it.
If you are on an operations team, you can check which tasks have recently failed by running cf curl /v3/tasks?states=FAILED as an administrator user (you may need to paginate results if there are many results). You may also filter based on the last updated time of the task and the time at which your update ran to narrow down results if there are many. See the API docs for details on filtering. When you located failed tasks, if they are during the time of the upgrade, you may choose to manually re-run these tasks. You can do that with cf run-task or with cf curl using the POST method to create a new task .
Scheduler Outages
As mentioned previously, Scheduler outages should be rare. If one were to occur, the impact of it would be that triggers for scheduled jobs and calls do not fire.
Scheduler uses the venerable Quartz project as its job manager, and Quartz is configured in the Scheduler project such that the following will happen during an outage.
1. If Quartz was in the middle of triggering a call or job, Quartz will fail and not reschedule that call or job. For calls, the duration of this failure window is however long it takes for the call to execute. For jobs, the duration of this failure window is only the time it takes to make the API calls to Cloud Controller to queue the task. Once the task has been queue, it will not be impacted by an outage (unless TAS is also impacted). For most cases, this should be a very small window of time.
2. If the outage was for a sustained period of time then it is possible that you have calls or jobs which were missed entirely (i.e. misfires). For this case, Scheduler is configured such that missed calls and jobs will be re-run if the current time is less than 60 seconds from when the call or job was originally scheduled to run. In fact if Scheduler comes back up within the 60 second window this isn't even considered to be a misfire. It just proceeds as normal.
If Scheduler has been out for more than 60s and jobs have been missed, what happens will depend on the type of trigger. For simple triggers, Scheduler will immediately run the call or job as soon as it's back into an operational state. For Cron jobs, the behavior is similar. Scheduler will also re-run the job, however, it only re-runs it once, even if Scheduler was down for a long period of time and multiple executions of the cron trigger were missed.
If you have any questions about the scenarios covered here or encounter other potential failure scenarios, please open a ticket with VMware Tanzu Support and we can provide more information.