How to track failed tasks

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The official documentation describes tasks as:

In contrast to a long-running process (LRP), tasks run for a finite amount of time, then stop. Tasks run in their own containers and are designed to use minimal resources. After a task runs, Cloud Foundry destroys the container running the task.

As a single-use object, a task can be checked for its state and for a success or failure message.

This KB article aims to provide insight into identifying task failure reasons. We will do this by reviewing CAPI endpoint examples and demonstrating a walkthrough investigation of task failures for Scheduler jobs. This KB article is specific to tracking failed job tasks and their reasons however there is another KB article that describes Scheduler behavior more in depth if further investigation is required.

At a high level, all tasks can be viewed via the CAPI endpoint /v3/tasks

cf curl "/v3/tasks"

This endpoint accepts flags as well, for example if we wanted to only get retrieve tasks with state "FAILED":

cf curl "/v3/tasks?states=FAILED"

If this is a large environment that runs many tasks we may need to "walk" multiple pages of results:

cf curl "/v3/tasks?states=FAILED&per_page=100&order_by=updated_at&page=1"

cf curl "/v3/tasks?states=FAILED&per_page=100&order_by=updated_at&page=2"

Environment

Product Version: Other

Resolution

In this section we will walk through an example where we have Scheduler job tasks showing as FAILED, but the cause for the failures is not apparent in addition there are no logs associated with these tasks.

Scheduler creates jobs and these jobs run as tasks. The job name for this demonstration is myjob3. Let us start by looking at the job-history of myjob3.

$ cf job-history myjob3
Getting scheduled job history for myjob3 in org jgainey / space playarea as admin
1 - 4 of 4 Total Results
Execution GUID                         Execution State   Scheduled Time                  Execution Start Time            Execution End Time              Exit Message
3b5974aa-####-4297-a750-679e2fcab5ce   FAILED            Tue, 20 Dec 2022 14:47:22 UTC   Tue, 20 Dec 2022 14:47:22 UTC   Tue, 20 Dec 2022 14:47:23 UTC   202 ACCEPTED - Cloud Controller Accepted Task
0402f3d6-####-499f-815a-6874e8d51b62   FAILED            Tue, 20 Dec 2022 14:43:18 UTC   Tue, 20 Dec 2022 14:43:18 UTC   Tue, 20 Dec 2022 14:43:18 UTC   202 ACCEPTED - Cloud Controller Accepted Task
afb82afb-####-4e86-a2ac-198608a89f4a   FAILED            Mon, 19 Dec 2022 23:26:02 UTC   Mon, 19 Dec 2022 23:26:02 UTC   Mon, 19 Dec 2022 23:26:02 UTC   202 ACCEPTED - Cloud Controller Accepted Task
dffb391a-####-4725-a7d4-36fbeb0fabf8   FAILED            Mon, 19 Dec 2022 23:25:30 UTC   Mon, 19 Dec 2022 23:25:30 UTC   Mon, 19 Dec 2022 23:25:30 UTC   422 UNPROCESSABLE_ENTITY - [memory_in_mb exceeds organization memory quota]

We can see myjob3 was initiated 4 times and each run failed. The "Exit Message" sometimes gives us more insight into the failure. For example we can see Execution dffb391a-d8e0-4725-a7d4-36fbeb0fabf8 has an "Exit Message" of "422 UNPROCESSABLE_ENTITY - [memory_in_mb exceeds organization memory quota]" so this tells us that task was never accepted by Cloud Controller due to organization quota limit enforcement. In this execution, a container was never attempted to be spun up as the org quota stopped the process. Once we lifted the quota, we tried to run the myjob3 an additional 3 times but all 3 runs also failed. The 3 runs that failed each have an "Exit Message" of "202 ACCEPTED - Cloud Controller Accepted Task". This message tells us that Cloud Controller accepted the task, but doesn't tell us what happened to the task or if the task even ran. Looking into the recent logs for the application that this task is associated with we see no logs to indicate the task ran. To investigate these failed job runs further we can leverage Scheduler API and CAPI.

1 - Obtain the space GUID where the job was created:

cf space playarea --guid
296266cd-fe28-4e2f-b30c-70cab8699f0a

2 - Curl scheduler API using the space GUID to identify the job GUID:

curl -k -X GET -H 'Accept: application/json' -H "authorization: $(cf oauth-token)" "https://scheduler.run-18.slot-##.#####-###-###.#####.com/jobs?space_guid=296266cd-####-4e2f-b30c-70cab8699f0a" | jq -r '.resources[] | .guid,.name +"\n"'

6d73ffe3-####-47f0-943a-45515d7b3bb7
myjob

a2184202-####-4d67-90fd-31547a0376f8
myjob2

d9deca23-#####-4869-807e-71f0fd1de776
myjob3

In this example, we are interested in myjob3, so our job GUID is d9deca23-####-4869-807e-71f0fd1de776

3 - Use CAPI to retrieve failed jobs for that job GUID:

cf curl "/v3/tasks?states=FAILED&per_page=100&order_by=created_at" | jq -r '.resources[] | select(.name | startswith("d9deca23-807b-4869-807e-71f0fd1de776")) | .guid, .created_at, .result.failure_reason + "\n"'

a7dcdb98-####-4674-93e0-bf3f7775a8e0
2022-12-19T23:26:02Z
insufficient resources: memory

dfd110ea-####-48a0-940d-f176589c7f75
2022-12-20T14:43:18Z
insufficient resources: memory

52979d6b-####-4e3c-b3ca-78256cead7fd
2022-12-20T14:47:22Z
insufficient resources: memory

Once we have this data we can use the timestamps to match the task in cf job-history output.

For example we can see

3b5974aa-baaa-4297-a750-679e2fcab5ce   FAILED            Tue, 20 Dec 2022 14:47:22 UTC   Tue, 20 Dec 2022 14:47:22 UTC   Tue, 20 Dec 2022 14:47:23 UTC   202 ACCEPTED - Cloud Controller Accepted Task

matches

52979d6b-####-4e3c-b3ca-78256cead7fd
2022-12-20T14:47:22Z
insufficient resources: memory

This tells us that job execution GUID 3b5974aa-baaa-4297-a750-679e2fcab5ce was accepted by the Cloud Controller and Cloud Controller initiated task GUID 52979d6b-4628-4e3c-b3ca-78256cead7fd but that task failed due to "insufficient resources: memory". This is different from exceeding the quota, this instead is due to not enough available memory within Diego for the task's container specifications and therefore the task's container fails to spawn. To confirm, we can curl Scheduler API for job specifications and compare it with what Diego has available:

$ curl -k -X GET -H 'Accept: application/json' -H "authorization: $(cf oauth-token)" https://scheduler.run-18.slot-35.tanzu-gss-labs.vmware.com/jobs/d9deca23-807b-4869-807e-71f0fd1de776  | jq

{
  "guid": "d9deca23-####-4869-807e-71f0fd1de776",
  "name": "myjob3",
  "state": "AVAILABLE",
  "command": "sleep 10",
  "app_guid": "168a7e89-0123-4ced-9074-db6ad0e78ecc",
  "space_guid": "296266cd-fe28-4e2f-b30c-70cab8699f0a",
  "created_at": "2022-12-19T23:25:24Z",
  "updated_at": "2022-12-19T23:25:24Z",
  "memory_in_mb": 10240,
  "disk_in_mb": 1024
}

In this output we can see that for myjob3 to run, it will need to spin up a container with 10GB of memory. Lets ssh on a diego_cell to fetch how much memory Diego has available per diego_cell via cfdot:

diego_cell/b0c62abc-9041-4a01-add6-f6ccca69f335:~$ cfdot cell-states | jq '.|.cell_id,.AvailableResources'

"b0c62abc-9041-####-add6-f6ccca69f335"
{
  "MemoryMB": 7640,
  "DiskMB": 90186,
  "Containers": 234
}
"c492f004-b549-####-b6cb-0b92424721eb"
{
  "MemoryMB": 5656,
  "DiskMB": 84042,
  "Containers": 228
}

Looking at the output from cfdot we can see that there is not a diego_cell with 10GB of available memory.

Conclusion
We can leverage the scheduler API and CAPI to investigate task failures further when the output from command cf job-history and task logs do not indicate an obvious failure reasons.