Enabling CPU Steal Time Accounting in Tanzu Elastic Application Runtime (TAS) to Diagnose CPU Contention
search cancel

Enabling CPU Steal Time Accounting in Tanzu Elastic Application Runtime (TAS) to Diagnose CPU Contention

book

Article ID: 434983

calendar_today

Updated On:

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

When operating VMware Tanzu Elastic Application Runtime (TAS), users may observe the following issues:

  • Unexplained latency spikes in application requests.
  • "Missing" heartbeats or frequent "cell evacuation" events for Diego Cells.
  • System components (such as Router, Diego Cell, or BBS) reporting high CPU usage in top or external monitoring, yet vSphere reports the host is not fully utilized.
  • Symptoms similar to those described in Control VMs Unresponsive, where CPU contention leads to performance instability.

Environment

Tanzu Elastic Application Runtime (TAS). 

Cause

In virtualized environments, "Ready Time" (vSphere metric) represents the time a Virtual Machine (VM) waits for physical CPU resources. By default, Linux guests on vSphere are often unaware of this wait time, leading to gaps in observability.

When a VM is "stolen" cycles by the hypervisor to serve other VMs or host processes, this is known as CPU Steal Time. Without specific configuration, the guest OS may attribute this lost time to "System" or "User" usage, or simply not report it at all, making it difficult to distinguish between an application that is legitimately busy and an application that is being throttled by the underlying infrastructure.

Resolution

To improve observability and address performance bottlenecks, you can enable the stealclock.enable flag in the VMX options of your TAS VMs. This requires a compatible Linux kernel (SLES 15, SLES 12 SP5+, or Ubuntu versions used in recent Stemcells) and vSphere Hardware Version 13 or newer.

 

Prerequisites

  • SSH access to the Ops Manager VM.
  • Your Ops Manager admin credentials.
  • om cli
  • jq installed (standard on most Ops Manager versions).

 

Step 1: Environment Setup & Token Retrieval

Run these commands to authenticate and identify your TAS Product GUID from an SSH session to the Opsman VM.

  1. Authenticate om cli using an env.yml file as detailed in the Authenticate with the Tanzu Operations Manager CLI documentation.


  2. Get the TAS Product GUID:

    export PRODUCT_GUID=$(om -e env.yml -k curl -s -p /api/v0/staged/products -x GET | jq -r '.[] | select(.type=="cf") | .guid')


Step 2: Create the VM Extension

Define the stealclock.enable VMX option globally in the Ops Manager Director.

om -e env.yml -k curl -s -p /api/v0/staged/vm_extensions -x POST -d \
'{"vm_extension": {"name": "enable-cpu-steal", "cloud_properties": {"vmx_options": {"stealclock.enable": "TRUE"}}}}'

 

Step 3: Bulk Edit & Upload 

Run these three commands in order. They will download every job's config, inject the extension, and push the changes back to the API.

  1. Create a new directory and cd into it to ensure only the expected files are modified and uploaded:

    mkdir /tmp/resource_config_updates

  2. Download all Resource Configs:

    This creates a local JSON file for every job in the TAS tile, named by its GUID, placed in /tmp/resource_config_updates.

    om -e env.yml -k curl -s -p /api/v0/staged/products/$PRODUCT_GUID/jobs | jq -r '.jobs[] | "\(.guid) \(.name)"' | while read -r JOB_GUID JOB_NAME; do \
        echo "Exporting $JOB_NAME..."; \
        om -e env.yml -k curl -s -p /api/v0/staged/products/$PRODUCT_GUID/jobs/$JOB_GUID/resource_config > "/tmp/resource_config_updates/${JOB_GUID}.json"; \
    done

  3. Edit all .json files to add the Extension:

    This one-liner uses jq to append "enable-cpu-steal" to the additional_vm_extensions array in every downloaded file.

    for f in /tmp/resource_config_updates/*.json; do \
        jq '.additional_vm_extensions += ["enable-cpu-steal"] | .additional_vm_extensions |= unique' "$f" > "$f.tmp" && mv "$f.tmp" "$f"; \
    done

  4. Upload the edited configurations back to the API:

    This loops through the .json files in /tmp/resource_config_updates and performs a PUT request to update the staged configuration in Ops Manager.

    for f in /tmp/resource_config_updates/*.json; do \
        JOB_GUID=$(basename "$f" .json); \
        echo "Uploading configuration for GUID: $JOB_GUID..."; \
        om -e env.yml -k curl -s -p /api/v0/staged/products/$PRODUCT_GUID/jobs/$JOB_GUID/resource_config -x PUT -d @"$f"; \
    done


Step 4: Apply Changes

  1. Navigate to the Ops Manager Installation Dashboard.
  2. Click Review Pending Changes.
  3. Ensure the VMware Tanzu Application Service tile is selected.
  4. Click Apply Changes.
  5. NOTE: This apply changes will recreate all VMs in the deployment in order to apply the VMX changes.

 

Verification

After the deployment finishes, SSH into any VM (e.g., a Diego Cell) and run top.

  • The %st column (Steal Time) will now report values if the vSphere host is overcommitted.
  • If %st is 0.0 but vCenter shows high CPU Ready, ensure the VM is running on Hardware Version 13+.