Apps Manager and CF CLI show 0% for CPU/Memory/Disk metrics

search cancel

Apps Manager and CF CLI show 0% for CPU/Memory/Disk metrics

book

Article ID: 297997

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article details why 0's are displayed for current application resource utilization statistics. We will also discuss some possible mitigation paths.

Sometimes all of the application resource utilization statistics will show 0's. Sometimes only select instances will show 0 while other instances (of the same application) are properly displaying their resource utilization.

The following is a visual of how this looks when all instances show 0's for their resource utilization:

Environment

Product Version: 2.7

Resolution

In order to understand why this occurs, we must understand where these metrics originate from.

Container metrics are emitted by the Rep process (located on the Diego Cell instance group for standard TAS deployments, and on the Compute instance group for small footprint TAS deployments).

These messages look like the following:

{
    "timestamp": "1600967371255995124",
    "source_id": "5be0bad9-####-4f4a-b962-cbed63c50703",
    "instance_id": "4",
    "deprecated_tags": {},
    "tags": {
        "deployment": "cf-617db73052fb756cb167",
        "index": "c39f4bcc-70ae-4f22-b35f-b592b94174bb",
        "instance_id": "4",
        "ip": "10.###.##.23",
        "job": "diego_cell",
        "origin": "rep",
        "process_id": "5be0bad9-76d9-4f4a-b962-cbed63c50703",
        "process_instance_id": "28f86b60-5d76-4c90-54dd-6e82",
        "product": "Pivotal Application Service",
        "source_id": "5be0bad9-####-4f4a-b962-cbed63c50703",
        "system_domain": "run-36.slot-##.###.####.com"
    },
    "gauge": {
        "metrics": {
            "cpu": {
                "unit": "percentage",
                "value": 0.23444877907904652
            },
            "disk": {
                "unit": "bytes",
                "value": 132345856
            },
            "disk_quota": {
                "unit": "bytes",
                "value": 1073741824
            },
            "memory": {
                "unit": "bytes",
                "value": 147154137
            },
            "memory_quota": {
                "unit": "bytes",
                "value": 1073741824
            }
        }
    }
}

Options:
Now we know the following:

The Rep process emits the container metrics.
The metrics ultimately land in log-cache.
CF CLI and AppsManager get these metrics from log-cache for processing and displaying.

One thing we must consider now is the fact that when these utilities request these container metrics, they request current metrics. For example, if an application had 200% CPU usage 20 minutes ago, we do not want to see 200% as the current CPU usage when we request current resource usage (such as running a command like $ cf app APPNAME). We want current metrics. Considering this, the utilities specify a time frame for these metrics when making the request. Typically that timeframe is now - 2 minutes.

If there are no container metrics for the specified application process within the requested time frame, then application process resource utilization statistics will default to 0.

Now it can be visualized:
1. The rep process emits container metrics.
2. These metrics flow through the loggregator (or via syslog) to log-cache
3. Utilities make a request for these metrics in the specified time frame (now minus 2 minutes typically).

If for any reason the container metrics are not in log-cache for the application the utility is requesting for during that specific time frame, then we will see 0's for the resource utilization statistics.

You can verify exactly what these utilities are getting by running the following log-cache api call (be sure that the terminal this runs from has a CF CLI authenticated as to pass the oauth-token):

export TWOMinsAgo=$(date -v -2M '+%s000000000'); export ZEROMinsAgo=$(date -v -0M '+%s000000000'); curl -k -H "Authorization: $(cf oauth-token)" "http://log-cache.<System Domain>/api/v1/read/<App Guid>?start_time=$TWOMinsAgo&end_time=$ZEROMinsAgo&envelope_types=GAUGE"

Replace <System Domain> with your System domain.
Replace <App Guid> with an app guid you are interested in.

Reasons for why the metrics are not making it to log-cache could be vast but here are some common things to check and common ways to mitigate:

If log-cache is ingesting from the firehose:

First make sure the container metrics are flowing through the firehose with one of the following ways:

If you have access to the nozzle plugin run the following to see if any metrics come through:

cf nozzle -d -n | grep -i 'ContainerMetric' | grep <App-Name>

# Example
cf nozzle -d -n | grep -i 'ContainerMetric' | grep 'spring-music'

Replace <App-Name> with an app name you are interested in.

If you have access to the log-stream plugin instead, run the following:

cf log-stream -t gauge | grep <App-Name>

# Example
cf log-stream -t gauge | grep 'spring-music'

Replace <App-Name> with an app name you are interested in.

If the metrics are able to be seen with the above commands, then it is worth investigating why log-cache is not getting them. This could be due to many things but one of the biggest contributors to this is log loss. If there is egress log loss on the doppler job for the log-cache-nozzle consumer then this could explain why sometimes those metrics are not in log-cache, because they were dropped. See here for information on how to scale loggregator components.

If the metrics are not able to be seen with the above commands after a few minutes, please contact Tanzu Support.

If log-cache is ingesting from the syslog:

Check to see if there are any helpful errors in the syslog log files. Check the following files on the VM where the interested application container is running (can be found with cf curl /v2/apps/<App-Guid>/stats)

/var/vcap/sys/log/loggr-syslog-agent/*

Also check if there are helpful errors within the log-cache-syslog-server logs:
Check the following files on the Doppler VMs (or Control VMs if using small footprint TAS):

/var/vcap/sys/log/log-cache-syslog-server/*

If the steps in this article do not yield results, please contact Tanzu Support.

Now that we know where the container statistics come from, we can now focus on how utilities such as the CF CLI or Apps Manager obtain these statistics.

The CF CLI and Apps Manager obtain these metrics from log-cache. The log-cache is an in-memory caching layer that sits on the Doppler instance group for standard TAS deployments or the Control instance group for small footprint TAS.

log-cache has 2 ways it can ingest messages:

1. From the firehose as a v2 consumer. This is the traditional and default setting that TAS deploys with. This means that the log-cache gets its messages in this fashion:

Prior to the Shared Nothing Architecture:

loggregator_agent --> doppler --> rlp -> log-cache-nozzle --> log-cache

In the Shared Nothing Architecture:

loggr-forwarder-agent --> loggregator_agent --> doppler --> rlp -> log-cache-nozzle --> log-cache

2. From the log-cache-syslog-server.

Note: This is only true for TAS versions 2.8+ and only if the setting "Enable Log Cache syslog ingestion" is enabled in the TAS tile under the System Logging section.

By default, the log-cache-syslog-server is disabled and the log-cache-nozzle is enabled.
When the setting "Enable Log Cache syslog ingestion" is enabled, the log-cache-syslog-server will listen on port 6067 on the Doppler instance group for standard TAS deployments or the Control instance group for small footprint TAS and the log-cache-nozzle is disabled.

Note that the loggr-syslog-agents always consider the log-cache-syslog-server an aggregate drain in case log-cache is configured to receive logs and metrics over syslog. If log-cache-syslog-server is disabled, then the only affect we should see is an error log every 15 seconds; if you see any other behavior, like excess CPU or memory usage, please let us know by contacting Support. The harmless log looks like the following:

2020/09/18 08:47:21 failed to write to doppler.service.cf.internal:6067, retrying in 15s, err: dial tcp 10.213.60.48:6067: connect: connection refused

With the option to have log-cache ingest from syslog, the message travel is different:

loggr-forwarder-agent --> loggr-syslog-agent --> log-cache-syslog-server --> log-cache

Log-cache can only be configured to ingest one way at a time - either from the firehose or from syslog.

Feedback

thumb_up Yes

thumb_down No