In order to understand why this occurs, we must understand where these metrics originate from.
Container metrics are emitted by the Rep process (located on the Diego Cell instance group for standard TAS deployments, and on the Compute instance group for small footprint TAS deployments).
These messages look like the following:
{ "timestamp": "1600967371255995124", "source_id": "5be0bad9-####-4f4a-b962-cbed63c50703", "instance_id": "4", "deprecated_tags": {}, "tags": { "deployment": "cf-617db73052fb756cb167", "index": "c39f4bcc-70ae-4f22-b35f-b592b94174bb", "instance_id": "4", "ip": "10.###.##.23", "job": "diego_cell", "origin": "rep", "process_id": "5be0bad9-76d9-4f4a-b962-cbed63c50703", "process_instance_id": "28f86b60-5d76-4c90-54dd-6e82", "product": "Pivotal Application Service", "source_id": "5be0bad9-####-4f4a-b962-cbed63c50703", "system_domain": "run-36.slot-##.###.####.com" }, "gauge": { "metrics": { "cpu": { "unit": "percentage", "value": 0.23444877907904652 }, "disk": { "unit": "bytes", "value": 132345856 }, "disk_quota": { "unit": "bytes", "value": 1073741824 }, "memory": { "unit": "bytes", "value": 147154137 }, "memory_quota": { "unit": "bytes", "value": 1073741824 } } } }
Options:
Now we know the following:
One thing we must consider now is the fact that when these utilities request these container metrics, they request current metrics. For example, if an application had 200% CPU usage 20 minutes ago, we do not want to see 200% as the current CPU usage when we request current resource usage (such as running a command like $ cf app APPNAME). We want current metrics. Considering this, the utilities specify a time frame for these metrics when making the request. Typically that timeframe is now - 2 minutes.
If there are no container metrics for the specified application process within the requested time frame, then application process resource utilization statistics will default to 0.
Now it can be visualized:
1. The rep process emits container metrics.
2. These metrics flow through the loggregator (or via syslog) to log-cache
3. Utilities make a request for these metrics in the specified time frame (now minus 2 minutes typically).
If for any reason the container metrics are not in log-cache for the application the utility is requesting for during that specific time frame, then we will see 0's for the resource utilization statistics.
You can verify exactly what these utilities are getting by running the following log-cache api call (be sure that the terminal this runs from has a CF CLI authenticated as to pass the oauth-token):
export TWOMinsAgo=$(date -v -2M '+%s000000000'); export ZEROMinsAgo=$(date -v -0M '+%s000000000'); curl -k -H "Authorization: $(cf oauth-token)" "http://log-cache.<System Domain>/api/v1/read/<App Guid>?start_time=$TWOMinsAgo&end_time=$ZEROMinsAgo&envelope_types=GAUGE"
Replace <System Domain> with your System domain.
Replace <App Guid> with an app guid you are interested in.
Reasons for why the metrics are not making it to log-cache could be vast but here are some common things to check and common ways to mitigate:
First make sure the container metrics are flowing through the firehose with one of the following ways:
If you have access to the nozzle plugin run the following to see if any metrics come through:
cf nozzle -d -n | grep -i 'ContainerMetric' | grep <App-Name>
# Example
cf nozzle -d -n | grep -i 'ContainerMetric' | grep 'spring-music'
Replace <App-Name> with an app name you are interested in.
If you have access to the log-stream plugin instead, run the following:
cf log-stream -t gauge | grep <App-Name>
# Example
cf log-stream -t gauge | grep 'spring-music'
Replace <App-Name> with an app name you are interested in.
If the metrics are able to be seen with the above commands, then it is worth investigating why log-cache is not getting them. This could be due to many things but one of the biggest contributors to this is log loss. If there is egress log loss on the doppler job for the log-cache-nozzle consumer then this could explain why sometimes those metrics are not in log-cache, because they were dropped. See here for information on how to scale loggregator components.
If the metrics are not able to be seen with the above commands after a few minutes, please contact Tanzu Support.
Check to see if there are any helpful errors in the syslog log files. Check the following files on the VM where the interested application container is running (can be found with cf curl /v2/apps/<App-Guid>/stats)
/var/vcap/sys/log/loggr-syslog-agent/*
Also check if there are helpful errors within the log-cache-syslog-server logs:
Check the following files on the Doppler VMs (or Control VMs if using small footprint TAS):
/var/vcap/sys/log/log-cache-syslog-server/*
If the steps in this article do not yield results, please contact Tanzu Support.
Now that we know where the container statistics come from, we can now focus on how utilities such as the CF CLI or Apps Manager obtain these statistics.
The CF CLI and Apps Manager obtain these metrics from log-cache. The log-cache is an in-memory caching layer that sits on the Doppler instance group for standard TAS deployments or the Control instance group for small footprint TAS.
log-cache has 2 ways it can ingest messages:
1. From the firehose as a v2 consumer. This is the traditional and default setting that TAS deploys with. This means that the log-cache gets its messages in this fashion:
Prior to the Shared Nothing Architecture:
loggregator_agent --> doppler --> rlp -> log-cache-nozzle --> log-cache
In the Shared Nothing Architecture:
loggr-forwarder-agent --> loggregator_agent --> doppler --> rlp -> log-cache-nozzle --> log-cache
2. From the log-cache-syslog-server.
By default, the log-cache-syslog-server is disabled and the log-cache-nozzle is enabled.
When the setting "Enable Log Cache syslog ingestion" is enabled, the log-cache-syslog-server will listen on port 6067 on the Doppler instance group for standard TAS deployments or the Control instance group for small footprint TAS and the log-cache-nozzle is disabled.
Note that the loggr-syslog-agents always consider the log-cache-syslog-server an aggregate drain in case log-cache is configured to receive logs and metrics over syslog. If log-cache-syslog-server is disabled, then the only affect we should see is an error log every 15 seconds; if you see any other behavior, like excess CPU or memory usage, please let us know by contacting Support. The harmless log looks like the following:
2020/09/18 08:47:21 failed to write to doppler.service.cf.internal:6067, retrying in 15s, err: dial tcp 10.213.60.48:6067: connect: connection refused
With the option to have log-cache ingest from syslog, the message travel is different:
loggr-forwarder-agent --> loggr-syslog-agent --> log-cache-syslog-server --> log-cache
Log-cache can only be configured to ingest one way at a time - either from the firehose or from syslog.