How to work with Container Metrics on Pivotal Cloud Foundry

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Pivotal Support gets the following questions frequently:

Where is Container Metrics stored?
How frequently will stats be updated?
Who is responsible for the updates?
What is workflow around fetching Container Metrics?

Environment

Resolution

Below is the workflow around the Container Metrics

The client initiates a request to the status endpoint. This could be running cf app my-app or a direct request like cf curl /v2/apps/<APP_GUID>/stats
Cloud Controller receives the API request
Cloud Controller talks to the Diego TPS Listener, which runs on the Diego Brain VMs. The TPS Listener provides the Cloud Controller with information about running processes on Diego, including metrics
The Diego TPS Listener will fetch the application metrics from Doppler and return them to the Cloud Controller

Cloud Foundry Application Metrics workflow

Instructions to Troubleshoot Problems with Container Metrics

To begin troubleshooting, it helps to test the full flow of traffic. You can test the whole flow by running cf curl /v2/apps/<app_guid>/stats (you can get an App GUID by running cf app --guid <name>). This will send a request to the Cloud Controller and ask it to get metrics on your behalf. If this returns with metric data, then all is working OK. If it does not or if the request is very slow, then continue troubleshooting.
When the request to Cloud Controller is slow, this usually happens because Cloud Controller is under-provisioned and needs to be scaled up. There are two parts which can cause issues.
The first is CPU, certain requests to the Cloud Controller like those that set inline-relations-depth to a value greater than zero can put a lot of load on the Cloud Controller. This can be confirmed by looking for high CPU usage or Load Average on the Cloud Controller VMs.
The second limit is often the number of threads used by the Cloud Controller. The Cloud Controller has a fixed number of threads that it uses to process incoming requests. When this pool is exhausted, requests will queue up and wait for a free thread. This can be identified through the Cloud Controller metrics, where you can see the available threads will be low or zero. This is also usually the cause of slow requests when CPU usage is also low.
To resolve either of these issues, simply scale up the number of Cloud Controller instances or reduce the load on the Cloud Controller VMs.
If you are receiving an HTTP error from the Cloud Controller, you can find more information about this in the Cloud Controller logs. You can download these via Operations (Ops) Manager or through BOSH.
HTTP errors from Cloud Controller will typically result in a backtrace in the logs which indicates the line of code that causes the problem. It's not possible to list all of the causes here, but a common problem when pulling metrics is that the Cloud Controller receives an error from the TPS listener. In this case, you will also need to pull the logs from the Diego Brain VMs to see why the TPS listener failed. If you need assistance reviewing the Cloud Controller or TPS logs, please contact Support.
Because the TPS listener is also requesting information from an external resource, you can have a similar issue where it fails because of an error with Loggregator. In this case, you can troubleshoot the Loggregator errors by connecting directly to the firehose yourself. To connect to the firehose directly and pull metrics, you can use the firehose CF CLI plugin.

Example:
```
cf nozzle --filter ContainerMetric  # run as admin to get metrics from all containers
cf app-nozzle APP_NAME --filter ContainerMetric # run as a user to get metrics from a specific app
```

Impact

As you can see from the diagram above, asking the Cloud Controller to retrieve stats for you is not a trivial request. As such, this endpoint is not suitable for integration with monitoring systems. Having a monitoring system poll metrics through Cloud Controller will put a significant load on your Cloud Controller instances and either cause performance issues or require you to significantly scale up the number of Cloud Controllers in your deployment.

Instead of polling for metrics from Cloud Controller, you should listen to the firehose and pull metrics directly from there. If you refer to the graphic above, you can see that this cuts out both the Cloud Controller & the TPS listener which makes retrieving the metrics more efficient.