How to identify the source app / client which over drives cloud controller API

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Cloud controllers could be over-driven by various clients / apps including platform services (e.g, autoscaling_service), third party services (e.g., nozzles consuming Loggregator firehose), and on-demand service brokers, etc.

As a consequence, you may see the following symptoms:

cf commands would fail which impacts developer activity
Apps Manager UI is stuck at loading

This article contain steps to identify these noisy clients / apps.

Environment

Product Version: 2.9

Resolution

First, we should try the following:

Check cloud controller key metrics to confirm it is facing resource contention
Use the top-logs script to get insights into the load upon cloud controllers

The top-logs script analyzes cloud controller's nginx-access.log, which includes the "User-Agent" of the requests but does not include information about the UAA clients / users behind it.

Below is an example line from nginx-access.log which reveals that the User-Agent of the request is "go-cli":

api.example.com - [10/Sep/2020:06:55:44 +0000] "DELETE /v2/routes/0d776be6-####-437ab7388f07 HTTP/1.1" 204 226 "-" "go-cli 6.49.0+d0dfa93bb.2020-01-07 / linux" 10.####.43, 10.####.5, 10.####.31 vcap_request_id:8f9f326f-####-3ea0d56174d7::f94e5b5a-####-53d0b9f7dad6 response_time:0.084

To identify the UAA clients / users sending the request, let us identify the same request in cloud_controller_ng.log. Using the field vcap_request_id, we are able to identify it:

{"timestamp":1599720944.2169876,"message":"Started DELETE \"/v2/routes/0d776be6-####-437ab7388f07\" for user: 54a6a01a-####-353ddfaaaf24, ip: 10.####.43 with vcap-request-id: 8f9f326f-####-3ea0d56174d7::f94e5b5a-####-53d0b9f7dad6 at 2020-09-10 06:55:44 UTC","log_level":"info","source":"cc.api","data":{"request_guid":"8f9f326f-####-3ea0d56174d7::f94e5b5a-####-53d0b9f7dad6"},"thread_id":47051152986220,"fiber_id":4705####420,"process_id":7,"file":"/var/vcap/data/packages/cloud_controller_ng/56f7####b291/cloud_controller_ng/middleware/request_logs.rb","lineno":12,"method":"call"}

Who is the user "54a6a01a-####-353ddfaaaf24" ?

$ uaac users | egrep "^    (username:|id:)" | grep "54a6a01a-####-353ddfaaaf24" -A 1
    id: 54a6a01a-####-353ddfaaaf24
    username: healthwatch_space_developer

# or using cf CLI 
$ cf curl /v2/users/54a6a01a-####-353ddfaaaf24
{
   "metadata": {
      "guid": "54a6a01a-####-353ddfaaaf24",
      "url": "/v2/users/54a6a01a-####-353ddfaaaf24",
      "created_at": "2020-07-25T02:07:42Z",
      "updated_at": "2020-07-25T02:10:15Z"
   },
   "entity": {
      "admin": false,
      "active": false,
      "default_space_guid": null,
      "username": "healthwatch_space_developer",
......

It is possible that the request does not involve any user, therefore the "for user" field in cloud_controller_ng.log may be UAA clients.

How about the ip "10.####.43" ?

$ bosh vms | grep "10.####.43"
diego_cell/b2e8b2e3-####-4d83e6272cd4                   	running	az2	10.####.43 ......

It is likely the request is from some app running in this diego_cell. To further check this, we could use cf top or cfdot to narrow down the app.

The following procedure is a summary from the above idea:

1. List out the top users / clients.

# cat cloud_controller_ng.log | grep "for\ user" | awk -F'for\ user:\ ' '{print $2}' | awk -F',' '{print $1}' | sort | uniq -c | sort -n
     34 network-policy-server
    472 usage_service
    624 autoscaling_service
   1035 scheduler_service
   1191 loggregator
   2307 677d10d7-####-22ee2de082d1
   2385 068004fa-####-d414e8b1d2f0
   4128 54a6a01a-####-353ddfaaaf24
   4901 tile_installer
  13449

2. Identify the users based on GUIDs obtained above:

$ uaac users | egrep "^    (username:|id:)"
    id: 068004fa-####-d414e8b1d2f0
    username: admin
    id: 677d10d7-####-22ee2de082d1
    username: system_services
    id: 54a6a01a-####-353ddfaaaf24
    username: healthwatch_space_developer
    ... other users omitted ...

# or using cf CLI 
$ cf curl /v2/users/54a6a01a-####-353ddfaaaf24

3. Review the recent requests PATH from a particular user/client:

# cat cloud_controller_ng.log | grep "for user: 54a6a01a-####-353ddfaaaf24" | awk -F'Started ' '{print $2}' | awk -F'for user' '{print $1}' | tail -n 100
......
GET \"/v3/apps/8c788656-####-c7ea7c7a525f/processes\"
DELETE \"/v2/routes/0d776be6-####-437ab7388f07\"

It's worth noting that some nozzles, which poll information about org / space / app, may send heavy requests with filters like "inline-relations-depth" (in this conversation).

4. Which are the recent IPs of the recent requests from this user?

# cat cloud_controller_ng.log | grep "for user: 54a6a01a-####-353ddfaaaf24" | tail -n 100 | awk -F'ip: ' '{print $2}' | awk -F' with' '{print $1}' | sort | uniq -c
     97 10.####.43
      2 10.####.46
      1 10.####.47

These IPs may be Diego cells or other system components:

$ bosh vms --column instance --column ips | egrep "10.####.4(3|6|7)"
diego_cell/b2e8b2e3-####-4d83e6272cd4                   	10.####.43
loggregator_trafficcontroller/39e584f3-####-11cbdf9d4470	10.####.47

# using cf top, we figure out the app "cf-health-check" is running in the cell .43.
# This app pushes/deletes smoke test app via user healthwatch_space_developer.

5. If you are suspecting any aggressive firehose nozzles, you may directly check the cell IPs of the nozzle app instances. Compare with the cell IPs filtered out from cloud_controller_ng.log in the previous step:

f curl /v2/apps/(cf app --guid NOZZLE_APP_NAME)/stats | grep host

6. Once we have identified the source apps / components which cause the problems, we could shut them down to quickly confirm our theory. Then we can proceed to tune settings like nozzle polling internal. It is also possible that we fail to identify any outstanding aggressive sources. We should verify whether the workload (cells / apps / services) has been increasing in the platform but the cloud controllers are under-scaled and thus not able to keep up.