VMware Tanzu Application Service (TAS) for VMs errand Metric Registrar Smoke Test failure results in PCF API Returning 404 Errors

search cancel

VMware Tanzu Application Service (TAS) for VMs errand Metric Registrar Smoke Test failure results in PCF API Returning 404 Errors

book

Article ID: 298022

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When the TAS API endpoint is returning 404 errors, it could mean that the route is not in the GoRouter's table.

The Cloud Controller (CC) VM is the destination for the TAS API endpoint. The route_registrar job will perform a health check on the VM. If the health check passes, it will proceed to publish its route (the TAS API endpoint) to NATS, which will then be consumed by the GoRouter and go into the routing table.

In the event the Cloud Controller health check fails, the route_registrar unregisters its route. This is good because if a Cloud Controller is not healthy, you don't want requests going to that Cloud Controller. However when all Cloud Controller VMs fail their health checks, then that means all CC VMs have unregistered their routes - therefore you will get the 404 errors when making a request to the TAS API endpoint.

Anytime a Cloud Controller fails a health check and unregisters it's route, it makes an entry into its logs under the job route_registrar (/var/vcap/sys/log/route_registrar/route_registrar.stdout.log):

{"timestamp":"2020-03-24T17:13:09.118718176Z","level":"info","source":"Route Registrar","message":"Route Registrar.healthchecker errored for route","data":{"route":{"Type":"","Name":"api","Port":null,"TLSPort":9024,"Tags":{"component":"CloudController"},"URIs":["api.system.FQDN"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCheck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-controller-ng.service.cf.internal"}}}
{"timestamp":"2020-03-24T17:13:09.118741176Z","level":"info","source":"Route Registrar","message":"Route Registrar.Unregistering route","data":{"route":{"Type":"","Name":"api","Port":null,"TLSPort":9024,"Tags":{"component":"CloudController"},"URIs":["api.system.FQDN"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCheck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-controller-ng.service.cf.internal"}}}

When all of the Cloud Controllers in the foundation have this happen at the same time, the TAS API route will not be found.

Environment

Product Version: 2.7

Resolution

One of the reasons this could be happening is because of a TAS for VMs errand failure called Metric Registrar Smoke Test.

This errand deploys and starts an application called metric-registrar-monitor under the system org and in metric-registrar-monitor space. This application was designed to be short lived, however if it is left running for longer periods of time, it ends up leaking some go routines and makes an overwhelming amount of requests to the TAS API.

The initial bug reported for this errand where the application was not cleaned sometimes after successful errand completion is patched. However, we have identified that there is another scenario where this application can stay in a running state when it is not meant to.

If the errand fails for any reason after it creates the metric-registrar-monitor application, then it will fail to clean up the application which results in the application overwhelming the Cloud Controllers to the point they can't pass health checks.

The resolution is to stop the application in the system org and the metric-registrar-monitor space.

This article will be updated with the product version that includes the patch for this issue when it is released.

If stopping the metric-registrar-monitor application doesn't fix the issue, consult the Scaling Cloud Controller documentation to verify the CC's are running healthy: https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller.html and/or contact Tanzu Support.

Feedback

thumb_up Yes

thumb_down No