How to troubleshoot app health check failure problem
search cancel

How to troubleshoot app health check failure problem

book

Article ID: 298252

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Environment

Product Version: 2.10

Resolution

Checklist:
This article explains app health check types on Tanzu Application Service. Additionally, this article explains how to troubleshoot when hitting http or port health check failures. 


Tanzu Application Service supports three types of app health checks, for more details please refer to Tanzu document, Using App Health Checks.
 
  • port - it's simply TCP check, developer can't customize the check logic
  • http - developer and framework can customize what to check
  • process - it's managed by the Diego container system

For port and http types, because it takes time for the app to go from start until listen at the port or handling the HTTP request, there are two kinds of checks in different phases. 
 

  • readiness check - it occurs at 60 seconds by default during app start, the client (healthcheck app) connects to the app listening port and waits up to 1 second
  • liveness check - this is a regular check when the app is running, the client (healthcheck app) connects to the app listening port and waits up to 1 second


Below are some error messages raised by heathcheck and some suggested solutions:
 

[CELL/0] ERR Failed after 1m0.772s: readiness health check never passed.

In the error above, it took longer than 60 seconds for the app to start listening or serving a request at the port. To resolve the problem, we suggest the following:
 

  • Reviewing the app initialization, removing unnecessary parts from initialization to reduce the startup time
  • The readiness timeout is set to 60 seconds by default, please increase the value (up to 180s) with cf push -t argument or timeout attribute in deployment manifest. 
 
[CELL/0] OUT Container became unhealthy
The error above, is different from the readiness check. Instead, the error above is a regular liveness check failure. The port or http check does not respond within 1 second (invocation timeout), which indicates the app is in an unresponsive state. The invocation timeout is configurable with:
 
  • (v6) cf v3-set-health-check --invocation-timeout
  • (v7) cf set-health-check --invocation-timeout
  • health-check-invocation-timeout in deployment manifest when pushed with v3 API
 
[HEALTH/0] ERR Failed to make TCP connection to port 8080: connection refused
The error above is a port type health check failure. The app could not respond to a TCP request within one second. Usually it indicates the app instance is at an extremely high CPU or memory pressure, please review app resource usage and workload in order to scale up / out the app accordingly. 
 
[HEALTH/0] ERR Failed to make HTTP request to '/actuator/health' on port 8080: timed out after 1.00 seconds
The error above is a http type health check failure, the app could not respond a HTTP request at /actuator/health within 1 second. http type healthcheck is far slower than TCP check because it engages HTTP handling. The developer could add additional check implementation and a framework could add a backend service check automatically. To resolve this error, we suggest not only reviewing app workload / resource usage, but also reviewing and improving the health check implementation.