How to troubleshoot app health check failure problem
search cancel

How to troubleshoot app health check failure problem

book

Article ID: 298252

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article details the various application health check types available in Tanzu Elastic Application Runtime(EAR). It provides a comprehensive troubleshooting guide for resolving common HTTP and TCP port-based health check failures. 

 

Resolution


Elastic Application Runtime supports three types of app health checks, for more details please refer to document, Using App Health Checks.

  • port - it's simply TCP check, developer can't customize the check logic
  • http - developer and framework can customize what to check
  • process - it's managed by the Diego container system

For port and http types, because it takes time for the app to go from start until listen at the port or handling the HTTP request, there are two kinds of checks in different phases. 

  • readiness check - it occurs at 60 seconds by default during app start, the client (healthcheck app) connects to the app listening port and waits up to 1 second
  • liveness check - this is a regular check when the app is running, the client (healthcheck app) connects to the app listening port and waits up to 1 second

Below is the error message raised by heathcheck and some suggested solutions:

[CELL/0] ERR Failed after 1m0.772s: readiness health check never passed.

In the error above, it took longer than 60 seconds for the app to start listening or serving a request at the port. To resolve the problem, we suggest the following:

  • Reviewing the app initialization, removing unnecessary parts from initialization to reduce the startup time
  • The readiness timeout is set to 60 seconds by default, please increase the value (up to 180s) with cf push -t argument or timeout attribute in deployment manifest. 
[CELL/0] OUT Container became unhealthy

The error above, is different from the readiness check. Instead, the error above is a regular liveness check failure. The port or http check does not respond within 1 second (invocation timeout), which indicates the app is in an unresponsive state. The invocation timeout is configurable with:

  • (v6) cf v3-set-health-check --invocation-timeout
  • (v7) cf set-health-check --invocation-timeout
  • health-check-invocation-timeout in deployment manifest when pushed with v3 API
[HEALTH/0] ERR Failed to make TCP connection to port 8080: connection refused

The error above is a port type health check failure. The app could not respond to a TCP request within one second. Usually it indicates the app instance is at an extremely high CPU or memory pressure, please review app resource usage and workload in order to scale up / out the app accordingly. 

[HEALTH/0] ERR Failed to make HTTP request to '/actuator/health' on port 8080: timed out after 1.00 seconds

The error above is a http type health check failure, the app could not respond a HTTP request at /actuator/health within 1 second. http type healthcheck is far slower than TCP check because it engages HTTP handling. The developer could add additional check implementation and a framework could add a backend service check automatically. To resolve it, we suggest not only reviewing app workload / resource usage, but also reviewing and improving the service health check implementation.