App Healthcheck Effect During Container Replacement

search cancel

App Healthcheck Effect During Container Replacement

book

Article ID: 298300

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This Knowledge Base article covers a necessary detail regarding application health checks for applications running on VMware Tanzu Application Service for VMS (TAS).

The application health check feature is one of the ways the platform ensures that an application is healthy and can accept requests. Every application pushed to TAS will receive the default health check if a custom health check is not specified.

More information regarding application health checks and the health check lifecycle can be found in the official documentation.

The detail we will be covering specifically in this article is application availability during container replacement.

Firstly, we should cover the following:

The gorouters maintain routing tables of all http routable endpoints exposed within the platform. For more information see this documentation.
Bosh VM lifecycle updates (bosh recreate, bosh stop) trigger specific scripts to be called so that jobs on the VM can shutdown and startup gracefully and handle logic as needed during lifecycle updates.

When a Diego-Cell is going through a lifecycle update, the rep process contains a drain script that may be executed. This drain script will begin to evacuate all running apps it has running on the Diego Cell. As part of the application evacuation, one of the first tasks is to request replacements for all running apps so that availability can be maintained. The replacements have 10 minutes to spin up.

The steps at a high level:

App is running on Diego Cell 1.
Diego Cell 1 is recreated during a bosh deploy.
As part of the recreate, the rep drain script is executed.
All running app instances on Diego Cell 1 request replacements to spin up on other Diego Cells so that availability may be maintained before shutting down the instances on Diego Cell 1.
Diego Cell 2 spins up all replacement instances for Diego Cell 1 and all replacement instances are considered healthy.
Diego Cell 1 stops the app instances it has since the replacements are up.
Diego Cell 1 can proceed with other lifecycle scripts and recreating.
Diego Cell 2 contains the new instances of the apps and availability is maintained during Diego Cell 1 recreate.

Our interest is with the timing of when these application routes are registered and unregistered in Gorouter throughout this process.

To put a visual to this, lets look at appA the following simplified scenario:

Gorouter is aware that appA is on Diego Cell 1.

A bosh recreate is triggered on Diego Cell 1:

When the drain script on Diego Cell 1 was triggered, a replacement instance was requested. The rep on Diego Cell 2 won the auction for the new instance and begins to spin it up. From the docs:
"When Diego starts an app instance, the app health check runs every two seconds until a response indicates that the app instance is healthy or until the health check timeout elapses. The 2-second health check interval is not configurable."

At this point we should cover the following:

appA instance in Diego Cell 1 will stay registered in Gorouter until the instance is stopped.
appA instance in Diego Cell 2 will not have its route registered in Gorouter until it is considered healthy.

Lets continue the visual, now appA in Diego Cell 2 just became healthy:

Once appA on Diego Cell 2 becomes healthy, its route is registered in Gorouter. Additionally, Diego Cell 1 rep will be made aware that the new instance is successfully spun up and will proceed to stop its instance of appA.

It is important to note that the route for appA in Diego Cell 1 will stay registered in Gorouter until the instance is stopped. This process encourages high probability for graceful failover of application location while maintaining availability due to the timing of when the routes are registered and unregistered in Gorouter.

However, what if we have an application that is slow to start up while using the process health check type? Availability will be briefly lost. Lets see a real example of this.

Environment

Product Version: 2.11

Resolution

This real example follows the same logic as appA above. This Spring app is named "healthchecker" and has a health check type of process.

App healthchecker takes 1 minute to start up and will be unable to serve requests until successfully started.

The process healthcheck is defined: "Diego ensures that any process declared for the app stays running. If the process exits, Diego stops and deletes the app instance."

For a Spring app, this means that as soon as the JVM process is running it is considered healthy. Though the JVM process is running, the application itself may not be ready yet. The time between the JVM process starting and when the app is actually ready to accept requests is where availability may be impacted. This is why it is paramount to ensure applications have the necessary health checks according to their logic. Lets continue to see the impact in this scenario:

diego_cell/2205ef22-2e84-4890-83b9-5d5f9e9e0ce2 contains original healthchecker instance e15b8151-e082-4df2-569f-1c54.
diego_cell/84a84fa9-e105-421b-bae1-be4f3315634b contains original healthchecker instance 165031b5-8c05-429d-4c12-9d1a.

diego_cell/2205ef22-2e84-4890-83b9-5d5f9e9e0ce2 is being recreated:

1 - The original instance e15b8151-e082-4df2-569f-1c54 needs a replacement

 2022-05-23T11:36:32.71-0400 [CELL/0] OUT Cell 2205ef22-2e84-4890-83b9-5d5f9e9e0ce2 requesting replacement for instance e15b8151-e082-4df2-569f-1c54

2 - diego_cell/84a84fa9-e105-421b-bae1-be4f3315634b won the auction and begins spinning up the replacement instance 165031b5-8c05-429d-4c12-9d1a:

 2022-05-23T11:36:35.09-0400 [CELL/0] OUT Cell 84a84fa9-e105-421b-bae1-be4f3315634b successfully created container for instance 165031b5-8c05-429d-4c12-9d1a

As soon as the replacement container is created, the JVM process starts and resumes starting the application. Diego will be performing a health check every 2 seconds while starting the new instance. Unfortunately since the health check is type process, as soon as the JVM starts the next health check will return healthy. We can see this in the logs of route_emitter on diego_cell/84a84fa9-e105-421b-bae1-be4f3315634b:

{"timestamp":"2022-05-23T15:36:37.868863155Z","level":"info","source":"route-emitter","message":"route-emitter.watcher.handling-event.added","data":{"address":"x.x.x.x","cell-id":"84a84fa9-e105-421b-bae1-be4f3315634b","domain":"cf-apps","evacuating":false,"index":0,"instance-guid":"165031b5-8c05-429d-4c12-9d1a","ports":[{"container_port":8080,"host_port":61135,"container_tls_proxy_port":61001,"host_tls_proxy_port":61137},{"container_port":8080,"host_port":61135,"container_tls_proxy_port":61443,"host_tls_proxy_port":61138},{"container_port":2222,"host_port":61136,"container_tls_proxy_port":61002,"host_tls_proxy_port":61139}],"process-guid":"bb7cb43e-6606-4260-981a-2b08561fbcfb-1a963a6f-ed30-4a48-a330-aa816288db81","session":"8.43225","state":"RUNNING"}}

This means Gorouter will be updating its routing table with this new instance. However the new instance will not be fully ready to accept requests until 1 minute later:

2022-05-23T11:37:40.63-0400 [APP/PROC/WEB/0] OUT 2022-05-23 15:37:40.630  INFO 22 --- [           main] o.s.b.w.embedded.tomcat.TomcatWebServer  : Tomcat started on port(s): 8080 (http) with context path ''

2022-05-23T11:37:40.63-0400 [APP/PROC/WEB/0] OUT 2022-05-23 15:37:40.639  INFO 22 --- [           main] c.j.h.HealthcheckerApplication           : Started HealthcheckerApplication in 62.195 seconds (JVM running for 62.809)

Unfortunately the initial healthchecker app instance was stopped near the time the replacement instance became healthy and its route was unregistered.

2022-05-23T11:36:42.71-0400 [CELL/0] OUT Cell 2205ef22-2e84-4890-83b9-5d5f9e9e0ce2 stopping instance e15b8151-e082-4df2-569f-1c54

At this point, only the replacement instance route is registered in Gorouter. This means healthchecker app availability is lost between 2022-05-23T11:36:42 - 2022-05-23T11:37:40. If a request is destined the app during that time of unavailability, it will result in a 502 response code:

2022-05-23T11:36:44.26-0400 [RTR/0] OUT healthchecker.cfapps-01.slot-20.pez.vmware.com - [2022-05-23T15:36:44.259838389Z] "GET /health0 HTTP/1.1" 502 0 67 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0 Safari/537.36" "x.x.x.x:46932" "x.x.x.x:61001" x_forwarded_for:"x.x.x.x, x.x.x.x" x_forwarded_proto:"http" vcap_request_id:"337b2bf4-6b01-47a6-588b-e4e61888f78e" response_time:0.007890 gorouter_time:0.000438 app_id:"bb7cb43e-6606-4260-981a-2b08561fbcfb" app_index:"0" instance_id:"165031b5-8c05-429d-4c12-9d1a" x_cf_routererror:"endpoint_failure (EOF (via idempotent request))" x_b3_traceid:"5b783e7b797c55c639f71b7708d5688b" x_b3_spanid:"39f71b7708d5688b" x_b3_parentspanid:"-" b3:"5b783e7b797c55c639f71b7708d5688b-39f71b7708d5688b"

It is not until the application is fully started can we begin to successfully accept traffic:

2022-05-23T11:37:42.80-0400 [RTR/0] OUT healthchecker.cfapps-01.slot-20.pez.vmware.com - [2022-05-23T15:37:42.680453719Z] "GET /health0 HTTP/1.1" 200 0 8 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0 Safari/537.36" "x.x.x.x:56134" "x.x.x.x:61001" x_forwarded_for:"x.x.x.x, x.x.x.x" x_forwarded_proto:"http" vcap_request_id:"1d8d9887-84e6-4d3b-42da-4bf5dd64e0ec" response_time:0.123125 gorouter_time:0.000448 app_id:"bb7cb43e-6606-4260-981a-2b08561fbcfb" app_index:"0" instance_id:"165031b5-8c05-429d-4c12-9d1a" x_cf_routererror:"-" x_b3_traceid:"5e1ab78b3c4e03ae5dcc82d3b45ec244" x_b3_spanid:"5dcc82d3b45ec244" x_b3_parentspanid:"-" b3:"5e1ab78b3c4e03ae5dcc82d3b45ec244-5dcc82d3b45ec244"

If the healthchecker application instead used a port or http health check type, then the new instance would not have been considered healthy as soon as the JVM process started up. Instead the new app instance would have been considered healthy when it started listening on port 8080 or was able to accept http requests, depending on the health check type.

Conclusion
Application health check types are vital in ensuring application availability during times where the underlying host is updating.

Additional Note:
This note only applies when using NSX-T as the overlay network for TAS. Prior to NSX-T Container Plugin v3.1.2.4 and v3.0.2.5 it was possible that proper health check types still result in availability lost. This is because the drain script in openvswitch shut down the container networking. Recall that bosh lifecycle updates calls scripts for jobs to run. When the drain script is called, it is called for all jobs on the VM. Rep drain script desired to replace all app instances while openvswitch drain script shut down container networking. This resulted in immediate loss of availability until the replacement instances were healthy. Ideally, the original instance will remain functional and able to serve traffic until the new instance is healthy - however openvswitch drain script shut down container networking prematurely. This has been patched starting with NCP tile v3.1.2.4 and v3.0.2.5.

Feedback

thumb_up Yes

thumb_down No