GSLB service with Site Persistence enabled not failing over or is too slow to failover to active site
search cancel

GSLB service with Site Persistence enabled not failing over or is too slow to failover to active site

book

Article ID: 374107

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

GSLB service with Site Persistence enabled not failing over or is too slow to failover to active site.

Environment

- GSLB environment with multiple Active Sites.
- A GSLB service pool with 2 or more VSs from the active sites as pool members. 
- GSLB Site Persistence enabled.

Cause

- When the local pool goes down for any member VS, the VS is not marked down because there is another Site-Persistence (SP) pool associated with it.
- This SP pool is an internal construct with the VSs from the remaining participating sites as members.
- Since this pool is not going down, the local VS will not be marked down. This is by design.
- As the local VS is not marked down, its A record is still active on the GSLB DNS VS. 
- Thus, new connections can still be proxied to this site which will eventually result in failure since the local pool is down.
- Existing connections will also not failover to the active site since the GSLB pool member is not down.

Resolution

- It is recommended to have a Layer 7 DataPlane Health Monitor configured on the GSLB service as seen below. ( ex: GSLB-HTTP-HM / GSLB-HTTPS-HM / or Custom equivalent to HTTP/HTTPs HMs)


- These health checks will be performed from the SEs hosting the GSLB DNS service.
- The idea is that, when the local pool goes down, the GSLB HM will fail. When the HM fails, the GS pool member will be marked down. 
- Consequently, the A recored for that IP will be taken off of the DNS VSs. Thus, new connections will only be served with the IP of the member that is UP.
- Existing connections will also be failed over to the active site on the next DNS refresh.
- In cases where the failover is slow, we need to check the timers on the configured GSLB L7 HM.
- The HM will have a "Send Interval" and a "Failed Checks" field configured.
- These fields determine the time it takes the GSLB service to mark its pool member down.
- Say the Send Interval is set to 15 secs, and Failed checks is 3. In this case, it can take upto 45 secs (Send Interval*Failed checks) for the GSLB service to mark a pool member down. 
- If this is too long, these fields can be tweaked to reduce the failover time. 
- However, they should not be too aggressive as it can lead to continual flips or false failures.