GSLB service with Site Persistence enabled not failing over or is too slow to failover to active site.
- GSLB environment with multiple Active Sites.
- A GSLB service pool with 2 or more VSs from the active sites as pool members.
- GSLB Site Persistence enabled.
- When the local pool goes down for any member VS, the VS is not marked down because there is another Site-Persistence (SP) pool associated with it.
- This SP pool is an internal construct with the VSs from the remaining participating sites as members.
- Since this pool is not going down, the local VS will not be marked down. This is by design.
- As the local VS is not marked down, its A record is still active on the GSLB DNS VS.
- Thus, new connections can still be proxied to this site which will eventually result in failure since the local pool is down.
- Existing connections will also not failover to the active site since the GSLB pool member is not down.
- It is recommended to have a Layer 7 DataPlane Health Monitor configured on the GSLB service as seen below. ( ex: GSLB-HTTP-HM / GSLB-HTTPS-HM / or Custom equivalent to HTTP/HTTPs HMs)
- These health checks will be performed from the SEs hosting the GSLB DNS service.
- The idea is that, when the local pool goes down, the GSLB HM will fail. When the HM fails, the GS pool member will be marked down.
- Consequently, the A recored for that IP will be taken off of the DNS VSs. Thus, new connections will only be served with the IP of the member that is UP.
- Existing connections will also be failed over to the active site on the next DNS refresh.
- In cases where the failover is slow, we need to check the timers on the configured GSLB L7 HM.
- The HM will have a "Send Interval" and a "Failed Checks" field configured.
- These fields determine the time it takes the GSLB service to mark its pool member down.
- Say the Send Interval is set to 15 secs, and Failed checks is 3. In this case, it can take upto 45 secs (Send Interval*Failed checks) for the GSLB service to mark a pool member down.
- If this is too long, these fields can be tweaked to reduce the failover time.
- However, they should not be too aggressive as it can lead to continual flips or false failures.