Client IP or App Cookie Persistence Fails in Large-Scale Virtual Service Deployments

search cancel

Client IP or App Cookie Persistence Fails in Large-Scale Virtual Service Deployments

book

Article ID: 400664

calendar_today

Updated On:

Products

VMware Avi Load Balancer

Issue/Introduction

At high virtual service scale (around ~150-200 VS), Service Engine (SE) persistence synchronisation may take significantly longer when all services are configured with client IP persistence or App cookie persistence and a high persistence timeout (in the order of tens of minutes).
Client IPs intermittently map to different backend servers eventually breaking the stickiness when multiple Service Engines (SEs) are involved.

Cause

At large scale, the volume of operations (key publications) from multiple SEs causes Redis timeouts during entry creation.

Resolution

Root Cause:

Each persistence entry must be synced between SEs via the Controller's Redis-based key-value (KV) store.
Redis keys are stored for only 60 seconds by default. In high-connection scenarios some keys expire before peers can read them.
Missed entries are only reconciled during a periodic resync, which by default happens at half the persistence timeout (e.g., 15 minutes if persistent timeout is 30 mins).
These resync bursts add further Redis load, possibly leading to repeated timeouts and stale state between SEs.

Impact:

Persistence entries may remain unsynchronized between SEs for extended durations (up to the next resync interval), affecting traffic stickiness and session continuity.

Identifying the Issue

To identify if persistence entries are not syncing properly:

SSH into the Avi Controller.
Run the following command:

> show pool <pool-name> internal

In this output look for these two counters for each ServiceEngine:

|   num_redis_recvd                     | 0                                                   | 
|   num_redis_sent_dp                   | 0                                                   |

num_redis_recvd should be around same in all SEs for that particular pool. If there is lot of difference of these counters then it means some entries are missing.

num_redis_recvd on 1 SE should be around the same value as summation of num_redis_sent_dp from all SEs.

This issue is Fixed in newer release i.e 30.2.4,31.1.2 wherein two new configuration knobs within SE group are introduced to improve sync behavior and Redis reliability:

sdb_key_timeout
- Purpose: Controls how long the Controller Redis server holds keys before expiring.
- Range: 60-600 seconds (default: 60s)
persistence_update_interval
- Purpose: Defines how often refreshed persistence entries are synced to peer SEs.
- Range: 1-30 minutes (default behavior: persistence timeout/2)

Recommendation for Scaled Deployments

For large-scale environments with many VSes using Client IP or App Cookie Persistence with long timeouts:

sdb_key_timeout = 3 minutes
persistence_update_interval = 3 minutes

These settings reduce sync delays and help maintain session persistence across SEs even at scale.

Feedback

thumb_up Yes

thumb_down No