VIP Authentication Hub - support token API across datacenters with consistency
search cancel

VIP Authentication Hub - support token API across datacenters with consistency

book

Article ID: 256557

calendar_today

Updated On:

Products

VIP Authentication Hub

Issue/Introduction

In production VIP AuthHub is deployed to three data centers. There is a database cluster for each datacenter. The clusters have eventual consistency across the datacenters. The latency is about 100ms and we have not seen replication delays longer than a second.

In the current deployment it is possible for the authorization code grant to be obtained in one datacenter and then a client to invoke the token API, which may end up in another datacenter.

We need to make sure that when the authorization code cannot be found, some reasonable time will be waited on for the replication of the authorization code to occur.

 

Environment

Release : Oct.03

Cause

The main implication is /token request may need to wait a short amount of time for data to arrive, instead of immediately failing due to lack of data, and this requires having a configurable mechanism to control the retry logic. For this reason, in an MDC topology utilizing different SSP clusters, there is a requirement to persist such transactional artifacts in persistence tier that replicates across DCs.

- To support this, we have implemented in M9 (oct 2022) the persistence option enabling use of DB persistence instead of HazelCast Cache persistence. This is controlled by global config parameter "persistentStoreForTransactionalData" using values "CACHE" (default) or "DB".   When set to DB, transactional artifacts get written to, and read from, the operational DB that is expected to be replicated across DCs.

- Since such DB persistence across DCs, and in some cases even Cache persistence within a DC, is almost never immediately-consistent, the DCs will for a short period of time, be in an inconsistent state wrt transactional data. When such data is used by certain flows such as /authorize followed by token, its likely the second leg of the flow will not find the data it needs to process the request

Resolution

- Addressing this requires supporting a configurable resiliency mechanism to retry fetching the data artifacts, while also making sure the data replication is as fast as possible.

- M9 patch as well as next M10 milestone add a configurable retry mechanism when reading such data elements from any persistence (DB or Cache).  Additional global config params are as follows.

-- "transactionalDataReadsRetryCount" (default is 0 - no retries by default, max 50 retries)
-- "transactionalDataReadsRetryWaitPeriodMillis" (default is 200ms between retries, range is 100ms to 1000ms)