Running the VIP authentication hub raises concerns about scaling:
Having resources, limits, and CPUs, does it apply to each pod of that type or to the total of pods of that type?
Trying to get the load tests performed, it seems they did not go well.
Apparently, the product reported a java heap size exception in auth mgr:
[208876.827s][warning][gc,alloc] https-jsse-nio-8086-exec-33: Retried waiting for GCLocker too often allocating 167518 words
Jun 14 02:24:34 <name>-ssp-auth-mgr-<code> ssp-auth-mgr error [208876.827s][error ][jvmti ] Posting Resource Exhausted event: Java heap space
full trace:
Jun 14 02:24:34 <name>-ssp-auth-mgr-<code> ssp-auth-mgr warning [208876.827s][warning][gc,alloc] https-jsse-nio-8086-exec-33: Retried waiting for GCLocker too often allocating 167518 words
Jun 14 02:24:34 <name>-ssp-auth-mgr-<code> ssp-auth-mgr error [208876.827s][error ][jvmti ] Posting Resource Exhausted event: Java heap space
Jun 14 02:24:34 <name>-ssp-auth-mgr-<code> ssp-auth-mgr error {"timestamp":"2024-06-14T00:24:34.994148Z","type":"log","level":"error","thread":"https-jsse-nio-8086-exec-33","msg":"Failed to complete processing of a request","throwable":"java.lang.OutOfMemoryError: Java heap space\n\tat java.base/java.util.Arrays.copyOf(Arrays.java:3537)\n\tat java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:100)\n\tat java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:130)\n\tat org.bouncycastle.crypto.internal.io.CipherOutputStreamImpl.close(Unknown Source)\n\tat com.broadcom.layer7authentication.crypto.bcfipsimpl.AesCrypto.decryptBytes(AesCrypto.java:139)\n\tat com.broadcom.layer7authentication.crypto.CryptoFacade.decryptWithKey(CryptoFacade.java:108)\n\tat com.broadcom.layer7authentication.crypto.SSPMasterKey.decrypt(SSPMasterKey.java:221)\n\tat com.broadcom.layer7authentication.core.cache.cipher.CipherServiceImpl.decrypt(CipherServiceImpl.java:59)\n\tat com.broadcom.layer7authentication.core.cache.cipher.EncryptedHazelcastCache.fromStoreValue(EncryptedHazelcastCache.java:63)\n\tat com.hazelcast.spring.cache.HazelcastCache.get(HazelcastCache.java:68)\n\tat com.broadcom.layer7authentication.core.cache.cipher.EncryptedHazelcastCache.get(EncryptedHazelcastCache.java:25)\n\tat org.springframework.cache.interceptor.AbstractCacheInvoker.doGet(AbstractCacheInvoker.java:73)\n\tat org.springframework.cache.interceptor.CacheAspectSupport.findInCaches(CacheAspectSupport.java:570)\n\tat org.springframework.cache.interceptor.CacheAspectSupport.findCachedItem(CacheAspectSupport.java:535)\n\tat org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:401)\n\tat org.springframework.cache.interceptor.CacheAspectSupport.execute(CacheAspectSupport.java:345)\n\tat org.springframework.cache.interceptor.CacheInterceptor.invoke(CacheInterceptor.java:64)\n\tat org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:184)\n\tat org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(Cglib
Depending on the load, sizing has to be done correctly.
Sizing tests for every environment end up locating the right mix of cpu/memory limits as well as JVM config to control JVM sizing use this chart param.
For the JVMopts, when deploying, add the following options
--set ssp.<name>.env.jvmOpts="-XX:MaxRAMPercentage=75.0"
where <name> is admin, azserver, authmgr, identity, factor, geolocation, iaservice, iarisk.
As of right now in 3.1.1, provide this value under each of the services - admin, azserver, authmgr, identity, factor, geolocation, iaservice, iarisk. Take a look at values.yaml
In the 3.2 chart, there will be a single global jvmOpts value applicable to all Java services.
So, redeploy with this option above to solve this issue.
Remember that the way this gets configured, allow pre-provisioning resources to meet max expected demand.
One option is to pre-create additional replicas for few pods (authmgr, azserver, factors) so that they do not need to start during when the loads are bigger than average but less than peak.
When running tests stressing the system, observe node/pod utilization and then allow the resource according to that spec.
Ensure that whatever is pre-created, actually gets used but still has room to grow.