The "OutOfMemoryError: Java heap space" and 500 error on the doi-cpa-ng pod while adding capacity analytics groups. It can happen for example when trying to define Capacity analytics for all IUM groups when selected the root group for UIM. Also the following error can be found in the cpa_ng logs
ERROR [2022-03-25 07:27:19,186] io.dropwizard.jersey.errors.LoggingExceptionMapper: Error handling a request: eae49914a6a25541
! java.lang.OutOfMemoryError: Java heap space
DX Operational Intelligence 21.3.1
Capacity analytics groups for UIM
The NASS metadata clamp size was very high, so while fetching and data and processing locally it was going OOM.
Add the NASS_METADATA_CLAMP_SIZE environment variable in the CPA-ng pod deployment, set its value to 50000, and try to load the config page to see whether you are still hitting out of memory error.
1/ Use the following commands to enter the edit mode of your CPA-ng deployment:
kubectl get pods -n<your-namespace> | grep cpa-ng
kubectl describe pod cpa-ng-<pod_id> -n<your-namespace>
kubectl exec cpa-ng-<pod_id> -it -n<your-namespace> -- bash
kubectl edit deployment -n<your-namespace> cpa-ng
2/ Scroll down and change the NASS_METADATA_CLAMP_SIZE value to 50000. If this property is not present in your deployment YAML file yet, please add it.
NASS_METADATA_CLAMP_SIZE : 50000
Important Note: The "NASS_METADATA_CLAMP_SIZE" is an environment variable in the CPA-ng pod so it is not present in your YAML file if never changed before.
3/ Save the file. The pods from this deployment will be recreated automatically. Your CPA-ng should be now with the NASS metadata clamp size increased
4/ In case it will not help then please try reducing the above value.
Important Note: There are some chances that some metric names might be missing on the config page after this change. In that case, make sure that the metric is actively flowing into the system and its part of the device that is part of the selected group/service in the config page.
If this will still not work, please open a ticket with support and attach all the logs from all your 3 CPA pods. The most important in this case would be the cpa_ng logs but all the logs from all your 3 CPA pods can be useful to find the root cause. If you are running OpenShift, please provide also the output of the oc describe deployment command of your doi-cpa-ng deployment. The commands below that can be useful in this case.
oc get pods -ndxi | grep cpa
cpa-projection-<pod-id>
doi-cpa-ng-<pod-id>
doi-cpa-service-aggregation-<pod-id>
oc describe deployment doi-cpa-ng -ndxi
oc logs doi-cpa-ng| grep drop wizards
AIOPs - Troubleshooting, Common Issues and Best Practices
https://knowledge.broadcom.com/external/article/190815/aiops-troubleshooting-common-issues-and.html