The production restmon pod was in 'Unshedulable' state from July 30. Due to this none of the Monitoring Alerts were processed and no AIOPs alarms and SNOW tickets were created. To fix this, node pool configuration was updated on Aug 4 around and it worked fine when tried few test alerts. However, within 1 hour again restmon pod stopped processing the alerts. After checking restmon logs it was confirmed that restmon was stuck only receiving payload and not processing them and to resolve this issue restmon pod was restarted.
What could have caused these issues and if there is a way it can be avoided by setting up some monitoring.
Release : 20.2
Component : CA DOI AO PLATFORM COMPONENTS
It seems like there may have been flood of alarms which overloaded Restmon, something crashed and it got stuck. The recommendation is to upgrade to 2.1 with the liveness probe. The liveness probe is there to detect this exact scenario and auto-restart the pod. Also it includes monitoring to see the number of alerts received vs published. That will give you insight into what is going on with Restmon. Also, make sure the load on a single Restmon instance is within the performance parameters. It can handle small spikes of 500-1000 alarms a minute for short durations. Sustained loads at this level will cause the queue to grow faster than Restmon can process and will lead to issues like this. The initial issue where the pod was un-schedulable was probably due to resource constraints on the cluster.