You want to know more about how incidents work in relation to the event flow and queues
During the troubleshooting of correlation service issues, you may need to clear the ICE queues due to incident problems, before deleting these queues (/opt/Symantec/simserver/queues/ice/input and /opt/Symantec/simserver/queues/ice/output) you must stop the simserver and icesvc services. Below is an account of why you must do this:
The key is to stop both the simserver and the ice service and then delete the queues. It is important because the correlation engine inside the simserver has a state, which must be cleared when the ice service input queue is manually deleted.
The incident state gets automatically cleared after 24 hours after the incident was created by default or when the incident gets closed via the UI console. In other words, no new incidents will be created for the same set of events that already triggered the incident.
The incident state is shown on the UI console, Incident Details view, the Tracking check-box in the lower left corner. If the box is checked that means the rule that created the incident is still “tracking” events, ie. assigning events to the existing incident instead of creating a new incident.
In the case of no new incidents being created, here is why:
The default event flow is:
events incidents incidents
<event service> -10010-> <simserver> --input--> <ice service> --output--> <event service>
|
|
V
incident database
The event service sends events to the simserver using a TCP connection on port 10010, the simserver sends incidents/conclusion/correlated-event events to the ice service using an on-disk queue (queues/ice/input folder -- to guarantee that no events will be lost and for performance reasons), and finally the ice service writes the incident/conclusion/correlated-event in the database and sends incident and conclusion events to the event service using again on-disk queue (queues/ice/output folder), so the event service can save them in the archives.
As you can see the event processing is like a pipe line, thus stopping one of the services will, sooner or later, “block” the service that is in front. If using on-disk queue then it takes longer before a service “blocks” the event flow. For example, the size of the on-disk queue between simserver and the ice service was design to allow stopping the ice service for months (assuming low number of incidents, e.g. a few hundreds incidents per day) without affecting the correlation engine and the event flow.
The on-disk queues help to reduce the event flow “blocking” but they also have the following side effects:
-- the queues add delays in the event processing
-- the queues are slower than direct connection and affect the performance
-- and finally, under some extreme conditions e.g. too many incidents created for a short time, the queues can cause a problem with having too many items (incidents, conclusions, and correlated-events) in the ice service input queue and combined with the ice service slower processing (database writes are much slower than in-memory event correlation), created a huge backlog of incidents and it looked that SSIM did not create any more new incidents. In other words, the ice service could not keep up with the load — too many incidents to handle and the database updates were “delayed” for a long time. Stopping both, simserver and the ice service, and deleting the ice service input queues “cleared” the pipe line and restored the real-time event processing.
There is no need to restart the SSIM, simply re-starting the simserver and the ice service will suffice.