This article provides the steps to install and configure ITCM Health Monitoring, which can be used to monitor the health of the ITCM infrastructure, raise alerts, or take appropriate actions.
Client Automation (ITCM) -- any version
Step 1: Health Monitoring Component Overview
It's probably a good idea to started by familiarizing with the various components involved in Health Monitoring (HM)...
Health Monitoring Agent (hmAgent):
The hmAgent service (hmAgent.exe) is installed by default, alongside the CAF service, at every level of the ITCM architecture. Out of the box, the hmAgent is not configured with any monitoring rules, so nothing is being monitored and no alerts will be raised. Note that is not possible to install ITCM, without HM, though the hmAgent service could be disabled, if desired.
The Alert Collector (AC) plugin (AlertCollector.exe) is not installed by default, and installation considerations are discussed further in the next step. When HM is configured, the AC is responsible for receiving alerts from remote hmAgents in the environment. The AC may then forward the alert to another server or commit them into the manager's database. When the AC is installed, it's role in the architecture will be configured.
DSM Web Console:
The interface for viewing alerts raised by hmAgents in the environment is found in the ITCM/DSM Web Console. This makes the Web Console a required component for the purposes of HM. From the web console you view, manage and delete alerts, as necessary.
All HM functionality is driven by configuration policy. Policy changes are required to enable the hmAgent for monitoring, set monitoring rules or actions, and configure WHERE hmAgents should send their alerts. This leads into a discussion about HM architecture considerations...
Step 2: Architectural Considerations
There are two major considerations for HM with respect to architecture...
The first major consideration is what tiers of the architecture or groups of systems do you intend to monitor? It is the recommendation of this article to not consider monitoring at the agent tier. The reason being that the agent tier may generate an exorbitant amount of alerts to manage. HM will be much more valuable for monitoring infrastructure servers-- Enterprise, Domains and Scalability Servers.
The second major consideration for HM is where in the architecture to install Alert Collectors, and how many are needed. At minimum, only one AC is needed. If you have an Enterprise, and it's reachable by all ITCM infrastructure servers, it may be ideal to configure and consolidate all alerts there. It also may be practical to install an AC at each Domain. Optionally, if you wanted a consolidated view of all the alerts, you could configure the AC at each domain, to forward the alerts to an AC installed at the Enterprise. HM is scalable and flexible.
Keep in mind the decision of where to implement the Alert Collector(s) affects where the Configuration Policy settings will be configured. If each Domain is to have its own AC, then a separate policy will be needed at each Domain, for configuring all hmAgents within that domain, to use the respective AC. If however, you have a single AC at the Enterprise tier, then it can be managed with configuration policy at the Enterprise tier.
Step 3: Installing the Alert Collector
To install an Alert Collector, run setup.exe from the ITCM DVD media. Choose a "Custom" install and add the "Alert Collector" from the list of components. Recall the Alert Collector requires the ITCM Web Console and Web Services to be installed.
Next, configure what this Alert Collector will do:
Persist alerts into MDB:
Only report the alerts, but don't take any actions. The alerts will be visible in the Web Console. The implication here is that the Alert Collector is installed on a Domain or Enterprise manager.
Persist alerts into MDB and take configured actions:
Same as above, except also take what ever configured actions (e.g. send an email, raise an SNMP trap, or event log entry)
Persist alerts into MDB and take configured actions and forward them:
Same as above, except also FORWARD the alert to another Alert Collector. This is useful for forwarding alerts to an AC on the Enterprise, for consolidation of alerts from each Domain.
Only forwards the alerts. This may be useful for consolidation to another AC (e.g. at the Enterprise), or if you are monitoring infrastructure somewhere that is unable to reach the AC at the Enterprise or Domain tier, so this AC must act as a proxy for other Alert Collector(s).
Following this step, the Alert Collector will be installed with the selected configuration.
Step 4: Enabling Health Monitoring
Before any agents will generate alerts, health monitoring MUST be enabled via configuration policy.
Note: Depending on your specific architecture and implementation, you may be doing this at the Domain or Enterprise level. You also may or may not be doing this using the Default Computer Policy versus a Custom Configuration policy. A healthy understanding of the flexible nature of Configuration policies is key.
To enable health monitoring, ensure this policy is set to TRUE:
DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> Whatever Policy --> DSM --> Health Monitoring --> Health Monitoring Agent --> Enable Health Monitoring (TRUE)
Also, this policy must set to configure which Alert Collector, that agents with this policy applied, will send their alerts to:
DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> Whatever Policy --> DSM --> Web Services --> client --> Health Monitoring --> Alert Collector address (somewhere.ca.com)
Note: Changing these two policies will enable Health Monitoring, and tell the hmAgents where to send their alerts, but in the next step we will configure what to monitor...
Step 5: Configuring Alerts and Monitoring
Next step is to configure what problems will be monitored for, along with the frequency and threshold they should be monitored, for raising an alert...
DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> Whatever Policy --> DSM --> Health Monitoring --> Alert Configuration
Once selected, the Alert Configuration dialog will open. For the purposes of this document, we will focus on the "Alerts" tab.
Alert Templates Tab:
Used to add health monitoring templates to ITCM. Out of the box, a number of useful template are provided. This tab is not covered here.
This tab covers enabling and configuring Alert Templates, specifically whether they are enabled or not, what tier will use the template, and the threshold and frequency settings for monitoring.
It may take some time to experiment with a balance of frequency and threshold settings, such that you don't initially get flooded with alerts. For example, monitoring a Collect Task. From time to time the collect tail might sporadically fail, for some reason. If your threshold says to raise an alert immediately, you may get flooded with alerts for something that is working 99% of the time. Hence you should consider a threshold that is reasonable, such that if the task is not working for that period of time, you want to know about it.
Frequency: How often should health monitoring check for this problem is occurring or still occurring?
Threshold: How long after identifying the problem, should the hmAgent wait before generating an alert?
Here is an example for monitoring a Collect Task. The hmAgent here is configured to check if every 30 minutes, however, won't raise any alert until the Collect Task has consistently not functioned for 6-hours.
In another example, we want to receive an alert immediately for a process crash occurring on any Enterprise, Domain or Scalability Server:
Be mindful of which template are:
- Enabled vs Disabled
- Threshold for raising the alert
- Frequency for checking the problem
- Where the configuration policy is applied or not applied!
Step 6: Configuring Alert Actions
So you've installed an Alert Collector, enabled health monitoring, configured alerts to be raised-- now what? I can only view them in the Web Console? How do I get an email, or raise an event for another professional monitoring tool to pickup?
DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> Whatever Policy --> DSM --> Health Monitoring --> Alert Collector --> Alert Actions
It is here that you can configure the Alert Collector to take an action (e.g. send an email, raise an SNMP trap, or write to an event log):
Where do you configure the e-mail settings?
DSM Explorer --> Control Panel --> Configuration --> Configuration Policy --> Whatever Policy --> DSM --> Health Monitoring --> Alert Collector --> Alert Actions --> SMTP Email Configuration
Note: There is no where to specify credentials for ITCM/Health Monitoring to login to the SMTP server. ITCM only supports anonymous access to the SMTP server, for emailing the alert.
For more information on setting an SMTP "relay" server, see this article from Microsoft:
The article overviews how to implement a basic SMTP server using IIS, that can integrate and authenticate with an Exchange server on its backend, in order to facilitate sending alert emails.
- This article did not cover the usage of remediation scripts, for the ability for Health Monitoring to take specific actions to attempt to automate the remediation of the problem.
- Example of alerts displayed in DSM Web Console :