High 'Failed Objects' in Server Replication report

Products

IT Management Suite

Issue/Introduction

You noticed that the "alert state" of high and it is causing you concern because you haven't noticed it before. The replication jobs had a high number of failed objects,

Alert State: High

Environment

ITMS 8.x

Cause

In the database is a table called Evt_NS_Hierarchy_Alert that keeps track of the alert status for each replication job. It tracks each replication job based upon SourceNS guid.

When the hierarchy topology view opens in the console, the spGetServerAlertStat runs for each node listed in the view.

DECLARE @AlertState__auto AS int;
EXECUTE spGetServerAlertState @ServerGuid='ca3aa222-f786-48bf-b93e-fa1940165b47', @AlertState=@AlertState__auto OUTPUT

sp_helptext spGetServerAlertState

SELECT MAX(AlertState) FROM Evt_NS_Hierarchy_Alert ha
INNER JOIN HierarchyNode hn ON ((ha.[DestinationNS] = hn.[ParentGuid] AND ha.[SourceNS] = hn.[ChildGuid]) OR
(ha.[DestinationNS] = hn.[ChildGuid] AND ha.[SourceNS] = hn.[ParentGuid]))
WHERE (ha.[SourceNS] = 'ca3aa222-f786-48bf-b93e-fa1940165b47') AND
(ha.[Latest] = 1) AND
(ha.[ValidUntilDate] >= GETUTCDATE())

spGetServerAlertState pulls the MAX Alert state from all replication jobs that ran within the last 24 hours.(ValidUntilDate is always exactly 24 hours from the time the alert was created in the Evt_NS_Hierarchy_Alert table)

This max alert state is what is displayed by the indicator in the console.

There are 4 alert states Low, medium, high and critical.

/// <summary>
        /// The alert state is unknown.
        /// </summary>
        Unknown = 0,

        /// <summary>
        /// Low-priority information that might not be important to the user. For example,
        /// replication has been completed between two servers in a Hierarchy structure.
        /// </summary>
        AlertLow = 1,

        /// <summary>
        /// Medium-priority information that does not need to be conveyed to the user
        /// immediately. For example, several items could not be replicated between two
        /// servers in a Hierarchy structure. 1% or less of items could not be replicated
        /// </summary>
        AlertMedium = 2,

        /// <summary>
        /// Important information that should be conveyed to the user as soon as possible.
        /// For example, 10% or more of items could not be replicated between two servers
        /// in a Hierarchy structure.
        /// </summary>
        AlertHigh = 3,

        /// <summary>
        /// Critical information that should be conveyed to the user immediately. For
        /// example a link could not be established between two servers within a Hierarchy
        /// structure.
        /// </summary>
        AlertCritical = 4

The status alerts for medium and high are calculated based upon the following code.

failedReplicationPercentage = (status.FailedReplicationCount * 100) / status.TotalReplicationCount;

if (failedReplicationPercentage >= 10) HierarchyAlertState.AlertHigh,

if (failedReplicationPercentage >= 1) HierarchyAlertState.AlertMedium,

Resolution

The problem is, under most circumstances, that a high alert state is always going to be displayed because the ratio of failed items to the count of items replicated is always going to be high with a differential job and a differential job is most often always ran daily. The failures occur a majority of the time due the items not supporting replication. Either the item itself doesn't support replication, its class guid doesn't support replication, or the item's product doesn't support replication.

Examples of products that don't support replication are CMDB, Workflow, and Altiris Connector.

Items that are deleted but still have an entry in the ReplicationItemDependencyCache table will continue to attempt to replicate. Another scenario that adds to the list of failed replication items are Patch Management files. The Patch Management Import Data Replication for Windows replication rule replicates down the configured updates. The dependencies associated with these updates, including the associated file guids, get inserted into the ReplicationItemDependencyCache table but these file guids are configured as non-replicable because the files are actually pulled down as part of the PMImport process on the child SMPs.

Whenever replication runs, whether Patch Management updates need to replicate or not, their dependencies will attempt to replicate but will fail because of their non-replicable attributes.