Replication Data from a failed or hung job is not cleaned up automatically.

Products

IT Management Suite

Issue/Introduction

Data is no longer replicating between servers. You may see one or more of the following symptoms:

The database size is increasing dramatically, especially the ITEM table size.
The Current Replication Activity report shows a job that is failed, or appears to be hung. For example, the percentage is not increasing or the Start Date is greater than 48 hours.

Job shows as in progress for more than 48hrs

Environment

Symantec Management Platform 8.x

Cause

When a failure occurs during a replication job, the normal cleanup measures are unable to be performed.

Resolution

Cleanup steps prior to 8.x:

Download and import the attached Task named ‘Replication Data Cleanup’ on each server participating in the replication process, source/destination, parent/child
After importing, execute the Task (Note that this Task may time out depending on SQL Server speed and amount of data to be cleaned. If this happens, try re-running the Task several times if necessary as it should get farther with each pass)
Once the Task finishes, run the "NS.Quarter-Hour.." task from the Windows Scheduled Tasks folder if you want to recover additional database space immediately

NOTE: The Task attached to this article should be disabled/deleted from your servers once this issue has been resolved in a released version of SMP.

Also, if the ITEM table in the database is too large then the cleanup Task may fail to run, because there are too many rows to deal with at one time. If the Task continually fails to run then edit the SQL in the task ‘Replication Data Cleanup’ and at the very top line put the following statement:

SET ROWCOUNT 5000

This will cause the Task to only take out 5,000 rows. This may also require that the Task be run multiple times before everything is cleaned up.

Cleanup steps post 8.x:

Run the following queries to identify and stop any hung replication jobs:

/*
This query tells the current state of all replication jobs, past and present and includes other information such as duration and transfer sizes.
*/

select distinct
j.JobId,
fs1.Name as [Source],
fs2.Name as [Destination],
case when j.State = 1 then 'Not Started'
when j.State = 2 then 'Pending'
when j.State = 3 then 'Running'
when j.State = 4 then 'Complete'
when j.State = 5 then 'Aborted'
when j.State = 6 then 'Error'
end as [Job Status],
j.StartDate, j.FinishDate, j.RunCount, o.cnt as Operations,hs.RunningTime as [Duration], hs.TransferSize

from ReplicationJob j
join ForwardServer fs1 on fs1.Guid = j.[Source]
join ForwardServer fs2 on fs2.Guid = j.Destination
join ReplicationJobOperationState hs on hs.JobID = j.JobID

join
(select distinct JobId, cnt=COUNT(*)
from ReplicationJobOperationState
group by JobId
)o on o.JobId = j.JobID

where 1 = 1
-- and hs.OperationId = '{9317d360-bfa1-4070-95a0-0acc5e74c917}'
-- and DATEDIFF(hh, j.Startdate, GETDATE()) <= 24
order by j.StartDate desc

/*
The following query will show list of hung or stalled jobs currently not complete. The default is anything older than 8 hours.
It can be changed by modifying the value of the @hours variable on line 2.
*/

declare @hours float
set @hours = .34 -- approximately 8 hours

select j.StartDate
,isnull(jsi.Name,rs.Name) as Name
,cast(rs.State as XML).value('(//source/name/text())[1]','nvarchar(100)') sourceNS
,cast(rs.State as XML).value('(//destination/name/text())[1]','nvarchar(100)') DestinationNS
,isnull(LastAction,j.CreatedDate) as LastAction
,j.RunCount
,isnull(o.Type,'')+' '+isnull(o.Task,'') CurrentOperation
,m.ManifestCount, m.Pending,m.VerificationInProgress,m.[ReplicaitonInProgress],m.[Failed],m.[Replicated]
,j.JobID,ja.SourceJobID
from ReplicationJob j
left join Item rs on rs.Guid=j.ItemGuid
LEFT JOIN Evt_NS_Hierarchy_Job_Targets rt on rt.ReplicationJobID=j.JobID
LEFT JOIN Evt_NS_Hierarchy_Job_Start js on js.JobID=rt.JobID
LEFT JOIN Item jsi on jsi.Guid=js.CauseID
left join ReplicationJobAssociation ja on j.JobID=ja.DestinationJobID
left Join (
select o.JobID
,sum(case when m.State in (1,3,5,7,9,11,13) then 1 else 0 end) as [Pending]
,sum(case when m.State in (2,6) then 1 else 0 end) as [VerificationInProgress]
,sum(case when m.State in (4,8,10,12,14) then 1 else 0 end) as [ReplicaitonInProgress]
,sum(case when m.State=16 then 1 else 0 end) as [Failed]
,sum(case when m.State=15 then 1 else 0 end) as [Replicated]
,COUNT(m.State) ManifestCount
,max(case when m.ModifiedDate>o.ModifiedDate then m.ModifiedDate else o.ModifiedDate end) LastAction
from ReplicationOperation o
left join ReplicationManifest m on m.OperationID=o.OperationID
group by o.JobID
) m on m.JobID=j.JobID
left join ReplicationOperation o on o.JobID=j.JobID and o.State=2
where j.State<4
and isnull(LastAction,j.CreatedDate)<GETDATE()-@hours --No Action for over 8 Hrs
order by j.StartDate

Attachments

Replication Data Cleanup2.zip get_app