Replication Data from a failed or hung job is not cleaned up automatically.

book

Article ID: 152826

calendar_today

Updated On:

Products

Management Platform (Formerly known as Notification Server)

Issue/Introduction

Data is no longer replicating between servers. You may see one or more of the following symptoms:

  • The database size is increasing dramatically, especially the ITEM table size. 
  • The Current Replication Activity report shows a job that is failed, or appears to be hung. For example, the percentage is not increasing or the Start Date is greater than 48 hours.

Job shows as in progress for more than 48hrs

Cause

When a failure occurs during a replication job, the normal cleanup measures are unable to be performed.

Environment

Symantec Management Platform 7.0 SP5 and later
Symantec Management Platform 7.1 and later
Symantec Management Platform 7.5 and later

Symantec Management Platform 8.x

Resolution

Cleanup steps prior to 8.x:
  1. Download and import the attached Task named ‘Replication Data Cleanup’ on each server participating in the replication process, source/destination, parent/child
  2. After importing, execute the Task (Note that this Task may time out depending on SQL Server speed and amount of data to be cleaned. If this happens, try re-running the Task several times if necessary as it should get farther with each pass)
  3. Once the Task finishes, run the "NS.Quarter-Hour.." task from the Windows Scheduled Tasks folder if you want to recover additional database space immediately

NOTE: The Task attached to this article should be disabled/deleted from your servers once this issue has been resolved in a released version of SMP. 

Also, if the ITEM table in the database is too large then the cleanup Task may fail to run, because there are too many rows to deal with at one time.  If the Task continually fails to run then edit the SQL in the task  ‘Replication Data Cleanup’ and at the very top line put the following statement:

SET ROWCOUNT 5000

This will cause the Task to only take out 5,000 rows. This may also require that the Task be run multiple times before everything is cleaned up.



Cleanup steps post 8.x:

  1. Run the following queries to identify and stop any hung replication jobs:

    /* 
    This query tells the current state of all replication jobs, past and present and includes other information such as duration and transfer sizes.
    */

    select distinct
    j.JobId, 
    fs1.Name as [Source],
    fs2.Name as [Destination],
    case when j.State = 1 then 'Not Started'
         when  j.State = 2 then 'Pending'
         when  j.State = 3 then 'Running'
         when  j.State = 4 then 'Complete'
         when  j.State = 5 then 'Aborted'
         when  j.State = 6 then 'Error'
    end as [Job Status],
    j.StartDate, j.FinishDate, j.RunCount, o.cnt as Operations,hs.RunningTime as [Duration], hs.TransferSize

    from ReplicationJob j
    join ForwardServer fs1 on fs1.Guid = j.[Source]
    join ForwardServer fs2 on fs2.Guid = j.Destination
    join ReplicationJobOperationState hs on hs.JobID = j.JobID

    join
    (select distinct JobId, cnt=COUNT(*)
      from ReplicationJobOperationState
      group by JobId
    )o on o.JobId = j.JobID

    where 1 = 1
    -- and hs.OperationId = '{9317d360-bfa1-4070-95a0-0acc5e74c917}'
    -- and DATEDIFF(hh, j.Startdate, GETDATE()) <= 24
    order by j.StartDate desc


    /*
    The following query will show list of hung or stalled jobs currently not complete. The default is anything older than 8 hours. 
    It can be changed by modifying the value of the @hours variable on line 2.
    */

    declare @hours float
    set @hours = .34 -- approximately 8 hours

    select j.StartDate
    ,isnull(jsi.Name,rs.Name) as Name
    ,cast(rs.State as XML).value('(//source/name/text())[1]','nvarchar(100)')  sourceNS
    ,cast(rs.State as XML).value('(//destination/name/text())[1]','nvarchar(100)') DestinationNS
    ,isnull(LastAction,j.CreatedDate) as LastAction
    ,j.RunCount 
    ,isnull(o.Type,'')+' '+isnull(o.Task,'') CurrentOperation
    ,m.ManifestCount, m.Pending,m.VerificationInProgress,m.[ReplicaitonInProgress],m.[Failed],m.[Replicated]
    ,j.JobID,ja.SourceJobID
    from ReplicationJob j  
    left join Item rs on rs.Guid=j.ItemGuid
    LEFT JOIN Evt_NS_Hierarchy_Job_Targets rt on rt.ReplicationJobID=j.JobID
    LEFT JOIN Evt_NS_Hierarchy_Job_Start js on js.JobID=rt.JobID
    LEFT JOIN Item jsi on jsi.Guid=js.CauseID
    left join ReplicationJobAssociation ja on j.JobID=ja.DestinationJobID
    left Join (
    select o.JobID
    ,sum(case when m.State in (1,3,5,7,9,11,13) then 1 else 0 end) as [Pending]
    ,sum(case when m.State in (2,6) then 1 else 0 end) as [VerificationInProgress]
    ,sum(case when m.State in (4,8,10,12,14) then 1 else 0 end) as [ReplicaitonInProgress]
    ,sum(case when m.State=16 then 1 else 0 end) as [Failed]
    ,sum(case when m.State=15 then 1 else 0 end) as [Replicated]
    ,COUNT(m.State) ManifestCount 
    ,max(case when m.ModifiedDate>o.ModifiedDate then m.ModifiedDate else o.ModifiedDate end)  LastAction
    from ReplicationOperation o
         left join ReplicationManifest m on m.OperationID=o.OperationID 
    group by o.JobID
    ) m on m.JobID=j.JobID
    left join ReplicationOperation o on o.JobID=j.JobID and o.State=2
    where j.State<4
    and isnull(LastAction,j.CreatedDate)<GETDATE()[email protected]  --No Action for over 8 Hrs
    order by j.StartDate

Attachments

Replication Data Cleanup2.zip get_app