Discover - SharePoint Scan Improvements
search cancel

Discover - SharePoint Scan Improvements

book

Article ID: 165123

calendar_today

Updated On:

Products

Data Loss Prevention Network Discover

Issue/Introduction

When SharePoint has a huge list to return that FileReader exceed its timeout value set in BoxMonitor.HeartbeatGapBeforeRestart which is used to restart FileReader process under the impression its not responding. This caused frequent file reader restarts and scan behaved erroneously.

Cause

The SharePoint crawler uses the Microsoft's GetListItems query API to retrieve the items from a list and returns the list in XML format.  By default this query returns all items of the list.  Small XML results are read and process in milliseconds; however at some point the XML crosses a threshold where as its size and complexity increases the time it takes to read and process begins to exponentially longer.  While reading and processing the XML result, no heartbeat is sent between  FileReader and BoxMonitor and eventually boxmonitor will restart the FileReader process after the HeartbeatGapBeforeRestart (default 16 minutes) has been reached to resolve a hung FileReader.

Resolution

Starting with 15.0 the DLP SharePoint Crawler simplified the data being requested for each item and starting using the pagination option offered by the  GetListItems API to break up the list into multiple pages with each call retrieving one of the pages at a time.  By default DLP uses a page size of 2,000 items per page which in most lists used in SharePoint is sufficient to keep the resulting XML small enough so that it can be read within the existing heatbeatgap timeout. 

In some SharePoint environments, the size of the listed items may still generate so much data that the default page size may still be too large and may need to be lowered.  To do this add the following string to the bottom of crawler.properties config file to override the 2,000 built-in default with a page size of 200.

sharepointcrawler.pagination.limit = 200

Restart the Detection Server service for the change to take affect.

This does increase the number of requests to sharepoint to retriev the same data.  Where a 5,000 item list would normally take 3 requests, with a pagination of 200 it would take 25 requests.