How does Network Discover scan Enterprise Vault (or other storage solutions) archived files within a scan?

book

Article ID: 160060

calendar_today

Updated On:

Products

Data Loss Prevention Network Discover

Issue/Introduction

When a scan accesses files that are archived within a storage solution such as Symantec Enterprise Vault or EMC's Storage Solutions, the file is being restored to the file system. This can cause scanned systems to run out of disk space.

Resolution

How does Discover work on shares?

Discover iterates through the directory structure as setup in the scan target. During the scan each file will be accessed via read-open. Depending on the method, the last accessed-date can be reset to ensure that the scan does not cause to set all scanned flies last-access date when the scan occurred.

Observed behavior

When a scan accesses files that are archived within a storage solution such as Enterprise Vault, the file is being restored to the file system.

Note: This behavior is by design. Discover performs standard read operation on the file system. Archive solutions such as EV, or others that are utilizing reparse points, are doing so transparently and by design restoring the file on access.

Background Information

With the release of Windows 2000 came the ability to create special file system functions and associate them with files or directories. This enables the functionality of the NTFS file system to be enhanced and extended dynamically.  The feature is implemented using objects that are called reparse points.

Every reparse point is tagged with an identifier specific to the application and stored with the file or directory. A special application-specific filter (driver) is associated with the reparse point tag type and made known to the file system. This can be cascaded, so more than one application can be associated with a specific tag type.

Whenever a user or process opens a file, the file system notices the parse point and will open via the driver of the referenced object. It will "reparse" the original request for the file. The driver will use the data stored within the reparse point to retrieve the proper data transparently in the background.

The key is that the retrieval of the data occurs transparently in the backend and that the scanning application does not know that a fetching of the archived file occurs.

From a Discover perspective, the files will be scanned as if there were regular files in place, since the data is being retrieved in the background.

NOTE:  It is recommended that there is enough disk space available to complete a Discover scan in this situation to allow for the restored data to be copied back into the original location.

Reparse points are discussed in more details on Microsoft's MSDN in http://msdn.microsoft.com/en-us/library/ms995846.aspx

Reparse points let an application associate a block of application data with a file or directory and let the Object Manager reparse, or reexecute a name lookup, when an application encounters a reparse point. (For information about the Object Manager's role in the OS's architecture, see "Inside NT's Object Manager," October 1997.) In addition to storing the reparse data, the reparse point stores a reparse code that identifies the reparse point as belonging to a particular application. Although not useful by themselves, reparse points let Win2K or third-party developers build functionality. Win2K provides several types of reparse-point functionality, including mount points, NTFS junctions, and Hierarchical Storage Management (HSM).
[....]
Not all reparse points rely on path reparsing functionality. The HSM system uses HSM reparse points to transparently migrate infrequently accessed files to offline storage. When HSM moves a file offline, the HSM system deletes the file's contents and creates a reparse point in the file's place. The reparse point data contains information the HSM system uses to locate the file's data on archival media. When an application later accesses an offline HSM file, the HSM driver RsFilter.sys (Remote Storage Filter) intercepts the reparse code that NTFS returns to the Object Manager. The driver deletes the reparse point, fetches the file data from archival storage, then reissues the original request. This time, NTFS accesses the file as it would any other, and the application doesn't realize that data shuffling occurred.

Workarounds

1] Custom scanner

Utilizing the Discover SOAP API you could traverse within your own application the directory you want to scan. You could check whether or not a file is a reparse point or an actual file by utilizing FSUtil http://technet.microsoft.com/en-us/library/cc785451%28v=ws.10%29.aspx If it is a Reparse point, you could either skip the file or use the appropriate API to access the stored file in the backend server.

For example, for Enterprise Vault see http://www.symantec.com/business/support/index?page=content&id=TECH69123

2] Scan the backend storage directly

If the storage solution exposes the backend storage via WebDAV or as a fileshare, point the Discover target to the exposed share for scanning.

3] Adjust Archive Solution Policy

Discuss with the administrator of the archive solution if the solution can be setup to archive the file again after file access has been performed. This can be based on the access origination such as IP or user account or age of the archived file.