Keyword rule failed to detect a key phrase in a PDF containing a LF or new line.
search cancel

Keyword rule failed to detect a key phrase in a PDF containing a LF or new line.

book

Article ID: 282245

calendar_today

Updated On:

Products

Data Loss Prevention Enterprise Suite Data Loss Prevention Data Loss Prevention Cloud Detection Service Data Loss Prevention Cloud Detection Service for ICAP Data Loss Prevention Cloud Detection Service for REST Data Loss Prevention Cloud Package Data Loss Prevention Cloud Prevent for Microsoft Office 365 Data Loss Prevention Cloud Service for Discovery/Connector Data Loss Prevention Cloud Service for Email Data Loss Prevention Cloud Storage Data Loss Prevention Core Package Data Loss Prevention Discover Suite Data Loss Prevention Endpoint Discover Data Loss Prevention Endpoint Prevent Data Loss Prevention Endpoint Suite Data Loss Prevention Enforce Data Loss Prevention for Mobile Data Loss Prevention Form Recognition Data Loss Prevention Network Discover Data Loss Prevention Network Email Data Loss Prevention Network Monitor Data Loss Prevention Network Monitor and Prevent for Email Data Loss Prevention Network Monitor and Prevent for Email and Web Data Loss Prevention Network Monitor and Prevent for Web Data Loss Prevention Network Web Data Loss Prevention Network Prevent for Email

Issue/Introduction

The words in a key phrase in a PDF file are separated by spaces and the entire phrase is not detected by DLP. 

The search function in the Adobe Acrobat Reader is able to find the entire key phrase in the PDF file. 

Cause

When a file from another application is convert to PDF sometimes the PDF adds newlines and other formatting to the original text.

We know this is something that MS Word does to conform with the margin and document structure requirements that PDF has.

It can happen that one space is actually a Line Feed (LF) or New Line (\nl) or Carriage Return (CR) in the document which can be seen by extracting the PDF raw text using the DLP filter.exe to view the cracked content in a file editor which allows you to view symbols e.g. Notepad++.

In the case of a keyword phrase rule not detecting a phrase where a LF or new line was added in-between this is breaking up the original key phrase.

DLP would not detect that because DLP will extract the content as it is formatted in the scanned document.

Currently there is no way to distinguish between new lines originally present in the original document versus new lines added as a part of an export process.

Resolution

To successfully detect in this situation you can do either: 

  1. Create a Regular Expression to detect the key phrase.
  2. If you want to detect on such phrases which span multiple lines you can break it up and make use of keyword proximity rules.

Additional Information