File is not being detected by OCR in DLP for CloudSOC and On-premises
search cancel

File is not being detected by OCR in DLP for CloudSOC and On-premises

book

Article ID: 257341

calendar_today

Updated On:

Products

Data Loss Prevention Core Package Data Loss Prevention Cloud Package Data Loss Prevention Cloud Detection Service Data Loss Prevention Cloud Detection Service for ICAP Data Loss Prevention Cloud Detection Service for REST

Issue/Introduction

We have been having trouble scanning a few files, they are PDFs. We have scanned it on premise with the SharePoint connector with OCR enabled. We run the filter.exe against the file and it completes but the output file is blank. We scanned another document to make sure there wasn't anything wrong with the program itself and it returned an expected result. 

Environment

Release : 15.8, 16.0

Cause

The OCR extraction method used by DLP can extract image content created by Acroforms:

PDF content created by other methods (e.g., "XFA") will not allow the DLP OCR engine to extract a readable image.

Resolution

If the OCR engine finds no images at all it's either due to image quality and size requirements (see Image Quality and Resolution for OCR results (broadcom.com)).

However, in some cases the type of PDF involved will also prevent image extraction - e.g., "XFA" (XML Forms Architecture).

Thus, a form created by XFA might include the following document properties (viewed by Acrobat Reader "File > Properties" menu):

Additional Information

There is a Feature Request for this issue, PM-2963: "Support content extraction for XFA-based PDF forms".