File is not being detected by OCR in DLP for CloudSOC and On-premises

search cancel

File is not being detected by OCR in DLP for CloudSOC and On-premises

book

Article ID: 257341

calendar_today

Updated On:

Products

Data Loss Prevention Core Package Data Loss Prevention Cloud Package Data Loss Prevention Cloud Detection Service Data Loss Prevention Cloud Detection Service for ICAP Data Loss Prevention Cloud Detection Service for REST

Issue/Introduction

You have been having trouble scanning a few files, they are PDFs. you have scanned it on premise with the SharePoint connector with OCR enabled. You run the filter.exe against the file and it completes but the output file is blank. You scanned another document to make sure there wasn't anything wrong with the program itself and it returned an expected result.

Environment

Release : 15.8, 16.0

Cause

The OCR extraction method used by DLP can extract image content created by Acroforms:

PDF content created by other methods (e.g., "XFA") will not allow the DLP OCR engine to extract a readable image.

Resolution

If the OCR engine finds no images at all it's either due to image quality and size requirements (see Image Quality and Resolution for OCR results (broadcom.com)).

However, in some cases the type of PDF involved will also prevent image extraction - e.g., "XFA" (XML Forms Architecture).

Thus, a form created by XFA might include the following document properties (viewed by Acrobat Reader "File > Properties" menu):

Additional Information

There is a Feature Request for this issue, PM-2963: "Support content extraction for XFA-based PDF forms".

This should be rectified in DLP 16.1 as content extraction for XFA should be supported.

Feedback

thumb_up Yes

thumb_down No