OCR Unknown Document Format
search cancel

OCR Unknown Document Format

book

Article ID: 242863

calendar_today

Updated On:

Products

Data Loss Prevention Network Monitor and Prevent for Email and Web Data Loss Prevention Enterprise Suite Data Loss Prevention Form Recognition

Issue/Introduction

There are a lot of incidents that show that the OCR server is unable to detect the format.

A lot of image files (.png, .jpg, .gif) are identified as Unknown Document Format

 

On investigation, these images generally turn out to be very small images which are often found in email signatures.

Environment

15.+

Cause

These files are too small for the OCR server to perform reasonable detection on.

Resolution

This issue is known and will likely be fixed in a future release. However, in the interim, it may be possible to avoid the issue by increasing the minimum file size setting found in the ImageRecognition.properties file on the server [<dir>:\Program Files\Symantec\DataLossPrevention\DetectionServer\<DLPVersion>\Protect\config]

ImagePreclassifier.OCR_MINIMUM_IMAGE_DIM which is normally set at 200 can be increased to 400 and higher to filter out small files.

Additional Information

You may also be interested in reviewing below articles: 

Article ID: 221599: What are the default image prefilter settings for a detection server

Article ID: 254861 Image Quality and Resolution for OCR results

Article ID: 160504 Detect sensitive data in an image file with DLP