OCR (optical character recognition) fails to detect or inconsistently detect violating text in a document with multi language text.
search cancel

OCR (optical character recognition) fails to detect or inconsistently detect violating text in a document with multi language text.

book

Article ID: 242771

calendar_today

Updated On:

Products

Data Loss Prevention Enforce Data Loss Prevention Cloud Service for Email Data Loss Prevention Cloud Detection Service for ICAP

Issue/Introduction

You will see inconsistent or no detection with OCR when a image document with multiple languages like Chinese and English is scanned by DLP. 

Environment

Release : 15.x

Cause

In a multi language document OCR library determines dominant language and detection is done only on dominant language (Single language). Dominant language selection is a function of pixels (height * width), DPI, image sharpness and character separability. However if the dominant language is not English, English characters can still be detected (since English is always a default).

You may see inconsistent detections for e.g. if same image file is sent as an attachment multiple times incident count may vary in each incident. This happens as OCR currently fails to differentiate between traditional and simplified Chinese.

Resolution

OCR accuracy is best when processing a single language image document with high contrast, high DPI images containing typewritten text.