search cancel

OCR (optical character recognition) fails to detect or inconsistently detect violating text in a document with multi language text.


Article ID: 242771


Updated On:


Data Loss Prevention Enforce Data Loss Prevention Cloud Service for Email Data Loss Prevention Cloud Detection Service for ICAP


You will see inconsistent or no detection with OCR when a image document with multiple languages like Chinese and English is scanned by DLP. 


Release : 15.x


In a multi language document OCR library determines dominant language and detection is done only on dominant language (Single language). Dominant language selection is a function of pixels (height * width), DPI, image sharpness and character separability. However if the dominant language is not English, English characters can still be detected (since English is always a default).

You may see inconsistent detections for e.g. if same image file is sent as an attachment multiple times incident count may vary in each incident. This happens as OCR currently fails to differentiate between traditional and simplified Chinese.


OCR accuracy is best when processing a single language image document with high contrast, high DPI images containing typewritten text.