Detections can be missed within data identifiers (DI) in Linux systems when a document contains non-BMP (Basic Multilingual Plane) characters.
BMP characters include all of those used to represent characters of the various languages supported by DLP. Non-BMP character(s) lie outside language character sets.
DLP Data Identifier (DI) matcher used custom DIs and many system DIs utilizes UTF-16 internally to evaluate documents against their respective DI validations. While most BMP characters are represented using 16 bits, Non-BMP characters are represented using 32 bits.
On Linux systems the DLP DI validator incorrectly identifies non-BMP characters as representing two distinct characters, causing tokenization to be shifted left by one character each time a non-BMP character is encountered. With the shift, the validators end up evaluating the wrong string including one or more chars that preceded the match and not including the same number of chars at the end of the match.
For example, let's say we have a string in a document "My identity example ABC123456789". If you have two non-BMP characters earlier in a document, due to character shift it will validate BC1234567 instead of the DI match of "12345689". In this case the characters shifted by 2 characters as we had 2 non-BMP characters in the document.
Additionally, when one of the validators is the DNI Check validator, this shift ultimately causes a NumberFormatException from the non-numerical data which is not handled causing the detection of this message chain to crash losing all detection of message leaving behind an error in the log.
This issue only impacts 15.8.00311 or older running on Linux detection servers. The fix for this issue will be included in the below patch.
15.8
Hotfix_15.8.00313.01002_Server.zip (you will need to upgrade to MP3 to apply this patch)
16.0
Not affected.