EDM results are not matching as expected for source data

Products

Data Loss Prevention Data Loss Prevention Network Prevent for Email Data Loss Prevention Network Prevent for Web Data Loss Prevention Network Monitor Data Loss Prevention Endpoint Prevent

Issue/Introduction

We are experiencing an odd results for EDM detection where we get a number of false positives when our EDM index contains keyword 123456 in one column.
DLP is matching for 123,456 and 123.456
No matches for 1,23456 or 1234.56
I tried with 12345678 in the index and no matches when adding commas or periods.

Modifying either of the two settings Lexer.IncludePunctuationInWords and/or Lexer.Validate in the Advanced server settings doesn't change the results it is always the same.

Cause

The lexer gets quite complex as pattern evaluation is done for SSN, CCN, Phone number, Email, postal code, date, numbers, IP address, etc.

The interpretation of punctuation can be overruled by other algorithms in the so-called 'lexer' that may apply more common pattern recognition to some punctuation symbols.
Some punctuation don't have specific lexer rules, and are part of punctuation (`,~,!,&,-,',\",.,?,@,$,%,*,^,(,),[,{,],},/,#,=,+,_) so the lexer just breaks the tokens during detection.

Resolution

Information provided by Engineering:

Let me briefly explain how EDM tokenizer works.

Provided that we get "$123,456,789".

First of all, using predefined matching patterns, EDM tokenizer detects types of tokens, such as WORD, NUMBER, CCN, SNN.etc.

In this case it recognizes "$123,456,789" as "NUMBER" since it starts with "$" followed by 3digit numbers separated by comma.

Then, normalize the token depending on the token type. In this case, tokenizer recognizes it as "NUMBER" so "number normalizer" is performed and it removes all non-digit letters. Then we get the normalized value of "12345678".

This way, it's important what token type is detected.

Let's look into the customer's case.

Our tokenizer detects type of each format as below.

921-915-262-92 -> WORD

94 06 77 76 419 -> 5 NUMBERs

$941,448,265,30 -> 2 NUMBERs ($941,448,265 and 30}

945&635&464&1 -> WORD

963/999/234/99 -> WORD

965\178\87\660 -> WORD

980.17.77.35.58 -> WORD

98073272327 -> NUMBER

9810438908 -> NUMBER

*Note: If tokenizer recognizes it as WORD type, it won't perform number normalization.

As seen, only the last 2 patterns are detected as a single NUMBER type.

You might be unclear how "$941,448,265,30" is detected.

Because the last number "30" is NOT 3 digits, tokenizer recognizes it's different number.

So it recognizes it as a combination of the following 3 formats.

$941,448,265 -> NUMBER

, -> SEPARATOR

30 -> NUMBER

There's no simple way to make EDM tokenizer to recognize all these patterns as NUMBER.

If the customer really needs to match these formats of account number, the only way I can come up with is to add all formats as separate columns of the data source.

Example Data Source:

Name | Account# format1 | Account# format2 | Account# format3 | Account# format4 | Foo | 92191526292 | 921-915-262-92 | 921 915262 92 | 921.915.262.92 |

And create 2 out of 5 columns EDM rule.