Multi-token EDM policy with punctuation characters - not recognising percentage, dollar, dot and dash symbols

search cancel

Multi-token EDM policy with punctuation characters - not recognising percentage, dollar, dot and dash symbols

book

Article ID: 196836

calendar_today

Updated On:

Products

Data Loss Prevention

Issue/Introduction

Testing multi-token policies EDM

Example:

While testing an IBAN policy. I created indexes for IBAN numbers in 2 different formats, one without blanks, one with blanks.

CH660076xxxx1005xxxx3 CH66 0076 xxxx 1005 xxxx 3

Policy is created with rules to detect any of this two formats

I set up server configuration so every blank of second IBAN number in indexed content with WIP settings false (Lexer.IncludePunctuationInWords = false) should detect any IBAN with Multi-token punctuation characters

Index seems to be properly created

com.vontu.profileindexer.database.NativeStatisticsBuilder@6a4f787b

Cryptographic key used: EXTERNAL.1

Single Token Uncommon cells: 230322

, Single Token Uncommon cell lists: 0

, Single Token common cells: 0

, Single Token common cell lists: 0

, Multi Token Uncommon cells: 230322

, Multi Token Uncommon cell lists: 0

, Multi Token common cells: 0

, Multi Token common cell lists: 0,

Elapsed time: 5133 milliseconds.

Successfully created index

In every mail I put 3 different IBAN numbers in 2 different formats, without blanks and with blanks to be able to detect numbers using Multi-token punctuation characters. Almost all of them are working fine – see generated incidents

EXCEPT those 4, when using % $ dot and dash

It is detecting only IBAN format without blanks.

This multi-token policies with punctuation is important for us, because we mostly want to detect the IBAN numbers with dot and dash punctuation characters.

Cause

The thought is that the issue is *,@,~,' don't have specific lexer rules, and are part of punctuation (`,~,!,&,-,',\",.,?,@,$,%,*,^,(,),[,{,],},/,#,=,+,_)so the lexer just breaks the tokens in the first group as CH66 0076 6000 2005 3066 3 and makes a multi-token match (this is because Lexer.IncludePunctuationInWords = false).

For the one in the second group:

CH66$0076$6000$2005$3066$3 gets most likely recognized as a group of currencies, $6000 $2005 $3066 $3 so it won't match

CH66%0076%6000%2005%3066%3 gets most likely recognized as a group of percentages 6000% 2005% 3066% so it won't match

I am not sure of CH66-0076-6000-2005-3066-3 as part of it might be recognized as telephone or credit card (the lexer gets quite complex so it's time-consuming to try to run it in your head, you would have to set up a debug system and look at how each token is parsed).

Resolution

If the user needs all those combinations to be matched, the only workaround found thus far is to set Lexer.IncludePunctuationInWords = true and index all combinations in an EDM row.

Since the account numbers are fairly long and with a specific structure, you should consider using DIs? If the numbers also have to follow some logic you can add validators to reduce false positives.

Feedback

thumb_up Yes

thumb_down No