How Are Spaces Treated in EDMs?

book

Article ID: 160468

calendar_today

Updated On:

Products

Data Loss Prevention Endpoint Prevent Data Loss Prevention Network Monitor Data Loss Prevention Network Prevent for Email Data Loss Prevention Enforce Data Loss Prevention Network Discover Data Loss Prevention Network Prevent for Web Data Loss Prevention Network Protect Data Loss Prevention Endpoint Discover

Issue/Introduction

For DLP version 12.0 and below: Spaces in an EDM do not detect. When spaces are used with EDMs, false negatives can result.

Resolution

Note: This is for version 12.0 and below. Version 12.5 introduces Multi-token EDMs.

When processing a document or message, text is broken into tokens. In most cases, one word becomes a single token.


When creating an EDM, if the data source has a value such as "United States," the value is indexed as one multi-word token. The processor is looking to match the whole string "United States”.  On the detection side, if a message contains “United States” in non-tabular content, it is broken into two tokens, "United" and "States." The EDM token "United States," does not match "United" or "States" so it is not matched; hence, a false negative can result.


To improve EDM accuracy, when creating an EDM: 

  • Avoid using spaces, for example, use US instead 
  • Split names into First Name and Last Name, do not put full names in a single column


    An exception to this rule is for specific data patterns. In these patterns, both the index and detection will recognize the patterns in the same way:
  • Social Security Number
    •using dashes 111-22-3333
    •using spaces 111 22 3333
    •no delimiters 111223333
  • Credit Card Number - numerous patterns, each tailored to specific credit card issuers, such as Visa, Mastercard, American Express
    •using dashes 4444-2222-1111-4444
    •using spaces between each grouping, i.e. 4444 2222 1111 4444 or 44442222 11114444 or 444422221111 4444
    •no spaces 4444222211114444
  • Phone Number - currently only patterned for US or Canadian numbers
    •using area codes with or without parentheses, i.e. (415)111-2222 or 415111-2222 or 415 111-2222
    •no area codes, i.e. 111-2222
    •without dashes 11112222
    •using the US country code, i.e 1(415)111-2222
    •spaces instead of dashes, 415 111 2222
    •dots as separators 415.111.2222
    It's important to note, the pattern that is indexed does not need to be used during detection.  Therefore, indexing (415)111-2222 would match against 1 415 111 2222.