Create efficient regular expressions to improve DLP performance

book

Article ID: 159642

calendar_today

Updated On:

Products

Data Loss Prevention Enforce

Issue/Introduction

Regular expressions can cause poor performance with Symantec Data Loss Prevention (DLP), especially poorly written ones. Learn how to create more efficient regular expressions.

Resolution

Regular expressions are much slower than Data Identifiers, so use a Data Identifier whenever possible. If a Data Identifier does not fit the needs of a particular policy, regular expressions are still available.

Sample Regular Expression Constructs

+ Following a regular expression means 1 or more
- Range; Example [a-z]
* Following a regular expression means any number
? Following a regular expression means 0 or 1
\ Escape; Example \. \* \+ \?
\d Any digit character (0-9)
\D Non-digit character
\w Word character (a-z, A-Z, 0-9, _)
\W Non-word character
\s Any White Space
\S Any Non-White Space
[ ] Character Class Brackets
[a-z] Lower Case Alphabet
[A-Z] Upper Case Alphabet
[%*.#$%@-] Symbols (Exact match within Square Brackets)
^ Within a Character Class, negates the elements within
(?: ) Groups regular expressions together
(?i) Case Insensitive
(?u) Makes a period (.) match even newline characters
| Pipe Character; Means OR
(?=(?:[^-\w])|$ Enhanced Look Ahead (DLP 14.6)
(?<=(^|(?:[^)+\d][^-\w+]))) Enhanced Look Behind (DLP 14.6)
(?<=(^|(?:[^)+\d][^-\w+])|\t)) Enhanced Look Behind (DLP 14.6)

Regular Expression Tips 

  • Use PCRE Compatible regex syntax.
  • Only search in the appropriate message part.  If you are expecting something in the header, then do not search the body.  The bigger the defined search area, the more work needs to be done to evaluate it.
  • See TECH218937: "What is the definition of an Envelope, Body or Attachment for each protocol?"
  • Avoid using an asterisk (*) where possible.  Regular Expression processors can be resource intensive, i.e. High CPU usage. This can cause the File Reader to do more work than necessary.
  • Limit the scope, and change the string to a range instead.  {0,10} will look for between 0 and 10 characters.

Examples

Match a string starting with file with up to 10 additional characters, ending with .txt

Summary: This pattern match between 0 and 10 characters. It will match on filereader.txt, but not on filewaytoolongofaname.txt

Best Practice: file{0,10}.txt

Not Best Practice: file*.txt

Notes: This expression is looking at the beginning or end of the body part.  So, for a header, it would be looking at the beginning of the message header. Be aware of this or it may not provide the results you were expecting.

Recommended patterns for starting and ending characters

Begin: (?<=(^|(?:[^)+\d][^-\w+])))

End: (?=(?:[^-\w])|$)

Match on exactly 2 digits only

(?<=(^|(?:[^)+\d][^-\w+])))\d\d\(?=(?:[^-\w])|$)

Limitations

  • Symbols must be in square brackets in order to be matched on.
  • Symbols .*| are not supported for data identifier patterns.
  • \w does not match _ when implemented in a Data Identifier pattern.
  • \s cannot be used to match whitespace, please use whitespace character.

Reference: Regex performance

Below are some links that discuss performance and provide some ideas on how to improve your Regular Expressions:

PCRE - Perl Compatible Regular Expressions

Runaway Regular Expressions: Catastrophic Backtracking

Regex Optimization Using Atomic Grouping

Some Basic guidelines how to optimize RegExes:

http://publib.boulder.ibm.com/infocenter/wmbhelp/v6r0m0/index.jsp?topic=/com.ibm.etools.mft.doc/ad09910_.htm

http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html

http://blog.stevenlevithan.com/archives/greedy-lazy-performance

For more information see "Detecting content using regular expressions" chapter in the Data Loss Prevent Administration Guide.