How to create a regular expression (regex) for an email header

book

Article ID: 159595

calendar_today

Updated On:

Products

Data Loss Prevention Enforce

Issue/Introduction

To check the content of a specific email, a few extra steps are needed in addition to the general check for a keyword in the body or attachment.

Resolution

An email header field is defined by RFC 2822(http://www.ietf.org/rfc/rfc2822.txt), often called an email header.  It is composed of three parts: a field name, a colon and a field body.  The name occurs at the begining of a line.  A header that continues onto the next line must have whitespace before the continuation of the field body. 

By default the Java regex engine sets the multiline mode to off.  This means that there is no distinction made to separate the content of a string into separate lines terminated by a newline character.  This means that the use of the beginning of string/line anchor (^) and end of string/line anchor ($) match the beginning and ending of the entire string, not individual lines.  Therefore, specifying a field name cannot be specified with the beginning of line anchor (^).

There are two ways to handle this problem: 

  1. Specify the beginning of the line in a regex pattern, or
  2. Turn on multiline mode.

To specify the begining of a line in a regex to match the subject header, the pattern would look something like this:

(?i)\nsubject: re

The (?i) makes the pattern case insensitive, the \n is the end-of-line class so that "subject" will only match at the beginning of a line.

To set multiline mode on and use the begining of line anchor (^) the pattern would look like the following:

(?m)(?i)^subject: re:

The above pattern example will match any subject that starts with "re:".