Uploaded by ivorr_dxb

Cellebrite Whitepaper RegularExpressions A4 web (1)

advertisement
WHITEPAPER
Digital Forensic
Regular Expressions (RegEx)
1
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
This is neither a random pattern, nor the result of a cat walking across our keyboard. It is a regular expression, or RegEx
for short, that forensic examiners may use to search for email addresses. During a forensic examination, investigators use
keywords to find exact string (‘word’) matches and regular expressions to find strings that match a pattern.
For example, if we use marchmadness@thisdomain.com as a keyword search term, a search function returns a ‘hit’ every
time the marchmadness@thisdomain.com email address is located. But, if we want to find any email address containing any
letter, number, or symbol combination, followed by the @ symbol, followed by any top-level domain (.net, .org, .com, etc.), we
need to search for the email address pattern. We call this pattern a regular expression.
To construct and use regular expressions, the information will be broken down into the following topics:
• Introduction and Definitions
• Alphanumeric RegEx Building Blocks
• Metacharacters and Escaping
• Escaped Alphabetical Characters
• Finding a Simple Word Pattern
• Regular Expression ‘Synonyms’ and Shortcuts
• Dissecting and Creating Regular Expressions
• Customizing BlackLight Regular Expression Presets
Introduction and Definitions
Regular expressions are used to search for data that matches a pattern. In order to use a regular expression to locate data
during a forensic examination, the forensic tool must have a regular expression search engine. Keep in mind there may be
slight variations in regular expression engines incorporated in some forensic tools. While most implementations are similar,
when in doubt consult the documentation provided with the forensic tool and test regular expressions in the forensic tool
in use. There are also other ways to test regular expressions such as the online testers https://regex101.com or
https://www.regextester.com.
Some tools also refer to regular expressions as grep expressions (where grep stands for global regular expression print).
People often used the terms RegEx and grep interchangeably.
Some terms commonly associated with regular expressions are as follows:
• Literal Characters: A character the regular expression sees exactly as it is typed. The regular expression engine is looking
for that character.
• Character Class: A range of alphanumeric characters such as a-z or 0-9. The regular expression engine looks for any
character in the specified range.
• Group: A defined set of characters in the regular expression. The regular expression engine is looking for any character in
the defined set.
• Metacharacters or Special Characters: Characters with special meaning to the regular expression engine. The regular
expression engine interprets the metacharacter based on its special meaning.
• Escaping: A means to instruct the regular expression engine to ignore the special meaning of a metacharacter and instead
look for that character.
Combining literal characters with character classes and groups, used in conjunction with metacharacters, is what creates
a regular expression. Below a methodology is introduced to help formulate regular expressions to find patterns of data.
2
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Alphanumeric Regex Building Blocks
When building regular expressions, one method is to verbally describe the regular expression before building it, and then
seek out each RegEx piece that fulfills the description. When describing the regular expression, consider which items
need to be part of a character class and/or part of a group. In regular expressions, character classes are enclosed with
[brackets], and groups are enclosed with (parentheses). Note brackets and parentheses are metacharacters.
The process of verbally describing the regular expression and then using RegEx pieces to replace the verbal description takes
time and may be slow initially, but with practice, it gets easier and faster. It is important when building regular expressions
that the expression built is neither too granular nor to expansive. If it is too granular, data may be missed. Too expansive and
there may be a lot of false positive hits. The verbal description process along with testing the regular expression will help
ensure the correct RegEx is created.
For example, take the case where a suspect uses Dan7a as a username and email address prefix. During the course of the
investigation, it is found the subject has another username that fits this same alphanumeric pattern. With this information,
a search can be conducted for additional five-character strings that match this pattern. First create the verbal description
of the pattern:
A string pattern that includes any single uppercase letter from A to Z, followed by any single lowercase letter from a
to z, followed by any single lowercase letter from a to z, followed by any single digit from 0 to 9, followed by any single
lowercase vowel.
The following would be included in the regular expression:
I want to find string patterns that include…
Regular Expression
Any single uppercase letter from A to Z
[A-Z
followed by
] (close the character class)
Any single lowercase letter from a to z
[a-z
followed by
]
Any single lowercase letter form a-z
[a-z
followed by
]
Any single digit from 0 to 9
[0-9
followed by
]
Any single lowercase vowel
[aeiou
(close the last character class)
]
Here is the complete regular expression: [A-Z][a-z][a-z][0-9][aeiou]
If an examiner uses this RegEx as a keyword, the search function returns Dan7a and Kim4i but not SAm9u because the
second letter in SAm9u is uppercase.
In the next example, the character classes are going to be organized a little bit differently. Here is the new verbal description:
A string patterns that include any single digit from 0-9 or any single lowercase letter from a-z, followed by any single digit
from 0-9 or any single lowercase letter from a-z or any single uppercase letter from A-Z.
3
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Use these components to build the regular expression:
I want to find string patterns that include…
Regular Expression
Any single digit from 0 to 9
[0-9
Or
(no bracket – the definition of the character class is not finished)
Any single lowercase letter from a to z
a-z
followed by
] (close bracket – the character class is defined)
Any single digit 0 to 9
[0-9
Or
no bracket – the definition of the character class is not finished)
Any single lowercase letter from a-z
a-z
Or
no bracket – the definition of the character class is not finished)
Any single uppercase letter from A to Z
A-Z
(close the last character class)
] (close bracket – the character class is defined)
Here is the complete regular expression: [0-9a-z][0-9a-zA-Z]
Notice a close bracket is not added until all character class members are defined. Remember that character classes can
include a range of alphanumeric characters, single alphanumeric characters, and/or metacharacters.
Any number of character class regular expressions can be written. Writing a verbal description first helps organize what
should be included in the RegEx and how the characters should be grouped within the brackets.
Metacharacters and Escaping
In the regular expression created above, the brackets and parenthesis used are part of a group of characters known as
metacharacter. When a regular expression engine encounters a metacharacter, it knows that the metacharacter has a
special meaning; the engine will not look for a “[“ or “(“, instead it is automatically interpreted as the beginning of a character
class or group.
Of course the parenthesis and brackets are not the only metacharacters available. There are other metacharacters with
special meaning prompting the search engine to interpret these special characters. This creates an additional problem if
the problem if one of the metacharacters is needed in the regular expression. Regular expression engines are typically built
into the search function of a forensic tool. Typically, there is some means in the interface to indicate to the search function
that the string entered should be interpreted as a regular expression as opposed to just literal characters. This is when the
regular expression engine takes over.
Once it is indicated to the search function that the string is a regular expression, or grep expression, the regex engine
interprets characters literally or non-literally depending on character type. In general, and by default, regex engines interpret
the metacharacters (special characters and/or punctuation marks) non-literally and regular alphanumeric characters
literally. To change a character’s default interpretation, an examiner must ‘escape’ it. The most common way to escape a
character is to place a backslash before it; however, some metacharacters are automatically escaped if they are part of a
character class (i.e., [included in a bracketed set]).
For example, by default, keyword search functions treat the ‘$’ and '+' metacharacters non-literally; the ‘$’ character is a
wildcard/anchor character and the ‘+’ character is a repetition character. But, if we escape these characters with a backslash,
the regex engine treats them literally and returns a hit each time a dollar sign or plus sign is located.
4
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Below are some of the most commonly used metacharacters, their default and escaped meanings:
Metacharacter
Default Meaning
Escaped Character
Escaped Meaning
\ (backslash)
Escape
\\
Find a backslash
$ (wildcard/anchor)
Find the preceding pattern matches if the occur at (are
anchored to) the end of the string or line
\$
Find a dollar sign
^ (wildcard/anchor)
(see additional notes below)
Find the subsequent pattern matches if they occur at (are
anchored to) the beginning of the string or line
\^
Find a caret
. (wildcard)
(see additional notes below)
Find any character or space except a newline character
\.
Find a period
* (repetition)
Find zero or more consecutive preceding character/pattern
matches
\*
Find an asterisk
+ (repetition)
Find one or more consecutive preceding character/pattern
matches
\+
Find a plus sign
? (repetition)
(see additional notes below)
Find zero or more preceding character/pattern matches
\?
Find a question mark
?! (negative look ahead)
Find anything that is not the subsequent characters/patterns
\?\!
Fins a question mark and an
exclamation point
[]
Find any single member in the character class
\[ or \]
Find an open or close bracket
()
Group RegEx items (such as multiple character classes,
shortcuts, and/or characters) together
\( or \)
Find an open or close
parenthesis
Additional Notes:
1. Carets: The caret is an ‘interesting character’ because it can modify the way a search function treats a character
class or becomes a character class member, depending on where it is placed. Here are three different caret
RegEx usage examples:
I want to find…
Caret Position
Regular Expression
Any one digit except 1-6, or any one lowercase
letter except a-e
In a character class immediately after the open bracket
[^1-6a-e]
Any one digit from 1-6, or any one lowercase
letter from a-e or the ^ character
Anywhere in a character class except just after the
open bracket
[1-6a-e^]
or
[1-6^a-e]
Every string or line of text that begins with
‘From:’ and a space (think email)
At the beginning of an unbracketed string
^From:\s
2. Non-escaped periods: When a search function sees a non-escaped period, it returns any character that is not
a line break (i.e. carriage return, line feed, etc.). So, if we have a regular expression with five consecutive periods
“…..” the search function returns strings with any five characters (including a space), as long as each character
is not a newline character. So, the search function returns Dan7a, Kim4i, and SAm9u as hits, and also “4”, “321”,
“98765”, and “abdlz” as hits. As you can see, regular expressions that contain the dot metacharacter can be very
useful, but if a regular expression definition is too broad, the search function returns many hits that do not have
relevance and/or meaning.
3. Greedy characters: The ‘*’ and ‘+’ repetition characters are greedy by default! Left to their own devices, they
match as many character or pattern matches as their character definition allows. To ‘limit’ these characters,
follow them with a ‘?’. This quells their greed, and politely asks the search function to match the fewest number
of character or pattern matches a character definition allows. This concept will be demonstrated in the Finding
Simple Word and Phrase Pattern Matches example.
5
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Escaped Alphabetical Characters
By default, the RegEx engine treats alphabetical characters literally; a hit is returned when the letter, word, or phrase is
located. But, if an escape (backslash) is used with an alphabetical, the engine treats it non-literally; as a line break or
shortcut. There are specific alphabetical characters that have an escaped meaning.
Below are a few alphabetical characters and their default and escaped meanings:
Escaped Letter
Search Result
\t
Find a tab
\r
Find a carriage return
\n
Find a new line
\s
Find a space, tab, carriage return, or line break. The same as [ \t\r\n].
\v
Find a vertical tab (see additional notes below)
\f
Find a form feed
\d
Find any digit (including zero). The same as [0-9].
\w
Find any capital letter, or lowercase letter, or digit (including zero), or the underscore character. The same as [A-Za-z0-9_]
\b (word boundary)
Find a defined whole word (see additional notes below)
Additional Notes:
1. Vertical Tabs - If you are like us and just have to know more, you’ll find additional vertical tabs information here.
2. Word Boundaries - An escaped ‘b’ is also an ‘interesting character.’ If we want to write a regular expression that
finds both occurrences of the word ‘last’ in this sentence:
“I committed a crime last week and it will not be my last!”
We might be tempted to create a regular expression that looks for a white space, then the word ‘last,’ then another white
space: \slast\s
But if we use this RegEx, the search function skips the second ‘last’ occurrence because it is followed by an exclamation
point and not a space.
To find both occurrences, we need to use a word boundary like this: \blast\b
Finding a Simple Word Pattern
Using the information above, a RegEx can be built to find any bracketed word. The RegEx should return a hit for each
match separately. Using the following test sentence, hits should be returned for [kind of] and [and greedy]:
I think some metacharacters are [kind of] repetitive [and greedy].
The initial RegEx description:
I want to find string patterns that have an open bracket, then any character except a newline, then one or more of the
preceding characters, then a closed bracket.
6
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Remember that brackets are metacharacters, the open and close brackets must be escaped for the RegEx engine to find
the literal characters. Without escaping, the engine treats them as a character class enclosure (non-literally) instead of
brackets (literally).
I want to find string patterns that include…
Regular Expression
An open bracket
\[
then
Any character except a newline
.
then
One more ‘any character except a newline'
+
then
a close bracket
\]
Here is the RegEx: \[.+\]
Using a RegEx tester, the following hit is returned:
As you can see, instead of getting two separate hits, one for [kind of] and one for [and greedy], the RegEx created returned
one hit that encompassed all of the data in brackets as well as the text in between the two group of bracketed strings. The
RegEx created, did not return the results wanted. When creating regular expressions, especially when repetition characters
are used, often the first try yields unexpected results, which is why it’s important to test the expression. In this case, the use
of the ‘+’ metacharacter, which is greedy by default, resulted in the return of the entire string between the first open bracket
and the last close bracket.
The expression needs to be altered to limit the ‘+’ metacharacter. The ‘?’ metacharacter can be used to force the RexEx engine
to match the previous character (in this case the period), the fewest number of times that the ‘+’ metacharacter definition
allows. The ‘+’ metacharacter definition allows one or more character repetitions, so the ‘?’ limits the ‘+’ metacharacter to
one repetition.
Modifying the RegEx to this: \[.+?\]
7
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
And the RegEx engine returns this:
In this example, the brackets are escaped, they are treated literally. In non-escaped brackets, the period, plus sign,
and question mark are treated as part of the character class and would be interpreted literally. In the expression
built above, the RegEx engine does not interpret the open and close bracket as a character class enclosure because
they are escaped. If the brackets were not escaped, the RegEx engine would return a hit each time a period, plus
sign, or question mark was located.
Escaped Alphabetical Character Example
Using escaped alphabetical characters can reduce the size of a regular expression. Regular expressions can grow to be
quite long when complex logic is in use. Using the escaped alphabetical characters can minimize the size of the regular
expression. Walking through the example below, which overall is rather simple, illustrates how powerful using escaped
alphabetical letters can be.
In the Alphanumeric RegEx Building Blocks section above, the following expression was created: [0-9a-z][0-9a-zA-Z]
If in the second character class defined the underscore character was added, the RegEx engine will also look for email
address prefixes with underscores: [0-9a-z][0-9a-zA-Z_]
Using escaped alphanumeric characters, the first character class can be simplified by substituting \d for 0-9, and the second
character class is simplified by substituting \w for [0-9a-zA-Z_]: [\da-z]\w
The overall expression is simplified. Both of these simplifications are defined in the ‘Metacharacter’ and ‘Escaped
Alphanumeric Character’ tables. The RegEx built previously was reduced by identifying escaped alphanumeric characters
that match the character classes defined.
Regular Expression ‘Synonyms’ and Shortcuts
Some RegEx engines also recognize ‘synonyms’ and shortcuts. Both can be used to simplify regular expressions. Regular
expression synonyms are like word synonyms; they are different patterns that mean the same thing to a search function. The
use of synonyms can make the RegEx a bit more human readable. Shortcuts provide an easier way to write long, repetitive
regular expressions.
Below are a few RegEx synonyms:
Regular Expression
Synonym
[0-9]
[:digit]
[0-9a-zA-Z]
[:alnum:]
\s
[:space:]
8
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Here are a few RegEx shortcuts:
I want to…
Shortcut
This…
…is the same as this
Matching the preceding pattern n times
{n}
[0-9][0-9][0-9][0-9]
[0-9]{n}
Matching the preceding pattern a minimum of n times and
a maximum of m times
{n,m}
[0-9][0-9]
[0-9][0-9][0-9]
[0-9][0-9][0-9][0-9]
[0-9]{2,4}
Match the preceding pattern n or more times
{n,}
[0-9][0-9][0-9][0-9]
[0-9][0-9][0-9][0-9][0-9]
[0-9][0-9][0-9][0-9][0-9][0-9]
[0-9]{4,}
Find a two-word phrase. The first word in the phrase is
‘brown’ and the second word is either ‘cat’ or ‘dog’
|
(pipe, also called ‘or’ operator)
brown\scat
brown\sdog
brown\s(cat|dog)
An Example of Using Shortcuts
During an investigation, information is received that the subject seems to purposefully create passwords or email addresses
with at least six lowercase alphabetical characters followed by two numbers. The subject used the email address donnie01@
gmail.com at least once. The investigator would like to locate additional instances of this email address and additional email
addresses and passwords that match this pattern. To create a RegEx, begin by writing a verbal description:
I want to find string patterns that include any single lowercase letter from a to z, followed by six or more single lowercase
letters from a-z, followed by any single digit from 0-9, followed by any single digit from 0-9.
Here are the items to include in the regular expression:
I want to find string patterns that include…
Regular Expression
Any single lowercase letter from A to Z
[a-z
followed by
]
Six or more single lowercase letters from a-z
[a-z][a-z][a-z][a-z][a-z][a-z] (or more)
followed by
]
Any single digit from 0 to 9
[0-9
followed by
]
Any single digit from 0 to 9
[0-9
(close the last character class)
]
Below is the starting regular expression (‘or more’ is not part of the final RegEx, but is keep it in place for now as a reminder):
[a-z] [a-z] [a-z] [a-z] [a-z] [a-z] (or more) [0-9][0-9]
Next, the expression can be simplified by replacing the six or more lowercase alphabetical character classes using a shortcut.
The simplification transforms this: [a-z] [a-z] [a-z] [a-z] [a-z] [a-z] (or more) to this: [a-z]{6,}
Next, simplify the two numeric character classes from this: [0-9][0-9] to this: [0-9]{2}, then to this: \d{2}
9
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
The final simplified regular expression created is this:
[a-z]{6,}\d{2}
Always keep in mind that as long as the RegEx being used finds the data, then it is not wrong. Simplification allows the same
search pattern to be defined using fewer characters, but it is not required.
Creating a RedEx to Find Social Security Number
In this example, a RegEX is created to locate U.S. Social Security numbers. Here is the RegEx description:
Find string patterns that include any single digit from 0 to 9, followed by any single digit from 0 to 9, followed by any single
digit from 0 to 9, then a dash, followed by any single digit from 0 to 9... (etc.)
There are several ways to write this RegEx. The expression can be written in the longhand first, and then simplified using
shortcuts.
Longhand: [0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]
Simplified: [0-9]{3}-[0-9]{2}-[0-9]{4}
Further simplified: \d{3}-\d{2}-\d{4}
Any of these regular expressions will work to find social security numbers.
10
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
This RegEx pattern can be further refined so that a search function looks for specific string matches. For example, to find
Social Security numbers that start with ‘824’ use this RegEx:
824-\d{2}-\d{4}
To find Social Security numbers that DO NOT start with ‘824’:
(?!824)\d{3}-\d{2}-\d{4}
The RegEx engine excluded 824 because of the ?! group with the 824 in the parentheses. The ?! is a negative look ahead. The
characters were not in brackets and were not escaped, so the search function treated them non-literally.
Notice the \d{3} was still included after the ‘negative look ahead’. It is needed to tell the search function to find patterns that
match the entire regular expression, then exclude the hits that begin with ‘824’. If the \d{3} is left off, the search function
ignores the first three digits all together and returns this:
The Social Security pattern beginning with 824 is returned as a search hit even though we included the (?!824) negative look
ahead because without the \d{3} it only looks for the second and third part of the Social Security number pattern.
11
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Remember, if a regular expression is too broadly defined, the search function may return hits that are irrelevant and/or
meaningless, sometimes referred to as false positives. To exclude meaningless Social Security number hits comprised of
random numbers and dashes, a word boundary can be added:
\b\d{3}-\d{2}-\d{4}\b
Finding Phone Numbers
The regular expression used to find a basic U.S. phone number pattern with dash separators (XXX-XXX-XXXX) is nearly
identical to the one used to locate Social Security numbers; making one small adjustment, the search function looks for
three numbers after the first dash instead of two. Any of the following regular expressions would work:
Longhand: [0-9][0-9][0-9]-[0-9][0-9] [0-9]-[0-9][0-9][0-9][0-9]
Simplified: [0-9]{3}-[0-9]{3}-[0-9]{4}
Further simplified: \d{3}-\d{3}-\d{4}
The above regular expressions make the assumption that phone numbers are all stored with dashes. This many not be the
case. Comprehensive phone number regular expressions are more complex because not all phone numbers are stored with
dashes. The regular expression needs to be altered so the search function will look for phone number patterns with dot or
whitespace (space, tab, or line break) separators too:
I want to find string patterns that include…
Regular Expression
three digits
\d{3}
then
a dash, or a dot, or a space
[-.\s
followed by
]
three digits
\d{3}
then
a dash, or a dot, or a space
[-.\s
followed by
]
four digits
\d{4}
Here is the regular expression: \d{3}[-.\s]\d{3}[-.\s]\d{4}
Remember, when metacharacters are part of a character class, included in a bracketed set, they are read literally. In the
RegEx created above, the dot (“.”) does not need to be escaped since it is included in the character class.
12
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Another common method for storing U.S. phone number is to have parentheses around the area code. Ideally, a RegEx
created to locate U.S. phone numbers would optionally look for those stored with parentheses around the area code portion
of the number. To make this regular expression, start by building a piped shortcut as seen in the cat|dog example in the
shortcut table:
I want to find string patterns that include…
Regular Expression
An open parenthesis (this must be escaped)
\(
then
three digits
/d{3}
then
a closed parenthesis (escape again)
\)
or
| (pipe)
three digits
\d{3}
This must then be enclosed in non-escaped parentheses to make it a group: (\(\d{3}\)|\d{3})
After creating this bit of logic for the area code portion of the U.S. phone number, the rest of the previously created regular
expression can be added to complete the logic. In the resulting expression, the RegEx engine looks for the three-digit prefix,
a space, a period, or a whitespace, and last four digits in the phone number:
(\(\d{3}\)|\d{3})[-.\s]\d{3}[-.\s]\d{4}
Dissecting and Creating Regular Expressions
While creating new regular expressions is a useful skill, there are already a lot of regular expression out there created by
other. Before using a regular expression, built by someone else, or included in forensic software, it is wise to understand
what the expression is looking for.
The first step when dissecting an already created regular expression is to identify group or character class enclosures,
parentheses and brackets. Next, unravel each group and/or character class until the entire expression is understood. Here
is a sample regular expression:
(((\(\d{3}\)|\d{3})[-.\s])|(\(\d{3}\)|\d{3}))?\d{3}[-.\s]?\d{4}([-.\s]?([Ee]xt|[Xx])[.]?[-.\s]?\d{2,5})?
13
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
This regular expression locates strings containing any phone number where the area code is not enclosed in parentheses,
digits are separated by a dash, dot or space, and optionally locates an extension. If an extension is present, it must be from
two to five digits and preceded by any of the following: ext., ext, Ext., Ext, x, x., X, or X..
Finding Email Addresses in Allocated Space
Examiners may use the next few regular expressions to locate email addresses in allocated space. This section is provided
to show the evolution from the description to the regular expression. As you read through each example, take a moment to
practice writing your own RegEx descriptions. Begin with:
‘I want to find string patterns that include...’
This regular expression locates email address patterns with the ‘zebra’ domain name and the ‘.com’ top-level domain:
\w+@zebra\.com
In this example, the escaped ‘w’ is a shortcut for [A-Za-z0-9_]. Using this shortcut will limit the email address hits, there
are more complex regular expressions for locating emails. The period is escaped to force the RegEx engine to look for
email addresses from the domain zebra.com; escaping the period looks for an actual period in the string. Remember, by
default a period is a wildcard metacharacter. Since the period is not included in a bracketed group, it must be escaped to be
interpreted literally.
Another way to provide a variation is to search for email address from two domain names. For example, a user may have a
zebra email address and a tiger email address. The following regular expression will locate email address patterns with the
zebra.com or tiger.com domain names and .com top-level domain: \w+@(zebra|tiger)\.com
Adding on the variation a regular expression can be constructed to find email addresses with the zebra.com or tiger.com or
cheetah.net domain names and top-level domains:
\w+@(((zebra|gorilla)\.com)|cheetah\.net)
Back to the regular expression at the top of this paper:
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
Hopefully this RegEx makes a lot more sense now. Notice that unlike the ‘\w’ shortcut, this regular expression tells the
search function to match email prefixes containing any alphanumeric character, a period, underscore, percentage sign, plus
sign, or minus sign. So it is more comprehensive than any of the regular expressions built in this section.
14
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Finding Email Addresses in Allocated and Unallocated Space
The regular expressions used to search for email address patterns in unallocated space are similar to the ones used to
search in allocated space. But, we need make one simple adjustment because email addresses in unallocated space may
have ‘%20’ instead of the ‘@’ symbol. To do so, simply change the ‘@’ symbol to this:
(%20|@)
For example, donnie01@gmail.com (allocated space) may be donnie01%20gmail.com in unallocated space. To search for
Gmail account email addresses with the ‘@’ symbol or ‘%20’, use this regular expression:
[A-Za-z0-9._%+-]+(%20|@)gmail\.com
To find any email address in allocated and unallocated space, use this regular expression:
[A-Za-z0-9._%+-]+(%20|@)[A-Za-z0-9.-]+\.[A-Za-z]{2,4}
Finding Web Addresses
Other commonly used regular expressions search for Web addresses by domain name and top-level domain (i.e., www.
zebra.com). Here is an initial RegEx description:
I want to find string patterns that begin with three consecutive instances of the letter ‘w’, then a dot ( . ), then one or more
alphanumeric characters or underscore ( _ ), then a dot ( . ) then the top-level ‘com’ domain.
Here is the preliminary regular expression: www\.\w+\.com
Because Web addresses do not always include ‘www.’, consider grouping www\. together in parentheses and add the ‘?’
repetition metacharacter just after the group to tell the search function to return a hit if ‘www.’ occurs zero or one times.
(www\.)?\w+\.com
To find http web addresses, add http:\/\/ to the beginning of the RegEx: http:\/\/(www\.)?\w+\.com
To find http, https, ftp, or afp (Apple File Protocol) web addresses with .com, .org, or .net top-level domains, start with
this RegEx: (http|https|ftp|ftps|afp):\/\/(www\.)?\w+\.(com|org|net)
15
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Then, because http and https share the same first four characters, and ftp and ftps share the same first three characters,
simplify this RegEx by adding (s?) just after http and ftp. This tells the RegEx engine to return a pattern match if the ‘s’ exists
zero or one time:
(http(s?)|ftp(s?)|afp):\/\/(www\.)?\w+\.(com|org|net)
Note: The list of approved generic top-level domains (.com, .net, .org, etc.) Consider adding/changing the top-level domain piece of the RegEx definitions
to accommodate the new domain names as they are approved. For further information, please visit this ICANN webpage.
Finding IP Addresses and Domain Name Web Addresses
Depending on the type of investigation an examiner is working on, locating IP addresses may be important. A regular
expression can be built to look for IP addresses (i.e., https://192.168.0.2) and domain name Web addresses. A place to begin
would be to simply append the first part of our domain name regular expression with an asterisk like this:
(http(s?)|ftp(s?)|afp):\/\/.*
But this RegEx tells the search function to return strings that start with http://, https://, ftp://, ftps://, or afp:// plus all of the
subsequent content in the rest of the document.
A better place to start would be with this to find https:// etc., as done in the previous example: (http(s?)|ftp(s?)|afp):\/\/
And also keep the RegEx piece that looks for the domain name: (www\.)?\w+\.(com|org|net)
Next, create a regular expression that looks for IP address patterns. Notice the similarities between the Social Security
number example created previously and this one. IP addresses are comprised of 4 octets, with each octet comprised of a
minimum of 1 digit and a maximum of 3 digits separated by a period (dot). Unlike the Social Security number, the number of
digits is not static, thus a min/max shortcut is needed. Also, the RegEx engine needs to look for an (escaped) period instead
of a dash: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
Simplify the expression using a {3} shortcut:
(\d{1,3}\.){3}\d{1,3}
Of course this is an oversimplification of the IP address structure since each octet (group of digits between the dots)
represents 8-bits and therefore has the limit of 255. There are not going to be IP address with 256 or higher in an octet.
16
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Looking closer and note that the {3} shortcut is used, not the {4} even though there are 4 octets in each IP address. This is
done since the last octet is not followed by a dot (period). So the first three octets are one to three-digit numbers followed
by a dot (period), and the last octet is a one to three-digit number followed by nothing. Therefore, the last piece of this RegEx
pattern is a little bit different: \d{1,3}
To combine the expressions, add a ‘pipe’ between the RegEx piece that looks for a domain name and the RegEx piece that
looks for an IP address, and place parentheses around the piped pieces to create a group:
(((www\.)?\w+\.(com|org|net)|(\d{1,3}\.){3}\d{1,3}))
Below is the complete regular expression:
(http(s?)|ftp(s?)|afp):\/\/(((www\.)?\w+\.(com|org|net)|(\d{1,3}\.){3}\d{1,3}))
This regular expression finds ‘https://192.168.0.233’, ‘https://www.zebra.com,’ ‘afp://192.168.1.125‘, ‘ftp://192.168.1.111’,
ftps://192.168.1.234, and ftp://gorilla.net.
Remember that whenever a new or complex regular expression is encountered, it is helpful to take a moment and pull it
apart piece by piece. Begin by identifying group or character class enclosures (parentheses or brackets). Then, unravel each
group and/or character class until there is an understanding the whole expression.
17
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
Customizing BlackLight Regular Expression Presets
BlackLight forensic analysis software ships with several RegEx presets that an examiner can use to create custom regular
expressions without reinventing the wheel each time. For those that do not currently own BlackLight but would like to follow
along, please visit the BlackLight product page and select the Request Trial button to request a fully-functional trial license.
Launch BlackLight, and in the ‘Component List’ click the green Add button next to Content Searches.
To add a RegEx preset to a keyword search, on the lower right side of the ‘Content Pane’ select the Add Preset drop-down
menu. For this example, select the Email Address (Simple) menu option.
BlackLight automatically activates the Selected Keyword is RegEx Pattern checkbox, and an email RegEx to the
keyword search list.
Note: When adding RegEx pattern keywords to a search in BlackLight activate the Selected Keyword is a RegEx Pattern checkbox. RegEx patterns are
blue in the Keywords list.
18
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
To customize the regular expression preset, click on it twice to activate the text field. Delete or modify the pattern preset as
desired and click anywhere outside the text box to deactivate it.
To share customized RegEx patterns with other case investigators, select the bottom of the list select the Export button.
BlackLight exports all of the keywords in the list, including any the regular expressions, as a plain text file. To import a
custom keyword list into case, a list that may include regular expressions, select the Import button. Navigate to and select
any plain text file containing line-separated keywords and regular expressions.
Remember to test modified and created regular expressions on set of sample data prior to running the search across and
entire case. One way to do this is to create test data in a text file or files. Add the test data text file(s) to BlackLight and then
perform the Content Search on only the test data text file(s). This will help ensure the RegEx is hitting on the data desired.
If the test data is not large enough, or does not provide enough variation, it may not illustrate when a regular expression
is too vague. Vague regular expressions are expressions that hit on the desired data but are written so poorly they hit on a
tremendous number of false positives. If when running a Content Search using a RegEx the number of hits grows rapidly,
click the Pause button in the ‘Content Pane’ (the search must be selected in the ‘Component List’ to see the results as the
search is running). In the example below, 8,343,854 hits were located in 30 MB out of 11.3 GB. This is a large number of hits
in a small percentage of the data.
In the example shown above, only one RegEx was included in the Content Search. There were no other keywords so all the
hits could be attributed to the poorly written RegEx. If multiple keywords and/or regular expressions are included in the
search, the Statistics sub-view in the ‘Content Pane’ indicates how many hits are attributed to each keyword or RegEx. If the
number of hits associated with a RegEx is high, it is an indicator the expression is too vague. Test the RegEx with a more
comprehensive test data set to determine the cause.
19
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
References
Goyvaerts, Jan. 2009. Regular Expressions Info. http://www.regular-expressions.info/reference.html (accessed July 21, 2020(.
Stackoverflow.com. 2010. What is a vertical tab?. http://stackoverflow.com/questions/3380538/what-is-a-vertical-tab (accessed July 20, 2020(.
20
Whitepaper | Digital Forensic Regular Expressions (RegEx)
www.cellebrite.com
About Cellebrite
Cellebrite is the global leader of Digital Intelligence solutions for law enforcement, government and enterprise organizations.
Cellebrite delivers an extensive suite of innovative software solutions, analytic tools, and training designed to accelerate
digital investigations and address the growing complexity of handling crime and security challenges in the digital era.
Trusted by thousands of leading agencies and companies in more than 150 countries, Cellebrite is helping fulfill the joint
mission of creating a safer world.
• To learn more visit us at www.cellebrite.com
• Contact Cellebrite globally at www.cellebrite.com/contact
Download