WHITEPAPER Digital Forensic Regular Expressions (RegEx) 1 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4} This is neither a random pattern, nor the result of a cat walking across our keyboard. It is a regular expression, or RegEx for short, that forensic examiners may use to search for email addresses. During a forensic examination, investigators use keywords to find exact string (‘word’) matches and regular expressions to find strings that match a pattern. For example, if we use marchmadness@thisdomain.com as a keyword search term, a search function returns a ‘hit’ every time the marchmadness@thisdomain.com email address is located. But, if we want to find any email address containing any letter, number, or symbol combination, followed by the @ symbol, followed by any top-level domain (.net, .org, .com, etc.), we need to search for the email address pattern. We call this pattern a regular expression. To construct and use regular expressions, the information will be broken down into the following topics: • Introduction and Definitions • Alphanumeric RegEx Building Blocks • Metacharacters and Escaping • Escaped Alphabetical Characters • Finding a Simple Word Pattern • Regular Expression ‘Synonyms’ and Shortcuts • Dissecting and Creating Regular Expressions • Customizing BlackLight Regular Expression Presets Introduction and Definitions Regular expressions are used to search for data that matches a pattern. In order to use a regular expression to locate data during a forensic examination, the forensic tool must have a regular expression search engine. Keep in mind there may be slight variations in regular expression engines incorporated in some forensic tools. While most implementations are similar, when in doubt consult the documentation provided with the forensic tool and test regular expressions in the forensic tool in use. There are also other ways to test regular expressions such as the online testers https://regex101.com or https://www.regextester.com. Some tools also refer to regular expressions as grep expressions (where grep stands for global regular expression print). People often used the terms RegEx and grep interchangeably. Some terms commonly associated with regular expressions are as follows: • Literal Characters: A character the regular expression sees exactly as it is typed. The regular expression engine is looking for that character. • Character Class: A range of alphanumeric characters such as a-z or 0-9. The regular expression engine looks for any character in the specified range. • Group: A defined set of characters in the regular expression. The regular expression engine is looking for any character in the defined set. • Metacharacters or Special Characters: Characters with special meaning to the regular expression engine. The regular expression engine interprets the metacharacter based on its special meaning. • Escaping: A means to instruct the regular expression engine to ignore the special meaning of a metacharacter and instead look for that character. Combining literal characters with character classes and groups, used in conjunction with metacharacters, is what creates a regular expression. Below a methodology is introduced to help formulate regular expressions to find patterns of data. 2 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Alphanumeric Regex Building Blocks When building regular expressions, one method is to verbally describe the regular expression before building it, and then seek out each RegEx piece that fulfills the description. When describing the regular expression, consider which items need to be part of a character class and/or part of a group. In regular expressions, character classes are enclosed with [brackets], and groups are enclosed with (parentheses). Note brackets and parentheses are metacharacters. The process of verbally describing the regular expression and then using RegEx pieces to replace the verbal description takes time and may be slow initially, but with practice, it gets easier and faster. It is important when building regular expressions that the expression built is neither too granular nor to expansive. If it is too granular, data may be missed. Too expansive and there may be a lot of false positive hits. The verbal description process along with testing the regular expression will help ensure the correct RegEx is created. For example, take the case where a suspect uses Dan7a as a username and email address prefix. During the course of the investigation, it is found the subject has another username that fits this same alphanumeric pattern. With this information, a search can be conducted for additional five-character strings that match this pattern. First create the verbal description of the pattern: A string pattern that includes any single uppercase letter from A to Z, followed by any single lowercase letter from a to z, followed by any single lowercase letter from a to z, followed by any single digit from 0 to 9, followed by any single lowercase vowel. The following would be included in the regular expression: I want to find string patterns that include… Regular Expression Any single uppercase letter from A to Z [A-Z followed by ] (close the character class) Any single lowercase letter from a to z [a-z followed by ] Any single lowercase letter form a-z [a-z followed by ] Any single digit from 0 to 9 [0-9 followed by ] Any single lowercase vowel [aeiou (close the last character class) ] Here is the complete regular expression: [A-Z][a-z][a-z][0-9][aeiou] If an examiner uses this RegEx as a keyword, the search function returns Dan7a and Kim4i but not SAm9u because the second letter in SAm9u is uppercase. In the next example, the character classes are going to be organized a little bit differently. Here is the new verbal description: A string patterns that include any single digit from 0-9 or any single lowercase letter from a-z, followed by any single digit from 0-9 or any single lowercase letter from a-z or any single uppercase letter from A-Z. 3 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Use these components to build the regular expression: I want to find string patterns that include… Regular Expression Any single digit from 0 to 9 [0-9 Or (no bracket – the definition of the character class is not finished) Any single lowercase letter from a to z a-z followed by ] (close bracket – the character class is defined) Any single digit 0 to 9 [0-9 Or no bracket – the definition of the character class is not finished) Any single lowercase letter from a-z a-z Or no bracket – the definition of the character class is not finished) Any single uppercase letter from A to Z A-Z (close the last character class) ] (close bracket – the character class is defined) Here is the complete regular expression: [0-9a-z][0-9a-zA-Z] Notice a close bracket is not added until all character class members are defined. Remember that character classes can include a range of alphanumeric characters, single alphanumeric characters, and/or metacharacters. Any number of character class regular expressions can be written. Writing a verbal description first helps organize what should be included in the RegEx and how the characters should be grouped within the brackets. Metacharacters and Escaping In the regular expression created above, the brackets and parenthesis used are part of a group of characters known as metacharacter. When a regular expression engine encounters a metacharacter, it knows that the metacharacter has a special meaning; the engine will not look for a “[“ or “(“, instead it is automatically interpreted as the beginning of a character class or group. Of course the parenthesis and brackets are not the only metacharacters available. There are other metacharacters with special meaning prompting the search engine to interpret these special characters. This creates an additional problem if the problem if one of the metacharacters is needed in the regular expression. Regular expression engines are typically built into the search function of a forensic tool. Typically, there is some means in the interface to indicate to the search function that the string entered should be interpreted as a regular expression as opposed to just literal characters. This is when the regular expression engine takes over. Once it is indicated to the search function that the string is a regular expression, or grep expression, the regex engine interprets characters literally or non-literally depending on character type. In general, and by default, regex engines interpret the metacharacters (special characters and/or punctuation marks) non-literally and regular alphanumeric characters literally. To change a character’s default interpretation, an examiner must ‘escape’ it. The most common way to escape a character is to place a backslash before it; however, some metacharacters are automatically escaped if they are part of a character class (i.e., [included in a bracketed set]). For example, by default, keyword search functions treat the ‘$’ and '+' metacharacters non-literally; the ‘$’ character is a wildcard/anchor character and the ‘+’ character is a repetition character. But, if we escape these characters with a backslash, the regex engine treats them literally and returns a hit each time a dollar sign or plus sign is located. 4 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Below are some of the most commonly used metacharacters, their default and escaped meanings: Metacharacter Default Meaning Escaped Character Escaped Meaning \ (backslash) Escape \\ Find a backslash $ (wildcard/anchor) Find the preceding pattern matches if the occur at (are anchored to) the end of the string or line \$ Find a dollar sign ^ (wildcard/anchor) (see additional notes below) Find the subsequent pattern matches if they occur at (are anchored to) the beginning of the string or line \^ Find a caret . (wildcard) (see additional notes below) Find any character or space except a newline character \. Find a period * (repetition) Find zero or more consecutive preceding character/pattern matches \* Find an asterisk + (repetition) Find one or more consecutive preceding character/pattern matches \+ Find a plus sign ? (repetition) (see additional notes below) Find zero or more preceding character/pattern matches \? Find a question mark ?! (negative look ahead) Find anything that is not the subsequent characters/patterns \?\! Fins a question mark and an exclamation point [] Find any single member in the character class \[ or \] Find an open or close bracket () Group RegEx items (such as multiple character classes, shortcuts, and/or characters) together \( or \) Find an open or close parenthesis Additional Notes: 1. Carets: The caret is an ‘interesting character’ because it can modify the way a search function treats a character class or becomes a character class member, depending on where it is placed. Here are three different caret RegEx usage examples: I want to find… Caret Position Regular Expression Any one digit except 1-6, or any one lowercase letter except a-e In a character class immediately after the open bracket [^1-6a-e] Any one digit from 1-6, or any one lowercase letter from a-e or the ^ character Anywhere in a character class except just after the open bracket [1-6a-e^] or [1-6^a-e] Every string or line of text that begins with ‘From:’ and a space (think email) At the beginning of an unbracketed string ^From:\s 2. Non-escaped periods: When a search function sees a non-escaped period, it returns any character that is not a line break (i.e. carriage return, line feed, etc.). So, if we have a regular expression with five consecutive periods “…..” the search function returns strings with any five characters (including a space), as long as each character is not a newline character. So, the search function returns Dan7a, Kim4i, and SAm9u as hits, and also “4”, “321”, “98765”, and “abdlz” as hits. As you can see, regular expressions that contain the dot metacharacter can be very useful, but if a regular expression definition is too broad, the search function returns many hits that do not have relevance and/or meaning. 3. Greedy characters: The ‘*’ and ‘+’ repetition characters are greedy by default! Left to their own devices, they match as many character or pattern matches as their character definition allows. To ‘limit’ these characters, follow them with a ‘?’. This quells their greed, and politely asks the search function to match the fewest number of character or pattern matches a character definition allows. This concept will be demonstrated in the Finding Simple Word and Phrase Pattern Matches example. 5 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Escaped Alphabetical Characters By default, the RegEx engine treats alphabetical characters literally; a hit is returned when the letter, word, or phrase is located. But, if an escape (backslash) is used with an alphabetical, the engine treats it non-literally; as a line break or shortcut. There are specific alphabetical characters that have an escaped meaning. Below are a few alphabetical characters and their default and escaped meanings: Escaped Letter Search Result \t Find a tab \r Find a carriage return \n Find a new line \s Find a space, tab, carriage return, or line break. The same as [ \t\r\n]. \v Find a vertical tab (see additional notes below) \f Find a form feed \d Find any digit (including zero). The same as [0-9]. \w Find any capital letter, or lowercase letter, or digit (including zero), or the underscore character. The same as [A-Za-z0-9_] \b (word boundary) Find a defined whole word (see additional notes below) Additional Notes: 1. Vertical Tabs - If you are like us and just have to know more, you’ll find additional vertical tabs information here. 2. Word Boundaries - An escaped ‘b’ is also an ‘interesting character.’ If we want to write a regular expression that finds both occurrences of the word ‘last’ in this sentence: “I committed a crime last week and it will not be my last!” We might be tempted to create a regular expression that looks for a white space, then the word ‘last,’ then another white space: \slast\s But if we use this RegEx, the search function skips the second ‘last’ occurrence because it is followed by an exclamation point and not a space. To find both occurrences, we need to use a word boundary like this: \blast\b Finding a Simple Word Pattern Using the information above, a RegEx can be built to find any bracketed word. The RegEx should return a hit for each match separately. Using the following test sentence, hits should be returned for [kind of] and [and greedy]: I think some metacharacters are [kind of] repetitive [and greedy]. The initial RegEx description: I want to find string patterns that have an open bracket, then any character except a newline, then one or more of the preceding characters, then a closed bracket. 6 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Remember that brackets are metacharacters, the open and close brackets must be escaped for the RegEx engine to find the literal characters. Without escaping, the engine treats them as a character class enclosure (non-literally) instead of brackets (literally). I want to find string patterns that include… Regular Expression An open bracket \[ then Any character except a newline . then One more ‘any character except a newline' + then a close bracket \] Here is the RegEx: \[.+\] Using a RegEx tester, the following hit is returned: As you can see, instead of getting two separate hits, one for [kind of] and one for [and greedy], the RegEx created returned one hit that encompassed all of the data in brackets as well as the text in between the two group of bracketed strings. The RegEx created, did not return the results wanted. When creating regular expressions, especially when repetition characters are used, often the first try yields unexpected results, which is why it’s important to test the expression. In this case, the use of the ‘+’ metacharacter, which is greedy by default, resulted in the return of the entire string between the first open bracket and the last close bracket. The expression needs to be altered to limit the ‘+’ metacharacter. The ‘?’ metacharacter can be used to force the RexEx engine to match the previous character (in this case the period), the fewest number of times that the ‘+’ metacharacter definition allows. The ‘+’ metacharacter definition allows one or more character repetitions, so the ‘?’ limits the ‘+’ metacharacter to one repetition. Modifying the RegEx to this: \[.+?\] 7 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com And the RegEx engine returns this: In this example, the brackets are escaped, they are treated literally. In non-escaped brackets, the period, plus sign, and question mark are treated as part of the character class and would be interpreted literally. In the expression built above, the RegEx engine does not interpret the open and close bracket as a character class enclosure because they are escaped. If the brackets were not escaped, the RegEx engine would return a hit each time a period, plus sign, or question mark was located. Escaped Alphabetical Character Example Using escaped alphabetical characters can reduce the size of a regular expression. Regular expressions can grow to be quite long when complex logic is in use. Using the escaped alphabetical characters can minimize the size of the regular expression. Walking through the example below, which overall is rather simple, illustrates how powerful using escaped alphabetical letters can be. In the Alphanumeric RegEx Building Blocks section above, the following expression was created: [0-9a-z][0-9a-zA-Z] If in the second character class defined the underscore character was added, the RegEx engine will also look for email address prefixes with underscores: [0-9a-z][0-9a-zA-Z_] Using escaped alphanumeric characters, the first character class can be simplified by substituting \d for 0-9, and the second character class is simplified by substituting \w for [0-9a-zA-Z_]: [\da-z]\w The overall expression is simplified. Both of these simplifications are defined in the ‘Metacharacter’ and ‘Escaped Alphanumeric Character’ tables. The RegEx built previously was reduced by identifying escaped alphanumeric characters that match the character classes defined. Regular Expression ‘Synonyms’ and Shortcuts Some RegEx engines also recognize ‘synonyms’ and shortcuts. Both can be used to simplify regular expressions. Regular expression synonyms are like word synonyms; they are different patterns that mean the same thing to a search function. The use of synonyms can make the RegEx a bit more human readable. Shortcuts provide an easier way to write long, repetitive regular expressions. Below are a few RegEx synonyms: Regular Expression Synonym [0-9] [:digit] [0-9a-zA-Z] [:alnum:] \s [:space:] 8 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Here are a few RegEx shortcuts: I want to… Shortcut This… …is the same as this Matching the preceding pattern n times {n} [0-9][0-9][0-9][0-9] [0-9]{n} Matching the preceding pattern a minimum of n times and a maximum of m times {n,m} [0-9][0-9] [0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9]{2,4} Match the preceding pattern n or more times {n,} [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9]{4,} Find a two-word phrase. The first word in the phrase is ‘brown’ and the second word is either ‘cat’ or ‘dog’ | (pipe, also called ‘or’ operator) brown\scat brown\sdog brown\s(cat|dog) An Example of Using Shortcuts During an investigation, information is received that the subject seems to purposefully create passwords or email addresses with at least six lowercase alphabetical characters followed by two numbers. The subject used the email address donnie01@ gmail.com at least once. The investigator would like to locate additional instances of this email address and additional email addresses and passwords that match this pattern. To create a RegEx, begin by writing a verbal description: I want to find string patterns that include any single lowercase letter from a to z, followed by six or more single lowercase letters from a-z, followed by any single digit from 0-9, followed by any single digit from 0-9. Here are the items to include in the regular expression: I want to find string patterns that include… Regular Expression Any single lowercase letter from A to Z [a-z followed by ] Six or more single lowercase letters from a-z [a-z][a-z][a-z][a-z][a-z][a-z] (or more) followed by ] Any single digit from 0 to 9 [0-9 followed by ] Any single digit from 0 to 9 [0-9 (close the last character class) ] Below is the starting regular expression (‘or more’ is not part of the final RegEx, but is keep it in place for now as a reminder): [a-z] [a-z] [a-z] [a-z] [a-z] [a-z] (or more) [0-9][0-9] Next, the expression can be simplified by replacing the six or more lowercase alphabetical character classes using a shortcut. The simplification transforms this: [a-z] [a-z] [a-z] [a-z] [a-z] [a-z] (or more) to this: [a-z]{6,} Next, simplify the two numeric character classes from this: [0-9][0-9] to this: [0-9]{2}, then to this: \d{2} 9 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com The final simplified regular expression created is this: [a-z]{6,}\d{2} Always keep in mind that as long as the RegEx being used finds the data, then it is not wrong. Simplification allows the same search pattern to be defined using fewer characters, but it is not required. Creating a RedEx to Find Social Security Number In this example, a RegEX is created to locate U.S. Social Security numbers. Here is the RegEx description: Find string patterns that include any single digit from 0 to 9, followed by any single digit from 0 to 9, followed by any single digit from 0 to 9, then a dash, followed by any single digit from 0 to 9... (etc.) There are several ways to write this RegEx. The expression can be written in the longhand first, and then simplified using shortcuts. Longhand: [0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9] Simplified: [0-9]{3}-[0-9]{2}-[0-9]{4} Further simplified: \d{3}-\d{2}-\d{4} Any of these regular expressions will work to find social security numbers. 10 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com This RegEx pattern can be further refined so that a search function looks for specific string matches. For example, to find Social Security numbers that start with ‘824’ use this RegEx: 824-\d{2}-\d{4} To find Social Security numbers that DO NOT start with ‘824’: (?!824)\d{3}-\d{2}-\d{4} The RegEx engine excluded 824 because of the ?! group with the 824 in the parentheses. The ?! is a negative look ahead. The characters were not in brackets and were not escaped, so the search function treated them non-literally. Notice the \d{3} was still included after the ‘negative look ahead’. It is needed to tell the search function to find patterns that match the entire regular expression, then exclude the hits that begin with ‘824’. If the \d{3} is left off, the search function ignores the first three digits all together and returns this: The Social Security pattern beginning with 824 is returned as a search hit even though we included the (?!824) negative look ahead because without the \d{3} it only looks for the second and third part of the Social Security number pattern. 11 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Remember, if a regular expression is too broadly defined, the search function may return hits that are irrelevant and/or meaningless, sometimes referred to as false positives. To exclude meaningless Social Security number hits comprised of random numbers and dashes, a word boundary can be added: \b\d{3}-\d{2}-\d{4}\b Finding Phone Numbers The regular expression used to find a basic U.S. phone number pattern with dash separators (XXX-XXX-XXXX) is nearly identical to the one used to locate Social Security numbers; making one small adjustment, the search function looks for three numbers after the first dash instead of two. Any of the following regular expressions would work: Longhand: [0-9][0-9][0-9]-[0-9][0-9] [0-9]-[0-9][0-9][0-9][0-9] Simplified: [0-9]{3}-[0-9]{3}-[0-9]{4} Further simplified: \d{3}-\d{3}-\d{4} The above regular expressions make the assumption that phone numbers are all stored with dashes. This many not be the case. Comprehensive phone number regular expressions are more complex because not all phone numbers are stored with dashes. The regular expression needs to be altered so the search function will look for phone number patterns with dot or whitespace (space, tab, or line break) separators too: I want to find string patterns that include… Regular Expression three digits \d{3} then a dash, or a dot, or a space [-.\s followed by ] three digits \d{3} then a dash, or a dot, or a space [-.\s followed by ] four digits \d{4} Here is the regular expression: \d{3}[-.\s]\d{3}[-.\s]\d{4} Remember, when metacharacters are part of a character class, included in a bracketed set, they are read literally. In the RegEx created above, the dot (“.”) does not need to be escaped since it is included in the character class. 12 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Another common method for storing U.S. phone number is to have parentheses around the area code. Ideally, a RegEx created to locate U.S. phone numbers would optionally look for those stored with parentheses around the area code portion of the number. To make this regular expression, start by building a piped shortcut as seen in the cat|dog example in the shortcut table: I want to find string patterns that include… Regular Expression An open parenthesis (this must be escaped) \( then three digits /d{3} then a closed parenthesis (escape again) \) or | (pipe) three digits \d{3} This must then be enclosed in non-escaped parentheses to make it a group: (\(\d{3}\)|\d{3}) After creating this bit of logic for the area code portion of the U.S. phone number, the rest of the previously created regular expression can be added to complete the logic. In the resulting expression, the RegEx engine looks for the three-digit prefix, a space, a period, or a whitespace, and last four digits in the phone number: (\(\d{3}\)|\d{3})[-.\s]\d{3}[-.\s]\d{4} Dissecting and Creating Regular Expressions While creating new regular expressions is a useful skill, there are already a lot of regular expression out there created by other. Before using a regular expression, built by someone else, or included in forensic software, it is wise to understand what the expression is looking for. The first step when dissecting an already created regular expression is to identify group or character class enclosures, parentheses and brackets. Next, unravel each group and/or character class until the entire expression is understood. Here is a sample regular expression: (((\(\d{3}\)|\d{3})[-.\s])|(\(\d{3}\)|\d{3}))?\d{3}[-.\s]?\d{4}([-.\s]?([Ee]xt|[Xx])[.]?[-.\s]?\d{2,5})? 13 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com This regular expression locates strings containing any phone number where the area code is not enclosed in parentheses, digits are separated by a dash, dot or space, and optionally locates an extension. If an extension is present, it must be from two to five digits and preceded by any of the following: ext., ext, Ext., Ext, x, x., X, or X.. Finding Email Addresses in Allocated Space Examiners may use the next few regular expressions to locate email addresses in allocated space. This section is provided to show the evolution from the description to the regular expression. As you read through each example, take a moment to practice writing your own RegEx descriptions. Begin with: ‘I want to find string patterns that include...’ This regular expression locates email address patterns with the ‘zebra’ domain name and the ‘.com’ top-level domain: \w+@zebra\.com In this example, the escaped ‘w’ is a shortcut for [A-Za-z0-9_]. Using this shortcut will limit the email address hits, there are more complex regular expressions for locating emails. The period is escaped to force the RegEx engine to look for email addresses from the domain zebra.com; escaping the period looks for an actual period in the string. Remember, by default a period is a wildcard metacharacter. Since the period is not included in a bracketed group, it must be escaped to be interpreted literally. Another way to provide a variation is to search for email address from two domain names. For example, a user may have a zebra email address and a tiger email address. The following regular expression will locate email address patterns with the zebra.com or tiger.com domain names and .com top-level domain: \w+@(zebra|tiger)\.com Adding on the variation a regular expression can be constructed to find email addresses with the zebra.com or tiger.com or cheetah.net domain names and top-level domains: \w+@(((zebra|gorilla)\.com)|cheetah\.net) Back to the regular expression at the top of this paper: [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4} Hopefully this RegEx makes a lot more sense now. Notice that unlike the ‘\w’ shortcut, this regular expression tells the search function to match email prefixes containing any alphanumeric character, a period, underscore, percentage sign, plus sign, or minus sign. So it is more comprehensive than any of the regular expressions built in this section. 14 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Finding Email Addresses in Allocated and Unallocated Space The regular expressions used to search for email address patterns in unallocated space are similar to the ones used to search in allocated space. But, we need make one simple adjustment because email addresses in unallocated space may have ‘%20’ instead of the ‘@’ symbol. To do so, simply change the ‘@’ symbol to this: (%20|@) For example, donnie01@gmail.com (allocated space) may be donnie01%20gmail.com in unallocated space. To search for Gmail account email addresses with the ‘@’ symbol or ‘%20’, use this regular expression: [A-Za-z0-9._%+-]+(%20|@)gmail\.com To find any email address in allocated and unallocated space, use this regular expression: [A-Za-z0-9._%+-]+(%20|@)[A-Za-z0-9.-]+\.[A-Za-z]{2,4} Finding Web Addresses Other commonly used regular expressions search for Web addresses by domain name and top-level domain (i.e., www. zebra.com). Here is an initial RegEx description: I want to find string patterns that begin with three consecutive instances of the letter ‘w’, then a dot ( . ), then one or more alphanumeric characters or underscore ( _ ), then a dot ( . ) then the top-level ‘com’ domain. Here is the preliminary regular expression: www\.\w+\.com Because Web addresses do not always include ‘www.’, consider grouping www\. together in parentheses and add the ‘?’ repetition metacharacter just after the group to tell the search function to return a hit if ‘www.’ occurs zero or one times. (www\.)?\w+\.com To find http web addresses, add http:\/\/ to the beginning of the RegEx: http:\/\/(www\.)?\w+\.com To find http, https, ftp, or afp (Apple File Protocol) web addresses with .com, .org, or .net top-level domains, start with this RegEx: (http|https|ftp|ftps|afp):\/\/(www\.)?\w+\.(com|org|net) 15 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Then, because http and https share the same first four characters, and ftp and ftps share the same first three characters, simplify this RegEx by adding (s?) just after http and ftp. This tells the RegEx engine to return a pattern match if the ‘s’ exists zero or one time: (http(s?)|ftp(s?)|afp):\/\/(www\.)?\w+\.(com|org|net) Note: The list of approved generic top-level domains (.com, .net, .org, etc.) Consider adding/changing the top-level domain piece of the RegEx definitions to accommodate the new domain names as they are approved. For further information, please visit this ICANN webpage. Finding IP Addresses and Domain Name Web Addresses Depending on the type of investigation an examiner is working on, locating IP addresses may be important. A regular expression can be built to look for IP addresses (i.e., https://192.168.0.2) and domain name Web addresses. A place to begin would be to simply append the first part of our domain name regular expression with an asterisk like this: (http(s?)|ftp(s?)|afp):\/\/.* But this RegEx tells the search function to return strings that start with http://, https://, ftp://, ftps://, or afp:// plus all of the subsequent content in the rest of the document. A better place to start would be with this to find https:// etc., as done in the previous example: (http(s?)|ftp(s?)|afp):\/\/ And also keep the RegEx piece that looks for the domain name: (www\.)?\w+\.(com|org|net) Next, create a regular expression that looks for IP address patterns. Notice the similarities between the Social Security number example created previously and this one. IP addresses are comprised of 4 octets, with each octet comprised of a minimum of 1 digit and a maximum of 3 digits separated by a period (dot). Unlike the Social Security number, the number of digits is not static, thus a min/max shortcut is needed. Also, the RegEx engine needs to look for an (escaped) period instead of a dash: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} Simplify the expression using a {3} shortcut: (\d{1,3}\.){3}\d{1,3} Of course this is an oversimplification of the IP address structure since each octet (group of digits between the dots) represents 8-bits and therefore has the limit of 255. There are not going to be IP address with 256 or higher in an octet. 16 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Looking closer and note that the {3} shortcut is used, not the {4} even though there are 4 octets in each IP address. This is done since the last octet is not followed by a dot (period). So the first three octets are one to three-digit numbers followed by a dot (period), and the last octet is a one to three-digit number followed by nothing. Therefore, the last piece of this RegEx pattern is a little bit different: \d{1,3} To combine the expressions, add a ‘pipe’ between the RegEx piece that looks for a domain name and the RegEx piece that looks for an IP address, and place parentheses around the piped pieces to create a group: (((www\.)?\w+\.(com|org|net)|(\d{1,3}\.){3}\d{1,3})) Below is the complete regular expression: (http(s?)|ftp(s?)|afp):\/\/(((www\.)?\w+\.(com|org|net)|(\d{1,3}\.){3}\d{1,3})) This regular expression finds ‘https://192.168.0.233’, ‘https://www.zebra.com,’ ‘afp://192.168.1.125‘, ‘ftp://192.168.1.111’, ftps://192.168.1.234, and ftp://gorilla.net. Remember that whenever a new or complex regular expression is encountered, it is helpful to take a moment and pull it apart piece by piece. Begin by identifying group or character class enclosures (parentheses or brackets). Then, unravel each group and/or character class until there is an understanding the whole expression. 17 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com Customizing BlackLight Regular Expression Presets BlackLight forensic analysis software ships with several RegEx presets that an examiner can use to create custom regular expressions without reinventing the wheel each time. For those that do not currently own BlackLight but would like to follow along, please visit the BlackLight product page and select the Request Trial button to request a fully-functional trial license. Launch BlackLight, and in the ‘Component List’ click the green Add button next to Content Searches. To add a RegEx preset to a keyword search, on the lower right side of the ‘Content Pane’ select the Add Preset drop-down menu. For this example, select the Email Address (Simple) menu option. BlackLight automatically activates the Selected Keyword is RegEx Pattern checkbox, and an email RegEx to the keyword search list. Note: When adding RegEx pattern keywords to a search in BlackLight activate the Selected Keyword is a RegEx Pattern checkbox. RegEx patterns are blue in the Keywords list. 18 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com To customize the regular expression preset, click on it twice to activate the text field. Delete or modify the pattern preset as desired and click anywhere outside the text box to deactivate it. To share customized RegEx patterns with other case investigators, select the bottom of the list select the Export button. BlackLight exports all of the keywords in the list, including any the regular expressions, as a plain text file. To import a custom keyword list into case, a list that may include regular expressions, select the Import button. Navigate to and select any plain text file containing line-separated keywords and regular expressions. Remember to test modified and created regular expressions on set of sample data prior to running the search across and entire case. One way to do this is to create test data in a text file or files. Add the test data text file(s) to BlackLight and then perform the Content Search on only the test data text file(s). This will help ensure the RegEx is hitting on the data desired. If the test data is not large enough, or does not provide enough variation, it may not illustrate when a regular expression is too vague. Vague regular expressions are expressions that hit on the desired data but are written so poorly they hit on a tremendous number of false positives. If when running a Content Search using a RegEx the number of hits grows rapidly, click the Pause button in the ‘Content Pane’ (the search must be selected in the ‘Component List’ to see the results as the search is running). In the example below, 8,343,854 hits were located in 30 MB out of 11.3 GB. This is a large number of hits in a small percentage of the data. In the example shown above, only one RegEx was included in the Content Search. There were no other keywords so all the hits could be attributed to the poorly written RegEx. If multiple keywords and/or regular expressions are included in the search, the Statistics sub-view in the ‘Content Pane’ indicates how many hits are attributed to each keyword or RegEx. If the number of hits associated with a RegEx is high, it is an indicator the expression is too vague. Test the RegEx with a more comprehensive test data set to determine the cause. 19 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com References Goyvaerts, Jan. 2009. Regular Expressions Info. http://www.regular-expressions.info/reference.html (accessed July 21, 2020(. Stackoverflow.com. 2010. What is a vertical tab?. http://stackoverflow.com/questions/3380538/what-is-a-vertical-tab (accessed July 20, 2020(. 20 Whitepaper | Digital Forensic Regular Expressions (RegEx) www.cellebrite.com About Cellebrite Cellebrite is the global leader of Digital Intelligence solutions for law enforcement, government and enterprise organizations. Cellebrite delivers an extensive suite of innovative software solutions, analytic tools, and training designed to accelerate digital investigations and address the growing complexity of handling crime and security challenges in the digital era. Trusted by thousands of leading agencies and companies in more than 150 countries, Cellebrite is helping fulfill the joint mission of creating a safer world. • To learn more visit us at www.cellebrite.com • Contact Cellebrite globally at www.cellebrite.com/contact