BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ Perl Programming for Pattern Matching. THE OBJECTIVE OF THIS PRACTICAL is for each student to begin to use regular expressions in a PERL programming environment as well as have some fun and obtain some instant gratification with some surprisingly relevant word play. There is not enough time in one lab session to thoroughly learn PERL. This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in 2007 and is recommended for any student wishing to go further with this work. Until that book is published, I recommend Beginning Perl for Bioinformatics from O'Reilly & Associates which addresses the needs of biologists who want to learn Perl programming. However, Perl is useful for many applications other than Bioinformatics. I argue that a basic understanding of computer programming will help any scientist. Even if you are lucky enough to always have a programmer working for you, understanding what programmers do and how they do it will help you to be a better collaborator and accomplish your own goals more quickly. Furthermore, with rudimentary programming skills you will be able to adapt scripts to your current needs and even write simple code on your own. In the modern era of biology with such complicated tools as whole genome sequence analysis, automated environmental data logging, functional brain imaging, and gene expression analysis you will encounter data manipulation issues beyond your wildest nightmares. Perl may help you solve your problems. Question 1 You have covered 4 different topics in introductory biology, each of which has the potential to include experiments with extraordinarily large data sets that may not be manageable in Word and Excel. In your lab notebook, describe one such situation from each of at least two different areas in Biology that interest you. REGULAR EXPRESSIONS It can be argued that almost all DNA sequence analysis is one form or another of pattern matching. Whether you are comparing sequences to determine their identities, to build phylogenetic trees or to hunt for specific transcription binding sites, you are looking for patterns. Powerful syntax matching tools are used to identify these patterns. These tools are called “Regular Expressions” or “regex” for short. Many different programming languages use regex’s, so while the syntax may be slightly different, this exercise will be useful no matter what programming language you use in your academic and professional career. Regular expressions are not only a way to describe text through pattern matching, but constitute a powerful tool to validate data, pull pieces out of larger text, substitute new text for old text etc. Most books on Perl start with the basics of how to write a program, but because we have only 2 hours, the meat and potatoes of the program will be provided for you so that you can focus on the dessert, the entertaining and rewarding parts of these programs, the regex. You will be writing all sorts of variations of regex’s embedded pg 1 of 6 BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ within the Perl syntax. You are essentially filling in the blanks, just as might if you were using foreign language phrase book. For example: “Je voudrais un croissant, s’il vous plait.” Could easily be changed to aquire all sorts of wonderful items “Je voudrais un café, s’il vous plait.” “Je voudrais un brioche, s’il vous plait.” You will not be required to learn the foreign language grammer and syntax yet you will be able to use this language to find what you want in the data files. In the case above, you have just ordered breakfast in France. The lab handout will use a “facing page format” which is often used in foreign language translations. The English version will be on the left page and the Foreign Language will be on the right hand page. You will use the same regex on the English dictionary (see the left page) and on the DNA E. coli genome (see the right page). FUN EXAMPLE FEATURING A RICH SET OF SYNTAX Do not feel daunted by the syntax at this point, rather, try to get a sense of the power of finding words that contain certain patterns. ^ge.*[nm]e$ this regex matches all words that : starts with ‘ge’ and end with either ‘ne’ or ‘me’ More explicitly, this pattern matches: any word that starts (^) with the letters ‘ge’ followed by any number of letters (.*), and the word ends ($) with either an ‘n’ or ‘m’ ( [nm]) followed by the letter ‘e’. This pattern would match with many words, a partial list of which is shown below: : : gelatine gene genome genuine germane : : pg 2 of 6 BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ BEFORE CLASS EXERCISE In the basic syntax, most characters are treated as literals — they match only themselves (i.e. "a" matches "a", "bc" matches "bc", etc). The exceptions are called metacharacters: . Matches any single character. "a.cd" matches "abcd" as well as "akcd", "afcd" etc. "a..d" matches "abcd" as well as "acbd" or "akcd” etc. [] Matches any single character that is contained within the brackets [abc] matches "a", "b", or "c" [a-z] matches any lowercase letter These can be mixed: [abcq-z] matches a, b, c, q, r, s, t, u, v, w, x, y, or z, and so does [a-cq-z] [^ ] Matches a single character that is not contained within the brackets after the carrot [^abc] matches any character other than "a", "b", or "c" [^a-z] matches any single character that is not a lowercase letter ^ Matches the start of the line (notice a very different meaning when the carrot is inside or outside the square brackets). $ Matches the end of the line () Defines a "marked subexpression". A "marked subexpression" is also a "block" Whatever is matched by the marked expression can be recalled later. * An expression followed by "*" matches zero or more copies of the expression. "ab*c" matches "ac", "abc", "abbc", "abbbc" etc. “(ab)*c” matches "c", "abc", "ababc" etc. (order matters) "[ab]*c" matches "c", "abc", "bac", "aac", "baac" and so on. (order doesn’t matter) + An expression followed by "+" matches one or more copies of the expression "ab+c" matches "abc", "abbbc" etc. but not “ac” as previous example "[xyz]+" matches "x", "y", "zx", "zyx", and so on. {x,y} Means to match the last "block" at least x and not more than y times. "(ab){2,4}" matches "abab"or "ababab" or "abababab" but not "ab" or nor “ababababab" " [ab]{2,3} matches "bb" or "bbb" and also "aa" "aaa" or "ab", "ba","aba", "bab" "abb" and "baa" but not "b" or "aabb" pg 3 of 6 BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ Before coming to class, decide which words would be matched by the following regular expressions and write in English the explanation for your answer in your lab notebook: Example: ".at" matches a) hat b) cat c) dog d) bat e) complicate f) match a,b,d, e and f are correct because the "." means to match any single character, and the letters "a" and "t" are literals and must occur in the specified order. Dog has none of these letters. 1: "[hc]at" matches a) hat b) cat c) dog d) bat e) complicate f) match g) atom 2: "[^b]at" matches a) hat b) cat c) dog d) bat e) complicate f) match 3: "^[hb]at" matches a) hat colors were picked to match the bats b) bat colors were picked to match the hats c) cat colors were picked to match the hats. d) dog colors were picked to match the cats. e) complicated colors were picked. f) matched colors were picked. 4: "[hb]at$" matches a) hat colors were picked to match the bat b) bat colors were picked to match the hat c) cat colors were picked to match the hat d) dog colors were picked to match the cat e) complicated colors were picked f) matched colors were picked During lab you will be introduced to the pattern matching syntax of regular expressions in a step-by-step fashion with lots of opportunities for you to practice. You will likely come up with ideas that are beyond the scope of this tutorial but well within the power of Perl. pg 4 of 6 BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ When programming in Perl, it is customary to “comment” your code so that future users of the program will be able to read the code and know the function of each part of the code. These programmer-added comments are not read by the computer, they are only there to help human users. These comments are set apart from the computer program code using the number sign “#”. The computer will ignore any comments that follow the “#”. This trick can also be used to tell the computer which part of the code to run. In class you will use “#” in order to tell the computer which dictionary you will use. QUESTION to answer before lab. In the following program which lines will be read by the computer program? 1 2 3 4 5 6 7 8 # This program will add or subtract 2 and 5 # It is not a very exciting program ################################### $a = 5; $b = 2; $result = $a + $b; #calculate add #$result = $a - $b; #calculate subtract print "Equal to: $result"; #print result 5. What is the reason to include lines 1,2, and 3? 6. How would you change the code if you wanted to do subtraction instead of addition? pg 5 of 6 BIO 102 Renn_Lab#1 (pre-lab handout) Name ___________________ Five datasets will be provided for you to work with in class. -One is a list of all of the English words in Webster’s 2nd International Dictionary. -The second is a list of all possible 7mers (heptamers) comprised of As, Cs, Gs, and Ts from the entire genome of the bacterium Escherichia coli (E. coli). To produce it, a 7 base-pair window was shifted one letter at a time, through the entire genome. Each possible 7-mer is listed on a single line so that it can be treated similar to the dictionary list. -The third is a complete text of Darwin’s “The Origin of Species” -The fourth is a list of all amino acid sequences for all predicted proteins in the E.coli genome in FASTA format. -The fifth is a list of all DNA sequences coding for all predicted proteins on the E. coli genome in FASTA format (see appendix for FASTA format descriptions). A list of words from an English dictionary and a list of 7-letter DNA motifs from E. coli are not directly comparable at all. However, they are useful as introductory datasets for learning about regex syntax and thinking about how pattern matching might be applied to DNA sequences. The list of DNA motifs (words) was generated artificially from all possible 7mers in the E.coli genome. It is a list with little meaning in and of itself, a collection in which all seven-letter motifs that appear in the E.coli genome are considered. That is not to say that none of these 7-mer motifs are without meaning. A few DNA motifs (of this length) have had their functions worked out and demonstrated in the lab. For example, some are protein-binding sites such as those found in the Lac-Operon and some are restriction sites (remember cutting DNA with Restriction Enzymes in lab). The list of English language words is full of meaning to any native speaker. It serves as a sort of familiar, comfortable grounding for your first forays into pattern matching. However, it is not meant to be directly analogous to the DNA motif list. Any English language words that you “discover” will already have well-established meanings. You are testing your ability to write an accurate search statement using the regex syntax. Yet, this is quite important because regex’s carelessly written can return bad results rather quickly, and it takes some practice and vigilance to spot the errors. In this case, we are using our knowledge of English to spot the successful searches and the errors (if any.) Please work through the regular expressions in this handout recording your answers and thought processes so you will be ready to work with the programs in lab with the camaraderie of your fellow students and instructors to make it fun. pg 6 of 6