pre-lab handout

advertisement
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
Perl Programming for Pattern Matching.
THE OBJECTIVE OF THIS PRACTICAL is for each student to begin to use regular
expressions in a PERL programming environment as well as have some fun and obtain
some instant gratification with some surprisingly relevant word play. There is not enough
time in one lab session to thoroughly learn PERL.
This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark
LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in
2007 and is recommended for any student wishing to go further with this work. Until
that book is published, I recommend Beginning Perl for Bioinformatics from O'Reilly &
Associates which addresses the needs of biologists who want to learn Perl programming.
However, Perl is useful for many applications other than Bioinformatics.
I argue that a basic understanding of computer programming will help any scientist. Even
if you are lucky enough to always have a programmer working for you, understanding
what programmers do and how they do it will help you to be a better collaborator and
accomplish your own goals more quickly. Furthermore, with rudimentary programming
skills you will be able to adapt scripts to your current needs and even write simple code
on your own. In the modern era of biology with such complicated tools as whole genome
sequence analysis, automated environmental data logging, functional brain imaging, and
gene expression analysis you will encounter data manipulation issues beyond your
wildest nightmares. Perl may help you solve your problems.
Question 1
You have covered 4 different topics in introductory biology, each of which has the
potential to include experiments with extraordinarily large data sets that may not be
manageable in Word and Excel. In your lab notebook, describe one such situation
from each of at least two different areas in Biology that interest you.
REGULAR EXPRESSIONS
It can be argued that almost all DNA sequence analysis is one form or another of pattern
matching. Whether you are comparing sequences to determine their identities, to build
phylogenetic trees or to hunt for specific transcription binding sites, you are looking for
patterns. Powerful syntax matching tools are used to identify these patterns. These tools
are called “Regular Expressions” or “regex” for short. Many different programming
languages use regex’s, so while the syntax may be slightly different, this exercise will be
useful no matter what programming language you use in your academic and professional
career.
Regular expressions are not only a way to describe text through pattern matching, but
constitute a powerful tool to validate data, pull pieces out of larger text, substitute new
text for old text etc. Most books on Perl start with the basics of how to write a program,
but because we have only 2 hours, the meat and potatoes of the program will be provided
for you so that you can focus on the dessert, the entertaining and rewarding parts of these
programs, the regex. You will be writing all sorts of variations of regex’s embedded
pg 1 of 6
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
within the Perl syntax. You are essentially filling in the blanks, just as might if you were
using foreign language phrase book. For example:
“Je voudrais un croissant, s’il vous plait.”
Could easily be changed to aquire all sorts of wonderful items
“Je voudrais un café, s’il vous plait.”
“Je voudrais un brioche, s’il vous plait.”
You will not be required to learn the foreign language grammer and syntax yet you will
be able to use this language to find what you want in the data files. In the case above, you
have just ordered breakfast in France.
The lab handout will use a “facing page format” which is often used in foreign language
translations. The English version will be on the left page and the Foreign Language will
be on the right hand page. You will use the same regex on the English dictionary (see the
left page) and on the DNA E. coli genome (see the right page).
FUN EXAMPLE FEATURING A RICH SET OF SYNTAX
Do not feel daunted by the syntax at this point, rather, try to get a sense of the power of
finding words that contain certain patterns.
^ge.*[nm]e$
this regex matches all words that :
starts with ‘ge’ and end with either ‘ne’ or ‘me’
More explicitly, this pattern matches:
any word that starts (^) with the letters ‘ge’
followed by any number of letters (.*),
and the word ends ($) with either an ‘n’ or ‘m’ ( [nm])
followed by the letter ‘e’.
This pattern would match with many words, a partial list of which is shown below:
:
:
gelatine
gene
genome
genuine
germane
:
:
pg 2 of 6
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
BEFORE CLASS EXERCISE
In the basic syntax, most characters are treated as literals — they match only themselves
(i.e. "a" matches "a", "bc" matches "bc", etc). The exceptions are called metacharacters:
.
Matches any single character.
"a.cd" matches "abcd" as well as "akcd", "afcd" etc.
"a..d" matches "abcd" as well as "acbd" or "akcd” etc.
[]
Matches any single character that is contained within the brackets
[abc] matches "a", "b", or "c"
[a-z] matches any lowercase letter
These can be mixed:
[abcq-z] matches a, b, c, q, r, s, t, u, v, w, x, y, or z, and so does [a-cq-z]
[^ ]
Matches a single character that is not contained within the brackets after the carrot
[^abc] matches any character other than "a", "b", or "c"
[^a-z] matches any single character that is not a lowercase letter
^
Matches the start of the line (notice a very different meaning when the carrot is
inside or outside the square brackets).
$
Matches the end of the line
()
Defines a "marked subexpression". A "marked subexpression" is also a "block"
Whatever is matched by the marked expression can be recalled later.
*
An expression followed by "*" matches zero or more copies of the expression.
"ab*c" matches "ac", "abc", "abbc", "abbbc" etc.
“(ab)*c” matches "c", "abc", "ababc" etc. (order matters)
"[ab]*c" matches "c", "abc", "bac", "aac", "baac" and so on. (order doesn’t matter)
+
An expression followed by "+" matches one or more copies of the expression
"ab+c" matches "abc", "abbbc" etc. but not “ac” as previous example
"[xyz]+" matches "x", "y", "zx", "zyx", and so on.
{x,y} Means to match the last "block" at least x and not more than y times.
"(ab){2,4}" matches "abab"or "ababab" or "abababab" but not "ab" or
nor “ababababab"
" [ab]{2,3} matches "bb" or "bbb" and also "aa" "aaa" or "ab", "ba","aba", "bab"
"abb" and "baa" but not "b" or "aabb"
pg 3 of 6
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
Before coming to class, decide which words would be matched by the following
regular expressions and write in English the explanation for your answer in your
lab notebook:
Example: ".at" matches
a) hat
b) cat
c) dog
d) bat e) complicate f) match
a,b,d, e and f are correct because the "." means to match any single character, and
the letters "a" and "t" are literals and must occur in the specified order. Dog has none of
these letters.
1: "[hc]at" matches
a) hat
b) cat
c) dog
d) bat
e) complicate
f) match g) atom
2: "[^b]at" matches
a) hat
b) cat
c) dog
d) bat
e) complicate
f) match
3: "^[hb]at" matches
a) hat colors were picked to match the bats
b) bat colors were picked to match the hats
c) cat colors were picked to match the hats.
d) dog colors were picked to match the cats.
e) complicated colors were picked.
f) matched colors were picked.
4: "[hb]at$" matches
a) hat colors were picked to match the bat
b) bat colors were picked to match the hat
c) cat colors were picked to match the hat
d) dog colors were picked to match the cat
e) complicated colors were picked
f) matched colors were picked
During lab you will be introduced to the pattern matching syntax of regular expressions
in a step-by-step fashion with lots of opportunities for you to practice. You will likely
come up with ideas that are beyond the scope of this tutorial but well within the power of
Perl.
pg 4 of 6
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
When programming in Perl, it is customary to “comment” your code so that future users
of the program will be able to read the code and know the function of each part of the
code. These programmer-added comments are not read by the computer, they are only
there to help human users. These comments are set apart from the computer program
code using the number sign “#”. The computer will ignore any comments that follow the
“#”. This trick can also be used to tell the computer which part of the code to run. In
class you will use “#” in order to tell the computer which dictionary you will use.
QUESTION to answer before lab.
In the following program which lines will be read by the computer program?
1
2
3
4
5
6
7
8
# This program will add or subtract 2 and 5
# It is not a very exciting program
###################################
$a = 5;
$b = 2;
$result = $a + $b;
#calculate add
#$result = $a - $b;
#calculate subtract
print "Equal to: $result"; #print result
5. What is the reason to include lines 1,2, and 3?
6. How would you change the code if you wanted to do subtraction instead of
addition?
pg 5 of 6
BIO 102 Renn_Lab#1 (pre-lab handout)
Name ___________________
Five datasets will be provided for you to work with in class.
-One is a list of all of the English words in Webster’s 2nd International Dictionary.
-The second is a list of all possible 7mers (heptamers) comprised of As, Cs, Gs,
and Ts from the entire genome of the bacterium Escherichia coli (E. coli). To
produce it, a 7 base-pair window was shifted one letter at a time, through the
entire genome. Each possible 7-mer is listed on a single line so that it can be
treated similar to the dictionary list.
-The third is a complete text of Darwin’s “The Origin of Species”
-The fourth is a list of all amino acid sequences for all predicted proteins in the
E.coli genome in FASTA format.
-The fifth is a list of all DNA sequences coding for all predicted proteins on the E.
coli genome in FASTA format (see appendix for FASTA format descriptions).
A list of words from an English dictionary and a list of 7-letter DNA motifs from E. coli
are not directly comparable at all. However, they are useful as introductory datasets for
learning about regex syntax and thinking about how pattern matching might be applied to
DNA sequences.
The list of DNA motifs (words) was generated artificially from all possible 7mers in the E.coli genome. It is a list with little meaning in and of itself, a collection in
which all seven-letter motifs that appear in the E.coli genome are considered. That is not
to say that none of these 7-mer motifs are without meaning. A few DNA motifs (of this
length) have had their functions worked out and demonstrated in the lab. For example,
some are protein-binding sites such as those found in the Lac-Operon and some are
restriction sites (remember cutting DNA with Restriction Enzymes in lab).
The list of English language words is full of meaning to any native speaker. It
serves as a sort of familiar, comfortable grounding for your first forays into pattern
matching. However, it is not meant to be directly analogous to the DNA motif list. Any
English language words that you “discover” will already have well-established meanings.
You are testing your ability to write an accurate search statement using the regex syntax.
Yet, this is quite important because regex’s carelessly written can return bad results rather
quickly, and it takes some practice and vigilance to spot the errors. In this case, we are
using our knowledge of English to spot the successful searches and the errors (if any.)
Please work through the regular expressions in this handout recording your answers and
thought processes so you will be ready to work with the programs in lab with the
camaraderie of your fellow students and instructors to make it fun.
pg 6 of 6
Download