Document

advertisement
Regular expressions
• Perl provides a pattern-matching engine
• Patterns are called regular expressions
• They are extremely powerful
– probably Perl's strongest feature, compared to
other languages
• Often called "regexps" for short
Motivation: N-glycosylation motif
• Common post-translational modification in ER
– Membrane & secreted proteins
– Purpose: folding, stability, cell-cell adhesion
• Attachment of a 14-sugar oligosaccharide
• Occurs at asparagine residues with the
consensus sequence NX1X2, where
– X1 can be anything
(but proline & aspartic acid inhibit)
– X2 is serine or threonine
• Can we detect potential N-glycosylation
sites in a protein sequence?
User Input from Keyboard
• Input a line of input from the user and save
it into a variable:
print "Enter your DNA sequence:";
$dna = <STDIN>;
chomp($dna);
It is often needed to remove the
“new line” character from user input
• We can also input a file name from the
user that we want to open:
print "Enter data file name:";
$data = <STDIN>;
chomp($data);
open F, $data;
Create a file handle called
F for the file name stored
in the $data variable
Interactive testing
• This script echoes input from the keyboard
while (<STDIN>) {
print;
}
The special filehandle STDIN means
"standard input", i.e. the keyboard
• Sometimes (e.g. in Windows IDEs) the
output isn’t printed until the script stops
• This is because of buffering.
• To stop buffering, set to "autoflush":
$| = 1;
while (<STDIN>) {
print;
}
$| is the autoflush flag
Matching alternative characters
• [ACGT] matches one A, C, G or T:
while (<STDIN>) {
print "Matched: $_" if /[ACGT]/;
}
this is not printed
This is printed
Matched: This is printed
• In general square brackets denote a set of
alternative possibilities
• Use - to match a range of characters: [A-Z]
• . matches anything
• \s matches spaces or tabs
Italics denote
• \S is anything that's not a space or tab
input text
• [^X] matches anything but X
Matching alternative strings
• /(this|that)/ matches "this" or "that"
• ...and is equivalent to /th(is|at)/
while (<STDIN>) {
print "Matched: $_" if /this|that|other/;
}
Won't match THIS
Will match this
Matched: Will match this
Won't match ThE oThER
Will match the other
Matched: Will match the other
Remember, regexps
are case-sensitive
Matching multiple characters
• x* matches zero or more x's (greedily)
• x*? matches zero or more x's (sparingly)
• x+ matches one or more x's (greedily)
• x{n} matches n x's
• x{m,n} matches from m to n x's
Word and string boundaries
• ^ matches the start of a string
• $ matches the end of a string
• \b matches word boundaries
"Escaping" special characters
• \ is used to "escape" characters that
otherwise have meaning in a regexp
• so \[ matches the character "["
– if not escaped, "[" signifies the start of a list of
alternative characters, as in [ACGT]
Retrieving what was matched
• If parts of the pattern are enclosed by
parentheses, then (following the match) those
parts can be retrieved from the scalars $1, $2...
$| = 1;
while (<STDIN>) {
if (/(a|the) (\S+)/i) {
print "Noun: $2\n";
}
}
Pick up the cup
Noun: cup
Sit on a chair
Noun: chair
Put the milk in the tea
Noun: milk
• e.g. /the (\S+) sat on the (\S+) drinking (\S+)/
• matches "the cat sat on the mat drinking milk"
• with $1="cat", $2="mat", $3="milk"
Note: only the first "the"
is picked up by this regexp
Variations and modifiers
• //i ignores upper/lower case distinctions:
while (<STDIN>) {
print "Matched: $_" if /pattern/i;
}
pAttERn
Matched pAttERn
• //g starts search where last match left off
– pos($_) is index of first character after last match
• s/OLD/NEW/ replaces first "OLD" with "NEW"
• s/OLD/NEW/g is "global" (i.e. replaces every
occurrence of "OLD" in the string)
N-glycosylation site detector
Convert to upper case
$| = 1;
while (<STDIN>) {
$_ = uc $_;
while (/(N[^PD][ST])/g) {
print "Potential N-glycosylation sequence ",
$1, " at residue ", pos() - 2, "\n";
}
}
while (/(N[^PD][ST])/g) { ... }
The main regular expression
Regexp uses
'g' modifier to
get all matches
in sequence
pos() is index of first residue
after match, starting at zero;
so, pos()-2 is index of first residue
of three-residue match, starting at one.
PROSITE and Pfam
PROSITE – a database of regular expressions
for protein families, domains and motifs
Pfam – a database of Hidden Markov
Models (HMMs) – equivalent to
probabilistic regular expressions
Subroutines
• Often, we can identify self-contained tasks that
occur in so many different places we may want
to separate their description from the rest of our
program.
• Code for such a task is called a subroutine.
• Examples of such tasks:
NB: Perl provides
– finding the length of a sequence
– reverse complementing a sequence
– finding the mean of a list of numbers
the subroutine
length($x) to do
this already
Finding all sequence lengths (2)
Subroutine calls
Subroutine definition;
code in here is not
executed unless
subroutine is called
open FILE, "fly3utr.txt";
while (<FILE>) {
chomp;
if (/>/) {
print_name_and_len();
$name = $_;
$len = 0;
} else {
$len += length;
}
}
print_name_and_len();
close FILE;
sub print_name_and_len {
if (defined ($name)) {
print "$name $len\n";
}
}
Reverse complement subroutine
"my" announces that
$rev is local to the
subroutine revcomp
"return" announces
that the return value
of this subroutine
is whatever's in $rev
sub revcomp {
my $rev;
$rev = reverse ($dna);
$rev =~ tr/acgt/tgca/;
return $rev;
}
$rev = 12345;
$dna = "accggcatg";
$rev1 = revcomp();
print "Revcomp of $dna is $rev1\n";
$dna = "cggcgt";
$rev2 = revcomp();
print "Revcomp of $dna is $rev2\n";
print "Value of rev is $rev\n";
Value of $rev is
unchanged by
calls to revcomp
Revcomp of accggcatg is catgccggt
Revcomp of cggcgt is acgccg
Value of rev is 12345
Revcomp with arguments
The array @_ holds
the arguments to
the subroutine
(in this case, the
sequence to be
revcomp'd)
sub revcomp {
my ($dna) = @_;
my $rev = reverse ($dna);
$rev =~ tr/acgt/tgca/;
return $rev;
}
$dna1 = "accggcatg";
$rev1 = revcomp ($dna1);
print "Revcomp of $dna1 is $rev1\n";
Now we don't
have to re-use
the same variable
for the sequence
to be revcomp'd
$dna2 = "cggcgt";
$rev2 = revcomp ($dna2);
print "Revcomp of $dna2 is $rev2\n";
Revcomp of accggcatg is catgccggt
Revcomp of cggcgt is acgccg
Mean & standard deviation
@xdata = (1, 5, 1, 12, 3, 4, 6);
($x_mean, $x_sd) = mean_sd (@xdata);
@ydata = (3.2, 1.4, 2.5, 2.4, 3.6, 9.7);
($y_mean, $y_sd) = mean_sd (@ydata);
Subroutine
takes a list
of $n numeric
arguments
Square root
Subroutine
returns a
two-element
list: (mean,sd)
sub mean_sd {
my @data = @_;
my $n = @data + 0;
my $sum = 0;
my $sqSum = 0;
foreach $x (@data) {
$sum += $x;
$sqSum += $x * $x;
}
my $mean = $sum / $n;
my $variance = $sqSum / $n - $mean * $mean;
my $sd = sqrt ($variance);
return ($mean, $sd);
}
Maximum element of an array
• Subroutine to find the largest entry in an array
@num = (1, 5, 1, 12, 3, 4, 6);
$max = find_max (@num);
print "Numbers: @num\n";
print "Maximum: $max\n";
sub find_max {
my @data = @_;
my $max = pop @data;
foreach my $x (@data) {
if ($x > $max) {
$max = $x;
}
}
return $max;
}
Numbers: 1 5 1 12 3 4 6
Maximum: 12
Including variables in patterns
• Subroutine to find number of instances of
a given binding site in a sequence
$dna = "ACGCGTAAGTCGGCACGCGTACGCGT";
$mcb = "ACGCGT";
print "$dna has ",
count_matches ($mcb, $dna),
" matches to $mcb\n";
sub count_matches {
my ($pattern, $text) = @_;
my $n = 0;
while ($text =~ /$pattern/g) { ++$n }
return $n;
}
ACGCGTAAGTCGGCACGCGTACGCGT has 3 matches to ACGCGT
Data structures
• Suppose we have a file containing a table
of Drosophila gene names and cellular
compartments, one pair on each line:
Cyp12a5
MRG15
Cop
bor
Bx42
Mitochondrion
Nucleus
Golgi
Cytoplasm
Nucleus
Suppose this file is in "genecomp.txt"
Reading a table of data
• We can split each
line into a 2-element
array using the
split command.
• This breaks the line
at each space:
open FILE, "genecomp.txt";
while (<FILE>) {
($g, $c) = split;
push @gene, $g;
push @comp, $c;
}
close FILE;
print "Genes: @gene\n";
print "Compartments: @comp\n";
Genes: Cyp12a5 MRG15 Cop bor Bx42
Compartments: Mitochondrion Nucleus Golgi Cytoplasm Nucleus
• The opposite of split is join, which makes a scalar
from an array: print join (" and ", @gene);
Cyp12a5 and MRG15 and Cop and bor and Bx42
Finding an entry in a table
• The following code assumes that we've
already read in the table from the file:
$geneToFind = shift @ARGV;
print "Searching for gene $geneToFind\n";
for ($i = 0; $i < @gene; ++$i) {
if ($gene[$i] eq $geneToFind) {
print "Gene: $gene[$i]\n";
print "Compartment: $comp[$i]\n";
exit;
}
}
print "Couldn't find gene\n";
• Example:
$ARGV[0] = "Cop"
Searching for gene Cop
Gene: Cop
Compartment: Golgi
Binary search
• The previous algorithm is inefficient. If there are N
entries in the list, then on average we have to search
through ½(N+1) entries to find the one we want.
• For the full Drosophila genome, N=12,000. This is
painfully slow.
• An alternative is the Binary Search algorithm:
Start with a sorted list.
Compare the middle element
with the one we want. Pick the
half of the list that contains our
element.
Iterate this procedure to
locate the right element.
This takes around log2(N) steps.
Associative arrays (hashes)
• Implementing algorithms like binary search
is a common task in languages like C.
• Conveniently, Perl provides a type of array
called an associative array (also called a
hash) that is pre-indexed for quick search.
• An associative array is a set of keyvalue pairs
(like our genecompartment table)
$comp{"Cop"} = "Golgi";
Curly braces {} are used to
index an associative array
Reading a table using hashes
open FILE, "genecomp.txt";
while (<FILE>) {
($g, $c) = split;
$comp{$g} = $c;
}
$geneToFind = shift @ARGV;
print "Gene: $geneToFind\n";
print "Compartment: ", $comp{$geneToFind}, "\n";
...with $ARGV[0] = "Cop" as before:
Gene: Cop
Compartment: Golgi
Reading a FASTA file into a hash
sub read_FASTA {
my ($filename) = @_;
my (%name2seq, $name, $seq);
open FILE, $filename;
while (<FILE>) {
chomp;
if (/>/) {
s/>//;
if (defined $name) {
$name2seq{$name} = $seq;
}
$name = $_;
$seq = "";
} else {
$seq .= $_;
}
}
$name2seq{$name} = $seq;
close FILE;
return %name2seq;
}
Formatted output of sequences
sub print_seq {
my ($name, $seq) = @_;
50-column output
print ">$name\n";
my $width = 50;
for (my $i = 0; $i < length($seq); $i += $width) {
if ($i + $width > length($seq)) {
$width = length($seq) - $i;
}
print substr ($seq, $i, $width), "\n";
}
}
The term substr($x,$i,$len) returns the substring of
$x starting at position $i with length $len.
For example, substr("Biology",3,3) is "log"
keys and values
• keys returns the list of keys in the hash
– e.g. names, in the %name2seq hash
• values returns the list of values
– e.g. sequences, in the %name2seq hash
%name2seq = read_FASTA ("fly3utr.txt");
print "Sequence names: ",
join (" ", keys (%name2seq)), "\n";
my $len = 0;
foreach $seq (values %name2seq) {
$len += length ($seq);
}
print "Total length: $len\n";
Sequence names: CG11488 CG11604 CG11455
Total length: 210
Files of sequence names
• Easy way to specify a subset of a given
FASTA database
• Each line is the name of a sequence in a
given database
• e.g.
CG1167
CG685
CG1041
CG1043
Get named sequences
• Given a FASTA database and a "file of sequence
names", print every named sequence:
($fasta, $fosn) = @ARGV;
%name2seq = read_FASTA ($fasta);
open FILE, $fosn;
while ($name = <FILE>) {
chomp $name;
$seq = $name2seq{$name};
if (defined $seq) {
print_seq ($name, $seq);
} else {
warn "Can't find sequence: $name. ",
"Known sequences: ",
join (" ", keys %name2seq), "\n";
}
}
close FILE;
Intersection of two sets
• Two files of sequence names:
• What is the overlap?
• Find intersection using hashes:
CG1167
CG685
CG1041
CG1043
fosn1.txt
open FILE1, "fosn1.txt";
while (<FILE1>) { $gotName{$_} = 1; }
close FILE1;
open FILE2, "fosn2.txt";
while (<FILE2>) {
print if $gotName{$_};
}
close FILE2;
CG1041
CG1167
CG215
CG1041
CG483
CG1167
CG1163
fosn2.txt
Assigning hashes
• A hash can be assigned directly,
as a list of "key=>value" pairs:
%comp = ('Cyp12a5' => 'Mitochondrion',
'MRG15' => 'Nucleus',
'Cop' => 'Golgi',
'bor' => 'Cytoplasm',
'Bx42' => 'Nucleus');
print "keys: ", join(";",keys(%comp)), "\n";
print "values: ", join(";",values(%comp)), "\n";
keys: bor;Cop;Bx42;Cyp12a5;MRG15
values: Cytoplasm;Golgi;Nucleus;Mitochondrion;Nucleus
The genetic code as a hash
%aa = ('ttt'=>'F',
'ttc'=>'F',
'tta'=>'L',
'ttg'=>'L',
'tct'=>'S',
'tcc'=>'S',
'tca'=>'S',
'tcg'=>'S',
'tat'=>'Y',
'tac'=>'Y',
'taa'=>'!',
'tag'=>'!',
'tgt'=>'C',
'tgc'=>'C',
'tga'=>'!',
'tgg'=>'W',
'ctt'=>'L',
'ctc'=>'L',
'cta'=>'L',
'ctg'=>'L',
'cct'=>'P',
'ccc'=>'P',
'cca'=>'P',
'ccg'=>'P',
'cat'=>'H',
'cac'=>'H',
'caa'=>'Q',
'cag'=>'Q',
'cgt'=>'R',
'cgc'=>'R',
'cga'=>'R',
'cgg'=>'R',
'att'=>'I',
'atc'=>'I',
'ata'=>'I',
'atg'=>'M',
'act'=>'T',
'acc'=>'T',
'aca'=>'T',
'acg'=>'T',
'aat'=>'N',
'aac'=>'N',
'aaa'=>'K',
'aag'=>'K',
'agt'=>'S',
'agc'=>'S',
'aga'=>'R',
'agg'=>'R',
'gtt'=>'V',
'gtc'=>'V',
'gta'=>'V',
'gtg'=>'V',
'gct'=>'A',
'gcc'=>'A',
'gca'=>'A',
'gcg'=>'A',
'gat'=>'D',
'gac'=>'D',
'gaa'=>'E',
'gag'=>'E',
'ggt'=>'G',
'ggc'=>'G',
'gga'=>'G',
'ggg'=>'G' );
Translating: DNA to protein
$prot = translate ("gatgacgaaagttgt");
print $prot;
sub translate {
my ($dna) = @_;
$dna = lc ($dna);
my $len = length ($dna);
if ($len % 3 != 0) {
die "Length $len is not a multiple of 3";
}
my $protein = "";
for (my $i = 0; $i < $len; $i += 3) {
my $codon = substr ($dna, $i, 3);
if (!defined ($aa{$codon})) {
die "Codon $codon is illegal";
}
$protein .= $aa{$codon};
}
return $protein;
}
DDESC
Counting residue frequencies
%count = count_residues ("gatgacgaaagttgt");
@residues = keys (%count);
foreach $residue (@residues) {
print "$residue: $count{$residue}\n";
}
sub count_residues {
my ($seq) = @_;
my %freq;
$seq = lc ($seq);
for (my $i = 0; $i < length($seq); ++$i) {
my $residue = substr ($seq, $i, 1);
++$freq{$residue};
}
return %freq;
}
g:
a:
c:
t:
5
5
1
4
Counting N-mer frequencies
%count = count_nmers ("gatgacgaaagttgt", 2);
@nmers = keys (%count);
foreach $nmer (@nmers) {
print "$nmer: $count{$nmer}\n";
}
sub count_nmers {
my ($seq, $n) = @_;
my %freq;
$seq = lc ($seq);
for (my $i = 0; $i <= length($seq) - $n; ++$i) {
my $nmer = substr ($seq, $i, $n);
++$freq{$nmer};
}
return %freq;
}
cg:
tt:
ga:
tg:
gt:
aa:
ac:
at:
ag:
1
1
3
2
2
2
1
1
1
N-mer frequencies for a whole file
my %name2seq = read_FASTA ("fly3utr.txt");
while (($name, $seq) = each %name2seq) {
%count = count_nmers ($seq, 2, %count);
}
@nmers = keys (%count);
foreach $nmer (@nmers) {
print "$nmer: $count{$nmer}\n";
}
sub count_nmers {
my ($seq, $n, %freq) = @_;
$seq = lc ($seq);
for (my $i = 0; $i <= length($seq) - $n; ++$i) {
my $nmer = substr ($seq, $i, $n);
++$freq{$nmer};
}
return %freq;
}
The each command is a shorthand for looping
through each (key,value) pair in an array
ct:
tc:
tt:
cg:
ga:
tg:
gc:
gt:
aa:
ac:
gg:
at:
ca:
ag:
ta:
cc:
5
9
26
4
11
12
2
17
39
10
4
17
11
15
20
2
Note how we keep passing %freq back into
the count_nmers subroutine, to get
cumulative counts
Files and filehandles
This XYZ is the filehandle
•
•
•
•
•
•
•
•
Opening a file:
Closing a file:
Reading a line:
Reading an array:
Printing a line:
Read-only:
Write-only:
Test if file exists:
open XYZ, $filename;
close XYZ;
$data = <XYZ>;
@data = <XYZ>;
print XYZ $data;
open XYZ, "<$filename";
open XYZ, ">$filename";
if (-e $filename) {
print "$filename exists!\n";
}
Files and filehandles
This XYZ is the filehandle
•
•
•
•
•
•
•
•
Opening a file:
Closing a file:
Reading a line:
Reading an array:
Printing a line:
Read-only:
Write-only:
Test if file exists:
open XYZ, $filename;
close XYZ;
$data = <XYZ>;
@data = <XYZ>;
print XYZ $data;
open XYZ, "<$filename";
open XYZ, ">$filename";
if (-e $filename) {
print "$filename exists!\n";
}
Download