Lecture 9:

advertisement
Bioinformatics
Lecture 8 perl pattern matching
features
Questions to think about
• Create a hash table that performs the condon
to AA conversion and use it to convert
codons {entered from the key board} into
their corresponding Amino Acids
• Write a script that extracts the gene ID, and
Gene name from the Descriptor header of a
DNA FASTA file
Questions to think about
• Write a script that reads in the DNA sequences
from two Fasta files, assume the sequence
length is the same for both, and determines
the number of alignment matches to non
matches
Introduction
•
•
•
•
•
Pattern Matching
Pattern extraction
Pattern Substitution
Split and join functions
Unpack function
Pattern Matching
• Recall =~ is the pattern matching operator
• A first simple match example
– print “EcoRI site found!” if $dna =~ /gat/;
– It means if $DNA (string) contains the pattern gat then
print Ecori site found. What is inside the 2 / is the pattern
and =~ is the pattern matching symbol
• More patterns
– if ($dna =~ /[GATCgatc]/ )
– if /^[GATC] / i
– If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or
symbol)
• Print “EcoR1 site found!!!”;
Pattern Matching
• A More flexible pattern:
– print “EcoRI site found!” if $dna =~ /GAA[GATC]TTC/;
– Pattern where 4th letter is any let within square brackets
– [GATC] means any character other than G or A or T or C
– [0-9] or \d (digit)
[ a-z]
[-A-Z]
/[AT][GC][TG]/
– /[a-zA-Z0-9_]/ or /\w/ (word)
– / \s/ (white space) and to invert \s uppercase the letter \S (non white
space)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Pattern matching: metacharacters
Metacharacter Description
.
Any character except newline
\.
Full stop character
^
The beginning of a line
$
The end of a line
\w
Any word character (non-punctuation, non-white space)
\W
Any non-word character
\s
White space (spaces, tabs, carriage returns)
\S
Non-white space
\d
Any digit
\D
Any non-digit
You can also specify the number of times [ single, multiple or specific multiple]
More information on variations of metacharacters here: metacharacters
Pattern matching: Quantifiers
• Quantifier Description
–?
– +
–*
– {N}
– {N,M}
– {N, }
– { ,M}
0 or 1 occurrence
1 or more occurrences
0 or more occurrences
n occurrences
Between N and M occurrences
At least N occurrences
No more than M occurrences
Pattern matching: Quantifiers
• Pattern Match the following format:
M58200.2 { =~/\w+\.\d+/ }
• If the sequence is: Pu-C-X(40-80)-Pu-C
• Pu [AG] and X[ATGC]
– $sequence = /[AG]C[GATC]{40,80}[AG]C/;
–
Extracting pattern to variables
• Anchors
– E..g. Matching a word exactly:
• /\bword\b/ \b boundary: just looks for word and not a
sequence of the letters w o r and d
– The start of line anchor ^
• /^>/ only those beginning with >
– The end of line character $
• />$/ only where the last character is >
– /^$/ : what does this mean?
Further examples
• File_size_base_only.pl example
– #!/usr/bin/perl
– # file size2.pl
– $length = 0; $lines = 0;
– while (<>) {
• chomp;
• $length = $length + length $_ if $_ =~ /[GATCgatc]/;
– #Alternative: $length += length if /^[GATCN] / i;
• $lines = $lines + 1;
–}
– print "LENGTH = $length\n"; print "LINES =
$lines\n";
FASTA files
Write and test (file_size_bases_only.pl) using a FASTA file
as input: FASTADNA1.txt:
example of FASTA file
>2L52.1 CE20433 Zinc finger, C2H2 type (CAMBRIDGE) protein id:CAA21776.1
GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC
sample of file in EMBL format
gccacagatt acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca 60
agaaaagaac atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga 120
tcatgcacct gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 180
tgatgttgaa actgccgatt ggtacgccta cagtgaaaac tatggcacaa gtgaagaaaa 240
Sample of an NCBI record format:
1 atgaacccca acctgtgggt cgacgcgcag agcacttgca agagggaatg cgacgctgac
61 ctggagtgcg agacctttga gaagtgctgc cccaatgtct gtggaaccaa gagctgtgtg
121 gctgctcggt acatggacat caaggggaag aaggggcctg tggggatgcc caaagaggca
181 acctgtgacc gcttcatgtg catccagcaa ggctcagagt gcgacatctg ggacgggcag
241 cctgtctgca agtgcaagga caggtgtgag aaggagccga gctttacctg cgcctcggac
Extracting Patterns
• Consider a sequence like
• >M185580 clone 333a, complete sequence
–>
– M18… is the sequence ID
– Clone 33a, com…. : optional comments
• Need to stored some of elements of the descriptor
line:
• =~/ ( \S+)/ part of the match is extracted and put
into variable $1;
Extracting patterns
• #! /usr/bin/perl –w
• # demonstrates the effect of parentheses.
• while ( my $line = <> )
• {
•
$line =~ /\w+ (\w+) \w+ (\w+)/;
•
print "Second word: '$1' on line $..\n" if defined $1;
•
print "Fourth word: '$2' on line $..\n" if defined $2;
• }
– Change it to catch the first and the 3 word of a sentence
Search and replace
• s/t/u/ replace (t)thymine with (u) Uracil; once only
• s/t/u/g (g = global) so scan the whole string
• s/t/u/gi (global and case insensitive)
• What about the following :
• s/^\s+//
• s/\s+$//
• s/\s+$/ /g (where g stands for global)
•
• Write a perl script that reads in the DNA sequences from
the FastaDNA1file.txt and replaces all the Thymine bases
with the corresponding Uracil bases
Splits and joins
• To transform strings into arrays: split
– Line 1 looks like:
–
192a8,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC
– Consider the following code:
– chomp($line = <>); # read the line into $line
– @fields = split ‘,’,$line;
– ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line;
• Reads in line 1 and puts each part before the delimiter; e.g. 192a8, into element of
array….
• To transform arrays (lists) into strings: join
• $tab = join “\t”,@fields;
• 192a8 The Sanger Centre GGGTTCCGATTTCCAA
CCTTAGGCCAAATTAAGGCC
•
•
•
•
#initialize an array
my @perlFunc = ("substr","grep","defined","undef");
my $perlFunc = join " ", @perlFunc;
print "Perl Functions: $perlFunc\n";
– See example split_file.pl
Other useful functions
• Other useful functions:
– Unpack syntax :
• @triplets = unpack("a3" x (length($line)/3), $line);
• Frame Shift (1 position to the right)
• @triplets = unpack(‘a’ . “a3” x (length ($line)/3),$line);
• Unpack_codons.pl
Questions
• Modify the file_bases_size_only.pl to count the
the number of bases for a file in an EMBL format
and one in an NCBI format
• Using the FASTADNA1.txt : extract the sections of
the descriptor line to appropriate scalar variables.
• Assuming the DNA sequence of FastaDNA1file.txt
is the complementary or anti-sense strand print
the mRNA when the primary strand ( sequence )
is transcribed
Exam Questions
• Perl is a important bioinformatics language.
Explain the main features of perl that make in
appealing to the field of Bioinformatics.
– Write a script that extracts the gene ID, and Gene
name from the Descriptor header of a DNA FASTA file
– Write a perl script only reads and prints DNA
sequences from a FASTA file.
– Write a script that reads in the DNA sequences from
two Fasta files, assume the sequence length is the
same for both, and determines the number of
alignment matches to non matches
FastaDNA1file.txt
– Write a script that reads in the DNA sequences
from two Fasta files, assume the sequence length
is the same for both, and illustrates the number of
alignment matches to non matches.
Download