Lecture 7: Perl pattern handling features Pattern Matching • Recall =~ is the pattern matching operator • A first simple match example • print “An methionine amino acid is found ” if $AA =~ /m/; – It means if $AA (string) contains the m then print methionine amino acid found. – What is inside the / / is the pattern and =~ is the pattern matching symbol – It could also be written as • if ($dna =~ /m/) • { – print “An methionine amino acid is found ”; • } – Met.pl Pattern Matching – If we want to check for the start codon we could use: – if ($seq =~ /ATG/ ) – { • Print “a start codon was found on line number\n” } – Or could write if /ATG / i (where I stands for case) – if we want to see if there is an A or T or G or C in the sequence use: $seq =~ /[ATGC]/ – The main way to use the Boolean OR is • If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) • { – Print “EcoR1 site found!!!”; • } – (note EcoR1 is an important DNA sequence) Sequence size example • File_size_2 example – – – – #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to any nucelotide • #{refer to http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml} – • $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; • The above is a modification of the length of the file example to include only files that have G or A or T or C in the input line. • However this will lead to problems for FASTA files as the descriptor line will be included: Why? Pattern Matching • A NOT Boolean operator such as to see if the pattern contains letters that are not vowels can be represented via pattern handling by using the ^ symbol and a set of characters: e.g. – If ($seq =~ /[^aeiou]/ {print “no vowel”}; • More flexible pattern syntax: • Quite common to check for words or numbers so perl has represented as: – /[0-9]/ or/ \d/ is a digit – A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word) – / \s/ represents a white space – By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non white space) • A more complete list of what are referred to as “metacharacters” is shown in the next slide (you must of course use =~ in expression) • • • • • • • • • • • • • • Pattern matching: metacharacters Metacharacter Description . Any character except newline \. Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on metacharacters here: metacharacters and other regular expresions note (abc) \1 \2 are important for comparing sets of characters). Pattern matching: Quantifiers • Quantifier Description –? – + –* – {N} – {N,M} – {N, } – { ,M} 0 or 1 occurrence 1 or more occurrences 0 or more occurrences n occurrences Between N and M occurrences At least N occurrences No more than M occurrences Pattern matching: Quantifiers • Consider the following pattern – DT249 4 (your class code) consists of [one or more word characters; then a space and then a digit so the match is: • { =~/\w+\s\d/ } • If the sequence has the following format: – Pu-C-X(40-80)-Pu-C • Pu [AG] and X[ATGC] – $sequence =~ /[AG]C[GATC]{40,80}[AG]C/; Quantify.pl Pattern Matching • To determine where to look for a “pattern” in a sequence: • Anchors – The start of line anchor ^ {note it is like the Boolean not operator but it is within [^aeiou]} • /^>/ only those beginning with > – The end of line character $ • />$/ only where the last character is > – /^$/ : what does this mean? – The boundary anchor \b • E.g. Matching a word exactly: • /\bword\b/ where \b boundary: just looks for “word” and not a sequence of the letters such as w o r and d – The non boundary anchor is \B • /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or word Sequence Size example: modified • File_size_2 example – – – – #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/; – #Alternative: $length += length if /^[GATCN]+$ / i; • $lines = $lines + 1; –} – print "LENGTH = $length\n"; print "LINES = $lines\n"; • Refer to DNA sequence codes to see meaning of A…N Extracting Patterns • • • • The second aspect of Perl pattern handling is: Pattern extraction: Consider a sequence like > M185580, clone 333a, complete sequence – M18… is the sequence ID – Clone 33a, com…. : optional comments • Need to stored some of elements of the descriptor line: – $seq =~/ ( \S+)/ part of the match is extracted and put into variable $1; Extracting patterns • #! /usr/bin/perl –w • # demonstrates the effect of parentheses. • while ( my $line = <> ) • { • $line =~ /\w+ (\w+) \w+ (\w+)/; • print "Second word: '$1' on line $..\n" if defined $1; • print "Fourth word: '$2' on line $..\n" if defined $2; • } – Change it to catch the first and the 3 word of a sentence • More examples in ExtractExample1.pl Search/replace and trans-literial • s/t/u/ replace (t)thymine with (u) Uracil; once only • s/t/u/g (g = global) so scan the whole string • s/t/u/gi (global and case insensitive) – What about the following : – s/^\s+// – s/\s+$// – s/\s+/ /g (where g stands for global) • • The transliteration search and replace function – $seq =~ tr/ATGC/TACG/; gets the compliment of a string of characters. (the normal search and replace works in a different way to the tr function) • Refer to SearchReplace.pl Search /replace/extract • Write a program that • removes the > from the FASTA line descriptor and assigns each element to appropriate variables. • Example Fastafile_replace.txt – – – – – – – – – >gi|171361, Saccharomyces cerevisiae, cystathionine gamma-lyase GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG CCTATACACATAGACATTTGCACCTTATACATATAC Exercises • Write a script that: 1. Confirms if the user has input the code in the following format: • • Classcode_yearcode(papercode) E.g dt249 4(w203c) 2. Many important DNA sequences have specific patters; e.g. TATA write a script to find the position of this sequence in a FASTA file sequence. Exercises 3. Write a script that can find the reverse complement of an DNA sequence without using the tr function. (Hint: a global search and replace will give an incorrect answer) 4. Coding regions begin win the AUG (ATG) codon and end with a stop codons. Write a perl script that extract a coding sequence from a FASTA file. Exercise 5. Modify the Sequence size example from earlier to: – Allow the user to input a file name and determine its length. Exam Questions • Perl is a important bioinformatics language. Explain the main features of perl that make it suitable for bioinformatics (10 marks) • Write a perl script that illustrates its pattern matching extraction and substitution ability. (6 marks) • (refer to assignment/previous papers perl scripts)