Basics of Pattern matching

Lecture 7: Perl pattern handling features Pattern Matching • Recall =~ is the pattern matching operator • A first simple match example • print “An methionine amino acid is found ” if $AA =~ /m/; – It means if $AA (string) contains the m then print methionine amino acid found. – What is inside the / / is the pattern and =~ is the pattern matching symbol – It could also be written as • if ($dna =~ /m/) • { – print “An methionine amino acid is found ”; • } – Met.pl Pattern Matching – If we want to check for the start codon we could use: – if ($seq =~ /ATG/ ) – { • Print “a start codon was found on line number\n” } – Or could write if /ATG / i (where I stands for case) – if we want to see if there is an A or T or G or C in the sequence use: $seq =~ /[ATGC]/ – The main way to use the Boolean OR is • If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) • { – Print “EcoR1 site found!!!”; • } – (note EcoR1 is an important DNA sequence) Sequence size example • File_size_2 example – – – – #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]/; # n refers to any nucelotide • #{refer to http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml} – • $lines = $lines + 1; – } – print "LENGTH = $length\n"; print "LINES = $lines\n"; • The above is a modification of the length of the file example to include only files that have G or A or T or C in the input line. • However this will lead to problems for FASTA files as the descriptor line will be included: Why? Pattern Matching • A NOT Boolean operator such as to see if the pattern contains letters that are not vowels can be represented via pattern handling by using the ^ symbol and a set of characters: e.g. – If ($seq =~ /[^aeiou]/ {print “no vowel”}; • More flexible pattern syntax: • Quite common to check for words or numbers so perl has represented as: – /[0-9]/ or/ \d/ is a digit – A word character is /[a-zA-Z0-9_]/ and is represented by /\w/ (word) – / \s/ represents a white space – By invert the case of the letter it has the reverse meaning; e.g. /\S/ (non white space) • A more complete list of what are referred to as “metacharacters” is shown in the next slide (you must of course use =~ in expression) • • • • • • • • • • • • • • Pattern matching: metacharacters Metacharacter Description . Any character except newline \. Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on metacharacters here: metacharacters and other regular expresions note (abc) \1 \2 are important for comparing sets of characters). Pattern matching: Quantifiers • Quantifier Description –? – + –* – {N} – {N,M} – {N, } – { ,M} 0 or 1 occurrence 1 or more occurrences 0 or more occurrences n occurrences Between N and M occurrences At least N occurrences No more than M occurrences Pattern matching: Quantifiers • Consider the following pattern – DT249 4 (your class code) consists of [one or more word characters; then a space and then a digit so the match is: • { =~/\w+\s\d/ } • If the sequence has the following format: – Pu-C-X(40-80)-Pu-C • Pu [AG] and X[ATGC] – $sequence =~ /[AG]C[GATC]{40,80}[AG]C/; Quantify.pl Pattern Matching • To determine where to look for a “pattern” in a sequence: • Anchors – The start of line anchor ^ {note it is like the Boolean not operator but it is within [^aeiou]} • /^>/ only those beginning with > – The end of line character $ • />$/ only where the last character is > – /^$/ : what does this mean? – The boundary anchor \b • E.g. Matching a word exactly: • /\bword\b/ where \b boundary: just looks for “word” and not a sequence of the letters such as w o r and d – The non boundary anchor is \B • /\Bword\B/ look for words like unworthy, trustworthy….. But not worthy or word Sequence Size example: modified • File_size_2 example – – – – #!/usr/bin/perl # file size2.pl $length = 0; $lines = 0; while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCNgatcn]+$/; – #Alternative: $length += length if /^[GATCN]+$ / i; • $lines = $lines + 1; –} – print "LENGTH = $length\n"; print "LINES = $lines\n"; • Refer to DNA sequence codes to see meaning of A…N Extracting Patterns • • • • The second aspect of Perl pattern handling is: Pattern extraction: Consider a sequence like > M185580, clone 333a, complete sequence – M18… is the sequence ID – Clone 33a, com…. : optional comments • Need to stored some of elements of the descriptor line: – $seq =~/ ( \S+)/ part of the match is extracted and put into variable $1; Extracting patterns • #! /usr/bin/perl –w • # demonstrates the effect of parentheses. • while ( my $line = <> ) • { • $line =~ /\w+ (\w+) \w+ (\w+)/; • print "Second word: '$1' on line $..\n" if defined $1; • print "Fourth word: '$2' on line $..\n" if defined $2; • } – Change it to catch the first and the 3 word of a sentence • More examples in ExtractExample1.pl Search/replace and trans-literial • s/t/u/ replace (t)thymine with (u) Uracil; once only • s/t/u/g (g = global) so scan the whole string • s/t/u/gi (global and case insensitive) – What about the following : – s/^\s+// – s/\s+$// – s/\s+/ /g (where g stands for global) • • The transliteration search and replace function – $seq =~ tr/ATGC/TACG/; gets the compliment of a string of characters. (the normal search and replace works in a different way to the tr function) • Refer to SearchReplace.pl Search /replace/extract • Write a program that • removes the > from the FASTA line descriptor and assigns each element to appropriate variables. • Example Fastafile_replace.txt – – – – – – – – – >gi|171361, Saccharomyces cerevisiae, cystathionine gamma-lyase GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG CCTATACACATAGACATTTGCACCTTATACATATAC Exercises • Write a script that: 1. Confirms if the user has input the code in the following format: • • Classcode_yearcode(papercode) E.g dt249 4(w203c) 2. Many important DNA sequences have specific patters; e.g. TATA write a script to find the position of this sequence in a FASTA file sequence. Exercises 3. Write a script that can find the reverse complement of an DNA sequence without using the tr function. (Hint: a global search and replace will give an incorrect answer) 4. Coding regions begin win the AUG (ATG) codon and end with a stop codons. Write a perl script that extract a coding sequence from a FASTA file. Exercise 5. Modify the Sequence size example from earlier to: – Allow the user to input a file name and determine its length. Exam Questions • Perl is a important bioinformatics language. Explain the main features of perl that make it suitable for bioinformatics (10 marks) • Write a perl script that illustrates its pattern matching extraction and substitution ability. (6 marks) • (refer to assignment/previous papers perl scripts)

Basics of Pattern matching

Related documents

Products

Support

Basics of Pattern matching

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib