Bioinformatics Lecture 8 perl pattern matching features Questions to think about • Create a hash table that performs the condon to AA conversion and use it to convert codons {entered from the key board} into their corresponding Amino Acids • Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file Questions to think about • Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches Introduction • • • • • Pattern Matching Pattern extraction Pattern Substitution Split and join functions Unpack function Pattern Matching • Recall =~ is the pattern matching operator • A first simple match example – print “EcoRI site found!” if $dna =~ /gat/; – It means if $DNA (string) contains the pattern gat then print Ecori site found. What is inside the 2 / is the pattern and =~ is the pattern matching symbol • More patterns – if ($dna =~ /[GATCgatc]/ ) – if /^[GATC] / i – If ( $dna =~ /GAATTC|AAGCTT/) | (Boolean Or symbol) • Print “EcoR1 site found!!!”; Pattern Matching • A More flexible pattern: – print “EcoRI site found!” if $dna =~ /GAA[GATC]TTC/; – Pattern where 4th letter is any let within square brackets – [GATC] means any character other than G or A or T or C – [0-9] or \d (digit) [ a-z] [-A-Z] /[AT][GC][TG]/ – /[a-zA-Z0-9_]/ or /\w/ (word) – / \s/ (white space) and to invert \s uppercase the letter \S (non white space) • • • • • • • • • • • • • • Pattern matching: metacharacters Metacharacter Description . Any character except newline \. Full stop character ^ The beginning of a line $ The end of a line \w Any word character (non-punctuation, non-white space) \W Any non-word character \s White space (spaces, tabs, carriage returns) \S Non-white space \d Any digit \D Any non-digit You can also specify the number of times [ single, multiple or specific multiple] More information on variations of metacharacters here: metacharacters Pattern matching: Quantifiers • Quantifier Description –? – + –* – {N} – {N,M} – {N, } – { ,M} 0 or 1 occurrence 1 or more occurrences 0 or more occurrences n occurrences Between N and M occurrences At least N occurrences No more than M occurrences Pattern matching: Quantifiers • Pattern Match the following format: M58200.2 { =~/\w+\.\d+/ } • If the sequence is: Pu-C-X(40-80)-Pu-C • Pu [AG] and X[ATGC] – $sequence = /[AG]C[GATC]{40,80}[AG]C/; – Extracting pattern to variables • Anchors – E..g. Matching a word exactly: • /\bword\b/ \b boundary: just looks for word and not a sequence of the letters w o r and d – The start of line anchor ^ • /^>/ only those beginning with > – The end of line character $ • />$/ only where the last character is > – /^$/ : what does this mean? Further examples • File_size_base_only.pl example – #!/usr/bin/perl – # file size2.pl – $length = 0; $lines = 0; – while (<>) { • chomp; • $length = $length + length $_ if $_ =~ /[GATCgatc]/; – #Alternative: $length += length if /^[GATCN] / i; • $lines = $lines + 1; –} – print "LENGTH = $length\n"; print "LINES = $lines\n"; FASTA files Write and test (file_size_bases_only.pl) using a FASTA file as input: FASTADNA1.txt: example of FASTA file >2L52.1 CE20433 Zinc finger, C2H2 type (CAMBRIDGE) protein id:CAA21776.1 GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC sample of file in EMBL format gccacagatt acaggaagtc atatttttag acctaaatca ctatcctcta tctttcagca 60 agaaaagaac atctacttgg tttcgttccc tatccaagat tcagatggtg aaacgagtga 120 tcatgcacct gatgaacgtg caaaaccaca gtcaagccat gacaaccccg atctacagtt 180 tgatgttgaa actgccgatt ggtacgccta cagtgaaaac tatggcacaa gtgaagaaaa 240 Sample of an NCBI record format: 1 atgaacccca acctgtgggt cgacgcgcag agcacttgca agagggaatg cgacgctgac 61 ctggagtgcg agacctttga gaagtgctgc cccaatgtct gtggaaccaa gagctgtgtg 121 gctgctcggt acatggacat caaggggaag aaggggcctg tggggatgcc caaagaggca 181 acctgtgacc gcttcatgtg catccagcaa ggctcagagt gcgacatctg ggacgggcag 241 cctgtctgca agtgcaagga caggtgtgag aaggagccga gctttacctg cgcctcggac Extracting Patterns • Consider a sequence like • >M185580 clone 333a, complete sequence –> – M18… is the sequence ID – Clone 33a, com…. : optional comments • Need to stored some of elements of the descriptor line: • =~/ ( \S+)/ part of the match is extracted and put into variable $1; Extracting patterns • #! /usr/bin/perl –w • # demonstrates the effect of parentheses. • while ( my $line = <> ) • { • $line =~ /\w+ (\w+) \w+ (\w+)/; • print "Second word: '$1' on line $..\n" if defined $1; • print "Fourth word: '$2' on line $..\n" if defined $2; • } – Change it to catch the first and the 3 word of a sentence Search and replace • s/t/u/ replace (t)thymine with (u) Uracil; once only • s/t/u/g (g = global) so scan the whole string • s/t/u/gi (global and case insensitive) • What about the following : • s/^\s+// • s/\s+$// • s/\s+$/ /g (where g stands for global) • • Write a perl script that reads in the DNA sequences from the FastaDNA1file.txt and replaces all the Thymine bases with the corresponding Uracil bases Splits and joins • To transform strings into arrays: split – Line 1 looks like: – 192a8,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC – Consider the following code: – chomp($line = <>); # read the line into $line – @fields = split ‘,’,$line; – ($clone,$laboratory,$left_oligo,$right_oligo) = split ‘,’,$line; • Reads in line 1 and puts each part before the delimiter; e.g. 192a8, into element of array…. • To transform arrays (lists) into strings: join • $tab = join “\t”,@fields; • 192a8 The Sanger Centre GGGTTCCGATTTCCAA CCTTAGGCCAAATTAAGGCC • • • • #initialize an array my @perlFunc = ("substr","grep","defined","undef"); my $perlFunc = join " ", @perlFunc; print "Perl Functions: $perlFunc\n"; – See example split_file.pl Other useful functions • Other useful functions: – Unpack syntax : • @triplets = unpack("a3" x (length($line)/3), $line); • Frame Shift (1 position to the right) • @triplets = unpack(‘a’ . “a3” x (length ($line)/3),$line); • Unpack_codons.pl Questions • Modify the file_bases_size_only.pl to count the the number of bases for a file in an EMBL format and one in an NCBI format • Using the FASTADNA1.txt : extract the sections of the descriptor line to appropriate scalar variables. • Assuming the DNA sequence of FastaDNA1file.txt is the complementary or anti-sense strand print the mRNA when the primary strand ( sequence ) is transcribed Exam Questions • Perl is a important bioinformatics language. Explain the main features of perl that make in appealing to the field of Bioinformatics. – Write a script that extracts the gene ID, and Gene name from the Descriptor header of a DNA FASTA file – Write a perl script only reads and prints DNA sequences from a FASTA file. – Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and determines the number of alignment matches to non matches FastaDNA1file.txt – Write a script that reads in the DNA sequences from two Fasta files, assume the sequence length is the same for both, and illustrates the number of alignment matches to non matches.