סקר הוראה בשבועות הקרובים יתקיים סקר ההוראה (באתר מידע אישי לתלמיד) 13.1 13.2 BioPerl 13.3 BioPerl The BioPerl project is an international association of developers of open source Perl tools for bioinformatics, genomics and life science research. Things you can do with BioPerl: • Read and write sequence files of different format, including: Fasta, GenBank, EMBL, SwissProt and more… • Extract gene annotation from GenBank, EMBL, SwissProt files • Read and analyse BLAST results. • Read and process phylogenetic trees and multiple sequence alignments. • Analysing SNP data. • And more… 13.4 BioPerl BioPerl modules are called Bio::XXX You can use the BioPerl wiki: http://bio.perl.org/ with documentation and examples for how to use them – which is the best way to learn this. We recommend beginning with the "How-tos": http://www.bioperl.org/wiki/HOWTOs To a more hard-core inspection of BioPerl modules: BioPerl 1.5.2 Module Documentation 13.5 Object-oriented use of packages Many packages are meant to be used as objects. In Perl, an object is a data structure that can use subroutines that are associated with it. $obj 0x225d14 func() anotherFunc() We will not learn object oriented programming, but we will learn how to create and use objects defined by BioPerl packages. 13.6 BioPerl: the SeqIO module BioPerl modules are named Bio::xxxx The Bio::SeqIO module deals with Sequences Input and Output: In order to create a new SeqIO object, use new Bio::SeqIO as follows: use Bio::SeqIO; my $in = new Bio::SeqIO; 13.7 BioPerl: the SeqIO module BioPerl modules are named Bio::xxxx The Bio::SeqIO module deals with Sequences Input and Output: In order to create a new SeqIO object, use new Bio::SeqIO as follows: use Bio::SeqIO; my $in = Bio::SeqIO->new; or alternatively 13.8 BioPerl: the SeqIO module BioPerl modules are named Bio::xxxx The Bio::SeqIO module deals with Sequences Input and Output: We will pass arguments to the new argument of the file name and format use Bio::SeqIO; my $in = Bio::SeqIO->new("-file" => "<seq.gb", "-format" => "GenBank"); File argument (filename as would be in open) Format argument $in 0x25e211 A list of all the sequence formats BioPerl can read is in: http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats next_seq() write_seq() 13.9 BioPerl: the SeqIO module use Bio::SeqIO; my $in = Bio::SeqIO->new("-file" => "<seq.gb", "-format" => "GenBank"); my $seqObj = $in->next_seq(); next_seq() returns the next sequence in the file as a Bio::Seq object (we will talk about them soon) Perform next_seq()subroutine on $in You could think of it as: SeqIO::next_seq($in) $in 0x25e211 next_seq() write_seq() 13.10 BioPerl: the SeqIO module use Bio::SeqIO; my $in = Bio::SeqIO->new("-file" => "<adeno12.gb", "-format" => "GenBank"); my $out = Bio::SeqIO->new("-file" => ">adeno12.out.fas", "-format" => "Fasta"); my $seqObj = $in->next_seq(); write_seq()write a while ( defined($seqObj) ){ $out->write_seq($seqObj); $seqObj = $in->next_seq(); } $out 0x3001a3 next_seq() write_seq() Bio::Seq object to $out according to its format $in 0x25e211 next_seq() write_seq() 13.11 BioPerl: the Seq module use Bio::SeqIO; my $in = Bio::SeqIO->new( "-file" => "<Ecoli.prot.fasta", "-format" => "Fasta"); my $seqObj = $in->next_seq(); while (defined($seqObj)) { print "ID:".$seqObj->id()."\n"; #1st word in header print "Desc:".$seqObj->desc()."\n"; #rest of header print "Sequence: ".$seqObj->seq()."\n"; #seq string print "Length:".$seqObj->length()."\n"; #seq length $seqObj = $in->next_seq() This one is } extremely useful You can read more about the Bio::Seq subroutines in: ... http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object 13.12 Printing last 30aa of each sequence open (IN, "<seq.fasta") or die "Cannot open seq.fasta..."; my $fastaLine = <IN>; while (defined $fastaLine) { chomp $fastaLine; # Read first word of header if (fastaLine =~ m/^>(\S*)/) { my $header = substr($fastaLine,1); $fastaLine = <IN>; } # Read seq until next header my $seq = ""; while ((defined $fastaLine) and(substr($fastaLine,0,1) ne ">" )) { chomp $fastaLine; $seq = $seq.$fastaLine; $fastaLine = <IN>; } # print last 30aa my $subseq = substr($seq,-30); print "$header\n"; print "$subseq\n"; } 13.13 Now using BioPerl use Bio::SeqIO; my $in = Bio::SeqIO->new("-file"=>"<seq.fasta","-format"=>"Fasta"); my $seqObj = $in->next_seq(); while (defined($seqObj)) { # Read first word of header my $header = $seqObj->id(); # print last 30aa my $seq = $seqObj->seq(); my $subseq = substr($seq,-30); print "$header\n"; Note: BioPerl warnings about: print "$subseq\n"; Subroutine ... redefined at ... $seqObj = $in->next_seq(); } Should not trouble you, it is a known issue – it is not your fault and won't effect your script's performances. 13.14 Now using BioPerl use Bio::SeqIO; my $in = Bio::SeqIO->new("-file"=>"<seq.fasta","-format"=>"Fasta"); my $seqObj; while (defined ($seqObj= $in->next_seq()) ) { # Read first word of header my $header = $seqObj->id(); # print last 30aa my $seq = $seqObj->seq(); my $subseq = substr($seq,-30); print "$header\n"; Note: BioPerl warnings about: print "$subseq\n"; Subroutine ... redefined at ... } Should not trouble you, it is a known issue – it is not your fault and won't effect your script's performances. 13.15 Class exercise 13a 1. Write a script that uses Bio::SeqIO to read a FASTA file (use the EHD nucleotide FASTA from the webpage) and print to an output FASTA file only sequences shorter than 3,000 bases. 2. Write a script that uses Bio::SeqIO to read a FASTA file, and print (to the screen) all header lines that contain the words "Mus musculus" (you may use the same file). 3. Write a script that uses Bio::SeqIO to read a GenPept file (use preProInsulinRecords.gp from the webpage), and convert it to FASTA. 4*. Same as Q1, but print to the FASTA the reverse complement of each sequence. (Do not use the reverse or tr// functions! BioPerl can do it for you - read the BioPerl documentation). 13.16 BioPerl: downloading files from the web The Bio::DB::Genbank module allows us to download a specific record from the NCBI website: use Bio::DB::GenBank; my $gb = Bio::DB::GenBank->new; my $seqObj = $gb->get_Seq_by_acc("J00522"); $seqObj = $gb->get_Seq_by_gi(195052); print $seqObj->seq(); see more options in: http://www.bioperl.org/wiki/HOWTO:Beginners#Retrieving_a_sequence_fro m_a_database http://doc.bioperl.org/releases/bioperl-1.4/Bio/DB/GenBank.html 13.17 BLAST Congrats, you just sequenced yourself some DNA. #$?!? And you want to see if it exists in any other organism 13.18 BLAST BLAST - Basic Local Alignment and Search Tool BLAST helps you find similarity between your sequence and other sequences 13.19 BLAST BLAST - Basic Local Alignment and Search Tool BLAST helps you find similarity between your sequence and other sequences 13.20 BLAST BLAST helps you find similarity between your sequence and other sequences 13.21 BLAST You can search using BLAST proteins or DNA: Query: DNA Protein Database: DNA Protein blastn – nucleotides vs. nucleotides blastp – protein vs. protein blastx – translated query vs. protein database tblastn– protein vs. translated nuc. DB tblastx – translated query vs. translated database 13.22 BioPerl: reading BLAST output First we need to have the BLAST results in a text file BioPerl can read. Here is one way to achieve this (using NCBI BLAST): Download Text Another alternative is to use BLASTALL on your computer, to perform BLAST on each sequence of a multiple sequence Fasta against another multiple sequence Fasta. 13.23 BioPerl: reading BLAST output Query Query= gi|52840257|ref|YP_094056.1| chromosomal replication initiator protein DnaA [Legionella pneumophila subsp. pneumophila str. Philadelphia 1] (452 letters) Database: Coxiella.faa 1818 sequences; 516,956 total letters Results info Searching..................................................done Sequences producing significant alignments: gi|29653365|ref|NP_819057.1| gi|29655022|ref|NP_820714.1| gi|29654861|ref|NP_820553.1| gi|29654871|ref|NP_820563.1| gi|29654481|ref|NP_820173.1| gi|29654004|ref|NP_819696.1| Score E (bits) Value chromosomal replication initiator p... DnaA-related protein [Coxiella burn... Holliday junction DNA helicase B [C... ATPase, AFG1 family [Coxiella burne... hypothetical protein CBU_1178 [Coxi... succinyl-diaminopimelate desuccinyl... 633 72 32 27 25 25 0.0 4e-14 0.033 1.4 3.1 3.1 13.24 BioPerl: reading BLAST output gi|215919162|ref|NP_820316.2| threonyl-tRNA synthetase [Coxiella... gi|29655364|ref|NP_821056.1| transcription termination factor rh... gi|215919324|ref|NP_821004.2| adenosylhomocysteinase [Coxiella b... gi|29653813|ref|NP_819505.1| putative phosphoribosyl transferase... 25 24 24 24 5.3 9.0 9.0 9.0 Result header >gi|29653365|ref|NP_819057.1| chromosomal replication initiator protein [Coxiella burnetii RSA 493] Length = 451 Score = 633 bits (1632), Expect = 0.0 Identities = 316/452 (69%), Positives = 371/452 (82%), Gaps = 5/452 (1%) Query: 1 MSTTAWQKCLGLLQDEFSAQQFNTWLRPLQAYMDEQR-LILLAPNRFVVDWVRKHFFSRI 59 + T+ W KCLG L+DE QQ+NTW+RPL A +Q L+LLAPNRFV+DW+ + F +RI LPTSLWDKCLGYLRDEIPPQQYNTWIRPLHAIESKQNGLLLLAPNRFVLDWINERFLNRI 62 Sbjct: 3 Query: 60 Sbjct: 63 EELIKQFSGDDIKAISIEVGSKPVEAVDTPAETIVTSSSTAPLKSAPKKAVDYKSSHLNK 119 EL+ + S D I +++GS+ E + + AP + + +++N TELLDELS-DTPPQIRLQIGSRSTEMPTKNSHEPSHRKAAAPPAGT---TISHTQANINS 118 Query: 120 KFVFDSFVEGNSNQLARAASMQVAERPGDAYNPLFIYGGVGLGKTHLMHAIGNSILKNNP 179 F FDSFVEG SNQLARAA+ QVAE PG AYNPLFIYGGVGLGKTHLMHA+GN+IL+ + Sbjct: 119 NFTFDSFVEGKSNQLARAAATQVAENPGQAYNPLFIYGGVGLGKTHLMHAVGNAILRKDS 178 Note: There could be more than one HSP for each result, in case of homology in different parts of the protein high scoring pair (HSP) data HSP Alignment 13.25 Bio::SearchIO : reading BLAST output The Bio::SearchIO module can read and parse BLAST output: use Bio::SearchIO; my $blast_report = Bio::SearchIO->new("-file" => "<LegCox.blastp", "-format" => "blast" ); my ($resultObj, $hitObj, $hspObj); while( defined($resultObj = $blast_report->next_result()) ){ print "Checking query ".$resultObj->query_name()."\n"; while( defined($hitObj = $resultObj->next_hit()) ) { print "Checking hit ". $hitObj->name()."\n"; $hspObj = $hitObj->next_hsp(); print "Best score: ".$hspObj->score()."\n"; } } (See the BLAST output example in course web-site) 13.26 BioPerl: reading BLAST output You can send parameters to the subroutines of the objects: # Get length of HSP (including gaps) $hspObj->length("total"); # Get length of hit part of alignment (without gaps) $hspObj->length("hit"); # Get length of query part of alignment (without gaps) $hspObj->length("query"); More about what you can do with query, hit and hsp see in: http://www.bioperl.org/wiki/HOWTO:SearchIO#Table_of_Methods Class exercise 13b 13.27 1. Write a script that uses Bio::SearchIO to parse the BLAST results (LegCox.blastp provided in the course web-site) and: a) For each query print out its name and the name of its first hit. b) Print the % identity of each HSP of the first hit of each query. c) Print the e-value of each HSP of the first hit of each query. 13.28 • • Installing BioPerl – how to add a repository to the PPM Start All Programs Active Perl… Perl Package manager You might need to add a repository to the PPM before installing BioPerl: 13.29 Installing modules from the internet • The best place to search for Perl modules that can make your life easier is: http://www.cpan.org/ • The easiest way to download and install a module is to use the Perl Package Manager (part of the ActivePerl installation) 1.Choose “View all packages” 2. Enter module name (e.g. bioperl) 3. Choose module (e.g. bioperl) 4. Add it to the installation list 5. Install! Note: ppm installs the packages under the directory “site\lib\” in the ActivePerl directory. You can put packages there manually if you would like to download them yourself from the net, instead of using 13.30 • Installing BioPerl – how to add a repository to the PPM Click the “Repositories” tab, enter “bioperl” in the “Name” field and http://bioperl.org/DIST in the “Location” field, click “Add”, and finally “OK”: 13.31 BioPerl installation • In order to add BioPerl packages you need to download and execute the bioperl10.bat file from the course website. • If that that does not work – follow the instruction in the last three slides of the BioPerl presentation. • Reminder: BioPerl warnings about: Subroutine ... redefined at ... Should not trouble you, it is a known issue – it is not your fault and won't effect your script's performances. 13.32 Installing modules from the internet • Alternatively in older Active Perl versions- Note: ppm installs the packages under the directory “site\lib\” in the ActivePerl directory. You can put packages there manually if you would like to download them yourself from the net, instead of using