12-BioPerl

advertisement
‫סקר הוראה‬
‫בשבועות הקרובים‬
‫יתקיים סקר‬
‫ההוראה‬
‫(באתר מידע אישי‬
‫לתלמיד)‬
‫‪13.1‬‬
13.2
BioPerl
12.4
BioPerl
BioPerl modules are called Bio::XXX
You can use the BioPerl wiki:
http://bio.perl.org/
with documentation and examples for how to use them – which is the best
way to learn this. We recommend beginning with the "How-tos":
http://www.bioperl.org/wiki/HOWTOs
To a more hard-core inspection of BioPerl modules:
BioPerl 1.6.1 Module Documentation
12.6
BioPerl: the SeqIO module
BioPerl modules are named Bio::xxxx
The Bio::SeqIO module deals with Sequences Input and Output:
We will pass arguments to the new argument of the file name and format
use Bio::SeqIO;
my $in = Bio::SeqIO->new("-file" => "<seq.gb",
"-format" => "GenBank");
File argument
(filename as would
be in open)
Format
argument
$in
0x25e211
A list of all the sequence formats BioPerl can read is in:
http://www.bioperl.org/wiki/HOWTO:SeqIO#Formats
next_seq()
write_seq()
12.7
BioPerl: the SeqIO module
use Bio::SeqIO;
my $in = Bio::SeqIO->new("-file" => "<seq.gb",
"-format" => "GenBank");
my $seqObj = $in->next_seq();
next_seq() returns the
next sequence in the file as a
Bio::Seq object (we will
talk about them soon)
Perform next_seq()subroutine on
$in You could think of it as:
SeqIO::next_seq($in)
$in
0x25e211
next_seq()
write_seq()
12.8
BioPerl: the SeqIO module
use Bio::SeqIO;
my $in = Bio::SeqIO->new("-file" => "<adeno12.gb",
"-format" => "GenBank");
my $out = Bio::SeqIO->new("-file" => ">adeno12.out.fas",
"-format" => "Fasta");
my $seqObj = $in->next_seq();
write_seq()write a
while ( defined($seqObj) ){
$out->write_seq($seqObj);
$seqObj = $in->next_seq();
}
Bio::Seq object to $out
according to its format
12.9
BioPerl: the Seq module
use Bio::SeqIO;
my $in = Bio::SeqIO->new( "-file" => "<Ecoli.prot.fasta",
"-format" => "Fasta");
my $seqObj = $in->next_seq();
while (defined($seqObj)) {
print "ID:".$seqObj->id()."\n";
#1st word in header
print "Desc:".$seqObj->desc()."\n";
#rest of header
print "Sequence:".$seqObj->seq()."\n"; #seq string
print "Length:".$seqObj->length()."\n"; #seq length
$seqObj = $in->next_seq()
}
You can read more about the Bio::Seq subroutines in:
http://www.bioperl.org/wiki/HOWTO:Beginners#The_Sequence_Object
12.10
Print last 30aa of each sequence (no BioPerl)
open (IN, "<seq.fasta") or die "Cannot open seq.fasta...";
my $fastaLine = <IN>;
while (defined $fastaLine) {
chomp $fastaLine;
# Read first word of header
if (fastaLine =~ m/^>(\S*)/) {
my $header = substr($fastaLine,1);
$fastaLine = <IN>;
}
# Read seq until next header
my $seq = "";
while ((defined $fastaLine) and(substr($fastaLine,0,1) ne ">" )) {
chomp $fastaLine;
$seq = $seq.$fastaLine;
$fastaLine = <IN>;
}
# print last 30aa
my $subseq = substr($seq,-30);
print "$header\n";
print "$subseq\n";
}
12.11
Now using BioPerl
use Bio::SeqIO;
my $in = Bio::SeqIO->new("-file"=>"<seq.fasta","-format"=>"Fasta");
my $seqObj = $in->next_seq();
while (defined($seqObj)) {
# Read first word of header
my $header = $seqObj->id();
# print last 30aa
my $seq = $seqObj->seq();
my $subseq = substr($seq,-30);
print "$header\n";
Note: BioPerl warnings about:
print "$subseq\n";
Subroutine ... redefined at ...
$seqObj = $in->next_seq();
}
Should not trouble you, it is a known issue –
it is not your fault and won't effect your
script's performances.
12.12
Now using BioPerl
use Bio::SeqIO;
my $in = Bio::SeqIO->new("-file"=>"<seq.fasta","-format"=>"Fasta");
my $seqObj;
while (defined ($seqObj= $in->next_seq()) ) {
# Read first word of header
or
my $header = $seqObj->id();
alternatively
# print last 30aa
my $seq = $seqObj->seq();
my $subseq = substr($seq,-30);
print "$header\n";
print "$subseq\n";
}
13.13
1.
Class exercise 13a
Use Bio::SeqIO to read a FASTA file and print to an output FASTA
file only sequences shorter than 3,000 bases.
(use the EHD nucleotide FASTA from the webpage)
2.
Use Bio::SeqIO to read a FASTA file, and print (to the screen)
header lines that contain the words
"Mus musculus".
3*. Write a script that uses Bio::SeqIO to read a GenPept file and
convert it to FASTA.
(use preProInsulinRecords.gp from the webpage)
4*. Same as Q1, but print to the FASTA the reverse complement of
each sequence.
(Do not use the reverse or tr// functions! BioPerl can do it for you read the BioPerl documentation).
12.14
BioPerl: downloading files from the web
The Bio::DB::Genbank module allows us to download
a specific record from the NCBI website:
use Bio::DB::GenBank;
my $gb = Bio::DB::GenBank->new;
my $seqObj = $gb->get_Seq_by_acc("J00522");
print $seqObj->seq();
see more options in:
http://www.bioperl.org/wiki/HOWTO:Beginners#Retrieving_a_sequence_from_a_database
http://doc.bioperl.org/releases/bioperl-1.4/Bio/DB/GenBank.html
12.15
BLAST
Congrats, you just sequenced
yourself some DNA.
#$?!?
And you want to see if it exists in
any other organism
12.16
BLAST
BLAST - Basic Local Alignment and Search Tool
BLAST helps you find
similarity between your
sequence and other sequences
12.17
BLAST
BLAST - Basic Local Alignment and Search Tool
BLAST helps you find
similarity between your
sequence and other sequences
12.18
BLAST
BLAST helps you find
similarity between your
sequence and other sequences
12.19
BLAST
high scoring pair
(HSP)
query
Database
hit
12.21
BioPerl: reading BLAST output
First we need to have the BLAST results in a text file BioPerl can read.
Here is one way to achieve this (using NCBI BLAST):
Download
Text
An alternative is to use BLASTALL on your computer
12.22
BioPerl: reading BLAST output
Query
Query= gi|52840257|ref|YP_094056.1| chromosomal replication initiator
protein DnaA [Legionella pneumophila subsp. pneumophila str.
Philadelphia 1]
(452 letters)
Database: Coxiella.faa
1818 sequences; 516,956 total letters
Results info
Searching..................................................done
Sequences producing significant alignments:
gi|29653365|ref|NP_819057.1|
gi|29655022|ref|NP_820714.1|
gi|29654861|ref|NP_820553.1|
gi|29654871|ref|NP_820563.1|
gi|29654481|ref|NP_820173.1|
gi|29654004|ref|NP_819696.1|
Score
E
(bits) Value
chromosomal replication initiator p...
DnaA-related protein [Coxiella burn...
Holliday junction DNA helicase B [C...
ATPase, AFG1 family [Coxiella burne...
hypothetical protein CBU_1178 [Coxi...
succinyl-diaminopimelate desuccinyl...
633
72
32
27
25
25
0.0
4e-14
0.033
1.4
3.1
3.1
12.23
BioPerl: reading BLAST output
gi|215919162|ref|NP_820316.2| threonyl-tRNA synthetase [Coxiella...
gi|29655364|ref|NP_821056.1| transcription termination factor rh...
gi|215919324|ref|NP_821004.2| adenosylhomocysteinase [Coxiella b...
gi|29653813|ref|NP_819505.1| putative phosphoribosyl transferase...
25
24
24
24
5.3
9.0
9.0
9.0
Result
header
>gi|29653365|ref|NP_819057.1| chromosomal replication initiator
protein [Coxiella burnetii RSA 493]
Length = 451
Score = 633 bits (1632), Expect = 0.0
Identities = 316/452 (69%), Positives = 371/452 (82%), Gaps = 5/452 (1%)
Query: 1
MSTTAWQKCLGLLQDEFSAQQFNTWLRPLQAYMDEQR-LILLAPNRFVVDWVRKHFFSRI 59
+ T+ W KCLG L+DE
QQ+NTW+RPL A
+Q L+LLAPNRFV+DW+ + F +RI
LPTSLWDKCLGYLRDEIPPQQYNTWIRPLHAIESKQNGLLLLAPNRFVLDWINERFLNRI 62
Sbjct: 3
Query: 60
Sbjct: 63
EELIKQFSGDDIKAISIEVGSKPVEAVDTPAETIVTSSSTAPLKSAPKKAVDYKSSHLNK 119
EL+ + S D
I +++GS+ E
+
+ AP
+ + +++N
TELLDELS-DTPPQIRLQIGSRSTEMPTKNSHEPSHRKAAAPPAGT---TISHTQANINS 118
Query: 120 KFVFDSFVEGNSNQLARAASMQVAERPGDAYNPLFIYGGVGLGKTHLMHAIGNSILKNNP 179
F FDSFVEG SNQLARAA+ QVAE PG AYNPLFIYGGVGLGKTHLMHA+GN+IL+ +
Sbjct: 119 NFTFDSFVEGKSNQLARAAATQVAENPGQAYNPLFIYGGVGLGKTHLMHAVGNAILRKDS 178
Note:
There could be more than one HSP for each result,
in case of homology in different parts of the protein
high
scoring pair
(HSP) data
HSP
Alignment
12.24
Bio::SearchIO : reading BLAST output
The Bio::SearchIO module can read and parse BLAST output:
use Bio::SearchIO;
my $blast_report =
Bio::SearchIO->new("-file"
=> "<LegCox.blastp",
"-format" => "blast"
);
my ($resultObj, $hitObj, $hspObj);
while( defined($resultObj = $blast_report->next_result()) ){
print "Checking query ".$resultObj->query_name()."\n";
while( defined($hitObj = $resultObj->next_hit()) ) {
print "Checking hit ". $hitObj->name()."\n";
$hspObj = $hitObj->next_hsp();
print "Best score: ".$hspObj->score()."\n";
}
}
(See the BLAST output example in course web-site)
12.25
BioPerl: reading BLAST output
You can send parameters to the subroutines of the objects:
# Get length of HSP (including gaps)
$hspObj->length("total");
# Get length of hit part of alignment (without gaps)
$hspObj->length("hit");
# Get length of query part of alignment (without gaps)
$hspObj->length("query");
More about what you can do with query, hit and hsp see in:
http://www.bioperl.org/wiki/HOWTO:SearchIO#Table_of_Methods
Class exercise 13b
13.26
1.
Uses Bio::SearchIO to parse the BLAST results:
(LegCox.blastp provided in the course web-site)
a)
For each query print out its name and the name of its first hit.
b*) Print the % identity of each HSP of the first hit of each query.
c*) Print the e-value of each HSP of the first hit of each query.
12.27
•
•
Installing BioPerl – how to add a
repository to the PPM
Start  All Programs Active Perl…  Perl Package manager
You might need to add a repository to the PPM before installing
BioPerl:
12.28
Installing modules from the internet
• The best place to search for Perl modules that can make your life easier
is:
http://www.cpan.org/
• The easiest way to download and install a module is to use the Perl
Package Manager (part of the ActivePerl installation)
1.Choose “View all
packages”
2. Enter module name
(e.g. bioperl)
3. Choose module
(e.g. bioperl)
4. Add it to the installation
list
5. Install!
Note: ppm installs the packages under the directory “site\lib\” in the
ActivePerl directory. You can put packages there manually if you
would like to download them yourself from the net, instead of using
12.29
•
Installing BioPerl – how to add a
repository to the PPM
Click the “Repositories” tab, enter “bioperl” in the “Name”
field and http://bioperl.org/DIST in the “Location” field, click
“Add”, and finally “OK”:
12.30
BioPerl installation
• In order to add BioPerl packages you need to download and
execute the bioperl10.bat file from the course website.
• If that that does not work – follow the instruction in the last
three slides of the BioPerl presentation.
• Reminder:
BioPerl warnings about:
Subroutine ... redefined at ...
Should not trouble you, it is a known issue – it is not your fault
and won't effect your script's performances.
12.31
Installing modules from the internet
• Alternatively in older Active Perl versions-
Note: ppm installs the packages under the directory “site\lib\” in the
ActivePerl directory. You can put packages there manually if you
would like to download them yourself from the net, instead of using
Download