use Bio::Perl

advertisement
Bioinformatics Course
Day 4
Perl Extensions:
BioPerl
and
Ensembl API
BioPerl
●
●
●
●
●
Collection of Perl scripts and modules
Facilitate development of Perl scripts for
bioinformatics applications – not ready-to-use
programs!
Object-orientated Perl code
Generated by biologists, bioinformaticians,
computer scientists
Different levels of complexity
What are modules?
●
Perl extensions
●
.pm ending
●
inserted via 'use' or 'require' statement
●
can be nested:
–
Bio::DB::GenBank
–
/sw/lib/perl5/5.8.6/Bio/DB/GenBank.pm
●
contain code and documentation
●
perldoc Bio::DB::GenBank
Applications for BioPerl
●
Sequence retrieval and manipulation
●
Data re-formatting
●
Run and parse output from bioinformatics
programs:
–
Blast
–
ClustalW
–
Tcoffee
–
GenScan
–
HMMER
–
...
What is an Object?
●
Example: Bio::Seq Object
use Bio::Perl
$seqobj = get_sequence('swissprot', 'TLR4_HUMAN');
What is an Object?
●
Example: Bio::Seq Object
load module
new function available
use Bio::Perl
$seqobj = get_sequence('swissprot', 'TLR4_HUMAN');
generates object
What is an Object?
●
Example: Bio::Seq Object
load module
new function available
use Bio::Perl
$seqobj = get_sequence('swissprot', 'TLR4_HUMAN');
generates object
provides methods
$sequence_part = $seqobj->subseq(1,100);
$translation = $seqobj->translate();
What is an Object?
●
Example: Bio::Seq Object
load module
new function available
use Bio::Perl
$seqobj = get_sequence('swissprot', 'TLR4_HUMAN');
generates object
provides methods
$sequence_part = $seqobj->subseq(1,100);
$translation = $seqobj->translate();
combinations possible
$trans_trunc_rev = $seqobj->trunc(100,200)->revcom->translate();
BioPerl's Objects
●
Sequence Objects (representation of
various types of sequences):
–
Seq
–
PrimarySeq
–
LocatableSeq
–
RelSegment
–
LiveSeq
–
LargeSeq
–
RichSeq
–
SeqWithQuality
BioPerl's Objects
●
Location Objects:
–
Where is a feature on a sequence?
●
Alignment objects
●
Blast reports
●
...
Objects vs Functions
●
Different access method
–
Example: reading an EMBL file:
Function:
$seq = read_sequence($file, 'embl')
Object:
$seqio = Bio::SeqIO->new(
-format => 'embl' ,
-file
=> $file
);
$seqobj = $seqio->next_seq();
Bio::Perl example
●
Load the module
use Bio::Perl;
●
Get the sequence
$seqobj = get_sequence('swissprot', 'TLR4_HUMAN');
●
Now you have a Bio::Seq object!
Bio::Perl is not BioPerl!
●
BioPerl:
–
●
whole collection of Bio-related Perl extension
Bio::Perl
–
just one of many modules
–
“Easy first time access to BioPerl via functions”
–
“Functional access to BioPerl for people who don't
know objects”
–
limited functionality
–
nice starter
DB entry to Objects
●
Conversion of parts of the data into objects:
ID
AC
TLR4_HUMAN
STANDARD;
O00206; Q9UK78; Q9UM57;
PRT;
839 AA.
Bio::PrimarySeq
OS
OC
Homo sapiens (Human).
Eukaryota; Metazoa; Chordata; Craniata; ...
Bio::Species
FT
FT
REPEAT
REPEAT
52
77
76
100
LRR 1..
LRR 2..
Bio::SeqFeatureI
Bio::Seq – Formats
AB1,ABI
ALF
CTF
EMBL
EXP
Fasta
Fastq
GCG
GenBank
PIR
PLN
SCF
ZTR
ace
game
locuslink
phd
qual
raw
swiss
ABI tracefile format
ALF tracefile format
CTF tracefile format
EMBL format
Staden tagged experiment tracefile format
FASTA format
Fastq format
GCG format
GenBank format
Protein Information Resource format
Staden plain tracefile format
SCF tracefile format
ZTR tracefile format
ACeDB sequence format
GAME XML format
LocusLink annotation (LL_tmpl format only)
phred output
Quality values (get a sequence of quality scores)
Raw format (one sequence per line, no ID)
Swissprot format
Access through Bio::Seq object
perldoc Bio::Seq (methods returning strings):
$seqobj->seq();
# string of sequence
$seqobj->subseq(5,10);
# part of the sequence as a string
$seqobj->accession_number(); # when there, the accession number
$seqobj->alphabet();
# one of 'dna','rna',or 'protein'
$seqobj->seq_version()
# when there, the version
$seqobj->keywords();
# when there, the Keywords line
$seqobj->length()
# length
$seqobj->desc();
# description
$seqobj->display_id();
# the human readable id of the sequence
Derived Bio::Seq objects
perldoc Bio::Seq (methods returning strings):
$seqobj->trunc(5,10)
$seqobj->revcom
$seqobj->translate
●
# truncation from 5 to 10 as new object
# reverse complements sequence
# translation of the sequence
Example:
$seqobj = read_sequence($file);
if ($seqobj->alphabet eq 'dna'
or
$seqobj->alphabet eq 'rna') {
$revcom = $seqobj->revcom;
write_sequence('', 'fasta', $revcom);
}
Other Bio::Perl features
●
Remote Blast:
$seqobj = read_sequence($file);
$blast_report = blast_sequence($seqobj);
write_blast(">blast.out", $blast_report);
●
Also possible to run stand-alone Blast
Other Bio::Perl features
●
Generate alignments:
# load module
use Bio::Tools::Run::Alignment::Clustalw;
# define parameters
@params = ('ktuple' => 2, 'matrix' => 'BLOSUM');
# build a clustalw alignment factory
$factory =
Bio::Tools::Run::Alignment::Clustalw->new(@params);
# Pass the factory a list of sequences to be aligned.
$aln = $factory->align('TLRs.fa');
# $aln is a SimpleAlign object
Other Bio::Perl features
●
Work with alignments:
$aln->length
$aln->no_residues
$aln->is_flush
$aln->no_sequences
$aln->score
$aln->percentage_identity
$aln->consensus_string(75)
??????L?LS?N?I??????????L??L??L?L??N??????????????
???F?????L??L?L??N???????????????L??L?L???????????
?????????????????????????????????????????????F??L?
?L??L?L????????????????L?????L????????????????????
???????L?????????????????????L????????????????????
?????????L????????????????????????????????????????
mostly Leucines conserved
Other Features
●
sequence statistics (chemical description,
residue count, word frequency)
●
finding restriction enzyme sites
●
finding amino acid cleavage sites
●
show and add sequence annotation
●
gene detection
●
manipulate phylogenetic trees
●
statistics for population genetics
BioPerl Documentation
NAME
Bio::Perl - Functional access to BioPerl for people who don't know
objects
SYNOPSIS
use Bio::Perl;
# will guess file format from extension
$seq_object = read_sequence($filename);
# forces genbank format
$seq_object = read_sequence($filename,'genbank');
# reads an array of sequences
@seq_object_array = read_all_sequences($filename,'fasta');
# sequences are Bio::Seq objects, so the following methods work
# for more info see Bio::Seq, or do 'perldoc Bio/Seq.pm'
print "Sequence name is ",$seq_object->display_id,"\n";
print "Sequence acc is ",$seq_object->accession_number,"\n";
More Info
●
less `locate bptutorial`
`locate bptutorial`)
●
perldoc Bio::Perl
●
www.bioperl.org
●
BioPerl course:
(or perl -d
www.pasteur.fr/recherche/unites/sis/formation/bioperl/
Ensembl
●
●
joint project between EMBL-EBI and the Sanger
Centre
automatic annotation on selected eukaryotic
genomes
●
free access to all the data and software
●
ww.ensembl.org
Ensembl Organisms
●
Human
●
dog
●
mouse
●
chimp
●
fly
●
zebrafish
●
worm
●
pufferfish
●
chicken
●
mosquito
●
cow
●
honey bee
●
rat
●
yeast
local installations (Arabidopsis)
Ensembl Website
Ensembl Website
Ensembl Pipeline
●
Perl-based scripts
●
run programs to detect and annotate genes
●
compare genomes
●
provide Web graphics
●
all data stored in MySQL database
Ensembl release cycle
●
Data sets and software updates approximately
ten times a year
●
Versions for web code and databases
●
All older versions (back to 2004) accessible
●
Registry allows easy switch between versions
Ensembl database
●
very rich data set
●
complex database layout
●
user-friendly Web interface
●
DB components:
–
Core
–
Compara
●
abstract layers through API (Perl, Java)
●
anonymous @ ensembldb.ensembl.org
–
e.g. homo_sapiens_core_38_36
Ensembl core
●
Genome sequences and annotation info
–
Gene transcripts
–
Protein models
●
Assembly information
●
CDNA and protein alignments
●
External references
●
Markers
●
Repeats regions
Other Ensembl data sets
●
EST databases
●
Variation databases
●
Both with application programming interface
(API), e.g. Perl modules
Ensembl compara
●
Multi-species database
●
Genome-wide species comparison
●
Re-calculated for each release
●
Pair-wise whole genome alignments
●
Synteny sets
●
Orthologue predictions
●
Protein family clusters
Compara: Genome comparison
●
Which methods are available?
mysql> SELECT * FROM method_link;
+----------------+----------------------+
| method_link_id | type
|
+----------------+----------------------+
|
1 | BLASTZ_NET
|
|
2 | BLASTZ_NET_TIGHT
|
|
3 | BLASTZ_RECIP_NET
|
|
4 | PHUSION_BLASTN
|
|
5 | PHUSION_BLASTN_TIGHT |
|
6 | TRANSLATED_BLAT
|
|
7 | BLASTZ_GROUP
|
|
8 | BLASTZ_GROUP_TIGHT
|
|
101 | SYNTENY
|
|
201 | ENSEMBL_ORTHOLOGUES |
|
202 | ENSEMBL_PARALOGUES
|
|
301 | FAMILY
|
+----------------+----------------------+
Compara: Genome comparison
●
Which genomes are available?
mysql> SELECT * FROM genome_db WHERE genome_db_id IN (1, 11);
+--------------+----------+-------------------------+------------+
| genome_db_id | taxon_id | name
| assembly
|
+--------------+----------+-------------------------+------------+
|
1 |
9606 | Homo sapiens
| NCBI34
|
|
11 |
9031 | Gallus gallus
| WASHUC1
|
+--------------+----------+-------------------------+------------+
Compara: Genome comparison
●
Which genomes were compared?
mysql> SELECT * FROM method_link_species_set WHERE
method_link_species_set_id = 71;
+----------------------------+----------------+--------------+
| method_link_species_set_id | method_link_id | genome_db_id |
+----------------------------+----------------+--------------+
|
71 |
1 |
1 |
|
71 |
1 |
11 |
+----------------------------+----------------+--------------+
BLASTZ_NET (method_link_id = 1) has been used
for linking all the species of this set: Human
(genome_db_id = 1) and Chicken (genome_db_id =
11).
More Info
●
www.ensembl.org
●
www.ensembl.org/info/software
●
–
Tutorials
–
Database schema outline
Man pages, e.g.
perldoc Bio::Ensembl::DBSQL::DBAdaptor
Download