Functional Annotation - Background and Strategy

advertisement
Functional Annotation
Background + Strategy
The Group
27th Feb 2012
Lavanya Rishishwar
Artika Nath
Lu Wang
Haozheng Tian
Shengyun Peng
Ashwath Kumar
Hamidreza Hassanzadeh
1
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
2
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
3
Functional Annotation
THE ‘WHAT?’
27th Feb 2012
4
Genome Assembly
Assemble the Pieces Right
27th Feb 2012
5
Gene Prediction
When on board HMS Beagle, as
Identify the words
naturalist, I was much struck with
certain facts in the distribution of
the inhabitants of South America,
and in the geological relations of
hen on board HMS Beagle,
as
the present
to the past inhabitants
naturalist, I was much struckofwith
that continent. These facts
certain facts in the distribution
seemed
of to me to throw some light
the inhabitants of South America,
on the origin of species - that
and in the geological relations
mystery
of of mysteries, as it has been
the present to the past inhabitants
called by one of our greatest
of that continent. These philosophers.
facts
seemed to me to throw some light
on the origin of species - that
mystery of mysteries, as it has been
called by one of our greatest
philosophers.
W
27th Feb 2012
6
Functional Annotation
nat·u·ral·ist [nach-er-uh-list, nach-ruh-]
noun
1. a person who studies or is an expert in
natural history, especially a zoologist or
botanist.
2. an adherent of naturalism in literature or
art.
Origin: 1580–90; natural + -ist
DATABASES
27th Feb 2012
Identify the function (i.e., meaning)
of each word
When on board HMS Beagle, as
naturalist, I was much struck with
certain facts in the distribution of
the inhabitants of South America,
and in the geological relations of
the present to the past inhabitants
of that continent. These facts
seemed to me to throw some light
on the origin of species - that
mystery of mysteries, as it has been
called by one of our greatest
philosophers.
PROFILES
Origin of Species, The
noun
( On the Origin of Species by Means of Natural
Selection, or the Preservation of Favoured Races
in the Struggle for Life ) a treatise (1859) by
Charles Darwin setting forth his theory of 7
evolution.
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
8
THE GRAVITY OF THE ANNOTATION
PROCESS
Not just Newtonian
27th Feb 2012
9
“Ultimately, one wishes to determine
how genes—and the proteins they
encode—function
function in the intact
organism.”
Albert B, et al. (2002) Molecular biology of cell. New York: Garland Science.
27th Feb 2012
10
Function? What is it?
• To a cell biologist function might refer to the
network of interactions in which the protein
participates or to the location to a certain
cellular compartment.
• To a biochemist, function refers to the
metabolic process in which a protein is
involved or to the reaction catalyzed by an
enzyme.
27th Feb 2012
11
Functional Annotation
Functional annotation consists of attaching
biological information to genomic elements.
• Biochemical function
• Biological function
• Involved regulation and interactions
• Expression
27th Feb 2012
12
Whatever happened to wet-lab?
“Experimentally annotating one complete
bacterial genome varies from organism to
organism. Roughly speaking, it could take as
much as $25,000 and a period of 6-12 months
for completing the process”
- Alejandro Caro
27th Feb 2012
13
The Naked Truth
2000
No. of Genomes in KEGG
1800
1600
1400
1200
1000
800
600
400
200
0
7/98
27th Feb 2012
10/99
1/01
4/02
7/03
10/04
1/06
4/07
7/08
KEGG Genome: Release Update of Jan 2012
10/09
1/11
14
How Gene Performs Function? Operon
• Operon: Several genes with related functions that are regulated
together, because one piece of mRNA codes for several related
proteins.
• Polycistronic mRNA,, mRNA coding for more than one polypeptide, is
found only in prokaryotes
27th Feb 2012
15
Coding and non coding RNA’s
Protein Coding
Enzymes
Structural
Regulatory
Signal Transduction
Receptors
Toxins
Virulence Factors
Membrane/
Transmembrane
Non Coding
Riboswitches
CRISPR
Srna's
27th Feb 2012
Pathway
Prediction
16
Domain/Motif
• Domain:
A discrete structural unit that is
assumed to fold independently of
the rest of the protein and to
have its own function.
~20-100 aa
• Motif:
Are short, conserved regions and
frequently are the most
conserved regions of domains.
Motifs are critical for the domain
to function.
27th Feb 2012
17
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
18
Haemophilus haemolyticus
- The Biography
Understanding the Target
27th Feb 2012
19
Haemophilus haemolyticus
•
•
•
•
Gram-negative
Facultative anaerobe
Known to colonize the human respiratory tract.
Out of the 8 Haemophilus species found to colonize
the respiratory tract, H. influenzae and H.
haemolyticus are the most prevalent ones.
• H. haemolyticus is an emerging pathogen
– 5 cases of invasive disease reported between 2009-10.
27th Feb 2012
20
Strains of H. haemolyticus
Species
Disease State
State
Isolated
Hemolysis
Hpd
fucK
M19107
H. Haemolyticus
Asymptomatic
Minnesota
Y
-
-
M19501
H. Haemolyticus
Asymptomatic
Minnesota
N
+
-
M21127
H.Haemolyticus
Pathogenic
Georgia
Y
-
-
M21621
H. Haemolyticus
Pathogenic
Texas
Y
-
-
M21639
H. Haemolyticus
Pathogenic
Illinois
N
-
-
M21709
H. Influenzae
Pathogenic
NY
N
-
+
fucK : ncoding fuculose-kinase. fucK deletion has been observed in some Hi isolates
Hpd: encoding a lipoprotein protein D,
27th Feb 2012
21
Phylogeny
Niels Nørskov-Lauritsen, N., et al. (2005).Multilocus sequence phylogenetic study of the genus Haemophilus with
description of Haemophilus pittmaniae sp. nov. International Journal of Systematic and Evolutionary
27th Feb 2012 55, 449–456
22
Microbiology,
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
23
View from 300 ft
and a brief time travel
27th Feb 2012
24
Ontology
• An ontology is a "formal, explicit specification
of a shared conceptualization“
• Two formal major ontology schemes:
– EC – Enzyme Commission Number
– GO – Gene Ontology
27th Feb 2012
25
Enzyme Commission (EC)
• A large scale comprehensive attempt to organize and
classify enzymes according to its function
• For inclusion in the list, direct experimental evidence
is to be provided for its claimed activity
• Organizes the list of enzymes in four levels of
hierarchy, starting with the top most 6 classes:
1.
2.
3.
4.
5.
6.
27th Feb 2012
Oxidoreductases
Transferases
Hydrolases
Lyases
Isomerases
Ligases
26
Chronology: Enzyme Commission (EC)
• Cons of EC:
• Hierarchy only provides parent to child
relationship
• Only specific to enzymes (doesn't cover all of the
proteins)
27th Feb 2012
27
Chronology: Gene Ontology (GO)
Or in other words "give this protein a name and stick to it!!"
27th Feb 2012
28
What is the GO?
•
•
•
•
Molecular Function
Biological Process
Cellular Component
Relations between the terms
– ‘is_a’
– ‘part_of’, ‘has_part’
– ’regulates’
27th Feb 2012
29
Structure of GO
du Plessis L, Skunca N, Dessimoz C (2011). The what, where, how and
why of gene ontology–a primer for bioinformaticians. Brief Bioinform.
Doi: 10.1093/bib/bbr002
27th Feb 2012
30
General Rule To Apply Evidence
Code
27th Feb 2012
31
Where Do Annotations Come From?
• Inferred from experiment
– Most reliable
– Base for computational method
• Inferred from computational method
– Sequence similarity, structural similarity, etc.
• Inferred from author statement
• Curator statement and Obsolete evidence
codes
27th Feb 2012
32
Why use the GO?
• The ‘GO Consortium’ consists of a number of large
databases working together to define standardized
ontologies and provide annotations to the GO.
• Search for interacting genes
• Reason across the relations
• Analyze the results of high-throughput experiment
• Infer function of un-annotated genes and inter proteinprotein interactions.
27th Feb 2012
33
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
34
CAUTION!
PROS AND CONS OF CONVENTIONAL APPROACHES
Choosing The Right Function Prediction Tool
27th Feb 2012
35
“Perutz et al. showed in 1960 that myoglobin and hemoglobin, the
first two protein structures to be solved at atomic resolution using
X-ray crystallography, have similar structures even though their
sequences differ.”
27th Feb 2012
36
Pros and Cons: There are no free
lunches!
• Homology Useful but different from “same” function
– Simply implies common ancestry
27th Feb 2012
37
Pros and Cons: There are no free
lunches!
27th Feb 2012
38
Pros and Cons: There are no free
lunches!
• Quality of Prediction is as good as the quality
of annotation of the database
• Eukaryotic function predictor can not be used
for Prokaryotes and vice versa
27th Feb 2012
39
Outline
•
•
•
•
•
•
What is Functional Annotation
The Importance of Functional Annotation
The Biology of H. haemolyticus
Background for Functional Annotation
Pros/Cons of Available Approaches
Planned Approach
– Breadth
– Depth
27th Feb 2012
40
BREADTH AND DEPTH OF THE
ANALYSIS
A Snapshot of the Iceberg Named Functional Annotation
27th Feb 2012
41
Spectrum of Methods Selected
BREADTH
27th Feb 2012
42
Criteria for selecting methods
1. Currently being maintained
2. Applicable to Prokaryotic sequences
3. Could be installed locally (support batch
jobs if GUI)
OR
Could be included in a pipeline i.e., have a
command-line interface
27th Feb 2012
43
Categories of Approaches
• Sequence similarity-based
• Phylogenomics-based
• Domain/pattern/profile - based
– Domain-based
– Pattern-based
– Profile-based
• Sequence clustering-based
• Machine learning-based
• Network-based
27th Feb 2012
44
Breadth: Options
Approach
Sequence similarity based
Phylogenomics based
Domain/pattern/profile based
27th Feb 2012
Resource
GOtcha
PFP
GOsling
OntoBlast
GOblet
Blast2GO
SIFTER
AFAWE
RIO
OrthoStrapper
InterProScan
TMHMM
HMMTOP
HMMER
Pfam
SUPERFAMILY
PROSITE
PRINTS
SMART
Gene3D
PANTHER
TIGRFAMs
SCOP
CATH
CatFam
PIRSF
PRODOM
EFICAz
PRIAM
Approach
Sequence clustering based
Machine learning based
Network based
Pipelines
Resource
ProtoNet
CluSTr
eggNOG
COGs
InParanoid
MultiParanoid
OrthoMCL
ProtFun
GOPET
SVM-Prot
ffPred
EzyPred
MCODE
AGeS
SAMBA
RNSC
PRODISTIN
Cytoscape
STRING
VisANT
VIRGO
RAST
MultiParanoid
AGMIAL
MicroScope
Dead
GUI
Proprierty
Eukaryotic Model
External Servers
InterPro
Web-based Servers
45
Flowchart
27th Feb 2012
46
Description of Selected Methods
DEPTH
27th Feb 2012
47
Level 1
The building blocks!
27th Feb 2012
48
PanGenome Analysis
• PanGeome is the full complement of genes in a species.
• It includes core genome which is a set of genes that are present in
all strains, dispensable genome that are genes present in 2 or
more strains and unique genes which are unique to specific
strains.
• In this case, we will be using pangeome of Haemophilus
influenzae.
• This database will be used as the reference database in BLAST.
• This method gives high confidence annotations since the strains
selected are very closely related to the organism in question.
27th Feb 2012
49
BLAST: How it works?
1. Divide a query
sequence into short
chunks called words,
2. Look for exact
matches
3. in case of hit try
extending the
alignment
27th Feb 2012
50
Statistical assessment
E-value: 𝐸 = 𝑚 × 𝑛 × 𝑃
where,
𝑚 = Total number of residues in the database
𝑛 = Number of residues in the query sequence
𝑃 = Probability that an HSP alignment is a result of
random chance
For e.g., 𝑚 = 1 × 1020 , 𝑛 = 100 , 𝑃 = 1 × 10−20
⇒ 𝐸 = 1 × 10−6
27th Feb 2012
51
Different flavors!
• BLASTN
– Queries nucleotide vs. nucleotide sequences
• BLASTP
– Queries protein vs. protein sequences
• BLASTX
– Queries 6 possible frames of nucleotide sequences vs. protein
sequences
• TBLASTN
– Reciprocal of BLASTX
• TBLASTX
– Queries 6 possible frames of nucleotide sequences vs. 6 possible
frames of nucleotide sequences inside the database
27th Feb 2012
52
"InterPro provides functional analysis of proteins by
classifying them into families and predicting domains
and important sites."
• Combines protein signatures from a number of member
databases into a single searchable resource
• Capitalizes on their individual strengths to produce an integrated
database and diagnostic tool.
Current release: 36.0 23 February 2012
New features:
• An update to Pfam (26.0) and PIRSF (2.78).
• The integration of 755 new methods from the GENE3D, PANTHER,
PIRSF, Pfam and SUPERFAMILY databases.
Member database information
Signature Database
GENE3D
HAMAP
PANTHER
PIRSF
PRINTS
PROSITE patterns
PROSITE profiles
Pfam
PfamB
ProDom
SMART
SUPERFAMILY
TIGRFAMs
Version
3.3.0
140911
7
2.78
41.1
20.72
20.72
26
26
2006.1
6.2
1.73
10.1
Signatures*
2386
1702
69566
2983
2050
1308
922
13672
20000
1894
1008
1774
4023
Integrated Signatures**
1441
1686
2392
2983
2001
1291
897
12672
0
1105
1002
1208
4002
* Some signatures may not have matches to UniProtKB proteins.
** Not all signatures of a member database may be integrated at the time of an InterPro release.
HAMAP
TIGRFAMs
PIRSF
ProDom
Evolutionary
relationships
of proteins
Protein ANalysis
THrough
Simple
Modular
Architecture
from
superto
sub-families
Evolutionary
Relationships
:
Database of Automated
protein domains,
families
and functional
sites
Research
Tool
High-quality
and
Manual
Annotation
of microbial
“SUPERFAMILY is a Member
database
of structural
and functional
database
information
“The Gene3D
databaseofisprotein
a largefamily
collection
of
Proteomes
“PRINTS
is
a
database
‘fingerprints’
annotation for all proteins and genomes.”
CATH(Class,
Architecture,
Topology,
Homologues
offering
a
diagnostic
resource
for
newly-determined
Signature Database
Version
Signatures*
Integrated Signatures**
superfamily)
protein
domain
assignments
for
ENSEMBL
sequences.” 3.3.0
GENE3D
2386
1441
genomes and140911
Uniprot sequences.”
HAMAP
1702
1686
:
PANTHER
7
69566
2392
PIRSF
PRINTS
PROSITE patterns
PROSITE profiles
Pfam
PfamB
ProDom
SMART
SUPERFAMILY
TIGRFAMs
2.78
41.1
20.72
20.72
26
26
2006.1
6.2
1.73
10.1
2983
2050
1308
922
13672
20000
1894
1008
1774
4023
2983
2001
1291
897
12672
0
1105
1002
1208
4002
* Some signatures may not have matches to UniProtKB proteins.
** Not all signatures of a member database may be integrated at the time of an InterPro release.
Integration into
InterPro
Features of Member Databases
• ProDom: provider of sequence-clusters built from UniProtKB using
PSI-BLAST.
• PROSITE patterns: provider of simple regular expressions.
:
• PROSITE and HAMAP profiles: provide sequence matrices.
• PRINTS provider of fingerprints, which are groups of aligned, unweighted Position Specific Sequence Matrices (PSSMs).
• PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and
SUPERFAMILY: are providers of hidden Markov models (HMMs).
Querying
with InterProScan
“Sequence-based queries are performed using InterProScan, a
tool that combines the different protein signature recognition
methods native to the InterPro member databases into one
resource.”
Query Sequence
InterProScan
Querying
with InterProScan
• Web version
• Stand-alone version
– A wrapper of sequence analysis apps
– Database and output files scanning
– Bulk data processing
Member Databases & Scanning Methods
Member Databases
PROSITE patterns
Prosite Profiles
HAMAP Profiles
PRINTS
PFAM
PRODOM
SMART
TIGRFAMs
PIR SuperFamily
SUPERFAMILY
GENE3D
Scanning Methods
Software Package
pfscan
Pftools
pfscan
Pftools
FingerPRINTScan
hmmscan
ProDomBlast
hmmpfam
hmmscan
hmmpfam
hmmpfam/hmmsearch
hmmpfam
HMMER3.0b3
HMMER2.3.2
HMMER3.0b3
HMMER2.3.2
HMMER2.3.2
HMMER2.3.2
The TMHMM and SignalP prediction search algorithms are provided
through the web interface at EBI. However, they are not integrated into
InterPro.
Blast2GO
• B2G has been design to (1) allow automatic
and highthroughput sequence annotation and
(2) integrate functionality for annotationbased data mining.
27th Feb 2012
62
Why Blast2GO?
• Blast2GO is designed for high-throughput
sequence annotation.
• Better at mining and visualization capabilities
• Good at utilizing annotated sequences
already deposited in public databases.
27th Feb 2012
63
How Blast2GO works?
• Basically, Blast2GO uses local or remote BLAST
searches to find similar sequences to one or several
input sequences.
• The program extracts the GO terms associated to
each of the obtained hits and returns an evaluated
GO annotation for the query sequence(s).
• Enzyme codes are obtained by mapping from
equivalent GOs while InterPro motifs are directly
queried at the InterProScan web service.
• GO annotation can be visualized reconstructing the
structure of the Gene Ontology relationships and ECs
are highlighted on KEGG maps
27th Feb 2012
64
How Blast2GO works?
• OBTAINING GO TERMS
– The first step is to find sequences similar to a
query set by Blast searching. Homology search can
either be done at public databases or custom
databases when a local Blast installation is
available.
– By using Blast hit gene identifiers (gi) and gene
accessions B2G retrieves all GO annotations for
the hit sequences, together with their evidence
codes (EC).
27th Feb 2012
65
How Blast2GO works?
• ANNOTATION ASSIGNMENT
– annotation score (AS), direct term (DT)
27th Feb 2012
66
How Blast2GO works?
• STATISTICS
– statistical assessment of GO term enrichments in a
group of interesting genes when compared with a
reference group (Blüthgen et al., 2004).
– Gossip computes Fisher’s Exact Test applying
robust FDR (false discovery rate) correction for
multiple testing and returns a list of significant GO
terms ranked by their corrected or one-test Pvalues
• VISUALIZATION
27th Feb 2012
67
Systems for Functional Annotation
•
•
•
•
Clusters of Orthologous Groups (COGs)
euKaryote Orthologous Groups (KOGs)
Gene Ontology (GO)
Enzyme Commission no. (EC)
27th Feb 2012
68
Clusters of Orthologous Groups of
Genes
(KOGs,
COGs)
– Why?
• Orthologs retain the same function during evolution
and hence have a critical role in functional annotation.
COGs provides a framework for functional analysis.
• It's also important for phylogenetic and evolutionary
analysis of genomes. Interpretable phylogenetic trees
generally can be constructed only within sets of
orthologs.
27th Feb 2012
69
How to find Orthologous genes?
• Naive approach: For a query gene and target
genome, the highest similarity score indicates
homologous relationship
– Gives good results for not so distant species
– How about larger phylogenetically distances?
• Gene duplications: Suggests that a many-to-many relationship
required
• What if several hits with not a so high score emerge ? Stringent
threshold may lead to false negatives
• COG approach: Each two genes inside a COG are
either orthologous genes or orthologous groups of
paralogs
27th Feb 2012
70
How to create COGs
• Choose all 2-permutations of available genes and perform
pairwise comparison between genes from different clades (in this
case 5 clades)
2
2
10
90
3000
~8.9e6
17967
~3.2e8
• Best hits (BeT) in other organisms are recognized
• Make the graph of consistent relations (does not depend on an
absolute threshold level)
• The simplest case is a triangle: if a gene yields a hit with two other
genomes there are, being orthologs is a necessary condition for
yielding a hit between those two genes
• Merge all triangles with common side
27th Feb 2012
71
How to create COGs - continued
6. Do to existence of
paralogs, BeTs are not
necessarily symmetrical
(RBBH [Reciprocal Best
Blast Hits] )
?
27th Feb 2012
Tatusov, Koonin & Lipman, Science 278, 631 (1997)
72
Facing challenges when creating
COGs
• The clusters however are subject to ambiguity:
– Proteins with distinct regions (multi-domain proteins)
each belonging to a different conserved family.
• Sol: Further inspection of domains
– When one gene in a pair of paralogs is lost in one
lineage (but not in the other), it may artificially merge
the two COGs.
• Sol: Similarity measures
27th Feb 2012
73
COGs vs. Gene Function
• Each COG includes proteins from at least 3
major clades with divergence time estimated
around over a billion year. Hence they are
ancient conserved families with important (if
not necessary function)
• Accordingly, the proteins belonging to
mysterious COGs are good possible
candidates for further analysis
27th Feb 2012
74
Clusters of Orthologous Groups
(COGs)
27th Feb 2012
http://www.ncbi.nlm.nih.gov/COG/
75
Classification of COGs by functional categories
INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics
CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones
METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism
POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown
27th Feb 2012
76
LipoP
• It is a tool used to mainly predict lipoprotein signal
peptides.
• It is most suitable for Gram negative bacteria but
shown to have considerable accuracy for Gram
positive bacteria as well.
• It uses Hidden Markov Models to distinguish
between lipoproteins (SPaseII-cleaved proteins),
SPaseI-cleaved proteins, cytoplasmic proteins, and
transmembrane proteins.
27th Feb 2012
77
Thank You!
To be continued…
27th Feb 2012
78
Download