Blast2GO: adding GO for your data sets

advertisement
Blast2GO presentation @ StatSeq
COST workshop
21nd
-23rd April
Helsinki,
Finland
Friday
25th2013,
January
2013,
Royal
Melbourne Hospital
Why Blast2GO
Functional characterization of novel
sequence data
Adapted of high throughput needs
of biological laboratories
Extracting knowledge about
functioning of genomes
Blast2GO Impact
Outline
Concepts on Functional Annotation
The Blast2GO annotation framework
Visualization of functional data
Pathway analysis with Blast2GO
Concepts of Functional Annotation
What is functional annotation?
How to annotate a large dataset?
The Gene Ontology
 Three branches:
 Biological Process
 Molecular Function
 Cellular Component
 Annotations are given to te most
specific (low) level
 True path rule: annotation at a
given term implies annotation to
all its parent terms
 Annotation is given with an
Evidence Code:
o
o
o
o
IDA: inferred by direct assay
TAS: traceable author statement
ISS: infered by sequence similarity
IEA: electronic annotation
o ….
More general
More specific
Functional assignment
Annotation
Empirical
Transference
Literature
reference
Phylogeny
Molecular
interactions
Gene/protein
expression
Biochemical
assay
Structure
Comparison
Identification
of folds
Sequence
analysis
Sequence
homology
Motif identification
Annotation by similarity: concerns
GO1, GO2, GO3, GO4
HIT
GO1, GO2, GO3, GO4
QUERY
 Level of homology (~ from 40-60% is possible)
 The overlap between hit and query, association function and structure
 The paralog problem: genes with similar sequences
might have different functional specifications
 The evidence for the original annotation
 Balance between quality and quantity: depends on the use
The Blast2GO annotation
framework
Application scheme
biological
process
Fasta
cellular
component
Application scheme
biological
process
Fasta
cellular
component
Application scheme
biological
process
Fasta
cellular
component
Basic annotation procedure
Sq1
Sq1
Hit1
Hit2
Hit3
Hit4
Sq2
Hit1
Hit2
Hit3
Hit4
Sq3
Sq3
Hit1
Hit2
Hit3
Hit4
Sq4
Sq4
Sq2
Blast
Hit1
Hit2
Hit1
Hit2
Hit3
Hit4
go1,go2, go3
go1,go3, go4
go3,go5, go6,go8
go1,go4
Sq2
Hit1
Hit2
Hit3
Hit4
go6,go9, go8
go1,go8
go4,go1, go8,go9
Sq3
Hit1
Hit2
Hit3
Hit4
go2
go2,go4, go4
go2,go5, go6
go2,go4
Sq4
Hit1
Hit2
Sq1
Mapping
Sq1
Annotation Sq2
Sq3
Sq4
go1,go2, go3
go1,go3, go4
go3,go5, go6,go8
go1,go4
go6,go9, go8
go1,go8
go4,go1, go8,go9
go2
go2,go4, go4
go2,go5, go6
go2,go4
Annotation Rule
- Let be GO1…n be candidate annotations for sequence S1, obtained
from hits Hi…k
- We compute an annotation score AS for each GOi that depends on:
- The similarity between sequence S1 and Hj
- The evidence code of GOi
- The existence of other neigboring GO candidates
- The structure of the Gene Ontology
- We define an abritary annotation threshold (AT)
- S1 is annotated with GOi if its ASGOi > AT
Annotation Rule
Possibility of abstraction
Similarity Requirement
GO4
Quality of source annotation:
IEA=0.7, IDA = 1, NR = 0.0, ...
GO1
GO2
GO3
Annotation Score
Cut-Off Value
new annotation
True-Path-Rule
selectivity
vs.
specificity
Blast2GO annotation rule
- When I have a GO with ECw =1 and I do not allow abstraction (GOw =
0), then the Annotation Score = %similarity
- If the ECw < 1 my similarity requirement is higher to obtain the same
Annotation Score
- If I allow abstraction GOw > 0, then with less similarity I can obtain the
required Annotation Score at a parent node
www.blast2go.com
Start Blast2GO
Blast2GO Application
(1) Blast
(2) Mapping
(3) Annotation
Main
Sequence
Table
Any operation
will only affect
to selected
sequences!!!!
Application statistics
Blast results
Application messages
Graph visualisation
Load sequences
Input data
(in FASTA format, AA or nt)
>my_favourite_species_seq1 | still unknown
gtgatggaaaagaaaagttttgttatcgtcgacgcatatgggtttctttttcgcgcgtattatgcgctgcctggattaagcacctcatacaattttcctgtaggaggtgtatatggtttt
ataaacatacttttgaaacatctctctttccacgatgcagattatttagttgtggtatttgattcggggtcgaaaaattttcgtcacactatgtattccgaatacaaaactaatcgccct
aaagcaccagaggatctgtcactacaatgtgctccgctacgtgaggctgttgaagcgtttaatattgtaagtgaagaagtgcttaactacgaagcagacgacgtaatagcta
cactctgtacaaaatatgcatctagtaatgttggagtgagaatactgtcagcagataaggatttactacaactcctaaatgataatgttcaagtttacgaccctataaaaagca
gatacctcaccaatgaatacgttttagaaaaatttggtgtttcatcagataagttgcatattgatacggttgcatcgagttataatgagaaaattattctcagctaagctgtacacc
gtttattacacactcgaaaggccgttag
as
>my_favourite_species_seq2 | no clue
df
ttgttagctaaaaaggaagactttcacacctttggtaatggtgttggctctgctggaacaggtggagttgtagtttctgcatccatgttgtctgcggatttttcaaatcttagagaaga
gatagcagcggttagtacggctggtgcagattggttacacattgatgtgatggatgggtgcttcgtccccagtttgactatgggtcctgtggtgatttccggcattaggaaatgta
caaatatgtttcttgatgtgcatttgatgattaatcgcccaggcgatcatctgaagagtgtggtagatgctggagctgataagatagagcacattcgcaagatgatagaggaa
asdf
agctcatcaaccgcgaaaatcgctgttgatggtggtgtttcaacggataatgcccgggctgttatcgaggcaggtgcgaatatactcgttgttggaacggcgctgtttgctgctg
acgatatgagtaaagttgtaagaactttaaaatcattttaa
>my_favourite_species_seq3 | just sequenced
gtgggactgctcatccctgtaggcagggtggctattttttgtgtaaaggcagtctttcatagtcttgtaccgccatactatctatggataactacaaagcagttttttgaggtgtggttt
ttctctcttcctatagtagcagttacatctttgtttacgggaggcgcgttagcccttcaggataccctcgtgggaagcgctaaagtatcagggtaatggagtttttactcctgcaag
atgtaatagagggtctggtaaaagctgtatcgtttgggctggtaatttcgctagttgggtgttacaacgggtatcactgtgagataggcgcaaggggtgtaggaacagcgaca
acaaaaacttcggtagcagcttctatgctcataattttgttaaactatataattactgttttttacgcgta
>my_favourite_species_seq4 | we will see soon...
atgtacgctgtatctctttcaaatttgcatgtctctttcaacaacaaggaggttttgaaaggtgttgacttggacatagcatggggggattccctggttatactgggagaatctggta
gtggaaagtctgtactaacaaaggttgtattgggtctaatagtgccccaagagggaagtgttactgtagatggcaccaatattcttgagaataggcagggcatcaagaatttt
agtgttttgtttcaaaactgtgcgttatttgacagtcttacgatttgggaaaatgtagtattcaatttccgtaggaggcttcgtttagataaggataatgccaaggctttggctttacgg
ggattggagcttgtgggattggacgccagtgtaatgaacgtgtatcctgtggagctatcaggcgggatgaaaaagcgcgtagctttggcaagagctattataggtagtccca
aaattctaattttggatgagccaacttcgggattggatcctataatgtcttcagtggt
BLAST
You email adress
BLAST program (normally blastx)
BLAST database (many options)
E-Value (depends on the DB)
Number of HITs (use <= 20)
Recommended to save as XML
Human readable seq.
Descriptions via BDA
Additional BLAST params
Set word size and filter
Use your own server
Minimum HSP length
Filter by description
Parsing options for own
databases
BLAST Results
RED
Blast Distribution Charts
Evaluate the similarity of
your sequences with public DBs
Single Sequence Menu
Single Sequence Menu
Mapping Results
GREEN
Annotation Menu
BLAST based annotation
Other Annotation modes
Validation and Annex
Annotation
Allows to set a minimum percentage of the HIT sequence
which should be expand by the QUERY sequence
This helps to avoid the problem of cis-annotation
Annotation Result
BLUE
Annotation Charts
Annotation Charts
Commonly, level 5 is the most abundant
specificity level in the Gene Ontology
Additional Annotation: ANNEX
Recovers implicit biological
process and cellular
component GO terms
based on molecular
function annotations
Molecular Function
is involved in
Biological Process
Myhre et al, Bioinformatics 2006
acts in
Cellular Component
Additional Annotation: InterProScan
Runs InterProScan searches
at the EBI through Blast2GO
Once you have completed your InterPro
annotation, results can be transformed to
GO terms and merged to Blast annotation
Results are stored
at your computer
as XML files. You
can upload them
later
InterProScan Results
Column with
InterProScan
results
Additional Annotation: GOSlim
GOSlim is a reduction of the Gene
Ontology to a more reduced
vocabulary → Helps to
summarize information
After GOSlim transformation
sequences get YELLOW
Different GOSlims
available at Blast2GO
Enzyme annotation and Kegg Maps
GO  Enzyme Codes  KEGG maps
Manual Curation
You can modify manually
annotation of particular
sequences
If you click in this box,
curated sequences get
purple
Export Results
Saves the complete
B2G project (heavy)
Export annotation results
in different formats
Export formats
.annot
C04018C10
C04018C10
C04018A12
C04018A12
GO:0004707
EC:2.7.11.24
GO:0016798
GO:0000272
mitogen-activated protein kinase 3
Also for
import!
class iv chitinase
GeneSpring Format
C04013E10 response to water deprivation; regulation ofnucleus;
transcription; multicellular organismal
transcription
development;
factor activity;
response to abscisic acid stimulus;
C04013A12 translation;
ribosome; plastid;
structural constituent of ribosome;
C04013C12 galactose metabolic process;
plastid;
aldose 1-epimerase activity; carbohydrate binding;
GoStat
C04018C10
C04018A12
C04018C12
4707,9409,6979,10200,5524,169
16798,272,44248
4869,12505,8233
By Seq
C04018A02 glyoxalase i
C04018C02 metallothionein-like protein
C04018G02 protein phosphatase
GO:0004462 F:lactoylglutathione lyase activity
GO:0046872 F:metal ion binding
GO:0008287 C:protein serine/threonine phosphatase complex
More export formats
Export Sequence Table
Seq. Name
C04018C12
C04018E12
C04018G12
C04018A02
C04018C02
C04018E02
C04018G02
C04018C04
C04018E04
C04018G04
C04018A06
Seq. Description
Seq. Length #Hits min. eValuemean Similarity#GOs GOs
Enzyme Codes InterProScan
cysteine proteinase inhibitor
663
20
25
80.00%
3 F:GO:0004869; C:GO:0012505; F:GO:0008233
IPR000010; IPR01807
protein phosphatase 2c
663
20
77
85.00%
2 N:GO:0015071; F:GO:0003824
IPR001932; IPR01404
alpha beta fold family protein
578
20
84
79.00%
4 F:GO:0016787; C:GO:0005739; C:GO:0009507;
noIPR
P:GO:00
glyoxalase i
600
20
64
74.00%
2 P:GO:0005975; F:GO:0004462
EC:4.4.1.5
IPR004360; noIPR
metallothionein-like protein
625
18
14
74.00%
1 F:GO:0046872
IPR000347
haemolysin-iii related familyexpressed 612
20
32
72.00%
1 C:GO:0016020
noIPR
protein phosphataseexpressed
645
20
97
81.00%
5 C:GO:0008287; N:GO:0015071; P:GO:0006470;
no IPS match C:GO:00
phosphoglycerate bisphosphoglycerate780
mutase20
family protein
63
66.00%
2 P:GO:0008152; F:GO:0003824
IPR001345; IPR01307
polyubiquitin
707
20
115
99.00%
2 P:GO:0006464; C:GO:0005622
IPR000626; IPR01995
meiotic recombination 11
575
20
45
89.00%
21 C:GO:0019013; P:GO:0007126; F:GO:0004519;
IPR003701; IPR00484
F:GO:000
late embryogenesis-abundant protein 648
20
43
68.00%
2 P:GO:0009737; P:GO:0009409
no IPS match
Export BestHit Data
Sequence name
C04018C10
C04018E10
C04018G10
C04018A12
C04018C12
C04018E12
C04018G12
C04018A02
C04018C02
Sequence desc.
Sequence lengthHit desc.
Hit ACC
E-Value
Similarity Score Alignment lengthPositives
mitogen-activated protein kinase 3 717 gi|122894104|gb|ABM67698.1|mitogen-activated
ABM67698 1.35E-123
protein kinase
99[Citrus
445.28
sinensis]
222
221
---NA--706 gi|157356307|emb|CAO62459.1|unnamed
CAO62459 2.69E-036
protein product [Vitis
83vinifera]
155.22
119
99
protein
620 gi|114153154|gb|ABI52743.1|10
ABI52743
kDa putative
7.47E-015
secreted protein63
[Argas
83.57
monolakensis]
90
57
class iv chitinase
715 gi|3608477|gb|AAC35981.1|chitinase
AAC35981CHI11.45E-061
[Citrus sinensis] 78 239.2
171
134
cysteine proteinase inhibitor
663 gi|8099682|gb|AAF72202.1|AF265551_1cysteine
AAF72202
9.33E-025
protease inhibitor
83 116.7
[Manihot esculenta] 99
83
protein phosphatase 2c
663 gi|46277128|gb|AAS86762.1|protein
AAS86762
phosphatase
2.76E-077
2C [Lycopersicon
91 291.2
esculentum]
180
164
alpha beta fold family protein
578 gi|147865769|emb|CAN83251.1|hypothetical
CAN83251
1.67E-084
protein [Vitis vinifera]
94 314.69
>gi|157339464|emb|CAO44005.1|
179
169 unn
glyoxalase i
600 gi|2213425|emb|CAB09799.1|hypothetical
CAB09799
2.16E-064
protein [Citrus x paradisi]
81 248.05
114
93
metallothionein-like protein
625 gi|3308980|dbj|BAA31561.1|metallothionein-like
BAA31561
2.23E-014
protein [Citrus
100unshiu]
82.03
40
40
Sequence Selection
Sequence Selection tool to
obtain a selection based on
annotation status
Sequence Selection
By Name/Description
By Function
View Menu
Functions to switch between
displaying IDs or descriptions
for GO annotation or InterPro
results
Hands-on I
Annotation 10 seqs with Blast2GO
Visualization
How to understand the functional context of a
annotated dataset
Combined Graph
Each term has a number of sequences associated
Nodes can be coloured
to indicate relevance
Each term is displayed
around its biological context
Node shape to differentiate between
direct and indirect annotation
Combined Graph
Different GO branches
Reduces nodes by number
of annotate sequences
Node data to be displayed
Criterion for highlighting
and filtering nodes
Node information content
Accumulated by GO term
5
(Sequence Count)
1
4
1
3
1
3
Incomming information
2.5
(Node Score)
Σ seq(g)*α
g∈desc(g')
1
2.4
dist (g, g')
1
3
1
3
Compacting Graphs by GO-Slim
Saving Options
Save as picture and as txt
Graph Charts
Graph Charts
• Sequence Distribution/GO
as Bar-Chart
• Sequence Distribution/GO
as Level-Pie (level selection)
• Sequence Distribution/GO as Multilevel-Pie (#score or #seq cutoff)
Analysis of a specific function
How many sequences are annotated to the function
“photosynthesis”?


Option 1: Find in the GO graph -> direct & indirect annotation
Option 2: Find through the Select function. Two sub-options
 Option 2.1. Direct annotation (use GO-ID or description)
 Option 2.2. Direct & indirect (use GO-ID and “include GO
parents”)
Analysis of a specific function
export
search
Find a function on the graph
Analysis of a specific function
Exporting sequence table you see sequences
Annotated to the function
Analysis of a specific function
Select all sequences annotated
to this function and its descendents
Analysis of a specific function
Locate these sequences
Hands-on II
Summary statistics
Visualize & Search
Pathway analysis with Blast2GO
Which cellular functions are important in my
experiment
Functional Enrichment Analysis
One Gene List
(Responsive genes)
The other list (Non
responsive genes)
Are this two
groups of genes
carrying out
different
biological roles?
Biosynthesis 54%
Biosynthesis 18%
Sporulation
Sporulation
18%
18%
Are pathway frequencies
different?
Fisher's Exact Test
One Gene List
(Responsive genes)
The other list (Non
responsive genes)
Biosynthesis 54%
Biosynthesis 18%
Sporulation
Sporulation
18%
Biosynthesis
No biosynthesis
A
6
5
p-value for Biosynthesis = 0.0913
B
2
9
Contingency table
Genes in group A have not
significantly to do with
biosynthesis nor
sporulation.
18%
Multiple testing correction
We do this for all GO term of our dataset!!!
Many tests => Many false positive => We need correction!
FDR control is a statistical method used in multiple
hypothesis testing to correct for multiple comparisons. In a list
of rejected hypotheses, FDR controls the expected proportion
of incorrectly rejected null hypotheses.
FWER control: The familywise error rate is the probability of
making one or more false discoveries among all the
hypotheses when performing multiple pairwise tests.
(more conservative)
Different types of comparisons



Compare two equivalent
conditions (root vs leaves)
Remove Common Ids
Test and Ref-Set are
interchangeable
Common IDs
Set 1
Set 2



Compare a subset against
the total
Common ids removed from
reference
Test and Ref-Set are NOT
interchangeable
TestSet
Common IDs
RefSet
RefSet
Common IDs
TestSet
FET in Blast2GO


Two-Tailed test not only identifies over but also
under represented functions.
If no Ref-Set is chosen all annotations are used
as reference
FatiGO Results
Result table with link out to sequence lists
Most specific terms
Retains only the lowest,
most specific enriched
term per GO branch
Enriched Graph

View enriched terms data as DAG graphs!
reduce
=> To draw all nodes, set filter to 1
Hands-on III
Enrichment Analysis
Concluding Remarks
Blast2GO is a versatile tool for the
annotation of sequence data
Blast2GO uses controlled vocabularies
and a elaborated annotation rule to
generate GO labels
Visualization and data mining functions
help to understand the functional
content of your dataset
Download