STRING-Tutorial

advertisement
The STRING Database
What it does and how it
interfaces to other resources
Christian von Mering, University of Zurich & SIB
bigDATA Workshop
STRING
http://string-db.org/
Genomic Neighborhood
Genes/Species Co-occurence
Gene Fusions
Database Imports
Exp. Interaction Data
Co-expression
Literature co-occurence
- viewers for all types of evidence
- focus on useability and speed
- integrated scoring scheme
- information transfer between species
http://string-db.org
Numbers:
• 630 organisms
• 2.6 Mio proteins
• 88 Mio interactions
• server-footprint: 320 Gb
Interaction prediction from genome information
Conserved Neighborhood
Phylogenetic Profiles
“genomic context”
Gene-Fusions
quantify …
networks
integrate …
Other Interaction Sources
Interaction Databases
Pathway Databases
Reactome
Automated Textmining
Interolog Transfer
raw score
The scoring
system
KEGG performance
(fraction on same map)
benchmarking
Example - Neighborhood raw score:
gene A
gene B
100 bp
6 bp
20 bp
raw score: sum of intergenic distances
raw score
each predictor has its own raw-score regime
evidence transfer between species
nscore = 1 – (1 – nscorequery species) * (1 – nscoretransf.)
information transfer between
species either via orthologs
(COG database) or via homology
analog for cscore, escore, tscore,...
1 – (1 – nscore) * (1 – fscore) * (1 – pscore) * (1 – cscore) * (1 – escore) * (1 – tscore)
neighborhood
fusion
cooccurence
coexpression
final interaction score: protein A – protein B 0.856
experimental
textmining
between 0 and 1, pseudoprobability,
“likelihood of functional association”
The raw score regimes
Phylogenetic profiles
Neighborhood
gene A
• “similarity profiles”
• singular value decomposition
gene B
100 bp
6 bp
raw score: euklidian distance
20 bp
filter: downweigh scores
for homologous pairs
raw score: sum of intergenic distances
Fusion
experimental interactions
• two-hydrid, TAP, annotated complexes, …
• topology-based analysis: who with whom, how many other partners?
raw score: constant (0.99)
raw score: various (usually ‘uniqueness’ of interaction).
Co-expression
Textmining
• download all microarray datasets
for a given species
• data normalization (spatial correction)
• download all PubMed abstracts
• identify proteins in the abstracts
• search for co-mentioned pairs
raw score: pairwise pearson-correlation coefficient
raw score: log-odds score
User-Experience: Aiming to be Visual and Intuitive
1’000 visits / day
800 users / day
9’000 pageviews / day
> 10’000 DB-queries / day
Citations
2000 NAR Snel et al.
80 citations
2003 NAR von Mering et al.
215 citations
2005 NAR von Mering et al.
183 citations
2007 NAR von Mering et al.
189 citations
2009 NAR Jensen et al.
47 citations
total: 714 citations
Cross-links
SMART: protein domain information
GENECARDS: info and products on human genes
SWISS-MODEL-REPOSITORY: homology models
CYTOSCAPE: access via plug-in architecture
SWISSPROT / UNIPROT: expert protein annotation
Cross-link example
launch
SwissModel
Reciprocal View
popup:
launch
STRING
Example #1
A missing chaperone for Cytochrome C oxidase
Question: who inserts the Copper-atom into CcO ?
Example #1
The missing chaperone for Cytochrome C oxidase
Initial observation:
Example #1
The missing chaperone for Cytochrome C oxidase
• gene expressed
• structure solved
• it binds copper !
• likely function - copper delivery
Example #2
Simplify discovery in genome-wide association screens ?
Christian von Mering – UZH MolBio – SIB
In-House Use of STRING
a) download data in relational
database scheme
b) download data as
compact flat-files
c) in-house installation
of webserver
d) cross-link to server
(version controlled,
to network, protein, link, ...)
e) PSI-MI export
f)
[ SOAP / webservices ]
Version 9.0 – exceeding 1000 genomes
Irrelevant Organisms
[future category]
Core organisms:
• include all model organisms (annotated knowledge)
• non-redundant, each genus is covered
• include organisms with functional genomics data
More details & new features
“Payload Display” - Your Own STRING Server
=> “branding” STRING
via remote-control:
a call-back API
Acknowledgements
The STRING team:
Samuel Chaffron
Manuel Weiss
Michael Kuhn
Lars Juhl Jensen
Sean Hooper
Berend Snel
Martijn Huynen
Peer Bork
The STRING institutions:
SIB – Swiss Institute
of Bioinformatics
University of
Zurich
European Molecular
Biology Laboratory
TU-Dresden,
University of Copenhagen
“MySTRING”

users can register / login

using OpenID or similar for authentication

persistency of search results (“history”)

store lists / items of interest (“bag of genes”)

users can customize the interface

generate revenue (?)
Feature #2 (Finding Relevant Texts)
Example #2
The missing enzymes for uric acid degradation
Question: why can’t humans degrade uric acid ?
Example #2
The missing enzymes for uric acid degradation
?
?
Example #2
The missing enzymes for uric acid degradation
initial observation:
Example #2
The missing enzymes for uric acid degradation
• genes cloned, expressed
• enzymatic activity demonstrated
• candidate short-term therapeutics !
Download