The STRING Database What it does and how it interfaces to other resources Christian von Mering, University of Zurich & SIB bigDATA Workshop STRING http://string-db.org/ Genomic Neighborhood Genes/Species Co-occurence Gene Fusions Database Imports Exp. Interaction Data Co-expression Literature co-occurence - viewers for all types of evidence - focus on useability and speed - integrated scoring scheme - information transfer between species http://string-db.org Numbers: • 630 organisms • 2.6 Mio proteins • 88 Mio interactions • server-footprint: 320 Gb Interaction prediction from genome information Conserved Neighborhood Phylogenetic Profiles “genomic context” Gene-Fusions quantify … networks integrate … Other Interaction Sources Interaction Databases Pathway Databases Reactome Automated Textmining Interolog Transfer raw score The scoring system KEGG performance (fraction on same map) benchmarking Example - Neighborhood raw score: gene A gene B 100 bp 6 bp 20 bp raw score: sum of intergenic distances raw score each predictor has its own raw-score regime evidence transfer between species nscore = 1 – (1 – nscorequery species) * (1 – nscoretransf.) information transfer between species either via orthologs (COG database) or via homology analog for cscore, escore, tscore,... 1 – (1 – nscore) * (1 – fscore) * (1 – pscore) * (1 – cscore) * (1 – escore) * (1 – tscore) neighborhood fusion cooccurence coexpression final interaction score: protein A – protein B 0.856 experimental textmining between 0 and 1, pseudoprobability, “likelihood of functional association” The raw score regimes Phylogenetic profiles Neighborhood gene A • “similarity profiles” • singular value decomposition gene B 100 bp 6 bp raw score: euklidian distance 20 bp filter: downweigh scores for homologous pairs raw score: sum of intergenic distances Fusion experimental interactions • two-hydrid, TAP, annotated complexes, … • topology-based analysis: who with whom, how many other partners? raw score: constant (0.99) raw score: various (usually ‘uniqueness’ of interaction). Co-expression Textmining • download all microarray datasets for a given species • data normalization (spatial correction) • download all PubMed abstracts • identify proteins in the abstracts • search for co-mentioned pairs raw score: pairwise pearson-correlation coefficient raw score: log-odds score User-Experience: Aiming to be Visual and Intuitive 1’000 visits / day 800 users / day 9’000 pageviews / day > 10’000 DB-queries / day Citations 2000 NAR Snel et al. 80 citations 2003 NAR von Mering et al. 215 citations 2005 NAR von Mering et al. 183 citations 2007 NAR von Mering et al. 189 citations 2009 NAR Jensen et al. 47 citations total: 714 citations Cross-links SMART: protein domain information GENECARDS: info and products on human genes SWISS-MODEL-REPOSITORY: homology models CYTOSCAPE: access via plug-in architecture SWISSPROT / UNIPROT: expert protein annotation Cross-link example launch SwissModel Reciprocal View popup: launch STRING Example #1 A missing chaperone for Cytochrome C oxidase Question: who inserts the Copper-atom into CcO ? Example #1 The missing chaperone for Cytochrome C oxidase Initial observation: Example #1 The missing chaperone for Cytochrome C oxidase • gene expressed • structure solved • it binds copper ! • likely function - copper delivery Example #2 Simplify discovery in genome-wide association screens ? Christian von Mering – UZH MolBio – SIB In-House Use of STRING a) download data in relational database scheme b) download data as compact flat-files c) in-house installation of webserver d) cross-link to server (version controlled, to network, protein, link, ...) e) PSI-MI export f) [ SOAP / webservices ] Version 9.0 – exceeding 1000 genomes Irrelevant Organisms [future category] Core organisms: • include all model organisms (annotated knowledge) • non-redundant, each genus is covered • include organisms with functional genomics data More details & new features “Payload Display” - Your Own STRING Server => “branding” STRING via remote-control: a call-back API Acknowledgements The STRING team: Samuel Chaffron Manuel Weiss Michael Kuhn Lars Juhl Jensen Sean Hooper Berend Snel Martijn Huynen Peer Bork The STRING institutions: SIB – Swiss Institute of Bioinformatics University of Zurich European Molecular Biology Laboratory TU-Dresden, University of Copenhagen “MySTRING” users can register / login using OpenID or similar for authentication persistency of search results (“history”) store lists / items of interest (“bag of genes”) users can customize the interface generate revenue (?) Feature #2 (Finding Relevant Texts) Example #2 The missing enzymes for uric acid degradation Question: why can’t humans degrade uric acid ? Example #2 The missing enzymes for uric acid degradation ? ? Example #2 The missing enzymes for uric acid degradation initial observation: Example #2 The missing enzymes for uric acid degradation • genes cloned, expressed • enzymatic activity demonstrated • candidate short-term therapeutics !