Bioinformatics - Why E. coli? Development of the Information Resource www.EcoliHub.org

advertisement
• Bioinformatics - Why E. coli?
Development of the www.EcoliHub.org
Information Resource
– Oldest molecular model system
• Information is distributed among many established sites
• E. coli information is frequently basal in other resources
• Significant information predates electronic age (>100,000
literature articles)
Barry L. Wanner
Purdue University
West Lafayette, IN 47907 USA
bl
blwanner@purdue.edu
@ d
d
– Information is deep
•
•
•
•
•
Majority of E. coli genes/proteins have known functions
Over half have known 3-D
3 D structures
Detailed enzymology
Genetic and physical interactions
Fundamental processes from DNA replication, transcription,
translation, mutation, DNA repair, protein secretion, protein
folding, protein repair, disulfide bond formation, cell division,
DNA movement, protein movement
• Ca. 600 E. coli proteins are highly conserved in eucaryotes
including humans.
GeneSys Inaugural Meeting
National e-Science Centre, Edinburgh, UK
3 October 2008
1
2
Supported
Supported by
by NIH
NIH NIGMS
NIGMS U24
U24 GM077905
GM077905
Supported by NIH NIGMS U24 GM077905
Vision
1. Create biology-driven information resource that is
comprehensive, accurate, and up-to-date for
experimentalists and modelers.
2. Develop an integrated “one-stop-shopping” E. coli K-12
information resource to make full use of existing
knowledge and to enable new discoveries leading to
deeper understanding of life processes.
3. Implement web services (Web2) architecture that will
interoperate across multiple resources via simple
transparent interfaces,
interfaces which will be broadly useful for
development of other procaryote databases.
4. Develop bacterial database schema and a core database
for data not now easily accessible
5. Develop and nucleate a process for expert curation by
members of the community (EcoliWiki)
6. Facilitate development of accurate and up-to-date
annotation records for the K-12 group of organisms
3
4
Supported by NIH NIGMS U24 GM077905
Supported
Supported by
by NIH
NIH NIGMS
NIGMS U24
U24 GM077905
GM077905
E. coli Information is deep
Data Overload
Wealth of detailed information on
innumerable biochemical and
molecular processes continues to
accumulate.
High-throughput experimentation
from DNA microarray, multiple E.
coli genomes, ChIP-chip,
proteomics, protein-protein
interactions metabolomics,
interactions,
metabolomics
experimental resources, genetic
interactions are rapidly
increasing.
• Part of what we don't know yet
is how the things we do know
fit together (Integration Tools)
• Finding the missing and
inconsistent information is
difficult (Conflict Detection)
• Requires facilities to compare
and contrast information
(What’s Different? What’s
New?)
New ways are needed to make
these data more accessible for
reuse by others and for data
integration
Data Overload
5
Supported by NIH NIGMS U24 GM077905
6
Supported by NIH NIGMS U24 GM077905
1
Web 2.0 (web services)
• …the philosophy of mutually maximizing collective intelligence and
added value for each participant by formalized and dynamic
information sharing and creation
EcoliHub Goals
EcoliHub will not replace existing information resources
Our goal is to add value to these resources by:
1. improving the ability to share information and computational
Services among Resources,
2. improving the community’s
community s ability to find information and
resources (EcoliHub Websearch, Multi-site Search,
Workbench)
3. providing new information and resources that 'fill in the
gaps' between existing resources and improve the quality of
information provided by all participating E. coli resources.
4. allowing resources to be combined (piped together) in new
ways, without requiring additional development effort by the
provider (Integration, under development)
Problems and approaches
Finding and sharing data from different resources
• EcoliHub - information from collaborating biological electronic data
resources
Making data curation faster, cheaper, and better
• EcoliWiki - community annotation for E. coli K-12
7
Supported by NIH NIGMS U24 GM077905
Investigators
Walid G. Aref (Purdue)
Julio Collado-Vides (UNAM)
Tyrrell Conway (OU)
Michael R. Gribskov (Purdue)
Peter D. Karp (SRI)
Daisuke Kihara (Purdue)
James C
C. Hu (TAMU)
Hirotada Mori (NAIST)
Kenneth E. Rudd (Miami)
Debby Siegele (TAMU)
Todd J. Vision (UNC)
Barry L. Wanner (Principal, Purdue)
Management Team
Dawn R. Whitaker, Project Manager
Sara C. Ess, Assistant Project Manager
Deana L. Galema, Administrative Assistant
Ali Roumani, Lead Architect
8
Supported by NIH NIGMS U24 GM077905
EcoliHub People (current)
Shikha Agrawal, Lead Programmer, Purdue
Dave Clements, GMOD Help Desk, NESCent, UNC
Kirill A. Datsenko, Research Associate, Biology, Purdue
Hicham G. Emongui, Grad. R.A., Computer Science, Purdue
Joe Grissom, Lead Programmer, OU
James B. Hengenius, Grad. R.A., Bioinformatics, Purdue
Yi-Ju Hsieh, Grad. R.A., Biology, Purdue
Rajasekar Karthik, Programmer, Purdue
Yusuf Kaya, Post-doc Biocurator
Nathan Liles, Undergraduate Programmer, TAMU
Thomas McGrew, Programmer, Purdue
Brenley McIntosh, Post-doc, Bioinformatics, TAMU
Daniel Renfro, Lead Programmer, TAMU
Yasin Laura Silva, Grad. R.A., Computer Science, Purdue
Rikiya Takeuchi, Grad. R.A., Bioinformatics, NAIST
Matthew F. Traxler, Grad R. A., OU
Anand Venkatraman, Grad. R.A., Bioinformatics, TAMU
Samuel D. Wehrspann, Web Developer, Purdue
John E. Wertz, Consultant, CGSC-Yale
Yifeng David Yang, Grad. R.A., Bioinformatics, Purdue
Jindan Zhou, Grad. R.A., Computer Science, Miami
Gregory R. Ziegler, Grad. R.A., Bioinformatics, Purdue
Adrienne Zweifel, Grad. R.A., Biochemistry, TAMU
Steering Committee
James J. Anderson (NIGMS)
Patricia C. Babbitt (UCSF)
Rex L. Chisholm (Northwestern)
Valentina di Francesco (NIAID)
Carol A. Gross (UCSF)
James C. Hu ((TAMU))
Michael Hucka (Caltech)
Robert Landick (chair, Wisconsin)
Philip Matsumura (UIC)
Thomas J. Silhavy (Princeton)
Paul W. Sternberg (Caltech)
Barry L. Wanner (Purdue)
Owen R. White (Maryland)
Matthew E. Portnoy (ad hoc member;
Program Director, NIGMS)
9
Supported by NIH NIGMS U24 GM077905
EcoliHub Goals
1.
2.
3.
4.
10
Supported by NIH NIGMS U24 GM077905
E. coli information pages indexed
Providing Services and Resources
Improving the community’s ability to find
information and resources (EcoliHub
Websearch, Multi-site Search, Workbench)
Sharing and discussing information via a
forum
Training videos – Help how to use the
resource
•ASAP: A Systematic Annotation Package for Community Analysis of Genomes
•CGSC: The Coli Genetic Stock Center
•EcoCyc: Encyclopedia of Escherichia coli K-12 Genes and Metabolism
• EcoliWiki: EcoliHub's subsystem for community annotation.
•EcoGene Database of Escherichia coli Sequence and Function
• ECOR Collection: E.coli Reference Collection
• ecce: The E.
E coli Cell Envelope Protein Data Collection
• epd: E. coli protease database
• GenoBase: Functional Genomic Analysis of E.coli in Japan
• GIB: Genome Information Broker
• GTD: The Genomic Threading Database
• GtRNAdb: Genomic tRNA Database
• IS Finder: IS Database
• PEC: Profiling of E.coli Chromosome
• RegulonDB: a database on transcriptional regulation in Escherichia coli.
• Rfam (Janelia): The Rfam database of RNA alignments and CMs
• RPG: Ribosomal Protein Gene Database
• TCDB: Transport Classification Database
• TransportDB: Genomic Comparisons of Membrane Transport Systems
11
Supported by NIH NIGMS U24 GM077905
12
Supported by NIH NIGMS U24 GM077905
2
13
Supported by NIH NIGMS U24 GM077905
14
Supported by NIH NIGMS U24 GM077905
15
Supported by NIH NIGMS U24 GM077905
EcoliHub Databases
EcoliLiterature a comprehensive database of all articles,
book chapters, and books with basic information on E.
coli, its phages, and plasmids
EcoliPredict - computationally predicted and
experimentally determined structures of proteins encoded
by E. coli K-12
EcoliWiki - community annotation system for EcoliHub.
GenExpDB is a comprehensive database of publicly
deposited DNA microarray gene expression data on E.
coli
GenoBase - legacy E. coli database on comprehensive
resources, e. g., Keio collection and ASKA ORFeome
clone - now being further developed at EcoliHub
Participating Databases
EcoCyc - professionally curated encyclopedic source of
information on the genome, metabolic pathways, and
regulatory network
EcoGene - knowledgebase derived from extensive
literature surveys and bioinformatics research that
documents the functions of DNA, protein and RNA in E.
coli K-12
RegulonDB - source of highly curated knowledge on
regulation of transcription initiation and operon
organization, and regulatory networks
16
Supported by NIH NIGMS U24 GM077905
17
Supported by NIH NIGMS U24 GM077905
18
Supported by NIH NIGMS U24 GM077905
3
Making Resources Work Together
– Integration – Bring all information to one
place and centrally curate
• Labor intensive
• Puts the least expert people in charge of data
• Does not scale
– Federation – Leave data where it is and build
a unified query system
• Keeps experts in charge of data
• Scales better
• Superschema is logistically difficult to implement, and
makes resources "rigid"
• Difficult to extend (unanimity is essential)
– Interoperation – Leave the data where it is,
but provide well-defined low level access
• Keeps experts in charge
• Scales
• Freely extensible
19
Supported by NIH NIGMS U24 GM077905
EcoliHub
Web Services
• Building a collaborative system
– Community participation via interactive systems
– Interoperation using web services where possible
• Data warehousing or mirroring where necessary
•
A framework for sharing
•
Goal – a network of independent resources that easily make their
respective information and computational services available to the
community of E. coli resources
– What can be shared?
– Web services to share information (memes?)
– Create bridge
g services to g
glue resources together
g
•
•
•
•
20
Supported by NIH NIGMS U24 GM077905
•
•
logical intersection
translation services
object transformations and mappers
portable displays
•
–
Data (lookup and display)
Annotation (lookup services)
– Get gene information from gene ID
– Find
Fi d expression
i experiments
i
t from
f
ID
Calculations (computational services)
– Sequence searches (e.g., BLAST)
– Predictions (e.g., promoters, terminators, microRNAs, 3-D structure)
– Pathway and network analysis (integration)
Glue services
•
•
•
•
•
translate from one ID to corresponding - SwissProt to Genbank
Translate from one kind of object to another- DNA sequence to Protein Sequence
AND/OR/NOT/XOR of objects
Local Storage and workspace
Provenance
21
Supported by NIH NIGMS U24 GM077905
22
Supported by NIH NIGMS U24 GM077905
Webservice/Workspace Integration
Web Services
• A simple transaction
Consumer
User
Keyword
IDfromKeyword
List of IDs
EcoCyc
EcoGene
Vendor
23
Supported by NIH NIGMS U24 GM077905
24
Supported by NIH NIGMS U24 GM077905
4
Web Services
Diesel fuel from E. coli
Hub and Spoke
Peer to peer
Workflows
25
Supported by NIH NIGMS U24 GM077905
26
Supported by NIH NIGMS U24 GM077905
5
Download