What's True For E. coli…
Enlisting The Community In Ongoing Genome
Annotation
Jim Hu
EcoliHub/EcoliWiki
Texas A&M University
Why more E. coli websites?
• The number of E. coli databases is large
• Extensive coverage exists for many aspects of E. coli biology
• Journals contain half a century of E. coli data
• Don't we already know everything?
Why more E. coli websites?
• The number of E. coli databases is large
• Extensive coverage exists for many aspects of E. coli biology
• Journals contain half a century of E. coli data
• Don't we already know everything?
• #(1-3) The problem isn't the amount of information, it's finding it
• #4: No
Why more E. coli websites?
• Part of what we don't know yet is how the things we do know fit together
• Most of us need help mining what's out there
Problems and approaches
• Finding data from different resources
– EcoliHub - information from collaborating biological electronic data resources
• Making data curation faster, cheaper, and better
– EcoliWiki - community annotation for E. coli K-12
• Community functional curation for cross-species comparison
– GONUTS - a community Gene Ontology resource
Integrating information from multiple sites
• EcoliHub is based on web services
• A user query to EcoliHub is passed on to participating sites http://ecolihub.org or http://ecolicommunity.org
Integrating information from multiple sites
• EcoliHub is based on web services
• A user query to EcoliHub is passed on to participating sites
• EcoliHub gathers the responses and assembles output for the user http://ecolihub.org or http://ecolicommunity.org
Integrating information from multiple sites
Integrating information from multiple sites
• But the users won't have to start at the EcoliHub site
Integrating information from multiple sites
• But the users won't have to start at the EcoliHub site
• EcoliHub will provide the infrastructure to help member sites do peer-topeer queries
Try EcoCyc and
RegulonDB who has info?
Integrating information from multiple sites
• But the users won't have to start at the hub site
• EcoliHub will provide the infrastructure to help member sites do peer-topeer queries
• The users don't need to know or care about the
EcoliHub
What kinds of nodes are connected to EcoliHub?
• So far:
– EcoCyc
• everything E. coli ; professionally curated
– EcoGene*
• everything E. coli ; professionally curated
– GenoBase
• functional genomics and resources
– EcoliPredict
• protein structure models
– OU GenExpDB
• transcriptomes, experimental data
– RegulonDB*
• operons and regulons
– EcoliWiki
• everything E. coli ; community curated
– GONUTS
• Community curation of the Gene Ontology; not just E. coli
• More coming…
The need for Annotation is growing
http://www.pasteur.fr/infosci/archives/mon/im_ele.html
“What is true of Escherichia coli is true of the elephant”
“Thanks to annotation creep, what’s false for
- Jacques Monod
E. coli is false for the elephant too”
People are limiting for annotation
• Major MODs (EcoCyc, SGD, Wormbase, Flybase, MGI, Zfin,
TAIR etc.) employ large numbers of PhD-level curators
• This model problematic for the future of biocuration, and not just for E. coli
– Curators are expensive
• NIH and NSF cannot afford to staff every organism at this level
– Broad expertise across all areas is hard
• Curators have to read papers in areas they were not trained in.
• Curators may not recognize the significance of papers in areas they were not trained in
• Can we make it:
– cheaper?
– faster?
– better?
The Wikipedia approach
• Get your user community to work for free!
• Many groups have tried community annotation, with mixed success (at best)
• Wikipedia has added more than a million articles in English since I made the first version of this slide!
EcoliWiki http://ecoliwiki.org or .net or .com
or come from EcoliHub
EcoliWiki philosophy
• Any registered user can register new users
• Any registered user can create new pages
• It's easier to revise than to create new content
– Seed content from other places, mostly EcoCyc
But won't that invite chaos?
GenBank's managers are dead set against letting users into GenBank's files, however. They say there already are procedures to deal with errors in the database, and researchers themselves have created secondary databases that improve on what GenBank has to offer. "That we would wholesale start changing people's records goes against our idea of an archive," says David Lipman, director of the National Center for
Biotechnology Information (NCBI), GenBank's home in
Bethesda, Maryland. "It would be chaos."
Wikipedia:
Correct compared to what?
NCBI RefSeq:
Wikipedia:
Correct compared to what?
NCBI RefSeq:
Wikipedia:
Correct compared to what?
NCBI RefSeq:
Correct compared to what?
This is how biology achieves fidelity
A collage of books I haven’t read
Biology Wikis are proliferating
Participation is the major challenge
• Anyone can edit ≠ Anyone will edit
• Wikipedia: a tiny fraction of the users edit anything
– A tiny fraction of those do major editing
– Really big denominator
• Outreach to increase our user base
Participation is the major challenge
• Tools to make it easier to edit
Participation is the major challenge
• Biggest difference from other systems:
– Partial annotations are wanted
– It doesn't matter if you don't know the wiki markup
– It doesn't matter if what you're adding isn't fully worked out
• Someone else can fix it
•
And you can fix what others write
Community annotation for everyone
• What if I don't work on E. coli ?
• Community annotation of gene function via the Gene Ontology
• Gene Ontology Normal Usage Tracking System (GONUTS)
• http://gowiki.tamu.edu
Community annotation for everyone
• Annotation pages based on UniProt IDs
The future of EcoliHub and EcoliWiki
• Making the resource more useful to the community
– incorporating more resources
– providing integration workflows
– teaching users how to use them
– adding content people want
• Making the approach available to other biology communities
– reusable open source tools
– public web services
E.
COLI
2008 don't forget the acknowledgements!
Thanks to
• EcoliWiki/GONUTS Team
– Chris Elsik
– Gwen Knapp
– Debby Siegele
– Daniel Renfro
– Jerry Tsai
– Xiaotao Qu
– Rosemarie Swanson
– Anand Venkatraman
– Adrienne Zweifel
• Sabbatical hosts
– SGD/Stanford
– Stein Lab/CSHL
• GO consortium
• EcoliHub Team Leaders
– Barry Wanner PI, Purdue
– Walid Aref, co-PI, Purdue
– Tyrell Conway, co-PI, Oklahoma
– Mike Gribskov, co-PI, Purdue
– Peter Karp, co-PI, SRI
– Daisuke Kihara, co-PI, Purdue
• Funding NIH U24-GM077905
URLs: http:ecolihub.org
http:ecoliwiki.org
http:gowiki.tamu.edu