Slides

advertisement
Crowdsourcing Chemistry for the
Community – 5 Years of Experiences
Antony Williams
NFAIS, February 28th 2012
The World of Online Chemistry










Safety data
Toxicity data
Blogs and Wikis
Property databases
Experimental results
Scientific publications
Compound aggregators
Open Notebook Science
Metabolic pathway databases
Encyclopedic articles (Wikipedia)
If it was not just about me…
If it was not just about me…
 We might have a community
built encyclopedia
 I might know where the best
restaurants are
 I might get good advice on
books to read
 I might know which movies
to watch
 I might know which plumber
to call
 Data might just be Open
If it was not just about me…
 We might have a community
built encyclopedia
 I might know where the best
restaurants are
 I might get good advice on
books to read
 I might know which movies
to watch
 I might know which plumber
to call
 Data might just be Open
Collaborative Knowledge Management
QUESTION
 Are you involved with assisting chemists,
pharmaceutical scientists, etc. in sourcing
information about Chemistry?
 1. Yes
 2. No
Chemistry Databases on the Internet
 Public databases are “trusted” as primary sources
 Trust is granted without investigation of the
content
 Online data vary dramatically in quality!
 Examples…
With Great Fanfare…
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
How many contribute to clean-up?
 Less than a dozen contributors to data
 The majority are project members
The crowd is
small…
What you might not know about
Chemistry Databases on the Internet
 Data-sharing between the databases is cyclic –
proliferating errors – “Linked Data”
What is the Structure of Vitamin K?
MeSH
 A lipid cofactor that is required for normal blood
clotting.
 Several forms of vitamin K have been identified:
 VITAMIN K 1 (phytomenadione) derived from
plants,
 VITAMIN K 2 (menaquinone) from bacteria, and
synthetic naphthoquinone provitamins,
 VITAMIN K 3 (menadione).
What is the Structure of Vitamin K1?
QUESTION
 Who has heard of ChemSpider as a chemistry
database?
 1. Yes
 2. No
ChemSpider
We Want to Answer Questions
 Questions a chemist might ask…
 What is the melting point of n-heptanol?
 What is the chemical structure of Xanax?
 Chemically, what is phenolphthalein?
 What are the stereocenters of cholesterol?
 Where can I find publications about xylene?
 What are the different trade names for Ketoconazole?
 What is the NMR spectrum of Aspirin?
 What are the safety handling issues for Thymol Blue?
Available Information…
 Linked to vendors, safety data, toxicity, metabolism
Available Information….
Crowdsourced “Annotations”
 Users can add
 Descriptions/Syntheses/Commentaries
 Links to PubMed articles
 Links to articles via DOIs
 Add spectral data
 Add Crystallographic Information Files
 Add photos
 Add MP3 files
 Add Videos
QUESTION
 Did you know that ChemSpider was OWNED by
the Royal Society of Chemistry?
 1. Yes
 2. No
Public Domain Databases
 Our databases are a mess…
 Non-curated databases are proliferating errors
 We source and deposit data between databases
 Original sources of errors hard to determine
 Curation is time-consuming and challenging
Stop Whining – Fix it
Crowdsourced Curation
 Crowdsourced curation: identify/tag errors, edit
names, synonyms, identify records to deprecate
Search “Vitamin H”
“Curate” Identifiers
“Curate” Identifiers
Validated Name-Structure Dictionaries
 Chemical name dictionaries are used for:
 Text-mining (publications, patents)
 Used to index PubMed and link to Google Patents
 Linking to other databases – think Biology!
 When structures are not available drug names link
 Searching the web
 Names link to structures link to InChIs
Why are Dictionaries important?
The Final Search Strategy
Many Names, One Structure
I want to know about “Vincristine”
Vincristine: Identifiers and Properties
Vincristine: Patents
Linked by Name
Text-Mining Depends on Dictionaries
Curated Dictionaries Matter
Originally 15 compounds “called” Yohimbine
54 Skeletons for Yohimbine
Sharing Chemspider curation
Data Curation Sharing - Proof of Concept
Identifier Dictionaries
 Reciprocal curation processes…share curation
 A series of “added” and “removed” synonyms
against structures for matching.
 Announced 9 months ago – only one consumer
 Who will participate???
Community Contribution to ChemSpider
www.SpectralGame.com
http://www.jcheminf.com/content/1/1/9
Curation through “gaming”
Data Curation
Reversed Spectrum
True Curation of Data
ChemSpider SyntheticPages
ChemSpider SyntheticPages
Submission Process
 Simple template-based submission process
 Submissions reviewed by editorial board.
 Online Peer Review process
 Crowdsourced expansion?
 A few regular dedicated authors only
 Online peer review and feedback small but useful
Crowdsourcing – does it work?
 192 people EVER have deposited or curated data
 ChemSpider SyntheticPages small group of authors
 Database hosts make the largest contributions
 ChemSpider staff tend to do the most curation
Contributions
Curations
 2009 – 8255 curations by 43 people
 2010 – 10014 curations by 66 people
 2011 – 16025 curations by 116 people
 “Crowdsourcing” – the crowd is small!
www.SciMobileApps.com
 8 contributors only…in 7 months
www.SciDBs.com
 7 contributors only…in 6 months
www.ScientistsDB.com
 38 contributors …in 6 weeks
What encourages participation?
 “Interested” parties contribute
 Marketing and self-promotion are primary reasons
for participation
 There are very few “selfless” participants
 Relationships garner contributions…
 Crowdsourcing across drug discovery
 Open PHACTS : partnership between European
Community and European Pharma Companies
 Freely accessible for knowledge discovery and
verification.
 Data on chemistry and biology
 Pharmacological profiles
 Proprietary and public data sources.
How will it improve?
Participation
and
contribution
Conclusions
 For chemistry - crowdsourced deposition, annotation,
and curation works but low engagement to date
 Primary challenge – engaging the community to help
create what they want. Rewards and recognition?
 MORE collaboration can benefit us all
 Indicators are good for small but continued growth
Thank you
Email: williamsa@rsc.org
Twitter: ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams
Download