How a music service inspired a new way of thinking about research Jason Hoyt, PhD – Research Director Impact and influence of Web 2.0-based Services on e-Research Workshop Edinburgh, UK 3 November 2009 Public Service Announcement Support Open Access Let’s talk about…. Idea behind Mendeley Overall architecture Mendeley in action Clean data Mendeley Last.fm works like this: 3) Last.fm builds your music profile and recommends you music you also could like... and it’s the world‘s biggest open music database 1) Install “Audioscrobbler” 2) Listen to music Mendeley Last.fm music libraries research libraries artists researchers songs papers genres disciplines Based in London, UK. We are 18 researchers, graduates and software developers from... ...backed by co-founders and former executives of: We are young! Mean age = ~25 Our users: Stanford University Cambridge University MIT After 10 months in public beta (version 0.9.4.1): University of Edinburgh University of Michigan Harvard Cornell University Berkeley University of Cologne RWTH Aachen Dartmouth College University of Wisconsin Fraunhofer Institute ETH Zurich University of Southampton University College Dublin Columbia University Oxford University Trinity College Dublin Max Planck Society Idea behind Mendeley Overall Architecture Mendeley in action Clean data Repository Database Web Service Idea behind Mendeley Overall architecture Mendeley In Action Clean data Adding your papers You have different options to set up your library: • Add single files or an entire folder • “Watch a folder” to automatically import PDF files • Add existing EndNote/BibTeX/RIS databases, or… …drag & drop PDF files into the library pane… … and Mendeley will try to extract the document details automatically Document details lookup You can also try to complete the document details by querying various databases (Crossref, PubMed, ArXiv or Google Scholar) Enter the DOI, PubMed, or ArXiv ID and click on the magnifier glass to start lookup What is Mendeley? Set up and manage your collections Add tags & notes and edit document details Library showing all your documents (citation or table view) Filter your papers by authors, keywords, tags, or publications Annotate and highlight Manage your library Our Challenges Challenges 1. Extracting 2. Syncing 3. Verifying 4. Recommending Clean Data Is Our Biggest Challenge! Idea behind Mendeley Overall architecture Mendeley in action Clean Data Dirty Data 1. Poor extraction 2. User input errors 3. Lookup errors (Open Access issue) 4. Duplicates & near-duplicates* What can we do with clean data? Yummy Data Implement our own reference checking service for ourselves/others APIs & mashups Entity disambiguation (create ontologies and semantic services) Starting point for recommendations Discover research statistics Discover research statistics What is your impact? Lots of Data 6.99M documents added by users 20M+ documents from other sources 140M references extracted so far 50M documents by Q3 2010 How to Clean Clean Up Improve text extraction “Wiki-fy” metadata with users Create canonical documents Canonical Documents Saves room Removes duplicates Corrects errors Deduplication Markov clustering Affinity Propagation Pair wise similarity Fingerprinting Fingerprinting The cat in the h3t went home The cat i the hat went home The cat went home without a hat Fingerprinting The cat in the h3t went home = 010011001 The cat i the hat went home = 011011001 The cat went home without a hat = 011110011 Fingerprinting The ct in th3 hat went hom[ 1000101 The cat in the hat went home 1000101 1010101 1010101 1010101 Questions Will this scale to 50M+ documents Will it scale to 1B references Q3 2010 Other Challenges Image-based PDFs Tables & Figures Real-time recommendations Entity disambiguation jason.hoyt@mendeley.com @jasonhoyt www.mendeley.com