Wiki-Authors: Combining Automatic Data Integration and Wiki Technology to Derive Unique Author Identifiers for Professional Knowledge Management White paper compiled to summarize the results of the Workshop on Scholarly Data and Data Integration at Indiana University, Bloomington, IN on August 30 and 31, 2006. Miguel Andrade1, Katy Börner2, Stacy Kowalczyk2, Barend Mons3,4, Erik M. van Mulligen3,4, Marc Weeber4 1Ottawa Health Research Institute, University of Ottawa, 501 Smyth Rd, Ottawa, ON K1H 8L6, Canada 2Indiana University, SLIS, 10th Street & Jordan Avenue, Bloomington, IN 47405, USA 3Erasmus Medical Center, Rotterdam, The Netherlands 4Knewco, Inc., 733 Fallsgrove Drive, Suite 9130, Rockville, MD 20850, USA The recent International Workshop on Scholarly Data and Data Integration 1 brought together thought leaders in the fields of bibliographic analysis and data mining. The participants were a blend of content providers, Wiki-developers, funding agencies, companies, and basic scientists focussing on data integration, disambiguation, on-line knowledge discovery and data visualisation. Presentations on user needs and current technologies and services available can be summarized as follows: Bibliographic databases become ever more important in scientific research and science management. These databases are essential to the primary work of science - retrieving the relevant literature and finding potential collaborators and competitors. Increasingly, mining of these databases has become a science in itself. For science management, the use of these databases has become crucial in connecting new proposals to what is existant and to find reviewers for these proposals. Additionally, measuring success of both projects and individuals in science is a challenge, and bibliographic databases are the key resource to do this. One of the commonly perceived difficulties in the collection and use of bibliographic data bases (e.g., PubMed, CiteSeer, arXiv, Google Scholar) is the need to uniquely identify authors. Subscription-restricted databases such as the Web of Science, maintained by Thomson Scientific and SCOPUS, developed by Elsevier have recently added a fair degree of author disambiguation, but even in these data bases this issue remains a challenge both technically and economically. The recent success of Wikipedia, a community effort to establish an on-line encyclopaedia, has lead to a series of spin-off projects and proposals that have community annotation of data at their core. Wiki-Authors 2 is a proposed environment to collectively annotate and disambiguate science authors. The participants of the workshop agreed that the new Wiki-data model3 and the supporting technologies of the various participants in combination with public content (essentially authors and the titles of their publications as identified by the Digital Object Identifier, DOI) would be a sufficient basis to start a Wiki-Authors project with high likelihood of success. Therefore the group has decided to proceed with the extension of a pilot project, which has already started between Wikimedia developers and the company Knewco, and to try the Wiki-Authors approach in this environment. The pilot project will be focussed on scientific authors, starting in the biomedical domain but expanding soon into other scientific disciplines if successful. The reason to start with the biomedical domain is that the largest single public data base of scientific abstracts is available in this discipline and that several algorithms for author disambiguation have been developed specifically for biomedicine. The proposed Wiki-Authors project is detailed below. All comments and suggestions are welcome. Please send to all authors of this paper. 1 http://www.scimaps.org/meeting_060830.php 2 http://meta.wikimedia.org/wiki/WikiAuthors 3 http://meta.wikimedia.org/wiki/Wikidata Wiki-Authors Project The Wiki-authors project has originally been proposed by Miguel Andrade. Technology for on-line author disambiguation has already been developed at the University of Illinois at Chicago by Neil Smalheiser’s group and is currently also under development in the company Knewco. Meanwhile, the Wiki-data model and the WiktionaryZ application has been technically conceived by the Wiki-Media developers Gerard Meijssen and Erik Moeller and a prototype was developed, supported by the company Knewco with funds, technical, and intellectual support. Katy Börner’s Group at Indiana University designed a Scholarly Database 4 that serves publication, patent, and grant datasets. The identification and interlinkage of unique authors, inventors, and investigators poses a major data integration challenge. Prototypes of the different databases and disambiguation algorithms were demonstrated during the workshop. The groups of Neil Smalheiser and Katy Börner, The Knewco Group, Thomson Scientific, and the SCOPUS team at Elsevier have all developed their approaches with partially overlapping elements. The presentations showed that automatic author disambiguation can be achieved to levels around 90% in a controlled environment. However, unambiguous definition of a unique author, with all synonyms and spelling variation and errors of the author name associated correctly as well as a link to all publications, patents and projects of that author is a daunting task and can not be fully achieved by any centralised organisation in isolation. It also became clear that no particular party with a specific interest is currently in the position to enforce a unique author ID on the scientific community. The envisioned Wiki-Authors will support multiple branches for imported content, similar to version control software used in software development. It will thus be possible to apply different disambiguation algorithms to the data without necessarily changing the current "living version" of the data. Tools will be implemented to facilitate merge processes across branches, and of new data sets into the primary branch, to allow the Wiki-Authors user community to work together in rejecting and approving changes. Wiki-Authors must also support a date hierarchy – i.e., from 1999 – 2000 the author’s name was “Susan Smith-Jones” and from 2000 to the present, the author’s name is “Susan Smith”. At the workshop it was further concluded that: The integration challenge associated with the federation of multiple scholarly databases might be best solved by using existing unique authors lists generated for specific datasets. These lists should be combined initially using automatic data integration technologies to arrive at a ‘master list’ that forms a bridge across the stove-pipes of today’s mostly unconnected publication, patent, grant, and other scholarly databases. The master lists of unique authors/journals/institutions/geo-locations would then be made available in an open Wiki-environment to enable computer-assisted community correction, annotation, and completion, under a free content license as defined by the Free Content Definition5. The license will not have a 'copyleft' clause, i.e., it will not be necessary to make all derivative works freely available. The sources of the combined author list should be kept in their original state to enable updates. Community annotations should be made on the ‘master list’. Individual institutions should be able to download the improved data set at regular intervals and apply the data in their own free or commercially available environments. Links to the original data should be possible. For such a ‘Professional’ Wiki environment to be acceptable and useful, a number of features have to be guaranteed: An international governance structure needs to be created to ensure credibility and reliability. Data provenance, recognition, and integrity of compiled sources have to be ensured. Transparency and ease of use of technical setup. Long term sustainability. A 24/7 up time. Scalability. Clarity on licensing and access issues. 4 5 https://iv.slis.indiana.edu/db www.freecontentdefinition.org Plan of Work Based on the needs expressed and the technical status reports, we suggest a stepwise approach: 1. A small, dedicated International Steering Group will be formed. The Steering Group will support the thinking and practice around the issues of credibility, reliability (up time, performance), and sustainability (long term funding) once the pilot results in a decision to move to a professional production environment. 2. A dedicated technical team, wider than the current one, will be formed. 3. Participants and potentially additional parties will share information on the scope and the format of their current data on disambiguation of and disambiguated author names. 4. Knewco will bring in the prototype with semantic support of the application as well as their on-line technology for combining algorithmic and self-disambiguation of authors, possibly combined with the technology developed by the Smalheiser group and potentially available algorithms as developed by Thomson and Elsevier in case these can be made available. 5. The content providers will deliver their input to the technical team in the desired format. Data should be made available in open standard formats, ideally XML-based formats or MySQL, PostgreSQL, or Oracle database dumps. At least minimal documentation should be provided for all data sets. All string representations should be in Unicode. 6. After a period of alpha testing by a limited group of dedicated professionals, the prototype will be put in the public domain for beta-testing and user feedback. 7. Follow-up focused workshops will be organised dealing with technical and deployment issues. 8. A production version of Wiki-Authors will continue to be governed by an International Steering Group representing the major contributors to the project. 9. The need for a legal entity associated with or responsible for Wiki-Authors will be discussed. Proposed Timeline Step 1 and 2: September and early October 2006 Step 3 and 4: October 2006 Step 4 and 5: November 2006 Step 6: December 2006 Step 7: as needed Step 8 & 9: 2007 and beyond