Data-integration and computer-assisted disambiguation of Author

advertisement
Wiki-Authors:
Combining Automatic Data Integration and Wiki Technology to Derive
Unique Author Identifiers for Professional Knowledge Management
White paper compiled to summarize the results of the Workshop on Scholarly Data and Data Integration at Indiana University,
Bloomington, IN on August 30 and 31, 2006.
Miguel Andrade1, Katy Börner2, Stacy Kowalczyk2, Barend Mons3,4, Erik M. van Mulligen3,4, Marc Weeber4
1Ottawa Health Research Institute, University of Ottawa, 501 Smyth Rd, Ottawa, ON K1H 8L6, Canada
2Indiana University, SLIS, 10th Street & Jordan Avenue, Bloomington, IN 47405, USA
3Erasmus Medical Center, Rotterdam, The Netherlands
4Knewco, Inc., 733 Fallsgrove Drive, Suite 9130, Rockville, MD 20850, USA
The recent International Workshop on Scholarly Data and Data Integration 1 brought together thought leaders in the
fields of bibliographic analysis and data mining. The participants were a blend of content providers, Wiki-developers,
funding agencies, companies, and basic scientists focussing on data integration, disambiguation, on-line knowledge
discovery and data visualisation.
Presentations on user needs and current technologies and services available can be summarized as follows:
Bibliographic databases become ever more important in scientific research and science management. These databases are
essential to the primary work of science - retrieving the relevant literature and finding potential collaborators and
competitors. Increasingly, mining of these databases has become a science in itself.
For science management, the use of these databases has become crucial in connecting new proposals to what is
existant and to find reviewers for these proposals. Additionally, measuring success of both projects and individuals in
science is a challenge, and bibliographic databases are the key resource to do this.
One of the commonly perceived difficulties in the collection and use of bibliographic data bases (e.g., PubMed,
CiteSeer, arXiv, Google Scholar) is the need to uniquely identify authors. Subscription-restricted databases such as the
Web of Science, maintained by Thomson Scientific and SCOPUS, developed by Elsevier have recently added a fair
degree of author disambiguation, but even in these data bases this issue remains a challenge both technically and
economically.
The recent success of Wikipedia, a community effort to establish an on-line encyclopaedia, has lead to a series of
spin-off projects and proposals that have community annotation of data at their core. Wiki-Authors 2 is a proposed
environment to collectively annotate and disambiguate science authors. The participants of the workshop agreed that the
new Wiki-data model3 and the supporting technologies of the various participants in combination with public content
(essentially authors and the titles of their publications as identified by the Digital Object Identifier, DOI) would be a
sufficient basis to start a Wiki-Authors project with high likelihood of success. Therefore the group has decided to
proceed with the extension of a pilot project, which has already started between Wikimedia developers and the company
Knewco, and to try the Wiki-Authors approach in this environment. The pilot project will be focussed on scientific
authors, starting in the biomedical domain but expanding soon into other scientific disciplines if successful. The reason
to start with the biomedical domain is that the largest single public data base of scientific abstracts is available in this
discipline and that several algorithms for author disambiguation have been developed specifically for biomedicine.
The proposed Wiki-Authors project is detailed below. All comments and suggestions are welcome. Please send to
all authors of this paper.
1 http://www.scimaps.org/meeting_060830.php
2
http://meta.wikimedia.org/wiki/WikiAuthors
3
http://meta.wikimedia.org/wiki/Wikidata
Wiki-Authors Project
The Wiki-authors project has originally been proposed by Miguel Andrade. Technology for on-line author
disambiguation has already been developed at the University of Illinois at Chicago by Neil Smalheiser’s group and is
currently also under development in the company Knewco. Meanwhile, the Wiki-data model and the WiktionaryZ
application has been technically conceived by the Wiki-Media developers Gerard Meijssen and Erik Moeller and a
prototype was developed, supported by the company Knewco with funds, technical, and intellectual support.
Katy Börner’s Group at Indiana University designed a Scholarly Database 4 that serves publication, patent, and grant
datasets. The identification and interlinkage of unique authors, inventors, and investigators poses a major data
integration challenge.
Prototypes of the different databases and disambiguation algorithms were demonstrated during the workshop. The
groups of Neil Smalheiser and Katy Börner, The Knewco Group, Thomson Scientific, and the SCOPUS team at
Elsevier have all developed their approaches with partially overlapping elements. The presentations showed that
automatic author disambiguation can be achieved to levels around 90% in a controlled environment. However,
unambiguous definition of a unique author, with all synonyms and spelling variation and errors of the author name
associated correctly as well as a link to all publications, patents and projects of that author is a daunting task and can not
be fully achieved by any centralised organisation in isolation. It also became clear that no particular party with a specific
interest is currently in the position to enforce a unique author ID on the scientific community.
The envisioned Wiki-Authors will support multiple branches for imported content, similar to version control
software used in software development. It will thus be possible to apply different disambiguation algorithms to the data
without necessarily changing the current "living version" of the data. Tools will be implemented to facilitate merge
processes across branches, and of new data sets into the primary branch, to allow the Wiki-Authors user community to
work together in rejecting and approving changes. Wiki-Authors must also support a date hierarchy – i.e., from 1999 –
2000 the author’s name was “Susan Smith-Jones” and from 2000 to the present, the author’s name is “Susan Smith”.
At the workshop it was further concluded that:
 The integration challenge associated with the federation of multiple scholarly databases might be best solved by
using existing unique authors lists generated for specific datasets.
 These lists should be combined initially using automatic data integration technologies to arrive at a ‘master list’
that forms a bridge across the stove-pipes of today’s mostly unconnected publication, patent, grant, and other
scholarly databases.
 The master lists of unique authors/journals/institutions/geo-locations would then be made available in an
open Wiki-environment to enable computer-assisted community correction, annotation, and completion, under
a free content license as defined by the Free Content Definition5. The license will not have a 'copyleft' clause,
i.e., it will not be necessary to make all derivative works freely available.
 The sources of the combined author list should be kept in their original state to enable updates. Community
annotations should be made on the ‘master list’.
 Individual institutions should be able to download the improved data set at regular intervals and apply the data
in their own free or commercially available environments.
 Links to the original data should be possible.
For such a ‘Professional’ Wiki environment to be acceptable and useful, a number of features have to be guaranteed:
 An international governance structure needs to be created to ensure credibility and reliability.
 Data provenance, recognition, and integrity of compiled sources have to be ensured.
 Transparency and ease of use of technical setup.
 Long term sustainability.
 A 24/7 up time.
 Scalability.
 Clarity on licensing and access issues.
4
5
https://iv.slis.indiana.edu/db
www.freecontentdefinition.org
Plan of Work
Based on the needs expressed and the technical status reports, we suggest a stepwise approach:
1. A small, dedicated International Steering Group will be formed. The Steering Group will support the thinking
and practice around the issues of credibility, reliability (up time, performance), and sustainability (long term
funding) once the pilot results in a decision to move to a professional production environment.
2. A dedicated technical team, wider than the current one, will be formed.
3. Participants and potentially additional parties will share information on the scope and the format of their
current data on disambiguation of and disambiguated author names.
4. Knewco will bring in the prototype with semantic support of the application as well as their on-line technology
for combining algorithmic and self-disambiguation of authors, possibly combined with the technology
developed by the Smalheiser group and potentially available algorithms as developed by Thomson and Elsevier
in case these can be made available.
5. The content providers will deliver their input to the technical team in the desired format. Data should be made
available in open standard formats, ideally XML-based formats or MySQL, PostgreSQL, or Oracle database
dumps. At least minimal documentation should be provided for all data sets. All string representations should
be in Unicode.
6. After a period of alpha testing by a limited group of dedicated professionals, the prototype will be put in the
public domain for beta-testing and user feedback.
7. Follow-up focused workshops will be organised dealing with technical and deployment issues.
8. A production version of Wiki-Authors will continue to be governed by an International Steering Group
representing the major contributors to the project.
9. The need for a legal entity associated with or responsible for Wiki-Authors will be discussed.
Proposed Timeline
Step 1 and 2: September and early October 2006
Step 3 and 4: October 2006
Step 4 and 5: November 2006
Step 6: December 2006
Step 7: as needed
Step 8 & 9: 2007 and beyond
Download