Identifying and tracking resources in the Neuroscience

advertisement
Resource Identification and Tracking in the
Neuroscience Literature
Table of Contents
Executive summary and action items ............................................................................................ 1
Action items ............................................................................................................................................ 2
Meeting overview .................................................................................................................................. 3
Motivation................................................................................................................................................ 3
Who benefits? ..................................................................................................................................................... 4
Background: Summary of first meeting held at Society for Neuroscience, Oct 15th,
2012 ........................................................................................................................................................... 5
Antibodies: .......................................................................................................................................................... 5
Transgenic animals:......................................................................................................................................... 5
Digital tools: ........................................................................................................................................................ 6
Summary of discussion at June 26th meeting ............................................................................. 7
Presentations ..................................................................................................................................................... 7
Discussion of presentations .......................................................................................................................... 9
Working groups .............................................................................................................................................. 10
Feasibility Group: ........................................................................................................................................................ 10
Implementation Group ............................................................................................................................................. 11
Meeting Outcomes .............................................................................................................................. 12
Pre-pilot: ........................................................................................................................................................... 13
Pilot project: .................................................................................................................................................... 13
Action items .......................................................................................................................................... 14
Appendix: Meeting attendees ........................................................................................................ 15
Executive summary and action items
A meeting to discuss the challenge of identifying key research resources used in the course of
scientific studies published in the neuroscience literature was held at the National Institute on
Drug Abuse (NIDA) in Washington DC on June 26, 2013. The meeting was organized by the
Neuroscience Information Framework (NIF; http://neuinfo.org) and the International
Neuroinformatics Coordinating Facility (INCF: http://incf.org) with support from NIDA.
Attendees were drawn from different stakeholders, including government representatives,
publishers, journal editors, informaticians, curators and commercial resource suppliers.
2
At the end of the session, almost all attendees indicated their interest in a pilot project to identify
antibodies, model organisms and tools in a machine processable form across neuroscience
journals to improve reproducibility and tracking of resource utilization. One goal of this project
will be to gather data on the best implementation strategy to engage the authors in providing
these identifiers and in establishing a scalable process for verifying that the correct identifiers
are used. Another goal will also be to provide a demonstration project to the research
community that will show the benefits of machine processable information within papers by
making it easier to find research resources.
Action items
1) Perform pre-pilot project (2 months-Resource Identification Group: NIF, NITRC, INCF,
Monarch, Cross-Ref, antibodies-online, Eagle-i and other interested parties):
● Form the Resource Identification Group: The RIG will develop and evaluate the
specific technologies and implementation. Ensuring that other groups who are working
in this area are involved will be important for the success of the project.
● Make sure that the appropriate identifiers are available for all model organisms
● Establish a single website with an easy to use front end for obtaining identifiers
● Prepare instructions for authors
● Perform usability studies with naive users (~25)
● Present results to workshop consortium
2) Discuss potential pilot project with publishers (meeting attendees) - 1 month
○ Get initial commitments from publishers for proposed pilot project: what journals,
what resources
○ Discuss potential implementation per journal
3) Prepare detailed proposal for publishers (at completion of pre-pilot project) (Resource
Identification Group)
● Include a link to a demonstration site and the results of the usability study
● Allow flexibility in implementation
● Launch pilot project at SFN???
3) Continue to improve the automated pipeline and authoring/curation tools (Resource
Identification Group)
● Contact Biocreative to see if they are interested in hosting a text mining challenge
4) Seek sponsorship for implementation and promoting the project (all)
● antibodies-online
● Mozilla Foundation: Open Science and Science in the Web
● Society for Neuroscience?
● CrossRef?
3
Meeting overview
A meeting to discuss the challenge of identifying key research resources used in the course of
scientific studies published in the neuroscience literature was held at the National Institute on
Drug Abuse (NIDA) in Washington DC on June 26, 2013. The meeting was organized by the
Neuroscience Information Framework (NIF; http://neuinfo.org) and the International
Neuroinformatics Coordinating Facility (INCF: http://incf.org) with support from NIDA.
Attendees were drawn from different stakeholders, including government representatives,
publishers, journal editors, informaticians, curators and commercial resource suppliers. A list of
attendees is included in the appendix.
Motivation
The goal of the meeting was to come to agreement about a course of action to improve the
ability of both humans and automated agents to identify key research resources - defined here
as materials, data and digital tools - used in published studies. Digital tools in this context refer
to software programs, services, data sets or databases. The meeting was motivated by the
experiences of the Neuroscience Information Framework, a project of the NIH Blueprint
consortium tasked with cataloging these types of digital resources for neuroscience, and other
informatics projects such as the model organism databases, who attempt to identify important
research resources like the subject of a study or reagents used within a published paper. These
projects routinely encounter three problems:
1) Insufficient identifying information is included in the paper such that the exact
organism or antibody used is not identifiable, requiring either a loss of information or curator
effort to interact with the author to track down the information. In either case, this identifying
information is not included in the paper for humans to access.
2) If the information is in the paper, it is not machine readable, that is, it cannot be
parsed and recognized by a computer, either because the information is ambiguous or in a
form, e.g., lots of special characters, that is difficult for a computer to handle.
3) If a machine readable identifier, e.g., an accession number or a stock number, is
used within the paper, it is in a section of the paper, usually the materials and methods, that is
behind a paywall, hampering text mining approaches for extracting this information.
The issue of research resource identification thus reflects three critical needs in biomedical
science:
1) The need for better reporting of materials and methods to promote
reproducible science. Proper resource identification is a step towards that goal.
4
2) The need for a cultural shift in the way we write and structure papers.
Recognizing that we will interact with the literature through an automated agent, the conventions
we adopt should be tailored towards greater machine-processability.
3) The need for a cultural shift in the way we view the literature. This would require
not only a source of papers prepared for people to read but a connected database of data,
observations and claims in biomedicine that spans journals, publishers and formats. To locate
and synthesize information from the literature requires universal machine access to key entities
within the paper.
Because the current practices for reporting research resources within the literature are
inadequate, non-standardized or optimized for machine-based access, it is currently very
difficult to answer a very basic question about published studies such as “What studies used
resource X”? These types of questions are of interest to the biomedical community, which relies
on the published literature to identify appropriate reagents, troubleshoot experiments and
aggregate information about a particular organism or reagent to form hypotheses about
mechanism and function. Such information is also critical to funders who would like to be able
to track the impact of resource funding by generating reports on substantive usage of these
resources within the biomedical literature. Such information is also very useful for resource
providers, both commercial and academic, so that they can track use of their resources.
Based on pilot projects and solutions from other communities, NIF, along with several database
curators and informaticians proposed that:
1) Key research resources be identified within papers using a unique and persistent
identifier, i.e., an accession number for antibodies, animals and tools, that is machine readable.
2) These resource identifiers need to be available outside of the paywall
3) The same format for these identifiers should be used across publishers
There is precedent for these requirements already in that journals require accession numbers
for certain types of entities: gene sequences and protein structures.
Who benefits?
As discussed above, if these practices were adopted, it would clearly benefit funders, who
would be able to track the usage of research resources within literature to measure the impact
of funding. These practices would also clearly benefit database curators, who would reduce the
amount of time required for curating results from the literature. Science as a whole benefits,
because proper identification of the tools used to generate research findings is a cornerstone of
reproducible science. Resource providers clearly benefit, as it is easier to track who uses their
reagents and tools. The benefits to individual research scientists, which includes journal
editors, would need to be made clear for these policies to become adopted. Perceived benefits
to the research community include:
○ It will be easier to design and troubleshoot experiments, because researchers
would be able to find all studies that used a particular reagent, animal or tool
○ Researchers would be able to find studies that used a particular type of reagent,
e.g., a mouse monoclonal antibody, even if that information was not explicitly
5
○
○
included in the paper, because a complete characterization of the entity is
present in an external database that can be accessed at time of query
It will be easier to aggregate and compare results across studies, using both
human effort and data mining approaches
Problems found in a resource, e.g., specificity of an antibody, an error in a
database or algorithm, can be easily propagated across the literature, even
retrospectively. With proper tools, readers could be alerted to any potential
problem thereby reducing time, effort and money wasted on problematic
resources and incorrect conclusions based on the results of these studies .
These benefits can only be realized if machine-processable resource identification is carried out
on a large scale across journals, in order to create a rich enough data set for data mining and
resource linking across papers to be interesting to the research scientist.
Background: Summary of first meeting held at Society for
Neuroscience, Oct 15th, 2012
The June 26th meeting was a follow up to a meeting held at the Society for Neuroscience (SFN)
annual meeting involving NIF, INCF and a group of neuroscience journal editors and publishers.
At this meeting, the above proposal was presented, along with the results of a NIF pilot project
to identify antibodies, transgenic animals and digital tools used in neuroscience research. The
NIF pilot project used a combination of human curation and text mining. The results for
antibodies indicated that an antibody could be identified as a particular antibody sold or
produced by an individual in less than 50% of the papers. Reporting procedures required a lot
of human intervention, as there were as many styles of reporting research resources as there
were papers. The major problems with resource identification are summarized here:
Antibodies:
·
The author did not supply sufficient identifying information, e.g., catalog number,
such that an antibody could be reliably found in a vendor catalog. Rather, general information ,
e.g., mouse monoclonal antibody against actin from Sigma, was provided. As many vendors
sell multiple antibodies that fit these descriptions, we could not identify the reagent used.
·
The vendor no longer sold the antibody referenced or the vendor no longer existed,
so information could not be discovered about its properties. In many cases, the same
manufacturer would sell their products through multiple vendors, with no ability to cross
reference
·
The same antibody identifier, e.g., clone ID, could point to multiple antibodies
·
Methods were not referenced within the paper, but readers were referred to other
papers, which then referred to other papers…
Transgenic animals:
6
·
Authors did not supply sufficient information to identify the exact transgenic animal
used, e.g., stock number. As with antibodies, a given reference to a transgenic from Jackson
Labs could not be resolved to a particular transgenic line, but could point to more than one.
·
The notation adopted by the IMSR does not lend itself for use by automated agents
or search systems. It employs superscripts, subscripts and special characters.
Digital tools:
·
NIF’s semi-automated pipeline did fairly well at recognizing research resources
listed in the NIF catalog within papers, except for those resources with names that were very
common or short, e.g., R, Enzyme,
·
In trying to determine meaningful use of the resource, as opposed to mentions of
the resource but no actual use within the study, NIF focused its search on the Materials and
Methods section. However, NIF had access to the materials and methods of sections of only a
subset of PubMed Central, that is the PMC Open Access Subset which is a relatively small part
of the total collection of articles in PMC, and other open access journals.
At the SFN meeting, the attendees were polled regarding the desirability and feasibility of
implementing the research resource identification proposal. No serious objections were raised
about the desirability of better resource identification. However, several issues were raised
about the feasibility of such a process:
•Who would do the identification?
–Author? Algorithm? Curator? Editor? How would the information be verified
once supplied?
•Would a special tool be needed? If so, who will pay?
–Would it scale to 40,000 papers/month?
–Is the information available from authors in general?
•Difficult to implement only for neuroscience journals, as journals have many different
journals in their portfolio
•Granularity: Would we be able to specify the requirements at a level of granularity
that would be useful but still feasible?
•What will be the benefit to the user? How will we show that?
Prior to convening a follow up meeting, NIF and some of our partners agreed to develop some
pilot projects to address some of these issues, based on work that NIF was doing with Elsevier.
Elsevier had agreed to provide full text access to a significant number of neuroscience journals.
7
Summary of discussion at June 26th meeting
The meeting was divided into a morning session with 3 presentations and an afternoon breakout
session with two working groups. Dr. Jonathan Pollock opened the meeting with a charge to the
participants that at the end of the day, we needed to have a set of action steps.
Presentations
The morning session included presentations from:
1) Mike Huerta, National Library of Medicine: Discovering, Citing and Linking Data
-an overview of the planned NIH data catalog and of the BD2K project
NIH has several initiatives planned for increasing reporting of and access to data, in particular
the creation of a Data catalog, where a researcher would upload minimal information about a
data set to the data catalog. Each data set would receive a unique identifier that would be used
to subsequent use of data. These same identifiers will be used in PubMed.
2) Maryann Martone, Neuroscience Information Framework: Current practices in
reporting neuroscience resources
Maryann presented the results of several pilot and formal projects that had provided information
about some of the challenges and questions raised in the previous meeting. The main
conclusions of these projects were:
○ The issue of proper resource identification is not unique to neuroscience. Nicole
Vasilevsky and her colleagues from eagle I (https://www.eagle-i.net/) performed
a comprehensive study of resource reporting across a spectrum of journals and
fields, tracking the reporting of antibodies, cell lines, model organisms,
knockdown reagents and constructs. Although the results differed across fields
and type of resource, the general conclusions reached by the neuroscience pilot
held: most papers did not contain sufficient identifying information for either a
human or automated algorithm to identify the resources used. The study is under
review in PeerJ.
○ Vasilevsky et al examined the reporting requirements of journals and found no
correlation between proper identification and the stringency of reporting
requirements
○ Although the pilot project did not address availability of the information from the
author systematically, in a case study of a single laboratory at Carnegie Mellon
University, Anita de Waard and colleagues found that the identifying information
for reagents and animals was kept in good order by the researcher, i.e., the
appropriate identifying information was available, but this information by and
large did not make it into published papers.
■ Although only an N of 1, the finding affirms the contention that authors
simply do not think to put this information in a paper. In contrast, the
8
○
○
vendor location and city are routinely supplied, because this information is
requested from many journals and mentors teach their students to supply
it.
Scalability: NIF and Elsevier worked on a text mining project to see if a machinelearning algorithm could be used to automate the process of resource
identification. They focused on antibodies and tools registered within the NIF
Antibody Registry and NIF Catalog (databases and software tools). Over 500
articles were hand annotated and then used for text mining. The algorithm was
reasonably accurate at detecting antibodies and identifying them if the catalog
numbers were provided (~87%), although the many different styles of reporting
catalog numbers decreased the total number identified (~63%). Identification of
tools was better, approaching 100%. The algorithms are still under development,
but the results were encouraging in that:
■ This project suggested that automated text mining would be helpful in
verifying information supplied by the authors.
■ This project also suggested that at some point, a “resource identification”
step could be incorporated into the manuscript submission pipeline that
would be able to assist authors in identifying their resources.
Commercial antibody providers, specifically antibodies-online, that seeks to
provide more transparency in the antibody market, are interested in helping to
support such efforts. NIF has interacted with antibodies-online
(http://www.antibodies-online.com/), who are experts in the antibody market and
are willing to help underwrite costs for the NIF Antibody Registry, an on-line
database for assigning unique identifiers to antibody reagents.
3) Geoffrey Bilder, Cross Ref: Current Solutions in working with the Biomedical
Literature
-provided an overview of Cross Ref and identifier systems
Dr. Bilder was invited to this meeting because of his expertise in identifier systems through
ORCID, the unique author identification system, and Cross Ref, a non-profit organization funded
by the publishing industry to ensure that articles could be identified on the web through the
creation of the Digital Object Identifier (DOI’s). Cross Ref works with the entirety of the
biomedical literature and has been approached by other groups to develop methods for
identifying specific research resources within the literature, e.g., chemical identifiers. These
efforts did not proceed because of objections from the publishers, similar to those expressed to
NIF: publishers do not want to do this in an ad hoc fashion for just one domain. Dr. Bilder gave
an overview of the use and need for identifiers and addressed issues of duplication and trust.
Some key points: the system should be as simple as possible, the identifiers should be owned
by the community of interest and not the individual, bi-directional verification provides added
trust, e.g., if an author supplies a catalog number for an antibody from a particular vendor and
the system checks a database that says the vendor has an antibody with those characteristics
and catalog number, then the information receives some validation.
9
The major conclusions drawn from the presentations ;
-the problems of resource identification are not unique to neuroscience and therefore the
solutions could be applied across all of biomedicine; clearly, as per Bilder’s talk, the
study of Vasilevsky et al. and the experience of the model organism database curators,
there is need and interest from communities beyond neuroscience.
-getting authors to supply proper identifiers will require a cultural shift;
instructions to authors without some sort of editorial or curatorial oversight will probably
not be adequate, although right now the evidence for this is somewhat anecdotal
-tools can be developed that would make the process of validating identifiers and
perhaps assisting authors with annotating the correct entities in their papers at least
semi-automated. We note that subsequent to the meeting, a paper by Kafkas et al
(2013) was published on the use of text mining to identify genomic database accession
numbers. Although the recall was not perfect, they note that “These initial results suggest
that, given the volume of references found, and the low cost and high precision of the text-mining
method we deploy..., it is useful to extend the scope of accession number mining beyond the
“core three” data sources [Genbank, UniProt, PDB], that publishers currently mark up.”
-Identifier systems exist for antibodies (NIF Antibody Registry), most model
organisms (Fly, Zebrafish, Rat, Worms); mice are a bit of a problem as not all of the
mouse suppliers have easily accessible unique identifiers, tools (NIF and NITRC)
-NIH will be addressing many of the same issues for data sets
Kafkas Ş, Kim J-H, McEntyre JR (2013) Database Citation in Full Text Biomedical
Articles. PLoS ONE 8(5): e63184. doi:10.1371/journal.pone.0063184
Discussion of presentations
The presentations generated much discussion from the audience, centering again on the issues
of feasibility rather than desirability. Issues of granularity again were raised and some
questioned whether the proposed reporting guidelines would go far enough. For example,
those who deal with behavioral data might want to require more stringent requirements and
unique identifiers be given for key procedures like the Morris Water Maze. The moderators
countered that while such things were identifiable through the many community ontologies that
are under development, and that groups have been working to create fully structured methods
and to semantically enhance entire papers, e.g., FEBS Journal Structured Abstract project, we
needed to start with a set of entities that we agree could be reasonably identified and for which
authoritative sources of identifiers exist. Similarly, while everyone present agreed that
researchers should write better and more detailed methods, the moderators made it clear that
this meeting was equally focused on the machine-accessibility and processability of information
and not just its presence in an article for another human with a subscription to read.
The larger discussion focused on who would do the work and when would it be done. Several
in the audience felt that adding yet another requirement for the journal staff or the reviewers
10
would be too onerous. Authors also might not adopt the practice if it was too difficult to find
accession numbers or if it wasn’t clear what they should identify. Geoff Bilder noted that the
authors currently spend a lot of time doing things that are no longer necessary, e.g., formatting
references for a particular journal style, and that perhaps if we started to eliminate some of
these unnecessary steps, we could free up time for new practices required for electronic
publishing, e.g., resource identification.
Dr. Bilder also noted that any workflow that involved a modification of the manuscript
submission system, e.g., Scholar One, would not likely succeed in the short run, as these
modifications are perceived to be expensive and take time. He did say, however, that the entire
process could be done with minimal modification of the current manuscript submission system.
Matt Giampaolo from Wiley noted that if the text mining tools are made available across all
publishers, it would make the process easier and more widespread.
The issue of whether resource identification and tracking would provide sufficient benefit to the
research community that it would spur adoption was raised, with some questioning whether it
would be any benefit. While proper identification of antibodies via catalog numbers and even lot
numbers was viewed to be a “no brainer”, they are not of relevance to all researchers in
neuroscience. Just knowing that a researcher used a particular software tool might also not be
useful, without additional information about version and other parameters. These objections
were countered by others, some of whom noted that projects like ADNI (Alzheimer’s Disease
Neuroimaging Initiative) had recently requested special identifiers within PubMed so that they
could track usage. The well known problems with finding antibodies-described as “a search
industry” by antibodies-online were reiterated.
There was general agreement that, although the problem of resource identification goes far
deeper than just supplying catalog numbers and other identifiers, that we have to start simply
with something that is doable. The hope is that if it proves beneficial to researchers, funders,
publishers and resource providers, that we would expand the project to include much more
structured methods.
Following the morning discussion, the workshop broke up into two groups: 1) Feasibility of a
pilot project: what would be identified and by whom? 2) Implementation: what would an endto-end system look like.
Working groups
Feasibility Group:
Scope: 3 types of entities should be identified as an initial pilot project: 1) Antibodies; 2)
Tools; 3) Model organisms
11
For tools, the scope should be those that are registered within the NIF Registry and not all
commercial tools or instruments used. The NIF Registry focuses on digital resources that are
largely, although not exclusively, produced by the academic community. Note that the NIF
Registry links with NITRC (Neuroimaging Tools and Resource Clearinghouse; http://nitrc.org),
which has catalogued software tools and databases of for neuroimaging. For the purposes of
this proposal, references to the NIF Registry will also include NITRC, as that is the authoritative
source of neuroimaging tools.
Who: The issue of whether the author should be asked to supply this information or whether we
should attempt to use semi-automated means to identify potential research resources and then
go back to the author was discussed. One can envision a two-step process where the authors
are asked to supply the information and then the article is screened via NLP for verification.
The need to ensure that the process was not overly onerous for the author was emphasized.
When: Should the process of resource identification be done at time of submission, during
review or after acceptance? The general feeling was that during review or after acceptance
would be the time when we would likely get the most compliance. If this process is done during
review, then the reviewers would need to be alerted that they should look for this information
and be able to communicate with the author that they need to supply this information. We do
not want to make this an absolute requirement for publication, as we all recognize that the
authors may not possess this information and we do not want them supplying false information
in order to have the article published. If it is after acceptance, then the onus would be on the
staff or the editor to ensure compliance.
How: If authors are going to supply these identifiers, then it needs to be easy for them to obtain
them. Dr. Martone felt that the proper identifiers were sometimes difficult to find in the model
organism databases, but that NIF could help with a simple service. NIF also would need to be
made more simple, as it is currently difficult to know where to look. Communication with the
Mutant Mouse Resources is necessary to ensure that proper identifiers are being given to all
mouse strains.
Dr. Pollock brought up the issue of having animals identified through a bar code or and perhaps
spiking reagents with a sequence or some other identifier that could be automatically read. It is
clear that novel technology solutions are now possible or on the horizon and that investments
into laboratory information management need to be made. Once the research community
begins to make the shift towards a web-enabled platform for scholarly communication - one that
handles all types of diverse research objects - we believe that there will be numerous
opportunities to streamline the process of working with these objects.
Implementation Group
The implementation group mapped out what an end-to-end workflow might look like for a pilot
project and beyond. The minimum requirement is that we have the appropriate registries that
are viewed as authoritative for the entities to be identified.
12
Other steps:
1) Tagging: The option of having an independent group like NIF do the tagging, rather
than the author, was discussed but would likely bring up privacy concerns from authors. As with
the feasibility group, one can see pros and cons to performing the resource identification at
different steps in the publication process: at time of submission, during review, after
acceptance.
2) Verification step: The suggestion was made that we contact Biocreative
(http://biocreative.sourceforge.net): “Critical Assessment of Information Extraction systems in
Biology”, a organization that runs challenges for evaluating text mining and information
extraction systems applied to the biological domain. We could make the verification of research
resources within the materials and methods section a challenge project.
3) Where would the identifiers be? The request was that any identifiers supplied would
be available in a uniform format across publishers, would not be stripped out by PubMed, and
be available to 3rd parties outside of the paywall. In the NIF-Elsevier pilot, identifiers are placed
in the author-supplied keyword field, which is indexed by PubMed. This solution may be
unwieldy if larger numbers of antibodies are used, for example. Alternatively, Geoff Bilder
suggested that they could be stored in a single URL that points to a metadata record. Placing
the identifiers in text is something that is done already for entities like gene accession numbers,
but unless the text was accessible, this would not satisfy the requirements for 3rd party
accessibility. However, with access to materials and methods, these identifiers could be
extracted and placed in a location outside of a paywall. Clearly, as indicated in Mike Huerta’s
talk, the NIH Data Catalog will face similar issues.
4) Sustainability: The issue of sustainability of projects like NIF was brought up, as
some publishers are concerned about investing in a strategy only to have the database
disappear. Of course, no one can guarantee that any organization will exist in perpetuity.
Possible solutions are to replicate the services, e.g., the INCF and eagle-i both offered to mirror
the NIF system, to provide robustness. Geoff Bilder also noted that if the identifiers and
systems are covered by a CC-0 license, then they would be available to anyone to pick up
should NIF go out of business.
Meeting Outcomes
At the end of the session, almost all attendees indicated their interest in a pilot project to identify
antibodies, model organisms and tools in a machine processable form across neuroscience
journals. One goal of this project will be to gather data on the best implementation strategy to
engage the authors in providing these identifiers and in establishing a scalable process for
verifying that the correct identifiers are used. Another goal will also be to provide a
demonstration project to the research community that will show the benefits of machine
processable information within papers by making it easier to find research resources.
13
Pre-pilot:
Considerable groundwork has been done and the major resources (NIF Registry, Antibody
Registry, NITRC, NIF Integrated Model Organism database) required for this project are largely
in place, before a large scale pilot project can be launched, we will need to do a pre-pilot. Thus
far, the work done by NIF and Monarch has not engaged the author but has relied on curators
or automated agents to identify research resources. As the author must be engaged in this
process, a pre-pilot was outlined where a small group of users is given 5-10 papers and asked
to supply appropriate identifiers for antibodies, tools and animals. We would monitor whether:
○ naive users were able to understand which entities needed to be identified
○ naive users were able to look up the appropriate identifiers
○ users got frustrated or annoyed at the process
○ What percentage of appropriate entities within papers were available through NIF
We did not discuss what would constitute success for this pre-pilot phase, but clearly we would
like to see that a majority of users could successfully complete the task. This pre-pilot could be
conducted via webinar so that it did not involve a large expense.
Pilot project:
Once the system is in place for obtaining the appropriate identifiers, a larger scale pilot project
would be launched across journals. This project would involve asking the authors to supply the
correct identifiers at some point in the publication process: at submission, during review or after
acceptance. We will leave it up to the individual journals and publishers how they would like to
implement the stage at which they send the author the request, in order to give them some
flexibility and to allow us to test different strategies for acquiring this information. Ideally, the
project would run for a specified period of time, e.g., one month, during which time all articles
from a particular journal would be tagged. Again, the journals and publishers can have some
flexibility in choosing the journals and the exact number of articles. However, it is important that
high impact journals participate in this project, as authors are usually highly motivated to comply
with requests from high impact journals and because it would give high visibility to the project.
Authors would be notified by the editors by email that they are participating in a pilot project to
make science more reproducible and to make articles easier for machines to read. NIF will
provide the appropriate instructions and a link to the website where the authors can obtain the
information. Geoff Bilder offered to work with colleagues, e.g., Steve Pettifer, to create a nice
front end for the system.
For the initial project, the authors should insert the identifiers into their materials and methods
section, as they would a gene accession number or a URL for a tool. Some journals have
author guidelines for this type of citation, and we would follow this convention, e.g., BMC
Genomics states that nucleic acid sequences, protein sequences, and the atomic coordinates of
macromolecular structures should be deposited in the appropriate database, and that the
14
accession number should be provided in square brackets with their corresponding database
name (e.g. [EMBL:AB026295, GenBank:U49845, PDB:1BFM] (Kafkas et al., 2013)).
To oversee the implementation issues and ensure that the effort can extend beyond
neuroscience, we will create a Resource Identification Group that includes participants in this
workshop and others who have expertise and tools relevant to the pilot. We will utilize
FORCE11 (Future of Research Communications and e-Scholarship; http://force11.org), a
community platform for stakeholders interesting in advancing scholarly communication through
technology, to align our efforts with those underway in different areas of biomedicine, as the
goal is to establish a uniform citation style. FORCE11 is already coordinating discussions on
data citation styles (http://www.force11.org/node/4381) and can provide feedback and advice
about the proposed implementation.
Once the papers have been annotated with resource identifiers, we would then need access to
the full text so that we could verify and extract these identifiers. For the initial pilot project, we
need not determine the final solution about where they identifiers are to be stored outside of the
paywall, as an outside organization like NIF, INCF or Cross Ref can store them. As per the
discussion above, the pilot project may involve mirroring at all 3 sites. After the pilot project is
complete, we will follow up with a questionnaire to find out how the authors viewed the task. We
will also provide them with a link where they can view the results of the pilot, and give them a
search interface so that they can find papers that used their reagent or tool. We hope to
engage the publishers to create various widgets that might provide this information through the
article itself.
Action items
1) Perform pre-pilot project (2 months-Resource Identification Group: NIF, NITRC, INCF,
Monarch, Cross-Ref, antibodies-online, Eagle-i and other interested parties):
● Form the Resource Identification Group: The RIG will develop and evaluate the
specific technologies and implementation. Ensuring that other groups who are working
in this area are involved will be important for the success of the project.
● Make sure that the appropriate identifiers are available for all model organisms
● Establish a single website with an easy to use front end for obtaining identifiers
● Prepare instructions for authors
● Perform usability studies with naive users (~25)
● Present results to workshop consortium
2) Discuss potential pilot project with publishers (meeting attendees) - 1 month
○ Get initial commitments from publishers for proposed pilot project: what journals,
what resources
○ Discuss potential implementation per journal
15
3) Prepare detailed proposal for publishers (at completion of pre-pilot project) (Resource
Identification Group)
● Include a link to a demonstration site and the results of the usability study
● Allow flexibility in implementation
● Launch pilot project at SFN???
3) Continue to improve the automated pipeline and authoring/curation tools (Resource
Identification Group)
● Contact Biocreative to see if they are interested in hosting a text mining challenge
4) Seek sponsorship for implementation and promoting the project (all)
● antibodies-online
● Mozilla Foundation: Open Science and Science in the Web
● Society for Neuroscience?
● CrossRef?
Appendix: Meeting attendees
Helen Atkins
Director, Publishing Services Plos
1160 Battery Street, Suite 100
San Francisco, CA 94111
Phone: 415-624-1200
Direct: 415-624-1227
Email:
hatkins@plos.org
Geoffrey Bilder
Director of Strategic Initiatives, CrossRef
21-27 George Street, Oxford
OX1 2AY, UK
Phone: +44 7766 410 380
Email:
gbilder@crossref.org
Jan Bjaalie, M.D., Ph.D.
Head of Institute
Institute of Basic Medical Sciences, University
of Oslo
Domus Medica Gaustad Sognsvannsveien 9 0372 Oslo
Phone: +47-22851575 +47-22851199
Email:
instituttleder@basalmed.uio.no
Katja Brose, Ph.D.
Neuron
16
600 Technology Square
Cambridge, MA 02139
Phone:
617-397-2835
Email:
KBrose@cell.com
Anita de Waard
VP, Research Data Collaborations
Elsevier, Jericho, VT
Phone: + 1 (619) 252 8589
Email: A.dewaard@elsevier.com
Howard Eichenbaum, Ph.D., B.S.
Boston University
Center for Memory and Brain/Psychology
Department
2 Cummington Street
Boston, MA 02215-2407
Phone: 617-353-1426
Email:
hbe@bu.edu
Mathew Giampoala, Ph.D.
Senior Editor, Life Science Journals
Wiley and Sons
111 River Street
Hoboken, NJ 07030
Phone: 201-748-6000
Email: mgiampoala@wiley.com
Scott Grafton, M.D.
Associate Editor, Neuroimage
University of California at Santa Barbara
Psychological & Brain Science
Santa Barbara, CA 93106-9660
Phone: 805-975-5272
Email:
grafton@psych.ucsb.edu
Sean Hill, Ph.D.
International Neuroinformatics Coordinating Facility
Nobels väg 15A
17177 Stockholm Sweden
Phone: 41 +41 21 69
17
Email:
sean.hill@epfl.ch
Michael Huerta, Ph.D.
Associate Director
National Library of Medicine
National Institutes of Health
8600 Rockville Pike
Bethesda, MD 20894
Phone: 301-496-8834
Email: mike.huerta@nih.gov
David Kennedy, Ph.D.
University Massachusetts Medical
365 Plantation Street, Biotech 1
Suite 100
Psychiatry
Worcester, MA 01605
Phone: 508-856-8228
Email:
David.Kennedy@umassmed.edu
Dr. Tim Hiddemann
antibodies-online GmbH
Schloß-Rahe-Str. 15
DE-52072 Aachen
Phone: +49(0)241 9367-2522
Mobile: +49(0)173 9432903
Email: tim.hiddemann@antibodies-online.com
Anthony-Samuel LaMantia, Ph.D.
Associate Editor, Cerebral Cortex
Professor of Pharmacology & Physiology
The George Washington University School of
Medicine and Health Sciences
Ross Hall 661
2300 Eye Street, NW
Washington, DC 20037
Phone: 202-994-8462
Lab phone: 202-994-8465
Email:
lamantia@gwu.edu
Stephen Lisberger, Ph.D.
Duke University
18
101H Bryan Building
412 Research Drive
Department of Neurobiology
Durham, NC 27710
Phone: 415-476-1062
Email:
lisberger@neuro.duke.edu
Maryann Martone, Ph.D.
Professor-in-Residence
Department of Neuroscience
University of California, San Diego
San Diego, CA 92093-0446
Phone: 858-822-0745
Email:
maryann@ncmir.ucsd.edu
John HR Maunsell, Ph.D.
Editor, Journal of Neuroscience
Harvard Medical School
220 Longwood Avenue
Department of Neurobiology
Boston, MA 02115-5701
Phone: 617-432-6779
Email:
maunsell@hms.harvard.edu
Angus Nairn, Ph.D.
Co-Editor Biological Psychiatry
Yale University
Psychiatry
34 Park Street, CMHC
New Haven, CT 06508
Phone: 212-327-8871
Email:
angus.nairn@yale.edu
Kalyani Narasimhan
Chief Editor
Nature Neuroscience
75 Varick Street, 9th Floor
New York, NY 10013-1917
Phone: 212-726-9319
Email:
k.narasimhan@us.nature.com
Jonathan Pollock, Ph.D.
Chief
19
Genetics and Molecular Neurobiology
Research Branch
Division of Basic Neuroscience and
Behavioral Research
National Institute on Drug Abuse
6001 Executive Blvd. Rm 4103
Bethesda, MD 20892
Phone: 301-435-1309
Email: jpollock@mail.nih.gov
Nicole Vasilevsky, PhD
Project Manager
Ontology Development Group
Oregon Health & Science University
Mail code: LIB
3181 S.W. Sam Jackson Park Road
Portland, OR 97239
Phone: 503-806-6900
Skype:
nicolevasilevsky
Email:
vasilevs@ohsu.edu
Laszlo Zaborszky, Ph.D., M.D.
Rutgers University
197 University Avenue
Newark, NJ 07102-1814
Phone: 973-353-3659
Email:
zaborszky@axon.rutgers.edu
Download