The possibility and probability of establishing a global neuroscience

advertisement
Big data from small data: A deep
survey of the neuroscience
landscape data via
the Neuroscience Information
Framework
Maryann Martone, Ph. D.
University of California, San Diego
“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation
to its multiple layers of organization that operate at different spatial and
temporal scales. Central to this effort is tackling “neural choreography” -the integrated functioning of neurons into brain circuits-- Neural
choreography cannot be understood via a purely reductionist approach.
Rather, it entails the convergent use of analytical and synthetic tools to
gather, analyze and mine information from each level of analysis, and
capture the emergence of new layers of function (or dysfunction) as we
move from studying genes and proteins, to cells, circuits, thought, and
behavior....
However, the neuroscience community is not yet fully engaged in exploiting the
rich array of data currently available, nor is it adequately poised to capitalize
on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
“Data choreography”
 In that same issue of Science
 Asked peer reviewers from last year about the availability and use of
data
 About half of those polled store their data only in their
laboratories—not an ideal long-term solution.
 Many bemoaned the lack of common metadata and archives as a
main impediment to using and storing data, and most of the
respondents have no funding to support archiving
 And even where accessible, much data in many fields is too poorly
organized to enable it to be efficiently used.
 “...it is a growing challenge to ensure that data produced during the
course of reported research are appropriately described, standardized,
archived, and available to all.” Lead Science editorial (Science 11
February 2011: Vol. 331 no. 6018 p. 649 )
A data federation problem
No single technology serves these all
equally well.
Multiple data types; multiple
scales; multiple databases
Whole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
Neuroscience is unlikely to be
served by a few large databases
like the genomics and proteomics
community
EM volumes &
reconstructions
Solved molecular
structures
 NIF is an initiative of the NIH Blueprint consortium of institutes
 What types of resources (data, tools, materials, services) are
available to the neuroscience community?
 How many are there?
 What domains do they cover? What domains do they not cover?
 Where are they?
 Web sites
 Databases
 Literature
•
•
PDF files
Desk drawers
 Supplementary material
 Who uses them?
 Who creates them?
 How can we find them?
 How can we make them better in the future?
http://neuinfo.org
We need more databases (?)
•NIF Registry: A
catalog of
neuroscience-relevant
resources
•> 5000 currently
listed
•> 2000 databases
•And we are finding
more every day
But we have Google!
 Current web is designed
to share documents
 Documents are
unstructured data
 Much of the content of
digital resources is part of
the “hidden web”
 Wikipedia: The Deep Web
(also called Deepnet, the
invisible Web, DarkNet,
Undernet or the hidden
Web) refers to World Wide
Web content that is not
part of the Surface Web,
which is indexed by
standard search engines.
NIF must work with ecosystem as
it is today
 NIF has developed a production technology platform for
researchers to discover, share, access, analyze, and
integrate neuroscience-relevant information
 Semantically-enabled search engine and interface that customizes
results for neuroscience
 System that searches the “hidden web”, i.e., content not well served by
search engines
 Data resources are predominantly relational, xml, text, rdf, owl
 Automated data harvesting technologies that produce dynamic indices
of data content including databases, web pages, text, xml etc.
 Tools to make products and data available
 Designed to be populated rapidly; set up process for progressive
refinement
NIF accomplishments

Assembled the largest searchable
collation of neuroscience data on the
web

Data federation

Resource registry (materials, data,
tools, services)

Pub Med literature

UCSD, Yale, Cal Tech, George Mason, Washington Univ
Full text of open access

The largest ontology for neuroscience

NIF search portal: simultaneous search
over data, NIF catalog and biomedical
literature

Neurolex Wiki: a community wiki
serving neuroscience concepts

A unique technology platform

A reservoir of cross-disciplinary
biomedical data expertise
NIF is poised to capitalize on the new tools
and emphasis on big data and open
science
NIF data federation
Percentage of data records per
data type
Brain activation foci
Animals
Images
Pathways
Drugs
connectivity
Antibodies
Microarray
98%
> 180 sources; 350 M records: NIF was
designed to be populated rapidly, with
progressive refinement of data
Grants
Percentage of data records per data
type: everything but microarray
What do you mean by data?
Databases come in many shapes and sizes
 Primary data:

Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
 Secondary data

Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
 Tertiary data

Claims and assertions about the
meaning of data
 E.g., gene
upregulation/downregulation,
brain activation as a function
 Registries:
 Metadata
 Pointers to data sets or
materials stored elsewhere
 Data aggregators
 Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
 Single source
 Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
What types of questions can I ask?
We’d like to be able to find:
 What is known****:
 What is the average diameter of a Purkinje neuron
 Is GRM1 expressed In cerebral cortex?
 What are the projections of hippocampus?
 What genes have been found to be upregulated in
chronic drug abuse in adults
 Is there a database of fMRI studies?
 What studies used my polyclonal antibody against
GABA in humans?
 What rat strains have been used most
extensively in research during the last 20 years?
 What is not known:
 Connections among data
 Gaps in knowledge
Without some sort of framework, very difficult to
do
What are the connections of the
hippocampus?
Hippocampus OR “Cornu Ammonis” OR
“Ammon’s horn”
Data sources
categorized by
“data type” and
level of nervous
system
Link back to
record in
original
source
Common views
across multiple
sources
Query expansion: Synonyms
and related concepts
Boolean queries
Tutorials for using
full resource when
getting there from
NIF
Results are organized within a common
framework
Target site
Synapsed by
innervates
Connects to
Input region
Synapsed with
Cellular contactProjects to
Axon innervates
Subcellular contact
Source site
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
The scourge of neuroanatomical nomenclature:
Importance of NIF semantic framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
NIF’s minimum requirements for
effective data sharing
 You (and the machine) have to be able to
find it
 Accessible through the web
 Annotations
 You have to be able to use it
 Data type specified and in a usable form
 You have to know what the data mean
 Some semantics
 Context: Experimental metadata
 Provenance: Where did the data come from?
Reporting neuroscience data within a consistent framework helps enormously
What is an ontology?
 Ontology: an explicit, formal
Brain
has a
representation of concepts
relationships among them
within a particular domain that
expresses human knowledge in a
machine readable form
Cerebellum
has a
Purkinje Cell Layer
has a
 Branch of philosophy: a theory
Purkinje cell
of what is
 e.g., Gene ontologies
is a
neuron
You need to use
ontology
identifiers instead
of strings
Blah, blah,
ontology blah
“Ontology as mathematics, computer science or esperanto”Andrey Rzhetsky and James A. Evans
What can ontology do for us?
“Esperanto!”
 Express neuroscience concepts in a way that is machine readable
 Classes are identified by unique identifiers
 Synonyms, lexical variants
 Definitions
 Provide means of disambiguation of strings
 Nucleus part of cell; nucleus part of brain; nucleus part of atom
 Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases
GABA as a neurotransmitter
 Properties
 Provide universals for navigating across different data sources
 Semantic “index”
 Perform reasoning
 Link data through relationships not just one-to-one mappings
 “Concept-based queries”
Power of unique identifiers: Are you the M
Martone who...
The Gene Wiki: community intelligence applied to human gene annotation.
Huss JW 3rd, Lindenbaum P, Martone M, Roberts D, Pizarro A, Valafar F, Hogenesch
JB, Su AI. Nucleic Acids Res. 2010 Jan;38(Database issue):D633-9.
Ontologies for Neuroscience: What are they and What are they Good for? Larson SD,
Martone ME. Front Neurosci. 2009 May;3(1):60-7. Epub 2009 May 1.
Three-dimensional electron microscopy reveals new details of membrane systems for
Ca2+ signaling in the heart. Hayashi T, Martone ME, Yu Z, Thor A, Doi M, Holst MJ,
Ellisman MH, Hoshijima M. J Cell Sci. 2009 Apr 1;122(Pt 7):1005-13.
Some analyses of forgetting of pictorial material in amnesic and demented patients.
Martone M, Butters N, Trauner D. J Clin Exp Neuropsychol. 1986 Jun;8(3):161-78.
Traumatic brain injury and the goals of care. Martone M. Hastings Cent Rep. 2006 MarApr;36(2):3.
Three-dimensional pattern of enkephalin-like immunoreactivity in the caudate nucleus of the
cat. Groves PM, Martone M,Young SJ, Armstrong DM. J Neurosci. 1988 Mar;8(3):892-900.
I am not a number (but I should
be)
 Full URI: Uniform
Resource Identifier
 http://orcid.org/1234567
Boston VA
Hospital
Dept of
Psychiatry,
UCSD
 Label: Maryann Elizabeth





Martone
Synonym: ME Martone, M
Martone, Maryann
Abbreviation: MEM
Is a
Has a
Is that entity which has
these properties
ORCID project: Author ID’s
M Martone
Nelson
Butters
Female
Publications
Text mining algorithms can discover a lot of things
about me
NIF Semantic Framework: NIFSTD ontology
NIFSTD
Anatomical
Structure
Organism
Molecule Descriptors
Dysfunction
Subcellular
structure
Molecule
Macromolecule
Cell
Gene
NS Function
Techniques
Resource
Reagent
Quality
Investigation
Instrument
Protocols
 NIF covers multiple structural scales and domains of relevance to neuroscience
 Aggregate of community ontologies with some extensions for neuroscience, e.g.,
Gene Ontology, Chebi, Protein Ontology
 Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks
for more complex representations
“We studied the behavior of CA2-binding proteins in
Ca2 neurons under high and low Ca2 conditions ”
BioGrid
Allen Brain Atlas
Brain Info
NIF queries
across over
170+
independent
databases
But you don’t have what I need!
•Provide a simple framework for
defining the concepts required
•Cell, Part of brain,
subcellular structure,
molecule
•Community based:
•Communities contribute
their vocabularies
•Reconcile and align
concepts used by different
domains
•Each concept gets its own
unique identifier
•Creating a computable index for
neuroscience data
•INCF
Demo D03
http://neurolex.org
Stephen Larson/INCF
Concept-based search: search by meaning
 Search Google: GABAergic neuron
 Search NIF: GABAergic neuron
 NIF automatically searches for types of
GABAergic neurons
Types of GABAergic
neurons
Esperanto!
 “The trouble is that if I make up all of my own URIs, my [data]
has no meaning to anyone else unless I explain what each URI is
intended to denote or mean. Two [data sets] with no URIs in
common have no information that can be interrelated.”
 NIF favors reuse of identifiers rather than mapping
 NIF imports many ontologies
 Creating ontologies to be used as common building blocks:
modularity, low semantic overhead, is important
 Many community ontologies available covering multiple domains
 NIFSTD available via web serivices
 Bioportal (http://bioportal.bioontology.org/)
http://www.rdfabout.com/intro/#Introducing%20RDF
NIF Analytics: The Neuroscience Ecosystem
Where are the data?
Data source
Brain region
Brain
Striatum
Hypothalamus
Olfactory bulb
Cerebral cortex
NIF is in a unique position to answer questions about the neuroscience
ecosystem
Vadim Astakhov, Kepler Workflow Engine
Whither neuroscience information?
∞
What is potentially knowable
What is known:
Literature, images, human
knowledge
What is easily machine
processable and accessible
Unstructured;
Natural language
processing, entity
recognition, image
processing and
analysis;
communication
Open world meets closed world
But...NIF has > 900,000 antibodies,
250,000 model organisms, and 3
million microarray records
Query for “reference” brain structures and their parts in NIF Connectivity database
Gender bias
NIF can start to
answer interesting
questions about
neuroscience
research, not just
about neuroscience
NIF Reports:
Male vs Female
What have we learned: Grabbing
the long tail of small data
 Analysis of NIF shows
multiple databases with
similar scope and content
 Many contain partially
overlapping data
 Data “flows” from one
resource to the next
 Data is reinterpreted,
reanalyzed or added to
 Is duplication good or bad?
Embracing duplication: Data Mash ups
•NIF queries across 3 of approximately 10 fMRI databases
•~300 PMID’s were common between Brede and SUMSdb
•PMID serves as a unique identifier for an article
•Same information; value added
Same data; different aspects
Same data: different analysis
Chronic vs acute morphine in striatum
 Gemma: Gene ID + Gene Symbol
 DRG: Gene name + Probe ID
 Gemma presented results relative to baseline chronic
morphine; DRG with respect to saline, so direction of
change is opposite in the 2 databases
 Analysis:
 1370 statements from Gemma regarding gene expression as
a function of chronic morphine
 617 were consistent with DRG;  over half of the claims of
the paper were not confirmed in this analysis
 Results for 1 gene were opposite in DRG and Gemma
 45 did not have enough information provided in the paper to
make a judgment
Taking a global view on data:
microculture to ecosystem
 Several powerful trends should change the way we
think about our data: One  Many
 Many data
 Generation of data is getting easier  shared data
 Data space is getting richer: more –omes everyday
 But...compared to the biological space, still sparse
 Many eyes
 Wisdom of crowds
 More than one way to interpret data
 Many algorithms
 Not a single way to analyze data
 Many analytics
 “Signatures” in data may not be directly related to the question for
which they were acquired but tell us something really interesting
Are you exposing or burying your work?
The future of scientific
communication
 We have learned over the years how to write
a scientific paper for other humans to read
and for other agents to index
 We now have to learn how to write papers
for automated agents (and their humans)
to mine
 We have learned over the years to report
data in papers for humans to read
 We now have to learn how to publish data
in a form and on a suitable platform for
automated agents (and their humans) to
mine
Printing press
Linked data cloud
Reporting neuroscience data within a consistent framework helps enormously
Watson
Why does it matter?
47/50 major preclinical
published cancer studies
could not be replicated
 “The scientific community
assumes that the claims in a
preclinical study can be taken
at face value-that although
there might be some errors in
detail, the main message of
the paper can be relied on and
the data will, for the most
part, stand the test of time.
Unfortunately, this is not
always the case.”
Begley and Ellis, 29 MARCH 2012 | VOL 483 |
NATURE | 531
 “There are no guidelines that
require all data sets to be
reported in a paper; often,
original data are removed
during the peer review and
publication process. “
 Getting data out sooner in a
form where they can be exposed
to many eyes and many
analyses, and easily compared,
may allow us to expose errors
and develop better metrics to
evaluate the validity of data
Data, not just stories about them!
Register your resource to NIF!
1
Institutional
repositories
“How do I share my
data?”
Cloud
2
“There is no database
for my data”
3
Community
database:
beginning
4
Community
database:
End
INCF: Global
infrastructure
Education
Industry
Government
NIF is designed to leverage existing investments in resources and infrastructure
It’s a messy ecosystem (and that’s OK)
NIF favors a hybrid, tiered,
federated system
 Domain knowledge
Gene
Organism
Neuron
Brain part
Disease
 Ontologies
 Claims about results
 Virtuoso RDF triples
 Data
 Data federation
 Workflows
 Narrative
 Full text access
Caudate projects to
Snpc
Betz cells
degenerate in ALS
Grm1 is upregulated in
chronic cocaine
Future of Research Communications
and e-Scholarship
 FORCE11: http://force11.org
 Founded by Phil Bourne, Tim
Clark, Ed Hovy, Anita de Waard
and Ivan Herman
 Bring together stakeholders with
an interest in moving scholarly
communication beyond reliance
on papers and traditional impact
metrics
 Beyond the PDF 2: Spring 2013
NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
Amarnath Gupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam, NIF Ontology Engineer
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Lee Hornbrook
Binh Ngo
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
Why do we create so many
overlapping products?
“That which I cannot build,
I cannot understand”
Science is incremental;
we build on the results
of others
 Don’t trust any data you
 It’s ingrained in our culture
haven’t generated
 Oh, now I see what you are
saying
 Scientists know the domain,
not informatics
 “Build a better mousetrap and the
Yes, we are planning to
do that...
world will beat down our doors”
 Little credit for making someone
else’s product better
There’s more than
way to skin a cat....
 We are all time and resource
 We are still mastering the
constrained
 We extend projects in time
medium
 Technology is developing fast
You need to use
ontology
identifiers instead
of strings
Blah, blah,
ontology blah
When I talk to resource providers, neuroscientists (and
journal editors)...
Download