PPT - The National Academies

advertisement
Giridhar Manepalli
Corporation for National Research Initiatives
Corporation for National Research Initiatives
Information Types
and Registries
Strategies for Discovering Online Data
BRDI Symposium – Feb 26, 2013
1
Research Data Interoperability
ID
ID
Discovery
ID
ID
ID
0100
0101..
ID 0100 ID
0101..
0100
0101..
ID
0100
0101..
ID
ID
Access
ID
ID
ID
ID
ID
Scientists, Data Curators,
End Users, Applications
ID
ID
0100
0101..
ID
Interpretation
Corporation for National Research Initiatives
Enabling
Technologies
Datasets
Accessed via Repositories
Reuse
2
• Interoperability of research data allows discovery, access,
interpretation, and reuse of datasets by researchers
• Examples
• Discovery: A scientist from US “discovers” datasets from research
in Germany, in related or even unrelated domain
• Reuse: A scientist from US “re-uses” or processes datasets from
the discovered research in Germany
• For interpretation of accessible datasets, Types and Type
Registries play a significant role
Corporation for National Research Initiatives
Research Data Interoperability
(cont.)
3
• What they are not:
• Programmatic data types (string, integer, double, etc.)
• Mime types as normally used (text/xml, application/rdf)
• Types are identifiers that, with the help of associated metadata,
characterize data structures used for managing information
• Data structures could be at multiple levels of granularity
• Individual observations, to sets of observations within a time
series, to multiple time-series sets that explain a
phenomenon
• Usually
• Spread across multiple files (each with specific mime type)
• Distributed on the network (managed by various
repositories)
• We call such data structures used for managing information digital
objects
• Types (aka type identifiers) are unique across their user base
• Types are associated with machine-readable metadata to support
interpretation of information
• CNRI’s focus is to support infrastructure for enabling inter-discipline
types
Type ID
Machine Readable
Metadata
Digital Object
File
File
File
File
Corporation for National Research Initiatives
Information Types – Our
Definition
Network
Typed Digital Object
4
Value Proposition of Info. Types
• Grouping of digital objects generated in different
times and domains for reasoning and
establishing correlations between different
types of objects
• Grouping is an aspect fundamental to humans for
reasoning about things
• Creation of services that can automate
information processing based on information
types
• Advanced information processing can be
performed for finding unforeseen correlations,
trends, etc.
• This type of advanced processing has different
names: data-intensive science, fourth paradigm,
big-data analytics, etc.
Type A
Type B
Digital Object
Type C
Typed Digital Object
Collection
Corporation for National Research Initiatives
• Typing allows
5
Value Proposition of Info. Types
(cont.)
Interaction
Visualization
5
SUITE OF SERVICES
Rights
Data Set Dissemination
10100
11010
10100
101….
11010
10100
101….
11010
101….
Terms:…
ITerms:…
Agree
ITerms:…
Agree
I Agree
3
4
1 2
Type
Registry
1. User requests Type from a
Digital Object of interest.
2. Type ID is returned to the
user.
3. User requests the Type
Registry for the Type info.
4. Type Info is returned to the
user containing Services Info.
5. User requests a Service for
processing.
Corporation for National Research Initiatives
Data Processing
Digital Objects
6
Info. Typing Challenges
• When the bit-level encoding matches?
• Or when the higher-level structures and intent matches?
• If two observations are made by two similar instruments at the same
time on the same entity, would the data generated by those two
observations be constituted as being of the same type?
• Even if the data generated by each observation, similar in concept,
has a different format (e.g., JPEG vs. PNG)?
• Our approach:
• Intent wins over optics (formats, encodings, etc.)
• The metadata associated with the type could list possible formats,
encodings, etc.
Corporation for National Research Initiatives
• Challenge: When are two digital objects assigned the same type?
• Alternative approach:
• Establish a base type and then sub-type for accommodating
variations
• Our experience was that it was too cumbersome to deal with
multiple formats, encodings at the type definition level
7
Info. Typing Challenges (cont.)
• If so, how do we deal with duplicate
types?
• If not, how do we manage multiple
types assigned by several domains?
Corporation for National Research Initiatives
• Challenge: Can the same digital object
be assigned multiple types?
Inter-discipline Type
Type I
Machine-readable
Metadata
Type α
Type β
• Our approach:
• An object is assigned an interdiscipline type only once.
Biologist
• Any domain-specific types are listed in
its metadata
Type α
Type β
Computer Scientist
Typed Digital Object
Collection
8
• Challenge: How can existing information be typed under this
new scheme?
• A lot of information exists already
• One approach:
• Start with domain-specific types, if any, and generate domainneutral types and list the domain-specific types in their metadata
records
Corporation for National Research Initiatives
Info. Typing Challenges (cont.)
9
• Machine-readable metadata for Info. Types is still an area of research for
us
• Type interdependence
• It is clear that sub-typing is needed for building on previously defined types
• Our experience shows that sub-typing based on variations in formats and
encoding is a cumbersome process
• Instead, an exhaustive list of possible formats and encodings may be
specified in the metadata
• Domain-specific Types
• Cross-domain Types could list or point at domain-specific types which could
be multiple for a given object, and which might define detailed semantics for
interpretation
Corporation for National Research Initiatives
Info. Types – Machine-readable
Metadata
• Metadata for automated interpretation
• For the few types of information we prototyped, defining metadata that
helps services process datasets is loose ended and sometimes impractical
• A parsing-language or a pseudo-code may instead be captured that
transforms datasets into domain-specific ontologies or semantics
10
• Info. Type Registries are metadata registries that
• Support recording of information types and associated metadata
records
• Perform federation across other registries
• De-duplicate (or match types) to control registration requests of
existing types
• Include manual moderation and/or crowd sourcing function for
spotting redundant registrations (optional)
• Cross-domain Type Registries may optionally link to domainspecific Type Registries
• Type Registries may manage or reference services that process
information of certain types
• CNRI has vast experience building metadata registries
Corporation for National Research Initiatives
Info. Type Registries
11
• Received Sloan Foundation funding to research Type Registries
within scientific and financial communities
• CNRI employees lead and participate in a Type Registry
working group within the Research Data Alliance
• Technical goal is to define the scope of ‘Information Type’ by
working in aforementioned projects, and build and release an
open-source Type Registry in the next 18 months.
Corporation for National Research Initiatives
Next Steps
12
Download