Open Data Repositories and Big Data

advertisement
Open Data Repositories
and
Big Data
ARD Prasad
DRTC, Indian Statistical Institute
Bangalore
Our Interest


We are working on (semantics of Big Data)

Structured Data

Hosting data repositories

Metadata of Big data

Ontology

Linked Open Data (LOD)
Presently NOT working on

Unstructured Data (especially on social networks)

Data Analytics
Semantic Web
Semantic web can be realized only when the
web provides answers to queries than web
pages
Scope

The Semantic Web is a Web of Data ... The
collection of Semantic Web technologies
(RDF, OWL, SKOS, SPARQL, etc.) provides
an environment where application can query
that data, draw inferences using vocabularies,
etc.

Linked data (http://www.w3.org/standards/semanticweb/data)
Metadata

Of books/Documents (Dublin Core etc.)

Products

Events

Processes

Organizations

Individuals (FOAF)

Keywords (Ontology)

Preservation (PREMIS)

DATA
Looking Back
Open Access to Information (OAI)
•A Fairly successful movement, resulted in
•Open Access Repositories (> 2000)
•Open Access Journals (> 5000)
•Partially bridging digital divide in Social, Physical,
Natural Sciences and Humanities,
Nature of Publications
Many publications use data. Actual article may
not have complete data used
• For lack of space
• Author might have overlooked the data
• Author deliberately did not present data so that others can not verify the data
For Example
Some suspect that Sigmund Freud's data is of
fictious persons, it is not just fictitious names
If data is available ...
• Others may draw different conclusions
contradictory to that of the author
• Others may deal with other facets of the data
• Data Transparency supplements the Objectivity
and self corrective characteristics of Science
If “Case history of patients” is openly available, it
will contribute significantly to medical research
Digital Divide
• Social Sciences do not require laboratory
infrastructure
• However, physical and natural sciences do
require expensive infrastructure
• If experimental data is available to scientists that
do not have infrastructure, it will significantly
reduce digital divide in Physical and Natural
Sciences
ODA is a step toward transparency and quality
in science
For Example
• Human Genome data
• Data from Accelerator Labs (CERN)
• Recent controversy about particle moving faster
than light
• Not surprisingly, astronomy data is openly
available even before the OA movement
Features of Open Data Repositories
• Metadata: specify who is the owner, creator etc
• license the data to waive your rights to facilitate
bulk download Open Data
• Technology Tools: automate data extraction
• Ontology: Index data
Licences
Creative Commons licenses (apart from CCZero),
GPL, BSD, etc are NOT quite appropriate for
open data licences
Open Data Licences
• Open Data Commons Public Domain
Dedication and Licence (PDDL)
•
Dedicate to the Public Domain (all rights waived)
• Open Data Commons Attribution License
•
Attribution for data(bases)
• Open Data Commons Open Database
License (OdbL)
•
Attribution-ShareAlike for data(bases)
• Creative Commons CCZero
•
Dedicate to the Public Domain (all rights waived)
Amazon Web Services (AWS)
Public Data Sets on AWS
• Annotated Human Genome Data provided by ENSEMBL
– The Ensembl project produces genome databases for human
as well as almost 50 other species, and makes this
information freely available.
• Various US Census Databases from The US Census Bureau
– Demographic data
– US Censuses
– Summary information about Business and Industry
– Economic Household Profile Data.
• UniGene provided by the National Center for Biotechnology
Information
Data Repositories by Governments

Many countries are hosting their data on
data.gov in various formats like RDF, XSL, JSON,
CSV etc.

Ex:

www.data.gov.in (India)

Www.data.gov (USA)

Www.data.gov.au (Australia)

Www.data.gov.uk (UK)
Registry of Data Repositories
Popular Data Registries:- Databib and re3data.org
Databib connects to 978 data repositories and databases
(agriculture,Geo-sciences,social Sciences,Biological sciences)
re3data.org currently lists 634 research data repositories from
different disciplines and 586 of these are described in detail
using the re3data.org schema.
In future, Databib and re3data.org are likely to get merged into
one service.
Note: The registry entries provide URL to the data repository and
also a brief description of it. Manually one has to visit and
download the data from the data repository. Again, no
protocol to expose metadata of data providers
Digital Curation
• Collecting verifiable digital assets
• Providing digital asset search and retrieval
• Certification of the trustworthiness and integrity
of the collection content
• Semantic and ontological continuity and
comparability of the collection content
• Use of open standards (formats) for term
preservation and future proofing by migration of
data
Technology
• Data repositories are much larger than OA
repositories
• Cloud Computing is a good solution (AWS uses)
• Hadoop
Resource Description in terms of
Metadata and Ontology




RDF: Resource Description Framework
SKOS: Simple Knowledge Organization
System
OWL: Web Ontology Language
SPARQL: SPARQL Protocol and RDF Query
Language
NoSQL DBMS
Key / Value Based


Redis, MemcacheDB, etc.
Column Based


Cassandra, HBase, etc
Document Based


MongoDB, Couchbase, etc
Graph Based


AllegroGraph, Neo4J, etc.
DBpedia Data Set




Multi-domain ontology derived from Wikipedia
3.77 million “things” (entities - Entitypedia)
400 million “facts”
Uses YAGO (Yet Another Great Ontology)
Entitypedia






Multilingual controlled vocabulary
Entity matching
Data quality and type checking
Entity type specific services
Semantic or faceted search and navigation on
entities
Summarization of entities and concepts
DRTC Projects



Living Knowledge (European Commission funded
project on semantic web based on SRR's Analytico
Synthetic Classification, completed)
ITPAR: India-Trento Program for Advanced Research
(Govts. Of India & Italy; Work on DERA, Ongoing)
AgInfra: Agriculture data (European Commission
funded, Agriculture data, ongoing)
Immediate Plan

An international workshop on Big Data with
ICSU/CODATA
Thank You
Download