Open data

advertisement
How can Big Data contribute
to the Open Data process?
DANE Big Data Event , Bogotá, October 2013
1
Presentation contents
1.
What is “Big Data”?
2. Big Data sources, challenges & opportunities
3. Big Data and official statistics
4. Current Big Data initiatives
5. What is “Open Data” ?
6. Current Open Data initiatives
7.
Incorporating Big Data into official statistics Open Data
programmes - the OECD perspective
DANE, Big Data, October 2013
2
BIG DATA
DANE, Big Data, October 2013
3
What is “Big Data”?
Big data are data sources that can be –generally– described as: “high
volume, velocity and variety of data that demand cost-effective,
innovative forms of processing for enhanced insight and decision
making.”
Gartner
• Big data is characterized as data sets of increasing volume, velocity and
variety
• Big data is often largely unstructured, meaning that it has no predefined data model and/or does not fit well into conventional relational
databases
• private sector may take advantage of the Big data era and produce more
and more statistics that attempt to beat official statistics on timeliness
and relevance
DANE, Big Data, October 2013
4
What is “Big Data”? The data deluge
DANE, Big Data, October 2013
5
What is “Big Data”? The data deluge
New wealth of digital data - 90% of the world’s digital data has been created in
just the last two years and is doubling every 20 months.
Big data comes in a number of forms:
• “Data Exhaust” collected passively from devices (phones, credit cards, web
searches etc) as sensors of human behaviour
• Online information (blogs, twitters, news articles...) sensors of human
sentiments
• Physical sensors (pollution, light emission etc) remote sensors of human
activity
• Citizen reporting – information actively produced via phone-surveys,
hotlines etc
DANE, Big Data, October 2013
6
Big Data - Sources
• Administrative (electronic medical records,
hospital visits, insurance records, bank records,
food banks, etc.)
• Commercial/Transactional (credit card
transactions, on-line transactions, etc.)
• Sensors (satellite imaging, road sensors,
climate sensors, etc.)
• Tracking devices (mobile telephones, GPS,
etc.)
• Behavioural (online searches, online page
view, etc.)
• Opinion (comments on social media, etc.)
DANE, Big Data, October 2013
7
Big Data - Challenges
• Legislative - with respect to the access and use
of data.
• Privacy - managing public trust and acceptance
of data re-use and its link to other sources.
• Financial - potential costs of sourcing data vs.
benefits.
• Management - policies and directives about the
management and protection of the data.
• Methodological - data quality and suitability of
statistical methods.
• Technological - issues related to information
technology.
DANE, Big Data, October 2013
8
Big Data - Opportunities
• Collecting data in real time or near real time maximize
the potential of data
• big data has potential as an input for official
statistics; either for use on its own, or in combination
with more traditional data sources such as sample
surveys and administrative registers
• Big data has the potential to produce more relevant
and timely statistics than traditional sources of official
statistics
• By incorporating relevant Big data sources into their
official statistics process NSOs are best positioned to
measure their accuracy
DANE, Big Data, October 2013
9
Big Data Opportunities - areas for
experimentation
1.
2.
3.
4.
5.
NSO as brokers of Big Data?
NSO to provide “Quality Stamp” ?
Combining Big data with official statistics
Replacing official statistics by Big data
Filling new data gaps, i.e. developing new 'Big data - based'
measurements to address emerging phenomena (not known
in advance or for which traditional approaches are not
feasible)
6. Visualization methods
7. Text mining
8. High Performance Computing.
DANE, Big Data, October 2013
10
Big Data and Official Statistics
“What does Big Data mean for
official statistics?”
DANE, Big Data, October 2013
11
Big Data and Official Statistics
DANE, Big Data, October 2013
12
Big Data & Official Statistics
Statistical organisations are encouraged to
address formally Big data issues in their annual
and multi-annual work programmes by:
• undertaking research and pilot projects in
selected areas
• allocating appropriate resources for that
purpose.
DANE, Big Data, October 2013
13
Big Data & Official Statistics
• Collaboration of NSOs with private data source owners is
of critical importance and it touches upon sensitive issues such
as privacy, trust and corporate competitiveness, as well as the
legislation framework of the NSOs.
• To use Big data, statisticians are needed with a different
mind-set and new skills. The processing of more and more data
for official statistics requires statistically aware people with an
analytical mind-set, an affinity for IT and a determination to extract
valuable ‘knowledge’ from data : “Data scientists” (Quote stats from
US)
• NSOs should develop the necessary internal analytical capability
through specialised training.
DANE, Big Data, October 2013
14
Big Data – Examples
Example: Twitter used an algorithm which
could perceive the difference between
actual sickness and usage of the common
word ‘sick’, researchers were able to plot
and predict when people from a certain
area were at risk of picking up a flu bug
DANE, Big Data, October 2013
15
Big Data – Examples
DANE, Big Data, October 2013
16
Big Data – Examples (OECD)
Real Estate
• Collect real time real estate data, incl. location, product characteristics and price
information
• Gather data by extracting information from Real estate sites ads in major agglomerations
• Collection mechanism: Search engine with semantic capability collects and structures data
• Data analysis with existing statistical tools to produce aggregated indicators
• Enrich the GOV metropolitan database with the compiled indicators
Traffic
• Collect a sample of data tracking movements of mobile users over a territory
• Compile from that data and create transportation performance indicators (especially:
reliability)
• Including evolution over time
• Collection mechanism: Sample of #100K users across several countries who downloaded
an App on mobile access quality
• Data analysis with existing statistical tools to produce aggregated indicators
DANE, Big Data, October 2013
17
Big Data – Examples (OECD)
Political tension
• Produce indicators on Political tension for African Economic Outlook
• Collect a sample of qualitative data (articles, …) qualifying political tension in African
countries. Based on keywords: strike, demonstration, kidnapping,…
• Text mining on countries or topics could raise interest as part of the data collection
process
Employment
• Supplement survey data on employment with job offerings and applications collected from
the Internet
• Building indicators by analysing legal documents (labour codes, labour legislation, court
judgements, ….)
Internet
• Quality of Internet network infrastructure and security
• Ranking of languages most used on the Internet
• Study the effectiveness of current intellectual property protection laws
DANE, Big Data, October 2013
18
Big Data – Examples (OECD)
• Objectives
– Collect a sample of data tracking movements of mobile users over a
territory
– Compile from that data and create transportation performance
indicators (especially: reliability)
– Including evolution over time
• Proof of concept envisaged
– Solution provider identified (Sensorly, start-up specialised in mobile
data)
– Privacy should not be an issue since Sensorly collects data based on an
opt-in mechanism and does not rely on mobile operators data.
– Collection mechanism: Sample of #100K users across several countries
who downloaded an App on mobile access quality
– Data analysis with existing statistical tools to produce aggregated
indicators
DANE, Big Data, October 2013
19
Big Data – Examples (MIT)
DANE, Big Data, October 2013
20
Big Data – Examples (Health data)
DANE, Big Data, October 2013
21
Any questions?
DANE, Big Data, October 2013
22
OPEN DATA
DANE, Big Data, October 2013
23
What is “Open Data”?
From wikipedia:
• Open data is the idea that data should
be freely available to everyone to use and
republish as they wish, without
restrictions from copyright, patents or
other mechanisms of control.
DANE, Big Data, October 2013
24
What is “Open Data”?
From open data handbook:
Open data is as defined by the Open
Definition:
Open data is data that can be freely used,
reused and redistributed by anyone.
DANE, Big Data, October 2013
25
What is “Open Data”?
From OECD Open Data Project:
Definition of ‘Open’ from 2011 OECD
Publishing Review :
To make OECD data machine-readable,
retrievable, indexable and re-usable
DANE, Big Data, October 2013
26
Open Data: Ten Principles for Opening Up
Government Information (Sunlight Foundation)
1.
Completeness: Datasets released by the government should be as complete as possible,
reflecting the entirety of what is recorded about a particular subject. Metadata that defines
and explains the raw data should be included as well, along with formulas and explanations
for how derived data was calculated.
2.
Primacy: Datasets released should be primary source data. This includes the original
information collected, details on how the data was collected and the original source
documents recording the collection of the data.
3.
Timeliness: Datasets should be available to the public in a timely fashion. Whenever
feasible, information collected should be released as quickly as it is gathered and collected
4.
Ease of Physical and Electronic Access: Datasets should be as accessible as possible.
There should be no barriers such as completing forms or submitting requests or systems
that require browser-oriented technologies (e.g., Flash, Javascript, cookies or Java applets).
5.
Machine readability: Information should be stored in widely-used file formats that easily
lend themselves to machine processing. These files should be accompanied by
documentation related to the format and how to use it in relation to the data..
DANE, Big Data, October 2013
27
Open Data: Ten Principles for Opening Up
Government Information (Sunlight Foundation)
6.
Non-discrimination: Barriers to use of data can include registration or membership
requirements. Any person can access the data at any time without having to identify
him/herself or provide any justification for doing so.
7.
Use of Commonly Owned Standards: Should be freely available formats by which
stored data can be accessed without the need for a software license to make the data
available to a wider pool of potential users.
8.
Licensing: Maximal openness means making data available without restrictions on use as
part of the public domain.
9.
Permanence: Information should be available online in archives in perpetuity. Data
should remain online, with appropriate version-tracking and archiving over time.
10.
Usage Costs: Data should be available free of charge
DANE, Big Data, October 2013
28
Examples of open data
initiatives
DANE, Big Data, October 2013
29
Open Data examples – Data.Gov.uk
DANE, Big Data, October 2013
30
Open Data examples – Data.Gov
DANE, Big Data, October 2013
31
Open Data examples
DANE, Big Data, October 2013
32
Open Data examples – World Bank
DANE, Big Data, October 2013
33
OECD Open Web Services
DANE, Big Data, October 2013
34
Incorporating Big Data into official statistics Open Data
programmes - the OECD perspective
Introducing the OECD DELTA Programme….
DANE, Big Data, October 2013
35
DELTA Programme – Making OECD
data Open, Accessible, Free
Accessible
Open
Machine-readable
Indexable
Re-Useable
Find
Understand
Use
Free
Available without
charge
DANE, Big Data, October 2013
36
The Open Data project - goals
• To make OECD data machine-readable,
retrievable, indexable and re-usable.
• To increase the dissemination and impact of
OECD data via open data services for OECD
statistical data
• To encourage re-use of OECD data and reuse by OECD of external innovation via open
innovation process and communities,
DANE, Big Data, October 2013
37
The Open Data project - scope
Data content
• All datasets within the OECD.Stat data
warehouse with standardised structural
format and content necessary for machineto-machine “Open” access.
DANE, Big Data, October 2013
38
Open data web services – Data Formats
i) SDMX/JSON JavaScript Object Notation (JSON)
text-based open standard designed for human-readable data
interchange
Widely-used open data format on web sites today. JSON has a
number of advantages, including:
• Simplicity - simple and ‘lightweight’ format with a smaller
grammar and can map directly onto the data structures used in
today’s programming languages.
• Interoperability - has the same interoperability potential as
XML.
• Openness - has the same open capabilities as XML
• Readability - is much easier for human to read than XML. It is
easier to write and is easier for machines to read and write.
DANE, Big Data, October 2013
39
Open data web services – Data
Fotmats
ii) Excel/CSV
Excel and CSV are already widely used
exchange standards so including them as
output formats was a fairly obvious decision.
iii) Open Data (OData)
OData is an open protocol for sharing data
DANE, Big Data, October 2013
40
Open data web services – Data Formats
iv) Future formats
could include Google Data (a REST-inspired
technology), Google Dataset Publishing
Language (DPSL) or Google KML, a Geospatial
file format.
DANE, Big Data, October 2013
41
Any questions?
DANE, Big Data, October 2013
42
Download