How can Big Data contribute to the Open Data process? DANE Big Data Event , Bogotá, October 2013 1 Presentation contents 1. What is “Big Data”? 2. Big Data sources, challenges & opportunities 3. Big Data and official statistics 4. Current Big Data initiatives 5. What is “Open Data” ? 6. Current Open Data initiatives 7. Incorporating Big Data into official statistics Open Data programmes - the OECD perspective DANE, Big Data, October 2013 2 BIG DATA DANE, Big Data, October 2013 3 What is “Big Data”? Big data are data sources that can be –generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.” Gartner • Big data is characterized as data sets of increasing volume, velocity and variety • Big data is often largely unstructured, meaning that it has no predefined data model and/or does not fit well into conventional relational databases • private sector may take advantage of the Big data era and produce more and more statistics that attempt to beat official statistics on timeliness and relevance DANE, Big Data, October 2013 4 What is “Big Data”? The data deluge DANE, Big Data, October 2013 5 What is “Big Data”? The data deluge New wealth of digital data - 90% of the world’s digital data has been created in just the last two years and is doubling every 20 months. Big data comes in a number of forms: • “Data Exhaust” collected passively from devices (phones, credit cards, web searches etc) as sensors of human behaviour • Online information (blogs, twitters, news articles...) sensors of human sentiments • Physical sensors (pollution, light emission etc) remote sensors of human activity • Citizen reporting – information actively produced via phone-surveys, hotlines etc DANE, Big Data, October 2013 6 Big Data - Sources • Administrative (electronic medical records, hospital visits, insurance records, bank records, food banks, etc.) • Commercial/Transactional (credit card transactions, on-line transactions, etc.) • Sensors (satellite imaging, road sensors, climate sensors, etc.) • Tracking devices (mobile telephones, GPS, etc.) • Behavioural (online searches, online page view, etc.) • Opinion (comments on social media, etc.) DANE, Big Data, October 2013 7 Big Data - Challenges • Legislative - with respect to the access and use of data. • Privacy - managing public trust and acceptance of data re-use and its link to other sources. • Financial - potential costs of sourcing data vs. benefits. • Management - policies and directives about the management and protection of the data. • Methodological - data quality and suitability of statistical methods. • Technological - issues related to information technology. DANE, Big Data, October 2013 8 Big Data - Opportunities • Collecting data in real time or near real time maximize the potential of data • big data has potential as an input for official statistics; either for use on its own, or in combination with more traditional data sources such as sample surveys and administrative registers • Big data has the potential to produce more relevant and timely statistics than traditional sources of official statistics • By incorporating relevant Big data sources into their official statistics process NSOs are best positioned to measure their accuracy DANE, Big Data, October 2013 9 Big Data Opportunities - areas for experimentation 1. 2. 3. 4. 5. NSO as brokers of Big Data? NSO to provide “Quality Stamp” ? Combining Big data with official statistics Replacing official statistics by Big data Filling new data gaps, i.e. developing new 'Big data - based' measurements to address emerging phenomena (not known in advance or for which traditional approaches are not feasible) 6. Visualization methods 7. Text mining 8. High Performance Computing. DANE, Big Data, October 2013 10 Big Data and Official Statistics “What does Big Data mean for official statistics?” DANE, Big Data, October 2013 11 Big Data and Official Statistics DANE, Big Data, October 2013 12 Big Data & Official Statistics Statistical organisations are encouraged to address formally Big data issues in their annual and multi-annual work programmes by: • undertaking research and pilot projects in selected areas • allocating appropriate resources for that purpose. DANE, Big Data, October 2013 13 Big Data & Official Statistics • Collaboration of NSOs with private data source owners is of critical importance and it touches upon sensitive issues such as privacy, trust and corporate competitiveness, as well as the legislation framework of the NSOs. • To use Big data, statisticians are needed with a different mind-set and new skills. The processing of more and more data for official statistics requires statistically aware people with an analytical mind-set, an affinity for IT and a determination to extract valuable ‘knowledge’ from data : “Data scientists” (Quote stats from US) • NSOs should develop the necessary internal analytical capability through specialised training. DANE, Big Data, October 2013 14 Big Data – Examples Example: Twitter used an algorithm which could perceive the difference between actual sickness and usage of the common word ‘sick’, researchers were able to plot and predict when people from a certain area were at risk of picking up a flu bug DANE, Big Data, October 2013 15 Big Data – Examples DANE, Big Data, October 2013 16 Big Data – Examples (OECD) Real Estate • Collect real time real estate data, incl. location, product characteristics and price information • Gather data by extracting information from Real estate sites ads in major agglomerations • Collection mechanism: Search engine with semantic capability collects and structures data • Data analysis with existing statistical tools to produce aggregated indicators • Enrich the GOV metropolitan database with the compiled indicators Traffic • Collect a sample of data tracking movements of mobile users over a territory • Compile from that data and create transportation performance indicators (especially: reliability) • Including evolution over time • Collection mechanism: Sample of #100K users across several countries who downloaded an App on mobile access quality • Data analysis with existing statistical tools to produce aggregated indicators DANE, Big Data, October 2013 17 Big Data – Examples (OECD) Political tension • Produce indicators on Political tension for African Economic Outlook • Collect a sample of qualitative data (articles, …) qualifying political tension in African countries. Based on keywords: strike, demonstration, kidnapping,… • Text mining on countries or topics could raise interest as part of the data collection process Employment • Supplement survey data on employment with job offerings and applications collected from the Internet • Building indicators by analysing legal documents (labour codes, labour legislation, court judgements, ….) Internet • Quality of Internet network infrastructure and security • Ranking of languages most used on the Internet • Study the effectiveness of current intellectual property protection laws DANE, Big Data, October 2013 18 Big Data – Examples (OECD) • Objectives – Collect a sample of data tracking movements of mobile users over a territory – Compile from that data and create transportation performance indicators (especially: reliability) – Including evolution over time • Proof of concept envisaged – Solution provider identified (Sensorly, start-up specialised in mobile data) – Privacy should not be an issue since Sensorly collects data based on an opt-in mechanism and does not rely on mobile operators data. – Collection mechanism: Sample of #100K users across several countries who downloaded an App on mobile access quality – Data analysis with existing statistical tools to produce aggregated indicators DANE, Big Data, October 2013 19 Big Data – Examples (MIT) DANE, Big Data, October 2013 20 Big Data – Examples (Health data) DANE, Big Data, October 2013 21 Any questions? DANE, Big Data, October 2013 22 OPEN DATA DANE, Big Data, October 2013 23 What is “Open Data”? From wikipedia: • Open data is the idea that data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. DANE, Big Data, October 2013 24 What is “Open Data”? From open data handbook: Open data is as defined by the Open Definition: Open data is data that can be freely used, reused and redistributed by anyone. DANE, Big Data, October 2013 25 What is “Open Data”? From OECD Open Data Project: Definition of ‘Open’ from 2011 OECD Publishing Review : To make OECD data machine-readable, retrievable, indexable and re-usable DANE, Big Data, October 2013 26 Open Data: Ten Principles for Opening Up Government Information (Sunlight Foundation) 1. Completeness: Datasets released by the government should be as complete as possible, reflecting the entirety of what is recorded about a particular subject. Metadata that defines and explains the raw data should be included as well, along with formulas and explanations for how derived data was calculated. 2. Primacy: Datasets released should be primary source data. This includes the original information collected, details on how the data was collected and the original source documents recording the collection of the data. 3. Timeliness: Datasets should be available to the public in a timely fashion. Whenever feasible, information collected should be released as quickly as it is gathered and collected 4. Ease of Physical and Electronic Access: Datasets should be as accessible as possible. There should be no barriers such as completing forms or submitting requests or systems that require browser-oriented technologies (e.g., Flash, Javascript, cookies or Java applets). 5. Machine readability: Information should be stored in widely-used file formats that easily lend themselves to machine processing. These files should be accompanied by documentation related to the format and how to use it in relation to the data.. DANE, Big Data, October 2013 27 Open Data: Ten Principles for Opening Up Government Information (Sunlight Foundation) 6. Non-discrimination: Barriers to use of data can include registration or membership requirements. Any person can access the data at any time without having to identify him/herself or provide any justification for doing so. 7. Use of Commonly Owned Standards: Should be freely available formats by which stored data can be accessed without the need for a software license to make the data available to a wider pool of potential users. 8. Licensing: Maximal openness means making data available without restrictions on use as part of the public domain. 9. Permanence: Information should be available online in archives in perpetuity. Data should remain online, with appropriate version-tracking and archiving over time. 10. Usage Costs: Data should be available free of charge DANE, Big Data, October 2013 28 Examples of open data initiatives DANE, Big Data, October 2013 29 Open Data examples – Data.Gov.uk DANE, Big Data, October 2013 30 Open Data examples – Data.Gov DANE, Big Data, October 2013 31 Open Data examples DANE, Big Data, October 2013 32 Open Data examples – World Bank DANE, Big Data, October 2013 33 OECD Open Web Services DANE, Big Data, October 2013 34 Incorporating Big Data into official statistics Open Data programmes - the OECD perspective Introducing the OECD DELTA Programme…. DANE, Big Data, October 2013 35 DELTA Programme – Making OECD data Open, Accessible, Free Accessible Open Machine-readable Indexable Re-Useable Find Understand Use Free Available without charge DANE, Big Data, October 2013 36 The Open Data project - goals • To make OECD data machine-readable, retrievable, indexable and re-usable. • To increase the dissemination and impact of OECD data via open data services for OECD statistical data • To encourage re-use of OECD data and reuse by OECD of external innovation via open innovation process and communities, DANE, Big Data, October 2013 37 The Open Data project - scope Data content • All datasets within the OECD.Stat data warehouse with standardised structural format and content necessary for machineto-machine “Open” access. DANE, Big Data, October 2013 38 Open data web services – Data Formats i) SDMX/JSON JavaScript Object Notation (JSON) text-based open standard designed for human-readable data interchange Widely-used open data format on web sites today. JSON has a number of advantages, including: • Simplicity - simple and ‘lightweight’ format with a smaller grammar and can map directly onto the data structures used in today’s programming languages. • Interoperability - has the same interoperability potential as XML. • Openness - has the same open capabilities as XML • Readability - is much easier for human to read than XML. It is easier to write and is easier for machines to read and write. DANE, Big Data, October 2013 39 Open data web services – Data Fotmats ii) Excel/CSV Excel and CSV are already widely used exchange standards so including them as output formats was a fairly obvious decision. iii) Open Data (OData) OData is an open protocol for sharing data DANE, Big Data, October 2013 40 Open data web services – Data Formats iv) Future formats could include Google Data (a REST-inspired technology), Google Dataset Publishing Language (DPSL) or Google KML, a Geospatial file format. DANE, Big Data, October 2013 41 Any questions? DANE, Big Data, October 2013 42