Prepared for Census-MIT Big Data Workshop Series MIT December 2015 Workshop Overview: Transparency and Inference for Big Data Micah Altman Director of Research MIT Libraries 1 Transparency and Inference for Big Data Roadmap Workshop series: Challenges of big data for official statistics What to expect today and tomorrow Acquisition Analysis Access Big Data Challenges Protection 2 Transparency and Inference for Big Data Governance Credits & Disclaimers 3 Transparency and Inference for Big Data DISCLAIMER These opinions are my own, they are not the opinions of MIT, Brookings, any of the project funders, nor (with the exception of co-authored previously published work) my collaborators Secondary disclaimer: “It’s tough to make predictions, especially about the future!” -- Attributed to Woody Allen,Yogi Berra, Niels Bohr,Vint Cerf, Winston Churchill, Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White, etc. 4 Transparency and Inference for Big Data Collaborators & Co-Conspirators Workshop Series Organizers US Census Cavan Capps, Ron Prevost MIT Micah Altman Workshop Co-Organizers (US Census) Peter Miller Benjamin Reist Michael Thieme Research Support 5 Supported by the U.S. Census Bureau Transparency and Inference for Big Data Related Work Main Project: Census-MIT Big Data Workshop Series projects.informatics.mit.edu/bigdataworkshops Related publications: (Reprints available from: informatics.mit.edu ) Altman M, Capps C, Prevost R. Using New Forms of Information for Official Economic Statistics -Examining the Commodity Flow Survey: Executive Summary from the 1rst Workshop in the MIT Big Data Workshop Series. SSN: Social Science Research Network [Internet]. Working Paper. Altman, M., D. O’Brien, S.Vadhan, A. Wood. 2014. “Big Data Study: Request for Information.” Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology Law. Forthcoming. Altman M, McDonald MP. 2014. Public Participation GIS : The Case of Redistricting. Proceedings of the 47th Annual Hawaii International Conference on Systems Science . 6 Transparency and Inference for Big Data Online… Website projects.informatics.mit.edu/bigdataworkshops Twitter Hashtag #cmbigdata E-mail 7 Micah Altman: Cavan Capps: Ron Prevost: escience@mit.edu Cavan.Paul.Capps@census.gov Ronald.C.Prevost@census.gov Transparency and Inference for Big Data Workshop Series: Big Data and Official Statistics 8 Transparency and Inference for Big Data Trends and Challenges Trends Increasingly data-driven economy Individuals are increasingly mobile Technology changes data uses Stakeholder expectations are changing Agency budgets and staffing remain flat. The next generation of official statistics Utilize broad sources of information Increase granularity, detail, and timeliness Reduce cost & burden Maintain confidentiality and security Multi-disciplinary challenges : 9 Computation, Statistics, Informatics, Social Science, Policy Transparency and Inference for Big Data Workshops and Outcomes Acquisition Challenges Using New forms of Information for Official Economic Statistics [August 3-4] Privacy Challenges Expected outcomes: Location Confidentiality and Official Surveys [November 30-Dec 1] Inference Challenges Transparency and Inference [December 7-8] 10 Workshop reports (September, January) Integrated white paper (February) Identifying new opportunities for statistical agencies Inform the Census Big Data Research Program. Transparency and Inference for Big Data Themes from Workshop 1: Big Data Sources Broad new sources of information have the potential to enhance official statistics Incorporating big data creates challenges increased granularity & detail increased timeliness reduced burdens Acquisition challenges Management, confidentiality and governance challenges Analytic challenges Incorporating big data into statistical agencies will require adaptation: Agencies will need to broaden from data collection to information provisioning. Agencies will require different sources of data to support different types of decisions. Agencies will need to develop more extensive relationships with business stakeholders. Agencies have the potential to take on new roles with respect to big data source, as… 11 standards leaders certification authorities clearinghouses infrastructure for durable, trusted access Transparency and Inference for Big Data Themes from Workshop 2: Big Data Privacy Value of Census Reputation Reputation to census is a primary concern Reputation affects willingness to participate cost of participation Reliability & transparency is needed for official statistics to serve their policy purpose Consider data needs in terms of computations Source of big data may not be willing to distribute data directly Sources of big data may not be able to distribute all data directly – typically internally distributed and reaggregated Access through computation To ensure accountability of process and programs To create a public data good – where results can be accepted across multiple sectors To support reliable inferences for a range of purposes Custom / private API’s could provide the analytics needed Where privacy and security are challenges Secure Multi-Party Computing methods could be used in place of trusted systems Characterizing risks and harms 12 official statistics reflect an implicit harm/benefit balance –although not legally framed explicitly need to move from binary measures (identification) to formal measures census could be a leader -- Many countries/industries/states use aggregation or suppression with no formal risk/harm characterization Transparency and Inference for Big Data What to Expect Today, Tomorrow, & Beyond 13 Transparency and Inference for Big Data Workshop Schedule Monday Tuesday 12:00 Lunch and Introductions 8:30 Breakfast 1:00 Workshop Overview 9:00 Recap / Review of Days 1:15 Overview of SIPP 9:15 Overview of Census Uses – Implications for Inference 2:15 Overview of Census Needs for Reliable and Transparent Inference 3:00 Coffee 3:30 Preliminary Discussion of Workshop Questions 4:00 Challenges in Extracting Information from Big Data 4:45 Transparency Challenges 5:15 Discussion & Provocations 6:00 Transportation to Hotel 7-10 Hosted Dinner 10:15 Discussion: Key Challenges and Opportunities 11:30 Lunch 1:00 Emerging Approaches to Using Big Data in Official Statistics 2:00 Discussion: Potential approaches to reliable, transparent & reproducible inference with Big Data 3:00 Coffee 3:30 Synthesis and next steps 4:30 Taxis leave for airport 5:00 (Optional) Beer/snacks and informal chat for those staying over in Boston 14 Transparency and Inference for Big Data Workshop Questions What are the errors and biases in the collection, cleaning, editing, assembly, linking and other operations that affect Big Data utility? How can bias, construct validity, and reliability be measured and evaluated? What methods are most promising for discovering relationships that are substantively interesting, statistically reliable, and causally plausible? 15 What are methods for ensuring transparency and replicability with big data sources? How do we detect dependencies among data sources? How can the integrity and authenticity, of official statistics be maintained when integrating big data from outside sources? How should we assess the quality of Big Data information for different official statistics uses? Transparency and Inference for Big Data Use Cases Survey of Income and Program Participation Use cases may focus discussion – they should not limit discussion 16 Transparency and Inference for Big Data What will be Shared Chatham-House Rules When a meeting, or part thereof, is held under the Chatham House Rule, participants are free to use the information received, but neither the identity nor the affiliation of the speakers, nor that of any other participant, may be revealed. Please do not name individuals or companies in social media, etc. What’s Public Ideas/information shared (We will be taking notes and recording – but only for summary reports) Formal presentations Attendance & Participant List (unless opted-out) Attribution – when requested/verified (opt-in) Future Outputs Draft summary report from workshop [January] Including corrections and attribution where requested White Paper – Series Summary & Synthesis 17 Circulated to participants for comments Public Summary of Report [December] To appear on project site Transparency and Inference for Big Data [February] Suggested Readings Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies: A Preliminary Analysis”, OECD Digital Economy Papers, No. 245, OECD Publishing. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14 March): 1203-1205. Copy at http://j.mp/1ii4ETo Kreuter, Frauke, Marcus Berg, Paul Biemer, Paul Decker, Cliff Lampe, Julia Lane, Cathy O'Neil, and Abe Usher. AAPOR Report on Big Data. No. 4eb9b798fd5b42a8b53a9249c7661dd8. Mathematica Policy Research, 2015. NRC, 2013, Frontiers in Massive Data Analysis, National Academies Press. 18 Transparency and Inference for Big Data Questions? E-mail: Web: 19 escience@mit.edu informatics.mit.edu Transparency and Inference for Big Data Creative Commons License This work. by Micah Altman (http://redistricting.info) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/bysa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 20 Transparency and Inference for Big Data