Transparency and Inference for Big Data

advertisement
Prepared for
Census-MIT Big Data Workshop Series
MIT
December 2015
Workshop Overview:
Transparency and Inference for Big Data
Micah Altman
Director of Research
MIT Libraries
1
Transparency and Inference for Big Data
Roadmap
Workshop series:
Challenges of
big data for
official statistics
What to expect today
and tomorrow


Acquisition
Analysis
Access
Big Data Challenges
Protection
2
Transparency and Inference for Big Data
Governance
Credits
&
Disclaimers
3
Transparency and Inference for Big Data
DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen,Yogi Berra, Niels Bohr,Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille,Albert Einstein, Enrico
Fermi, Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan
Quayle, George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L.
White, etc.
4
Transparency and Inference for Big Data
Collaborators & Co-Conspirators
Workshop Series Organizers


US Census



Cavan Capps,
Ron Prevost
MIT

Micah Altman
Workshop Co-Organizers (US Census)




Peter Miller
Benjamin Reist
Michael Thieme
Research Support


5
Supported by the U.S. Census Bureau
Transparency and Inference for Big Data
Related Work
Main Project:
 Census-MIT Big Data Workshop Series
projects.informatics.mit.edu/bigdataworkshops
Related publications:
(Reprints available from: informatics.mit.edu )

Altman M, Capps C, Prevost R. Using New Forms of Information for Official Economic Statistics -Examining the Commodity Flow Survey: Executive Summary from the 1rst Workshop in the MIT Big
Data Workshop Series. SSN: Social Science Research Network [Internet]. Working Paper.

Altman, M., D. O’Brien, S.Vadhan, A. Wood. 2014. “Big Data Study: Request for
Information.”
Altman, M Altman M, Wood A, O'Brien D, Gasser M., Vadhan, S. Towards a Modern
Approach to Privacy-Aware Government Data Releases. Berkeley Journal of Technology
Law. Forthcoming.
Altman M, McDonald MP. 2014. Public Participation GIS : The Case of Redistricting.
Proceedings of the 47th Annual Hawaii International Conference on Systems Science .


6
Transparency and Inference for Big Data
Online…
Website

projects.informatics.mit.edu/bigdataworkshops
Twitter Hashtag


#cmbigdata
E-mail




7
Micah Altman:
Cavan Capps:
Ron Prevost:
escience@mit.edu
Cavan.Paul.Capps@census.gov
Ronald.C.Prevost@census.gov
Transparency and Inference for Big Data
Workshop Series:
Big Data and
Official Statistics
8
Transparency and Inference for Big Data
Trends and Challenges
Trends






Increasingly data-driven economy
Individuals are increasingly mobile
Technology changes data uses
Stakeholder expectations are changing
Agency budgets and staffing remain flat.
The next generation of official statistics





Utilize broad sources of information
Increase granularity, detail, and timeliness
Reduce cost & burden
Maintain confidentiality and security
Multi-disciplinary challenges :


9
Computation, Statistics, Informatics, Social Science, Policy
Transparency and Inference for Big Data
Workshops and Outcomes
Acquisition
Challenges
Using New forms of Information for
Official Economic Statistics
[August 3-4]
Privacy Challenges
Expected outcomes:
Location Confidentiality and Official
Surveys
[November 30-Dec 1]

Inference Challenges

Transparency and Inference
[December 7-8]

10

Workshop reports
(September, January)
Integrated white paper
(February)
Identifying new opportunities for
statistical agencies
Inform the
Census Big Data Research Program.
Transparency and Inference for Big Data
Themes from Workshop 1: Big Data Sources

Broad new sources of information have the potential to enhance official statistics




Incorporating big data creates challenges




increased granularity & detail
increased timeliness
reduced burdens
Acquisition challenges
Management, confidentiality and governance challenges
Analytic challenges
Incorporating big data into statistical agencies will require adaptation:




Agencies will need to broaden from data collection to information provisioning.
Agencies will require different sources of data to support different types of decisions.
Agencies will need to develop more extensive relationships with business stakeholders.
Agencies have the potential to take on new roles with respect to big data source, as…




11
standards leaders
certification authorities
clearinghouses
infrastructure for durable, trusted access
Transparency and Inference for Big Data
Themes from Workshop 2: Big Data Privacy

Value of Census Reputation



Reputation to census is a primary concern
Reputation affects willingness to participate  cost of participation
Reliability & transparency is needed for official statistics to serve their policy purpose




Consider data needs in terms of computations



Source of big data may not be willing to distribute data directly
Sources of big data may not be able to distribute all data directly – typically internally
distributed and reaggregated
Access through computation



To ensure accountability of process and programs
To create a public data good – where results can be accepted across multiple sectors
To support reliable inferences for a range of purposes
Custom / private API’s could provide the analytics needed
Where privacy and security are challenges Secure Multi-Party Computing methods could be used in
place of trusted systems
Characterizing risks and harms



12
official statistics reflect an implicit harm/benefit balance –although not legally framed
explicitly
need to move from binary measures (identification) to formal measures
census could be a leader -- Many countries/industries/states use aggregation or
suppression with no formal risk/harm characterization
Transparency and Inference for Big Data
What to Expect
Today,
Tomorrow,
& Beyond
13
Transparency and Inference for Big Data
Workshop Schedule
Monday
Tuesday
12:00
Lunch and Introductions
8:30 Breakfast
1:00
Workshop Overview
9:00 Recap / Review of Days
1:15
Overview of SIPP
9:15 Overview of Census Uses – Implications for
Inference
2:15
Overview of Census Needs for Reliable and
Transparent Inference
3:00
Coffee
3:30
Preliminary Discussion of Workshop Questions
4:00
Challenges in Extracting Information from Big Data
4:45
Transparency Challenges
5:15
Discussion & Provocations
6:00
Transportation to Hotel
7-10
Hosted Dinner
10:15 Discussion: Key Challenges and Opportunities
11:30 Lunch
1:00 Emerging Approaches to Using Big Data in
Official Statistics
2:00
Discussion: Potential approaches to reliable,
transparent & reproducible inference with Big Data
3:00
Coffee
3:30
Synthesis and next steps
4:30
Taxis leave for airport
5:00
(Optional) Beer/snacks and informal chat for
those staying over in Boston
14
Transparency and Inference for Big Data
Workshop Questions



What are the errors and biases 
in the collection, cleaning,
editing, assembly, linking and
other operations that affect Big
Data utility?
How can bias, construct validity, 
and reliability be measured and
evaluated?
What methods are most
promising for discovering

relationships that are
substantively interesting,
statistically reliable, and causally
plausible?
15
What are methods for ensuring
transparency and replicability
with big data sources? How do
we detect dependencies among
data sources?
How can the integrity and
authenticity, of official statistics
be maintained when integrating
big data from outside sources?
How should we assess the
quality of Big Data information
for different official statistics
uses?
Transparency and Inference for Big Data
Use Cases

Survey of Income and Program Participation
Use cases may focus discussion –
they should not limit discussion
16
Transparency and Inference for Big Data
What will be Shared

Chatham-House Rules



When a meeting, or part thereof, is held under the Chatham House Rule,
participants are free to use the information received, but neither the identity nor
the affiliation of the speakers, nor that of any other participant, may be revealed.
Please do not name individuals or companies in social media, etc.
What’s Public

Ideas/information shared
(We will be taking notes and recording – but only for summary reports)




Formal presentations
Attendance & Participant List (unless opted-out)
Attribution – when requested/verified (opt-in)
Future Outputs

Draft summary report from workshop



[January]
Including corrections and attribution where requested
White Paper – Series Summary & Synthesis

17
Circulated to participants for comments
Public Summary of Report

[December]
To appear on project site
Transparency and Inference for Big Data
[February]
Suggested Readings




Reimsbach-Kounatze, C. (2015), “The Proliferation of “Big
Data” and Implications for Official Statistics and Statistical
Agencies: A Preliminary Analysis”, OECD Digital Economy
Papers, No. 245, OECD Publishing.
Lazer, David, Ryan Kennedy, Gary King, and Alessandro
Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data
Analysis.” Science 343 (14 March): 1203-1205. Copy
at http://j.mp/1ii4ETo
Kreuter, Frauke, Marcus Berg, Paul Biemer, Paul Decker, Cliff
Lampe, Julia Lane, Cathy O'Neil, and Abe Usher. AAPOR
Report on Big Data. No.
4eb9b798fd5b42a8b53a9249c7661dd8. Mathematica Policy
Research, 2015.
NRC, 2013, Frontiers in Massive Data Analysis, National
Academies Press.
18
Transparency and Inference for Big Data
Questions?
E-mail:
Web:
19
escience@mit.edu
informatics.mit.edu
Transparency and Inference for Big Data
Creative Commons License
This work. by Micah Altman
(http://redistricting.info) is licensed under the
Creative Commons Attribution-Share Alike 3.0
United States License. To view a copy of this
license, visit
http://creativecommons.org/licenses/bysa/3.0/us/ or send a letter to Creative
Commons, 171 Second Street, Suite 300, San
Francisco, California, 94105, USA.
20
Transparency and Inference for Big Data
Download