Introduction to Data Science Day 2

advertisement
Introduction to Data Science
Section 6
Data Matters 2015
Sponsored by the Odum Institute, RENCI, and
NCDS
Thomas M. Carsey
carsey@unc.edu
Data Science Examples in Action
Google Flu
• Link between search terms used in Google and
actual flu outbreaks.
• The idea is that people with the flu or who are
in areas where more folks have the flu will be
searching more about it.
• Baseline model used searches on 45 terms
related to flu and flu symptoms
– http://www.google.org/flutrends/about/how.html
Billion Prices Project @ MIT
• Collects prices from hundreds of online
retailers every day.
– http://bpp.mit.edu/
• Measures health of the economy in near real
time.
• Two days after Lehman Brothers failed in Sept.
2008, their index began to fall.
– Govt. released data in mid-November on October
figures before it could show similar evidence.
Medicare Fraud Detection
• 90 people charged with $260 million in false
billing
– https://www.fbi.gov/newyork/pressreleases/2014/medicare-fraud-strike-forcecharges-90-individuals-for-approximately-260million-in-false-billing
Twitter Trending
• Twitter trending
– https://twitter.com/WhatsTrending
Nate Silver
• FiveThirtyEight – http://fivethirtyeight.com/
Computational Event Data System
• Automated coding of English-language news
reports to generate political event data in the
Middle East, Balkans, and West Africa.
• The focus is on contentious political events
(bombings, shootings, protests, riots, etc.)
• Records initiator, target, location, “size”
– http://eventdata.parusanalytics.com/
Satellite Data and Crop Insurance
Fraud
• Mapping satellite images of fields (of 100 acres or
larger in Nebraska)
• Map “greenness” of the field against crop
insurance claims, then investigate claims on the
ground.
• Good predictor of fraud
– http://data2discovery.org/dev/wpcontent/uploads/2014/05/DMysnyk_Unsupervised_Le
arning_Classification.jpeg
Interest Groups and Policy Diffusion
Garrett and Jansa, forthcoming SPPQ
Plagiarizing Policy
Kroeger, State Politics Conference 2015
• Compares model bills with actual legislation
– 64 groups; 1434 model bills
– 7,542,937 versions of bills
– Results in 10,816,571,658 pairwise comparisons
• Measured similarity based on percentage of 5word strings that appear in both model and
actual bills.
– https://www.dropbox.com/sh/i4ud5e6tyap1qp2/AAC
RSIANXFLC8U5Ln9Bn6usca/Kroeger_SPPC.pdf?dl=0
Communication Networks
Desmarais et al.
• Email from New Hanover County, North
Carolina
– 30 Managers of departments of county govt.
– Feb., 2011
• 30,909 emails
• 8,097 authored by managers
• 1,739 sent to other managers
• Study within vs. between group
communication, broken down by topics
– http://people.umass.edu/bruced/pubs/Krafft_Mo
ore_Desmarais_Wallach_NIPS2012.pdf
Privacy, Ethics, Transparency
“Could” versus “Should”
• Advances in Data Science focus more on what
could be done than what should be done.
• Current law has not caught up with current
technology.
• Do people understand what they are giving away
about themselves?
• Are researchers being open about what they are
doing with data?
• Are companies?
• Are governments?
Data Access and Research
Transparency (DA-RT)
• Increased regulations from government to
provide access to data funded by government
dollars.
• Increased professional publication standards for
sharing research data
• Increased pressure for transparency in research
methods more generally, including data access
• Symposium in Political Science (PS: Political
Science and Politics, Vol. 47, Issue 01.)
– http://journals.cambridge.org/action/displayIssue?jid
=PSC&volumeId=47&seriesId=0&issueId=01
DA-RT Statement
• Commitment to DA-RT principles by academic
journals.
– Require authors to provide research data via a
trusted repository
– Require authors to provide clear instructions/code
used for analysis
– Maintain a consistent data citation policy
– Update all codes, guidelines, etc. as needed.
– http://media.wix.com/ugd/fa8393_da017d3fed82
4cf587932534c860ea25.pdf
Retracted Research
• Science article on changing minds on same-sex
marriage
• Donald Green and Michael LaCour
• http://retractionwatch.com/2015/05/20/authorretracts-study-of-changing-minds-on-same-sexmarriage-after-colleague-admits-data-were-faked/
• Who is responsible?
– Authors, publishers, advisors, IRBs, professional
societies, others?
Grown in Retractions
• NY Times: Nature reports a 10 fold increase in
retracted scientific research in 2000s compared
to 1990s.
• http://www.nytimes.com/interactive/2015/05/28/science/r
etractions-scientific-studies.html?_r=0
• http://www.nature.com/news/2011/111005/pdf/478026a.p
df
• Best guess is improved standards and technology
is finding more problems, not that bad behavior
itself is on the rise, but we don’t know.
Retracted Studies Live On
• Studies by John Budd at U. Missouri (School of
Ed.) finds that
– Retracted articles continue to be cited just as often as
non-retracted articles.
– Only 4-8% of citations mention the retraction
• There is an effort to deal with this called
CrossMark
– http://www.crossref.org/crossmark/
• More attention to retracted work at Retraction
Watch
– http://retractionwatch.com/
Other Retractions
• 1998 Study linking vaccines to autism retracted in 2010
• http://www.nytimes.com/2010/02/03/health/research/03lancet.h
tml
• 2014 article in Nature on Stem Cell production method
• http://www.nytimes.com/2014/07/03/business/stem-cellresearch-papers-are-retracted.html
• 2004 and 2005 papers in Science on human cloning
• http://www.nytimes.com/2005/12/31/science/amid-confusionjournal-retracts-koreans-stem-cell-paper.html
There is More
• John Darsee faked data in about 100 published
papers on heart diseas
• Study linking vaccines to autism
• Study touting a new method of producing stem
cells
• Two studies on human cloning
• Nearly 100 papers by John Darsee on heart
research
• 17 papers on Physics by J. Hendrik Schon
Privacy and Ethics
• Data, the elements of data science, and even
so-called “Big Data” are not new.
• One thing that is new is the greater variety of
data and, most importantly, the amount of
data available about humans.
• Discussion and good policy regarding privacy,
security, and the ethical use of data about
people lags behind the methods of collecting,
sharing, archiving, and analyzing data.
25
The Free Market, Unfair Competition, Big Brother?
26
Privacy and Ethics
• Several questions emerge from the explosion of
digital data about us:
– Identity: What is the relationships between our
offline and online identities?
– Privacy: Who should control access to online data?
– Ownership: Who owns data? Can ownership rights be
transferred? What are the obligations of those who
generate and use data?
– Reputation: How can we determine if data is
trustworthy? How can we know if provider acquired
data legitimately?
• There are entire industries on reputation management.
Ethical Claims
• Davis claims “Big Data, like all technology, is
ethically neutral”
• However, business is a social/public enterprise, so
you will encounter the values others hold.
• Competing values means competing ethical
standards, and thus, ethical conflicts.
• Modern Big data magnifies this encounter
because it increases the points of connection.
• It also leaves a digital trail of those encounters.
Ethical Decision Points
• Inquiry: what are our values?
• Analysis: what are our current data handling
practices? Do they align with our values?
• Articulation: explicit written expression of
alignment and gaps between values and
practices
• Action: specific plans and activities used to
close those gaps.
Buying vs. Selling
• Company data policies:
– 34 of 50 Fortune-class companies said the would not
sell personal data.
– No company said explicitly that the would sell
personal data.
– No company made any explicit statement that they
would not buy personal data
– 11 companies said explicitly that buying personal data
was allowed.
• Why different standards for buying vs. selling
personal data?
Privacy
•
•
•
•
Physical Privacy
Informational Privacy
Organizational Privacy
Privacy of
– Our communications
– Our behavior
– Our person
• Is digital communal living different than physical
communal living?
• Who decides? Who regulates? Who Enforces?
Privacy as a Common Pool Resource?
• We all benefit from privacy norms, but we all may
realize personal gains by violating them.
• If everyone has that same incentive, the public
good gets destroyed.
• To govern our collective behavior:
– We need norms/rules that are agreed upon and
known
– We need effective monitoring
– We need a mechanism to sanction violations
• The question is, do we need an external force or
can we govern ourselves? (Elinor Ostrom’s work)
Closing Thoughts on Data
Science
Conclusions
• Data Science is an evolving field
– Exciting, confusing, immature
• Data science will be critical in an information
economy and to national security, but it is also
changing our social behavior, the arts, and everything
else.
• There are many claims made about data science and
“Big Data,” and some of them are probably true.
• Focused on applied interaction between computer
science, information science, and statistics.
– This is good, but . . .
36
Conclusions (cont.)
• Data Science needs to figure out how to
include substantive expertise and theories.
– Finding patterns and gaining understanding are
two different tasks.
• Data Science needs to foster interdisciplinary
communication
– Data as the universal language
• Data Science needs greater attention to
privacy and ethics.
Download